ProActive

The ProActive R package automatically detects regions of gapped and elevated read coverage using a pattern-matching algorithm. ProActive can detect, characterize and visualize read coverage patterns in both genomes and metagenomes. Optionally, users may provide gene predictions associated with their genome or metagenome in the form of a .gff file. In this case, ProActive will generate an additional output table containing the gene predictions found within the detected regions of gapped and elevated read coverage. ProActive is best used as a screening method to identify genetic regions for further investigation.

Elevations or gaps in read coverage can be caused by differential abundance of specific genetic elements. For example, an elevation in read coverage may be caused by prophage activation. When a prophage activates and enters the lytic cycle, its genome begins replicating and the ratio of phage:bacterial genomes in the cell begins to increase. Because there are more phage genomes than bacterial genomes, during sequencing more phage reads are generated than bacterial. When these reads are mapped back to their associated reference sequence, the read coverage of the prophage region will be elevated in comparison to the read coverage of the bacterial genome on either side of the prophage. This same principle applies to temperate phage who are highly abundant in the environment as well as other mobile genetic elements that are freely present in the environment at a higher ratio than the originating or ‘host’ genome.

Conversely, a gap in read coverage may indicate genetic heterogeneity in the associated bacterial population. Genetic variants with and without specific genetic elements, like prophage or certain genes, will produce differential abundances of sequencing reads that may form read coverage gaps. The formation of read coverage gaps due to genetic variants is dependent on the assembly (i.e. if the assembler assembles the genetic variants as separate entities or not). Gaps in read coverage may also form at regions with high mutation rates.

Input files

Pileup file:

ProActive detects read coverage patterns using a pattern-matching algorithm that operates on pileup files. A pileup file is a file format where each row summarizes the ‘pileup’ of reads at specific genomic locations. Pileup files can be used to generate a rolling mean of read coverages and associated base pair positions which reduces data size while preserving read coverage patterns. ProActive requires that input pileups files be generated using a 100 bp window/bin size.

Pileup files can be generated by mapping sequencing reads to a metagenome or genome fasta. Read mapping should be performed using a high minimum identity (0.97 or higher) and random mapping of ambiguous reads. The pileup files needed for ProActive are generated using the .bam files produced during read mapping. Some read mappers, like BBMap, allow for the generation of pileup files in the bbmap.sh command with use of the bincov output with the covbinsize=100 parameter/argument. Otherwise, BBMap’s pileup.sh can convert .bam files produced by any read mapper to pileup files compatible with ProActive using the bincov output with binsize=100.

NOTE: For detailed information on input file format, please see the vignette. Users may also use the ‘sampleMetagenomePileup’ and ‘sampleGenomePileup’ files that come pre-loaded with ProActive as a reference.

gffTSV:

ProActive optionally accepts a .gff file as input. The .gff file must be associated with the same metagenome or genome used to create your pileup file. The .gff file should be a TSV and should follow the same general format described here.

Installation

Install ProActive from CRAN with:

install.packages("ProActive")
library(ProActive)

Install the development version of ProActive from GitHub with:

if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

devtools::install_github("jlmaier12/ProActive")
library(ProActive)

Quick start

Metagenome mode:

library(ProActive)

MetagenomeProActive <- ProActive(
  pileup = sampleMetagenomePileup,
  mode = "metagenome",
  gffTSV = sampleMetagenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 1.98secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>         3         3         1

MetagenomePlots <- plotProActiveResults(pileup = sampleMetagenomePileup,
                                        ProActiveResults = MetagenomeProActive)

Genome mode:

GenomeProActive <- ProActive(
  pileup = sampleGenomePileup,
  mode = "genome",
  gffTSV = sampleGenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 36.98secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>        25         3        21

GenomePlots <- plotProActiveResults(pileup = sampleGenomePileup,
                                    ProActiveResults = GenomeProActive)