Bioconductor Pipeline


The bioconductor pipeline is an amplicon sequencing pipeline which can be fully implemented in the R environment. The pipeline relies on the denoising alogrithm DADA2 to generate a table of specific amplicon sequence variants (ASVs) rather than clustered operational taxonomic units (OTUs). This page provides a complete walkthrough of the bioconductor pipeline, using illumina MiSeq v2 data from the GEOTRACES ocean acidification experiment. This page is modified from a tutorial produced by the pipeline authors.

 

Install DADA2

The first stage is to install DADA2 if it is not already available on your computer.

source("https://bioconductor.org/biocLite.R")
biocLite("dada2")
biocLite("phyloseq")
library(dada2)
library(phyloseq)
library(ggplot2)

 

Read files

.fastq.gz files must be unzipped to .fastq files prior to processing in the dada2 pipeline. Unzipped files should be saved in a specific folder. Note that unzipping these files will use a substantial amount of memory. Reading files will fail if they are not unzipped.

Reading files into dada2 first requires an appropriate path to be set. This should lead to the folder in which the raw .fastq files for your project are saved. It is important that only .fastq files from your specific project are in this folder, as the script will read all files in this folder. Confirm that the correct path has been set, using the list.files command.

path = "~/geotraces-exp/Geotraces_exp_fastq"
# check the file path is correct using list.files(path).

# we take this opportunity to extract the file names as a
# vector, which will be used later in the script
f.names = as.vector(list.files(path, pattern = "_R1_001.fastq", 
    full.names = F))
r.names = as.vector(list.files(path, pattern = "_R2_001.fastq", 
    full.names = F))

# pattern identifies all files in the path ending in
# '_R1_001.fastq'. This orders and separates forward and
# reverse reads.
fnFs = sort(list.files(path, pattern = "_R1_001.fastq", full.names = TRUE))
fnRs = sort(list.files(path, pattern = "_R2_001.fastq", full.names = TRUE))

 

Plot quality scores from .fastq files

Plot the Q scores associated with each .fastq file. Look for drop offs in sequence quality, whilst considering that a QS of 30 is equivalent to a 99.9% probability of accurate base calling and a QS of 20 is equivalent to a 99% probaility of accurate base calling. The number of plots is 36 in this case, as there are 36 samples; this should be adjusted to the number of samples in your investigation.

qp.f = plotQualityProfile(fnFs[1:36])
qp.r = plotQualityProfile(fnRs[1:36])

# illustration of plot outputs (first 2 only)
plotQualityProfile(fnFs[1:2])

The gray block in the background of the plots represent the density of sequences in that sample which have a quality score at that position. This should result in a dark grey band >30 QS, and a much more pale distribution of blocks below. The solid green line represents the mean QS, the solid orange line represents the median QS, and the dashed orange lines represent the 25th and 75th quantiles.

 

Filter and trim (quality control)

The first step in the filtering and trimming script is to setup a new directory in which the filtered and trimmed .fastq files will be saved. This is achieved using the file.path() function to create a subdirectory in the path location. Create file paths for filtered forward and reverse reads respectively, with a predetermined suffix which allows us to differentiate them ("_F_filt.fastq.gz" / "_R_filt.fastq.gz")

After inspecting the quality plots, determine the trim and truncate parameters for .fastq files.

Check raw .fastq files for primer sequences, and remove using the trim function if necessary

In this instance, forward and reverse primers are still present. The forward primer is 19 bp, whilst the reverse primer is 20 bp, these will be removed by setting the trim parameters to c(19,20). The truncation length should be set to cut off where any signfiicant drop off in quality occurs, but it is advisable to remove approximately the last 10 bp anyway as these reads can be unreliable.

filt_path = file.path(path, "filtered")  #place filtered files in 'filtered' subdirectory
filtFs = file.path(filt_path, paste0(f.names, "_F_filt.fastq.gz"))
filtRs = file.path(filt_path, paste0(r.names, "_R_filt.fastq.gz"))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen = c(240, 
    181), trimLeft = c(19, 20), maxN = 0, maxEE = c(2, 2), truncQ = 2, 
    rm.phix = TRUE, compress = TRUE, multithread = TRUE)

head(out)  #check the results

 

Learning error rates

The dada2 algorithm relies on a error model calculated from each amplicon dataset. This model is then used downstream to distinguish true sequence variants from sequence variants which have arisen from sequencing errors (using a probability approach). From the pipeline authors: “The learnErrors method learns the error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many optimization problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).”

# calculating the error model
errF = learnErrors(filtFs, multithread = TRUE)
errR = learnErrors(filtRs, multithread = TRUE)

# plotting errors
plot.errF = plotErrors(errF, nominalQ = TRUE) + theme_bw() + 
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.errR = plotErrors(errR, nominalQ = TRUE) + theme_bw() + 
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
The error rates for each possible between-base call are shown. The red line represents the expected error according to the Q value (e.g. 99.9% accuracy for QS 30). The black line represents the estimated error rate, fitted to the observed error rates (points). Estimated error rates should decrease with increasing quality scores. Note that this algorithm calculates errors based on the first 1M reads, to reduce computational demand. If your error plots look unreasonable (e.g. a trend of increasing error with increasing quality score) you can manually increase the number of reads to consider (increasing computational demand, and therefore runtime) using the nreads variable in the learnErrors function.

 

Dereplicate sequences

Dereplication merges all identical sequence reads into a new file containing only unique sequences, with a corresponding abundance metadata. This is done to reduce the computational demand of calculations downstream by removing redundant comparisons. The pipeline then runs signficantly faster.

Sequence quality scores are retained during this process, as a consensus quality profile is generate from all identical sequences being dereplicated to a single unique sequence. The quality profiles are then used to inform the error model of the denoising step, which increases dada2’s accuracy.

derepFs = derepFastq(filtFs, verbose = TRUE)
derepRs = derepFastq(filtRs, verbose = TRUE)
names(derepFs) = f.names
names(derepRs) = r.names

 

DADA2 core sequence variant inference algorithm

At this stage the DADA2 takes all the information provided in the dereplicated sequences and the error models to determine which sequences represent true amplicon sequence variants, and which sequences have arisen from errors. Thus we end up with a list of true amplicon sequence variants, avoiding the archaic practice of clustering.

dadaFs = dada(derepFs, err = errF, multithread = TRUE)
dadaRs = dada(derepRs, err = errR, multithread = TRUE)
dadaFs[[1]]
dadaRs[[1]]
By calling dadaFs[[1]] the number of real sequences (named sample sequences) infered from the number of unique input sequences are displayed for sample number 1. There is much more information available which can be accessed through help(“dada-class”).

 

Merge paired reads

The forward and reverse reads are then merged using the mergePairs function. Note that the forward and reverse reads must be in matching order at this stage.

mergers = mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose = TRUE)

# Check merger success by sample ([[1]] refers to sample 1)
head(mergers[[1]])

 

Generate sequence table

Construct a sequence table from the merged sequence object. Viewing this table will reveal a number of amplicon lengths which are outside the expected range of your region, these are likely to be erroneous.

seqtab = makeSequenceTable(mergers)
dim(seqtab)
table(nchar(getSequences(seqtab)))

 

Plot amplicon length distribution

Plot amplicon length frequency to visualise the distribution of amplicon sizes, informing the amplicon length range to conserve versus the length range to discard.

ex = as.data.frame(table(nchar(getSequences(seqtab))))
ex.plot = ggplot(ex, aes(Var1, Freq)) + geom_bar(stat = "identity", 
    aes(fill = Var1)) + ylab("Frequency") + xlab("Merged Sequence Length") + 
    scale_x_discrete(breaks = c(220, 240, 250, 260, 270, 300, 
        330, 360)) + theme_bw() + theme(panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank()) + theme(legend.position = "none")
ex.plot

 

Excise amplicons of desired length

In this case, only amplicons between 250 and 256bp in length are selected. Amplicons outside of this range are discarded.

seqtab.ex = seqtab[, nchar(colnames(seqtab)) %in% seq(250, 256)]
table(nchar(getSequences(seqtab.ex)))

 

Remove chimeras

Note that if you lose a large number of sequence reads at this stage, it is likely that primer sequences still remain in your reads. These should have been removed manually at the trim and truncate stage.

seqtab.ex.chi = removeBimeraDenovo(seqtab.ex, method = "consensus", 
    multithread = T, verbose = T)

 

Track sequence loss

Track the number of sequence reads per sample remaining after each stage of processing. This provides an opportunity to identify points in the pipeline where large proportions of sequences were lost, which may be worth revisiting and tweaking. Commonly, 15-20% of reads are lost over the course of the pipeline.

getN = function(x) sum(getUniques(x))
track = cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), 
    rowSums(seqtab.ex), rowSums(seqtab.ex.chi))
colnames(track) = c("input", "filtered", "denoised", "merged", 
    "tabled", "no chim")
rownames(track) = f.names
head(track)

 

Assign taxonomy

It is important to consider which taxonomic database to use for this step. The two most common options are SILVA and Greengenes. Typically, Greengenes offers more reliable taxonomic classifications, but has less coverage, and is not updated regularly (last update ~2014). SILVA is a curate database which is updated frequently, but can be less reliable than Greengenes in certain cases. Some studies use integrated approaches where taxonomies are assigned from numerous databases, and checked for agreement. The bioconductor pipeline team maintain taxonomic classfiers from the SILVA, Greengenes, and RDP databases for the Earth Microbiome Project (EMP) primers 515-F (or Y for marine samples) & 806R. In this case, these classifiers can be downloaded directly.

Once you have downloaded your desired classifier, place it in a folder, and call the specific file path in the assignTaxonomy function. This step may take a significant amount of time, depending of computing power, sample number, and sequencing depth.

gg.taxa = assignTaxonomy(seqtab.ex.chi, "~/geotraces-exp/Geotraces_exp_unzipped/gg_13_8_train_set_97.fa", 
    multithread = T)

 

Upload metadata

The metadata file associated with your sequences should be uploaded with samples as rows and variables as columns. Various file formats can be supported with the ultimate goal of creating a metadata dataframe in R which can then be incorporated into a phyloseq object. It is critical that the sample order in your metadata table matches the sample order in your sequence table (in this case seqtab.ex.chi).

sd.geo = read.csv("~/geotraces-exp/geo-mapping.csv")
head(sd.geo)
# if your variable names (column names) are unclear, rename
# them at this stage.

 

Rename samples and ASVs

Before constructing the phyloseq object, it is important to rename all the samples (which will currently named by their file name) and ASVs (which will currently be named by their full sequence). This will make handling the phyloseq object easier downstream.

# create a vector for sample names
s.vec = as.vector(1:36)  #number should reflect your total number of samples
s.nam = cbind("sample_", s.vec)
s.nam = as.data.frame(s.nam)
s.names = paste0(s.nam$V1, s.nam$s.vec)
s.names = as.data.frame(s.names)

# apply sample names to metadata
row.names(sd.geo) = s.names$s.names
sd.geo = as.data.frame(sd.geo)
head(sd.geo)

# apply sample names to sequence table
row.names(seqtab.ex.chi) = s.names$s.names

# create vector for ASV names
a.vec = as.vector(1:1937)  #number should reflect your total ASVs
a.nam = cbind("asv_", a.vec)
a.nam = as.data.frame(a.nam)
asv.names = paste0(a.nam$V1, a.nam$a.vec)
asv.names = as.data.frame(asv.names)

# apply ASV names to sequence table
row.names(seqtab.ex.chi) = asv.names$asv.names
head(seqtab.ex.chi)

 

Construct a phyloseq object

Finally, a phyloseq object is created which can be handled easily for statistical analyses. The phyloseq object from this tutorial is notably missing a phylogenetic tree. This can easily be constructed and will be added to this tutorial at a later date.

geo.ex = phyloseq(tax_table(gg.taxa), otu_table(seqtab.ex.chi, 
    taxa_are_rows = FALSE), sample_data(sd.geo))
geo.ex