Chapter 2 Data Processing

2.1 FastQC

FastQC

Many tools/options to filter and trim data
Trimming does not always improve things as valuable information can be lost!
Removal of adapters is critical for downstream analysis

2.2 Dereplication

In this process all the quality-filtered sequences are collapsed into a set of unique reads, which are then clustered into OTUs
Dereplication step significantly reduces computation time by eliminating redundant sequences

What’s an OTU?

https://www.youtube.com/watch?v=azI9taClDhQ

2.3 Chimera detection and removal of non-bacterial sequences

Chimeras as artifact sequences formed by two or more biological sequences incorrectly joined together

Chimera

Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence.

2.4 Clustering

Analysis of 16S rRNA relies on clustering of related sequences at a particular level of identity and counting the representatives of each cluster

Clustering

Some level of sequence divergence should be allowed – 95% (genus-level, partial 16S gene), 97% (species-level) or 99% typical similarity cutoffs used in practice and the resulting cluster of nearly identical tags (assumedly identical genomes) is referred to as an OTU (Operational Taxonomic Unit)

2.5 Create OTU tables

OTU table is a matrix that gives the number of reads per sample per OTU

OTUs

2.6 Bin OTUs into Taxonomy (assign taxonomy)

Accuracy of assigning taxonomy depends on the reference database chosen

Ribosomal Database Project
GreenGenes
SILVA

Accuracy depends on the completeness of databases

Database

2.7 Assess Population Diversity: alpha diversity

Assessment of diversity involves two aspects

Species richness (# of species present in a sample)
Species evenness (distribution of relative abundance of species)

Total community diversity of a single sample/environment is given by alpha-diversity and represented using rarefaction curves
Quantitative methods such as Shannon or Simpson indices measure evenness of the alpha- diversity

Human Mol. Genet., 2013

2.8 Assess Beta Diversity

Beta-diversity measures community structure differences (taxon composition and relative abundance) between two or more samples

For example, beta-diversity indices can compare similarities and differences in microbial communities in healthy and diseases states

Many qualitative (presence/absence taxa) and quantitative(taxon abundance) measures of community distance are available using several tools

LIBHUFF, TreeClimber, DPCoA, UniFrac (QIIME)

2.9 Measuring Population Diversity: alpha and beta diversity

PLoS Computational Biol.,2012

2.10 Diversity Measurements with 16s rRNA sequencing

Overall Benefits

Cost effective
Data analysis can be performed by established pipelines
Large body of archived data is available for reference

Overall Limitations

Sequences only a single region of the genome
Classifications often lack accuracy at the species level
Copy number per genome can vary. While they tend to be taxon specific, variation among strains is possible
Relative abundance measurements are unreliable because of amplification biases
Diversity of the gene tends to overinflate diversity estimates

FastQC for 16S rRNA dataset

Extremely biased per base sequence content
Extremely narrow distribution of GC content
Very high sequence duplication levels
Abundance of overrepresented sequences
In cases where the PCR target is shorter than the read length, the sequence will read through into adapters

2.11 Taxonomy: Expectation vs Reality

Expectation vs. Reality

2.12 Beta Diversity - UniFrac

Measures how different two samples’ component sequences are

$Unifrac$

Unifrac

Weighted Unifrac: takes abundance of each sequence into account

2.13 Results from Paper

Main phyla: Firmicutes, Bacteroidetes, Proteobacteria, Actinobacteria, Fusobacteria with differences bw samples
Sputum (patient) samples had highest diversity followed by oropharynx samples followed by nasal
Healthy controls (N and O) more diverse than samples from TB patients
Between-group comparisons?
Phyla differences?