Chapter 5 Read Processing (ONT)

5.1 Oxford Nanopore Sequencing (ONT)

Wang, Y., Zhao, Y., Bollas, A., Wang, Y., & Au, K. F. (2021). Nanopore sequencing technology, bioinformatics and applications. Nature biotechnology, 39(11), 1348-1365.

https://www.youtube.com/watch?v=E9-Rm5AoZGw

5.2 Filter Host DNA

In order to focus on sequencing reads from the microbes in the nodule, we will filter out reads that align to the red alder genome as follows:

Align the fastq-formatted reads to the red alder genome using minimap2.
Extract reads that do not align to red alder and sort them using samtools.
Create a fastq file with only the unaligned reads using samtools bam2fastq.
Compress the fastq file using gzip.

5.2.1 Setup

Activate the environment that contains minimap2 and samtools

conda activate filter-reads

Make a directory and go into it

mkdir ~/microbe_fastq
cd ~/microbe_fastq

Link to the merged minion reads

ln -s /home/data/metagenomics/red-alder-reads/3469-3.all.fastq .

5.2.2 Alignment

Run Minimap2 to align the MinION reads to the red alder genome

The -x map-ont parameter (allows ~10% error + divergence)

minimap2 -x map-ont -L -t 8 -a \
/home/data/metagenomics/red-alder-reads/red-alder-genome.fasta \
3469-3.all.fastq > 3469-3-minionxredalder.mm2.sam

5.2.3 Get microbial reads

Now we will use samtools, which is available in the same environment, to pull out reads that didn’t align to red alder.

Convert the unmapped reads in the alignment file (sam) to a fastq file

The -f4 includes only reads with the 4 flag (unmapped)

samtools fastq -f4 3469-3-minionxredalder.mm2.sam > 3469-3.microbe.fq

Compress the new fastq file

(note that it will automatically add the extension .gz)

gzip 3469-3.microbe.fq

Now filter host reads from sample 4956-3

Reads are here:

/home/data/metagenomics/red-alder-reads/4956-3.all.fastq

5.3 Quality Control

“PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data” (https://a-slide.github.io/pycoQC/ )

What do we need in order to run a Quality Control Check?

Sequencing Summary File
- Automatically produced by the MinIon basecaller.
PycoQC package + dependencies
- Already downloaded for you in Ghostwheel.
Line of code to produce .html file.
Line of code to secure copy file to your computer.

5.3.1 Sequencing Summary File

Where is it?

/home/data/metagenomics/red-alder-reads/sequencing_summary_FAS21661_9ac87089.txt

5.3.2 PycoQC package & code

Where is it? How to activate it?

Log in to Ghostwheel
Stay in your home directory (check with pwd)
Type the following:

source activate pycoQC
pycoQC
pycoQC –f inputfilename.txt –o outputfilename.html

5.3.3 Secure Copy

What terminal to copy from? What is the code?

Open new Terminal window but don’t connect to the linux server
Type:

scp -P 2508 <username>@inbre.ncgr.org:/home/<username>/outputfile.html ~/Desktop/

5.3.4 Open your URL

Find your file on your desktop.

Double-click to open, or right-click to select browser

https://a-slide.github.io/pycoQC/

5.3.5 Normalization

With normalization we are trying to get the correct relative gene expression abundances between cells.

Gene expression between cells is based on count data.

What does a count in a count matrix represent?

mRNA Capture
Reverse transcription of mRNA
sequencing of a molecule of mRNA

The most common normalization protocol is:

count depth scaling
aka CPM or counts per million
it assumes that all cells in the dataset initially contain an equal number of mRNA molecules
it assumes that count depth differences arise from sampling

Normalize complete

But wait!
We still have unwanted variability in the data.
What kind of unwanted variability?
What is the solution? Data Correction.

5.3.6 Data correction and integration

Biological Covariates

Cell-Cycle effects
Batch
Dropout

Which Covariates to Consider?

Depends on downstream analysis
Correct for biological and technical to be considered separately
Corrections are used for different purposes
Each approach to correction presents unique challenges

What are the Correction methods?

Regressing out biological effects
Regressing out technical effects
Batch effects and data integration
Expression recovery