Chapter 6 Illumina WGS Community Profiling (Centrifuge)
6.1 What is Centrifuge?
“Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples, with better sensitivity than and comparable accuracy to other leading systems. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem”.
6.2 Data
We will use data from this publication: “Metagenomic Characterization of the Human Intestinal Microbiota in Fecal Samples from STEC-Infected Patients” (https://www.frontiersin.org/articles/10.3389/fcimb.2018.00025/). Data were retrieved from the NCBI BioProject PRJEB23207.
Illumina whole genome metagenomic sequencing of the intestinal microbiome (feces samples).
11 samples
Shiga toxin-producing E. Coli (STEC) infection (N = 2).
Crohn’s Patients (N = 4)
Healthy (N = 2)
Healed controls (N = 3)
Can you identify the STEC positive samples?
Can you identify the specific strain of STEC?
6.3 Classification
Because of large time and space requirements to build indexes, we will use pre-built indexes that contain complete RefSeq genomes for archea, bacteria, viruses and fungi. The human chromosome level reference genome was included and we also added the Eukaryotic Pathogens Databased (EuPathDB) as well. We’ll kick off the Centrifuge classification and while it is running we will teach you how to create a small database and also go over the commands we used to create the big database.
Make a directory to store your results
6.3.1 Centrifuge
Now let’s run Centrifuge (it will take 10 - 20 minutes per sample to run).
First, activate the environment.
Parameters
-x Path to the indexes (/home/data/metagenomics-2310/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2)
-1 Read 1 fastq file
-2 Read 2 fastq file
-t Print wall-clock time taken by search phase
-p Number of alignment threads to launch
--met-file
--met-stderr Send metrics to stderr
-S Output file name
centrifuge \
-x /home/data/metagenomics/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
-1 /home/data/metagenomics/centrifuge/ERR2271042_1.fastq \
-2 /home/data/metagenomics/centrifuge/ERR2271042_2.fastq -t \
-p 4 \
--met-file ~/centrifuge/centrifuge_results/ERR2271042_meta.txt \
--met-stderr -S ~/centrifuge/centrifuge_results/ERR2271042_cent.outYou can also run all the samples together using a for loop (>1 hr).
for i in /home/data/metagenomics/centrifuge/*_1.fastq; do
bn=`basename $i _1.fastq`
centrifuge \
-x /home/data/metagenomics/centrifuge/complete_genomes/Arc_Bac_Vir_Hum_Eupath_v2 \
-1 /home/data/metagenomics/centrifuge/${bn}_1.fastq \
-2 /home/data/metagenomics/centrifuge/${bn}_2.fastq -t \
-p 4 \
--met-file ~/centrifuge/centrifuge_results/${bn}_meta.txt \
--met-stderr -S ~/centrifuge/centrifuge_results/${bn}_cent.out
done6.3.2 Kraken-style reports
Now let’s create a Kraken-style report from the Centrifuge output using the centrifuge-kreport command (~1 min).
Parameters
-x Path to the indexes
You also feed in the _cent.out file we generated when we ran Centrifuge.
Make a directory to store your results
Create the report. We will use a for loop to generate the report for all the samples.
6.4 Visualize the Kraken Reports with Pavian
First, download all *.krn files to your computer.
If you are using RStudio, open it up and type “pavian::runApp(port=5000)” in the console.
Or go to the website: https://fbreitwieser.shinyapps.io/pavian/
Here are some things to try:
• Drag the .krn files to the “Browse” bar under the “Upload files” tab.
• Click on the “Results Overview” tab on the left side of the screen.
• Click on the “Sample” tab on the left side of the screen. You will notice an interactive Sankey chart that displays different classified taxa. You can click and drag nodes to reorder the nodes if you want. You can select different samples by clicking on the down arrow next to the “Select sample” header. Select sample ERR2270960 or ERR2270961 since we know that these two samples are likely infected with STEC.
• From the “Sample” page, click on the “Table” tab. You should see an interactive, filterable and searchable table that summarizes the results of Centrifuge. For example, type “Escherichia coli” into the “Search:” box. A histogram will appear that displays the numbers of reads across all samples that were classified as “Escherichia coli”. Sort the table by the most abundant taxa by clicking on the up arrow next to “TaxonReads”.
• Can you can you identify the strain(s) of STEC that samples ERR2270960 and ERR2270961 likely contain?
6.5 Practice Database
We will create a small database that has only fungi and human genomes so that you learn how to create Centrifuge databases. In the “Full Database” section we have the commands we used to create the database we used above but we will not run them. If you need to create an updated database for your own research, you can run them at a later time.
Go into the centrifuge directory.
6.5.1 Taxonomy files
First create a centrifuge directory and another directory inside that called taxonomy.
# Create a directory to store results
# The -p directory creates parent directories (in this case "centrifuge") as needed
mkdir -p ~/centrifuge/taxonomyCentrifuge provides a few scripts that will allow you to download the NCBI taxonomy tree and RefSeq reference genomes. In addition to the reference fasta files, we need the taxonomy tree files:
nodes.dmp: Links taxonomy IDs to their parents
names.dmp: Links taxonomy IDs to their scientific name
seqid2taxid.map: Links sequence IDs to taxonomy IDs
If you are using custom taxonomy or sequence files for the database, please refer to the manual (https://ccb.jhu.edu/software/centrifuge/manual.shtml)
Let’s download the taxonomy files (~20 seconds). This gets the nodes.dmp and the names.dmp files.
Parameters
-o specifies the folder to which the files are downloaded
taxonomy database to download (refseq, genbank, contaminants or taxonomy)
6.5.2 Fungal genomes
Now lets download complete fungal genomes from RefSeq (~1 min). This will also make the seqid2taxid.map file.
Parameters
-a Only download genomes with the specified assembly level. Default: ‘Complete Genome’. If you want any assembly level, use ‘Any’
-m Mask low-complexity regions using dustmasker.
-P Number of threads or processes to use when downloading
-d What domain to download (bacteria, viral, archaea, fungi, protozoa, invertebrate, plant, vertebrate_mammalian, vertebrate_other); comma-separated
refseq Tells the program to use the refseq database
-o Output directory
6.5.3 Human genome
Download the RefSeq human genome (~14 min). This is our host since we have human samples. It will also create the seqid2taxid.map which we will append to the seqid2taxid.map file that we just made with fungal genomes.
Additional parameters
-t Only download the specified taxonomy IDs, comma separated. Default: any. 9606 is the taxonomy id for Human
-c Only download genomes in the specified RefSeq category. Default: any.
6.6 Eukaryotic Pathogens
We will add in Eukaryotic Pathogens, which can be helpful for medical metagenomic samples. To save time, we will use data we previously downloaded. Instructions for downloading it are in the “Full Database” section. If you use this database for your own research, you might want to download it later on to make sure it is up to date.
Here is the link to the publication: https://doi.org/10.1371/journal.pcbi.1006277
Here is the webpage for the data download https://ccb.jhu.edu/data/eupathDB/
First, make a directory to put the eupath data in.
Copy our already downloaded file.
This is a compressed tarball file that contains several files. Let’s uncompress it and look into the library file it created.
How many genomes are there?
Create a directory in the centrifuge library directory that has our fungal and human data and move the fasta files from this library directory there.
Because we didn’t do this through centrifuge-download, we have to create the information for the seqid2taxid.map file. We’ll extract it from the prelim_map.txt file that came with the eupath data.
awk -F'\t' '{print $2,$3}' ~/centrifuge/eupath/library/prelim_map.txt | \
awk -F'|' '{print $1,$3}' | awk -F' ' '{print $1"\t"$2}' \
>> ~/centrifuge/seqid2taxid.mapNow we will concatenate all the reference genome fasta files (fungal, human, eukaryotic pathogens) that we downloaded. We’ll make sure we are in the library directory first.
Centrifuge doesn’t like the header format of some sequences so let’s reformat the headers.
6.6.1 Build database indices
Now, let’s build the database indices (~39 min).
Parameters
-p   The number of processes/threads to use for creating the index
–conversion-table   The seqid2taxid.map file that you created with the centrifuge-download commands
–taxonomy-tree   The nodes.dmp file that was downloaded with the first centrifuge-download command
–name-table   The names.dmp file that was downloaded with the first centrifuge-download command
6.7 Full Database
Due to time and space considerations, do not run this. If you need to update this database or want to create a similar database for your research, these commands should help you do that at a later date. Refer to the “Practice Database” section for more information on each command.
This database will include archaea, bacteria, viral and fungal genomes from RefSeq. At the time of download, there were 289 Archea, 13,287 bacterial, 8,583 viral, and 10 fungal genomes.
# Create a directory to store results
mkdir -p ~/centrifuge_fulldb/taxonomy
# Download microbial genomes
centrifuge-download -o ~/centrifuge_fulldb/library -a "Complete Genome" -m -P 30 -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map
# Download the host genome (human in our case) and append info to the seqid2taxid.map file
centrifuge-download -o ~/centrifuge_fulldb/library -a "Chromosome" -m -P 4 -d "vertebrate_mammalian" \
-t 9606 -c "reference genome" refseq >> seqid2taxid.map
# Get eupath since we have medical metagenomes (wget downloads the data)
mkdir -p ~/centrifuge_fulldb/eupath
cd ~/centrifuge_fulldb/eupath
wget http://ccb.jhu.edu/data/eupathDB/dl/eupathDB.tar.gz
# Extract the eupath data and move fasta files to the centrifuge library directory
tar -zxvf eupathDB.tar.gz
mkdir ~/centrifuge_fulldb/library/eupath
mv ~/centrifuge_fulldb/eupath/library/*.fna ~/centrifuge/library/eupath/
# Add info to the seqid2taxid.map file
awk -F'\t' '{print $2,$3}' ~/centrifuge_fulldb/eupath/library/prelim_map.txt | \
awk -F'|' '{print $1,$3}' | awk -F' ' '{print $1"\t"$2}' \
>> ~/centrifuge_fulldb/seqid2taxid.map
# Cat the downloaded genomes together
cd ~/centrifuge_fulldb/library/
cat */*.fna > allgenomes.fna
# Reformat the headers
sed 's/|kraken[^|]*|/ /' allgenomes.fna > allgenomes_fixed.fna
# Build the database indices (~14 hr and ~410 GB of RAM using 30 threads)
centrifuge-build -p 4 --conversion-table ~/centrifuge_fulldb/seqid2taxid.map \
--taxonomy-tree ~/centrifuge_fulldb/taxonomy/nodes.dmp \
--name-table ~/centrifuge_fulldb/taxonomy/names.dmp \
~/centrifuge_fulldb/library/allgenomes_fixed.fna \
~/centrifuge_fulldb/allgenomes_indicesNOTE: the Centrifuge developers have provided various premade indexes that you can download here, ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz using wget.