Chapter 7 KrakenUniq

“confident and fast metagenomics classification using unique k-mer counts”

Default kmer = 31

KrakenUniq Overview

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0

https://github.com/fbreitwieser/krakenuniq

https://gitlab.umiacs.umd.edu/derek/krakenuniq/-/blob/master/MANUAL.md

7.1 Kraken/Centrifuge Family

Kraken 1 is obsolete.

Centrifuge was written to improve Kraken’s memory issues. It is completely new code with a different classification index. Its databases are also much smaller. Centrifuge can assign a sequence to multiple taxa. We’ll use centrifuge later in the course.

KrakenUniq is based on Kraken 1. Adds an efficient algorithm to assess coverage of unique kmers, running at the same speed and with only slightly more memory than Kraken 1. KrakenUniq distinguishes low abundance organisms from false positives.

Kraken 2 uses “probabilistic data structures” to reduce memory and increase speed at the expense of lower accuracy that leads to false positives during the classifications (“a few dozen out of millions”).

KrakenUniq was updated in May 2022 (v0.7+) that helped to deal with the large databases on computers without enough RAM to load them into memory. Databases can be read in chunks that fit in RAM.

Kraken preload

KrakenUniq gives you k-mer coverage information, reporting the number and percentage k-mers hit by reads. This helps to differentiate between false positives and good classifications.

Alignment

https://www.biorxiv.org/content/10.1101/2022.06.01.494344v1.full.pdf

http://ccb.jhu.edu/software/choosing-a-metagenomics-classifier/#:~:text=However%2C%20while%20Kraken%20provides%20only,1%20databases%2C%20not%20Kraken%202.

https://github.com/fbreitwieser/krakenuniq/blob/master/MANUAL.md

7.2 KrakenUniq databases

KrakenUniq requires Kraken1 not Kraken2 databases

We won’t build a database since it can take several days, but let’s look at the documentation:
https://github.com/fbreitwieser/krakenuniq#database-building

You can also get prebuilt databases:
https://benlangmead.github.io/aws-indexes/k2

We will use a bacterial databases located here:
/home/data/metagenomics-2310/krakenuniq

Link to it.

ln -s /home/data/metagenomics-2310/krakenuniq/ .

The taxa it contains are here: /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig

Take a look at the file. How many sequences are there?

How many genera?

awk '{print $2}' /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig | sort -u| wc -l

And how many of each genera?

awk '{print $2}' /home/data/metagenomics-2310/krakenuniq/library/bacteria/library_headers.orig | sort | uniq -c | less

7.3 Run KrakenUniq

Make a working directory and go into it.

mkdir ~/krakenuniq
cd ~/krakenuniq

Activate the KrakenUniq environment.

source activate krakenuniq

Run Kraken on sample 3469-3.

time krakenuniq \
    --db /home/data/metagenomics-2310/krakenuniq \
    --threads 32 \
    --report-file 3469-3.report \
    --unclassified-out 3469-3.unclassified.fna \
    --classified-out 3469-3.classified.fna \
    ~/microbe_fastq/3469-3.microbe.fq.gz \
    > 3469-3.krakenuniqoutput

Make sure you got the following files: 3469-3.classified.fna 3469-3.krakenuniqoutput 3469-3.report 3469-3.unclassified.fna

This command will also bring up the input file as well (3469-3.microbe.fq.gz)

ls 3469*

Use the less command to look at each of them.

Now run it on the other samples.

7.4 Pavian

Copy the reports to your computer. Let’s open them in Pavian.

Information on installing and some things you can do in Pavian is in the Centrifuge chapter.

<!—Comments here–>