Chapter 9 Isolate Assembly
We have been focusing on metagenomes—communities of microbial organisms. Now we are going to shift our focus to isolates. We will assemble and annotate single microbial organisms that have been isolated and cultured.
9.1 Get example data
We will use some pathogenic E. coli sequence data isolated from birds in South Africa, sequenced with Oxford Nanopore.
First, make a directory to work in and go into that directory.
Now, get the example data from the sequence read archive, which houses publicly available sequencing reads. We will prefetch the data before we extract the sequencing reads, which makes it faster.
prefetch SRR32629793 --output-directory .
fasterq-dump --outdir . --outfile ecoli.fastq --threads 4 --progress ./SRR32629793/SRR32629793.sraNow, let’s clean up the reads and get some information about them. We’ll use fastplong, which is for long reads. By default, it will remove reads with more than 40% of nts having a quality score <15. Since we don’t have access to the nanopore output files, we can’t use pycoQC.
We now have a file called ecoli.trim.fq. Download the .html file onto your computer and take a look at the read data before and after trimming. If you are using scp instead of gui, an example command is below.
Find the total number of bases after filtering in the .html file. Calculate the approximate coverage by dividing this number by the E. coli genome size (~5 Mb). We want to make sure the coverage is at least 30X.
9.2 Assemble the genome
We will use flye to assemble the genome. It has some preset parameters specifically for nanopore data. We’ll use the –nano-raw parameter. Newer versions of flye have a –nano-hq version for high quality reads called with the Guppy5+ SUP base callers.
More information is available at https://github.com/mikolmogorov/Flye/blob/flye/docs/USAGE.md
Activate the environment.
Now run the assembly. This will take about 10 minutes.
As it is finishing up, it will print out some assembly statistics and the path to the final assembly. Discuss with your neighbor what you think each one means then discuss as a group.
Click for Answer
We can get a little more information in one of the output files. Take a look and see what you understand and what questions you have.
Columns * Contig/scaffold id * Length * Coverage * Is circular, (Y)es or (N)o * Is repetitive, (Y)es or (N)o * Multiplicity (based on coverage) * Alternative group (alternative haplotypes) * Graph path (graph path corresponding to this contig/scaffold).
9.3 Assembly Assessment
We will use checkM2, a successor to checkM, to assess the quality of our assembly. It uses machine learning to figure out what lineage each genome has (this works on metagenomic assemblies as well) and whether the genome has the complete complement of genes expected for that lineage.
More information is available here: https://github.com/chklovski/checkm2
First, activate the environment.
Then run checkM2.
checkm2 predict --threads 20 --input ecoli_flye/assembly.fasta --output-directory ecoli_checkm2 --database_path /opt/checkm2/CheckM2_database/uniref100.KO.1.dmnd Take a look at the report.
Click for Answer
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Total_Contigs Max_Contig_Length Additional_Notes
assembly 100.0 0.47 Neural Network (Specific Model) 11 0.872 5120307 308.58413888340897 5369202 0.51 5069 35120307 None9.4 Annotation
We will use the National Center for Biotechnology Information’s (NCBI’s) Prokaryotic Genome Annotation Pipeline (PGAP).
More information is available here: https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/process/ and here: https://github.com/ncbi/pgap/wiki/Quick-Start
The parameters: -r report anonymized usage data -o output directory (can include full path) -g genome assembly (fasta) -s ‘organism_name’ (genus or genus and species)
Note: To save time and because we are having some permissions issues, we have run this for you.
This has already been run for you so don’t run it.
/opt/pgap/pgap.py -r -o ecoli_annotation -g /home/jm/isolate_assembly/ecoli_flye/assembly.fasta -s 'Escherichia coli'The annotation is put into a file in GFF format. More information on the GFF annotation format is here: https://useast.ensembl.org/info/website/upload/gff.html
Link to the GFF file that we ran previously and take a look at the file.
Let’s count how many of each type of annotation there is in the gff file.