Chapter 9 Isolate Assembly

We have been focusing on metagenomes—communities of microbial organisms. Now we are going to shift our focus to isolates. We will assemble and annotate single microbial organisms that have been isolated and cultured.

9.1 Get example data

We will use some pathogenic E. coli sequence data isolated from birds in South Africa, sequenced with Oxford Nanopore.

First, make a directory to work in and go into that directory.

mkdir ~/isolate_assembly

cd ~/isolate_assembly

Now, get the example data from the sequence read archive, which houses publicly available sequencing reads. We will prefetch the data before we extract the sequencing reads, which makes it faster.

prefetch SRR32629793 --output-directory .

fasterq-dump --outdir . --outfile ecoli.fastq --threads 4 --progress ./SRR32629793/SRR32629793.sra

Now, let’s clean up the reads and get some information about them. We’ll use fastplong, which is for long reads. By default, it will remove reads with more than 40% of nts having a quality score <15. Since we don’t have access to the nanopore output files, we can’t use pycoQC.

/opt/fastplong/fastplong -i ecoli.fastq -o ecoli.trim.fq

We now have a file called ecoli.trim.fq. Download the .html file onto your computer and take a look at the read data before and after trimming. If you are using scp instead of gui, an example command is below.

scp -P 2508 <username>@inbre.ncgr.org:/home/jm/isolate_assembly/fastplong.html .

Find the total number of bases after filtering in the .html file. Calculate the approximate coverage by dividing this number by the E. coli genome size (~5 Mb). We want to make sure the coverage is at least 30X.

Click for Answer
After filtering, we have 238.209373 Mb of sequence data.

238.209373 Mb / 5 Mb = 47.642X coverage

9.2 Assemble the genome

We will use flye to assemble the genome. It has some preset parameters specifically for nanopore data. We’ll use the –nano-raw parameter. Newer versions of flye have a –nano-hq version for high quality reads called with the Guppy5+ SUP base callers.

More information is available at https://github.com/mikolmogorov/Flye/blob/flye/docs/USAGE.md

Activate the environment.

conda activate flye

Now run the assembly. This will take about 10 minutes.

flye --nano-raw ecoli.trim.fq --out-dir ecoli_flye --threads 4 --no-alt-contigs

As it is finishing up, it will print out some assembly statistics and the path to the final assembly. Discuss with your neighbor what you think each one means then discuss as a group.

Click for Answer
[2025-03-11 04:19:23] INFO: Assembly statistics:

    Total length:   5369202  
    Fragments:  3  
    Fragments N50:  5120307  
    Largest frg:    5120307  
    Scaffolds:  0  
    Mean coverage:  43

[2025-03-11 16:05:17] INFO: Final assembly:   /home/jm/isolate_assembly/ecoli_flye/assembly.fasta

We can get a little more information in one of the output files. Take a look and see what you understand and what questions you have.

Columns * Contig/scaffold id * Length * Coverage * Is circular, (Y)es or (N)o * Is repetitive, (Y)es or (N)o * Multiplicity (based on coverage) * Alternative group (alternative haplotypes) * Graph path (graph path corresponding to this contig/scaffold).

cat ecoli_flye/assembly_info.txt
Click for Answer
#seq_name   length  cov.    circ.   repeat  mult.   alt_group   graph_path  
contig_1    5120307 44  Y   N   1   *   1  
contig_3    161968  38  Y   N   1   *   3  
contig_2    86927   41  Y   N   1   *   2

contig_1 is the main genome and is circular.

Contigs 2, 3 are plasmids (much smaller and also circular).

9.3 Assembly Assessment

We will use checkM2, a successor to checkM, to assess the quality of our assembly. It uses machine learning to figure out what lineage each genome has (this works on metagenomic assemblies as well) and whether the genome has the complete complement of genes expected for that lineage.

More information is available here: https://github.com/chklovski/checkm2

First, activate the environment.

conda activate checkm2

Then run checkM2.

checkm2 predict --threads 20 --input ecoli_flye/assembly.fasta --output-directory ecoli_checkm2 --database_path /opt/checkm2/CheckM2_database/uniref100.KO.1.dmnd 

Take a look at the report.

cat ecoli_checkm2/quality_report.tsv
Click for Answer
Name    Completeness    Contamination   Completeness_Model_Used Translation_Table_Used  Coding_Density  Contig_N50  Average_Gene_Length Genome_Size  GC_Content Total_Coding_Sequences  Total_Contigs   Max_Contig_Length   Additional_Notes  
assembly    100.0   0.47    Neural Network (Specific Model) 11  0.872   5120307 308.58413888340897    5369202   0.51    5069    35120307    None

9.4 Annotation

We will use the National Center for Biotechnology Information’s (NCBI’s) Prokaryotic Genome Annotation Pipeline (PGAP).

More information is available here: https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/process/ and here: https://github.com/ncbi/pgap/wiki/Quick-Start

The parameters: -r report anonymized usage data -o output directory (can include full path) -g genome assembly (fasta) -s ‘organism_name’ (genus or genus and species)

Note: To save time and because we are having some permissions issues, we have run this for you.

This has already been run for you so don’t run it.

/opt/pgap/pgap.py -r -o ecoli_annotation -g /home/jm/isolate_assembly/ecoli_flye/assembly.fasta -s 'Escherichia coli'

The annotation is put into a file in GFF format. More information on the GFF annotation format is here: https://useast.ensembl.org/info/website/upload/gff.html

Link to the GFF file that we ran previously and take a look at the file.

ln -s /opt/pgap/ecoli_annotation/ .

Let’s count how many of each type of annotation there is in the gff file.

grep -v "^#" ecoli_annotation/annot.gff |awk '{print $3}' |sort|uniq -c
Click for Answer
      1 antisense_RNA
     24 CDS
      1 direct_repeat
    121 exon
   4852 gene
   5088 Homology
      7 ncRNA
    355 pseudogene
      3 region
      7 riboswitch
      1 RNase_P_RNA
     22 rRNA
      7 sequence_feature
      1 SRP_RNA
      1 tmRNA
     88 tRNA