Chapter 6 VG Toolkit

6.1 Variation Graph (VG) Toolkit

  • Constructs graphs
  • Manipulates graphs
  • Indexes graphs
  • Maps sequences to graphs
  • Calls variants on mapped sequences
  • Visualizes graphs
Garrison, E., Sirén, J., Novak, A. et al.
Garrison, E., Sirén, J., Novak, A. et al.
vg Graph Genomics Pipeline: https://github.com/vgteam/vg
vg Graph Genomics Pipeline: https://github.com/vgteam/vg

Also does:

Graphs are acyclic, but otherwise general, i.e. reference, iterative, reference-free, etc.

6.2 Reference Graphs

A reference genome “decorated” with variants:

GRAF™ Pan Genome Reference: https://www.sevenbridges.com/graf/
GRAF™ Pan Genome Reference: https://www.sevenbridges.com/graf/

6.2.1 Pipeline

  1. Prepare input
  2. Construct graph
  3. View with Bandage

6.2.2 Preparing the Input

Already done for you:

  • Reference FASTA
    • Description lines should only contain chromosome names
    • Remove extraneous sequences, e.g. mitochondrial DNA
  • VCF
    • Ideally chromosome names exactly match chromosome names in reference

6.2.3 Set up directories

  1. Make sure you’re working in a screen
  2. Make directory
mkdir ~/vg
  1. Navigate to the directory
cd ~/vg
  1. Link to data
ln -s /home/data/pangenomics-2503/yprp .

6.2.4 Variant Call (VCF) Format

https://en.wikipedia.org/wiki/Variant_Call_Format

  • Plain text file
  • Header:
    • Header lines start with #
    • Header lines with special keywords start with ##
  • Columns
    • Tab separated
    • 8 mandatory columns
      • CHROM, POS, ID, REF, ALT, QUAL , FILTER, INFO
    • 9th column is optional
      • FORMAT
    • Each additional column represents a sample and has the format described in the FORMAT column

6.2.5 Variation Graph (VG) Format

https://github.com/vgteam/vg/wiki/File-Formats

  • Protocol Buffer messages (protobufs) describing a graph that can store sequence in its nodes and paths
  • Encoded as binary
  • Code for reading/writing protobufs can be generated for many languages
  • vg.proto

6.2.6 Graphical Fragment Assembly (GFA) Format

  • Originally developed for representing genomes during assembly
  • Now used for pangenomics
  • More on this later…

6.2.7 Yeast Data

Reference: S288C

VCF: 3 SV calling pipelines on the 11 non-reference strains

This workshop used an updated version of the Phase 1 pipeline available here

6.2.8 Construct Graph

Construct a graph from the reference and VCF:

vg construct -r yprp/assemblies/S288C.genome.fa -v yprp/variants/S288C.vcf -S -a -f -p -t 10 > S288C.vg
  • -r yprp/assemblies/S288C.fa
    • the reference genome
  • -v yprp/variants/S288C.sv.vcf
    • the VCF file containing variants (indels, structures, etc)
  • -S
    • include structural variants in construction of graph
  • -a
    • save paths for alts of variants by variant ID
  • -f
    • don’t chop up alternate alleles from input VCF
  • -p
    • show progress
  • -t 10
    • use 10 threads
  • > S288C.vg
    • write the standard out (the computed graph) into S288C.vg

Compact chains of nodes:

vg mod -u S288C.vg -t 10 > S288C.unchop.vg
  • -u
    • unchop nodes, i.e. compact chains

6.2.9 Viewing with Bandage

Convert the graph from vg to gfa for viewing:

vg view S288C.unchop.vg -g > S288C.unchop.gfa
  • -g
    • output GFA format

6.2.10 Converting to GFA

Convert the graph from vg to gfa for analysis:

vg convert S288C.vg -f > S288C.gfa
  • -f
    • output GFA format

6.3 Graph Indexing

6.3.1 Pipeline

  1. Autoindex
  2. Manual index

6.3.2 VG Index Formats

https://github.com/vgteam/vg/wiki/File-Formats

Distance Index (.dist)

  • Minimum path distance between nodes in the graph

Graph Burrows-Wheeler Transform (.gbz)

  • GBWT haplotype index, augmented with a path cover of graph components without haplotypes and/or subsampled to a reasonable number of local haplotypes
  • GBWTGraph that provides node sequences

Minimizer index (.min)

  • Minimizers used for seed finding

Zipcodes (.zipcodes)

  • Zipcode annotations for finding sequence positions of minimizers

6.3.3 Indexing

VG uses the autoindex command to generate the necessary index files for various workflows. Generate the indexes for read mapping as follows (1min):

vg autoindex --workflow giraffe -g S288C.gfa -p S288C
  • –workflow giraffe
    • generate indexes for the “giraffe” read mapper
  • -g S288C.gfa
    • the .gfa file to index
  • -p S2888C
    • the prefix to give all generated index files

If you have a graph with haplotypes as P-lines, you must build the indexes manually. However, version 1.53.0 or later fixes this issue as long as the haplotypes are named correctly. For instance, autoindexing works for PGGB. Graphs converted from .vg to .gfa will have haplotypes as P-lines.

vg gbwt --gbz-format -g S288C.giraffe.gbz -G S288C.gfa
vg index -j S288C.dist S288C.gfa
vg minimizer -d S288C.dist -o S288C.shortread.withzip.min S288C.giraffe.gbz
  • vg gbwt –gbz-format -g S288C.giraffe.gbz
    • the .gbz file to generate
  • vg index -j S288C.dist
    • the .dist file to generate
  • vg minimizer -o S288C.shortread.withzip.min
    • the .min file to generate

6.4 Read Mapping

6.4.1 Pipeline

  1. Mapping reads
  2. Surjecting to a reference

6.4.2 Graph Alignment/Map (GAM) Format

https://github.com/vgteam/vg/wiki/File-Formats

  • Analogous to BAM, but for graphs
  • Binary file describing where reads mapped to in the graph structure
  • Uncompressed has one read per line
  • Can be converted to JSON for manual parsing (very inefficient!)

6.4.3 Read Mapping

Map reads (9min):

vg giraffe -p -Z S288C.giraffe.gbz -m S288C.shortread.withzip.min -d S288C.dist -f yprp/reads/SK1.illumina.fastq.gz -t 10 > S288C.SK1.illumina.gam
  • -p
    • show progress
  • -Z S288C.giraffe.gbz
    • the Graph Burrows-Wheeler Transform to use
  • -m S288C.shortread.withzip.min
    • the minimizer index to use
  • -d S288C.dist
    • the distance index to use
  • -f yprp/reads/SK1.illumina.fastq.gz
    • the reads to map
  • -t 10
    • number of threads to use

Computing mapping stats (<1min):

vg stats -a S288C.SK1.illumina.gam
  • -a
    • input is an alignment (GAM) file

6.4.4 Bringing Alignments Back to Single Genomes

Map reads (1min):

vg surject -x S288C.giraffe.gbz -b -t 10 S288C.SK1.illumina.gam > S288C.SK1.illumina.BAM
  • -x S288C.giraffe.gbz
    • graph or xg to use
  • -b
    • BAM output
  • -t 10
    • use 10 threads

Exercise:

  1. Rename chromosomes in BED for use with IGV
  2. Copy BED to personal computer
  3. Spend some time looking at BED in IGV and share any interesting observations you may have?
  • What’s up with CUP1?

6.4.5 Preparing the BAM IGV (or other genome viewer)

Only necessary if genome viewer can’t match UCSC style names names in BAM to chromosomes in reference.

  1. Convert BAM to SAM (<1min):
samtools view -h -o S288C.SK1.illumina.sam S288C.SK1.illumina.BAM
  1. Remove assembly name (<1min):
sed 's/S288C.//' S288C.SK1.illumina.sam > S288C.SK1.illumina.renamed.sam
  1. Convert SAM to BAM (6min):
samtools view -bS S288C.SK1.illumina.renamed.sam > S288C.SK1.illumina.renamed.bam

<===

  1. Sort the bam file (6min):
samtools sort -o S288C.SK1.illumina.renamed.sorted.bam S288C.SK1.illumina.renamed.bam
  1. Index BAM (<1min):
samtools index S288C.SK1.illumina.renamed.sorted.bam

6.5 Calling Graph Supported Variants

6.5.1 Pipeline

  1. Convert to Packed-Graph
  2. Generate pack file
  3. Generate snarls file
  4. Call variants

6.5.2 Packed-Graph Format

https://github.com/vgteam/vg/wiki/File-Formats

  • A memory-efficient, dynamic graph representation
  • The format isn’t actually documented…

6.5.3 Pack (pileup support) Format

https://github.com/vgteam/vg/wiki/File-Formats

  • Binary file
  • Computes pileup support for each read in a mapping
  • The format isn’t actually documented…

6.5.4 Snarls Format

https://github.com/vgteam/vg/wiki/File-Formats

  • Computes nested cyclic structure in the graph
  • The format isn’t actually documented…

6.5.5 Calling Graph-Supported Variants

Convert the graph into a packed graph (<1min):

vg convert S288C.vg -p > S288C.pg
  • -p
    • the output should be a packed-graph

Compute read support for variation already in the graph (1min):

vg pack -x S288C.pg -g S288C.SK1.illumina.gam -Q 5 -s 5 -o S288C.SK1.illumina.pack -t 10
  • -x S288C.pg
    • the graph file to use (.gfa, .gbz, .xg, etc)
  • -Q 5
    • ignore mapping and base quality < 5
  • -s 5
    • ignore the first and last 5pb of each read
  • -o S288C.SK1.illumina.pack
    • the output pack file
  • -t 10
    • use 10 threads

Compute snarls in graph structure (<1min):

vg snarls S288C.pg -t 10 > S288C.snarls
  • S288C.pg
    • the graph file to use (.gfa, .gbz, .xg, etc)
  • -t 10
    • use 10 threads

Generate a VCF from the support (<1min):

vg call S288C.pg -k S288C.SK1.illumina.pack -r S288C.snarls -t 10 > S288C.SK1.illumina.graph_calls.vcf
  • S288C.pg
    • the graph file to use (.gfa, .gbz, .xg, etc)

6.6 Calling Novel Variants

6.6.1 Pipeline

  1. Augment packed-graph
  2. Run variant calling pipeline on augmented graph

6.6.2 Calling Variants

Augment the packed graph with the mapped reads (30min):

vg augment S288C.pg S288C.SK1.illumina.gam -m 4 -q 5 -Q 5 -A S288C.SK1.illumina.aug.gam -t 10 > S288C.SK1.illumina.aug.pg
  • -m 4
    • filter breakpoints with coverage less than 4
  • -q 5
    • filter bases with quality less than 5
  • -Q 5
    • filter mappings with quality less than 5
  • -A S288C.SK1.illumina.aug.gam
    • output the read mappings relative to the augmented graph
  • -t 10
    • use 10 threads
  • > S288C.SK1.illumina.aug.pg
    • write the standard out (the augmented graph) into S288C.SK1.illumina.aug.pg

Compute read support for the augmented graph (2min):

vg pack -x S288C.SK1.illumina.aug.pg -g S288C.SK1.illumina.aug.gam -Q 5 -s 5 -o S288C.SK1.illumina.aug.pack -t 10

Compute snarls in the augmented graph structure (<1min):

vg snarls S288C.SK1.illumina.aug.pg -t 10 > S288C.SK1.illumina.aug.snarls

Generate a VCF from the support (3min)

vg call S288C.SK1.illumina.aug.pg -k S288C.SK1.illumina.aug.pack -r S288C.SK1.illumina.aug.snarls -t 10 > S288C.SK1.illumina.aug_calls.vcf

6.7 Calling Variants from Graph Structure

6.7.1 Pipeline

  1. Deconstruct graph

6.8 Calling Variants Already in the Graph

Output variants used to construct graph (<1min):

vg deconstruct S288C.pg -P S288C -t 10 > S288C.deconstruct.vcf
  • -P S288C
    • report variants relative to paths with names that start with S288C

S288C.deconstruct.vcf might not be identical to S288C.vcf because VG takes liberties with variants when constructing the graph.

6.9 Pros and Cons Reference Graphs

Pros:

  • Graphs are relatively small
    • Easy to inspect visually
    • Good for low-quality / fragmented assemblies
    • Shorter run-times

Cons:

  • Variation represented is determined by VCF pipeline
    • Only describes variation relative to the reference
    • Suboptimal mappings
    • Suboptimal population structure

6.10 Additional vg Commands that are Useful

6.10.1 vg combine

Combines two or more graphs into one.

  • Use to construct graph in parts (e.g. individual chromosomes) and then combine
  • Use to combine graphs of related species

6.10.2 Modifying and Simplifying Graphs

vg chunk - Splits graphs or alignments into smaller chunks.

vg mod - Used to modify graphs.

  • We’ve already used vg mod -u to “unchop” nodes.

vg simplifying - Simplifies a graph.

vg find - Used to identify subgraphs based on various metrics and parameters.

  • e.g. anchored by a region on a specific chromosome.

Use –help with any command to learn how to use it.

6.11 Drawing Graphs

vg view - Outputs graphs and other structures for drawing.

  • We’ve already used vg view -g to convert from vg to gfa for viewing in Bandage.
  • Can also output dot (among other files types) which is a more general graph drawing format (compatible with Graphviz).

vg viz - Basic graph drawing (documentation).

  • Recommend only using on small (sub)graphs.
  • Use in conjunction with chunk, mod, and/or find to draw an interesting subgraph.
  • You can include mapped reads in the visualization!