Chapter 11 Reference-Free Graphs with pggb

11.1 pggb

All-pairs whole genome alignment
Induces a graph from the alignments

11.2 PanGenome Graph Builder

pggb is built on the idea that a pangenome graph represents an alignment of the genomes in the graph (this is literally what Cactus does), but infers the graph from all pairwise alignments instead of a multiple alignment.

pggb computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.

Critically, pggb performs graph normalization to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.

Input Genomes

11.3 Reference-Free Graphs

https://academic.oup.com/bioinformatics/article/30/24/3476/2422268

Input Genomes

11.4 pggb Algorithm

All-pairs align genomes with wfmash
Convert alignments into a graph using seqwish
Progressively normalize graph with smoothxg and gfaffix

11.5 Pipeline

Prepare the input
pggb
View with Bandage

11.5.1 Set up Directories

Make sure you’re working in a screen
Make Directory

mkdir ~/pggb

Navigate to the Directory

cd ~/pggb

Link to data

ln -s /home/data/pangenomics-2503/yprp/ .

11.6 Yeast Data

Using all 12 YPRP assemblies

11.7 Prepare the Input

Create a FASTA file containing all the yprp assemblies. Call it yprp.all.fa.

cat yprp/assemblies/*.fa > yprp.all.fa

Create a FASTA file containing chromosome VIII from every assembly. Call it yprp.chrVIII.fa.

awk special variables:
FS = Field Separator
RS = Record Separator

awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fa"; print ">" $0 > fnme; close(fnme);}' yprp/assemblies/*.genome.fa

cat *.chrVIII.fa > yprp.chrVIII.fa

rm `ls | grep -v yprp`

How can you confirm that the contents of these files is correct?
Compress the files

bgzip the FASTA files

bgzip -c yprp.all.fa > yprp.all.fa.gz
bgzip -c yprp.chrVIII.fa > yprp.chrVIII.fa.gz

-c
- output the bgzipped file to standard output
>
- redirect the standard output into a file

Index the bgzip files with samtools faidx:

samtools faidx yprp.all.fa.gz
samtools faidx yprp.chrVIII.fa.gz

11.8 Running pggb on Chromosome VIII

Build a graph containing all the yprp assemblies (2min):

singularity run --no-privs /home/data/pangenomics-2503/sif/pggb.sif pggb build -i yprp.chrVIII.fa.gz -o output_chrVIII -n 12 -t 20 -p 95 -s 5000

-i yprp.chrVIII.fa
- an input FASTA containing all sequences
-o output_chrVIIII
- the directory where all output files should be placed
-n 12
- the number of haplotypes (assemblies) in the input file
-t 20
- the number of threads to use
-p 95
- minimum sequence identity of alignment segments
-s 5000
- nucleotide segment length when scaffolding the graph

NOTE: These arguments were taken from the pggb paper. Refer to the paper for other species.

Create a copy of the output graph with a simpler name

cp output_chrVIII/yprp.chrVIII.fa.gz.*.smooth.final.gfa yprp.chrVIII.pggb.gfa

11.9 Viewing with Bandage

View your Chromsome VIII graph with Bandage (exercise):

Find CUP1 and YHR054C
Take a screenshot

pggb graphs are very detailed and it will take a long time to view all of Chromosome VIII in Bandage. You can load the whole graph but then only draw around the CUP1 genes.

Other graphs might be larger, so you can also make a subgraph to load into bandage.

Convert the GFA to VG (PackedGraph format)

vg convert -p yprp.chrVIII.pggb.gfa > yprp.chrVIII.pggb.pack.vg

-p
- output in PackedGraph format

Create a subgraph only containing the CUP1 region.

vg chunk -p S288C.chrVIII:212000-230000 -x yprp.chrVIII.pggb.pack.vg -c 10 > yprp.chrVIII.pggb.cup1_chunk.vg

-p S288C.chrVIII:212000-230000
- the path (region) to build the chunk from
-x yprp.chrVIII.pggb.vg
- the graph to chunk
-c 10
- expand the context of the chunk by 10 node steps

Convert the subgraph into a GFA for viewing (<1min):

vg view + -g + gfa output

vg view -g yprp.chrVIII.pggb.cup1_chunk.vg > yprp.chrVIII.pggb.cup1_chunk.gfa

Note that you can also use vg convert to get an identical gfa graph.

vg convert + -f + gfa output

vg convert -f yprp.chrVIII.pggb.cup1_chunk.vg > yprp.chrVIII.pggb.cup1_chunk.gfa

View your Chromsome VIII chunk graph with Bandage (exercise):

Find CUP1 and YHR054C
Take a screenshot

11.10 Exercises

Use vg to index the chromosome VIII graph
Use vg to map SK1 reads to the chromosome VIII graph
Use vg to call variants on chromosome VIII read mapping GAMS

Click for Answer

Graph-supported variants


# Index graph
vg autoindex --workflow giraffe -g yprp.chrVIII.pggb.gfa -p yprp.chrVIII.pggb

# Align reads
vg giraffe -p -Z yprp.chrVIII.pggb.giraffe.gbz -m yprp.chrVIII.pggb.min -d yprp.chrVIII.pggb.dist -f yprp/reads/SK1.illumina.fastq.gz > yprp.chrVIII.pggb.gam

# Get alignment statistics
vg stats -a yprp.chrVIII.pggb.gam

# Calculate Read support
vg pack -x yprp.chrVIII.pggb.giraffe.gbz -g yprp.chrVIII.pggb.gam -Q 5 -s 5 -o yprp.chrVIII.pggb.pack -t 4

# Call variants
vg call -k yprp.chrVIII.pggb.pack -t 4 yprp.chrVIII.pggb.giraffe.gbz > yprp.chrVIII.pggb.vcf

Including novel variants from the reads


# Convert graph to a format that we can change
vg convert yprp.chrVIII.pggb.giraffe.gbz > yprp.chrVIII.pggb.giraffe.vg

# Augment the graph with the variation from the reads
vg augment yprp.chrVIII.pggb.giraffe.vg yprp.chrVIII.pggb.gam -A yprp.chrVIII.pggb.aug.gam -t 4 > yprp.chrVIII.pggb.aug.vg

# Index the graph
vg index -t 4 -x yprp.chrVIII.pggb.aug.xg yprp.chrVIII.pggb.aug.vg

# Calculate read support
vg pack -x yprp.chrVIII.pggb.aug.xg -g yprp.chrVIII.pggb.aug.gam -Q5 s-5 -o yprp.chrVIII.pggb.aug.pack -t 4

# Call variants
vg call yprp.chrVIII.pggb.aug.xg -k yprp.chrVIII.pggb.aug.pack -t4 > yprp.chrVIII.pggb.aug.vcf

11.11 Running pggb on all Chromosomes

Partition the sequences before building the graph

singularity run --no-privs /home/data/pangenomics-2503/sif/pggb.sif partition-before-pggb -i yprp.all.fa.gz -o output_all -n 12 -t 20 -p 95 -s 5000

This partitions the input FASTA into smaller FASTAs containing sequences that should be in the same subgraph:

Will likely correspond to chromosomes if you have complete assemblies
May improve run-time of normalization step and make downstream analysis easier
Consider skipping if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.

This will print a pggb command for every partition to the command line.

Commands are also written to a log file: output_all/yprp.all.fa.gz.<gibberish>.log

11.11.1 Exercise

Make a bash script to run all of the generated commands

sed -n '/pggb -i output_all/,$p' output_all/*.log > ./run-pggb-partitions.sh
chmod +x run-pggb-partitions.sh

This bash script would be ready to run if we weren’t using singularity to run pggb. Because we are, we need to add in the singularity call at the beginning of each line so that pggb is run in the container.

sed -i "s/^pggb/singularity run --no-privs \/home\/data\/pangenomics-2503\/sif\/pggb.sif pggb/" run-pggb-partitions.sh

We also need to fix a bug. This version of partition-before-pggb prints out two -p parameters. One of the should be -P (capital letter). Let’s fix that.

sed -i 's/-p 1/-P 1/' run-pggb-partitions.sh

Run the command NOT DURING THE WORKSHOP to build the graphs (~30min)

./run-pggb-partitions.sh

11.12 Pros and Cons

Pros:

Graphs are truly reference-free
Includes normalization step
- Especially good for species with complex structural variation, such as plants

Cons:

Normalization step is currently a bottleneck
Partitioning removes inter-chromosome structural variation
Graphs contain lots of variation
- Makes the files large
- Makes them hard to view with Bandage