Chapter 5 Introduction to Pangenomics

5.1 What is a “pangenome”?

The term “pangenome” was first coined by Sigaux et al. and was used to describe a public database containing an assessment of genome and transcriptome alterations in major types of tumors, tissues, and experimental models.

Sigaux F. Génome du cancer ou de la construction des cartes d’identité moléculaire des tumeurs [Cancer genome or the development of molecular portraits of tumors]. Bull Acad Natl Med. 2000;184(7):1441-7; discussion 1448-9. French. PMID: 11261250.

Sigaux et al.

The term was later revitalized by Tettelin et al. to describe a microbial genome by which genes were in the core (present in all strains) and which genes were dispensable (missing from one or more of the strains).

Tettelin et al.

Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., … & Fraser, C. M. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proceedings of the National Academy of Sciences, 102(39), 13950-13955.

Pangenome: https://en.wikipedia.org/wiki/Pan-genome

5.1.1 Open vs. Closed Genomes

Open and Closed Pangenomes: https://en.wikipedia.org/wiki/Pan-genome

5.1.2 Then vs. Now

Cost per Genome: https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data

Low Cost
High Quality Long Reads (HiFi)
Many reference-quality assemblies per species

Pangenome Publications: https://www.nature.com/articles/s41477-020-0733-0

5.1.3 “Pangenome” Today

“Any collection of genomic sequences to be analyzed jointly or to be used as a reference. These sequences can be linked in a graph-like structure, or simply constitute sets of (aligned or unaligned) sequences.” – Computational Pangenomics Consortium

https://academic.oup.com/bib/article/19/1/118/2566735

5.1.4 The Benefit of Pangenomes

Removes reference bias
- May only represent one organism
- Could be a “mosaic”of individuals, i.e. doesn’t represent a coherent haplotype
- Allele bias
- Doesn’t include common variation
Allow multiple assemblies to be analyzed simultaneously, i.e. efficiently

5.1.5 What are pangenomes good for?

Core vs dispensable genes:
- How big is the core?
- How big is the dispensable?
- How big is the pangenome?
- What traits are associated with the core/dispensable?
Unbiased read mapping and variant calling
More robust variation-trait association
Visual exploration of genomic structure of population

5.2 Computational Pangenomics

“Questions about efficient data structures, algorithms and statistical methods to perform bioinformatic analyses of pan-genomes give rise to the discipline of ‘computational pan-genomics’.”

Computational Pangenomics: https://academic.oup.com/bib/article/19/1/118/2566735

5.2.1 Pangenome Representations

Gene sets
Multiple sequence alignments
K-mer sets
Graphs
- De Bruijn graphs
- Haptotype graphs
- Variation graphs

5.2.2 Variation Graphs

Variation forms bubbles Nodes represent sequences
Chains of nodes represent contiguous sequence in one or more assemblies
The sequences of nodes connected by an edge may overlap
Graphs can be acyclic or cyclic
Haplotypes are “threaded” through graph as paths

Pangenome Representations: https://academic.oup.com/bib/article/19/1/118/2566735

5.2.3 Types of Variation Graphs

Reference Graph (vg)

A reference with variants
E.G. Human reference now includes VCF with common variation

Reference Backbone; “iterative” (minigraph)

Graph starts as reference and other sequences are layered on, i.e. variants can be relative to sequences other than the reference

Reference-Free (Cactus and pggb)

Graph is built using non-reference techniques, such as multiple sequence alignment

These are all methods used by the Human Pangenome Reference Consortium

5.2.4 Mapping Reads to Variation Graphs

Genotyping Variation: https://link.springer.com/article/10.1186/s13059-020-1941-7

5.3 Pangenome Data Sets

5.3.1 Data/Yeast Genomes:

Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

12 Mb
16 chromosomes
12 strains from Yeast Population Reference Panel (YPRP)
- 7 Saccharomyces cerevisiae (brewer’s yeast)
  - Includes S288C reference
5 Saccharomyces paradoxus (wild yeast)
Manuscript
Software (LRSDAY)
- Manuscript
- GitHub

5.3.2 Yeast Assemblies

YPRP: 12 Yeast PacBio Assemblies (Chromosome level)
- ~100-200x PacBio sequencing reads
- HGAP + Quiver polishing
- ~200-500x Illumina (Pilon correction)
- Manual curation
- Annotation

5.3.3 SK1 Illumina Reads

SK1 is the most distant from S288C

Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

5.3.4 CUP1 Gene

Structrual Rearrangements: https://www.nature.com/articles/ng.3847

CUP1 - A gene involved in heavy metal (copper) tolerance with copy-number variation (CNV) in population.
YHR054C - Putative protein of unknown function.

5.3.5 We Changed the Names

YPRP FASTA files only contain chromosome names
We prefixed every chromosome with its assembly name and a “.” delineator
- e.g. S288C.chrVIII
Pangenome Sequence Naming Specification