Chapter 5 Introduction to Pangenomics

5.1 What is a “pangenome”?

The term “pangenome” was first coined by Sigaux et al. and was used to describe a public database containing an assessment of genome and transcriptome alterations in major types of tumors, tissues, and experimental models.

  • Sigaux F. Génome du cancer ou de la construction des cartes d’identité moléculaire des tumeurs [Cancer genome or the development of molecular portraits of tumors]. Bull Acad Natl Med. 2000;184(7):1441-7; discussion 1448-9. French. PMID: 11261250.
Sigaux et al.
Sigaux et al.

The term was later revitalized by Tettelin et al. to describe a microbial genome by which genes were in the core (present in all strains) and which genes were dispensable (missing from one or more of the strains).

Tettelin et al.
Tettelin et al.
  • Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., … & Fraser, C. M. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proceedings of the National Academy of Sciences, 102(39), 13950-13955.

5.1.1 Open vs. Closed Genomes

Open and Closed Pangenomes: https://en.wikipedia.org/wiki/Pan-genome
Open and Closed Pangenomes: https://en.wikipedia.org/wiki/Pan-genome

5.1.2 Then vs. Now

  • Low Cost
  • High Quality Long Reads (HiFi)
  • Many reference-quality assemblies per species

5.1.3 “Pangenome” Today

“Any collection of genomic sequences to be analyzed jointly or to be used as a reference. These sequences can be linked in a graph-like structure, or simply constitute sets of (aligned or unaligned) sequences.” – Computational Pangenomics Consortium

https://academic.oup.com/bib/article/19/1/118/2566735

5.1.4 The Benefit of Pangenomes

  • Removes reference bias
    • May only represent one organism
    • Could be a “mosaic”of individuals, i.e. doesn’t represent a coherent haplotype
    • Allele bias
    • Doesn’t include common variation
  • Allow multiple assemblies to be analyzed simultaneously, i.e. efficiently

5.1.5 What are pangenomes good for?

  • Core vs dispensable genes:
    • How big is the core?
    • How big is the dispensable?
    • How big is the pangenome?
    • What traits are associated with the core/dispensable?
  • Unbiased read mapping and variant calling
  • More robust variation-trait association
  • Visual exploration of genomic structure of population

5.2 Computational Pangenomics

“Questions about efficient data structures, algorithms and statistical methods to perform bioinformatic analyses of pan-genomes give rise to the discipline of ‘computational pan-genomics’.”

5.2.1 Pangenome Representations

  • Gene sets
  • Multiple sequence alignments
  • K-mer sets
  • Graphs
    • De Bruijn graphs
    • Haptotype graphs
    • Variation graphs

5.2.2 Variation Graphs

  • Variation forms bubbles Nodes represent sequences
  • Chains of nodes represent contiguous sequence in one or more assemblies
  • The sequences of nodes connected by an edge may overlap
  • Graphs can be acyclic or cyclic
  • Haplotypes are “threaded” through graph as paths

5.2.3 Types of Variation Graphs

  1. Reference Graph (vg)
  1. Reference Backbone; “iterative” (minigraph)
  • Graph starts as reference and other sequences are layered on, i.e. variants can be relative to sequences other than the reference
  1. Reference-Free (Cactus and pggb)
  • Graph is built using non-reference techniques, such as multiple sequence alignment

These are all methods used by the Human Pangenome Reference Consortium

5.2.4 Mapping Reads to Variation Graphs

5.3 Pangenome Data Sets

5.3.1 Data/Yeast Genomes:

5.3.2 Yeast Assemblies

5.3.3 SK1 Illumina Reads

SK1 is the most distant from S288C

5.3.4 CUP1 Gene

Structrual Rearrangements: https://www.nature.com/articles/ng.3847
Structrual Rearrangements: https://www.nature.com/articles/ng.3847
  • CUP1 - A gene involved in heavy metal (copper) tolerance with copy-number variation (CNV) in population.
  • YHR054C - Putative protein of unknown function.

5.3.5 We Changed the Names