2 Biological Context

Where does the sequening data come from? Once we have analysed our sequencing data, how does it translate back to cells, tissues, and organisms?

2.1 What is Single Cell RNA Seq (scRNA-seq)?

To understand what single cell RNA Seq is, we can compare and contrast with Bulk RNA Seq. Let’s walk through the bulk RNA Seq workflow below:

Bulk RNASeq

  • An estimation of the average expression level for each gene within a population of cells
  • Population-level resolution
  • DNA from every cell in the sample is mixed together for sequencing
  • Expression levels from every cell type are lumped together

Now let’s walk throug the Single Cell RNA Seq workflow:

Single Cell RNASeq

First Isolate Cells

  • Emphasizes the differences and variability between individual cells
  • Individual-level resolution
  • DNA from only one cell is sequenced
  • Expression levels from from individual cell types are separated

2.1.1 How is a single tiny cell isolated from other cells in a sample?

Cell Isolation Example

There are many technologies available commercially to perform single cell sorting.

Cell Sorting Tools

A common protocol comes from a company called 10x Genomics. Below is an overview of the cell sorting options available to researchers just to give you a quick impression of the variety of approaches.

Cell Sorting Tools

2.1.2 Once the single cells are isolated, how do we keep track of them? How do we know which cells belong to which sequences?

Droplets and UMIs

Cell Sorting Tools

What does the UMI do?

UMIs are oligonucleotides consisting of random bases.

  • Far more UMIs in the library than target DNA molecules so that ach target DNA molecule gets tagged with unique UMI
  • Target DNA molecule + UMI gets carried forward to amplification
  • Reads can later be grouped by UMI to generate a consensus sequence from all of the reads that contain one specific UMI

https://www.youtube.com/watch?v=gMzzXFMyM5g

UMIs

What does the Barcode do?

What does the Poly dT do?

2.2 Analysis Sequencing Data

Biological Context for interpreting scRNA-Seq Data and Data Analysis

This is an introduction to concepts. The purpose is to familiarize you with steps in an scRNA-Seq analysis workflow. Really, I hope that you understand what the purpose of the data processing steps are.

We are going to approach our learning using a journal-club model. I see that all of you are researchers and are likely very familiar with academic research publications and journal clubs. We will dissect a review paper from 2019 published in Molecular Systems Biology, titled “Current best practices in single cell RNA-seq analysis: a tutorial.”

This paper is paired with a tutorial freely available on github. These tutorials are not the same tutorials that we will cover in the course, providing an opportunity for extra practice as homework during the week of the workshop or afterward.

Luecken, M. D., & Theis, F. J. (2019). Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, 15(6), e8746.

https://www.github.com/theislab/single-cell-tutorial.

2.2.1 Objectives:

1.) scRNA-Seq Workflow:

  • To be able to recognize, select, and design an scRNA-Seq workflow

2.) Pre Processing and Visualization:

3.) Quality Control:

  • To distinguish between cell count per depth, genes detected per cell, count depth distribution, mitochondrial read fractions.

  • To be able to interpret quality control metrics

4.) Normalization

5.) Data correction and integration

  • Regressing out biological effects

  • Regressing out technical effects

  • Batch effects and data integration

  • Expression recovery

6.) Feature selection, dimensionality reduction and visualization

  • Feature Selection

  • Dimensionality reduction

  • Visualization

7.) Stages of Pre-Processed data

  • Five stages of data processing

  • Three pre-processing layers

  • Downstream analysis

8.) Clustering analysis

  • Clustering

  • Cluster Annotation

  • Compositional Analysis

9.) Trajectory analysis

  • Trajectory interference

  • Gene expression dynamics

  • Metastable states

10.) Cell Level Analysis unification

11.) Gene-level analysis

  • Differential Expression testing

  • Gene set analysis

  • Gene regulatory networks

Keep an eye out for references to the following analysis tools, because you will be using them later on in the workshop:

Cell Ranger

Seurat

2.2.2 scRNA-Seq Workflow:

To be able to recognize, select, and design an scRNA-Seq workflow to answer your own question.

Analysis Workflow

Questions:

1.) To get from a cluster to an annotated cluster, what do we need match up to the to the cluster data? See Figure 1.

Answer: Marker Identifiers

2.) The barcode details listed below indicate that a cell has a broken membrane. Why?

  • Barcodes with a low count depth

  • few detected genes

  • high fraction of mitochondrial counts

Answer: After nuclear DNA leaks through a broken nuclear membrane, the only DNA left in a cell is the mitochondrial DNA.

Keep in mind that there are always other implications to consider. For example: Cells involved in respiration ALSO have a high mitochondrial count.

2.2.3 Processing and Visualization

1.) Raw Data

2.) Processed data

  • molecular counts (count matrices)

  • read counts (read matrices)

  • UMIs

  • Cell Ranger

Quality Control (QC)

Assigning reads to cellular barcodes (demultiplexing)

Assigning reads to mRNA molecules of origin (demultiplexing)

genome alignment

quantification

3.) Doublets

  • Mistake whereby a barcode may tag multiple cells instead of one

  • like 2 cells in a droplet

4.) Empty Droplet/Well

  • Mistake where no cells are tagged

  • 0 cells in droplet

2.2.4 Quality Control:

To distinguish between cell count per depth, genes detected per cell, count depth distribution, and mitochondrial read fractions. See Figure 2.

We will be regularly referring to “covariates” so it is important to keep in mind what a covariate is.

With respect to quality control in scRNA-seq, covariates are:

  • counts per barcode (count depth)
  • genes per barcode
  • fractions of counts from mitochondrial genes

Run all QC Covariates (Cojointly) and cross reference the results of all three. Covariate dependence.

Questions:

What about barcodes that have

  • unexpectedly high counts
  • large number of detected genes?

What might that indicate?

  • Answer: Could be doublets
  • A doublet is where multiple cells are captured together
  • in a microfluidic droplet

What might a quiescent cell population’s barcode details look like?

  • Answer:
  • low counts
  • low genes

What might a QC Covariates show for cells that are larger in size?

  • Answer:
  • high cell counts

Should thresholds be set permissively?

  • Answer: yes
  • as permissively as possible

Why? - Don’t want to filter out viable cell populations.

What if you don’t filter?

  • Only lowest count depth and lowest gene per barcode should be considered non-viable

Quality Control

2.2.5 Normalization

With normalization we are trying to get the correct relative gene expression abundances between cells. We are also trying to get uniform variance to satisfy assumptions of downstream tools.

There are many different methods ranging from:

Shifted logarithm
+ Adjusts for a size factor and adds a pseudocount before taking the logartihm
+ Simple and fast

Analytic Pearson residuals
+ Tries to adjust for technical variation while preserving biological variation
+ Takes into account the count depth
+ Normalized values can be positive or negative (lower counts than expected based on average expression of the gene and the sequencing depth of the cell)

2.2.6 Feature Selection

You can get expression values for up to 25,000 genes in a human singls-cell RNA-seq dataset.

Many genes won’t be informative for any given experimental inquiry. The goal is to obtain 500-2000 informative genes.

Single cell RNA approaches contain a lot of drop outs because of limited RNA. Genes with zero or very low counts are removed. Focus on genes that are highly variable across cells.

2.2.7 Dimension Reduction

2.2.7.1 Dimensionality reduction

Dimensionality reduction is done to:

  • To ease downstream computational burden
  • To reduce noise in the data
  • To reduce redundancy in the data
  • To visualize data

Popular dimensionality reduction techniques:

  1. Principal Component Analysis (PCA)
  • Highly interpretable
  • Linear approach to reduce dimensions
  • Computationally efficient
  • But scRNA-seq is sparse and non-linear
  • Doesn’t reduce dimensions as much as non-linear methods
  1. tSNE (t-distributed stochastic neighbor embedding)
  • Graph-based, non-linear approach
  • Maintains local structure really well
  • Computationally intensive
  1. UMAP (uniform manifold approximation and projection)
  • Graph based, non-linear method
  • Maintains local and global structure
  • Relatively fast

2.2.8 Data correction and integration

2.2.8.1 Batch effects

These include technical variation introduced by differences in how samples are handled during collection or processing, differences in experimental protocols or even from differences in sequencing depths but the source of variation is often hard to pinpoint.

Batch effects can mask variation in rare cell populations and other important biological differences.

The recommendation is to be lenient on batch effect removal as you can remove important biological variation if you are too aggressive.

Basic steps in correcting for batch effects:

  • Reduce dimensionality (improves signal to noise ratio)
  • Model and remove the batch effect
  • Project into high-dimensional space (opt)

Different types of batch correct are available:

Global models

  • From bulk RNA-seq
  • Assumes the batch effect is consistent across all cells
  • Example: comBat

Linear embedding models

  • Focuses on local neighborhoods of similar cells across batches
  • This method is locally adaptive and non-linear
  • Examples: Seurat and Harmony

Graph-based models

  • Fast
  • Connects cells from different batchs but prunes these connections to account for cell type differences
  • Example: Batch-Balanced k-Nearest Neighbor (BBKNN)

Deep Learning models

  • Usually based on autoencoder networks
  • Can integrate cell identity labels to help maintain biological variation
  • Examples: scVI, scANVI, and scGen

UMAP

2.2.8.2 Regressing out biological effects

Cell cycle markers, conserved

  • Across tissue
  • Across spieces
  • Do you want to study effects of cell-cycle markers?
  • Do you want to study something else?

How to regress out Cell-Cycle markers

  • simple linear regression
  • against a cell cycle score

Where does this score come from?

  • Lists of marker genes in literature
  • Macosko et al, 2015

Some argue that normalizing for Cell Size already accounts for Cell-Cycle effects

Other markers: Ribosome Mitochondrial