2 Biological Context
Where does the sequening data come from? Once we have analysed our sequencing data, how does it translate back to cells, tissues, and organisms?
2.1 What is Single Cell RNA Seq (scRNA-seq)?
To understand what single cell RNA Seq is, we can compare and contrast with Bulk RNA Seq. Let’s walk through the bulk RNA Seq workflow below:
Bulk RNASeq
- An estimation of the average expression level for each gene within a population of cells
- Population-level resolution
- DNA from every cell in the sample is mixed together for sequencing
- Expression levels from every cell type are lumped together
Now let’s walk throug the Single Cell RNA Seq workflow:
Single Cell RNASeq
First Isolate Cells
- Emphasizes the differences and variability between individual cells
- Individual-level resolution
- DNA from only one cell is sequenced
- Expression levels from from individual cell types are separated
2.1.1 How is a single tiny cell isolated from other cells in a sample?
Cell Isolation Example
There are many technologies available commercially to perform single cell sorting.
Cell Sorting Tools
A common protocol comes from a company called 10x Genomics. Below is an overview of the cell sorting options available to researchers just to give you a quick impression of the variety of approaches.
Cell Sorting Tools
2.1.2 Once the single cells are isolated, how do we keep track of them? How do we know which cells belong to which sequences?
Droplets and UMIs
Cell Sorting Tools
What does the UMI do?
UMIs are oligonucleotides consisting of random bases.
- Far more UMIs in the library than target DNA molecules so that ach target DNA molecule gets tagged with unique UMI
- Target DNA molecule + UMI gets carried forward to amplification
- Reads can later be grouped by UMI to generate a consensus sequence from all of the reads that contain one specific UMI
https://www.youtube.com/watch?v=gMzzXFMyM5g
UMIs
What does the Barcode do?
What does the Poly dT do?
2.2 Analysis Sequencing Data
Biological Context for interpreting scRNA-Seq Data and Data Analysis
This is an introduction to concepts. The purpose is to familiarize you with steps in an scRNA-Seq analysis workflow. Really, I hope that you understand what the purpose of the data processing steps are.
We are going to approach our learning using a journal-club model. I see that all of you are researchers and are likely very familiar with academic research publications and journal clubs. We will dissect a review paper from 2019 published in Molecular Systems Biology, titled “Current best practices in single cell RNA-seq analysis: a tutorial.”
This paper is paired with a tutorial freely available on github. These tutorials are not the same tutorials that we will cover in the course, providing an opportunity for extra practice as homework during the week of the workshop or afterward.
Luecken, M. D., & Theis, F. J. (2019). Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, 15(6), e8746.
https://www.github.com/theislab/single-cell-tutorial.
2.2.1 Objectives:
1.) scRNA-Seq Workflow:
- To be able to recognize, select, and design an scRNA-Seq workflow
2.) Pre Processing and Visualization:
3.) Quality Control:
To distinguish between cell count per depth, genes detected per cell, count depth distribution, mitochondrial read fractions.
To be able to interpret quality control metrics
4.) Normalization
5.) Data correction and integration
Regressing out biological effects
Regressing out technical effects
Batch effects and data integration
Expression recovery
6.) Feature selection, dimensionality reduction and visualization
Feature Selection
Dimensionality reduction
Visualization
7.) Stages of Pre-Processed data
Five stages of data processing
Three pre-processing layers
Downstream analysis
8.) Clustering analysis
Clustering
Cluster Annotation
Compositional Analysis
9.) Trajectory analysis
Trajectory interference
Gene expression dynamics
Metastable states
10.) Cell Level Analysis unification
11.) Gene-level analysis
Differential Expression testing
Gene set analysis
Gene regulatory networks
Keep an eye out for references to the following analysis tools, because you will be using them later on in the workshop:
Cell Ranger
Seurat
2.2.2 scRNA-Seq Workflow:
To be able to recognize, select, and design an scRNA-Seq workflow to answer your own question.
Analysis Workflow
Questions:
1.) To get from a cluster to an annotated cluster, what do we need match up to the to the cluster data? See Figure 1.
Answer: Marker Identifiers
2.) The barcode details listed below indicate that a cell has a broken membrane. Why?
Barcodes with a low count depth
few detected genes
high fraction of mitochondrial counts
Answer: After nuclear DNA leaks through a broken nuclear membrane, the only DNA left in a cell is the mitochondrial DNA.
Keep in mind that there are always other implications to consider. For example: Cells involved in respiration ALSO have a high mitochondrial count.
2.2.3 Processing and Visualization
1.) Raw Data
2.) Processed data
molecular counts (count matrices)
read counts (read matrices)
UMIs
Cell Ranger
Quality Control (QC)
Assigning reads to cellular barcodes (demultiplexing)
Assigning reads to mRNA molecules of origin (demultiplexing)
genome alignment
quantification
3.) Doublets
Mistake whereby a barcode may tag multiple cells instead of one
like 2 cells in a droplet
4.) Empty Droplet/Well
Mistake where no cells are tagged
0 cells in droplet
2.2.4 Quality Control:
To distinguish between cell count per depth, genes detected per cell, count depth distribution, and mitochondrial read fractions. See Figure 2.
We will be regularly referring to “covariates” so it is important to keep in mind what a covariate is.
With respect to quality control in scRNA-seq, covariates are:
- counts per barcode (count depth)
- genes per barcode
- fractions of counts from mitochondrial genes
Run all QC Covariates (Cojointly) and cross reference the results of all three. Covariate dependence.
Questions:
What about barcodes that have
- unexpectedly high counts
- large number of detected genes?
What might that indicate?
- Answer: Could be doublets
- A doublet is where multiple cells are captured together
- in a microfluidic droplet
What might a quiescent cell population’s barcode details look like?
- Answer:
- low counts
- low genes
What might a QC Covariates show for cells that are larger in size?
- Answer:
- high cell counts
Should thresholds be set permissively?
- Answer: yes
- as permissively as possible
Why? - Don’t want to filter out viable cell populations.
What if you don’t filter?
- Only lowest count depth and lowest gene per barcode should be considered non-viable
Quality Control
2.2.5 Normalization
With normalization we are trying to get the correct relative gene expression abundances between cells. We are also trying to get uniform variance to satisfy assumptions of downstream tools.
There are many different methods ranging from:
Shifted logarithm
+ Adjusts for a size factor and adds a pseudocount before taking the logartihm
+ Simple and fast
Analytic Pearson residuals
+ Tries to adjust for technical variation while preserving biological variation
+ Takes into account the count depth
+ Normalized values can be positive or negative (lower counts than expected based on average expression of the gene and the sequencing depth of the cell)
2.2.6 Feature Selection
You can get expression values for up to 25,000 genes in a human singls-cell RNA-seq dataset.
Many genes won’t be informative for any given experimental inquiry. The goal is to obtain 500-2000 informative genes.
Single cell RNA approaches contain a lot of drop outs because of limited RNA. Genes with zero or very low counts are removed. Focus on genes that are highly variable across cells.
2.2.7 Dimension Reduction
2.2.7.1 Dimensionality reduction
Dimensionality reduction is done to:
- To ease downstream computational burden
- To reduce noise in the data
- To reduce redundancy in the data
- To visualize data
Popular dimensionality reduction techniques:
- Principal Component Analysis (PCA)
- Highly interpretable
- Linear approach to reduce dimensions
- Computationally efficient
- But scRNA-seq is sparse and non-linear
- Doesn’t reduce dimensions as much as non-linear methods
- tSNE (t-distributed stochastic neighbor embedding)
- Graph-based, non-linear approach
- Maintains local structure really well
- Computationally intensive
- UMAP (uniform manifold approximation and projection)
- Graph based, non-linear method
- Maintains local and global structure
- Relatively fast
2.2.8 Data correction and integration
2.2.8.1 Batch effects
These include technical variation introduced by differences in how samples are handled during collection or processing, differences in experimental protocols or even from differences in sequencing depths but the source of variation is often hard to pinpoint.
Batch effects can mask variation in rare cell populations and other important biological differences.
The recommendation is to be lenient on batch effect removal as you can remove important biological variation if you are too aggressive.
Basic steps in correcting for batch effects:
- Reduce dimensionality (improves signal to noise ratio)
- Model and remove the batch effect
- Project into high-dimensional space (opt)
Different types of batch correct are available:
Global models
- From bulk RNA-seq
- Assumes the batch effect is consistent across all cells
- Example: comBat
Linear embedding models
- Focuses on local neighborhoods of similar cells across batches
- This method is locally adaptive and non-linear
- Examples: Seurat and Harmony
Graph-based models
- Fast
- Connects cells from different batchs but prunes these connections to account for cell type differences
- Example: Batch-Balanced k-Nearest Neighbor (BBKNN)
Deep Learning models
- Usually based on autoencoder networks
- Can integrate cell identity labels to help maintain biological variation
- Examples: scVI, scANVI, and scGen
UMAP
2.2.8.2 Regressing out biological effects
Cell cycle markers, conserved
- Across tissue
- Across spieces
- Do you want to study effects of cell-cycle markers?
- Do you want to study something else?
How to regress out Cell-Cycle markers
- simple linear regression
- against a cell cycle score
Where does this score come from?
- Lists of marker genes in literature
- Macosko et al, 2015
Some argue that normalizing for Cell Size already accounts for Cell-Cycle effects
- McDavid et al, 2016
- https://www.nature.com/articles/srep33892
Other markers: Ribosome Mitochondrial