6 Dimensionality Reduction
A principal goal of single cell analysis is to distinguish and identify cells by their pattern of gene expression (counts). Cells of the same type or function (like xylem or phloem) should have similar expression patterns. In a mathematical sense, if we constructed a grid in N dimensions (where N is the number of genes) and plotted each cell’s (properly normalized) expression in this space, all xylem cells should end up clustered together in about the same spot, and all phloem cells should cluster together in a different spot.
A practical problem with this simple N-dimensional approach is the sheer number of dimensions. Recall that our dataset has 34218 genes overall and 23848 after eliminating those unexpressed in any cell. It is too difficult to visualize such a space.
Dimensionality reduction is a solution to this problem. We do not expect the cell counts to be uniformly distributed across the entire N-dimensional space, but to lie on lower-dimensional “manifolds” due to the correlation in their expression patterns and the fact that most counts are zero and do not contribute any information. We can use mathematical techniques to “project” the data into a 2-dimensional space that we can more easily visualize. As we will see, this is often sufficient to resolve different kinds of cell.
Besides the counts matrices and row and column metadata, the SingleCellExperiment
class has a reducedDim()
function to access any dimensionality reduction data we generate. In this chapter, we will explore three commonly used dimensionality reduction techniques, often known by their acronyms PCA, t-SNE, and UMAP.
Our dataset involves cells of known type, but in general your experimental single cell data may not include the cell type. (In fact, there may not be a clear definition of cell type.) In such cases, you may want to classify your cells using statistical clustering methods.
library(SingleCellExperiment)
library(scater)
library(ggplot2)
6.1 Load data and clusters
We begin by reading in our saved data after quality control (if necessary).
# Do not run - depends on workshop timing
# my_SCE_qc <- readRDS("data/my_SCE-04.rds")
We also load the previously determined cluster and supercluster ids corresponding to cell types.
<- readRDS("data/clusters_5.rds")
df.clusters $cell_barcode <- substring(df.clusters$cell_barcode, 1, 18)
df.clusterscolData(my_SCE_qc) <- merge(colData(my_SCE_qc), df.clusters, by.x = "Barcode", by.y = "cell_barcode", all.x = TRUE)
head(colData(my_SCE_qc))
## DataFrame with 6 rows and 13 columns
## Barcode Sample sizeFactor sum detected
## <character> <character> <numeric> <numeric> <integer>
## 1 AAACCCACACGCGCAT-1 data/filtered_featur.. 0.764455 2285 1382
## 2 AAACCCACAGAAGTTA-1 data/filtered_featur.. 0.872851 2609 1525
## 3 AAACCCACAGACAAAT-1 data/filtered_featur.. 1.576417 4712 2606
## 4 AAACCCAGTAGCTTTG-1 data/filtered_featur.. 0.415850 1243 998
## 5 AAACCCAGTCCAGTTA-1 data/filtered_featur.. 0.366336 1095 914
## 6 AAACCCAGTTATAGAG-1 data/filtered_featur.. 0.909986 2720 1707
## total low_lib_size low_n_features discard doublet_score
## <numeric> <outlier.filter> <outlier.filter> <logical> <numeric>
## 1 2285 FALSE FALSE FALSE 0.12716
## 2 2609 FALSE FALSE FALSE 0.49708
## 3 4712 FALSE FALSE FALSE 0.46240
## 4 1243 FALSE FALSE FALSE 0.02312
## 5 1095 FALSE FALSE FALSE 0.00000
## 6 2720 FALSE FALSE FALSE 0.16184
## probable_doublet cluster_id supercluster
## <logical> <factor> <factor>
## 1 FALSE 15 Endodermal cells
## 2 FALSE 15 Endodermal cells
## 3 FALSE 4 Atrichoblasts
## 4 FALSE 17 Stele cells
## 5 FALSE 9 Meristematic cells
## 6 FALSE 15 Endodermal cells
# Assign display colors for the 21 clusters and 6 superclusters
# (close to those used in Farmer et al.)
<- c(
cluster_colors "cornflowerblue", "darkblue", "cyan",
"springgreen", "olivedrab", "chartreuse", "darkgreen",
"gray", "black", "saddlebrown",
"slateblue", "purple", "pink", "lightpink3", "hotpink", "darkred",
"lightsalmon", "orange", "tan", "firebrick1", "indianred"
)<- c("blue", "green", "black", "magenta", "red", "orange") supercluster_colors
And as one more step in the quality control process, we can remove cells of unknown type. (There are 37 of these.)
<- is.na(my_SCE_qc$cluster_id)
unknown_cluster_id sum(unknown_cluster_id)
## [1] 37
<- my_SCE_qc[, !unknown_cluster_id]
my_SCE_qc dim(my_SCE_qc)
## [1] 23848 5056
6.2 Using counts
We will apply these methods first to the raw counts and then to the log-normalized counts, to see firsthand how the latter enhances variation and makes the clusters more distinct.
6.2.1 PCA
The idea behind Principal Component Analysis is that due to correlation (or covariance), the axes in feature (gene) space are generally not the axes with the highest variance. For example, in the simple 2-dimensional example below, the axes of maximum and minimum variance are not along either the x or y axis, but rotated through about 30°. These rotated axes are the principal components of the data. In higher dimensional cases, we find the axis of maximum variance, then the axis of next highest variance at 90° from it, then the axis of next highest variance at 90° from both previous ones, and so on.
By computing the principal components and retaining only those with the highest variance, we reduce dimensionality at the expense of not accounting for 100% of the variance (like a data compression scheme).
We can use the function runPCA()
(from the scater
package). The default is to compute the top 50 principal components (generally more than we need), and to consider only the top 500 genes (meaning those with the highest variance).
set.seed(42) # only necessary for sign of PC values, not magnitude
<- runPCA(my_SCE_qc, exprs_values = "counts") # ncomponents = 50, ntop = 500 by default my_SCE_qc
Now we can examine the principal components, and make a scatter plot of the first two.
reducedDims(my_SCE_qc)
## List of length 1
## names(1): PCA
dim(reducedDim(my_SCE_qc, "PCA"))
## [1] 5056 50
reducedDim(my_SCE_qc, "PCA")[1:6, 1:4]
## PC1 PC2 PC3 PC4
## [1,] -4.053179 11.61384 15.050736 1.305555
## [2,] -19.147165 13.15111 17.176057 1.955806
## [3,] 7.409991 -14.81521 19.296659 31.310012
## [4,] -35.331884 18.12503 8.338565 -7.429924
## [5,] -44.130172 14.22139 7.365172 -5.096884
## [6,] 18.924708 16.63463 11.947987 -6.169856
pdf("pdf/pca-counts-sce.pdf")
plotPCA(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
PCA is a “linear” method in that it considers variance only along Euclidean straight lines (axes) in the multidimensional space, while the variance might be even higher along a more general curve. We therefore do not expect PCA to distinguish similar cells as well as the nonlinear methods described in the next sections.
6.2.2 t-SNE
t-distributed Stochastic Neighbor Embedding is a nonlinear dimensionality reduction method that stochastically estimates probabilities that pairs of data points are close to each other in the high-dimensional space determined by a previously computed PCA, and uses those probabilities to try to keep them close in the low-dimensional space.
To compute t-SNE, we use the runTSNE()
function, which by default still considers the top 500 genes but creates only two columns (2-dimensional output). We use the existing 50 principal components as the high-dimensional space.
set.seed(42)
<- runTSNE(my_SCE_qc, exprs_values = "counts", dimred = "PCA")
my_SCE_qc
reducedDims(my_SCE_qc)
## List of length 2
## names(2): PCA TSNE
dim(reducedDim(my_SCE_qc, "TSNE"))
## [1] 5056 2
head(reducedDim(my_SCE_qc, "TSNE"))
## TSNE1 TSNE2
## [1,] 10.519504 -21.018531
## [2,] 3.113556 -26.409550
## [3,] 10.979244 13.235036
## [4,] -13.993152 -13.178915
## [5,] -10.317739 2.847872
## [6,] 11.016841 -24.874545
pdf("pdf/tsne-counts-sce.pdf")
plotTSNE(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
6.2.3 UMAP
Uniform Manifold Approximation and Projection is another nonlinear method. Like t-SNE, it starts with a high-dimensional space of principal components, and tries to keep nearby data points close in a lower-dimensional space. The main difference is that UMAP represents the points as nodes of a sparse graph, so that topological connections (edges) between neighboring points are what matter instead of absolute distances between points. It therefore compares far less pairs of points than t-SNE, resulting in faster run time.
To compute UMAP, we use the runUMAP()
function, which by default (like t-SNE) considers the top 500 genes but creates only two columns (2-dimensional output).
set.seed(42)
<- runUMAP(my_SCE_qc, exprs_values = "counts", dimred = "PCA")
my_SCE_qc
reducedDims(my_SCE_qc)
## List of length 3
## names(3): PCA TSNE UMAP
dim(reducedDim(my_SCE_qc, "UMAP"))
## [1] 5056 2
head(reducedDim(my_SCE_qc, "UMAP"))
## UMAP1 UMAP2
## [1,] -0.2264755 -5.6009413
## [2,] -0.4828674 -9.1512276
## [3,] -3.3064879 1.4108115
## [4,] 5.5791100 0.3584753
## [5,] 3.5515417 2.6182750
## [6,] -0.4037494 -5.9438730
pdf("pdf/umap-counts-sce.pdf")
plotUMAP(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
6.3 Using logcounts
Nonlinear dimensionality reduction using raw counts somewhat resolves the cell types, but perhaps we can do better. Let’s try it again using the log-normalized counts.
6.3.1 PCA
set.seed(42)
<- runPCA(my_SCE_qc, exprs_values = "logcounts") my_SCE_qc
pdf("pdf/pca-logcounts-sce.pdf")
plotPCA(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
Click for answer
set.seed(42)
<- runPCA(my_SCE_qc, exprs_values = "logcounts", ntop = nrow(my_SCE_qc)) my_SCE_qc_all_genes
pdf("pdf/pca-logcounts-sce-all-genes.pdf")
plotPCA(my_SCE_qc_all_genes, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
We may also plot all combinations of the top ncomponents
pairs of PCs. The charts on the diagonal show the density of counts v. value of that principal component.
pdf("pdf/pca-logcounts-pairs.pdf")
plotPCA(my_SCE_qc, ncomponents = 4, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id")
dev.off()
It is also worth looking at the percentage of variance explained by each principal component.
<- ncol(reducedDim(my_SCE_qc, "PCA"))
num_pcs <- data.frame(
df_pv pc = 1:num_pcs,
percent_variance = attr(reducedDim(my_SCE_qc, "PCA"), "percentVar")
)
pdf("pdf/pca-percent-variance.pdf")
ggplot(df_pv, aes(x = pc, y = percent_variance)) +
geom_point() +
xlab("Principal Component") +
ylab("Variance explained (%)")
dev.off()
This suggests that we can speed up t-SNE and UMAP by using only the first 8-10 principal components (the n_dimred
argument below).
6.3.2 t-SNE
t-SNE has a perplexity
parameter that usually requires some tuning. In general, a lower perplexity favors local closeness relationships between a small number of data points, and can resolve data that naturally map well to a lower-dimensional space, while a higher perplexity tries to retain longer-range or global relationships between more data points, and is necessary to model fundamentally higher-dimensional data. Some of the links at the end of this chapter discuss the subject in more detail.
We will run t-SNE for three different perplexity values: 10, 30 (the default value), and 100.
set.seed(42)
<- runTSNE(my_SCE_qc, exprs_values = "logcounts", dimred = "PCA", n_dimred = 10, perplexity = 10) my_SCE_qc
pdf("pdf/tsne-logcounts-sce-10.pdf")
plotTSNE(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id") +
ggtitle("perplexity = 10")
dev.off()
set.seed(42)
<- runTSNE(my_SCE_qc, exprs_values = "logcounts", dimred = "PCA", n_dimred = 10) #, perplexity = 30) my_SCE_qc
pdf("pdf/tsne-logcounts-sce-30.pdf")
plotTSNE(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id") +
ggtitle("perplexity = 30")
dev.off()
set.seed(42)
<- runTSNE(my_SCE_qc, exprs_values = "logcounts", dimred = "PCA", n_dimred = 10, perplexity = 100) my_SCE_qc
pdf("pdf/tsne-logcounts-sce-100.pdf")
plotTSNE(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id") +
ggtitle("perplexity = 100")
dev.off()
Exercise: Which perplexity
value seems most effective? Discuss.
6.3.3 UMAP
We will generate UMAP plots showing both the cluster and supercluster.
set.seed(42)
<- runUMAP(my_SCE_qc, exprs_values = "logcounts", dimred = "PCA", n_dimred = 10)
my_SCE_qc <- as.data.frame(reducedDim(my_SCE_qc, "UMAP"))
xx $cluster_id <- as.character(my_SCE_qc$cluster_id)
xx<- unique(xx$cluster_id)
cc <- sapply(cc, function(ci) mean(na.omit(xx[xx$cluster_id == ci, 1])))
xc <- sapply(cc, function(ci) mean(na.omit(xx[xx$cluster_id == ci, 2])))
yc <- data.frame(cluster_id = cc, x_centroid = xc, y_centroid = yc, row.names = NULL) my_cluster_data
pdf("pdf/umap-logcounts-sce-cluster.pdf")
plotUMAP(my_SCE_qc, colour_by = "cluster_id") +
scale_color_manual(values = cluster_colors) +
labs(color = "cluster_id") +
geom_text(data = my_cluster_data, aes(x = x_centroid, y = y_centroid, label = cluster_id))
dev.off()
$supercluster <- as.character(my_SCE_qc$supercluster)
xx<- unique(xx$supercluster)
cc <- sapply(cc, function(ci) mean(na.omit(xx[xx$supercluster == ci, 1])))
xc <- sapply(cc, function(ci) mean(na.omit(xx[xx$supercluster == ci, 2])))
yc <- data.frame(supercluster = cc, x_centroid = xc, y_centroid = yc, row.names = NULL) my_supercluster_data
pdf("pdf/umap-logcounts-sce-supercluster.pdf")
plotUMAP(my_SCE_qc, colour_by = "supercluster") +
scale_color_manual(values = supercluster_colors) +
labs(color = "supercluster") +
geom_text(data = my_supercluster_data, aes(x = x_centroid, y = y_centroid, label = supercluster))
dev.off()
To summarize: due to their nonlinearity, t-SNE and UMAP achieve better dimensionality reduction than PCA. UMAP is typically much faster than t-SNE, though each has a unique visual je ne sais quoi that may appeal to you (or suit your data).
The cell type clusters are now reasonably well resolved. To save the data,
# Do not run - depends on workshop timing
# saveRDS(my_SCE_qc, "data/my_SCE-05.rds")
6.4 Visualizing differential gene expression
6.4.1 Feature plots
Feature plots can identify marker genes associated with a particular cell type. They are basically a dimensionality reduction plot (here, UMAP) colored by gene expression level.
<- c("AT5G17390", "AT1G68850") my_genes
pdf("pdf/featureplots-sce.pdf")
plotUMAP(my_SCE_qc, colour_by = my_genes[1]) +
scale_color_gradient(low = "lightgray", high = "blue") +
labs(color = my_genes[1]) +
geom_text(data = my_cluster_data, aes(x = x_centroid, y = y_centroid, label = cluster_id))
plotUMAP(my_SCE_qc, colour_by = my_genes[2]) +
scale_color_gradient(low = "lightgray", high = "blue") +
labs(color = my_genes[2]) +
geom_text(data = my_cluster_data, aes(x = x_centroid, y = y_centroid, label = cluster_id))
dev.off()
6.4.2 Dot plots
Dot plots show expression by cluster (or any other category) for several genes simultaneously.
<- head(rowData(my_SCE_qc)$ID[order(-rowData(my_SCE_qc)$detected)]) top_6_genes
pdf("pdf/dotplot-sce.pdf")
plotDots(my_SCE_qc, features = top_6_genes, group = "supercluster") +
scale_color_gradient(low = "darkblue", high = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
dev.off()
Click for answer
The bright red dots represent high gene expression. One candidate is gene AT3G06355 which is expressed in meristematic cells.
<- "AT3G06355"
my_gene pdf("pdf/featureplot-exercise.pdf")
plotUMAP(my_SCE_qc, colour_by = my_gene) +
scale_color_gradient(low = "lightgray", high = "blue") +
labs(color = my_gene) +
geom_text(data = my_cluster_data, aes(x = x_centroid, y = y_centroid, label = cluster_id))
dev.off()
6.5 Clustering
What if we did not know the cell types? We can assign clusters using standard methods like k-means. For this method, we must provide the centers
argument, which is a guess for either the number of clusters (nc = 10 in this example) or an nc × 2 matrix of their centroids.
set.seed(42)
<- kmeans(reducedDim(my_SCE_qc, "UMAP"), centers = 10)$cluster
new_clusters colData(my_SCE_qc)$kmeans_cluster <- as.factor(new_clusters)
pdf("pdf/umap-kmeans-clusters-42.pdf")
plotUMAP(my_SCE_qc, colour_by = "kmeans_cluster")
dev.off()
k-means is a stochastic clustering algorithm and will give different results for different initial random number seeds.
set.seed(43)
<- kmeans(reducedDim(my_SCE_qc, "UMAP"), centers = 10)$cluster
new_clusters colData(my_SCE_qc)$kmeans_cluster <- as.factor(new_clusters)
pdf("pdf/umap-kmeans-clusters-43.pdf")
plotUMAP(my_SCE_qc, colour_by = "kmeans_cluster")
dev.off()
Exercise: Our choice of 10 clusters was arbitrary. How many would you use? Try it.
6.6 SingleR
While the cell clusters we created above may seem plausible, they tell us nothing about the real cell types to which they correspond (if indeed they do). A better way is to match your cells’ expression patterns against a database of known cell types.
The SingleR
Bioconductor package automates this predictive process for you. We start by loading the package,
library(SingleR)
As a simple example, we will import one of the other samples from the Farmer et al. paper, and match it against our current dataset. To save time, this has already passed through basic quality control and log-normalization, and contains its known cell types.
<- readRDS("data/another_SCE.rds")
another_SCE dim(another_SCE)
## [1] 20200 1867
head(colData(another_SCE))
## DataFrame with 6 rows and 8 columns
## Barcode Sample sum detected total
## <character> <character> <numeric> <integer> <numeric>
## 1 AAACCTGAGGTGCACA-1 data-farmer-et-al/sN.. 1503 1153 1503
## 2 AAACCTGGTATAATGG-1 data-farmer-et-al/sN.. 947 533 947
## 3 AAACCTGGTTCGGCAC-1 data-farmer-et-al/sN.. 651 570 651
## 4 AAACCTGTCACTTATC-1 data-farmer-et-al/sN.. 547 397 547
## 5 AAACCTGTCCTTTCTC-1 data-farmer-et-al/sN.. 2173 1041 2173
## 6 AAACCTGTCTTTCCTC-1 data-farmer-et-al/sN.. 846 676 846
## sizeFactor cluster_id supercluster
## <numeric> <factor> <factor>
## 1 1.285180 5 Atrichoblasts
## 2 0.809757 11 Cortical cells
## 3 0.556655 3 Trichoblasts
## 4 0.467727 4 Atrichoblasts
## 5 1.858081 11 Cortical cells
## 6 0.723395 5 Atrichoblasts
We predict the cell types using the function SingleR()
(this takes a few minutes), then inspect the results. Note that the default de.method = "classic"
is meant to be used with bulk data; for single cell data you should use "wilcox"
or "t"
instead, and either increase the number of genes it considers (de.n
) or explicitly pass it a list of known marker genes. See the SingleR
package documentation for more details.
<- SingleR(test = another_SCE, ref = my_SCE_qc, labels = my_SCE_qc$cluster_id, de.method = "wilcox", de.n = 100)
predicted_clusters predicted_clusters
## DataFrame with 1867 rows and 4 columns
## scores labels delta.next pruned.labels
## <matrix> <character> <numeric> <character>
## 1 0.145729:0.0906747:0.308452:... 5 0.02696302 5
## 2 0.142641:0.0756601:0.249562:... 11 0.05165787 11
## 3 0.133225:0.0813310:0.150437:... 3 0.00914204 3
## 4 0.145553:0.0456477:0.116264:... 1 0.00459201 1
## 5 0.182280:0.0821189:0.285521:... 11 0.11554030 11
## ... ... ... ... ...
## 1863 0.220292:0.0437221:0.115311:... 1 0.2193211 1
## 1864 0.117324:0.0723303:0.246897:... 11 0.0921154 11
## 1865 0.130900:0.0592665:0.264765:... 11 0.0734595 11
## 1866 0.233276:0.0688232:0.190697:... 1 0.0797890 1
## 1867 0.153706:0.0927359:0.276151:... 11 0.0125291 11
The scores
column is a matrix of prediction quality scores for each cell type, and delta.next
is the difference between the best and second best score. The labels
and pruned.labels
columns indicate the predicted cell type before and after pruning; the latter may be NA
(unknown) if the prediction is especially uncertain.
sum(is.na(predicted_clusters$pruned.labels))
## [1] 1
In our example, this was true for one cell.
We now assign the predicted clusters to another_SCE
.
# convert the predicted cluster id to a factor
$pred_cluster_id <- as.factor(predicted_clusters$labels)
another_SCE# put the factor levels in the correct order
$pred_cluster_id <- factor(another_SCE$pred_cluster_id, levels = levels(my_SCE_qc$cluster_id))
another_SCEhead(colData(another_SCE))
## DataFrame with 6 rows and 9 columns
## Barcode Sample sum detected total
## <character> <character> <numeric> <integer> <numeric>
## 1 AAACCTGAGGTGCACA-1 data-farmer-et-al/sN.. 1503 1153 1503
## 2 AAACCTGGTATAATGG-1 data-farmer-et-al/sN.. 947 533 947
## 3 AAACCTGGTTCGGCAC-1 data-farmer-et-al/sN.. 651 570 651
## 4 AAACCTGTCACTTATC-1 data-farmer-et-al/sN.. 547 397 547
## 5 AAACCTGTCCTTTCTC-1 data-farmer-et-al/sN.. 2173 1041 2173
## 6 AAACCTGTCTTTCCTC-1 data-farmer-et-al/sN.. 846 676 846
## sizeFactor cluster_id supercluster pred_cluster_id
## <numeric> <factor> <factor> <factor>
## 1 1.285180 5 Atrichoblasts 5
## 2 0.809757 11 Cortical cells 11
## 3 0.556655 3 Trichoblasts 3
## 4 0.467727 4 Atrichoblasts 1
## 5 1.858081 11 Cortical cells 11
## 6 0.723395 5 Atrichoblasts 5
Finally, we compute its overall accuracy and plot its confusion matrix.
<- table(another_SCE$cluster_id, another_SCE$pred_cluster_id, dnn = c("Real", "Pred"))
tbl <- sum(diag(tbl))/sum(tbl)
accuracy print(paste("Accuracy =", round(100*accuracy, 1), "%"))
## [1] "Accuracy = 73.9 %"
# Convert tbl from counts to probabilities
<- tbl/rowSums(tbl) # Real is in rows, Pred in columns
tbl <- as.data.frame(tbl) conf_mx
pdf("pdf/predicted-clusters-confusion-matrix.pdf")
ggplot(conf_mx, aes(x = Pred, y = Real)) +
geom_tile(aes(fill = Freq)) +
coord_fixed() +
scale_fill_gradient(low = "white", high = "darkblue")
dev.off()
SingleR
includes functions for assessing the uncertainty of the predicted cell types. For example, plotScoreHeatmap()
creates a heatmap from the values in the scores
matrix.
pdf("pdf/predicted-clusters-score-heatmap.pdf")
plotScoreHeatmap(predicted_clusters)
dev.off()
Its rows represent cell types (cluster_id
), its columns represent cells, and the bar across the top organizes the cells by predicted type. For example, many cells of types 5 and 11 receive nearly ambiguous predictions, as do cell types 1 and 2. These cell types probably have similar patterns of gene expression (possibly indicating similar function or even structure), making them difficult to distinguish.
another_SCE
superclusters as well. Assign them to a pred_supercluster
column and create a confusion matrix v. the real superclusters.
Click for answer
<- SingleR(test = another_SCE, ref = my_SCE_qc, labels = my_SCE_qc$supercluster, de.method = "wilcox", de.n = 100)
predicted_superclusters
$pred_supercluster <- as.factor(predicted_superclusters$labels)
another_SCE$pred_supercluster <- factor(another_SCE$pred_supercluster, levels = levels(my_SCE_qc$supercluster))
another_SCE
<- table(another_SCE$supercluster, another_SCE$pred_supercluster, dnn = c("Real", "Pred"))
tbl <- sum(diag(tbl))/sum(tbl)
accuracy print(paste("Accuracy =", round(100*accuracy, 1), "%"))
## [1] "Accuracy = 80.8 %"
<- tbl/rowSums(tbl)
tbl <- as.data.frame(tbl) conf_mx
pdf("pdf/predicted-superclusters-confusion-matrix.pdf")
ggplot(conf_mx, aes(x = Pred, y = Real)) +
geom_tile(aes(fill = Freq)) +
coord_fixed() +
scale_fill_gradient(low = "white", high = "darkred")
dev.off()
Here we used our own dataset as the reference. If you work with human or mouse single-cell data, the Bioconductor packages celldex and scRNAseq provide several public cell type databases for annotating them.
6.7 Further resources
Coenen and Pearce, Understanding UMAP.
Czarnewski, Dimensionality reduction video, ELIXIR EXCELERATE course Single cell RNA-seq data analysis with R (2019).
Lun, Assigning Cell Types with SingleR (2020).
McInnes, Healy, and Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020).
Tran, Comparing UMAP vs t-SNE in Single-cell RNA-Seq Data Visualization, Simply Explained
van der Maaten and Hinton, Visualizing Data using t-SNE, J. Machine Learning Research 1, 1-48 (2008).
Wattenberg, Viégas, and Johnson, How to Use t-SNE Effectively, Distill (2016).
Wikipedia: Principal Component Analysis • t-SNE • UMAP