What Non-Linear Dimension Reduction hides and misrepresents.
DNA genome → RNA transcripts → proteins
How can we know which RNA transcripts and proteins is a cell making, and how this is regulated?
I will concentrate on single-cell RNA-sequencing (scRNA-Seq).
Read hundreds of millions of sequences of RNA.
Each sequence has an attached “barcode” telling the cell it came from.
GACAATGCCCAGGGATCCCATGTGGGTTTTTTTTTT...ATCACGTCGTCCCACATACCCTCAACGTCAGTAGCGTGACGGTTC
[ Cell barcode ][ UMI ] [ RNA sequence, reverse transcribed to DNA ]
Large p: 20,000+ protein coding genes in human genome.
Large n: thousands or millions of individual cells.
Cells are like points in a \(p\)-dimensional space.
We will look at 10,000
Peripheral Blood Mononuclear Cells (PBMC) from eight human donors.
Public dataset from Kang et al. (2018).
A large sparse count matrix.
A small subset of the data is shown →
(seriated nicely)
We have a high-D set of points to make sense of.
We want a 2-D layout that captures important high-D features.
Two popular methods for this are:
Focus on capturing the topology of the high-D data.
Standard Seurat processing steps:
35,635 genes
→ Normalize and log1p transform counts.
→ 2,000 “highly variable genes” chosen.
→ 10 Principal Components found.
→ 2D UMAP layout of cells.
Cells have also been clustered and clusters annotated.
Some cells were stimulated with a cytokine,
IFN-β.
U = Unstimulated
S = Stimulated
UMAP seeks a 2-D layout that makes two weighted graphs as similar as possible
(according to an objective function).
1. Weighted graph from High-D points
2. Weighted graph from 2-D points
Many details omitted. The authors have a complicated topological justification of their algorithm.
Let’s look at some synthetic examples motivated by biology.
Parameters
UMAP: min_dist=0.3, n_neighbors=30
Seurat
defaults.
t-SNE: perplexity=200
A higher perplexity than the default produced better results.
A couple of these examples are based on ones by Jayani Lakshika.
A batch or treatment effect might simply offset cells from one biological sample relative to another.
With UMAP or t-SNE this simple geometry may be hidden.
A long tail of points in one direction becomes a whisker.
Parallel whiskers may attract.
There is no good 2D layout of this shape, so it is torn apart.
Even if a good layout is possible, NLDR may get stuck at a torn local optimum!
Tree-like patterns can be put into a plane without tearing.
UMAP and t-SNE are very effective in this case.
Linear projections of data have less potential to misrepresent data than NLDR.
Animated linear projection of data is called a tour.
Tours have been implemented in desktop software such as GGobi, and more recently in R packages such as tourr.
I recently developed langevitour, a Javascript tour widget (usable from R or Python).
Langevitour uses a physics engine to animate and interact with the linear projections, using concepts of momentum, potential energy, and heat.
PCA and UMAP only depend on Euclidean distance.
No direction is considered special.
But genes are special!
Specific biological functions are linked to specific genes and sets of genes.
We can apply varimax rotation:
(Other methods exist to find sparse loadings, such as PCA with an elastic net or prenet penalty, NMF, MOFA.)
The routine use of NLDR in biology for single cell data is a remarkable success, but:
UMAP has largely supplanted an earlier methoed called t-SNE.
logarithmic.net/langevitour/2023-iasc-ars (some slides are CPU intensive)
UMAP and t-SNE effectively separate clusters.
Differences in density are hidden. Size merely reflects the number of points in each cluster.