Dimensionality reduction example

Typical dimensionality reduction steps for scRNA-Seq:

1000s of genes → 10s of Principal Components → 2 UMAP dimensions


Thousands of dimensions are hard to draw, so this simulated example will use 3 genes.

Assume the genes are already log transformed, centered, and scaled as usual with Seurat.

Code
## 2023-11-10 This code requires the development version of langevitour
# remotes::install_github("pfh/langevitour")

library(langevitour)
library(uwot)
library(htmltools)
library(glue)

radio_button <- function(text, id, self, code) {
    down <- self == "1"
    group <- paste0("group_", id)
    self <- paste0(group, "_", self)
    
    code <- glue('
        {code};
        Array.from(document.getElementsByClassName("{group}")).forEach(i=>i.classList.remove("down"));
        document.getElementById("{self}").classList.add("down");
    ')
    
    tags$button(
        text,
        class=paste0(group,ifelse(down," down","")),
        id=self,
        onclick=code)
}




# Simulate some data
set.seed(563)
n <- 2000
mat <- cbind(
    rnorm(n),
    rnorm(n),
    rnorm(n)*0.3 + rep(c(-1,1),n/2))

gene_mat <- mat %*% cbind(
    gene1 = c(1,1,0),
    gene2 = c(1,-1,0),
    gene3 = c(1,0,2))


# Get principal components
pcs <- prcomp(gene_mat)

# Calculate UMAP layout
layout <- umap(pcs$x, min_dist=0.5)
layout[,1] <- -layout[,1] # Flip sign just to make the animation nicer
colnames(layout) <- c("UMAP1","UMAP2")

stateF <- '{
        "labelInactive":["UMAP1","UMAP2"],
        "labelPos":{},
        "heatOn":true,"labelAttraction":0,"damping":-1}'

state12 <- '{
        "labelInactive":[],
        "labelPos":{"PC1":[0.9,0],"PC2":[0,0.9]},
        "heatOn":false,"labelAttraction":-1.5,"damping":1.75,"guideType":"none"}'

state13 <- '{
        "labelInactive":[],
        "labelPos":{"PC1":[0.9,0],"PC3":[0,0.9]},
        "heatOn":false,"labelAttraction":-1.5,"damping":1.75,"guideType":"none"}'

state23 <- '{
        "labelInactive":[],
        "labelPos":{"PC2":[0.9,0],"PC3":[0,0.9]},
        "heatOn":false,"labelAttraction":-1.5,"damping":1.75,"guideType":"none"}'

stateU <- '{
        "labelInactive":[],
        "labelPos":{"UMAP1":[0.9,0],"UMAP2":[0,0.9]},
        "heatOn":false,"labelAttraction":-1.5,"damping":1.75,"guideType":"none"}'

bF <- radio_button("Spin freely", "example", "1",
    glue('document.getElementById("widget").langevitour.setState({stateF})'))

b12 <- radio_button("PC1, PC2", "example", "12",
    glue('document.getElementById("widget").langevitour.setState({state12})'))

b13 <- radio_button("PC1, PC3", "example", "13",
    glue('document.getElementById("widget").langevitour.setState({state13})'))

b23 <- radio_button("PC2, PC3", "example", "23",
    glue('document.getElementById("widget").langevitour.setState({state23})'))

bU <- radio_button("UMAP", "example", "U",
    glue('document.getElementById("widget").langevitour.setState({stateU})'))

widget <- langevitour(
    cbind(gene_mat, layout/2), 
    extraAxes=rbind(pcs$rotation, matrix(0,nrow=2,ncol=3)), 
    scale=8, pointSize=1, 
    state=stateF,
    elementId="widget")

browsable(div(bF, b12, b13, b23, bU, br(), br(), widget))



Principal Components Analysis finds a new basis for the data. We go from genes-as-axes to PCs-as-axes. Each PC axis is defined by a vector of “gene loadings” (or “feature loadings”). The projection of each cell onto each PC is called the “cell score” or “cell embedding”.

This is a linear transformation. In fact it is a rotation. Straight lines in gene space remain straight lines in PC space.

Principal Components are ordered by how much variance in the data they explain. The first n components are the best possible approximation of the data with n components, in terms of total squared error. We can simplify the data by discarding all but the first few PCs. Press the “PC1, PC2” button: this is the 2D view containing the most variance.

The most variance does not necessarily mean the most interesting structure. Here PC1 and PC3 actually show more interesting structure. Press the “PC1, PC3” button. This works ok here, but will not work well in general – biological phenomena of interest may be mixed up across many PCs. (Aside: an under-appreciated way to disentangle components so they have more direct biological meaning is to find the varimax rotation of the gene loadings.)


UMAP is good at revealing interesting structure.

UMAP tries to put nearest neighbours in a high dimensional space close to each other in a 2D layout. By using the nearest neighbours graph, it follows the topology of the data.

UMAP is a non-linear transformation, straight lines may become curved or even torn apart.

Press the “UMAP” button. The UMAP was computed from all three PCs. Because I am mean, I found an example where the UMAP went a bit weird. Can you work out what has happened?

Further reading:


About nearest neighbours graphs

The UMAP layout is based on the nearest neighbour graph. For each cell the (default) 30 nearest neighbours are found (n.neighbors parameter in RunUMAP). This tends to capture the topology of shapes in high-dimensional space. It also adapts to how dense clusters are, so as a feature UMAP erases differences in cluster density. All the points within clusters will be packed together with about the same density (min.dist parameter in RunUMAP).

Similarly, graph-based clustering methods such as Louvain clustering use nearest neighbour connections to find clusters following the topology of the data. Notice in the two clusters above there are pairs of points within a cluster that are further apart than some pairs of points between the two different clusters. A purely distance-based clustering method such as k-means would perform poorly here.

Louvain clustering can handle clusters with complex curved shapes, and clusters of different size and density.