Visualising high-dimensional genomics data

High-dimensional data in genomics

DNA genome → RNA transcripts → proteins

How can we know which RNA transcripts and proteins is a cell making, and how this is regulated?

Sequence millions to billions of short DNA sequences at a time.
Or use thousands of complementary “probes” for short DNA and RNA sequences with fluorescent markers.
Many variations!

scRNA-Seq

I will concentrate on single-cell RNA-sequencing (scRNA-Seq).

Read hundreds of millions of sequences of RNA.

Each sequence has an attached “barcode” telling the cell it came from.

        GACAATGCCCAGGGATCCCATGTGGGTTTTTTTTTT...ATCACGTCGTCCCACATACCCTCAACGTCAGTAGCGTGACGGTTC
        [ Cell barcode ][ UMI    ]             [ RNA sequence, reverse transcribed to DNA  ]

Large p: 20,000+ protein coding genes in human genome.

Large n: thousands or millions of individual cells.

Cells are like points in a \(p\)-dimensional space.

Example scRNA-Seq data

We will look at 10,000
Peripheral Blood Mononuclear Cells (PBMC) from eight human donors.

Public dataset from Kang et al. (2018).

A large sparse count matrix.

A small subset of the data is shown →
(seriated nicely)

Non-Linear Dimension Reduction

We have a high-D set of points to make sense of.

We want a 2-D layout that captures important high-D features.

Two popular methods for this are:

UMAP (Uniform Manifold Projection and Approximation) ← currently popular
t-SNE (t-distributed Stochastic Neighbor Embedding) ← a similar, earlier method

Focus on capturing the topology of the high-D data.

scRNA-Seq UMAP

What are we seeing here?

Standard Seurat processing steps:

35,635 genes
→ Normalize and log1p transform counts.
→ 2,000 “highly variable genes” chosen.
→ 10 Principal Components found.
→ 2D UMAP layout of cells.

Cells have also been clustered and clusters annotated.

Some cells were stimulated with a cytokine,
IFN-β.

U = Unstimulated
S = Stimulated

UMAP overview

UMAP seeks a 2-D layout that makes two weighted graphs as similar as possible
(according to an objective function).

1. Weighted graph from High-D points

k-nearest neighbours graph:
Only nearby points are connected, but adaptive to local density.
Weights based on distance for the k neighbours.

2. Weighted graph from 2-D points

Weights based on distance between all pairs of points.

Many details omitted. The authors have a complicated topological justification of their algorithm.

Let’s look at some synthetic examples motivated by biology.

Parameters

UMAP: min_dist=0.3, n_neighbors=30
Seurat defaults.

t-SNE: perplexity=200
A higher perplexity than the default produced better results.

A couple of these examples are based on ones by Jayani Lakshika.

Clusters

UMAP and t-SNE effectively separate clusters.

Differences in density are hidden. Size merely reflects the number of points in each cluster.

Different densities

A batch or treatment effect might simply offset cells from one biological sample relative to another.

With UMAP or t-SNE this simple geometry may be hidden.

Density gradient

A long tail of points in one direction becomes a whisker.

Parallel whiskers may attract.

Non-planar topology

There is no good 2D layout of this shape, so it is torn apart.

Even if a good layout is possible, NLDR may get stuck at a torn local optimum!

Planar topology

Tree-like patterns can be put into a plane without tearing.

UMAP and t-SNE are very effective in this case.

Touring the data

Linear projections of data have less potential to misrepresent data than NLDR.

Animated linear projection of data is called a tour.

Grand Tour: explore random linear projections.
Guided Tour: go towards an informative projection.

Tours have been implemented in desktop software such as GGobi, and more recently in R packages such as tourr.

I recently developed langevitour, a Javascript tour widget (usable from R or Python).

Langevitour uses a physics engine to animate and interact with the linear projections, using concepts of momentum, potential energy, and heat.

scRNA-Seq PCA

Genes are special directions

PCA and UMAP only depend on Euclidean distance.

No direction is considered special.

But genes are special!

Specific biological functions are linked to specific genes and sets of genes.

We can apply varimax rotation:

Same subspace as PCA.
Sparse loadings.

(Other methods exist to find sparse loadings, such as PCA with an elastic net or prenet penalty, NMF, MOFA.)

scRNA-Seq varimax

Discussion

The routine use of NLDR in biology for single cell data is a remarkable success, but:

NLDR, while powerful, hides aspects of the data or can misrepresent it.
Even PCA hid an underlying simplicity that nature provided.

Need to communicate to biologists what exactly they are seeing.

Extra slides

UMAP vs t-SNE

UMAP has largely supplanted an earlier methoed called t-SNE.

t-SNE doesn’t use k-nearest neighbours graph, but does something similar.
Both produce similar results, with the right parameter choices.
UMAP optimizes starting from a better initial layout (spectral embedding).
UMAP has better default settings, producing a less crowded layout.
UMAP is faster.