This is an improved way to calculate log2(normalized_count+pseudocount) that removes a bias between samples due to differing library sizes (normalization amount). The idea is to use the expectation from rarefying to the smallest library size and then applying the log transformation.
It’s a small improvement, but we use this a lot in RNA-Seq and scRNA-Seq.
Rarefication is performed using the binomial distribution to sub-sample. (Hypergeometric sampling could be used but this requires a strict idea of library sizes. Using the binomial, we just need the subsampling fraction, which could be derived from a TMM-adjusted library size, or a particular normalizing gene, etc.)
The concept of rarefication uses random sampling, but actually we can calculate the expectation exactly. My code does this up to a certain count (max_exact argument), then approximates the remainder. Use max_exact=-1 for the old, uncorrected method.
log2plus <- function(add) function(count) log2(count+add)
#Expectation of f(rbinom(length(count),count,fraction))
#Calculate by summing dbinom(i,count,fraction)*f(i)
#Exact summation up to max_exact, then the remaining mean is transformed
efrare_vec <- function(counts, fraction, f=log2plus(1), max_exact=10) {
result <- numeric(length(counts))
mean_remainder <- counts*fraction
p_remainder <- rep(1,length(counts))
i <- 0
while(i <= max_exact) {
mask <- counts >= i
if (!any(mask)) break
p <- dbinom(i, counts[mask], fraction)
result[mask] <- result[mask] + p*f(i)
p_remainder[mask] <- p_remainder[mask] - p
mean_remainder[mask] <- mean_remainder[mask] - i*p
i <- i+1
}
mask <- p_remainder > 1e-10
# Could have used counts > max_exact,
# but numerical accuracy issues also arise as p->0
conditional_mean <- mean_remainder[mask]/p_remainder[mask]
result[mask] <- result[mask] + p_remainder[mask]*f(conditional_mean)
result
}
#Apply E(f(rarefied)) method to a matrix
efrare <- function(counts, fractions, f=log2plus(1), max_exact=10) {
result <- vector("list",ncol(counts))
for(i in seq_len(ncol(counts))) {
result[[i]] <- efrare_vec(counts[,i], fractions[i], f=f, max_exact=max_exact)
}
result <- do.call(cbind, result)
colnames(result) <- colnames(counts)
rownames(result) <- rownames(counts)
result
}
I rarefiy an RNA-Seq count matrix by various fractions (0.01,0.02,0.05,0.1,0.2,0.5,1), then apply the transformation.
The usual convention is to choose a pseudocount of 1 based on the mean library size.
To simulate this, I use a pseudocount of min(lib_sizes)/mean(lib_sizes) after normalizing to min(lib_sizes).
library(NBPSeq)
library(limma)
set.seed(1234)
rarefy <- function(counts, fraction) {
for(i in seq_len(ncol(counts)))
counts[,i] <- rbinom(nrow(counts), counts[,i], fraction)
colnames(counts) <- paste0(colnames(counts),"_",fraction)
counts
}
# arab dataset from NBPSeq, and various downsamplings of it
data(arab)
arab_down <- cbind(
rarefy(arab, 0.01),
rarefy(arab, 0.02),
rarefy(arab, 0.05),
rarefy(arab, 0.1),
rarefy(arab, 0.2),
rarefy(arab, 0.5),
arab)
lib_sizes <- colSums(arab_down)
fractions <- min(lib_sizes) / lib_sizes
# Pseudo-count of 0.01 in smallest library
# The idea is to simulate the effect of a pseudo-count of one in
# a small library if normalization was to the average library size.
pseudocount <- min(lib_sizes)/mean(lib_sizes)
pseudocount
## [1] 0.0206204
f <- log2plus(pseudocount)
conventional <- f(sweep(arab_down, 2, fractions, "*"))
# or conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Old method")
efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Efrare method")
Consider a rarefication to 10% then a pseudocount of 0.1.
f <- log2plus(0.1)
i <- 0:100
conventional <- efrare_vec(i, 0.1, f, max_exact=-1)
efrared <- efrare_vec(i, 0.1, f)
unrared <- f(i)
plot(i/10, conventional, xlab="Normalized count", ylab="Transformed value")
points(i/10, efrared, col="blue")
points(i, unrared, col="red", cex=2)
You can see the transformation is different based on the rarefication amount.
The usual convention is to apply the pseudocount after normalizing to the mean library size. If we instead normalize to the smallest library size, a pseudocount of 1 is fine.
(However this may be inefficient if only one or two samples have a small library size. Hence the usual convention as above, letting these small samples be noisy in order to extract the most information from the rest of the samples.)
f <- log2plus(1)
conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Old method, pseudocount 1 in smallest lib")
efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Efrare method, pseudocount 1 in smallest lib")
f <- sqrt
conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Sqrt transformation")
efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Sqrt transformation with efrare method")
Li and Tibshirani (2011) proposed using resampling with replacement rather than downsampling, in the context of a method for testing for differential expression. As compared to downsampling, this has the advantage that we can increase the size of libraries as well as decrease them.
This can be done using a Poisson distribution.
Li and Tibshirani resampled to the geometric mean of library sizes, but to be consistent with the above I will continue using the arithmetic mean. After resampling, I use a pseudocount of 1.
It does not work well.
efresample_vec <- function(counts, scale, f=log2plus(1), max_exact=10) {
mean <- counts*scale
result <- numeric(length(counts))
mean_remainder <- mean
p_remainder <- rep(1,length(counts))
i <- 0
while(i <= max_exact) {
p <- dpois(i, mean)
result <- result + p*f(i)
p_remainder <- p_remainder - p
mean_remainder <- mean_remainder - i*p
i <- i+1
}
mask <- p_remainder > 1e-10
conditional_mean <- mean_remainder[mask]/p_remainder[mask]
result[mask] <- result[mask] + p_remainder[mask]*f(conditional_mean)
result
}
efresample <- function(counts, scales, f=log2plus(1), max_exact=10) {
result <- vector("list",ncol(counts))
for(i in seq_len(ncol(counts))) {
result[[i]] <- efresample_vec(counts[,i], scales[i], f=f, max_exact=max_exact)
}
result <- do.call(cbind, result)
colnames(result) <- colnames(counts)
rownames(result) <- rownames(counts)
result
}
scales <- mean(lib_sizes)/lib_sizes
f <- log2plus(1)
efresampled <- efresample(arab_down, scales, f=f, max_exact=100)
plotMDS(efresampled, col=rep(rainbow(6),7), main="Resampling method")