This is an improved way to calculate log2(normalized_count+pseudocount) that removes a bias between samples due to differing library sizes (normalization amount). The idea is to use the expectation from rarefying to the smallest library size and then applying the log transformation.

It’s a small improvement, but we use this a lot in RNA-Seq and scRNA-Seq.

Rarefication is performed using the binomial distribution to sub-sample. (Hypergeometric sampling could be used but this requires a strict idea of library sizes. Using the binomial, we just need the subsampling fraction, which could be derived from a TMM-adjusted library size, or a particular normalizing gene, etc.)

The concept of rarefication uses random sampling, but actually we can calculate the expectation exactly. My code does this up to a certain count (max_exact argument), then approximates the remainder. Use max_exact=-1 for the old, uncorrected method.

Implementation

log2plus <- function(add) function(count) log2(count+add)


#Expectation of f(rbinom(length(count),count,fraction))
#Calculate by summing dbinom(i,count,fraction)*f(i)
#Exact summation up to max_exact, then the remaining mean is transformed
efrare_vec <- function(counts, fraction, f=log2plus(1), max_exact=10) {    
    result <- numeric(length(counts))    
    mean_remainder <- counts*fraction
    p_remainder <- rep(1,length(counts))    
    i <- 0
    while(i <= max_exact) {
        mask <- counts >= i
        if (!any(mask)) break
        
        p <- dbinom(i, counts[mask], fraction)
        result[mask] <- result[mask] + p*f(i)
        p_remainder[mask] <- p_remainder[mask] - p
        mean_remainder[mask] <- mean_remainder[mask] - i*p
        
        i <- i+1
    }

    mask <- p_remainder > 1e-10 
    # Could have used counts > max_exact, 
    # but numerical accuracy issues also arise as p->0
    conditional_mean <- mean_remainder[mask]/p_remainder[mask]
    result[mask] <- result[mask] + p_remainder[mask]*f(conditional_mean)
    result
}


#Apply E(f(rarefied)) method to a matrix
efrare <- function(counts, fractions, f=log2plus(1), max_exact=10) {
    result <- vector("list",ncol(counts))
    for(i in seq_len(ncol(counts))) {
       result[[i]] <- efrare_vec(counts[,i], fractions[i], f=f, max_exact=max_exact)
    }
    result <- do.call(cbind, result)
    colnames(result) <- colnames(counts)
    rownames(result) <- rownames(counts)
    result
}

Test with NBPSeq::arab dataset, small pseudocount

I rarefiy an RNA-Seq count matrix by various fractions (0.01,0.02,0.05,0.1,0.2,0.5,1), then apply the transformation.

The usual convention is to choose a pseudocount of 1 based on the mean library size.

To simulate this, I use a pseudocount of min(lib_sizes)/mean(lib_sizes) after normalizing to min(lib_sizes).

library(NBPSeq)
library(limma)
set.seed(1234)

rarefy <- function(counts, fraction) {
    for(i in seq_len(ncol(counts)))
        counts[,i] <- rbinom(nrow(counts), counts[,i], fraction)
    colnames(counts) <- paste0(colnames(counts),"_",fraction)
    counts
}

# arab dataset from NBPSeq, and various downsamplings of it
data(arab)
arab_down <- cbind(
    rarefy(arab, 0.01),
    rarefy(arab, 0.02),
    rarefy(arab, 0.05),
    rarefy(arab, 0.1),
    rarefy(arab, 0.2),
    rarefy(arab, 0.5),
    arab)

lib_sizes <- colSums(arab_down)
fractions <- min(lib_sizes) / lib_sizes

# Pseudo-count of 0.01 in smallest library
# The idea is to simulate the effect of a pseudo-count of one in 
# a small library if normalization was to the average library size.
pseudocount <- min(lib_sizes)/mean(lib_sizes)
pseudocount
## [1] 0.0206204
f <- log2plus(pseudocount)

conventional <- f(sweep(arab_down, 2, fractions, "*"))
# or conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Old method")

efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Efrare method")

What efrare does

Consider a rarefication to 10% then a pseudocount of 0.1.

f <- log2plus(0.1)

i <- 0:100
conventional <- efrare_vec(i, 0.1, f, max_exact=-1)
efrared <- efrare_vec(i, 0.1, f)
unrared <- f(i)

plot(i/10, conventional, xlab="Normalized count", ylab="Transformed value")
points(i/10, efrared, col="blue")
points(i, unrared, col="red", cex=2)

You can see the transformation is different based on the rarefication amount.

Using a pseudocount of 1 with the smallest library size is also fine

The usual convention is to apply the pseudocount after normalizing to the mean library size. If we instead normalize to the smallest library size, a pseudocount of 1 is fine.

(However this may be inefficient if only one or two samples have a small library size. Hence the usual convention as above, letting these small samples be noisy in order to extract the most information from the rest of the samples.)

f <- log2plus(1)

conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Old method, pseudocount 1 in smallest lib")

efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Efrare method, pseudocount 1 in smallest lib")

sqrt transformation is also fine

f <- sqrt

conventional <- efrare(arab_down, fractions, f=f, max_exact=-1)
plotMDS(conventional, col=rep(rainbow(6),7), main="Sqrt transformation")

efrared <- efrare(arab_down, fractions, f=f, max_exact=10)
plotMDS(efrared, col=rep(rainbow(6),7), main="Sqrt transformation with efrare method")

Resampling with replacement doesn’t work well

Li and Tibshirani (2011) proposed using resampling with replacement rather than downsampling, in the context of a method for testing for differential expression. As compared to downsampling, this has the advantage that we can increase the size of libraries as well as decrease them.

This can be done using a Poisson distribution.

Li and Tibshirani resampled to the geometric mean of library sizes, but to be consistent with the above I will continue using the arithmetic mean. After resampling, I use a pseudocount of 1.

It does not work well.

efresample_vec <- function(counts, scale, f=log2plus(1), max_exact=10) {
    mean <- counts*scale
    result <- numeric(length(counts))    
    mean_remainder <- mean
    p_remainder <- rep(1,length(counts))    
    i <- 0
    while(i <= max_exact) {
        p <- dpois(i, mean)
        result <- result + p*f(i)
        p_remainder <- p_remainder - p
        mean_remainder <- mean_remainder - i*p
        
        i <- i+1
    }

    mask <- p_remainder > 1e-10 
    conditional_mean <- mean_remainder[mask]/p_remainder[mask]
    result[mask] <- result[mask] + p_remainder[mask]*f(conditional_mean)
    result
}


efresample <- function(counts, scales, f=log2plus(1), max_exact=10) {
    result <- vector("list",ncol(counts))
    for(i in seq_len(ncol(counts))) {
       result[[i]] <- efresample_vec(counts[,i], scales[i], f=f, max_exact=max_exact)
    }
    result <- do.call(cbind, result)
    colnames(result) <- colnames(counts)
    rownames(result) <- rownames(counts)
    result
}


scales <- mean(lib_sizes)/lib_sizes
f <- log2plus(1)

efresampled <- efresample(arab_down, scales, f=f, max_exact=100)
plotMDS(efresampled, col=rep(rainbow(6),7), main="Resampling method")