This is a method of computing log counts while dealing sensibly with zeros and differing library sizes. It should cope well with very sparse data. The input is transformed using log2(x/scale+1)
with a different scale for each sample. The scale for each sample is chosen so all of the samples add up to the same total. Finding these scale values is done efficiently using Newton's method.
Usage
samesum_log2_norm(x, scale = 2, sum = NA, tol = 1e-09, verbose = FALSE)
samesum_log2_cpm(x, scale = 2, sum = NA, tol = 1e-09, verbose = FALSE)
Arguments
- x
A matrix or a sparse matrix. Usually these will be counts, but any matrix of non-negative values could potentially be used. We regard columns as samples.
- scale
Target scale. Aim to give each sample close to this value of scale. The default value of 2 should be reasonable for count data, but see discussion below.
- sum
Target sum. Aim to give each sample exactly this sum.
scale
is ignored if this parameter is used. This could be used to provide a transformation that is consistent across datasets.- tol
Tolerance when optimizing scale values. 1/scale is increased from zero using Newton's method until the sum is at least
target_sum*(1-tol)
.- verbose
If TRUE, output some debugging messages.
Value
A matrix or a sparse matrix, depending on the input.
The return value will have some attributes accessible using attr()
.
"samples" is a data frame with per-sample information. The most important value for the transformation is the "scale" column. There is also "true_library_size", which is the sum of the input column, and "adjusted_library_size" if you want to use this method just as a library size adjustment method.
"zero" is the value that zero is transformed to. This will be zero for
samesum_log2_norm
."sum" is the value each column of the output sums to (but not including the CPM adjustment if using
samesum_log2_cpm
).
Details
You can either specify the argument scale
, and it will try to give individual samples scale values close to this value, or directly specify the target sum with sum
.
samesum_log2_cpm
adds a constant to the result to produce values that can be treated as log2 Counts Per Million (CPM). The result will have the property that mean(colSums(2^lcpm-2^attr(lcpm,"zero")))
equals one million.
Unlike using library size normalization, the samesum method is robust to variations in high abundance features, because it works on the log transformed values. It is also robust to noise from low abundance features because of the +1
in the transformation.
This method has some similarity to Centered Log Ratio (CLR) transformation. Where the CLR ensures each sample adds to zero, the samesum transformation ensures each sample adds to the same non-zero value. Like the CLR, distances between samesum transformed data are a good way to measure sample similarity. Transformed values could be used with PCA or with non-linear dimension reduction methods.
Samesum transformed data can also be used in differential abundance analysis using the limma-trend or limma-vooma methods. As usual, the assumption here is that there is a background of unchanging features that largely determine the scaling of each sample. This is usually true for RNA-Seq differential expression analysis, but seems to be questioned more often in metagenomics.
The default value of 2 for scale should be reasonable for count data, but is not variance stabilizing. limma-trend or limma-vooma will take this into account. For visualization and exploratory methods you could try a larger scale value, which will damp down variation in low abundance rows and provide better variance stabilization. The scale parameter acts similarly to a pseudocount in log transformation methods with a pseudocount parameter (such as edgeR's cpm
function).
When different samples have different library sizes, some of them may be given a scale much smaller than the target scale. When this happens, these samples may appear as outliers in data visualization, purely as an artifact of the transformation. A warning is produced if any samples are given a scale less than 0.5. Consider excluding these samples from your analysis. If all the samples need to be used, you can increase the target scale to avoid this problem. This will allow you to compare all samples, but at the cost of losing some ability to resolve fine differences between samples.
Examples
set.seed(563)
x <- matrix(rpois(15, 10), ncol=3)
samesum_log2_norm(x)
#> [,1] [,2] [,3]
#> [1,] 2.835302 2.874385 2.228180
#> [2,] 1.922011 2.775210 1.912293
#> [3,] 3.044870 2.668712 2.601092
#> [4,] 2.294345 1.780618 3.285829
#> [5,] 2.294345 2.291949 2.363480
#> attr(,"samples")
#> scale true_lib_size adjusted_lib_size
#> 1 1.792430 43 43.18113
#> 2 2.052771 49 49.45295
#> 3 2.170713 53 52.29429
#> attr(,"zero")
#> [1] 0
#> attr(,"sum")
#> [1] 12.39087
samesum_log2_cpm(x)
#> [,1] [,2] [,3]
#> [1,] 18.17639 18.21547 17.56927
#> [2,] 17.26310 18.11630 17.25338
#> [3,] 18.38596 18.00980 17.94218
#> [4,] 17.63543 17.12171 18.62692
#> [5,] 17.63543 17.63304 17.70457
#> attr(,"samples")
#> scale true_lib_size adjusted_lib_size
#> 1 1.792430 43 43.18113
#> 2 2.052771 49 49.45295
#> 3 2.170713 53 52.29429
#> attr(,"zero")
#> [1] 15.34109
#> attr(,"sum")
#> [1] 12.39087