This article puts samesum_log2_norm()
and
samesum_log2_cpm()
through their paces.
Example usage
x <- assay(airway, "counts")
lnorm <- samesum_log2_norm(x)
colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
#> 123445.8 123445.8 123445.8 123445.8 123445.8 123445.8 123445.8
#> SRR1039521
#> 123445.8
attr(lnorm,"sum")
#> [1] 123445.8
attr(lnorm,"samples")
#> name scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 2.006479 20637971 21667160
#> 2 SRR1039509 1.760615 18809481 19012176
#> 3 SRR1039512 2.424240 25348649 26178399
#> 4 SRR1039513 1.293811 15163415 13971346
#> 5 SRR1039516 2.342422 24448408 25294871
#> 6 SRR1039517 2.778465 30818215 30003524
#> 7 SRR1039520 1.861340 19126151 20099863
#> 8 SRR1039521 1.833587 21164133 19800165
plot_heatmap(
lnorm,
feature_labels=rowData(airway)$symbol,
n=50, baseline_to=0)
plot_biplot(lnorm, feature_labels=rowData(airway)$symbol)
Filtering
Some analysis methods don’t work well with features that are nearly all zero. In particular we suggest filtering before using limma. However samesum transformation works fine with very sparse data, so we suggest transforming your data before filtering it. This allows filtering to be applied fairly across samples with different scale values.
keep <- rowSums(lnorm) >= 1
table(keep)
#> keep
#> FALSE TRUE
#> 34722 28955
lnorm_filtered <- lnorm[keep,]
plot_stability(lnorm_filtered)
You can provide a larger target scale for greater variance stabilization
The default target scale parameter is 2, which may produce noisy results for features with low abundance. This scale parameter can be increased.
lnorm <- samesum_log2_norm(x, scale=10)
colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
#> 81285.03 81285.03 81285.03 81285.03 81285.03 81285.03 81285.03
#> SRR1039521
#> 81285.03
attr(lnorm, "samples")
#> name scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 10.057054 20637971 21701696
#> 2 SRR1039509 8.832103 18809481 19058426
#> 3 SRR1039512 11.910477 25348649 25701122
#> 4 SRR1039513 6.622363 15163415 14290121
#> 5 SRR1039516 11.576043 24448408 24979458
#> 6 SRR1039517 13.829760 30818215 29842661
#> 7 SRR1039520 9.229036 19126151 19914952
#> 8 SRR1039521 9.351696 21164133 20179633
keep <- rowSums(lnorm) >= 1
table(keep)
#> keep
#> FALSE TRUE
#> 40617 23060
lnorm_filtered <- lnorm[keep,]
plot_stability(lnorm_filtered)
You can provide a target sum for consistent transformation across datasets
lnorm <- samesum_log2_norm(x, sum=1e5)
colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
#> 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05 1e+05
#> SRR1039521
#> 1e+05
attr(lnorm, "samples")
#> name scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 4.711026 20637971 21695573
#> 2 SRR1039509 4.132692 18809481 19032182
#> 3 SRR1039512 5.624991 25348649 25904631
#> 4 SRR1039513 3.077556 15163415 14172993
#> 5 SRR1039516 5.449664 24448408 25097201
#> 6 SRR1039517 6.483651 30818215 29858998
#> 7 SRR1039520 4.344943 19126151 20009660
#> 8 SRR1039521 4.346977 21164133 20019024
You can also produce log2 CPM values
This is quite similar to edgeR’s cpm
function with
log=TRUE
.
lcpm <- samesum_log2_cpm(x)
Zeros will no longer be zero. This is fine, log transformation is meant to transform positive values to cover the whole real line. It is convenient sometimes to pick a log transformation with a pseudocount that transforms zeros to zero, but it is not necessary.
You can check the value that zeros are transformed to.
min(lcpm)
#> [1] -3.434502
attr(lcpm, "zero")
#> [1] -3.434502
mean(colSums(2^lcpm-2^attr(lcpm,"zero")))
#> [1] 1e+06
Adjust your filtering appropriately.
Low library size samples produce a warning
Here I have downsampled the first sample to 1% of the original.
set.seed(563)
x <- assay(airway, "counts")
x[,1] <- rbinom(nrow(x), x[,1], 0.01)
lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples have been given a scale less than
#> 0.5. These may subsequently appear to be outliers as an artifact of the
#> transformation. Consider using a larger target scale or excluding these samples
#> from analysis.
You can examine the samples attribute for more details.
attr(lnorm,"samples")
#> name scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 0.0103538 205717 83191.73
#> 2 SRR1039509 2.7133387 18809481 21801392.68
#> 3 SRR1039512 3.7132470 25348649 29835551.65
#> 4 SRR1039513 2.0091857 15163415 16143596.90
#> 5 SRR1039516 3.5924834 24448408 28865228.65
#> 6 SRR1039517 4.2665191 30818215 34281035.59
#> 7 SRR1039520 2.8609935 19126151 22987783.61
#> 8 SRR1039521 2.8410567 21164133 22827594.11
Sample 1 is indeed an outlier now, and this is purely an artifact of the transformation:
plot_heatmap(lnorm, feature_labels=rowData(airway)$symbol, n=50, baseline_to=0)
Increasing the scale
parameter removes the warning and
fixes the problem. The cost is that we squash variability in genes with
low expression levels. It’s like we’ve reduced the “resolution” of the
data.
lnorm <- samesum_log2_norm(x, scale=200)
plot_heatmap(lnorm, feature_labels=rowData(airway)$symbol, n=50, baseline_to=0)
You will also get a warning if a sample is all zero.
x[,1] <- 0
lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples are entirely zero.
Or sometimes a sample has very low library size or complexity, and the scale underflows to zero.
x[1,1] <- 1
lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples where the scale has underflowed and
#> been given a value of zero. Please remove these samples.
Sparse matrices can be used
samesum_log2_norm
transforms zeros to zeros, so the
matrix remains sparse.
x <- Matrix( assay(airway, "counts") )
lnorm <- samesum_log2_norm(x)
class(lnorm)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"