Skip to contents

This article puts samesum_log2_norm() and samesum_log2_cpm() through their paces.

Example usage

x <- assay(airway, "counts")

lnorm <- samesum_log2_norm(x)

colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
#>   123445.8   123445.8   123445.8   123445.8   123445.8   123445.8   123445.8 
#> SRR1039521 
#>   123445.8
attr(lnorm,"sum")
#> [1] 123445.8
attr(lnorm,"samples")
#>         name    scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 2.006479      20637971          21667160
#> 2 SRR1039509 1.760615      18809481          19012176
#> 3 SRR1039512 2.424240      25348649          26178399
#> 4 SRR1039513 1.293811      15163415          13971346
#> 5 SRR1039516 2.342422      24448408          25294871
#> 6 SRR1039517 2.778465      30818215          30003524
#> 7 SRR1039520 1.861340      19126151          20099863
#> 8 SRR1039521 1.833587      21164133          19800165
plot_heatmap(
  lnorm, 
  feature_labels=rowData(airway)$symbol, 
  n=50, baseline_to=0)

plot_biplot(lnorm, feature_labels=rowData(airway)$symbol)

Filtering

Some analysis methods don’t work well with features that are nearly all zero. In particular we suggest filtering before using limma. However samesum transformation works fine with very sparse data, so we suggest transforming your data before filtering it. This allows filtering to be applied fairly across samples with different scale values.

keep <- rowSums(lnorm) >= 1
table(keep)
#> keep
#> FALSE  TRUE 
#> 34722 28955
lnorm_filtered <- lnorm[keep,]
plot_stability(lnorm_filtered)

You can provide a larger target scale for greater variance stabilization

The default target scale parameter is 2, which may produce noisy results for features with low abundance. This scale parameter can be increased.

lnorm <- samesum_log2_norm(x, scale=10)

colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
#>   81285.03   81285.03   81285.03   81285.03   81285.03   81285.03   81285.03 
#> SRR1039521 
#>   81285.03
attr(lnorm, "samples")
#>         name     scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 10.057054      20637971          21701696
#> 2 SRR1039509  8.832103      18809481          19058426
#> 3 SRR1039512 11.910477      25348649          25701122
#> 4 SRR1039513  6.622363      15163415          14290121
#> 5 SRR1039516 11.576043      24448408          24979458
#> 6 SRR1039517 13.829760      30818215          29842661
#> 7 SRR1039520  9.229036      19126151          19914952
#> 8 SRR1039521  9.351696      21164133          20179633

keep <- rowSums(lnorm) >= 1
table(keep)
#> keep
#> FALSE  TRUE 
#> 40617 23060
lnorm_filtered <- lnorm[keep,]
plot_stability(lnorm_filtered)

You can provide a target sum for consistent transformation across datasets

lnorm <- samesum_log2_norm(x, sum=1e5)

colSums(lnorm)
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
#>      1e+05      1e+05      1e+05      1e+05      1e+05      1e+05      1e+05 
#> SRR1039521 
#>      1e+05
attr(lnorm, "samples")
#>         name    scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 4.711026      20637971          21695573
#> 2 SRR1039509 4.132692      18809481          19032182
#> 3 SRR1039512 5.624991      25348649          25904631
#> 4 SRR1039513 3.077556      15163415          14172993
#> 5 SRR1039516 5.449664      24448408          25097201
#> 6 SRR1039517 6.483651      30818215          29858998
#> 7 SRR1039520 4.344943      19126151          20009660
#> 8 SRR1039521 4.346977      21164133          20019024

You can also produce log2 CPM values

This is quite similar to edgeR’s cpm function with log=TRUE.

lcpm <- samesum_log2_cpm(x)

Zeros will no longer be zero. This is fine, log transformation is meant to transform positive values to cover the whole real line. It is convenient sometimes to pick a log transformation with a pseudocount that transforms zeros to zero, but it is not necessary.

You can check the value that zeros are transformed to.

min(lcpm)
#> [1] -3.434502
attr(lcpm, "zero")
#> [1] -3.434502

mean(colSums(2^lcpm-2^attr(lcpm,"zero")))
#> [1] 1e+06

Adjust your filtering appropriately.

keep <- rowMeans(lcpm) >= attr(lcpm,"zero")+1
table(keep)
#> keep
#> FALSE  TRUE 
#> 43662 20015

Low library size samples produce a warning

Here I have downsampled the first sample to 1% of the original.

set.seed(563)
x <- assay(airway, "counts")
x[,1] <- rbinom(nrow(x), x[,1], 0.01) 

lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples have been given a scale less than
#> 0.5. These may subsequently appear to be outliers as an artifact of the
#> transformation. Consider using a larger target scale or excluding these samples
#> from analysis.

You can examine the samples attribute for more details.

attr(lnorm,"samples")
#>         name     scale true_lib_size adjusted_lib_size
#> 1 SRR1039508 0.0103538        205717          83191.73
#> 2 SRR1039509 2.7133387      18809481       21801392.68
#> 3 SRR1039512 3.7132470      25348649       29835551.65
#> 4 SRR1039513 2.0091857      15163415       16143596.90
#> 5 SRR1039516 3.5924834      24448408       28865228.65
#> 6 SRR1039517 4.2665191      30818215       34281035.59
#> 7 SRR1039520 2.8609935      19126151       22987783.61
#> 8 SRR1039521 2.8410567      21164133       22827594.11

Sample 1 is indeed an outlier now, and this is purely an artifact of the transformation:

plot_heatmap(lnorm, feature_labels=rowData(airway)$symbol, n=50, baseline_to=0)

Increasing the scale parameter removes the warning and fixes the problem. The cost is that we squash variability in genes with low expression levels. It’s like we’ve reduced the “resolution” of the data.

lnorm <- samesum_log2_norm(x, scale=200)

plot_heatmap(lnorm, feature_labels=rowData(airway)$symbol, n=50, baseline_to=0)

You will also get a warning if a sample is all zero.

x[,1] <- 0 
lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples are entirely zero.

Or sometimes a sample has very low library size or complexity, and the scale underflows to zero.

x[1,1] <- 1
lnorm <- samesum_log2_norm(x)
#> Warning in samesum_log2_norm(x): 1 samples where the scale has underflowed and
#> been given a value of zero. Please remove these samples.

Sparse matrices can be used

samesum_log2_norm transforms zeros to zeros, so the matrix remains sparse.

x <- Matrix( assay(airway, "counts") )
lnorm <- samesum_log2_norm(x)
class(lnorm)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"