Perform a Variance Stabilizing Transformation (VST) of a matrix of count data.
vst(x, method = "anscombe.nb", lib.size = NULL, cpm = FALSE, dispersion = NULL, design = NULL)
x | A matrix of counts. Rows are genes (or other features), and columns are samples. |
---|---|
method | VST to use, see details. |
lib.size | Optional, estimated if not given. |
cpm | Should the output be in log2 Counts Per Million, rather than simply log2. |
dispersion | Optional, estimated if not given. Dispersion parameter of the negative binomial distribution of the data. |
design | Optional. If dispersion isn't given, a design matrix to use when estimating dispersion. |
A transformed matrix.
Several methods are available. "anscombe.nb" is recommended.
Methods:
"anscombe.nb": Default, asinh(sqrt((x+3/8)/(1/dispersion-3/4))). Anscombe's VST for the negative binomial distribution.
"anscombe.nb.simple": log(x+0.5/dispersion), a simplified VST also given by Anscombe.
"anscombe.poisson": sqrt(x+3/8). Anscombe's VST for the Poisson distribution. Only appropriate if you know there is no biological noise.
"naive.nb": asinh(sqrt(x/dispersion)). Resultant variance is slightly inflated at low counts.
"naive.poisson": sqrt(x). Resultant variance is slightly inflated at low counts.
Dispersion:
edgeR's estimate of the common dispersion of the count matrix would be a reasonable choice of dispersion. However Poisson noise in RNA-Seq data may be over-dispersed, in which case a slightly smaller dispersion may work better. I recommend not providing a dispersion and letting varistran pick an appropriate value.
If "dispersion" is not given, it is chosen so as to minimize sd(residual s.d.)/mean(residual s.d.). Residuals are calculated from the linear model specified by the parameter "design".
If "design" also isn't given, a linear model containing only an intercept term is used. This may lead to an over-estimate of the dispersion, so do give a design if possible.
Anscombe, F.J. (1948) "The transformation of Poisson, binomial, and negative-binomial data", Biometrika 35 (3-4): 246-254
# Generate some random data. means <- runif(100,min=0,max=1000) counts <- matrix(rnbinom(1000, size=1/0.01, mu=rep(means,10)), ncol=10) y <- varistran::vst(counts)#> Dispersion estimated as 0.00779547#> count transformed_count twofold_step #> 1 0 5.151076 NA #> 2 1 5.293759 NA #> 3 2 5.387236 0.09347714 #> 4 4 5.526106 0.13886989 #> 5 8 5.726246 0.20014036 #> 6 16 6.007677 0.28143123 #> 7 32 6.393095 0.38541768 #> 8 64 6.902603 0.50950760 #> 9 128 7.544541 0.64193818 #> 10 256 8.308032 0.76349117 #> 11 512 9.165837 0.85780476 #> 12 1024 10.086408 0.92057133 #> 13 2048 11.044161 0.95775269 #> 14 4096 12.022332 0.97817077