These slides: https://tinyurl.com/y3nmue36
RNA-Seq as a typical bioinformatics data type
False Discovery Rates
The wider debate around p-values
False Coverage-statement Rates
topconfects package
These slides: https://tinyurl.com/y3nmue36
RNA-Seq as a typical bioinformatics data type
False Discovery Rates
The wider debate around p-values
False Coverage-statement Rates
topconfects package
Biological samples containing mRNA molecules
 ↓ RNA to DNA reverse transcription
 ↓ Fragmentation
 ↓ High-throughput shotgun sequencing (Illumina)
Millions of short DNA sequences per sample, called “reads”
 ↓ “Align” reads to reference genome (approximate string search)
 ↓ Count number of reads associated with each gene
Matrix of read counts
~20,000 genes.
Often only 2 or 3 biological samples per experimental group.
Which genes differ in expression level between two groups?
Typically done using limma Bioconductor package.
Experimental design may be complicated, so allow any linear model.
Many people’s first encounter with linear models!
limma’s novel feature:
edgeR, DESeq2)We have ~20,000 p-values, one for each gene.
We want to select the significantly differentially expressed genes.
If we select genes with \(p \leq 0.05\), we will get ~1000 “discoveries” purely by chance.
Assume the set of true discoveries to be made is much smaller than \(n_\text{gene}\).
For p-value cutoff \(\alpha\) and total discoveries \(k\), the FDR \(q\) will be approximately
\[ q = { n_\text{gene}\alpha \over k } \]
\[ q = { n_\text{gene}\alpha \over k } \]
So to achieve a specified FDR \(q\), we need
\[ \alpha = { k \over n_\text{gene} } q \]
The larger the set of discoveries \(k\), the larger the \(\alpha\). Weirdly circular!
Greedily choose the largest \(\alpha\) possible.
“For whoever has, to him more will be given, and he will have abundance; but whoever does not have, even what he has will be taken away from him.”
Mathew 13:12
Benjamini and Hochberg proved this works, assuming the tests are independent or positively correlated.
In practice, provide “FDR-adjusted p-values” that let the reader select their desired FDR.
Example RNA-Seq volcano plot Red points significant
at FDR 0.05
\(n_\text{sample}\)-poor, \(n_\text{gene}\)-rich situation, struggling to make any discoveries.
It has made sense to sort results by p-value or plot p-values.
Need distributional assumptions, rank-based methods can’t produce small enough p-values!
What if we do make a lot of discoveries?
Many ’omics datasets follow this \(n_\text{sample}\)-poor, \(n_\text{feature}\)-rich pattern:
Analysis almost always focussed on p-values.
Ioannidis estimates that the Positive Predictive Value (PPV=1-FDR) of most fields is less then 50%.
Biasses:
American Statistical Association statement on p-values, March 2016
Special issue of the ASA’s journal, March 2019
800 scientists signed a statement published in Nature, March 2019
Mostly reasonable, but…
“How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see?”
“This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval.”
Gelman broadly liked it, Ioannidis disliked the undercurrent
Comment: in bioinformatics the
best looking noise can look very convincing.
“Insignificant” result may owe to no effect, or a lack of power, or luck
“Significant” result may owe to a large effect, or a powerful experiment, or selective reporting, or luck
Dichotomization leads to apprarent paradoxes