Selecting the most interesting results to report

Outline

These slides: https://tinyurl.com/y3nmue36

RNA-Seq as a typical bioinformatics data type

False Discovery Rates

The wider debate around p-values

False Coverage-statement Rates

topconfects package

RNA-Seq introduction

Biological samples containing mRNA molecules

↓ RNA to DNA reverse transcription

↓ Fragmentation

↓ High-throughput shotgun sequencing (Illumina)

Millions of short DNA sequences per sample, called “reads”

↓ “Align” reads to reference genome (approximate string search)

↓ Count number of reads associated with each gene

Matrix of read counts

RNA-Seq introduction

~20,000 genes.

Often only 2 or 3 biological samples per experimental group.

Which genes differ in expression level between two groups?

log transform read counts
perform 20,000 t-tests

RNA-Seq introduction

Typically done using limma Bioconductor package.

Experimental design may be complicated, so allow any linear model.
Many people’s first encounter with linear models!

limma’s novel feature:

genes have similar but not identical variability
use an “emprical Bayes” prior for the residual variance,
as if we have some extra “prior” residual degrees of freedom

Smyth, G. K. (2004). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):1–25.

(More advanced methods use negative-binomial GLM on counts directly: edgeR, DESeq2)

RNA-Seq introduction

We have ~20,000 p-values, one for each gene.

We want to select the significantly differentially expressed genes.

If we select genes with \(p \leq 0.05\), we will get ~1000 “discoveries” purely by chance.

False Discovery Rate (FDR)

Assume the set of true discoveries to be made is much smaller than \(n_\text{gene}\).

For p-value cutoff \(\alpha\) and total discoveries \(k\), the FDR \(q\) will be approximately

\[ q = { n_\text{gene}\alpha \over k } \]

False Discovery Rate (FDR)

\[ q = { n_\text{gene}\alpha \over k } \]

So to achieve a specified FDR \(q\), we need

\[ \alpha = { k \over n_\text{gene} } q \]

The larger the set of discoveries \(k\), the larger the \(\alpha\). Weirdly circular!

Greedily choose the largest \(\alpha\) possible.

“For whoever has, to him more will be given, and he will have abundance; but whoever does not have, even what he has will be taken away from him.”
Mathew 13:12

False Discovery Rate (FDR)

Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.

Benjamini and Hochberg proved this works, assuming the tests are independent or positively correlated.

In practice, provide “FDR-adjusted p-values” that let the reader select their desired FDR.

sort results by p-value
reader takes the head of the sorted table, down to their desired FDR

Path dependence

Example RNA-Seq volcano plot
Red points significant
at FDR 0.05

\(n_\text{sample}\)-poor, \(n_\text{gene}\)-rich situation, struggling to make any discoveries.

It has made sense to sort results by p-value or plot p-values.

Need distributional assumptions, rank-based methods can’t produce small enough p-values!

What if we do make a lot of discoveries?

Choose a smaller FDR
Ad-hoc combination of p-value and fold-change cutoff

Path dependence

Example GWAS Manhattan plot

(from https://en.wikipedia.org/wiki/Genome-wide_association_study)

Many ’omics datasets follow this \(n_\text{sample}\)-poor, \(n_\text{feature}\)-rich pattern:

microarrays
Genome Wide Association Study
(which mutations cause a disease?)
protein mass-spectrometry
16S sequencing
(bacterial species ecology)
Bisulfite sequencing, Hi-C, ATAC-seq, ChIP-seq, …
(DNA epigenetics, physical layout, openness to transcription, associated proteins)

Analysis almost always focussed on p-values.

The FDR of science

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8):e124.

Ioannidis estimates that the Positive Predictive Value (PPV=1-FDR) of most fields is less then 50%.

Tests performed with \(\alpha=0.05\)
Mostly tests where pre-study odds of a true effect are low
Mostly under powered, false-negative probability \(\beta\) high

Biasses:

Hypotheses tested multiple times by different labs
Flexibility in analysis
Tests selectively reported

Ongoing controversy about p-values

American Statistical Association statement on p-values, March 2016

what p-values aren’t, what they shouldn’t be used for

Special issue of the ASA’s journal, March 2019

many interesting ideas, no overall consensus

Meanwhile,
“Scientists rise up against statistical significance”

800 scientists signed a statement published in Nature, March 2019

Mostly reasonable, but…

“How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see?”

“This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval.”

Gelman broadly liked it, Ioannidis disliked the undercurrent

Comment: in bioinformatics the best looking noise can look very convincing.

Some specific problems with p-values

“Insignificant” result may owe to no effect, or a lack of power, or luck
“Significant” result may owe to a large effect, or a powerful experiment, or selective reporting, or luck
Dichotomization leads to apprarent paradoxes

ASA statement on p-values, March 2016

“4. Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively.”

“5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”

Confidence Intervals are better

\(p \leq 0.05\) iff 95% Confidence Interval doesn’t pass through zero.

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

Stop dichotomising
Instead think in terms of measurement accuracy
Report everything, combine using meta-analysis

But giving up selection isn’t an option for us!

False Coverage-statement Rates

Benjamini, Y. and Yekutieli, D. (2005). False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters. Journal of the American Statistical Association, 100(469):71– 81.

After selecting \(k\) of \(n_\text{gene}\) results,

for an FCR of \(q\),

provide \(1-{{k \over n_\text{gene}} q}\) confidence intervals.

Any selection rule may be used!

Caveat: point estimates remain biassed

Back to RNA-Seq

I want to move us away from p-values and use confidence bounds.

Sometimes we are not \(n_\text{sample}\) poor. Sometimes we have too many discoveries!

larger scale RNA-Seq studies
single cell RNA-Seq

I want to rank the results and let the reader select some to follow up.

\(n_\text{sample}\)-large example

TCGA breast cancer data.

97 RNA-seq tumor-normal pairs.

Shown is a random sample of 20 genes.

Highly heteroscedastic.

Too many dicoveries:

13,784 DE genes at 5% FDR

Do we care about:

smallest p-values?

largest signal-to-noise ratios?

largest log2 fold changes (LFC)?

Top table by FDR adjusted \(p\)-value

Focusses on high signal-noise ratio.

Top enriched gene-sets to do with cell division.

To follow up a reasonable number of genes, choose a tiny FDR?

TREAT: using an interval null hypothesis

McCarthy, D. J. and Smyth, G. K. (2009a). Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6):765–771.

Null hypothesis for each gene that absolute LFC is smaller than threshold \(e\)

Specify threshold to be a biologically significant amount
Table sorted by adjusted \(p\)-values, again allows reader to choose FDR

But

FDR of 5% is fine, don’t need to give me this choice!
Not clear how to choose threshold

(Could also use worst case p-value from a t-test over null hypotheses within the interval. Compared to this, TREAT shrinks p-value by up to half.)

Introducting Topconfects

Harrison, P. F., Pattison, A. D., Powell, D. R., and Beilharz, T. H. (March 2019). Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biology, 20(1):67.

TOP CONFident efFECT Sizes

An alternative presentation of TREAT.

Fixed FDR
Ranking by “confect” values, allow reader to select threshold \(e\)

Bonus: “confects” also work as FCR adjusted confidence bounds.

Topconfects method

Let \(S(e)\) be the set of genes selected by TREAT at threshold \(e\), for some fixed FDR (eg 5%)

These sets nest:
If \(e_1 > e_2\) then \(S(e_1) \subseteq S(e_2)\)

Call the largest \(e\) for which a gene is a member of \(S(e)\) its “confect”, with appropriate sign

Top table by confect at 5% FDR

Top enriched gene-sets to do with blood vessels, cell-cell signalling, cell adhesion.

Topconfects for \(n_\text{sample}\)-small

Sub-sample of 10 patients.

Confects much smaller than effects indicates lack of power.

Bioconductor package

# ... filter out low expression genes ...
# ... transform to log2 Reads Per Million ...

# Fit linear models with limma
library(limma)
fit <- lmFit(log2_rpm_matrix, design_matrix)
# ... depending on design matrix, might also need contrasts.fit ...

# Calculate confects
library(topconfects)
result <- limma_confects(fit, coef="treatmentTRUE", trend=TRUE)

edger_confects and deseq2_confects also available, but slower!

normal_confects if you have a colleciton of estimates with standard errors.

Discussion

FCR adjustment provides a very flexible fix for selective reporting of CIs.

Topconfects provides FCR control when selecting the head of a ranked list.

Topconfects tends to be conservative. Bayesian credible intervals method may be more efficient. If the prior distribution includes everything the selector knows, it will be immune from selection.

FCR concept gives a way to test methods.

Discussion

Methods we use are cultural, path dependent:

We started \(n_\text{sample}\)-small, concentrated on significance and controlling FDR
GWAS uses a fixed p-value cutoff of 5e-8 rather than FDR!
Problems in mass-spectrometry, ecology are similar, but different methods used

We may be overlooking useful \(n_\text{sample}\)-large methods:

sandwich estimator
bootstrap

Discussion

Why have current methods worked?

Significance implies more than \(\neq 0\)
also implies the sign of the effect is known
Some effect sizes line up with p-values
Cohen’s \(d\), \(R^2\), alignment scores
Some things that look like p-values are good effect sizes
gene set enrichment

We now have a new degree of freedom: what effect size?

For interactions, ratio of ratios or difference of proportions?
For gene set enrichment?

Acknowledgements

RNA Systems Biology Lab

Traude Beilharz

Andrew Pattison

Monash Bioinformatics Platform

David Powell

Extra slides

Simulation

\(df_\mathrm{within}=2\), \(s_\mathrm{within}=0.75\), \(df_\mathrm{between}=3\), \(s_\mathrm{between}=0.5\)

Outline

RNA-Seq introduction

RNA-Seq introduction

RNA-Seq introduction

RNA-Seq introduction

False Discovery Rate (FDR)

False Discovery Rate (FDR)

False Discovery Rate (FDR)

Path dependence

Path dependence

The FDR of science

Ongoing controversy about p-values

Meanwhile,“Scientists rise up against statistical significance”

Some specific problems with p-values

Confidence Intervals are better

Confidence Intervals are better

False Coverage-statement Rates

Back to RNA-Seq

\(n_\text{sample}\)-large example

Top table by FDR adjusted \(p\)-value

TREAT: using an interval null hypothesis

Introducting Topconfects

Topconfects method

Top table by confect at 5% FDR

Topconfects for \(n_\text{sample}\)-small

Bioconductor package

Discussion

Discussion

Discussion

Acknowledgements

Extra slides

Simulation

Meanwhile,
“Scientists rise up against statistical significance”