2020-03-10

Dissatisfaction with p-value usage in science

What goes wrong with p-values?

Selective reporting invalidates Type I error control.

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8):e124.

  • Argues that amongst significant results reported, a majority are false.
    (Note: this is about Positive Predictive Value, PPV=1-FDR, not Type I error rate.)


People read more into a p-value than they should:

  • p-values are not an effect size.
  • Dichotomization (p<0.05?) leads to apprarent paradoxes.
  • “Insignificant” taken as evidence of no effect, but may be lack of power or bad luck.
  • “Significant” taken to mean large effect, but may be a powerful experiment or luck.

Confidence Intervals are better

  • Interval given in meaningful units, can judge importance.
  • Cause of dichotomization “paradoxes” are clear.
  • Rejects more hypotheses — reject everything outside the interval!

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

  • Stop dichotomising.
  • Instead think in terms of measurement accuracy.
  • Report everything, combine using meta-analysis.

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

  • Stop dichotomising.
  • Instead think in terms of measurement accuracy.
  • Report everything, combine using meta-analysis.


Good prescription for clinical trials or psychology experiments.

Not so great for bioinformatics.

  • We often consider thousands of p-values, selective reporting is the whole point.

False Discovery Rate

Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.

To select a set of hypotheses to reject, \(S\), from \(n\), with an FDR of \(q\),

choose the largest set satisfying:

\[ S = \left\{ i : p_i \leq {|S| \over n} q \right\} \]


Sets for smaller FDR nest within sets for larger FDR, so can be presented in a way that leaves choice of FDR to reader (idea attributed to Gordon Smyth):

  • Calculate an “adjusted p-value”.
  • Reader can read down a sorted list to desired cutoff value.

What about Confidence Intervals?

False Coverage-statement Rate

Benjamini, Y. and Yekutieli, D. (2005). False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters. Journal of the American Statistical Association, 100(469):71– 81.


After selecting a subset \(S\) out of \(n\) parameters, for an FCR of \(q\),

provide intervals with coverage probability \(1-{{|S| \over n} q}\).


Very broadly applicable:

  • Example of CIs reported in an abstract.


Caveat:

  • Point estimates remain biassed by selection.

RNA-seq example with many samples

TCGA breast cancer data.

97 RNA-seq tumor-normal pairs.


Shown is a random sample of 20 genes.

Highly heteroscedastic.






Ranking by p-value

Using this table:
Read down the table until desired FDR reached.

eg 13,784 DE genes at 5% FDR.

Too many discoveries!



What would choosing a stricter FDR actually do?

Sorting by p-value focusses on high signal-noise ratio.

Like using Cohen’s d effect size, but not comparable between experiments.

TREAT solves the problem of too many results

McCarthy, D. J. and Smyth, G. K. (2009). Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6):765–771.

Null hypothesis for each gene that absolute LFC is smaller than threshold \(e\).

  • Specify threshold to be a biologically significant amount.

  • Table sorted by adjusted \(p\)-values, again allows reader to choose FDR.


But:

  • FDR of 5% is fine, don’t need to give me this choice!

  • Not clear how to choose threshold.


(Could also use worst case p-value from a t-test over null hypotheses within the interval. Compared to this, TREAT shrinks p-value by up to half.)

Topconfects improves TREAT usability

TOP CONFident efFECT Sizes

Harrison, P. F., Pattison, A. D., Powell, D. R., and Beilharz, T. H. (2019). Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biology, 20(1):67.


An alternative presentation of TREAT.

  • Fixed FDR.

  • Table ranked by “confect” values, allows reader to select threshold \(e\).

  • Once threshold chosen, confect values are valid FCR adjusted confidence bounds (slightly conservative).

Topconfects method

Let \(S(e)\) be the set of genes selected by TREAT at threshold \(e\), for some fixed FDR (eg 5%).

These sets nest:

  • If \(e_1 > e_2\) then \(S(e_1) \subseteq S(e_2)\).


Try \(e=0, 0.01, 0.02, 0.03, \dots\)

Call the largest \(e\) for which a gene is a member of \(S(e)\) its “confect”, with appropriate sign.

Topconfects at 5% FDR

Top ranked genes have large LFCs.

Changes are less consistent than top-ranked by p-value, but confidently large on average.


Sorting by p-value, a common default in software, we might have missed some of the largest effects.

Topconfects at 5% FDR, small number of samples

Sub-sample of 10 patients.

Confects much smaller than effects indicates lack of power.

Using topconfects

BiocManager::install("topconfects")
  • As a further step in limma analysis:
fit <- lmFit(y, design)
fit <- contrasts.fit(fit, contrasts)

limma_confects(fit, "contrast_to_test", fdr=0.05)
  • Also has (slower) functions to follow edgeR or DESeq2 analysis.

  • Normal- or \(t\)-distributed errors:

normal_confects(estimates, standard_errors, df, fdr=0.05)
  • Any TREAT-like test:
nest_confects(pvalue_function, n, fdr=0.05)

FCR with simulated data

Select a LFC threshold, obtain a set. What FDR and FCR are achieved?


(15,000 simulated genes, scaled inverse chi-square noise, \(df_\mathrm{within}=2\), \(s_\mathrm{within}=0.75\), t-distributed LFCs, \(df_\mathrm{between}=3\), \(s_\mathrm{between}=0.5\))

FDR: Proportion selected genes actually below threshold.
(Exactly the same as for TREAT test.)

FCR is below allowed level,
confects are conservative.

FCR: Proportion selected genes outside confidence bound.

Runs of confect values

Think about what happens as we dial up the TREAT threshold.

Some gene’s p-value is now above the p-value threshold.
→ The proportion of significant genes goes down.
→ p-value threshold for other genes becomes stricter.

library(topconfects)
normal_confects(effect=c(2.7, 2.6, 2.5, 2.3, 1.5, 0.5), se=1, fdr=0.05)
## $table
##  rank index confect effect
##  1    1     0.517   2.7   
##  2    2     0.517   2.6   
##  3    3     0.517   2.5   
##  4    4     0.420   2.3   
##  5    5        NA   1.5   
##  6    6        NA   0.5   
## 4 of 6 non-zero effect size at FDR 0.05

Several genes can drop out simultaneously.

Runs of FDR adjusted p-values

Think about what happens as we dial down the allowed FDR.

Some gene’s p-value is now above the p-value threshold.
→ The proportion of significant genes goes down.
→ p-value threshold for other genes becomes stricter.

p.adjust(c(0.1, 0.125, 0.15, 0.4, 0.5), method="fdr")
## [1] 0.25 0.25 0.25 0.50 0.50

Several genes can drop out simultaneously.

Thinking in terms of measuremment




What is the null hypothesis?

What are we measuring?

How accurately was it measured?

Example: gene set enrichment

Tumor-normal LFC as a vector in gene-space.


For each gene-set:

Measure the length of the projection onto a centered indicator variable for the gene set. (Centering makes this “competitive” enrichment.)

Projection is linear, so each patient gives an unbiassed estimate of the population average projection.

Calculate mean and standard error of mean, using all patients.



Having cake: Choose a measurement that selects results most interesting to me.

Eating it too: Assess the accuracy of measurement with patients as the sampling unit, not genes!

If a “p-value” was the only output, would need to give up one or other of these.

Gene set enrichment by confect

The length of the tumor-normal LFC vector is 143. (Largest possible effect size.)

Cancer may have something to do with cell division.

GO BP Term Confect Effect
(LFC length)
Size Up Down
chromosome organization 11.4 15.5 1155 595 320
DNA conformation change 10.0 12.8 292 182 59
system process -10.0 -13.3 1656 400 884
DNA packaging 9.7 12.1 176 117 36
mitotic cell cycle process 9.6 12.9 827 430 248
mitotic cell cycle 9.6 12.8 967 488 306
nucleosome assembly 9.3 11.6 118 85 20
chromosome segregation 9.2 12.1 311 182 77
DNA metabolic process 9.2 12.1 974 515 259
chromatin assembly 9.1 11.3 137 94 26


(Packages GSVA and singscore also offer sample-level scores, can use with limma then topconfects. Choice of scale/normalization may need tweaking — similar results with singscore after scaling scores by \(\sqrt{\textrm{gene set size}}\) .)

Gene set enrichment by p-value

These sets undoubtedly contain a gene with expression differing from the background.

Top ranked terms miss the importance of cell division in cancer.

GO BP Term Adjusted
p-value
Size Up Down
parturition 1.5e-41 14 0 10
regulation of NAD+ ADP-ribosyltransferase activity 1.5e-41 3 3 0
regulation of long-term synaptic potentiation 1.5e-41 43 9 27
negative regulation of intrinsic apoptotic signaling pathway in response to DNA damage by p53 class mediator 2.1e-40 16 12 3
negative regulation of cytosolic calcium ion concentration 5.2e-40 11 1 9
maintenance of rDNA 5.2e-40 2 2 0
negative regulation of insulin secretion 2.8e-39 36 9 21
regulation of systemic arterial blood pressure by vasopressin 2.8e-39 4 0 4
bicarbonate transport 2.8e-39 42 9 24
negative regulation of peptide hormone secretion 3.6e-39 42 10 25

Transition to measurement

More choice:

  • My choice to use geometric projection implies scaling the indicator variable by \(\approx 1/\sqrt{\textrm{gene set size}}\) then taking the dot product with LFC. Other scaling choices possible.

  • Interaction effects: Ratio of ratios or difference of proportions?

    • Will affect confect ranking.
    • No difference for p-values.


Thinking about measurement leads to thinking about bias and reproducibility.

  • Effect sizes with unbiassed estimators seem preferable for sparse large-\(n\) data such as single cell.


Naive calculation of confects or CIs by test inversion is computationally intensive.

  • Likelihood ratio, permutation/rotation methods slower.

Summary

  • Greater use of confidence intervals in science is desirable.

  • FCR does for CIs what FDR does for p-values.

  • Described topconfects, a method that gives a useful ranking of results and provides confidence bounds once top results selected.

    • Good usability.
    • Conservative in simulation.
    • FCR concept can be used to validate similar methods (eg Bayesian).
  • Choice of measurement gives more freedom than choice of null hypothesis.




Acknowledgements: Traude Beilharz, David Powell, Andrew Pattison

Extra slides

Mixed-sign enrichment using squared LFCs

\(E(X_1) = E(X_2) = \mu\) with \(X_1, X_2\) independent \(\implies E(X_1 X_2) = \mu^2\)

Multiplying estimated LFC of a gene from two different patients provides an unbiassed estimate of the squared LFC. Average over all pairs of patients. Projection to measure enrichment as before.

Jacknife to estimate standard error.

Term Confect Effect
(LFC2 length)
Size Up Down
system process 28.1 39.4 1656 400 884
multicellular organismal process 28.1 38.2 6614 2173 2932
extracellular structure organization 25.2 32.5 404 143 179
tendon development 24.0 33.9 6 4 2
extracellular matrix organization 23.3 29.9 357 129 155
anatomical structure development 23.1 30.5 5426 1814 2386
system development 22.5 29.8 4430 1478 1970
developmental process 22.5 29.8 5789 1950 2529
response to chemical 22.4 31.5 4077 1442 1733
nervous system process 22.4 29.4 952 236 487