Commentary
Clarifications on the application and interpretation of the test for excess significance and its extensions

https://doi.org/10.1016/j.jmp.2013.03.002Get rights and content

Highlights

  • The test for excess significance depends on several assumptions.

  • Interpretation of the test should be cautious.

  • Significance-related biases may follow a complex pattern.

  • Likelihood ratio estimates can be used to generate the post-test probability of bias.

  • Correcting effect estimates for bias is not necessarily reliable.

Abstract

This commentary discusses challenges in the application of the test for excess significance (Ioannidis & Trikalinos, 2007) including the definition of the body of evidence, the plausible effect size for power calculations and the threshold of statistical significance. Interpretation should be cautious, given that it is not possible to separate different mechanisms of bias (classic publication bias, selective analysis, and fabrication) that lead to an excess of significance and in some fields significance-related biases may follow a complex pattern (e.g. Proteus phenomenon and occasional preference for “negative” results). Likelihood ratio estimates can be used to generate the post-test probability of bias, and correcting effect estimates for bias is possible in theory, but may not necessarily be reliable.

Section snippets

Nomenclature

Francis uses the terms consistency and inconsistency and defines the test as examining the consistency of a set of reported experiments (Francis, 2013). I am afraid that these terms may create some confusion in the literature. The terms “consistency” and “inconsistency” are used interchangeably with the terms “homogeneity” and “heterogeneity” in the field of meta-analysis (Higgins, Thompson, Deeks, & Altman, 2003), and TES is applied typically when many studies and meta-analyses thereof are

Definition of body of evidence

Francis has typically applied the test to probe for bias in sets of multiple experiments published by the same team in the same paper. The experiments are not necessarily the same, but may deviate in important aspects that may or may not induce also differences in the genuine effect sizes. The number of studies included in such bodies of evidence is usually relatively small, often <10. Nevertheless TES always shows that there are too many significant results, because in the examples that

Definition of plausible effect size

TES results depend on the assumptions about the plausible effect size, since these directly affect the power estimates for each study. This is a clear limitation, but, as Francis shows, the conclusions tend to be fairly robust when different assumptions are made about the plausible effect size within a sensible range. I would like to add here some additional considerations. First, it is possible to perform power calculations assuming a distribution of a plausible effect instead of a

Definition of nominal statistical significance threshold

Francis has used the p=0.05 threshold to separate “positive” from “negative” results. This threshold acts as an attractor for investigators in many fields (Bakker et al., 2012, Simmons et al., 2011), but it is not absolute. Some fields increasingly require more stringent thresholds and/or use multiplicity-corrections, some investigators may bias the results of their analysis too much and strike to get p-values much below 0.05, and investigators occasionally make leaps to claim significance for

Separating mechanisms of reporting bias

There are many mechanisms of selective reporting. I agree with Francis that fabrication bias, i.e. clear fraud, is unlikely to be a major player in most scientific fields. However, I also doubt that classic publication bias is the main explanation for excess significance in most fields. Classic publication bias means that “negative” results entirely disappear (by authors and/or editors/reviewers). The prevalence of this bias may vary across different scientific fields, proportional to the ease

Proteus phenomenon and complex bias patterns

The notion that reporting biases always favor “positive” over “negative” results is an over-simplification. Incentives for reporting (or not) specific types of results may vary. Occasionally “negative” results may be more attractive to obtain and publish than “positive” results. For example, if a study publishes a prominent observation in a major journal, other scientists may wish to contradict it. A strong contradiction may be attractive also to editors and reviewers. This generates the

Post-test probability of bias

The probability of bias in a body of evidence depends not only on the results of the TES but also on the prior probability of such bias. TES can be seen as a diagnostic test with some sensitivity (sens=power to detect bias) and specificity (spec). The post-test odds of bias Opost is: Opost=LR(+)Opre=[(sens)/(1spec)]Oprewhenp<0.10andOpost=LR()Opre=[(1sens)/(spec)]Oprewhenp>0.10 where LR(+) and LR() are positive and negative likelihood ratios, respectively.

As shown in Table 1, extrapolating

Correcting for bias

Assuming that TES is “positive” and bias does exist, the true effect is likely to be smaller than what is observed, but it is not necessarily null. Empirical pragmatic examples based on the exchange of information between Francis and authors whose papers tagged “positive” TES, prove that TES did indeed pick the presence of biases, at a minimum classic publication bias with some studies being unpublished. Selective analysis and related questionable research practices may be more difficult to

Acknowledgment

I am grateful to Greg Francis for supplying the raw values for figures 4 and 5 of his paper.

References (33)

  • C.J. Ferguson et al.

    Publication bias in psychological science: revalence, methods for identifying and controlling, and implications for the use of meta-analyses

    Psychological Methods

    (2012)
  • G. Francis

    The psychology of replication and replication in psychology

    Perspectives in Psychological Science

    (2012)
  • G. Francis

    Too good to be true: publication bias in two prominent studies from experimental psychology

    Psychonomic Bulletin & Review

    (2012)
  • G.L. Gadbury et al.

    Inappropriate fiddling with statistical analyses to obtain a desirable p-value: tests to detect its presence in published literature

    PLoS ONE

    (2012)
  • J. Galak et al.

    You could have just asked: reply to Francis (2012)

    Perspectives in Psychological Science

    (2012)
  • J.P. Higgins et al.

    Measuring inconsistency in meta-analyses

    British Medical Journal

    (2003)
  • Cited by (0)

    View full text