AppCrib
Everyday Tools

Where the 0.05 Significance Threshold Came From, and What the ASA Said in 2016

Domain knowledge·Published by AppCrib··
PvalrTest statistic in, p-value out. 30 seconds.

A student runs a two-tailed t-test, gets p = 0.0512, and stops. The threshold says 0.05. Their result is not significant. They got close, but close doesn't count, and the lab report calls for a clean reject-or-fail-to-reject sentence.

That binary cutoff has been the source of more arguments in statistics than almost any other number. Where it came from, what it was meant to mean, and what professional statisticians now say about it are three different conversations. None of them began with a vote. None of them ended with one either.

Where the 0.05 threshold came from: Fisher's 1925 textbook

Ronald A. Fisher introduced the 0.05 convention in *Statistical Methods for Research Workers* (1925). He did not derive it. He suggested it as a working rule for researchers who needed a cutoff. The passage that everyone quotes runs roughly: the value at which P equals 0.05, or one in twenty, sits near a deviation of 1.96, and it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.

Convenient. Not derived. Not principled. Fisher picked a round-number probability that was easy to look up in a table, namely the table he was about to print at the end of the book.

In *The Design of Experiments* (1935) he reiterated it, again as a working convention. He did not recommend treating it as a fixed boundary between truth and falsehood. He thought experimenters should look at the actual p-value rather than whether it crossed a line. Later in his career he openly resisted the idea that 0.05 should be a universal cutoff. The convention outran his intent within a generation.

Neyman and Pearson built the decision rule Fisher didn't intend

Jerzy Neyman and Egon Pearson published their hypothesis-testing framework in 1933 ("On the Problem of the Most Efficient Tests of Statistical Hypotheses," *Philosophical Transactions of the Royal Society A*, Vol 231). Their framework was structurally different from Fisher's. Fisher offered a measure of evidence against the null. Neyman and Pearson offered a decision procedure with explicit Type I (false positive) and Type II (false negative) error rates.

The 0.05 number survived the transition, but it changed jobs. In the Neyman-Pearson framework, alpha is the long-run rate of false positives you are willing to tolerate. It is set before the experiment, defended in advance, and not adjusted after seeing the data. The fixed-cutoff interpretation that frustrates working scientists today is closer to the Neyman-Pearson framing than to Fisher's.

Modern teaching collapses the two. Students learn that p < 0.05 means significant and treat the p-value both as evidence (Fisher) and as a decision rule (Neyman-Pearson) at the same time. That blend was never the recommendation of either originator.

The 2016 ASA statement and its six principles

In March 2016, the American Statistical Association released a formal statement on p-values, the first time a major professional body had issued a consensus document warning about how the statistic was being misused. Wasserstein and Lazar's paper "The ASA Statement on p-Values: Context, Process, and Purpose" appeared in *The American Statistician* (Vol 70, Issue 2). It opened with a joke about a statistician who asked a colleague what a p-value was and got six different answers.

The statement lays out six principles. The short version:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Principle 3 is the one that lands hardest on the 0.05 convention. The ASA did not ask the world to abandon the cutoff. It asked everyone to stop pretending the cutoff was load-bearing.

The 2018 proposal to move the line to 0.005

In January 2018, a group of 72 statisticians and researchers published "Redefine Statistical Significance" in *Nature Human Behaviour* (Benjamin et al., 2018). The argument: for *new discoveries*, the conventional cutoff should be tightened from 0.05 to 0.005. They left 0.05 in place as a softer "suggestive" threshold.

The motivation was the replication crisis. Studies that cleared p < 0.05 were failing to replicate at high rates across psychology, biomedical research, and parts of economics. The authors argued that a stricter cutoff for claims of discovery would catch a substantial fraction of false positives at the source.

Not everyone agreed. A counter-proposal led by Daniel Lakens and 87 co-authors ("Justify Your Alpha," *Nature Human Behaviour*, March 2018) argued for justifying alpha case by case rather than swapping one universal cutoff for another. The two camps agreed on the diagnosis. They disagreed on whether a sharper line was the cure.

In March 2019, Wasserstein, Schirm, and Lazar edited a special issue of *The American Statistician* titled "Moving to a World Beyond p < 0.05." Their editorial took the firmer position: stop saying "statistically significant" at all. Report the p-value, the effect size, the confidence interval, and let the reader judge.

What different fields actually use

The 0.05 convention is the default in most coursework, but real practice has always varied. Some fields use it, some do not, and the difference matters when you are reading research.

FieldConventionNotes
Introductory teaching, most undergraduate labs0.05The default in nearly every stats textbook.
Clinical trials0.05, often 0.01FDA guidance favors 0.05 for primary endpoints with multiple-testing corrections.
Genomics (GWAS)5 × 10⁻⁸Reflects Bonferroni correction over roughly one million independent SNP tests.
Particle physics"5 sigma" ≈ 2.87 × 10⁻⁷Convention for claiming a discovery; the July 2012 Higgs announcement reached it.
Exploratory and pilot studies0.10 sometimesHigher false-positive tolerance accepted to flag follow-up candidates.
Psychology and social science0.05, with caveatsIncreasing use of pre-registration and Bayes factors alongside.
Basic and Applied Social Psychology (journal)NoneThe journal banned p-values entirely in February 2015.

The takeaway is that 0.05 is a convention, not a law. It is sticky because it is easy to teach, easy to write into a methods section, and easy for a reviewer to scan. It is not sticky because it has a defensible derivation.

Why 0.049 vs 0.051 sits at the heart of the debate

This is the worked-out version of the ASA's third principle. Two studies, one with p = 0.049 and one with p = 0.051, are usually treated as opposites. One rejects the null. The other does not. One gets reported as a "significant finding." The other gets the careful hedge of "no statistical significance."

The actual statistical evidence (the strength of the signal in the data) is essentially identical. The two p-values differ by 0.002. If you ran the same experiment a second time, that gap could easily flip the verdict the other way, even though nothing about the underlying effect changed.

The 2018 proposal to move the cutoff to 0.005 did not fix this problem. It moved it. A study with p = 0.0049 would still be cleanly different from one with p = 0.0051, despite carrying nearly identical evidence. Any binary cutoff produces a discontinuity at the threshold, and the discontinuity will trip up any field that treats the verdict as more meaningful than the number behind it.

The most consistent advice from statisticians over the last decade is to stop reading a p-value as a binary verdict and to report it as a continuous quantity along with the effect size and the confidence interval. That advice predates the ASA statement and has been repeated in nearly every commentary since.

Reading p-values without leaning on a single cutoff

The practical, post-2016 way to read a p-value goes like this.

A p-value tells you something about the compatibility between your data and a specific null hypothesis under a specific model. A small p means the data would be unusual *if* the null were true. It does not tell you the null is false. It does not tell you the alternative is true. It does not tell you how large the effect is.

When you write up a result, the conventional sentence ("the result is statistically significant at the alpha = 0.05 level") is fine, provided you also report the actual p-value, the effect size, and the confidence interval. The convention is not the problem. The convention being the *only* thing reported is the problem.

For coursework, the 0.05 cutoff is what your grader expects, and writing a results section that hits it cleanly is the appropriate move. For anything beyond coursework, the ASA's third principle is the one to internalize: the threshold is a guide, not an oracle, and the post-2016 consensus is that a p-value standing alone should not carry a decision.

When you need the actual number, the calculation itself is the easy part. Pvalr handles the CDF lookup for Z, T, chi-square, and F distributions across all three tail directions and prints the conventional interpretation sentence next to a reminder about what a p-value does and does not measure. The threshold is a convention, but the number is real, and getting the number right is the only place a calculator can actually help.

Pvalr
Test statistic in, p-value out. 30 seconds.
Try Pvalr