I often hear in talks from statisticians that P-values are uniformly distributed under the null. But how can this be? And what does it mean? As the demonstration is pretty straightforward but nonetheless hard to find on the Internet, here it is.
Everything starts with an experiment (or at least with the observation of a natural phenomenon, be it part of an experiment or not). The aim is to assess whether or not the hypothesis we have about this phenomenon seems to be true. But first, let’s recall that a parametric test (see Wikipedia) is constituted of:
- data: the observations , , …, are realizations of random variables , , …, assumed to be identically distributed;
- statistical model: the probability distribution of the , , …, depends on parameter(s) ;
- hypothesis: an assertion concerning , noted for the null (e.g. ), and for the alternative (e.g. with );
- decision rule: given a test statistic , if it belongs to the critical region , the null hypothesis is rejected.
In practice, follows a given distribution under (e.g. a Normal distribution, or a Student distribution) that does not depend on but on . We use the observations to compute a realization, noted , of .
The P-value, noted , can be seen as a random variable, and its realization, noted , depends on the observations. According to the notations, the formal definition of the P-value for the given observations is:
Therefore, according to Wikipedia, a P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. According to Matthew Stephens (source), a p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.
Very importantly, note that the 2nd definition emphasizes the fact that, although it is computed from the data, a P-value does not correspond to the probability that is true given the data we actually observed!!
A P-value simply gives information in the case we would repeat the experiment a large number of times… (That’s why P-values are often decried.)
Ok, back on topic now. From the formula above, we can also write:
By noting the cumulative distribution function (cdf, fonction de répartition in French) of under , we obtain:
And here is the trick, thanks to the fact that the cdf is monotonic, increasing and (left-)continuous:
Therefore, we have:
Which means that is following a uniform distribution. And, as this means also that is uniformly distributed, then we can conclude that P-values are uniformly distributed under the null hypothesis.
But what does it mean? Well, we usually consider a significance level, noted (small, e.g. 5%, 1%, 0.1%…), and if the P-value falls below this threshold, we reject the null and decide that the alternative is significant. However, let’s say we re-do the same experiment times and compute a P-value for each of them. Since P-values are uniformly distributed under the null, it is as likely to find some of them between 0.8 and 0.85 than to find some of them below 0.05, if is indeed true. That is, some of them will fall below the significance threshold, just by chance. The experiments corresponding to these P-values are called false-positives: we think they are positives, i.e. we decide to accept , while in fact they are really false, i.e. is true and should not be rejected.
Last but not least, if we re-do the same experiment 100 times and consider a threshold of 5%:
- if is false (although we are not supposed to know it before doing the experiment), how many P-values will fall below this threshold just by chance? 5, on average;
- if now is supposed to be true 50% of the time, what proportion of P-values will be around ? at least 23%, and typically 50% (see the paper of Sellke et al in 2001). In other words, when is true 50% of the time, a P-value of 5% doesn’t tell us anything, as half of the experiments from which they were calculated correspond to a true , and half to a false …