Proof that P-values under the null are uniformly distributed

I often hear in talks from statisticians that P-values are uniformly distributed under the null. But how can this be? And what does it mean? As the demonstration is pretty straightforward but nonetheless hard to find on the Internet, here it is.

Everything starts with an experiment (or at least with the observation of a natural phenomenon, be it part of an experiment or not). The aim is to assess whether or not the hypothesis we have about this phenomenon seems to be true. But first, let’s recall that a parametric test (see Wikipedia) is constituted of:

  • data: the n observations x_1, x_2, …, x_n are realizations of n random variables X_1, X_2, …, X_n assumed to be identically distributed;
  • statistical model: the probability distribution of the X_1, X_2, …, X_n depends on parameter(s) \theta;
  • hypothesis: an assertion concerning \theta, noted H_0 for the null (e.g. \theta=a), and H_1 for the alternative (e.g. \theta=b with b > a);
  • decision rule: given a test statistic T, if it belongs to the critical region C, the null hypothesis H_0 is rejected.

In practice, T follows a given distribution under H_0 (e.g. a Normal distribution, or a Student distribution) that does not depend on \theta but on n. We use the observations to compute a realization, noted t, of T.

The P-value, noted P, can be seen as a random variable, and its realization, noted p, depends on the observations. According to the notations, the formal definition of the P-value for the given observations is:

p = \mathbb{P} ( T \ge t | H_0 )

Therefore, according to Wikipedia, a P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. According to Matthew Stephens (source), a p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

Very importantly, note that the 2nd definition emphasizes the fact that, although it is computed from the data, a P-value does not correspond to the probability that H_0 is true given the data we actually observed!!

p \ne \mathbb{P} ( H_0 | x_1, x_2,..., x_n )

A P-value simply gives information in the case we would repeat the experiment a large number of times… (That’s why P-values are often decried.)

Ok, back on topic now. From the formula above, we can also write:

p = 1 - \mathbb{P} ( T < t | H_0 )

By noting F_0 the cumulative distribution function (cdf, fonction de répartition in French) of T under H_0, we obtain:

p = 1 - F_0( t )

And here is the trick, thanks to the fact that the cdf is monotonic, increasing and (left-)continuous:

\mathbb{P} ( T \ge t | H_0 ) = \mathbb{P} ( F_0(T) \ge F_0(t) ) = 1 - \mathbb{P} ( F_0(T) < F_0(t) )

Therefore, we have:

\mathbb{P} ( F_0(T) < F_0(t) ) = F_0( t )

Which means that F_0(T) is following a uniform distribution. And, as this means also that 1 - F_0(T) is uniformly distributed, then we can conclude that P-values are uniformly distributed under the null hypothesis.

cqfd.

But what does it mean? Well, we usually consider a significance level, noted alpha (small, e.g. 5%, 1%, 0.1%…), and if the P-value falls below this threshold, we reject the null and decide that the alternative is significant. However, let’s say we re-do the same experiment N times and compute a P-value for each of them. Since P-values are uniformly distributed under the null, it is as likely to find some of them between 0.8 and 0.85 than to find some of them below 0.05, if H_0 is indeed true. That is, some of them will fall below the significance threshold, just by chance. The experiments corresponding to these P-values are called false-positives: we think they are positives, i.e. we decide to accept H_1, while in fact they are really false, i.e. H_0 is true and should not be rejected.

Last but not least, if we re-do the same experiment 100 times and consider a threshold of 5%:

  • if H_0 is false (although we are not supposed to know it before doing the experiment), how many P-values will fall below this threshold just by chance? 5, on average;
  • if now H_0 is supposed to be true 50% of the time, what proportion of P-values will be around 5\% +- \epsilon? at least 23%, and typically 50% (see the paper of Sellke et al in 2001). In other words, when H_0 is true 50% of the time, a P-value of 5% doesn’t tell us anything, as half of the experiments from which they were calculated correspond to a true H_0, and half to a false H_0

Laisser un commentaire