Uncertainty and collective science

22 mai 2011

There are some things that you, the reader of this preface, know to be true, and others that you know to be false; yet, despite this extensive knowledge that you have, there remain many things whose truth or falsity is not known to you. We say that you are uncertain about them. You are uncertain, to varying degrees, about everything in the future; much of the past is hidden from you; and there is a lot of the present about which you do not have full information. Uncertainty is everywhere and you cannot escape from it. Truth and falsity are the subjects of logic, which has a long history going back at least to classical Greece. The object of this book is to tell you about work that has been done in the twentieth century about uncertainty.


Research is carried out by individuals and often the best research is the product of one person thinking deeply on their own. For example, relativity is essentially the result of Einstein’s thoughts. Yet, in a sense, the person is irrelevant, for most scientists feel that if he had not discovered relativity, then someone else would; that relativity is somehow ‘‘out there’’ waiting to be revealed, the revelation necessarily being made by human beings but not necessarily by that human being. This may not be true in the arts so that, for example, if Shakespeare had not written his plays it would not follow that someone else would have produced equivalent writing. Science is a collective activity, much more so than art, and although some scientists stand out from the rest, the character of science depends to only a very small extent on individuals and what little effect they have disappears over time as their work is absorbed into the work of others.

Dennis Lindley, Understanding uncertainty

Je ne sais pas pour vous, mais moi, quand je lis ça, j’ai envie de m’installer dans un fauteuil et lire toute la journée.

Do you know what heritability is, do you?

9 mai 2011

Genetics is the study of genes. When they were « invented » (i.e. conceptualized) in the 19th century, genes were defined as units of heredity. Thanks to the revolution of molecular biology, we now know that a gene is a section of DNA (although it is still possible to debate on the precise meaning of this statement). But what I am interested here, is to ask a few questions about quantitative genetics, and especially heritability.

Indeed, this notion of heritability is key in genetics (and, by extension, in biology as a whole) but it is often obscure for many biologists (notably for me, until I decided to seriously read some papers, and thus write this post to be sure I understood properly). And you, the reader, even if you think you know everything about heritability, I hope it can still be worth the reading.

* * *

But let’s start with basics: Gregor Mendel and the birth of genetics. This Austrian monk was interested in inheritance (passing of traits from parents to offsprings). To study this phenomenon, he worked with pea plants and chose seven characters to study, but let’s focus on one: seed color.

Mendel observed that some pea plants had green seeds while others had yellow seeds, we will speak of a « green » phenotype and a « yellow » phenotype. Intrigued about what would happen with offsprings, he started by crossing the peas having green seeds among themselves during several generations, idem for the peas with yellow seeds (selfing is possible in this species). After doing this for several generations, he always observed that peas with the « green » phenotype always had offsprings with the « green » phenotype, and reciprocally for the « yellow » phenotype. These offsprings resulting of several generations of selfing were called « pure lines« , as they were obtained by crossing plants with the same phenotype.

Then, Mendel went on to cross a yellow male plant (denoted as P1) with a green female plant (denoted as P2), and named the offsprings the F1 generation. He observed that all F1 plants had green seeds, as if the yellow material disappeared. He also observed such a result when doing the reciprocal cross: a green male plant crossed with yellow female plant gave green offsprings. He therefore decided to qualify the « green » phenotype as dominant and the « yellow » phenotype as recessive.

Then, he continued his experiment, and crossed F1 plants together to obtain F2 plants. And here, what is interesting, is that some F2 plants had the « yellow » phenotype (although most had the « green » one): as if the « yellow » material, somehow, jumped from the P generation to the F2 generation…! At this point, Mendel had the genial idea of counting the plants: he found that 3/4 had the « green » phenotype and 1/4 had the « yellow » phenotype, i.e. a 3:1 ratio.

In front of such a striking observation, Mendel went on to characterize the F2 generation. And here again, he observed something strange: when crossing « green » F2 with themselves, some had only « green » offsprings but some had a mix of « green » and « yellow » offsprings, here also in a 3:1 ratio. This was not the case when selfing the F2 « yellow » plants. Therefore, the F2 « yellow » plant seemed to be « pure » as the P « yellow » plants, but 2/3 of the F2 « green » plants were like the F1 plants, and 1/3 were « pure » like the P « green » plants.

Mendel then realized that the behind the 3:1 ratio was a more fundamental 1:2:1 ratio, 3/4 F2 green is in fact 1/4 « pure » green and 2/4 « impure » green, whereas 1/4 F2 yellow is 1/4 « pure » yellow:

From all this, Mendel was able to develop his theory, summarized in the 5 points below:

  1. existence of genes, i.e. discrete units (« atomic particles ») of heredity;
  2. genes are in pairs, i.e. a gene may have different forms called alleles;
  3. each gamete carries only one member of each gene pair;
  4. the members of each gene pair segregate equally into the gametes;
  5. random fertilization, i.e. gametes combine together to form an organism without regard to which allele is carried.

Now let’s recapitulate by noting « A » the « green » allele, and « a » the « yellow » allele. In the case of the pea described above, Mendel hypothesized that the P « green » plants had an « AA » gene pair, called their genotype, while the P « yellow » plants had the « aa » genotype. As each parent can make only one kind of gametes, « A » for the green and « a » for the yellow, the F1 plants must have the « Aa » genotype. But these plants can make « A » gametes as well as « a » gametes, in equal proportions. This gives rise to F2 plants, 1/4 have the « AA » genotype and thus have the green phenotype, 1/2 have the « Aa » genotype and thus the green phenotype also, and 1/4 have the « aa » genotype and therefore the yellow phenotype:

This is Mendel’s explanation of the 1:2:1 ratio. It seems likely that he got it by imagining that F1 needed to have a « bit » of yellow as well as a « bit » of green somewhere. But still, this will be, forever, one of the greatest scientific discovery…

* * *

Ok, now, let’s look at another classical example which, at the end, introduces quantitative genetics, and from there on, heritability. We are now in the early 20th century and the American geneticist, Edward M. East, also has fun with crossing plants. He worked with a species related to tobacco, some of the plants having a long corolla, ~90mm, some having a short one, ~40mm. As Mendel, he started by selfing plants with a long corolla, idem for the plants with a short corolla. Thus, after several round of selfing, he got two « pure » lines: one pure line with a long corolla (we will call them the P1), and one with a short one (P2). He crossed plants from P1 with plants from P2 and got F1 plants having a medium-size corolla, ~65mm.

Until here, everything seemed fine. But when he crossed F1 together to obtain F2 plants, he didn’t get the 3:1 ratio of corolla length. Instead, he got plants with a corolla of ~65mm on average, and a much larger variability (the appropriate mathematical term is variance). To understand a bit more what happened, he chose F2 plants with a small corolla, crossed them, and obtained F3 plants also with a small corolla. When he crossed F2 plants with a medium corolla, he also obtained F3 plants with a medium corolla. And so on (see the picture below).

This means that the inheritance of the corolla length has indeed a genetic component, but not as simple as the Mendelian case seen above. It is very likely that, instead of a single one, it is several genes that influence the length of the corolla.

Let’s imagine that 5 genes are involved, each with two alleles, + and -, and that the each + allele lengthens the corolla by 1mm, whereas each – allele shortens the corolla by 1mm. The P1 plants with a small corolla are likely to have only the – allele at each of the 5 genes, whereas the P2 plants with the long corolla have the + allele at each of the 5 gene. As usual, the F1 plants have 5 – alleles, coming from their P1 parent, and 5 + alleles from their P2 parent. However, when producing the F2 generation, recombination occurs in all gametes from the F1 parents, and thus F2 offsprings will have different numbers of + and – alleles. Consequently, F2 offsprings will have a wider range of corolla lengths than their F1 parents.

There comes quantitative genetics: a trait is called quantitative if it varies continuously. The seed color of Mendel was a discrete trait (qualitative), while the corolla length is a continuous trait (quantitative). And therefore, quantitative genetics is the study of the genetic basis of quantitative traits. Why is it so important? Well, the overwhelming majority of traits are quantitative…

Here are the questions that quantitative geneticists try to answer:

  • Is the observed variation in a character influenced at all by genetic variation? (versus environmental variation)
  • If there is genetic variation, what are the norms of reaction of the various genotypes?
  • How important is genetic variation as a source of total phenotypic variation?
  • Do many loci (or only a few) contribute to the variation in the character? How are they distributed throughout the genome?

* * *

I won’t answer all of them because it would be way too long (!), and in these times of whole-genome sequencing, lots of ongoing research is still underway… But I can still say a few words more.

Misconception: what differentiates a Mendelian trait from a quantitative one is the number of genes involved.

Although, the earlier part may suggest it, this is false. The critical difference between Mendelian and quantitative traits is not the number of segregating loci, but the size of phenotypic differences between genotypes compared with the individual variation within genotypic classes. The scheme below should make this clear. It represents the phenotypic distribution according to the genotype at a given bi-allelic locus, height being the phenotype of interest.

Hence, the definition of a quantitative character becomes: a quantitative character is one for which the average phenotypic differences between genotypes are small compared with the variation between individuals within genotypes.

* * *

After all this, what is heritability? Well, it’s simple. Let’s take a given phenotype, whatever it is (seed color, corolla length, hair color, disease status, growth rate, milk production…). The phenotype is the result of the interaction between genotype and environment. That is, a given genotype in a given environment may not lead to the same phenotype as the same genotype in a different environment, or as a different genotype in the same environment. That’s why the notion of reaction norm is essential.

But still, people always want to know how much genes contribute to some phenotypes. Is it possible? Here again, it’s tricky. Let’s take the example of two bricklayers building a wall. If both of them work in parallel, that is one builds the left of the wall and the other builds the right, it is possible to assess their respective contribution: we just have to count the number of bricks each made. But if now one makes mortar, and the other lays bricks, it is not possible anymore to compare their work (it would be absurd to do it). And it is the same for genes.

Thus, what do we do? Instead of trying to assess the contribution of genes to the phenotype (compare to the contribution of the environment), we can try to assess the contribution of genes to variations of the phenotype, using what statisticians call the analysis of variance. When looking at a given phenotype among many individuals, we try to partition the variability of their phenotype (\sigma^2_P) into a variability due to the fact that they have different alleles for the genes involved in the phenotype (\sigma^2_G), and into a variability due to the fact that they live in different environments (\sigma^2_E):

\sigma^2_P = \sigma^2_G + \sigma^2_E

Last but not least, the heritability H^2 can now be defined:

H^2 = \sigma^2_G / \sigma^2_P

The question “Is a trait heritable?” is a question about the role that differences in genes play in the phenotypic differences between individuals or groups of individuals.

Misconception: a high heritability means that a character is unaffected by the environment.

Hell, no! Because genotype and environment interact to produce the phenotype, no partition of variation into its genetic and environmental components can actually separate causes of variation. But a highly heritable trait means that the genetic component is much more important than the environmental component in contributing to the variation in phenotype.

* * *

There is still much to say of course, and one can find lots of very well explained details in the reference book Introduction to Genetic Analysis from Griffiths and colleagues. Indeed, I even allowed myself to scan some of its figures as they are very self-explanatory. (But  due to copyright issues, I will remove them if the authors and/or publisher ask me to do so.)

Proof that P-values under the null are uniformly distributed

22 avril 2011

I often hear in talks from statisticians that P-values are uniformly distributed under the null. But how can this be? And what does it mean? As the demonstration is pretty straightforward but nonetheless hard to find on the Internet, here it is.

Everything starts with an experiment (or at least with the observation of a natural phenomenon, be it part of an experiment or not). The aim is to assess whether or not the hypothesis we have about this phenomenon seems to be true. But first, let’s recall that a parametric test (see Wikipedia) is constituted of:

  • data: the n observations x_1, x_2, …, x_n are realizations of n random variables X_1, X_2, …, X_n assumed to be identically distributed;
  • statistical model: the probability distribution of the X_1, X_2, …, X_n depends on parameter(s) \theta;
  • hypothesis: an assertion concerning \theta, noted H_0 for the null (e.g. \theta=a), and H_1 for the alternative (e.g. \theta=b with b > a);
  • decision rule: given a test statistic T, if it belongs to the critical region C, the null hypothesis H_0 is rejected.

In practice, T follows a given distribution under H_0 (e.g. a Normal distribution, or a Student distribution) that does not depend on \theta but on n. We use the observations to compute a realization, noted t, of T.

The P-value, noted P, can be seen as a random variable, and its realization, noted p, depends on the observations. According to the notations, the formal definition of the P-value for the given observations is:

p = \mathbb{P} ( T \ge t | H_0 )

Therefore, according to Wikipedia, a P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. According to Matthew Stephens (source), a p value is the proportion of times that you would see evidence stronger than what was observed, against the null hypothesis, if the null hypothesis were true and you hypothetically repeated the experiment (sampling of individuals from a population) a large number of times.

Very importantly, note that the 2nd definition emphasizes the fact that, although it is computed from the data, a P-value does not correspond to the probability that H_0 is true given the data we actually observed!!

p \ne \mathbb{P} ( H_0 | x_1, x_2,..., x_n )

A P-value simply gives information in the case we would repeat the experiment a large number of times… (That’s why P-values are often decried.)

Ok, back on topic now. From the formula above, we can also write:

p = 1 - \mathbb{P} ( T < t | H_0 )

By noting F_0 the cumulative distribution function (cdf, fonction de répartition in French) of T under H_0, we obtain:

p = 1 - F_0( t )

And here is the trick, thanks to the fact that the cdf is monotonic, increasing and (left-)continuous:

\mathbb{P} ( T \ge t | H_0 ) = \mathbb{P} ( F_0(T) \ge F_0(t) ) = 1 - \mathbb{P} ( F_0(T) < F_0(t) )

Therefore, we have:

\mathbb{P} ( F_0(T) < F_0(t) ) = F_0( t )

Which means that F_0(T) is following a uniform distribution. And, as this means also that 1 - F_0(T) is uniformly distributed, then we can conclude that P-values are uniformly distributed under the null hypothesis.


But what does it mean? Well, we usually consider a significance level, noted alpha (small, e.g. 5%, 1%, 0.1%…), and if the P-value falls below this threshold, we reject the null and decide that the alternative is significant. However, let’s say we re-do the same experiment N times and compute a P-value for each of them. Since P-values are uniformly distributed under the null, it is as likely to find some of them between 0.8 and 0.85 than to find some of them below 0.05, if H_0 is indeed true. That is, some of them will fall below the significance threshold, just by chance. The experiments corresponding to these P-values are called false-positives: we think they are positives, i.e. we decide to accept H_1, while in fact they are really false, i.e. H_0 is true and should not be rejected.

Last but not least, if we re-do the same experiment 100 times and consider a threshold of 5%:

  • if H_0 is false (although we are not supposed to know it before doing the experiment), how many P-values will fall below this threshold just by chance? 5, on average;
  • if now H_0 is supposed to be true 50% of the time, what proportion of P-values will be around 5\% +- \epsilon? at least 23%, and typically 50% (see the paper of Sellke et al in 2001). In other words, when H_0 is true 50% of the time, a P-value of 5% doesn’t tell us anything, as half of the experiments from which they were calculated correspond to a true H_0, and half to a false H_0

The linear model (2) – simple linear regression

25 novembre 2010

In this post, I explicitly derive the mathematical formulas of the linear model, and show how to compute them in practice with the R computer program. The first post of this series is available here (sorry, it’s in french): « le modèle linéaire (1) – présentation ».


We want to analyze the relationship between the rate of DDT in fishes (variable to explain y) and their age (explanatory variable x). And for this, without any hesitation, we’re gonna use R.


We have a sample of n = 15 fishes. For each fish, we have its age and its rate of DDT. For each age, we have three different fishes. Here is the whole dataset:

\begin{pmatrix} obs & age & rate \\ 1 & 2 & 0.20 \\ 2 & 2 & 0.25 \\ 3 & 2 & 0.18 \\ 4 & 3 & 0.19 \\ 5 & 3 & 0.29 \\ 6 & 3 & 0.28 \\ 7 & 4 & 0.31 \\ 8 & 4 & 0.33 \\ 9 & 4 & 0.36 \\ 10 & 5 & 0.71 \\ 11 & 5 & 0.38 \\ 12 & 5 & 0.47 \\ 13 & 6 & 1.10 \\ 14 & 6 & 0.87 \\ 15 & 6 & 0.83 \end{pmatrix}

Assuming this dataset is stored on a flat file named « data.csv », we can load it into R and start analyzing it:

d <- read.csv( "data.csv", header=TRUE )
summary( d )
n <- length( d$obs )
tapply( d$rate, d$age, mean )
tapply( d$rate, d$age, sd )
library( ggplot2 )
qplot( age, rate, data=d )
ggsave( "data_fishes.png" )

From this, we see that the older the fish, the higher its rate of DDT. Moreover, the older the fish, the more variable the rate. We can propose hypotheses about the relationship between age and rate, but we cannot conclude for sure of which kind it is. That’s why we need a statistical model, i.e. a formal way to quantify the uncertainty in the measurements and our confidence into several hypotheses describing the mechanisms behind the phenomenon under study.

Building the model

We assume that the natural phenomenon under study (the rate of DDT in fish) can be modeled by a continuous random variable Y linearly depending on a continuous factor x (the age of fish). That is, we should be able to represent the link between x and Y by a straight line. The observations Y_1, \ldots, Y_n are measured for several given values x_1, \ldots, x_n:

Y_i = a + b x_i + E_i \mbox{ with } E_i \mbox{ i.i.d.} \sim \mathcal{N} ( 0, \sigma^2 )

  • i is the index representing the identifier of the fish (i=1 \ldots 15);
  • Y_i is a random variable corresponding to the rate of the ith fish;
  • x_i is the age of the ith fish (not random);
  • E_i is a random variable for the errors, following a Normal distribution with mean 0 and variance \sigma^2;
  • \sigma^2 is the variance of the E_i, i.e. an unknown parameter we want to estimate (\sigma is called the standard deviation);
  • a is a constant corresponding to the intercept of the regression line, i.e. an unknown parameter we want to estimate;
  • b is a constant corresponding to the regression coefficient (slope of the regression line), i.e. an unknown parameter we want to estimate;
  • « i.i.d. » stands for « independent and identically distributed ».

This model means that we decompose the value of the rate Y in two parts:

  • one explained by x, a + b x_i, the fixed part;
  • one left unexplained, E_i, the random part.

As a result, Y follows a normal distribution (also called Gaussian), and we can write, for a given x, Y \sim \mathcal{N} ( a + b x, \sigma^2 ), the Y_i being independents:

  • expectation: \mathbb{E}(Y_i) = \mathbb{E}(a + bx_i + E_i) = a + bx_i
  • variance: \mathbb{V}(Y_i) = \mathbb{V}(a + bx_i + E_i) = \mathbb{V}(E_i) = \sigma^2
  • covariance: \mathbb{C}ov(Y_i,Y_j) = \mathbb{C}ov(E_i,E_j) = 0

The term « linear » in the expression « linear model » means that « linear » applies to the parameters:

  • Y_i = a + b x_i + c x_i^2 + E_i is a linear model although the relation between Y and x is polynomial;
  • Y_i = a + b cos(x_i) + E_i is a linear model;
  • Y_i = a e^{b x_i} + epsilon_i is not a linear model as a e^{b_1+b_2x} \ne a e^{b_1} + ae^{b_2x};
  • Y_i = \frac{a}{b + x_i} + \epsilon_i is not a linear model.

Deriving estimators of the parameters

We will estimate the 3 unknown parameters, a, b and \sigma^2, by finding their values that maximize the likelihood \nu of the parameters given the data. For a given set of parameters, the likelihood is the probability of obtaining the data given these parameters, and we want to maximize this probability:

\nu = \mathbb{P}( data | parameters )

As Y \sim \mathcal{N} ( a + b x, \sigma^2 ), the Y_i have a probability density f:

f( y_i; a, b, \sigma | x_i ) = \frac{1}{\sigma \sqrt{2\pi}} exp\{ - \frac{1}{2 \sigma^2} [ y_i - ( a + b x_i ) ]^2 \}

And as the Y_i are independents, we can write the likelihood \nu as a product:

\nu = Pr( y_1, \ldots y_n; a, b, \sigma | x_1, \ldots x_n )

\nu = \prod_{i=1}^{n} f( y_i; a, b, \sigma | x_i )

\nu = (\frac{1}{\sigma \sqrt{2\pi}})^n \prod_{i=1}^{n} exp\{ - \frac{1}{2 \sigma^2} [ y_i - ( a + b x_i ) ]^2 \}

We note \mathcal{L} the logarithm of the likelihood:

\mathcal{L} = log( \nu )

\mathcal{L} = -nlog(\sigma \sqrt{2 \pi}) - \sum_{i=1}^{n} \frac{[y_i - (a+bx_i)]^2}{2 \sigma^2}

\mathcal{L} = -\frac{n}{2} log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2}\sum_{i=1}^{n} [y_i - (a+bx_i)]^2

To find the values of a, b and \sigma^2 that maximize \mathcal{L}, we only need to set to zero the partial derivatives of \mathcal{L} with respect to (w.r.t.) each parameter:

\frac{\partial \mathcal{L}}{\partial a} = 0

\Leftrightarrow \frac{\partial}{\partial a}( \sum_{i=1}^{n} [y_i - (a+bx_i)]^2 ) = 0

\Leftrightarrow \frac{\partial}{\partial a}( \sum_{i=1}^{n} [y_i^2 - 2y_i(a+bx_i) + (a+bx_i)^2] ) = 0

\Leftrightarrow \sum_{i=1}^{n} ( -2y_i + 2a + 2bx_i ) = 0

\Leftrightarrow \sum_{i=1}^{n} [ y_i - (a+bx_i) ] = 0

\Leftrightarrow \bar{y} - (a+b\bar{x}) = 0

We can thus deduce the maximum-likelihood estimator (MLE) A of parameter a:

A = \bar{Y} - B \bar{x}

Similarly for b:

\frac{\partial \mathcal{L}}{\partial b} = 0

\Leftrightarrow \sum_{i=1}^{n} [ y_i - (a+bx_i) ] x_i = 0

\Leftrightarrow \sum_{i=1}^{n}( x_i y_i ) - a \sum_{i=1}^{n} x_i - b \sum_{i=1}^{n} x_i^2 = 0

When replacing a by its value obtained previously, we obtain:

\Leftrightarrow \sum_{i=1}^{n}( x_i y_i ) - (\bar{y} - b\bar{x}) \sum_{i=1}^{n} x_i - b \sum_{i=1}^{n} x_i^2 = 0

…<to do>…

We can thus deduce the maximum-likelihood estimator B of parameter b:

B = \frac{ \sum_{i=1}^{n} (x_i -\bar{x} )( Y_i - \bar{Y} )}{ \sum_{i=1}^{n} (x_i -\bar{x} )^2 }

And finally for \sigma^2:

\frac{\partial \mathcal{L}}{\partial \sigma^2} = 0

\Leftrightarrow -\frac{n}{2 \sigma^2} + \frac{1}{2 \sigma^4}\sum_{i=1}^{n} [ y_i - (a + bx_i) ]^2 = 0

We can thus deduce the maximum-likelihood estimator S_n^2 of parameter \sigma^2:

S_n^2 = \frac{1}{n} \sum_{i=1}^{n} [ Y_i - (A + Bx_i) ]^2

Now, for each parameter again, we should compute the second derivatives of the likelihood and check that they are positives. This is to be sure that the formulas we obtained above with the first derivatives truly correspond to maxima and not minima. But I’m feeling a bit lazy here… Whatsoever, it works.

Properties of the estimators

The estimators A and B are linear combinations of independent Gaussian variables, the Y_i, thus both follow a Gaussian distribution: A \sim \mathcal{N}(E_A, S_A^2) and B \sim \mathcal{N}(E_B, S_B^2). We now need to derive the expectations and variances of these distributions.

First, let’s define a new quantity:

c_i = \frac{x_i - \bar{x}}{\sum_{i=1}^{n} (x_i - \bar{x})^2}

This new term is useful because it is not random, which facilitates the derivations of expectations and variances, as you will see below.

First formula:

\sum_{i=1}^{n} c_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} ( \sum_{i=1}^{n}x_i - \sum_{i=1}^{n} \bar{x} )

\sum_{i=1}^{n} c_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} ( n \bar{x} - n \bar{x} )

\sum_{i=1}^{n} c_i = 0

Second formula:

\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n} ( x_i - \bar{x})x_i


\sum_{i=1}^{n} (x_i - \bar{x})\bar{x} = 0


\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n} ( x_i - \bar{x})( x_i - \bar{x})


\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n}( x_i - \bar{x})^2

\sum_{i=1}^{n} c_i x_i = 1

Third formula:

B = \sum_{i=1}^{n} c_i (Y_i - \bar{Y})

B = \sum_{i=1}^{n} c_i Y_i - \bar{Y} \sum_{i=1}^{n} c_i

B = \sum_{i=1}^{n} c_i Y_i

Now, let’s calculate E_B:

E_B = \mathbb{E} ( B )

E_B = \sum_{i=1}^{n} c_i \mathbb{E}( Y_i )

E_B = \sum_{i=1}^{n} c_i ( a + bx_i )

E_B = a \sum_{i=1}^{n} c_i + b \sum_{i=1}^{n} c_i x_i

The B estimator has no bias: \mathbb{E}(B) = b.

Now, let’s calculate E_A:

E_A = \mathbb{E} ( A )

E_A = \mathbb{E} ( \bar{Y} - B \bar{x} )

E_A = \mathbb{E} ( \bar{Y} ) - \bar{x} \mathbb{E} ( B )

E_A = \frac{1}{n} \mathbb{E} ( \sum_{i=1}^n Y_i ) - b \bar{x}

E_A = \frac{1}{n} \sum_{i=1}^n ( \mathbb{E} ( Y_i ) ) - b \bar{x}

E_A = \frac{1}{n} \sum_{i=1}^n ( a + bx_i ) - b \bar{x}

The A estimator has no bias: \mathbb{E}(A) = a.

Now let’s calculate S_B^2:

\mathbb{V}(B) = S_B^2

\mathbb{V}(B) = \mathbb{V}( \sum_{i=1}^n c_i Y_i )

\mathbb{V}(B) = \sum_{i=1}^n c_i^2 \mathbb{V}( Y_i )

\mathbb{V}(B) = \sigma^2 \sum_{i=1}^n \frac{(x_i - \bar{x})^2}{(\sum_{i=1}^n (x_i-\bar{x})^2)^2}

\mathbb{V}(B) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2} \sum_{i=1}^n \frac{(x_i - \bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2}

\mathbb{V}(B) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}

Now let’s calculate S_A^2:

\mathbb{V}(A) = S_A^2

\mathbb{V}(A) = \mathbb{V}(\bar{Y} - B\bar{x})

\mathbb{V}(A) = \mathbb{V}( \frac{1}{n} \sum_{i=1}^n Y_i - \bar{x} \sum_{i=1}^n c_i Y_i )

\mathbb{V}(A) = \mathbb{V}( \sum_{i=1}^n (\frac{1}{n} - c_i \bar{x}) Y_i)

\mathbb{V}(A) = \sum_{i=1}^n (\frac{1}{n} - c_i \bar{x})^2 \sigma^2

\mathbb{V}(A) = \sigma^2 \sum_{i=1}^n ( \frac{1}{n^2} -2\frac{c_i\bar{x}}{n} + c_i^2\bar{x}^2 )

\mathbb{V}(A) = \sigma^2 ( \frac{1}{n} -2\frac{\bar{x}}{n}\sum_{i=1}^n c_i + \bar{x}^2 \sum_{i=1}^n c_i^2 )

\mathbb{V}(A) = \sigma^2 ( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )

Now, let’s calculate \mathbb{E}(S_n^2):

…<to do>…

The S_n^2 estimator of \sigma^2 is biased: \mathbb{E}(S_n^2) = (n-2)\frac{\sigma^2}{n}.

Thus, we rather use the following, unbiased estimator for \sigma^2:

S_{n-2}^2 = \frac{n}{n-2} S_n^2

S_{n-2}^2 = \frac{1}{n-2} \sum_{i=1}^n [ Y_i - (A+Bx_i)]^2

All these estimators are called « least-square estimators » because they minimize the sum of prediction squared errors:

\sum_{i=1}^n E_i^2 = \sum_{i=1}^n [ Y_i - (a+bx_i)]^2

Also, S_{n-2} (i.e. the square root of S_{n-2}^2) is also called the standard error of the regression.

Thanks to the formulas above for A, B and S^2_{n-2}, we can compute the estimations \hat{a}, \hat{b} and \hat{\sigma^2} of the parameters a, b and \sigma^2. These estimations are thus also estimations of E_A and E_B. But how to estimate S_A and S_B?

In fact, we just replace \sigma^2 by S_{n-2}^2 in the formulas of \mathbb{V}(A) and \mathbb{V}(B):

S_A^2 = S_{n-2}^2 ( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )

S_B^2 = \frac{S_{n-2}^2}{\sum_{i=1}^n (x_i-\bar{x})^2}

Moreover, for each estimation, it is often very useful to compute a confidence interval. Such an interval indicates that, if we were to re-do the whole experiment 100 times, the true value of the parameter a would fall in it 95% of the times (corresponding to a level \alpha=5\%).

We know that both (A-a)/S_A and (B-b)/S_B follow a Student distribution \mathcal{T} with n-2 degrees of freedom. This allows to compute confidence intervals:

CI_{\alpha}(a) = [ \hat{a} - t_{n-2;1-\alpha/2} s_A ; \hat{a} + t_{n-2;1-\alpha/2} s_A ]

CI_{\alpha}(b) = [ \hat{b} - t_{n-2;1-\alpha/2} s_B ; \hat{b} + t_{n-2;1-\alpha/2} s_B ]

where \hat{a}, \hat{b}, s_A, s_B are realizations of A, B, S_A, S_B, and the t‘s are quantiles taken from the Student cumulative distribution function.

Predicting values

It is possible with this model, for a given x=x_i, to predict a value for Y. We named T_i such a prediction that corresponds to an estimator of the expectation of Y given x_i:

T_i = A + Bx_i , estimator of \mathbb{E}(Y | x=x_i)

We also note F_i the calculated residual:

F_i = Y_i - T_i

The difference between E_i and F_i is that E_i is an unobserved random variable whereas F_i is a residual computed thanks to the estimators A and B.

For x=x_0, T_0 follows a normal law (as a linear combination of a Gaussian vector), and:

\mathbb{E}(T_0) = a +bx_0

\mathbb{V}(T_0) = \sigma^2 (\frac{1}{n} + \frac{(x_0-\bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2})

As an estimation of the expectation, we have: t_0 = \hat{a} + \hat{b}x_0.

And to estimate the variance, we do as above:

S_{T_0}^2 = S_{n-2}^2 ( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )

Similarly, to compute a confidence interval for \mathbb{E}(Y | x=x_0):

CI_{\alpha}(T_0) = [ t_0 - t_{n-2;1-\alpha/2} s_{T_0} ; t_0 + t_{n-2;1-\alpha/2} s_{T_0} ]

Moreover, we can also compute a prediction interval. For Y_0 a random variable whose values correspond to the results we can observe for Y given that x=x_0, we have:

Y_0 = a + bx_0 + E_0


\hat{Y_0} = A+Bx_0+E_0 = T_0+E_0

\mathbb{V}(\hat{Y_0}) = \mathbb{V}(T_0) + \mathbb{V}(E_0) = \mathbb{V}(T_0) + \sigma^2

And the prediction interval is:

PI_{\alpha}(Y_0) = [ t_0 - t_{n-2;1-\alpha/2} s_{\hat{Y}_0} ; t_0 + t_{n-2;1-\alpha/2} s_{\hat{Y}_0} ]


s_{\hat{Y}_0}^2 = s_{T_0}^2 + s_{E_0}^2 = s_{n-2}^2( 1 + \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2} )

The prediction interval is always wider than the confidence interval but, as the estimation of \sigma^2 by S_{n-2}^2 gets more precise (e.g. with more sampled data), the predictions will be more precise also.

Checking the hypotheses of the model

However, even if we spent hours writing equations, we first need to be sure that all the hypotheses of our model are verified, otherwise, no way of being confident in the results (i.e. in the statistical inference). Here are the hypotheses:

  • the relationship between the outcome and the explanatory variable(s) is linear;
  • the error terms E_i have the same variance \sigma^2;
  • the error terms E_i are independents;
  • the error terms E_i follow a Gaussian distribution.

Basically, we need to look at the estimations of the error terms (\hat{E}_i=F_i=Y_i-\hat{Y}_i) as a function of the predicted values(\hat{Y}_i = \hat{a} + \hat{b} x_i).

Therefore, first we record the results of the linear regression, and then we plot the residuals versus the fitted values:

mod1 <- lm( rate ~ age, data=d )
ggplot( mod1, aes(.fitted, .resid) ) + geom_hline( yintercept=0 )  + geom_point() + geom_smooth( se=F )
ggsave( "data_fishes_mod1_residuals-fitted.png" )

Clearly, the residuals are « structured », i.e. there is a tendency, which indicates that a relevant term was not considered in the modeling of \mathbb{E}(Y). Moreover, the residuals don’t have the same variance (heteroscedasticity). Therefore, we can’t carry on with this model, let’s modify it:

log(Y_i) = a + bx_i + E_i

again with E_i \mbox{ i.i.d.} \sim \mathcal{N}(0, \sigma^2) but these E_i are different from model 1.

mod2 <- lm( log10(rate) ~ age, data=d )
ggplot( mod2, aes(.fitted, .resid) ) + geom_hline( yintercept=0 )  + geom_point() + geom_smooth( se=F )
ggsave( "data_fishes_mod2_residuals-fitted.png" )

The variance was stabilized but the residuals are still slightly structured. But let’s keep this model as, although the blue line is similarly convex as in the previous model, the y-axis has a much zoomed-in scale. Therefore, our final model is: log(Y_i) = a + bx_i + E_i with E_i i.i.d. \sim \mathcal{N}(0, \sigma^2). To ease the equations, we will use Z_i = log(Y_i). Now, the model is Z_i = a + bx_i + E_i.

Testing the model

The first test aims at deciding if the relationship between the outcome Z_i and its explanatory variable x_i is truly linear. The null hypothesis is:

H_0: { Z_i = a + E_i } (no relation between Z_i and x_i)

and the alternative hypothesis is:

H_1: { Z_i = a + bx_i + E_i } (there is a relation between Z_i and x_i, which is linear)

To test this, we decompose the variance of the data into a part explained by the model and the rest being residual:

\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n (\hat{Z}_i - \bar{Z})^2 + \sum_{i=1}^n (Z_i - \hat{Z}_i)^2

The equation above corresponds to decomposing the total sum of squares (TSS) into a model sum of squares (MSS) and an residual sum of squares (RSS):


Each sum of squares follows a chi-squared distribution (\chi^2). Such a distribution is characterized by a parameter called « degrees of freedom » (df):

df_{TSS} = \mbox{observations } - 1 = n - 1 = 14

df_{MSS} = \mbox{explanatory variables} = p - 1 = 1

df_{RSS} = df_{TSS} - df_{MSS} = n - p = 13

To test the H_0 hypothesis, we use the Fisher’s F statistics:

F = \frac{MSS / df_{MSS}}{RSS / df_{RSS}}

This statistics can be interpreted as a variance ratio:

F = \frac{\mbox{variance explained by x}}{\mbox{residual variance}}

Under hypothesis H_0, this statistics follows a Fischer distribution with parameters (degrees of freedom) p-1 and n-p. We reject the null hypothesis, H_0, at the level \alpha (typically 5%), if:

F > f_{p-1,n-p,1-\alpha}

i.e. if F is higher than the quantile f of order 1-\alpha from a Fisher law \mathcal{F}_{p-1,n-p}.

We can also compute a P-value that measures the agreement between the tested hypothesis and the obtained result, i.e. the probability to draw a value from a \mathcal{F}_{p-1,n-p} distribution that is higher than F:

\mbox{P-value} = \mathbb{P}( \mathcal{F}_{p-1,n-p} > F )

The smaller the P-value, the the stronger the disagreement between the null hypothesis and the results of the experiment. Usually we reject H_0 when the P-value is smaller than \alpha=5\%.

In our case, we can compute all the sum of squares as well as the Fisher statistics:

MSS <- sum( ( mod2$fitted.values - mean(log10(d$rate)) )^2 )
RSS <- sum( ( log10(d$rate) - mod2$fitted.values )^2 )
TSS <- sum( ( log10(d$rate) - mean(log10(d$rate)) )^2 )
F <- (MSS / 1) / (RSS / 13)
f <- qf( p=0.95, df1=1, df2=13 )
Pval <- pf( q=F, df1=1, df2=13, lower.tail=FALSE )

We obtain MSS=0.77003, RSS=0.11998, F=83.43f=4.67 and P-value=5.092.10^{-7}.

It is also possible to have all these results in a simple way (although it is always good to know how to compute these quantities by oneself):

summary( mod2 )
anova( mod2 )

These results show that the variability explained by the model is far greater than the residual variability (F>80 and P-value << 5\%). Thus, we can reject the null hypothesis and consider that there exists a linear relationship between the log of the DDT rate in fishes and their age.

Assessing the quality of the model

It is important to assess the adjustment of the model to the data as we may use it to predict the value of Y knowing the value of x.For this purpose, we compute the R-square, R^2, that corresponds to the proportion of the variability in Z explained by the model (in percentage).

R^2 = \frac{MSS}{TSS} = 1 - \frac{RSS}{TSS}

However, it is more relevant to calculate the adjusted R-square, i.e. to divide it by the number of explanatory variables. Indeed, with more explanatory variables, the adjustment will be better but we may end with a model that is over-adjusted:

\mbox{adjusted-}R^2 = 1 - \frac{RSS/(n-p)}{TSS/(n-1)}

R2 <- 1 - RSS/TSS
R2.adj <- 1 - (RSS/13)/(TSS/14)

We obtain R^2=86.5\% and R^2adj=85.5\%. It means that 85% of the variability of the log of DDT rate in fishes is explained by their age.

Estimating the parameters

According to the formulas above of A, B and S_{n-2}^2, we can compute estimations of the parameters:

b <- sum( (d$age - mean(d$age)) * (log10(d$rate) - mean(log10(d$rate))) ) / sum( (d$age - mean(d$age))^2 )
a <- mean(log10(d$rate)) - b * mean(d$age)
sn2 <- 1/(n-2) * sum( (log10(d$rate) - (a + b*d$age))^2 )

Know that \bar{Z}=-0.419 and \bar{x}=4, we get \hat{b}=0.16021, \hat{a}=-1.06005 and S_{n-2}^2=0.00923.

Here is the equation of the regression line: z = \hat{a} + \hat{b}x = 0.16x - 1.06. It means that, in a year, the log of the DDT rate increases 0.16 times. We can also plot this line with the data:

qplot( age, log10(rate), data=d ) + geom_abline( intercept=a, slope=b, colour="red" )
ggsave( "data_fishes_mod2_regline.png" )

Are our estimates precise? Let’s compute the variances and confidence intervals for \hat{a} and \hat{b}:

s.a <- sn2 * ( 1/n + mean(d$age)^2 / sum( ( d$age - mean(d$age))^2 ) )
s.b <- sn2 / sum( ( d$age - mean(d$age) )^2 )
alpha <- 0.05
ci.a <- c( a - qt( p=1-alpha/2, df=n-2 ) * s.a, a + qt( p=1-alpha/2, df=n-2 ) * s.a )
ci.b <- c( b - qt( p=1-alpha/2, df=n-2 ) * s.b, b + qt( p=1-alpha/2, df=n-2 ) * s.b )

We get S_A^2=0.00553, S_B^2=0.000307, CI_{5\%}(a)=(-1.072, -1.048) and CI_{5\%}(b)=(0.1595,0.1608). These values show that our estimates are quite precise.

Although it seems almost certain when looking at the confidence intervals, we can still wonder if we can reject the null hypotheses according to which the a and b parameters equal zero: \mathcal{H}_0: a=b=0. Note that the test for b=0 is equivalent to the test made previously with \mathcal{H}_0: Z_i = a + E_i.

test.a <- a / sqrt( s.a )
test.b <- b / sqrt( s.b )
qt( 0.975, df=13 )^2 == qf( 0.95, df1=1, df2=13 )
Pval.a <- 2 * pt( q=test.a, df=n-2 )
Pval.b <- 2 * pt( q=test.b, df=n-2, lower.tail=FALSE )

We obtain \hat{a}/\hat{\sigma}_a=-14.245, \hat{b}/\hat{\sigma}_b=9.134, t_{1-\alpha/2}^2=f_{1-\alpha;p-1;n-p}, P-value_a=2.61.10^{-9} and P-value_b=5.09.10^{-7}. The P-values are far below 5%. Thus we reject both null hypotheses and conclude that:

  • the log of the DDT rate for a fish that is just born is significantly different from zero (a \ne 0);
  • the log of the DDT rate for a fish is linearly linked to its age(b \ne 0). And as \hat{b} > 0, this is a growing relationship.

Confidence and prediction intervals for the variable to explain

We can compute such interval for the values of the DDT rate from each fish:

s2.t <- sn2 * ( 1/n + (d$age-mean(d$age))^2 / sum((d$age-mean(d$age))^2) )
ci.y <- cbind( a+b*d$age - qt(p=1-alpha/2,df=n-2) * sqrt(s2.t), a+b*d$age + qt(p=1-alpha/2,df=n-2) * sqrt(s2.t) )
s2.y <- sn2 + s2.t
pi.y <- cbind( a+b*d$age - qt(p=1-alpha/2,df=n-2) * sqrt(s2.y), a+b*d$age + qt(p=1-alpha/2,df=n-2) * sqrt(s2.y) )
pl <- qplot( age, log10(rate), data=d ) + geom_abline( intercept=a, slope=b, colour="red" )
pl + geom_line( aes(x=d$age, y=ci.y[,1], colour="low CI") ) + geom_line( aes(x=d$age, y=ci.y[,2], colour="high CI") ) + geom_line( aes(x=d$age, y=pi.y[,1], colour="low PI") ) + geom_line( aes(x=d$age, y=pi.y[,2], colour="high PI") ) + scale_colour_manual( "Intervals", c("blue","blue","green","green") ) + opts( legend.position="none" )
ggsave( "data_fishes_mod2_regline-intervals-nolegend.png" )

Here is the final plot with the regression line in red, the lines of the confidence interval in blue and the lines of the prediction interval in green:

That’s it!

Sources: Statistique inférentielle de Daudin, Robin et Vuillet (sur Amazon.fr), Exemples d’applications du modèle linéaire de Lebarbier et Robin (here)

%d blogueurs aiment cette page :