## The linear model (2) – simple linear regression

In this post, I explicitly derive the mathematical formulas of the linear model, and show how to compute them in practice with the R computer program. The first post of this series is available here (sorry, it’s in french): « le modèle linéaire (1) – présentation ».

### Aim

We want to analyze the relationship between the rate of DDT in fishes (variable to explain $y$) and their age (explanatory variable $x$). And for this, without any hesitation, we’re gonna use R.

### Data

We have a sample of $n = 15$ fishes. For each fish, we have its age and its rate of DDT. For each age, we have three different fishes. Here is the whole dataset:

$\begin{pmatrix} obs & age & rate \\ 1 & 2 & 0.20 \\ 2 & 2 & 0.25 \\ 3 & 2 & 0.18 \\ 4 & 3 & 0.19 \\ 5 & 3 & 0.29 \\ 6 & 3 & 0.28 \\ 7 & 4 & 0.31 \\ 8 & 4 & 0.33 \\ 9 & 4 & 0.36 \\ 10 & 5 & 0.71 \\ 11 & 5 & 0.38 \\ 12 & 5 & 0.47 \\ 13 & 6 & 1.10 \\ 14 & 6 & 0.87 \\ 15 & 6 & 0.83 \end{pmatrix}$

Assuming this dataset is stored on a flat file named « data.csv », we can load it into R and start analyzing it:

summary( d )
n <- length( d$obs ) tapply( d$rate, d$age, mean ) tapply( d$rate, d$age, sd ) library( ggplot2 ) qplot( age, rate, data=d ) ggsave( "data_fishes.png" ) From this, we see that the older the fish, the higher its rate of DDT. Moreover, the older the fish, the more variable the rate. We can propose hypotheses about the relationship between age and rate, but we cannot conclude for sure of which kind it is. That’s why we need a statistical model, i.e. a formal way to quantify the uncertainty in the measurements and our confidence into several hypotheses describing the mechanisms behind the phenomenon under study. ### Building the model We assume that the natural phenomenon under study (the rate of DDT in fish) can be modeled by a continuous random variable $Y$ linearly depending on a continuous factor $x$ (the age of fish). That is, we should be able to represent the link between $x$ and $Y$ by a straight line. The observations $Y_1, \ldots, Y_n$ are measured for several given values $x_1, \ldots, x_n$: $Y_i = a + b x_i + E_i \mbox{ with } E_i \mbox{ i.i.d.} \sim \mathcal{N} ( 0, \sigma^2 )$ • $i$ is the index representing the identifier of the fish ($i=1 \ldots 15$); • $Y_i$ is a random variable corresponding to the rate of the $i$th fish; • $x_i$ is the age of the $i$th fish (not random); • $E_i$ is a random variable for the errors, following a Normal distribution with mean $0$ and variance $\sigma^2$; • $\sigma^2$ is the variance of the $E_i$, i.e. an unknown parameter we want to estimate ($\sigma$ is called the standard deviation); • $a$ is a constant corresponding to the intercept of the regression line, i.e. an unknown parameter we want to estimate; • $b$ is a constant corresponding to the regression coefficient (slope of the regression line), i.e. an unknown parameter we want to estimate; • « i.i.d. » stands for « independent and identically distributed ». This model means that we decompose the value of the rate $Y$ in two parts: • one explained by $x$, $a + b x_i$, the fixed part; • one left unexplained, $E_i$, the random part. As a result, $Y$ follows a normal distribution (also called Gaussian), and we can write, for a given $x$, $Y \sim \mathcal{N} ( a + b x, \sigma^2 )$, the $Y_i$ being independents: • expectation: $\mathbb{E}(Y_i) = \mathbb{E}(a + bx_i + E_i) = a + bx_i$ • variance: $\mathbb{V}(Y_i) = \mathbb{V}(a + bx_i + E_i) = \mathbb{V}(E_i) = \sigma^2$ • covariance: $\mathbb{C}ov(Y_i,Y_j) = \mathbb{C}ov(E_i,E_j) = 0$ The term « linear » in the expression « linear model » means that « linear » applies to the parameters: • $Y_i = a + b x_i + c x_i^2 + E_i$ is a linear model although the relation between $Y$ and $x$ is polynomial; • $Y_i = a + b cos(x_i) + E_i$ is a linear model; • $Y_i = a e^{b x_i} + epsilon_i$ is not a linear model as $a e^{b_1+b_2x} \ne a e^{b_1} + ae^{b_2x}$; • $Y_i = \frac{a}{b + x_i} + \epsilon_i$ is not a linear model. ### Deriving estimators of the parameters We will estimate the 3 unknown parameters, $a$, $b$ and $\sigma^2$, by finding their values that maximize the likelihood $\nu$ of the parameters given the data. For a given set of parameters, the likelihood is the probability of obtaining the data given these parameters, and we want to maximize this probability: $\nu = \mathbb{P}( data | parameters )$ As $Y \sim \mathcal{N} ( a + b x, \sigma^2 )$, the $Y_i$ have a probability density $f$: $f( y_i; a, b, \sigma | x_i ) = \frac{1}{\sigma \sqrt{2\pi}} exp\{ - \frac{1}{2 \sigma^2} [ y_i - ( a + b x_i ) ]^2 \}$ And as the $Y_i$ are independents, we can write the likelihood $\nu$ as a product: $\nu = Pr( y_1, \ldots y_n; a, b, \sigma | x_1, \ldots x_n )$ $\nu = \prod_{i=1}^{n} f( y_i; a, b, \sigma | x_i )$ $\nu = (\frac{1}{\sigma \sqrt{2\pi}})^n \prod_{i=1}^{n} exp\{ - \frac{1}{2 \sigma^2} [ y_i - ( a + b x_i ) ]^2 \}$ We note $\mathcal{L}$ the logarithm of the likelihood: $\mathcal{L} = log( \nu )$ $\mathcal{L} = -nlog(\sigma \sqrt{2 \pi}) - \sum_{i=1}^{n} \frac{[y_i - (a+bx_i)]^2}{2 \sigma^2}$ $\mathcal{L} = -\frac{n}{2} log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2}\sum_{i=1}^{n} [y_i - (a+bx_i)]^2$ To find the values of $a$, $b$ and $\sigma^2$ that maximize $\mathcal{L}$, we only need to set to zero the partial derivatives of $\mathcal{L}$ with respect to (w.r.t.) each parameter: $\frac{\partial \mathcal{L}}{\partial a} = 0$ $\Leftrightarrow \frac{\partial}{\partial a}( \sum_{i=1}^{n} [y_i - (a+bx_i)]^2 ) = 0$ $\Leftrightarrow \frac{\partial}{\partial a}( \sum_{i=1}^{n} [y_i^2 - 2y_i(a+bx_i) + (a+bx_i)^2] ) = 0$ $\Leftrightarrow \sum_{i=1}^{n} ( -2y_i + 2a + 2bx_i ) = 0$ $\Leftrightarrow \sum_{i=1}^{n} [ y_i - (a+bx_i) ] = 0$ $\Leftrightarrow \bar{y} - (a+b\bar{x}) = 0$ We can thus deduce the maximum-likelihood estimator (MLE) $A$ of parameter $a$: $A = \bar{Y} - B \bar{x}$ Similarly for $b$: $\frac{\partial \mathcal{L}}{\partial b} = 0$ $\Leftrightarrow \sum_{i=1}^{n} [ y_i - (a+bx_i) ] x_i = 0$ $\Leftrightarrow \sum_{i=1}^{n}( x_i y_i ) - a \sum_{i=1}^{n} x_i - b \sum_{i=1}^{n} x_i^2 = 0$ When replacing $a$ by its value obtained previously, we obtain: $\Leftrightarrow \sum_{i=1}^{n}( x_i y_i ) - (\bar{y} - b\bar{x}) \sum_{i=1}^{n} x_i - b \sum_{i=1}^{n} x_i^2 = 0$ …<to do>… We can thus deduce the maximum-likelihood estimator $B$ of parameter $b$: $B = \frac{ \sum_{i=1}^{n} (x_i -\bar{x} )( Y_i - \bar{Y} )}{ \sum_{i=1}^{n} (x_i -\bar{x} )^2 }$ And finally for $\sigma^2$: $\frac{\partial \mathcal{L}}{\partial \sigma^2} = 0$ $\Leftrightarrow -\frac{n}{2 \sigma^2} + \frac{1}{2 \sigma^4}\sum_{i=1}^{n} [ y_i - (a + bx_i) ]^2 = 0$ We can thus deduce the maximum-likelihood estimator $S_n^2$ of parameter $\sigma^2$: $S_n^2 = \frac{1}{n} \sum_{i=1}^{n} [ Y_i - (A + Bx_i) ]^2$ Now, for each parameter again, we should compute the second derivatives of the likelihood and check that they are positives. This is to be sure that the formulas we obtained above with the first derivatives truly correspond to maxima and not minima. But I’m feeling a bit lazy here… Whatsoever, it works. ### Properties of the estimators The estimators $A$ and $B$ are linear combinations of independent Gaussian variables, the $Y_i$, thus both follow a Gaussian distribution: $A \sim \mathcal{N}(E_A, S_A^2)$ and $B \sim \mathcal{N}(E_B, S_B^2)$. We now need to derive the expectations and variances of these distributions. First, let’s define a new quantity: $c_i = \frac{x_i - \bar{x}}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$ This new term is useful because it is not random, which facilitates the derivations of expectations and variances, as you will see below. First formula: $\sum_{i=1}^{n} c_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} ( \sum_{i=1}^{n}x_i - \sum_{i=1}^{n} \bar{x} )$ $\sum_{i=1}^{n} c_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} ( n \bar{x} - n \bar{x} )$ $\sum_{i=1}^{n} c_i = 0$ Second formula: $\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n} ( x_i - \bar{x})x_i$ However: $\sum_{i=1}^{n} (x_i - \bar{x})\bar{x} = 0$ Thus: $\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n} ( x_i - \bar{x})( x_i - \bar{x})$ Finally: $\sum_{i=1}^{n} c_i x_i = \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sum_{i=1}^{n}( x_i - \bar{x})^2$ $\sum_{i=1}^{n} c_i x_i = 1$ Third formula: $B = \sum_{i=1}^{n} c_i (Y_i - \bar{Y})$ $B = \sum_{i=1}^{n} c_i Y_i - \bar{Y} \sum_{i=1}^{n} c_i$ $B = \sum_{i=1}^{n} c_i Y_i$ Now, let’s calculate $E_B$: $E_B = \mathbb{E} ( B )$ $E_B = \sum_{i=1}^{n} c_i \mathbb{E}( Y_i )$ $E_B = \sum_{i=1}^{n} c_i ( a + bx_i )$ $E_B = a \sum_{i=1}^{n} c_i + b \sum_{i=1}^{n} c_i x_i$ The $B$ estimator has no bias: $\mathbb{E}(B) = b$. Now, let’s calculate $E_A$: $E_A = \mathbb{E} ( A )$ $E_A = \mathbb{E} ( \bar{Y} - B \bar{x} )$ $E_A = \mathbb{E} ( \bar{Y} ) - \bar{x} \mathbb{E} ( B )$ $E_A = \frac{1}{n} \mathbb{E} ( \sum_{i=1}^n Y_i ) - b \bar{x}$ $E_A = \frac{1}{n} \sum_{i=1}^n ( \mathbb{E} ( Y_i ) ) - b \bar{x}$ $E_A = \frac{1}{n} \sum_{i=1}^n ( a + bx_i ) - b \bar{x}$ The $A$ estimator has no bias: $\mathbb{E}(A) = a$. Now let’s calculate $S_B^2$: $\mathbb{V}(B) = S_B^2$ $\mathbb{V}(B) = \mathbb{V}( \sum_{i=1}^n c_i Y_i )$ $\mathbb{V}(B) = \sum_{i=1}^n c_i^2 \mathbb{V}( Y_i )$ $\mathbb{V}(B) = \sigma^2 \sum_{i=1}^n \frac{(x_i - \bar{x})^2}{(\sum_{i=1}^n (x_i-\bar{x})^2)^2}$ $\mathbb{V}(B) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2} \sum_{i=1}^n \frac{(x_i - \bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2}$ $\mathbb{V}(B) = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}$ Now let’s calculate $S_A^2$: $\mathbb{V}(A) = S_A^2$ $\mathbb{V}(A) = \mathbb{V}(\bar{Y} - B\bar{x})$ $\mathbb{V}(A) = \mathbb{V}( \frac{1}{n} \sum_{i=1}^n Y_i - \bar{x} \sum_{i=1}^n c_i Y_i )$ $\mathbb{V}(A) = \mathbb{V}( \sum_{i=1}^n (\frac{1}{n} - c_i \bar{x}) Y_i)$ $\mathbb{V}(A) = \sum_{i=1}^n (\frac{1}{n} - c_i \bar{x})^2 \sigma^2$ $\mathbb{V}(A) = \sigma^2 \sum_{i=1}^n ( \frac{1}{n^2} -2\frac{c_i\bar{x}}{n} + c_i^2\bar{x}^2 )$ $\mathbb{V}(A) = \sigma^2 ( \frac{1}{n} -2\frac{\bar{x}}{n}\sum_{i=1}^n c_i + \bar{x}^2 \sum_{i=1}^n c_i^2 )$ $\mathbb{V}(A) = \sigma^2 ( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )$ Now, let’s calculate $\mathbb{E}(S_n^2)$: …<to do>… The $S_n^2$ estimator of $\sigma^2$ is biased: $\mathbb{E}(S_n^2) = (n-2)\frac{\sigma^2}{n}$. Thus, we rather use the following, unbiased estimator for $\sigma^2$: $S_{n-2}^2 = \frac{n}{n-2} S_n^2$ $S_{n-2}^2 = \frac{1}{n-2} \sum_{i=1}^n [ Y_i - (A+Bx_i)]^2$ All these estimators are called « least-square estimators » because they minimize the sum of prediction squared errors: $\sum_{i=1}^n E_i^2 = \sum_{i=1}^n [ Y_i - (a+bx_i)]^2$ Also, $S_{n-2}$ (i.e. the square root of $S_{n-2}^2$) is also called the standard error of the regression. Thanks to the formulas above for $A$, $B$ and $S^2_{n-2}$, we can compute the estimations $\hat{a}$, $\hat{b}$ and $\hat{\sigma^2}$ of the parameters $a$, $b$ and $\sigma^2$. These estimations are thus also estimations of $E_A$ and $E_B$. But how to estimate $S_A$ and $S_B$? In fact, we just replace $\sigma^2$ by $S_{n-2}^2$ in the formulas of $\mathbb{V}(A)$ and $\mathbb{V}(B)$: $S_A^2 = S_{n-2}^2 ( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )$ $S_B^2 = \frac{S_{n-2}^2}{\sum_{i=1}^n (x_i-\bar{x})^2}$ Moreover, for each estimation, it is often very useful to compute a confidence interval. Such an interval indicates that, if we were to re-do the whole experiment 100 times, the true value of the parameter $a$ would fall in it 95% of the times (corresponding to a level $\alpha=5\%$). We know that both $(A-a)/S_A$ and $(B-b)/S_B$ follow a Student distribution $\mathcal{T}$ with $n-2$ degrees of freedom. This allows to compute confidence intervals: $CI_{\alpha}(a) = [ \hat{a} - t_{n-2;1-\alpha/2} s_A ; \hat{a} + t_{n-2;1-\alpha/2} s_A ]$ $CI_{\alpha}(b) = [ \hat{b} - t_{n-2;1-\alpha/2} s_B ; \hat{b} + t_{n-2;1-\alpha/2} s_B ]$ where $\hat{a}, \hat{b}, s_A, s_B$ are realizations of $A, B, S_A, S_B$, and the $t$‘s are quantiles taken from the Student cumulative distribution function. ### Predicting values It is possible with this model, for a given $x=x_i$, to predict a value for $Y$. We named $T_i$ such a prediction that corresponds to an estimator of the expectation of $Y$ given $x_i$: $T_i = A + Bx_i$ , estimator of $\mathbb{E}(Y | x=x_i)$ We also note $F_i$ the calculated residual: $F_i = Y_i - T_i$ The difference between $E_i$ and $F_i$ is that $E_i$ is an unobserved random variable whereas $F_i$ is a residual computed thanks to the estimators $A$ and $B$. For $x=x_0$, $T_0$ follows a normal law (as a linear combination of a Gaussian vector), and: $\mathbb{E}(T_0) = a +bx_0$ $\mathbb{V}(T_0) = \sigma^2 (\frac{1}{n} + \frac{(x_0-\bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2})$ As an estimation of the expectation, we have: $t_0 = \hat{a} + \hat{b}x_0$. And to estimate the variance, we do as above: $S_{T_0}^2 = S_{n-2}^2 ( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2} )$ Similarly, to compute a confidence interval for $\mathbb{E}(Y | x=x_0)$: $CI_{\alpha}(T_0) = [ t_0 - t_{n-2;1-\alpha/2} s_{T_0} ; t_0 + t_{n-2;1-\alpha/2} s_{T_0} ]$ Moreover, we can also compute a prediction interval. For $Y_0$ a random variable whose values correspond to the results we can observe for $Y$ given that $x=x_0$, we have: $Y_0 = a + bx_0 + E_0$ Therefore: $\hat{Y_0} = A+Bx_0+E_0 = T_0+E_0$ $\mathbb{V}(\hat{Y_0}) = \mathbb{V}(T_0) + \mathbb{V}(E_0) = \mathbb{V}(T_0) + \sigma^2$ And the prediction interval is: $PI_{\alpha}(Y_0) = [ t_0 - t_{n-2;1-\alpha/2} s_{\hat{Y}_0} ; t_0 + t_{n-2;1-\alpha/2} s_{\hat{Y}_0} ]$ where: $s_{\hat{Y}_0}^2 = s_{T_0}^2 + s_{E_0}^2 = s_{n-2}^2( 1 + \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2} )$ The prediction interval is always wider than the confidence interval but, as the estimation of $\sigma^2$ by $S_{n-2}^2$ gets more precise (e.g. with more sampled data), the predictions will be more precise also. ### Checking the hypotheses of the model However, even if we spent hours writing equations, we first need to be sure that all the hypotheses of our model are verified, otherwise, no way of being confident in the results (i.e. in the statistical inference). Here are the hypotheses: • the relationship between the outcome and the explanatory variable(s) is linear; • the error terms $E_i$ have the same variance $\sigma^2$; • the error terms $E_i$ are independents; • the error terms $E_i$ follow a Gaussian distribution. Basically, we need to look at the estimations of the error terms ($\hat{E}_i=F_i=Y_i-\hat{Y}_i$) as a function of the predicted values($\hat{Y}_i = \hat{a} + \hat{b} x_i$). Therefore, first we record the results of the linear regression, and then we plot the residuals versus the fitted values: mod1 <- lm( rate ~ age, data=d ) ggplot( mod1, aes(.fitted, .resid) ) + geom_hline( yintercept=0 ) + geom_point() + geom_smooth( se=F ) ggsave( "data_fishes_mod1_residuals-fitted.png" ) Clearly, the residuals are « structured », i.e. there is a tendency, which indicates that a relevant term was not considered in the modeling of $\mathbb{E}(Y)$. Moreover, the residuals don’t have the same variance (heteroscedasticity). Therefore, we can’t carry on with this model, let’s modify it: $log(Y_i) = a + bx_i + E_i$ again with $E_i \mbox{ i.i.d.} \sim \mathcal{N}(0, \sigma^2)$ but these $E_i$ are different from model 1. mod2 <- lm( log10(rate) ~ age, data=d ) ggplot( mod2, aes(.fitted, .resid) ) + geom_hline( yintercept=0 ) + geom_point() + geom_smooth( se=F ) ggsave( "data_fishes_mod2_residuals-fitted.png" ) The variance was stabilized but the residuals are still slightly structured. But let’s keep this model as, although the blue line is similarly convex as in the previous model, the y-axis has a much zoomed-in scale. Therefore, our final model is: $log(Y_i) = a + bx_i + E_i$ with $E_i i.i.d. \sim \mathcal{N}(0, \sigma^2)$. To ease the equations, we will use $Z_i = log(Y_i)$. Now, the model is $Z_i = a + bx_i + E_i$. ### Testing the model The first test aims at deciding if the relationship between the outcome $Z_i$ and its explanatory variable $x_i$ is truly linear. The null hypothesis is: $H_0: { Z_i = a + E_i }$ (no relation between $Z_i$ and $x_i$) and the alternative hypothesis is: $H_1: { Z_i = a + bx_i + E_i }$ (there is a relation between $Z_i$ and $x_i$, which is linear) To test this, we decompose the variance of the data into a part explained by the model and the rest being residual: $\sum_{i=1}^n (Z_i - \bar{Z})^2 = \sum_{i=1}^n (\hat{Z}_i - \bar{Z})^2 + \sum_{i=1}^n (Z_i - \hat{Z}_i)^2$ The equation above corresponds to decomposing the total sum of squares (TSS) into a model sum of squares (MSS) and an residual sum of squares (RSS): $TSS = MSS + RSS$ Each sum of squares follows a chi-squared distribution ($\chi^2$). Such a distribution is characterized by a parameter called « degrees of freedom » (df): $df_{TSS} = \mbox{observations } - 1 = n - 1 = 14$ $df_{MSS} = \mbox{explanatory variables} = p - 1 = 1$ $df_{RSS} = df_{TSS} - df_{MSS} = n - p = 13$ To test the $H_0$ hypothesis, we use the Fisher’s $F$ statistics: $F = \frac{MSS / df_{MSS}}{RSS / df_{RSS}}$ This statistics can be interpreted as a variance ratio: $F = \frac{\mbox{variance explained by x}}{\mbox{residual variance}}$ Under hypothesis $H_0$, this statistics follows a Fischer distribution with parameters (degrees of freedom) $p-1$ and $n-p$. We reject the null hypothesis, $H_0$, at the level $\alpha$ (typically 5%), if: $F > f_{p-1,n-p,1-\alpha}$ i.e. if $F$ is higher than the quantile $f$ of order $1-\alpha$ from a Fisher law $\mathcal{F}_{p-1,n-p}$. We can also compute a P-value that measures the agreement between the tested hypothesis and the obtained result, i.e. the probability to draw a value from a $\mathcal{F}_{p-1,n-p}$ distribution that is higher than $F$: $\mbox{P-value} = \mathbb{P}( \mathcal{F}_{p-1,n-p} > F )$ The smaller the P-value, the the stronger the disagreement between the null hypothesis and the results of the experiment. Usually we reject $H_0$ when the P-value is smaller than $\alpha=5\%$. In our case, we can compute all the sum of squares as well as the Fisher statistics: MSS <- sum( ( mod2$fitted.values - mean(log10(d$rate)) )^2 ) RSS <- sum( ( log10(d$rate) - mod2$fitted.values )^2 ) TSS <- sum( ( log10(d$rate) - mean(log10(d$rate)) )^2 ) F <- (MSS / 1) / (RSS / 13) f <- qf( p=0.95, df1=1, df2=13 ) Pval <- pf( q=F, df1=1, df2=13, lower.tail=FALSE ) We obtain $MSS=0.77003$, $RSS=0.11998$, $F=83.43$$f=4.67$ and $P-value=5.092.10^{-7}$. It is also possible to have all these results in a simple way (although it is always good to know how to compute these quantities by oneself): summary( mod2 ) anova( mod2 ) These results show that the variability explained by the model is far greater than the residual variability ($F>80$ and $P-value << 5\%$). Thus, we can reject the null hypothesis and consider that there exists a linear relationship between the log of the DDT rate in fishes and their age. ### Assessing the quality of the model It is important to assess the adjustment of the model to the data as we may use it to predict the value of $Y$ knowing the value of $x$.For this purpose, we compute the R-square, $R^2$, that corresponds to the proportion of the variability in $Z$ explained by the model (in percentage). $R^2 = \frac{MSS}{TSS} = 1 - \frac{RSS}{TSS}$ However, it is more relevant to calculate the adjusted R-square, i.e. to divide it by the number of explanatory variables. Indeed, with more explanatory variables, the adjustment will be better but we may end with a model that is over-adjusted: $\mbox{adjusted-}R^2 = 1 - \frac{RSS/(n-p)}{TSS/(n-1)}$ R2 <- 1 - RSS/TSS R2.adj <- 1 - (RSS/13)/(TSS/14) We obtain $R^2=86.5\%$ and $R^2adj=85.5\%$. It means that 85% of the variability of the log of DDT rate in fishes is explained by their age. ### Estimating the parameters According to the formulas above of $A$, $B$ and $S_{n-2}^2$, we can compute estimations of the parameters: b <- sum( (d$age - mean(d$age)) * (log10(d$rate) - mean(log10(d$rate))) ) / sum( (d$age - mean(d$age))^2 ) a <- mean(log10(d$rate)) - b * mean(d$age) sn2 <- 1/(n-2) * sum( (log10(d$rate) - (a + b*d$age))^2 ) Know that $\bar{Z}=-0.419$ and $\bar{x}=4$, we get $\hat{b}=0.16021$, $\hat{a}=-1.06005$ and $S_{n-2}^2=0.00923$. Here is the equation of the regression line: $z = \hat{a} + \hat{b}x = 0.16x - 1.06$. It means that, in a year, the log of the DDT rate increases 0.16 times. We can also plot this line with the data: qplot( age, log10(rate), data=d ) + geom_abline( intercept=a, slope=b, colour="red" ) ggsave( "data_fishes_mod2_regline.png" ) Are our estimates precise? Let’s compute the variances and confidence intervals for $\hat{a}$ and $\hat{b}$: s.a <- sn2 * ( 1/n + mean(d$age)^2 / sum( ( d$age - mean(d$age))^2 ) )
s.b <- sn2 / sum( ( d$age - mean(d$age) )^2 )
alpha <- 0.05
ci.a <- c( a - qt( p=1-alpha/2, df=n-2 ) * s.a, a + qt( p=1-alpha/2, df=n-2 ) * s.a )
ci.b <- c( b - qt( p=1-alpha/2, df=n-2 ) * s.b, b + qt( p=1-alpha/2, df=n-2 ) * s.b )

We get $S_A^2=0.00553$, $S_B^2=0.000307$, $CI_{5\%}(a)=(-1.072, -1.048)$ and $CI_{5\%}(b)=(0.1595,0.1608)$. These values show that our estimates are quite precise.

Although it seems almost certain when looking at the confidence intervals, we can still wonder if we can reject the null hypotheses according to which the $a$ and $b$ parameters equal zero: $\mathcal{H}_0: a=b=0$. Note that the test for $b=0$ is equivalent to the test made previously with $\mathcal{H}_0: Z_i = a + E_i$.

test.a <- a / sqrt( s.a )
test.b <- b / sqrt( s.b )
qt( 0.975, df=13 )^2 == qf( 0.95, df1=1, df2=13 )
Pval.a <- 2 * pt( q=test.a, df=n-2 )
Pval.b <- 2 * pt( q=test.b, df=n-2, lower.tail=FALSE )

We obtain $\hat{a}/\hat{\sigma}_a=-14.245$, $\hat{b}/\hat{\sigma}_b=9.134$, $t_{1-\alpha/2}^2=f_{1-\alpha;p-1;n-p}$, $P-value_a=2.61.10^{-9}$ and $P-value_b=5.09.10^{-7}$. The P-values are far below 5%. Thus we reject both null hypotheses and conclude that:

• the log of the DDT rate for a fish that is just born is significantly different from zero ($a \ne 0$);
• the log of the DDT rate for a fish is linearly linked to its age($b \ne 0$). And as $\hat{b} > 0$, this is a growing relationship.

### Confidence and prediction intervals for the variable to explain

We can compute such interval for the values of the DDT rate from each fish:

s2.t <- sn2 * ( 1/n + (d$age-mean(d$age))^2 / sum((d$age-mean(d$age))^2) )
ci.y <- cbind( a+b*d$age - qt(p=1-alpha/2,df=n-2) * sqrt(s2.t), a+b*d$age + qt(p=1-alpha/2,df=n-2) * sqrt(s2.t) )
s2.y <- sn2 + s2.t
pi.y <- cbind( a+b*d$age - qt(p=1-alpha/2,df=n-2) * sqrt(s2.y), a+b*d$age + qt(p=1-alpha/2,df=n-2) * sqrt(s2.y) )
pl <- qplot( age, log10(rate), data=d ) + geom_abline( intercept=a, slope=b, colour="red" )
pl + geom_line( aes(x=d$age, y=ci.y[,1], colour="low CI") ) + geom_line( aes(x=d$age, y=ci.y[,2], colour="high CI") ) + geom_line( aes(x=d$age, y=pi.y[,1], colour="low PI") ) + geom_line( aes(x=d$age, y=pi.y[,2], colour="high PI") ) + scale_colour_manual( "Intervals", c("blue","blue","green","green") ) + opts( legend.position="none" )
ggsave( "data_fishes_mod2_regline-intervals-nolegend.png" )

Here is the final plot with the regression line in red, the lines of the confidence interval in blue and the lines of the prediction interval in green:

That’s it!

Sources: Statistique inférentielle de Daudin, Robin et Vuillet (sur Amazon.fr), Exemples d’applications du modèle linéaire de Lebarbier et Robin (here)

Publicités