[R] lm coefficients output confusing

Thu Aug 13 23:53:25 CEST 2009

On 13-Aug-09 20:45:31, Ross Culloch wrote:
> 
> Hi all,
> 
> I have an issue with the lm() function regarding the listing of
> the coefficients. My data are below, showing a list of hours (HR)
> relating to the time spent resting (R) by an individual animal.
> Simply i want to run a lm() to run in an anova() to see if there
> is a significant difference in resting between hours. 
> 
>    HR         R
> 1   2 0.6666667
> 2   2 0.4666667
> 3   2 0.8000000
> 4   2 0.6333333
> 5   2 0.7333333
> 6   2 0.8000000
> 7   2 0.8666667
> 8   2 0.7857143
> 9   2 0.7826087
> 10  2 0.6666667
> 11  2 0.9166667
> 12  2 0.6666667
> 13  3 0.5294118
> 14  3 0.8541667
> 15  3 0.4583333
> 16  3 0.5882353
> 17  3 0.9347826
> 18  3 0.7878788
> 19  3 0.7857143
> 20  3 0.6944444
> 21  3 0.8333333
> 22  3 0.7450980
> 23  3 0.9230769
> 24  3 0.7222222
> 25  4 0.6571429
> 26  4 0.7241379
> 27  4 0.7391304
> 28  4 0.6571429
> 29  4 0.8000000
> 30  4 0.9130435
> 31  4 0.7187500
> 32  4 0.8437500
> 33  4 0.9230769
> 34  4 0.8571429
> 35  4 0.8695652
> 36  4 0.8888889
> 37  5 0.3333333
> 38  5 0.5365854
> 39  5 0.6774194
> 40  5 0.7142857
> 41  5 0.6904762
> 42  5 0.5483871
> 43  5 0.5952381
> 44  5 0.4166667
> 45  5 0.5666667
> 46  5 0.5952381
> 47  5 0.7894737
> 48  5 0.7500000
> 49  6 0.6268657
> 50  6 0.7187500
> 51  6 0.5500000
> 52  6 0.7164179
> 53  6 0.7656250
> 54  6 0.5869565
> 55  6 0.7164179
> 56  6 0.7031250
> 57  6 0.7230769
> 58  6 0.7462687
> 59  6 0.9200000
> 60  6 0.8536585
> 61  7 0.6379310
> 62  7 0.5357143
> 63  7 0.5227273
> 64  7 0.8000000
> 65  7 0.6724138
> 66  7 0.7083333
> 67  7 0.7241379
> 68  7 0.6938776
> 69  7 0.6545455
> 70  7 0.7931034
> 71  7 0.7560976
> 72  7 0.8684211
> 73  8 0.6727273
> 74  8 0.6000000
> 75  8 0.8333333
> 76  8 0.8181818
> 77  8 0.7818182
> 78  8 0.7647059
> 79  8 0.5818182
> 80  8 0.5918367
> 81  8 0.7450980
> 82  8 0.7818182
> 83  8 0.8048780
> 84  8 0.8684211
> 
> 
> The script i'm using and output is as follows:
> 
>> anova(rdayml <- lm(R ~ HR, data=rdata2, na.action=na.exclude)) 
> Analysis of Variance Table
> 
> Response: R
>           Df  Sum Sq Mean Sq F value  Pr(>F)   
> HR         6 0.25992 0.04332  3.1762 0.00774 **
> Residuals 77 1.05021 0.01364                   
> ---
> Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™
> 0.1 â€˜ â€™ 1 
>> 
>> summary(rdayml <- lm(R ~ HR,data=rdata2))
> 
> Call:
> lm(formula = R ~ HR, data = rdata2)
> 
> Residuals:
>       Min     >    1Q    Median        3Q       Max 
> -0.279725 -0.065416  0.005593  0.077486  0.201070 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  0.732082   0.033713  21.715   <2e-16 ***
> HR3          0.005976   0.047678   0.125   0.9006    
> HR4          0.067232   0.047678   1.410   0.1625    
> HR5         -0.130935   0.047678  -2.746   0.0075 ** 
> HR6         -0.013152   0.047678  -0.276   0.7834    
> HR7         -0.034807   0.047678  -0.730   0.4676    
> HR8          0.004971   0.047678   0.104   0.9172    
> ---> Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ .01 â€˜*â€™ 0.05
â€˜.â€™
> 0.1 â€˜ â€™ 1 
> 
> Residual standard error: 0.1168 on 77 degrees of freedom
> Multiple R-squared: 0.1984,     Adjusted R-squared: 0.1359 
> F-statistic: 3.176 on 6 and 77 DF,  p-value: 0.00774 
> 
> 
> What i really don't understand is why the lm summary lists the hour
> numbers in the coefficient of the lm, as apposed to just reading HR?

What has apparently happened here is that HR has been read in as
a factor, nor a numerical covariate (which is probably what you wanted
anyway, since you seem to want to compare hours).

Hence HR is a factor with 7 levels: "2","3","4","5","6","7","8"
(I have wirtten them between "" to show that they will not be
interpreted numerically). When you get a coefficient from lm(),
e.g. for HR6, this is the coefficient for level "6" of HR, and
that is the standard way of writing a factor with its level in
the output of ln().

> On top of that if R does display the data like this then i don't
> understand why it omits hour 2?

This is because your have the Intercept in the model, so along
with the 7 levels of HR you now have 8 variables present for a
model with 7 coefficients. So one has to go (being redundant).
Conventionally this is the first level of the factor (though you
can arrange it differently if you want to). Hence HR2 is, in effect,
absorbed into the Intercept. In this case, the Intercept is equal
to the mean of the values for Hour 2.

If you were to add the coefficint for Intercept to the coefficient
for any level of HR, e.g.

   Intercept           HR5
  (0.732082) + (-0.130935)  =  0.601147

you should find that it is the same as the mean of the values in
Hour 5:

  (0.3333333+0.5365854+0.6774194+0.7142857+0.6904762+0.5483871+
   0.5952381+0.4166667+0.5666667+0.5952381+0.7894737+0.7500000)/12

  = 0.6011475

as indeed it is! The coefficient of a given level of HR is here
the difference between the mean for that hour and the mean for
Hour 2. This depends on the system of contrasts ... by default,
lm() uses "treatment contrasts" for factors, which express the
difference between each "treatment" and the "base level" treatment.
Alternativve systems of cintrasts give different representations.

> If i can get this to work correctly
> can I use the p value to determine which of the hours is significantly
> different to the others - so in this example hour 5 is significantly
> different? Or is it just a case of using the p value from the
anova
> to determine that there is a significant difference between hours
> (in this case) and use a plot to determine which hour(s) are likely
> to be the cause?

Don't trust the individual P-values in the first instance, since with
several levels somce are bound to have more extreme P-values than
others.

However, your anova() test is an overall test for whether one or more
of the means jointly depart significantly from the Null Hypothesis
that they are all equal. Here, your anova() has given a significant
result, so the hour means are not all equal. *Now* you can look at
the individual P-values to see where this may be coming from.
In the list of coefficients, there is only one candidate for
"being different from Intercept", and this is Hour 5. Indeed the
P-value for HR5 and the P-value for the anova() are nearly equal
(the latter being slightly larger, allowing for the fact that it
is 1 out of 7).

Hoping this helps!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Aug-09                                       Time: 22:53:22
------------------------------ XFMail ------------------------------