[R] NEWBIE: Help explaining use of lm()?

Tue Nov 21 21:42:43 CET 2006

Hi Kevin,

The hint here would be the fact that you are coding your 'Groups' 
variable as a factor, which implies (at least in this situation) that 
the Groups are unordered levels.

Your understanding of a linear model fit is correct when the independent 
varible is continuous (like the amount of hormone administered). 
However, in this case, with unordered factor levels, a linear model is 
the same as fitting an analysis of variance (ANOVA) model.

Note that the statistic for an ANOVA (the F-statistic) is a 
generalization of the t-statistic to more than two groups. In essence 
the question you are asking with the lm() fit is 'Is the mean age at 
walking different from a baseline mean for any of these groups?' as 
compared to the t-test which asks 'Is the mean age at walking different 
between Group x and Group y?'.

The way you have set up the analysis uses the 'active' group as a 
baseline and compares all others to that. In the output you supply, the 
intercept is the mean age that the active group started walking, and it 
appears that there is moderate evidence that the 'ctr.8w' group started 
walking later (the estimate being a little over 2 months later).

If you were to do individual t-tests making the same comparisons you 
would end up with the same conclusion, but slightly different p-values, 
etc, because the degrees of freedom would be different.

HTH,

Jim

Zembower, Kevin wrote:
> I'm attempting the heruclean task of teaching myself Introductory
> Statistics and R at the same time. I'm working through Peter Dalgaard's
> Introductory Statistics with R, but don't understand why the answer to
> one of the exercises works. I'm hoping someone will have the patience to
> explain the answer to me, both in the statistics and R areas.
> 
> Exercise 6.1 says:
> The zelazo data are in the form of a list of vectors, one for each of
> the four groups. Convert the data to a form suitable for the use of lm,
> and calculate the relevant tests. ...
> 
> This stumped me right from the beginning. I thought I understood that
> linear models tried to correlate an independent variable (such as the
> amount of a hormone administered) to a dependent variable (such as the
> height of a cornstalk). Its output was a model that could state, "for
> every 10% increase in the hormone, the height increased by X%."
> 
> The zelazo data are the ages at walking (in months) of four groups of
> infants, two controls and two experimentals subjected to different
> exercise regimens. I don't understand why lm() can be used at all in
> this circumstance. My initial attempt was to use t.test(), which the
> answer key does also. I would have never thought to use lm() except for
> the requirement in the problem. I've pasted in the output of the
> exercise below, for those without the dataset. Would someone explain why
> lm() is appropriate to use in this situation, and what the results mean
> 'in plain English?'
> 
> Thanks for your patience with a newbie. I'm comfortable asking this
> question to this group because of the patience and understanding this
> group has almost always shown in explaining and teaching statistics.
> 
> Kevin Zembower
> Internet Services Group manager
> Center for Communication Programs
> Bloomberg School of Public Health
> Johns Hopkins University
> 111 Market Place, Suite 310
> Baltimore, Maryland  21202
> 410-659-6139 
> ============================================================
> 
>>library("ISwR")
> 
> Loading required package: survival
> Loading required package: splines
> 
> Attaching package: 'ISwR'
> 
> 
>         The following object(s) are masked from package:survival :
> 
>          lung 
> 
> 
>>data(zelazo)
>>head(zelazo)
> 
> $active
> [1]  9.00  9.50  9.75 10.00 13.00  9.50
> 
> $passive
> [1] 11.00 10.00 10.00 11.75 10.50 15.00
> 
> $none
> [1] 11.50 12.00  9.00 11.50 13.25 13.00
> 
> $ctr.8w
> [1] 13.25 11.50 12.00 13.50 11.50
> 
> 
>>walk <- unlist(zelazo)
>>walk
> 
>  active1  active2  active3  active4  active5  active6 passive1 passive2 
>     9.00     9.50     9.75    10.00    13.00     9.50    11.00    10.00 
> passive3 passive4 passive5 passive6    none1    none2    none3    none4 
>    10.00    11.75    10.50    15.00    11.50    12.00     9.00    11.50 
>    none5    none6  ctr.8w1  ctr.8w2  ctr.8w3  ctr.8w4  ctr.8w5 
>    13.25    13.00    13.25    11.50    12.00    13.50    11.50 
> 
>>group <- factor(rep(1:4,c(6,6,6,5)), labels=names(zelazo))
>>group
> 
>  [1] active  active  active  active  active  active  passive passive
> passive
> [10] passive passive passive none    none    none    none    none
> none   
> [19] ctr.8w  ctr.8w  ctr.8w  ctr.8w  ctr.8w 
> Levels: active passive none ctr.8w
> 
>>summary(lm(walk ~ group))
> 
> 
> Call:
> lm(formula = walk ~ group)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -2.7083 -0.8500 -0.3500  0.6375  3.6250 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept)   10.1250     0.6191  16.355 1.19e-12 ***
> grouppassive   1.2500     0.8755   1.428   0.1696    
> groupnone      1.5833     0.8755   1.809   0.0864 .  
> groupctr.8w    2.2250     0.9182   2.423   0.0255 *  
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> 
> Residual standard error: 1.516 on 19 degrees of freedom
> Multiple R-Squared: 0.2528,     Adjusted R-squared: 0.1348 
> F-statistic: 2.142 on 3 and 19 DF,  p-value: 0.1285 
> 
> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.