[R] NEWBIE: Help explaining use of lm()?
James W. MacDonald
jmacdon at med.umich.edu
Tue Nov 21 21:42:43 CET 2006
Hi Kevin,
The hint here would be the fact that you are coding your 'Groups'
variable as a factor, which implies (at least in this situation) that
the Groups are unordered levels.
Your understanding of a linear model fit is correct when the independent
varible is continuous (like the amount of hormone administered).
However, in this case, with unordered factor levels, a linear model is
the same as fitting an analysis of variance (ANOVA) model.
Note that the statistic for an ANOVA (the F-statistic) is a
generalization of the t-statistic to more than two groups. In essence
the question you are asking with the lm() fit is 'Is the mean age at
walking different from a baseline mean for any of these groups?' as
compared to the t-test which asks 'Is the mean age at walking different
between Group x and Group y?'.
The way you have set up the analysis uses the 'active' group as a
baseline and compares all others to that. In the output you supply, the
intercept is the mean age that the active group started walking, and it
appears that there is moderate evidence that the 'ctr.8w' group started
walking later (the estimate being a little over 2 months later).
If you were to do individual t-tests making the same comparisons you
would end up with the same conclusion, but slightly different p-values,
etc, because the degrees of freedom would be different.
HTH,
Jim
Zembower, Kevin wrote:
> I'm attempting the heruclean task of teaching myself Introductory
> Statistics and R at the same time. I'm working through Peter Dalgaard's
> Introductory Statistics with R, but don't understand why the answer to
> one of the exercises works. I'm hoping someone will have the patience to
> explain the answer to me, both in the statistics and R areas.
>
> Exercise 6.1 says:
> The zelazo data are in the form of a list of vectors, one for each of
> the four groups. Convert the data to a form suitable for the use of lm,
> and calculate the relevant tests. ...
>
> This stumped me right from the beginning. I thought I understood that
> linear models tried to correlate an independent variable (such as the
> amount of a hormone administered) to a dependent variable (such as the
> height of a cornstalk). Its output was a model that could state, "for
> every 10% increase in the hormone, the height increased by X%."
>
> The zelazo data are the ages at walking (in months) of four groups of
> infants, two controls and two experimentals subjected to different
> exercise regimens. I don't understand why lm() can be used at all in
> this circumstance. My initial attempt was to use t.test(), which the
> answer key does also. I would have never thought to use lm() except for
> the requirement in the problem. I've pasted in the output of the
> exercise below, for those without the dataset. Would someone explain why
> lm() is appropriate to use in this situation, and what the results mean
> 'in plain English?'
>
> Thanks for your patience with a newbie. I'm comfortable asking this
> question to this group because of the patience and understanding this
> group has almost always shown in explaining and teaching statistics.
>
> Kevin Zembower
> Internet Services Group manager
> Center for Communication Programs
> Bloomberg School of Public Health
> Johns Hopkins University
> 111 Market Place, Suite 310
> Baltimore, Maryland 21202
> 410-659-6139
> ============================================================
>
>>library("ISwR")
>
> Loading required package: survival
> Loading required package: splines
>
> Attaching package: 'ISwR'
>
>
> The following object(s) are masked from package:survival :
>
> lung
>
>
>>data(zelazo)
>>head(zelazo)
>
> $active
> [1] 9.00 9.50 9.75 10.00 13.00 9.50
>
> $passive
> [1] 11.00 10.00 10.00 11.75 10.50 15.00
>
> $none
> [1] 11.50 12.00 9.00 11.50 13.25 13.00
>
> $ctr.8w
> [1] 13.25 11.50 12.00 13.50 11.50
>
>
>>walk <- unlist(zelazo)
>>walk
>
> active1 active2 active3 active4 active5 active6 passive1 passive2
> 9.00 9.50 9.75 10.00 13.00 9.50 11.00 10.00
> passive3 passive4 passive5 passive6 none1 none2 none3 none4
> 10.00 11.75 10.50 15.00 11.50 12.00 9.00 11.50
> none5 none6 ctr.8w1 ctr.8w2 ctr.8w3 ctr.8w4 ctr.8w5
> 13.25 13.00 13.25 11.50 12.00 13.50 11.50
>
>>group <- factor(rep(1:4,c(6,6,6,5)), labels=names(zelazo))
>>group
>
> [1] active active active active active active passive passive
> passive
> [10] passive passive passive none none none none none
> none
> [19] ctr.8w ctr.8w ctr.8w ctr.8w ctr.8w
> Levels: active passive none ctr.8w
>
>>summary(lm(walk ~ group))
>
>
> Call:
> lm(formula = walk ~ group)
>
> Residuals:
> Min 1Q Median 3Q Max
> -2.7083 -0.8500 -0.3500 0.6375 3.6250
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 10.1250 0.6191 16.355 1.19e-12 ***
> grouppassive 1.2500 0.8755 1.428 0.1696
> groupnone 1.5833 0.8755 1.809 0.0864 .
> groupctr.8w 2.2250 0.9182 2.423 0.0255 *
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 1.516 on 19 degrees of freedom
> Multiple R-Squared: 0.2528, Adjusted R-squared: 0.1348
> F-statistic: 2.142 on 3 and 19 DF, p-value: 0.1285
>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
More information about the R-help
mailing list