[R] lm function in R
Douglas Bates
bates at stat.wisc.edu
Sun Feb 14 14:50:12 CET 2010
On Sat, Feb 13, 2010 at 7:09 PM, Daniel Malter <daniel at umd.edu> wrote:
> It seems to me that your question is more about the econometrics than about
> R. Any introductory econometric textbook or compendium on econometrics will
> cover this as it is a basic. See, for example, Greene 2006 or Wooldridge
> 2002.
>
> Say X is your data matrix, that contains columns for each of the individual
> variables (x), columns with their interactions, and one column of 1s for the
> intercept. Let y be your dependent variable. Then, OLS estimates are
> computed by
>
> X'X inverse X'y
>
> Or in R
>
> solve(t(X)%*%X)%*%t(X)%*%y
But that is not the way that the coefficients are calculated in R. It
is the formula that is given in text books but, like most text book
formulas, is not suitable for computation, especially with large data
sets and complex models.
There are many practical ways to calculate the least squares
coefficients in large data sets and they all involve decompositions.
For a Hadoop environment a scatter-gather approach would be to form
Cholesky decompositions of the cross-product of sections of rows of
the model matrix then combine the decompositions.
> Best,
> Daniel
>
>
> -------------------------
> cuncta stricte discussurus
> -------------------------
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Something Something
> Sent: Saturday, February 13, 2010 5:04 PM
> To: Bert Gunter
> Cc: r-help at r-project.org
> Subject: Re: [R] lm function in R
>
> I tried..
>
> mod = lm(Y ~ X1*X2*X3, na.action = na.exclude)
> formula(mod)
>
> This produced....
> Y ~ X1 * X2 * X3
>
>
> When I typed just mod I got:
>
> Call:
> lm(formula = Y ~ X1 * X2 * X3, na.action = na.exclude)
>
> Coefficients:
> (Intercept) X11 X21 X31 X11:X21 X11:X31
> X21:X31 X11:X21:X31
> 177.9245 0.2005 2.4482 3.1216 0.8127 -26.6166
> -3.0398 29.6049
>
>
> I am trying to figure out how R computed all these coefficients.
>
>
>
>
>
> On Sat, Feb 13, 2010 at 1:30 PM, Bert Gunter <gunter.berton at gene.com> wrote:
>
>> ?formula
>>
>>
>> Bert Gunter
>> Genentech Nonclinical Statistics
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On
>> Behalf Of Something Something
>> Sent: Saturday, February 13, 2010 1:24 PM
>> To: Daniel Nordlund
>> Cc: r-help at r-project.org
>> Subject: Re: [R] lm function in R
>>
>> Thanks Dan. Yes that was very helpful. I didn't see the change from '*'
>> to
>> '+'.
>>
>> Seems like when I put * it means - interaction & when I put + it's not an
>> interaction.
>>
>> Is it correct to assume then that...
>>
>> When I put + R evaluates the following equation:
>> Y-Hat = b0 + b1X1 + b2X2 + . . . bkXk + 7 7 7 + bkXk
>>
>>
>> But when I put * R evaluates the following equation;
>> Y-Hat = b0 + b1X1 + b2x2 + ... + bkXk + b12 X12+ b13 X13 +........ + c
>>
>> Is this correct? If it is then can someone point me to any sources that
>> will explain how the coefficients (such as b0... bk, b12.. , b123..) are
>> calculated. I guess, one source is the R source code :) but is there any
>> other documentation anywhere?
>>
>> Please let me know. Thanks.
>>
>>
>>
>> On Fri, Feb 12, 2010 at 5:54 PM, Daniel Nordlund
>> <djnordlund at verizon.net>wrote:
>>
>> > > -----Original Message-----
>> > > From: r-help-bounces at r-project.org [mailto:
>> r-help-bounces at r-project.org]
>> > > On Behalf Of Something Something
>> > > Sent: Friday, February 12, 2010 5:28 PM
>> > > To: Phil Spector; r-help at r-project.org
>> > > Subject: Re: [R] lm function in R
>> > >
>> > > Thanks for the replies everyone. Greatly appreciate it. Some
>> progress,
>> > > but
>> > > now I am getting the following values when I don't use "as.factor"
>> > >
>> > > 13.14167 25.11667 28.34167 49.14167 40.39167 66.86667
>> > >
>> > > Is that what you guys get?
>> >
>> >
>> > If you look at Phil's response below, no, that is not what he got. The
>> > difference is that you are specifying an interaction, whereas Phil did
>> not
>> > (because the equation you initially specified did not include an
>> > interaction. Use Y ~ X1 + X2 instead of Y ~ X1*X2 for your formula.
>> >
>> > >
>> > >
>> > > On Fri, Feb 12, 2010 at 5:00 PM, Phil Spector
>> > > <spector at stat.berkeley.edu>wrote:
>> > >
>> > > > By converting the two variables to factors, you are fitting
>> > > > an entirely different model. Leave out the as.factor stuff
>> > > > and it will work exactly as you want it to.
>> > > >
>> > > > dat
>> > > >>
>> > > > V1 V2 V3 V4
>> > > > 1 s1 14 4 1
>> > > > 2 s2 23 4 2
>> > > > 3 s3 30 7 2
>> > > > 4 s4 50 7 4
>> > > > 5 s5 39 10 3
>> > > > 6 s6 67 10 6
>> > > >
>> > > >> names(dat) = c('id','y','x1','x2')
>> > > >> z = lm(y~x1+x2,dat)
>> > > >> predict(z)
>> > > >>
>> > > > 1 2 3 4 5 6 15.16667
>> 24.66667
>> > > > 27.66667 46.66667 40.16667 68.66667
>> > > >
>> > > >
>> > > > - Phil Spector
>> > > > Statistical Computing
>> Facility
>> > > > Department of Statistics
>> > > > UC Berkeley
>> > > > spector at stat.berkeley.edu
>> > > >
>> > > >
>> > > >
>> > > > On Fri, 12 Feb 2010, Something Something wrote:
>> > > >
>> > > > Hello,
>> > > >>
>> > > >> I am trying to learn how to perform Multiple Regression Analysis in
>> R.
>> > > I
>> > > >> decided to take a simple example given in this PDF:
>> > > >> http://www.utdallas.edu/~herve/abdi-prc-pretty.pdf
>> > > >>
>> > > >> I created a small CSV called, students.csv that contains the
>> following
>> > > >> data:
>> > > >>
>> > > >> s1 14 4 1
>> > > >> s2 23 4 2
>> > > >> s3 30 7 2
>> > > >> s4 50 7 4
>> > > >> s5 39 10 3
>> > > >> s6 67 10 6
>> > > >>
>> > > >> Col headers: Student id, Memory span(Y), age(X1), speech rate(X2)
>> > > >>
>> > > >> Now the expected results are:
>> > > >>
>> > > >> yHat[0]:15.166666666666668
>> > > >> yHat[1]:24.666666666666668
>> > > >> yHat[2]:27.666666666666664
>> > > >> yHat[3]:46.666666666666664
>> > > >> yHat[4]:40.166666666666664
>> > > >> yHat[5]:68.66666666666667
>> > > >>
>> > > >> This is based on the following equation (given in the PDF): Y =
>> 1.67
>> > +
>> > > X1
>> > > >> +
>> > > >> 9.50 X2
>> > > >>
>> > > >> I ran the following commands in R:
>> > > >>
>> > > >> data = read.table("students.csv", head=F, as.is=T, na.string=".",
>> > > >> row.nam=NULL)
>> > > >> X1 = as.factor(data[[3]])
>> > > >> X2 = as.factor(data[[4]])
>> > > >> Y = data[[2]]
>> > > >> mod = lm(Y ~ X1*X2, na.action = na.exclude)
>> > > >> Y.hat = fitted(mod)
>> > > >> Y.hat
>> > > >>
>> > > >> This gives me the following output:
>> > > >>
>> > > >> Y.hat
>> > > >>>
>> > > >> 1 2 3 4 5 6
>> > > >> 14 23 30 50 39 67
>> > > >>
>> > > >> Obviously I am doing something wrong. Please help. Thanks.
>> > > >>
>> >
>> > Hope this is helpful,
>> >
>> > Dan
>> >
>> > Daniel Nordlund
>> > Bothell, WA USA
>> >
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list