[R] lm function in R

Ista Zahn istazahn at gmail.com
Sun Feb 14 02:14:20 CET 2010

```Dear Something,
I think you have no hope of completing this project without going back
to an introductory regression text. The two models you listed (Y-Hat =
b0 + b1X1 + b2X2 + . . . bkXk + · · · + bkXk) and (Y-Hat = b0 + b1X1 +
b2x2 + ... + bkXk + b12 X12+ b13 X13 +........  + c) are identical
except for the typos. Cohen and Cohen [1] might be a good place to
start.

Best,
Ista

[1] http://www.amazon.com/Multiple-Regression-Correlation-Analysis-Behavioral/dp/0805822232

On Sat, Feb 13, 2010 at 8:03 PM, Something Something
<mailinglists19 at gmail.com> wrote:
>>>From your question it is difficult to determine what sort of tutoring you
> are expecting.
> The kind of tutoring you are providing is definitely helpful :)
>
> I guess I should quickly explain what I am trying to accomplish.  I have
> been a computer scientist for quite a few years, but I studied Statistics
> long time ago.  Now trying to get back into the Statistics field.
>
> For my latest project I have been asked to implement Multiple Regression
> Analysis (using interactions - I guess) in Java & Distributed Computing
> (more specifically Hadoop).  I looked at the possibility of using rJava, but
> it's 'single threaded' model plus complex deployment on 1000s of machines
> does not make it very appealing.  I looked at using 3rd party libraries such
> as 'Apache Commons Math' & Flanagan's JAVA statistics library.  Both of them
> use this formula:
> Y-Hat = b0 + b1X1 + b2X2 + . . . bkXk + · · · + bkXk
>
> and NOT the one I have been asked to implement - which is -
> Y-Hat = b0 + b1X1 + b2x2 + ... + bkXk + b12 X12+ b13 X13 +........  + c
>
> (Not to mention these tools do not use BigDecimal so the answers given by
> them are not as precise as those from R).
>
> So now the deadline is approaching... and I can't find any Java library that
> matches results from R.. so I have to roll up my sleeves and start coding in
> Java - which means I need to get up to speed on Y-Hat calculations very
> quickly.  Ideally, looking for a paper (or a text book) that gives step by
> step instructions on calculating coefficients such as (b0, b1... b13).  I
> have been searching but haven't found anything.  My last resort is to look
> at R source code and start converting it to Java.
>
> I am new to R and a little rusty on Statistics, so I apologize for all the
> dumb questions, and GREATLY appreciate your patience and help.  Thanks.
>
>
> On Sat, Feb 13, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
>
>>
>> On Feb 13, 2010, at 5:03 PM, Something Something wrote:
>>
>>  I tried..
>>>
>>> mod = lm(Y ~ X1*X2*X3, na.action = na.exclude)
>>> formula(mod)
>>>
>>> This produced....
>>> Y ~ X1 * X2 * X3
>>>
>>>
>>> When I typed just mod I got:
>>>
>>> Call:
>>> lm(formula = Y ~ X1 * X2 * X3, na.action = na.exclude)
>>>
>>> Coefficients:
>>> (Intercept)          X11          X21          X31      X11:X21
>>>  X11:X31
>>>    X21:X31  X11:X21:X31
>>>  177.9245       0.2005       2.4482       3.1216       0.8127     -26.6166
>>>    -3.0398      29.6049
>>>
>>>
>>> I am trying to figure out how R computed all these coefficients.
>>>
>>
>> From your question it is difficult to determine what sort of tutoring you
>> are expecting. To get the code of an R formula, you just type its name:
>>
>> lm
>>
>> Leads to lm.fit:
>>
>> lm.fit
>>
>> Reading further it appears the lm and lm.fit functions are really front
>> ends for this call:
>>
>> .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,
>>        tol = as.double(tol), coefficients = mat.or.vec(p, ny),
>>        residuals = y, effects = y, rank = integer(1L), pivot = 1L:p,
>>        qraux = double(p), work = double(2 * p), PACKAGE = "base")
>>
>> Seems pretty likely that is a QR decomposition-based method that i
>> implemented in compiled code.
>>
>> So if you want to go deeper, at least you know what to search for. Or if
>> you want to know how regression works on a matrix level, you should consult
>> a good reference text or Wikipedia, which is surprisingly good for that sort
>> of question these days.
>>
>> --
>> David.
>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Feb 13, 2010 at 1:30 PM, Bert Gunter <gunter.berton at gene.com>
>>> wrote:
>>>
>>>  ?formula
>>>>
>>>>
>>>> Bert Gunter
>>>> Genentech Nonclinical Statistics
>>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>>> On
>>>> Behalf Of Something Something
>>>> Sent: Saturday, February 13, 2010 1:24 PM
>>>> To: Daniel Nordlund
>>>> Cc: r-help at r-project.org
>>>> Subject: Re: [R] lm function in R
>>>>
>>>> Thanks Dan.  Yes that was very helpful.  I didn't see the change from '*'
>>>> to
>>>> '+'.
>>>>
>>>> Seems like when I put * it means - interaction & when I put + it's not an
>>>> interaction.
>>>>
>>>> Is it correct to assume then that...
>>>>
>>>> When I put + R evaluates the following equation:
>>>> Y-Hat = b0 + b1X1 + b2X2 + . . . bkXk + 7 7 7 + bkXk
>>>>
>>>>
>>>> But when I put * R evaluates the following equation;
>>>> Y-Hat = b0 + b1X1 + b2x2 + ... + bkXk + b12 X12+ b13 X13 +........  + c
>>>>
>>>> Is this correct?  If it is then can someone point me to any sources that
>>>> will explain how the coefficients (such as b0... bk, b12.. , b123..) are
>>>> calculated.  I guess, one source is the R source code :) but is there any
>>>> other documentation anywhere?
>>>>
>>>> Please let me know.  Thanks.
>>>>
>>>>
>>>>
>>>> On Fri, Feb 12, 2010 at 5:54 PM, Daniel Nordlund
>>>> <djnordlund at verizon.net>wrote:
>>>>
>>>>  -----Original Message-----
>>>>>> From: r-help-bounces at r-project.org [mailto:
>>>>>>
>>>>> r-help-bounces at r-project.org]
>>>>
>>>>> On Behalf Of Something Something
>>>>>> Sent: Friday, February 12, 2010 5:28 PM
>>>>>> To: Phil Spector; r-help at r-project.org
>>>>>> Subject: Re: [R] lm function in R
>>>>>>
>>>>>> Thanks for the replies everyone.  Greatly appreciate it.  Some
>>>>>>
>>>>> progress,
>>>>
>>>>> but
>>>>>> now I am getting the following values when I don't use "as.factor"
>>>>>>
>>>>>> 13.14167 25.11667 28.34167 49.14167 40.39167 66.86667
>>>>>>
>>>>>> Is that what you guys get?
>>>>>>
>>>>>
>>>>>
>>>>> If you look at Phil's response below, no, that is not what he got.  The
>>>>> difference is that you are specifying an interaction, whereas Phil did
>>>>>
>>>> not
>>>>
>>>>> (because the equation you initially specified did not include an
>>>>> interaction.  Use Y ~ X1 + X2 instead of Y ~ X1*X2 for your formula.
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Feb 12, 2010 at 5:00 PM, Phil Spector
>>>>>> <spector at stat.berkeley.edu>wrote:
>>>>>>
>>>>>>  By converting the two variables to factors, you are fitting
>>>>>>> an entirely different model.  Leave out the as.factor stuff
>>>>>>> and it will work exactly as you want it to.
>>>>>>>
>>>>>>> dat
>>>>>>>
>>>>>>>>
>>>>>>>>  V1 V2 V3 V4
>>>>>>> 1 s1 14  4  1
>>>>>>> 2 s2 23  4  2
>>>>>>> 3 s3 30  7  2
>>>>>>> 4 s4 50  7  4
>>>>>>> 5 s5 39 10  3
>>>>>>> 6 s6 67 10  6
>>>>>>>
>>>>>>>  names(dat) = c('id','y','x1','x2')
>>>>>>>> z = lm(y~x1+x2,dat)
>>>>>>>> predict(z)
>>>>>>>>
>>>>>>>>      1        2        3        4        5        6 15.16667
>>>>>>>
>>>>>> 24.66667
>>>>
>>>>> 27.66667 46.66667 40.16667 68.66667
>>>>>>>
>>>>>>>
>>>>>>>                                      - Phil Spector
>>>>>>>                                       Statistical Computing
>>>>>>>
>>>>>> Facility
>>>>
>>>>>                                       Department of Statistics
>>>>>>>                                       UC Berkeley
>>>>>>>                                       spector at stat.berkeley.edu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 12 Feb 2010, Something Something wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>>
>>>>>>>> I am trying to learn how to perform Multiple Regression Analysis in
>>>>>>>>
>>>>>>> R.
>>>>
>>>>> I
>>>>>>
>>>>>>> decided to take a simple example given in this PDF:
>>>>>>>> http://www.utdallas.edu/~herve/abdi-prc-pretty.pdf
>>>>>>>>
>>>>>>>> I created a small CSV called, students.csv that contains the
>>>>>>>>
>>>>>>> following
>>>>
>>>>> data:
>>>>>>>>
>>>>>>>> s1 14 4 1
>>>>>>>> s2 23 4 2
>>>>>>>> s3 30 7 2
>>>>>>>> s4 50 7 4
>>>>>>>> s5 39 10 3
>>>>>>>> s6 67 10 6
>>>>>>>>
>>>>>>>> Col headers:  Student id, Memory span(Y), age(X1), speech rate(X2)
>>>>>>>>
>>>>>>>> Now the expected results are:
>>>>>>>>
>>>>>>>> yHat[0]:15.166666666666668
>>>>>>>> yHat[1]:24.666666666666668
>>>>>>>> yHat[2]:27.666666666666664
>>>>>>>> yHat[3]:46.666666666666664
>>>>>>>> yHat[4]:40.166666666666664
>>>>>>>> yHat[5]:68.66666666666667
>>>>>>>>
>>>>>>>> This is based on the following equation (given in the PDF):  Y =
>>>>>>>>
>>>>>>> 1.67
>>>>
>>>>> +
>>>>>
>>>>>> X1
>>>>>>
>>>>>>> +
>>>>>>>> 9.50 X2
>>>>>>>>
>>>>>>>> I ran the following commands in R:
>>>>>>>>
>>>>>>>> data = read.table("students.csv", head=F, as.is=T, na.string=".",
>>>>>>>> row.nam=NULL)
>>>>>>>> X1 = as.factor(data[[3]])
>>>>>>>> X2 = as.factor(data[[4]])
>>>>>>>> Y = data[[2]]
>>>>>>>> mod = lm(Y ~ X1*X2, na.action = na.exclude)
>>>>>>>> Y.hat = fitted(mod)
>>>>>>>> Y.hat
>>>>>>>>
>>>>>>>> This gives me the following output:
>>>>>>>>
>>>>>>>> Y.hat
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  1  2  3  4  5  6
>>>>>>>> 14 23 30 50 39 67
>>>>>>>>
>>>>>>>> Obviously I am doing something wrong.  Please help.  Thanks.
>>>>>>>>
>>>>>>>>
>>>>> Hope this is helpful,
>>>>>
>>>>> Dan
>>>>>
>>>>> Daniel Nordlund
>>>>> Bothell, WA USA
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>       [[alternative HTML version deleted]]
>>>>
>>>>
>>>>
>>>>
>>>        [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

--
Ista Zahn