[R] Question about multiple regression

Mon Sep 8 20:35:41 CEST 2008

R squared is: 1 - sum(residuals^2)/crossprod(y - mean(y))


On Mon, Sep 8, 2008 at 2:27 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
> I could get an r squared from lm.fit by correlating fitted.values and
> my response variable.
> But could I do it somehow using Sums of Squares? I am clear on SS for
> residuals. But where is SS for the model or the total SS in lm.fit
> output?
> Thank you!
> Dimitri
>
> On Mon, Sep 8, 2008 at 1:57 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> On Mon, Sep 8, 2008 at 1:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
>>> Thank you everyone for your responses. I'll answer several questions.
>>>
>>> 1. >  Disclaimer: I have **NO IDEA** of the details of what you want
>>> to do or why
>>>> -- but I am willing to bet that there are better ways of doing it than  1.8
>>>> mm multiple refressions that take 270 secs each!! (which I find difficult to
>>>> believe in itself -- are you sure you are doing things right? Something
>>>> sounds very fishy here: R's regression code is typically very fast).
>>> I probably should not bore everyone, but just to explain where the
>>> large number is coming from. I have an experimental design with 7
>>> factors. Each factor has between 3 and 5 levels. Once you cross them
>>> all, you end up with 18,000 cells. For each cell, I want to generate a
>>> sample of N=100. For each sample I have to analyze the data using 3
>>> different statistical methods of analysis (the goal of the
>>> Monte-Carlo) is to compare those methods. One of the methods requires
>>> running of up to ~32,000 simple multiple regressions - yes just for
>>> one sample and it's not a mistake. I test-ran one such analysis for a
>>> sample with N=800 and 15 predictors and it took 270 seconds. R was
>>> actually very fast - it ran each of the individual regressions in
>>> about 0.008 seconds. Still I need something faster.
>>>
>>> 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For
>>> example, using it on my data, I got a value of 36,644...
>>
>> Its the sum of the squares of the residuals.
>>
>>>
>>> 3. I know that for similarly challenging situations people did used
>>> Fortran compilers. So, anyone heard of a free Fortran library or an
>>> efficient piece of code?
>>>
>>> Thank you!
>>> Dimitri
>>>
>>>
>>>>
>>>> -- Bert Gunter
>>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>>>> Behalf Of Dimitri Liakhovitski
>>>> Sent: Monday, September 08, 2008 9:56 AM
>>>> To: Prof Brian Ripley
>>>> Cc: R-Help List
>>>> Subject: Re: [R] Question about multiple regression
>>>>
>>>> Yes, see my previous e-mail on how long R takes (270 seconds for one
>>>> of the 1,800,000 sets I need) - using system.time.
>>>> Not sure how to test the same for Fortran...
>>>>
>>>> On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley
>>>> <ripley at stats.ox.ac.uk> wrote:
>>>>> Are you sure R's ways are not fast enough (there are many layers
>>>> underneath
>>>>> lm)?  For an example of how you might do this at C/Fortran level, see the
>>>>> function lqs() in MASS.
>>>>>
>>>>> On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
>>>>>
>>>>>> Dear R-list,
>>>>>> maybe some of you could point me in the right direction:
>>>>>>
>>>>>> Are you aware of any FREE Fortran or Java libraries/actual pieces of
>>>>>> code that are VERY efficient (time-wise) in running the regular linear
>>>>>> least-squares multiple regression?
>>>>>
>>>>> A lot of the effort is in getting the right answer fast, including for
>>>> e.g.
>>>>> collinear inputs.
>>>>>
>>>>>> More specifically, I have to run small regression models (between 1
>>>>>> and 15 predictors) on samples of up to N=700 but thousands and
>>>>>> thousands of them.
>>>>>>
>>>>>> I am designing a simulation in R and running those regressions and R
>>>>>> itself is way too slow. So, I am thinking of compiling the regression
>>>>>> run itself in Fortran and Java and then calling it from R.
>>>>>
>>>>> I think Java is unlikely to be fast compared to the Fortran R itself uses.
>>>>>
>>>>> Have you profiled to find where the time is really being spent (both R and
>>>>> C/Fortran profiling if necessary).
>>>>>
>>>>>>
>>>>>> Thank you very much for any advice!
>>>>>>
>>>>>> Dimitri Liakhovitski
>>>>>> MarketTools, Inc.
>>>>>> Dimitri.Liakhovitski at markettools.com
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>> --
>>>>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>>>> University of Oxford,             Tel:  +44 1865 272861 (self)
>>>>> 1 South Parks Road,                     +44 1865 272866 (PA)
>>>>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dimitri Liakhovitski
>>>> MarketTools, Inc.
>>>> Dimitri.Liakhovitski at markettools.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Dimitri Liakhovitski
>>> MarketTools, Inc.
>>> Dimitri.Liakhovitski at markettools.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
>
>
> --
> Dimitri Liakhovitski
> MarketTools, Inc.
> Dimitri.Liakhovitski at markettools.com
>