[R] Regression with many independent variables

Thu Mar 3 22:16:47 CET 2011

What you might need to do is create a character string with your formula in it (looping through pairs of variables and using paste or sprint) then convert that to a formula using the as.formula function.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> Sent: Thursday, March 03, 2011 2:09 PM
> To: Greg Snow
> Cc: r-help at r-project.org
> Subject: Re: [R] Regression with many independent variables
> 
> Thanks greg,
> 
>  that formula was exactly what I was looking for. Except now when I
> run it on my data I get the following error:
> 
> "Error in model.matrix.default(mt, mf, contrasts) : cannot allocate
> vector of length 2043479998"
> 
> I know there are probably many 2-way interactions that are zero so I
> thought I could save space by removing these. Is there some way that
> can just delete all the two way interactions that are zero and keep
> the columns that have non-zero entries? I think that will
> significantly cut down the memory needed. Or is there just another way
> to get around this?
> 
> thanks,
> Matt
> 
> On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> > You can use ^2 to get all 2 way interactions and ^3 to get all 3 way
> interactions, e.g.:
> >
> > lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)
> >
> > The lm.fit function is what actually does the fitting, so you could
> go directly there, but then you lose the benefits of using . and ^.
>  The Matrix package has ways of dealing with sparse matricies, but I
> don't know if  that would help here or not.
> >
> > You could also just create x'x and x'y matricies directly since the
> variables are 0/1 then use solve.  A lot depends on what you are doing
> and what questions you are trying to answer.
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> >> Sent: Tuesday, March 01, 2011 1:09 PM
> >> To: Greg Snow
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Regression with many independent variables
> >>
> >> Hi Greg,
> >>
> >> Thanks for the help, it works perfectly. To answer your question,
> >> there are 339 independent variables but only 10 will be used at one
> >> time . So at any given line of the data set there will be 10 non
> zero
> >> entries for the independent variables and the rest will be zeros.
> >>
> >> One more question:
> >>
> >> 1. I still want to find a way to look at the interactions of the
> >> independent variables.
> >>
> >> the regression would look like this:
> >>
> >> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
> >>
> >> so I think the regression in R would look like this:
> >>
> >> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),
> >>
> >> my problem is that since I have technically 339 independent
> variables,
> >> when I do this regression I would have 339 Choose 2 = approx 57000
> >> independent variables (a vast majority will be 0s though) so I dont
> >> want to have to write all of these out. Is there a way to do this
> >> quickly in R?
> >>
> >> Also just a curious question that I cant seem to find to online:
> >> is there a more efficient model other than lm() that is better for
> >> very sparse data sets like mine?
> >>
> >> Thanks,
> >> Matt
> >>
> >>
> >> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org>
> wrote:
> >> > Don't put the name of the dataset in the formula, use the data
> >> argument to lm to provide that.  A single period (".") on the right
> >> hand side of the formula will represent all the columns in the data
> set
> >> that are not on the left hand side (you can then use "-" to remove
> any
> >> other columns that you don't want included on the RHS).
> >> >
> >> > For example:
> >> >
> >> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
> >> >
> >> > Call:
> >> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
> >> >
> >> > Coefficients:
> >> >      (Intercept)       Petal.Length        Petal.Width
> >>  Speciesversicolor
> >> >           3.0485             0.1547             0.6234
>  -
> >> 1.7641
> >> >  Speciesvirginica
> >> >          -2.1964
> >> >
> >> >
> >> > But, are you sure that a regression model with 339 predictors will
> be
> >> meaningful?
> >> >
> >> > --
> >> > Gregory (Greg) L. Snow Ph.D.
> >> > Statistical Data Center
> >> > Intermountain Healthcare
> >> > greg.snow at imail.org
> >> > 801.408.8111
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> >> project.org] On Behalf Of Matthew Douglas
> >> >> Sent: Monday, February 28, 2011 1:32 PM
> >> >> To: r-help at r-project.org
> >> >> Subject: [R] Regression with many independent variables
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am trying use lm() on some data, the code works fine but I
> would
> >> >> like to use a more efficient way to do this.
> >> >>
> >> >> The data looks like this (the data is very sparse with a few 1s,
> -1s
> >> >> and the rest 0s):
> >> >>
> >> >> > head(adj0708)
> >> >>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
> >> >> 1   64.28571   29    0    0    0    0    0    0   0    0    0
>  0
> >> >> 0    0    0
> >> >> 2 -100.00000    6    0    0    0    0    0    0   0    1    0
>  0
> >> >> 0    0    0
> >> >> 3  100.00000    4    0    0    0    0    0    0   0    1    0
>  0
> >> >> 0    0    0
> >> >> 4  -33.33333    7    0    0    0    0    0    0   0    0    0
>  0
> >> >> 0    0    0
> >> >> 5  200.00000    2    0    0    0    0    0    0   0    0    0
>  0
> >> >> -1    0    0
> >> >> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0
>  0
> >> >> 0    0    0
> >> >>
> >> >> adj0708 is actually a 35657x341 data set. Each column after
> "Poss"
> >> is
> >> >> an independent variable, the dependent variable is "MARGIN" and
> it
> >> is
> >> >> weighted by "Poss"
> >> >>
> >> >>
> >> >> The regression is below:
> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
> >> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
> >> >> adj0708$P605 + adj0708$P337 + .... +
> >> >> adj0708$P510,weights=adj0708$Poss)
> >> >>
> >> >> I have two questions:
> >> >>
> >> >> 1. Is there a way to to condense how I write the independent
> >> variables
> >> >> in the lm(), instead of having such a long line of code (I have
> 339
> >> >> independent variables to be exact)?
> >> >> 2. I would like to pair the data to look a regression of the
> >> >> interactions between two independent variables. I think it would
> >> look
> >> >> something like this....
> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
> >> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
> >> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
> >> >> but there will be 339 Choose 2 combinations, so a lot of
> independent
> >> >> variables! Is there a more efficient way of writing this code. Is
> >> >> there a way I can do this?
> >> >>
> >> >> Thanks,
> >> >> Matt
> >> >>
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide http://www.R-
> project.org/posting-
> >> >> guide.html
> >> >> and provide commented, minimal, self-contained, reproducible
> code.
> >> >
> >