[R] Regression with many independent variables

Greg Snow Greg.Snow at imail.org
Fri Mar 4 19:22:14 CET 2011


Here is one possible way (you will need to change the dataset and condition, etc.):

tmp1 <- combn(names(iris)[1:4], 2, function(x) {
	if( any( iris[[ x[1] ]] * iris[[ x[2] ]] < .25 )) {
		NA
	} else {
		paste(x, collapse=':')
	}} )

tmp1 <- tmp1[ !is.na(tmp1) ]

paste(tmp1, collapse=' + ')

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> Sent: Thursday, March 03, 2011 3:43 PM
> To: Greg Snow
> Cc: r-help at r-project.org
> Subject: Re: [R] Regression with many independent variables
> 
> Thanks for getting back to me so quickly greg. Im not quite sure how
> to do what you just said, is there an example that you can show?
> 
> I understand how to create the string with a formula in it but im not
> sure how to loop through the pairs of variables? How do I first get
> these 2way interaction variables, I can no longer use the "^" right?
> 
> Sorry for so many questions,
> 
> Matt
> On Thu, Mar 3, 2011 at 4:16 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> > What you might need to do is create a character string with your
> formula in it (looping through pairs of variables and using paste or
> sprint) then convert that to a formula using the as.formula function.
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> >> Sent: Thursday, March 03, 2011 2:09 PM
> >> To: Greg Snow
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] Regression with many independent variables
> >>
> >> Thanks greg,
> >>
> >>  that formula was exactly what I was looking for. Except now when I
> >> run it on my data I get the following error:
> >>
> >> "Error in model.matrix.default(mt, mf, contrasts) : cannot allocate
> >> vector of length 2043479998"
> >>
> >> I know there are probably many 2-way interactions that are zero so I
> >> thought I could save space by removing these. Is there some way that
> >> can just delete all the two way interactions that are zero and keep
> >> the columns that have non-zero entries? I think that will
> >> significantly cut down the memory needed. Or is there just another
> way
> >> to get around this?
> >>
> >> thanks,
> >> Matt
> >>
> >> On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow at imail.org>
> wrote:
> >> > You can use ^2 to get all 2 way interactions and ^3 to get all 3
> way
> >> interactions, e.g.:
> >> >
> >> > lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)
> >> >
> >> > The lm.fit function is what actually does the fitting, so you
> could
> >> go directly there, but then you lose the benefits of using . and ^.
> >>  The Matrix package has ways of dealing with sparse matricies, but I
> >> don't know if  that would help here or not.
> >> >
> >> > You could also just create x'x and x'y matricies directly since
> the
> >> variables are 0/1 then use solve.  A lot depends on what you are
> doing
> >> and what questions you are trying to answer.
> >> >
> >> > --
> >> > Gregory (Greg) L. Snow Ph.D.
> >> > Statistical Data Center
> >> > Intermountain Healthcare
> >> > greg.snow at imail.org
> >> > 801.408.8111
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> >> >> Sent: Tuesday, March 01, 2011 1:09 PM
> >> >> To: Greg Snow
> >> >> Cc: r-help at r-project.org
> >> >> Subject: Re: [R] Regression with many independent variables
> >> >>
> >> >> Hi Greg,
> >> >>
> >> >> Thanks for the help, it works perfectly. To answer your question,
> >> >> there are 339 independent variables but only 10 will be used at
> one
> >> >> time . So at any given line of the data set there will be 10 non
> >> zero
> >> >> entries for the independent variables and the rest will be zeros.
> >> >>
> >> >> One more question:
> >> >>
> >> >> 1. I still want to find a way to look at the interactions of the
> >> >> independent variables.
> >> >>
> >> >> the regression would look like this:
> >> >>
> >> >> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
> >> >>
> >> >> so I think the regression in R would look like this:
> >> >>
> >> >> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data =
> adj0708),
> >> >>
> >> >> my problem is that since I have technically 339 independent
> >> variables,
> >> >> when I do this regression I would have 339 Choose 2 = approx
> 57000
> >> >> independent variables (a vast majority will be 0s though) so I
> dont
> >> >> want to have to write all of these out. Is there a way to do this
> >> >> quickly in R?
> >> >>
> >> >> Also just a curious question that I cant seem to find to online:
> >> >> is there a more efficient model other than lm() that is better
> for
> >> >> very sparse data sets like mine?
> >> >>
> >> >> Thanks,
> >> >> Matt
> >> >>
> >> >>
> >> >> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org>
> >> wrote:
> >> >> > Don't put the name of the dataset in the formula, use the data
> >> >> argument to lm to provide that.  A single period (".") on the
> right
> >> >> hand side of the formula will represent all the columns in the
> data
> >> set
> >> >> that are not on the left hand side (you can then use "-" to
> remove
> >> any
> >> >> other columns that you don't want included on the RHS).
> >> >> >
> >> >> > For example:
> >> >> >
> >> >> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
> >> >> >
> >> >> > Call:
> >> >> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
> >> >> >
> >> >> > Coefficients:
> >> >> >      (Intercept)       Petal.Length        Petal.Width
> >> >>  Speciesversicolor
> >> >> >           3.0485             0.1547             0.6234
> >>  -
> >> >> 1.7641
> >> >> >  Speciesvirginica
> >> >> >          -2.1964
> >> >> >
> >> >> >
> >> >> > But, are you sure that a regression model with 339 predictors
> will
> >> be
> >> >> meaningful?
> >> >> >
> >> >> > --
> >> >> > Gregory (Greg) L. Snow Ph.D.
> >> >> > Statistical Data Center
> >> >> > Intermountain Healthcare
> >> >> > greg.snow at imail.org
> >> >> > 801.408.8111
> >> >> >
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> >> >> project.org] On Behalf Of Matthew Douglas
> >> >> >> Sent: Monday, February 28, 2011 1:32 PM
> >> >> >> To: r-help at r-project.org
> >> >> >> Subject: [R] Regression with many independent variables
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am trying use lm() on some data, the code works fine but I
> >> would
> >> >> >> like to use a more efficient way to do this.
> >> >> >>
> >> >> >> The data looks like this (the data is very sparse with a few
> 1s,
> >> -1s
> >> >> >> and the rest 0s):
> >> >> >>
> >> >> >> > head(adj0708)
> >> >> >>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307
> P337....
> >> >> >> 1   64.28571   29    0    0    0    0    0    0   0    0    0
> >>  0
> >> >> >> 0    0    0
> >> >> >> 2 -100.00000    6    0    0    0    0    0    0   0    1    0
> >>  0
> >> >> >> 0    0    0
> >> >> >> 3  100.00000    4    0    0    0    0    0    0   0    1    0
> >>  0
> >> >> >> 0    0    0
> >> >> >> 4  -33.33333    7    0    0    0    0    0    0   0    0    0
> >>  0
> >> >> >> 0    0    0
> >> >> >> 5  200.00000    2    0    0    0    0    0    0   0    0    0
> >>  0
> >> >> >> -1    0    0
> >> >> >> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0
> >>  0
> >> >> >> 0    0    0
> >> >> >>
> >> >> >> adj0708 is actually a 35657x341 data set. Each column after
> >> "Poss"
> >> >> is
> >> >> >> an independent variable, the dependent variable is "MARGIN"
> and
> >> it
> >> >> is
> >> >> >> weighted by "Poss"
> >> >> >>
> >> >> >>
> >> >> >> The regression is below:
> >> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247
> +
> >> >> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
> >> >> >> adj0708$P605 + adj0708$P337 + .... +
> >> >> >> adj0708$P510,weights=adj0708$Poss)
> >> >> >>
> >> >> >> I have two questions:
> >> >> >>
> >> >> >> 1. Is there a way to to condense how I write the independent
> >> >> variables
> >> >> >> in the lm(), instead of having such a long line of code (I
> have
> >> 339
> >> >> >> independent variables to be exact)?
> >> >> >> 2. I would like to pair the data to look a regression of the
> >> >> >> interactions between two independent variables. I think it
> would
> >> >> look
> >> >> >> something like this....
> >> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
> >> >> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
> >> >> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
> >> >> >> but there will be 339 Choose 2 combinations, so a lot of
> >> independent
> >> >> >> variables! Is there a more efficient way of writing this code.
> Is
> >> >> >> there a way I can do this?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Matt
> >> >> >>
> >> >> >> ______________________________________________
> >> >> >> R-help at r-project.org mailing list
> >> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >> PLEASE do read the posting guide http://www.R-
> >> project.org/posting-
> >> >> >> guide.html
> >> >> >> and provide commented, minimal, self-contained, reproducible
> >> code.
> >> >> >
> >> >
> >



More information about the R-help mailing list