[R] Regression with many independent variables
Matthew Douglas
matt.douglas01 at gmail.com
Thu Mar 3 23:43:00 CET 2011
Thanks for getting back to me so quickly greg. Im not quite sure how
to do what you just said, is there an example that you can show?
I understand how to create the string with a formula in it but im not
sure how to loop through the pairs of variables? How do I first get
these 2way interaction variables, I can no longer use the "^" right?
Sorry for so many questions,
Matt
On Thu, Mar 3, 2011 at 4:16 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> What you might need to do is create a character string with your formula in it (looping through pairs of variables and using paste or sprint) then convert that to a formula using the as.formula function.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
>> Sent: Thursday, March 03, 2011 2:09 PM
>> To: Greg Snow
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Regression with many independent variables
>>
>> Thanks greg,
>>
>> that formula was exactly what I was looking for. Except now when I
>> run it on my data I get the following error:
>>
>> "Error in model.matrix.default(mt, mf, contrasts) : cannot allocate
>> vector of length 2043479998"
>>
>> I know there are probably many 2-way interactions that are zero so I
>> thought I could save space by removing these. Is there some way that
>> can just delete all the two way interactions that are zero and keep
>> the columns that have non-zero entries? I think that will
>> significantly cut down the memory needed. Or is there just another way
>> to get around this?
>>
>> thanks,
>> Matt
>>
>> On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> > You can use ^2 to get all 2 way interactions and ^3 to get all 3 way
>> interactions, e.g.:
>> >
>> > lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)
>> >
>> > The lm.fit function is what actually does the fitting, so you could
>> go directly there, but then you lose the benefits of using . and ^.
>> The Matrix package has ways of dealing with sparse matricies, but I
>> don't know if that would help here or not.
>> >
>> > You could also just create x'x and x'y matricies directly since the
>> variables are 0/1 then use solve. A lot depends on what you are doing
>> and what questions you are trying to answer.
>> >
>> > --
>> > Gregory (Greg) L. Snow Ph.D.
>> > Statistical Data Center
>> > Intermountain Healthcare
>> > greg.snow at imail.org
>> > 801.408.8111
>> >
>> >
>> >> -----Original Message-----
>> >> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
>> >> Sent: Tuesday, March 01, 2011 1:09 PM
>> >> To: Greg Snow
>> >> Cc: r-help at r-project.org
>> >> Subject: Re: [R] Regression with many independent variables
>> >>
>> >> Hi Greg,
>> >>
>> >> Thanks for the help, it works perfectly. To answer your question,
>> >> there are 339 independent variables but only 10 will be used at one
>> >> time . So at any given line of the data set there will be 10 non
>> zero
>> >> entries for the independent variables and the rest will be zeros.
>> >>
>> >> One more question:
>> >>
>> >> 1. I still want to find a way to look at the interactions of the
>> >> independent variables.
>> >>
>> >> the regression would look like this:
>> >>
>> >> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
>> >>
>> >> so I think the regression in R would look like this:
>> >>
>> >> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),
>> >>
>> >> my problem is that since I have technically 339 independent
>> variables,
>> >> when I do this regression I would have 339 Choose 2 = approx 57000
>> >> independent variables (a vast majority will be 0s though) so I dont
>> >> want to have to write all of these out. Is there a way to do this
>> >> quickly in R?
>> >>
>> >> Also just a curious question that I cant seem to find to online:
>> >> is there a more efficient model other than lm() that is better for
>> >> very sparse data sets like mine?
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >>
>> >> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org>
>> wrote:
>> >> > Don't put the name of the dataset in the formula, use the data
>> >> argument to lm to provide that. A single period (".") on the right
>> >> hand side of the formula will represent all the columns in the data
>> set
>> >> that are not on the left hand side (you can then use "-" to remove
>> any
>> >> other columns that you don't want included on the RHS).
>> >> >
>> >> > For example:
>> >> >
>> >> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
>> >> >
>> >> > Call:
>> >> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
>> >> >
>> >> > Coefficients:
>> >> > (Intercept) Petal.Length Petal.Width
>> >> Speciesversicolor
>> >> > 3.0485 0.1547 0.6234
>> -
>> >> 1.7641
>> >> > Speciesvirginica
>> >> > -2.1964
>> >> >
>> >> >
>> >> > But, are you sure that a regression model with 339 predictors will
>> be
>> >> meaningful?
>> >> >
>> >> > --
>> >> > Gregory (Greg) L. Snow Ph.D.
>> >> > Statistical Data Center
>> >> > Intermountain Healthcare
>> >> > greg.snow at imail.org
>> >> > 801.408.8111
>> >> >
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> >> project.org] On Behalf Of Matthew Douglas
>> >> >> Sent: Monday, February 28, 2011 1:32 PM
>> >> >> To: r-help at r-project.org
>> >> >> Subject: [R] Regression with many independent variables
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am trying use lm() on some data, the code works fine but I
>> would
>> >> >> like to use a more efficient way to do this.
>> >> >>
>> >> >> The data looks like this (the data is very sparse with a few 1s,
>> -1s
>> >> >> and the rest 0s):
>> >> >>
>> >> >> > head(adj0708)
>> >> >> MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
>> >> >> 1 64.28571 29 0 0 0 0 0 0 0 0 0
>> 0
>> >> >> 0 0 0
>> >> >> 2 -100.00000 6 0 0 0 0 0 0 0 1 0
>> 0
>> >> >> 0 0 0
>> >> >> 3 100.00000 4 0 0 0 0 0 0 0 1 0
>> 0
>> >> >> 0 0 0
>> >> >> 4 -33.33333 7 0 0 0 0 0 0 0 0 0
>> 0
>> >> >> 0 0 0
>> >> >> 5 200.00000 2 0 0 0 0 0 0 0 0 0
>> 0
>> >> >> -1 0 0
>> >> >> 6 -83.33333 12 0 -1 0 0 0 0 0 0 0
>> 0
>> >> >> 0 0 0
>> >> >>
>> >> >> adj0708 is actually a 35657x341 data set. Each column after
>> "Poss"
>> >> is
>> >> >> an independent variable, the dependent variable is "MARGIN" and
>> it
>> >> is
>> >> >> weighted by "Poss"
>> >> >>
>> >> >>
>> >> >> The regression is below:
>> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
>> >> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
>> >> >> adj0708$P605 + adj0708$P337 + .... +
>> >> >> adj0708$P510,weights=adj0708$Poss)
>> >> >>
>> >> >> I have two questions:
>> >> >>
>> >> >> 1. Is there a way to to condense how I write the independent
>> >> variables
>> >> >> in the lm(), instead of having such a long line of code (I have
>> 339
>> >> >> independent variables to be exact)?
>> >> >> 2. I would like to pair the data to look a regression of the
>> >> >> interactions between two independent variables. I think it would
>> >> look
>> >> >> something like this....
>> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
>> >> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
>> >> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
>> >> >> but there will be 339 Choose 2 combinations, so a lot of
>> independent
>> >> >> variables! Is there a more efficient way of writing this code. Is
>> >> >> there a way I can do this?
>> >> >>
>> >> >> Thanks,
>> >> >> Matt
>> >> >>
>> >> >> ______________________________________________
>> >> >> R-help at r-project.org mailing list
>> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> PLEASE do read the posting guide http://www.R-
>> project.org/posting-
>> >> >> guide.html
>> >> >> and provide commented, minimal, self-contained, reproducible
>> code.
>> >> >
>> >
>
More information about the R-help
mailing list