[R] Re gression by groups questions
Chris Friedl
cfriedalek at gmail.com
Thu Jun 18 14:02:38 CEST 2009
Thanks Xavier. That's exactly the solution.
xavier.chardon wrote:
>
> Hi,
>
> It seems like your multivariate regression works for groups A and B. For
> group C, values are calculated using an interaction between x1 and x2. Try
> to include it in the regression model to find the right coeffs:
>
> y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1
>
> It works for me. Notice in summary() that the coefficients for the x1:x2
> interaction are not significantly different from 0 in groups A and B,
> which is what we expected.
>
> HTH,
>
> Xavier
>
>
>
> ----- Mail Original -----
> De: "Chris Friedl" <cfriedalek at gmail.com>
> À: r-help at r-project.org
> Envoyé: Mercredi 17 Juin 2009 12h07:08 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: [R] Re gression by groups questions
>
>
> I have a large dataset grouped by a factor and I want to perform a
> regression
> on each data subset based on this factor. There are many ways to do this,
> posted here and elsewhere. I have tried several. However I found one
> method
> posted on the R wiki which works exactly as I want, and I like the
> elegance
> and simplicity of the solution, but I don't understand how it works. Its
> all in the formula specification. Problem is also that I want to extend it
> from a single variable case to a multivariable case and haven't really got
> a
> clue how to go about it. Hope someone can help.
>
> # The code below creates a data.frame comprising two marginally noisy
> straight lines and performs a
> # regression on each based on the grp factor. This is a single variable
> case
> based on R wiki one liner.
> # y = x for grp A
> # y = 2 + 5x for grp B
> set.seed(5)
> ind <- as.vector(runif(50))
> d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
> d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
> data1 <- rbind(d1,d2)
> fits <- lm(y ~ grp:x + grp - 1, data=data1)
> summary(fits)
> with(data1, plot(x,y))
> abline(coef(fits)[c(1,3)], col="red")
> abline(coef(fits)[c(2,4)], col="blue")
>
> Output:
> Call:
> lm(formula = y ~ grp:x + grp - 1, data = data1)
>
> Residuals:
> Min 1Q Median 3Q Max
> -0.206414 -0.057582 -0.001376 0.062737 0.242111
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> grpA -0.02935 0.02811 -1.044 0.299
> grpB 1.96200 0.02811 69.803 <2e-16 ***
> grpA:x 1.06276 0.04921 21.596 <2e-16 ***
> grpB:x 5.09351 0.04921 103.502 <2e-16 ***
> ---
>
> We can see that the intercepts of are close to 0 and 2 and slopes close to
> 1
> and 5 as suggested by the data creation equation. In fact these results
> are
> identical to running lm individually on each subset of data with formula y
> ~
> x. What I don't understand is how this result comes about from the formula
> specification y ~ grp:x + grp -1 . Can someone offer an explanation?
>
>
> In addition I want to apply the same idea for a multivariate data set. The
> code below shows a multivariate dataset which I hope to fit by group.
>
> # The code below creates a data.frame comprising three marginally noisy
> planes. I want to perform a
> # regression on each based on the grp factor.
> # y = x1 + x2 for grp A
> # y = 2 + 2x1 + 4x2 for grp B
> # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C
> ind <- matrix(runif(200), ncol=2, dimnames=list(c(), c("x1","x2")))
> d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("A",
> 100))
> d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("B", 100))
> d3 <- data.frame(ind,
> y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
> 100))
> data2 <- rbind(d1,d2,d3)
> # visualise
> ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
> fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
> summary(fits)
>
> Output:
>
> Call:
> lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
>
> Residuals:
> Min 1Q Median 3Q Max
> -1.338348 -0.092069 -0.002844 0.094813 1.259094
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> grpA -0.01943 0.08442 -0.230 0.818
> grpB 1.98345 0.08442 23.495 < 2e-16 ***
> grpC 0.51002 0.08442 6.042 4.66e-09 ***
> grpA:x1 1.03376 0.11769 8.784 < 2e-16 ***
> grpB:x1 2.02563 0.11769 17.212 < 2e-16 ***
> grpC:x1 4.83101 0.11769 41.050 < 2e-16 ***
> grpA:x2 1.02371 0.09664 10.593 < 2e-16 ***
> grpB:x2 4.00069 0.09664 41.396 < 2e-16 ***
> grpC:x2 7.22331 0.09664 74.742 < 2e-16 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.2838 on 291 degrees of freedom
> Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
> F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16
>
> Whilst there seems to be some correlation between coefficients and terms
> that doesn't apply in all cases. I've tried many other combinations but
> noe
> that gave me coefficients similar to the defining equations. Bottom line
> is
> I don't understand what the formula specification should be for the
> multivariate version.
>
> Thanks for any help given.
>
> Chris
>
> --
> View this message in context:
> http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Re%3A-Re-gression-by-groups-questions-tp24087803p24091678.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list