[R] Re gression by groups questions
Chris Friedl
cfriedalek at gmail.com
Wed Jun 17 12:07:08 CEST 2009
I have a large dataset grouped by a factor and I want to perform a regression
on each data subset based on this factor. There are many ways to do this,
posted here and elsewhere. I have tried several. However I found one method
posted on the R wiki which works exactly as I want, and I like the elegance
and simplicity of the solution, but I don't understand how it works. Its
all in the formula specification. Problem is also that I want to extend it
from a single variable case to a multivariable case and haven't really got a
clue how to go about it. Hope someone can help.
# The code below creates a data.frame comprising two marginally noisy
straight lines and performs a
# regression on each based on the grp factor. This is a single variable case
based on R wiki one liner.
# y = x for grp A
# y = 2 + 5x for grp B
set.seed(5)
ind <- as.vector(runif(50))
d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
data1 <- rbind(d1,d2)
fits <- lm(y ~ grp:x + grp - 1, data=data1)
summary(fits)
with(data1, plot(x,y))
abline(coef(fits)[c(1,3)], col="red")
abline(coef(fits)[c(2,4)], col="blue")
Output:
Call:
lm(formula = y ~ grp:x + grp - 1, data = data1)
Residuals:
Min 1Q Median 3Q Max
-0.206414 -0.057582 -0.001376 0.062737 0.242111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
grpA -0.02935 0.02811 -1.044 0.299
grpB 1.96200 0.02811 69.803 <2e-16 ***
grpA:x 1.06276 0.04921 21.596 <2e-16 ***
grpB:x 5.09351 0.04921 103.502 <2e-16 ***
---
We can see that the intercepts of are close to 0 and 2 and slopes close to 1
and 5 as suggested by the data creation equation. In fact these results are
identical to running lm individually on each subset of data with formula y ~
x. What I don't understand is how this result comes about from the formula
specification y ~ grp:x + grp -1 . Can someone offer an explanation?
In addition I want to apply the same idea for a multivariate data set. The
code below shows a multivariate dataset which I hope to fit by group.
# The code below creates a data.frame comprising three marginally noisy
planes. I want to perform a
# regression on each based on the grp factor.
# y = x1 + x2 for grp A
# y = 2 + 2x1 + 4x2 for grp B
# y = 2 + 2x1 + 4x2 + 6x1x2 for grp C
ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2")))
d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), grp=rep("A",
100))
d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("B", 100))
d3 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
100))
data2 <- rbind(d1,d2,d3)
# visualise
ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
summary(fits)
Output:
Call:
lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
Residuals:
Min 1Q Median 3Q Max
-1.338348 -0.092069 -0.002844 0.094813 1.259094
Coefficients:
Estimate Std. Error t value Pr(>|t|)
grpA -0.01943 0.08442 -0.230 0.818
grpB 1.98345 0.08442 23.495 < 2e-16 ***
grpC 0.51002 0.08442 6.042 4.66e-09 ***
grpA:x1 1.03376 0.11769 8.784 < 2e-16 ***
grpB:x1 2.02563 0.11769 17.212 < 2e-16 ***
grpC:x1 4.83101 0.11769 41.050 < 2e-16 ***
grpA:x2 1.02371 0.09664 10.593 < 2e-16 ***
grpB:x2 4.00069 0.09664 41.396 < 2e-16 ***
grpC:x2 7.22331 0.09664 74.742 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2838 on 291 degrees of freedom
Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16
Whilst there seems to be some correlation between coefficients and terms
that doesn't apply in all cases. I've tried many other combinations but noe
that gave me coefficients similar to the defining equations. Bottom line is
I don't understand what the formula specification should be for the
multivariate version.
Thanks for any help given.
Chris
--
View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list