[R] Re gression by groups questions

Chris Friedl cfriedalek at gmail.com
Wed Jun 17 12:07:08 CEST 2009


I have a large dataset grouped by a factor and I want to perform a regression
on each data subset based on this factor. There are many ways to do this,
posted here and elsewhere. I have tried several. However I found one method
posted on the R wiki which works exactly as I want, and I like the elegance
and simplicity of the solution,  but I don't understand how it works. Its
all in the formula specification. Problem is also that I want to extend it
from a single variable case to a multivariable case and haven't really got a
clue how to go about it. Hope someone can help.

# The code below creates a data.frame comprising two marginally noisy
straight lines and performs a 
# regression on each based on the grp factor. This is a single variable case
based on R wiki one liner.
# y = x		for grp A
# y = 2 + 5x	for grp B
set.seed(5)
ind <- as.vector(runif(50))
d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
data1 <- rbind(d1,d2)
fits <- lm(y ~ grp:x + grp - 1, data=data1)
summary(fits)
with(data1, plot(x,y))
abline(coef(fits)[c(1,3)], col="red")
abline(coef(fits)[c(2,4)], col="blue")

Output:
Call:
lm(formula = y ~ grp:x + grp - 1, data = data1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.206414 -0.057582 -0.001376  0.062737  0.242111 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
grpA   -0.02935    0.02811  -1.044    0.299    
grpB    1.96200    0.02811  69.803   <2e-16 ***
grpA:x  1.06276    0.04921  21.596   <2e-16 ***
grpB:x  5.09351    0.04921 103.502   <2e-16 ***
---

We can see that the intercepts of are close to 0 and 2 and slopes close to 1
and 5 as suggested by the data creation equation. In fact these results are
identical to running lm individually on each subset of data with formula y ~
x. What I don't understand is how this result comes about from the formula
specification y ~ grp:x + grp -1 . Can someone offer an explanation?


In addition I want to apply the same idea for a multivariate data set. The
code below shows a multivariate dataset which I hope to fit by group.

# The code below creates a data.frame comprising three marginally noisy
planes. I want to perform a 
# regression on each based on the grp factor. 
# y = x1 + x2 			for grp A
# y = 2 + 2x1 + 4x2 		for grp B
# y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2")))
d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), grp=rep("A",
100))
d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("B", 100))
d3 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
100))
data2 <- rbind(d1,d2,d3)
# visualise
ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
summary(fits)

Output:

Call:
lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.338348 -0.092069 -0.002844  0.094813  1.259094 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
grpA    -0.01943    0.08442  -0.230    0.818    
grpB     1.98345    0.08442  23.495  < 2e-16 ***
grpC     0.51002    0.08442   6.042 4.66e-09 ***
grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2838 on 291 degrees of freedom
Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 

Whilst there seems to be some correlation between coefficients and terms
that doesn't apply in all cases. I've tried many other combinations but noe
that gave me coefficients similar to the defining equations. Bottom line is
I don't understand what the formula specification should be for the
multivariate version.

Thanks for any help given.

Chris

-- 
View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list