[R] Re gression by groups questions
Chris Friedl
cfriedalek at gmail.com
Thu Jun 18 14:01:05 CEST 2009
pasting reply by Xavier into this thread for the sake of others who might
come across it in their R travels. Thanks Xavier
Hi,
It seems like your multivariate regression works for groups A and B. For
group C, values are calculated using an interaction between x1 and x2. Try
to include it in the regression model to find the right coeffs:
y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1
It works for me. Notice in summary() that the coefficients for the x1:x2
interaction are not significantly different from 0 in groups A and B, which
is what we expected.
HTH,
Xavier
Chris Friedl wrote:
>
> I have a large dataset grouped by a factor and I want to perform a
> regression on each data subset based on this factor. There are many ways
> to do this, posted here and elsewhere. I have tried several. However I
> found one method posted on the R wiki which works exactly as I want, and I
> like the elegance and simplicity of the solution, but I don't understand
> how it works. Its all in the formula specification. Problem is also that I
> want to extend it from a single variable case to a multivariable case and
> haven't really got a clue how to go about it. Hope someone can help.
>
> # The code below creates a data.frame comprising two marginally noisy
> straight lines and performs a
> # regression on each based on the grp factor. This is a single variable
> case based on R wiki one liner.
> # y = x for grp A
> # y = 2 + 5x for grp B
> set.seed(5)
> ind <- as.vector(runif(50))
> d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
> d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
> data1 <- rbind(d1,d2)
> fits <- lm(y ~ grp:x + grp - 1, data=data1)
> summary(fits)
> with(data1, plot(x,y))
> abline(coef(fits)[c(1,3)], col="red")
> abline(coef(fits)[c(2,4)], col="blue")
>
> Output:
> Call:
> lm(formula = y ~ grp:x + grp - 1, data = data1)
>
> Residuals:
> Min 1Q Median 3Q Max
> -0.206414 -0.057582 -0.001376 0.062737 0.242111
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> grpA -0.02935 0.02811 -1.044 0.299
> grpB 1.96200 0.02811 69.803 <2e-16 ***
> grpA:x 1.06276 0.04921 21.596 <2e-16 ***
> grpB:x 5.09351 0.04921 103.502 <2e-16 ***
> ---
>
> We can see that the intercepts of are close to 0 and 2 and slopes close to
> 1 and 5 as suggested by the data creation equation. In fact these results
> are identical to running lm individually on each subset of data with
> formula y ~ x. What I don't understand is how this result comes about from
> the formula specification y ~ grp:x + grp -1 . Can someone offer an
> explanation?
>
>
> In addition I want to apply the same idea for a multivariate data set. The
> code below shows a multivariate dataset which I hope to fit by group.
>
> # The code below creates a data.frame comprising three marginally noisy
> planes. I want to perform a
> # regression on each based on the grp factor.
> # y = x1 + x2 for grp A
> # y = 2 + 2x1 + 4x2 for grp B
> # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C
> ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2")))
> d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("A", 100))
> d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("B", 100))
> d3 <- data.frame(ind,
> y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
> 100))
> data2 <- rbind(d1,d2,d3)
> # visualise
> ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
> fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
> summary(fits)
>
> Output:
>
> Call:
> lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
>
> Residuals:
> Min 1Q Median 3Q Max
> -1.338348 -0.092069 -0.002844 0.094813 1.259094
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> grpA -0.01943 0.08442 -0.230 0.818
> grpB 1.98345 0.08442 23.495 < 2e-16 ***
> grpC 0.51002 0.08442 6.042 4.66e-09 ***
> grpA:x1 1.03376 0.11769 8.784 < 2e-16 ***
> grpB:x1 2.02563 0.11769 17.212 < 2e-16 ***
> grpC:x1 4.83101 0.11769 41.050 < 2e-16 ***
> grpA:x2 1.02371 0.09664 10.593 < 2e-16 ***
> grpB:x2 4.00069 0.09664 41.396 < 2e-16 ***
> grpC:x2 7.22331 0.09664 74.742 < 2e-16 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.2838 on 291 degrees of freedom
> Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
> F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16
>
> Whilst there seems to be some correlation between coefficients and terms
> that doesn't apply in all cases. I've tried many other combinations but
> noe that gave me coefficients similar to the defining equations. Bottom
> line is I don't understand what the formula specification should be for
> the multivariate version.
>
> Thanks for any help given.
>
> Chris
>
>
--
View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24091659.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list