[R] Seeking help for outomating regression (over columns) and storing selected output
Gabor Grothendieck
ggrothendieck at myway.com
Sat Apr 3 18:45:11 CEST 2004
Note that there is a QUESTION at the end regarding
random effects.
Suppose your data frame is df and has components
y, x1, x2, x3 and u where u is a factor.
1. There was a problem posted about doing repeated regressions
(search for Operating on windows of data) last month that
has similarities to this one.
Making use of those ideas, the first sapply below loops
over the y~xi regressions and the next two loop over
the usergroup specific regressions. We just rbind
them altogether:
xvars <- c("x1", "x2", "x3")
rbind(
sapply( xvars, function(xi) coef( lm(y ~ df[,xi], data=df))[[2]] ),
sapply( xvars, function(xi)
sapply( levels(df$u), function(ulev)
coef(lm(y ~ df[,xi], subset=u==ulev, data=df))[[2]]
)
)
)
2. Another possibility is to create a giant regression that does
all the usergroup specific regressions at once and then repeat
it without the usergroup variable to get the rest.
df2 is a new data frame that strings out all the x variables into
a single long column and adds a new factor i that identifies
which x variable it is. y and u are repeated three times to bring
them into line with x. (
xvars <- c("x1", "x2", "x3")
xm <- as.matrix(df[,xvars])
df2 <- data.frame(y=rep(df$y,3), x = c(xm), i=factor(c(col(xm))), u=rep(u,3))
# We could have alternately used reshape like this:
# df2 <- reshape(df,timevar="i",times=factor(1:3),
# varying=list(xvars),direction="long",v.name="x")
# The slopes by usergroup and across user group are:
coeff.u <- coef(lm(y ~ i/u/x, data=df2))
coeff.all <- coef(lm(y ~ i/x, data=df2))
# Pick off the slopes (they are at the end of each coef vector) and reform:
z <- matrix( c( matrix( coef.all, nc=2)[,2], matrix( coef.u, nc=2)[,2] ), nc=3)
colnames(z) <- xvars
rownames(z) <- c("All", levels(df$u))
3. Note that the giant regression approach works as long as you are only
interested in the coefficients, however, if you were interested in the
variances then this would not work since each of the two regressions uses a
pooled estimate of variance.
QUESTION: As a matter of interest, would someone that is familiar with random
effects models show what the corresponding giant model is with separate
variances for each regression.
P.S. I tried the above out on the following which is similar
to the original problem except there are 4 levels in u:
data(state)
x <- state.x77[,1:3]
u <- state.region
y <- state.x77[,4]
df <- data.frame(y=y, x1=x[,1], x2=x[,2], x3=x[,3], u=factor(u))
Greg Blevins <gblevins <at> mn.rr.com> writes:
:
: Hello,
:
: I have spent considerable time trying to figure out that which I am about to
describe. This included
: searching Help, consulting my various R books, and trail and (always)
error. I have been assuming I would
: need to use a loop (looping over columns) but perhaps and apply function
would do the trick. I have
: unsuccessfully tried both.
:
: A scaled down version of my situation is as follows:
:
: I have a dataframe as follows:
:
: ID Y x1 x2 x3 usergroup.
:
: Y is a continous criterion, x1-x3 continous predictors, and usergroup is
coded a 1, 2 or 3 to indicate user status.
:
: My end goal is a (dataframe or matrix) with just the regression coef from
each of 12 runs (each x regressed
: separately on Y for the total sample and for each usergroup). I envision
output as follows, a three column
: by four row dataframe or matrix.
:
: Y and x1; Y and x2; Y and x3.
: Total sample:
: usergroup 1:
: usergroup 2: (Regression Coefs fill the matrix)
: usergroup 3:
:
: Using 1.8.1
: Windows 2000 and XP
:
: Help would be most appreciated.
:
: Greg Blevins, Partner
: The Market Solutions Group
: [[alternative HTML version deleted]]
More information about the R-help
mailing list