[R] repost: problems with lm for nested fixed-factor Anova (ANOVA I)

Thu Feb 12 18:18:53 CET 2009

tmp <- data.frame(y=rnorm(15000),
                  x1 <- factor(sample(48, 15000, replace=TRUE)),
                  z1 <- factor(sample(242, 15000, replace=TRUE)))
system.time(
            tmp.aov <- aov(y ~ x1/z1, data=tmp)
            )
## exceeds memory

tmp2 <- data.frame(y=rnorm(15000),
                   x1 <- factor(sample(48, 15000, replace=TRUE)),
                   z1 <- factor(sample(5, 15000, replace=TRUE)))
system.time(
            tmp2.aov <- aov(y ~ x1/z1, data=tmp2)
            )
anova(tmp2.aov)
## about 5 seconds

Use data.frames.  They make it easier to read.
Use aov() instead of lm().  It is the same arithmetic,
but the unneeded columns of X are handled more gracefully.

My guess is that your data has 100s of distinct values for z1.
Therefore excess space was allocated.  It is easier to understand with
distinct values of z1, but as you see it is costly in computer
resources.

You can force the actual numerical values of the second term to be
distinct across levels of x1 with the interaction() function.  Then
use the simpler model and let the linear dependencies work in your
favor.

system.time(
            tmp.aov <- aov(y ~ x1 + interaction(x1, z1), data=tmp)
)
anova(tmp.aov)
## about 6 seconds

Rich