[R] Is there a fast way to do several hundred thousand ANOVA tests?

Charles C. Berry cberry at tajo.ucsd.edu
Mon Aug 24 05:03:25 CEST 2009


On Mon, 24 Aug 2009, big permie wrote:

> Dear R users,
>
> I have a matrix a and a classification vector b such that
>
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
>
> I'd like to do an anova on all 800000 columns and record the F statistic for
> each test; I currently do this using
>
> f.stat.vec <- numeric(length(a[1,])
>
> for (i in 1:length(a[1,]) {
>  f.test.frame <- data.frame(nums = a[,i], cond = b)
>  aov.vox <- aov(nums ~ cond, data = f.test.frame)
>  f.stat <- summary(aov.vox)[[1]][1,4]
>  f.stat.vec[i] <- f.stat
> }
>
> The problem is that this code takes about 70 minutes to run.

Using lsfit(), my five year old windows XP PC does 100k columns in about 
40 seconds, so I reckon that in 5 minutes it could do 800000:

> x <- factor(sample(1:3,50,repl=T))
> x.mat <- model.matrix( ~x )[, 2:3 ] # drop intercept here
> y <- matrix(rnorm(50*100000),nc=100000)
> system.time({fit <- lsfit( x.mat, y );ls.pr <- ls.print( fit,  print.it=FALSE )})
    user  system elapsed
   39.16    0.58   39.79
>
> # check F-statistic:
> summary(as.numeric( ls.pr[[ 'summary' ]][, "F-value" ]))
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.0000  0.2932  0.7095  1.0460  1.4270 17.1100
>
> # theoretical 1st Qu., Median, 3rd Qu match
>
> qf(c(.25,0.5, 0.75 ), 2, 47 )
[1] 0.2894502 0.7034708 1.4280000
>


If it needed to be faster, I would take the part out of ls.print() that 
does just the F-statistic and work on it.

Of course, a newer computer would be a lot faster, too.

HTH,

Chuck

>
> Is there a faster way to do an anova & record the F stat for each column?
>
> Any help would be appreciated.
>
> Thanks
> Heath
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901




More information about the R-help mailing list