[R] Quickly calculating the mean results over a collection of data sets?

Dan Davison davison at stats.ox.ac.uk
Tue Aug 12 13:04:43 CEST 2008


On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head wrote:
> I have a collection of datasets in separate data frames which have 3
> independent test parameters (w, x, y) and one dependent variable (z) ,
> together with some additional static test data on each row. What I want
> is a data frame which contains the test data, the parameters (w, x, y)
> and the mean value of all (z)s in the Z column.
> 
> Each datasets has  around 6000 rows and around 7 columns, which doesn't
> seem outrageously large, so it seems like this shouldn't too time
> consuming, but the way I've been approaching it seems to take way too
> long (20 seconds for datasets over 4 runs, longer for my datasets over
> 10 runs). 
> 
> My imperative-coding brain lead me to use for loops, which seems to be
> particularly problematic for R performance. My first attempt at this
> looked like the following, which takes roughly 60 seconds to complete. I
> rewrote it a little, but the code was much longer and effectively
> replaces one of the for loops with an lapply(). I could paste the other
> code, but it's much longer and less clear about its intent.
> 

Hi Michael,

> #######################
> # Start code snippet
> #######################
> ### inputFiles just a list of paths to the test runs
> testRuns <- lapply(inputFiles, 
> 		function(x) {
> 			read.table(x, header=TRUE)})

(Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look at)

> 
> ### W, X, Y have (small) natural values
> w <- unique(testRuns[[1]]$W)
> x <- unique(testRuns[[1]]$X)
> y <- unique(testRuns[[1]]$Y)
> 
> ### All runs have the same values for all columns
> ### with the exception of the Z values, so just
> ### copy the first test run data
> testMeans <- data.frame(testRuns[[1]])

How about rbind()ing all the data frames together, and working with
the combined data frame? Say that testRuns is

> testRuns
[[1]]
  W X Y          Z
1 1 5 5 -0.5251156
2 5 1 3  1.1761139
3 2 4 4 -0.8934380
4 5 1 1  1.4076303
5 5 3 1  0.4679745

[[2]]
  W X Y          Z
1 1 5 5 -0.8556862
2 5 1 3  0.3517671
3 2 4 4 -1.0202064
4 5 1 1  1.2152349
5 5 3 1  0.4340249

> allRuns <- do.call("rbind", testRuns)
> aggregate(allRuns$Z, by=allRuns[c("W","X","Y")], mean)
  W X Y          x
1 5 1 1  1.3114326
2 5 3 1  0.4509997
3 5 1 3  0.7639405
4 2 4 4 -0.9568222
5 1 5 5 -0.6904009

Dan

> for(w0 in w) {
>    for(y0 in y) {
>      for (x0 in x) {
>        row <- which(testMeans$W == w0 &
>                     testMeans$Y == y0 &
>                     testMeans$X == x0)
>        meanValues <- sapply(testRuns,
>                             function(r)
>                             {mean( subset(r,
>                                           r$W == w0 &
>                                           r$Y == y0 &
>                                           r$X == x0)$Z )})
>        testMeans[row,]$Z = mean(meanValues)
>      }
>    }
>  }
> ### I will then want to plot certain values over (X, Z),
> ### so ultimately, I'm going to subset the data further.
> ### Code which gives me a list of W tables with mean Z values
> ### works, too.
> #######################
> # End code snippet
> #######################
> 
> 
> Thanks,
> mike
> 
> -- 
> Michael R. Head <burner at suppressingfire.org>
> http://www.cs.binghamton.edu/~mike/
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
www.stats.ox.ac.uk/~davison



More information about the R-help mailing list