[BioC] tapply for enormous (>2^31 row) matrices

Wed Feb 22 01:23:28 CET 2012

One alternative to Steve's suggestion is to dump the file in an SQL db
(say, RSQLite) and summarize through that...

Below the naive solution (and comparison with an "all in R
alternative") and the exercise on parallelising the task is left as an
exercise.

--benilton

set.seed(1)
n <- 1e6
tmp <- data.frame(V1=sample(1000, n, rep=T),
                  V2=sample(1000, n, rep=T),
                  V3=sample(1000, n, rep=T),
                  V4=runif(n))
ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum))
write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F)
rm(tmp)
gc()

library(sqldf)
cat(file='tmploc.db')
conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db',
                     sql="CREATE TABLE main AS SELECT * FROM file")

sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db')

test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2,
V1', dbname='tmploc.db')

all.equal(ref, test)