[BioC] tapply for enormous (>2^31 row) matrices

Benilton Carvalho beniltoncarvalho at gmail.com
Wed Feb 22 01:23:28 CET 2012


One alternative to Steve's suggestion is to dump the file in an SQL db
(say, RSQLite) and summarize through that...

Below the naive solution (and comparison with an "all in R
alternative") and the exercise on parallelising the task is left as an
exercise.

--benilton


set.seed(1)
n <- 1e6
tmp <- data.frame(V1=sample(1000, n, rep=T),
                  V2=sample(1000, n, rep=T),
                  V3=sample(1000, n, rep=T),
                  V4=runif(n))
ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum))
write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F)
rm(tmp)
gc()

library(sqldf)
cat(file='tmploc.db')
conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db',
                     sql="CREATE TABLE main AS SELECT * FROM file")

sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db')

test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2,
V1', dbname='tmploc.db')

all.equal(ref, test)



More information about the Bioconductor mailing list