[BioC] tapply for enormous (>2^31 row) matrices
Benilton Carvalho
beniltoncarvalho at gmail.com
Wed Feb 22 01:23:28 CET 2012
One alternative to Steve's suggestion is to dump the file in an SQL db
(say, RSQLite) and summarize through that...
Below the naive solution (and comparison with an "all in R
alternative") and the exercise on parallelising the task is left as an
exercise.
--benilton
set.seed(1)
n <- 1e6
tmp <- data.frame(V1=sample(1000, n, rep=T),
V2=sample(1000, n, rep=T),
V3=sample(1000, n, rep=T),
V4=runif(n))
ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum))
write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F)
rm(tmp)
gc()
library(sqldf)
cat(file='tmploc.db')
conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db',
sql="CREATE TABLE main AS SELECT * FROM file")
sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db')
test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2,
V1', dbname='tmploc.db')
all.equal(ref, test)
More information about the Bioconductor
mailing list