[BioC] tapply for enormous (>2^31 row) matrices

Benilton Carvalho beniltoncarvalho at gmail.com
Wed Feb 22 01:25:46 CET 2012


errata: and the *extension of* parallelising....

On 22 February 2012 00:23, Benilton Carvalho <beniltoncarvalho at gmail.com> wrote:
> One alternative to Steve's suggestion is to dump the file in an SQL db
> (say, RSQLite) and summarize through that...
>
> Below the naive solution (and comparison with an "all in R
> alternative") and the exercise on parallelising the task is left as an
> exercise.
>
> --benilton
>
>
> set.seed(1)
> n <- 1e6
> tmp <- data.frame(V1=sample(1000, n, rep=T),
>                  V2=sample(1000, n, rep=T),
>                  V3=sample(1000, n, rep=T),
>                  V4=runif(n))
> ref <- with(tmp, aggregate(list(SV4=V4), by=list(V1=V1, V2=V2), sum))
> write.table(tmp, file='tmp.loc', sep=' ', col.names=T, row.names=F)
> rm(tmp)
> gc()
>
> library(sqldf)
> cat(file='tmploc.db')
> conn <- read.csv.sql('tmp.loc', sep=' ', dbname='tmploc.db',
>                     sql="CREATE TABLE main AS SELECT * FROM file")
>
> sqldf('SELECT * FROM main LIMIT 10', dbname='tmploc.db')
>
> test <- sqldf('SELECT V1, V2, sum(V4) as SV4 FROM main GROUP BY V2,
> V1', dbname='tmploc.db')
>
> all.equal(ref, test)



More information about the Bioconductor mailing list