[R] Using plyr::dply more (memory) efficiently?
Matthew Dowle
mdowle at mdowle.plus.com
Thu Apr 29 15:52:56 CEST 2010
I don't know about that, but try this :
install.packages("data.table", repos="http://R-Forge.R-project.org")
require(data.table)
summaries = data.table(summaries)
summaries[,sum(counts),by=symbol]
Please let us know if that returns the correct result, and if its
memory/speed is ok ?
Matthew
"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in message
news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0ab3d at mail.gmail.com...
> Hi all,
>
> In short:
>
> I'm running ddply on an admittedly (somehow) large data.frame (not
> that large). It runs fine until it finishes and gets to the
> "collating" part where all subsets of my data.frame have been
> summarized and they are being reassembled into the final summary
> data.frame (sorry, don't know the correct plyr terminology). During
> collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
> until I kill it.
>
> Running a similar piece of code that iterates manually w/o ddply by
> using a combo of lapply and a do.call(rbind, ...) uses considerably
> less ram (tops out at about 8GB).
>
> How can I use ddply more efficiently?
>
> Longer:
>
> Here's more info:
>
> * The data.frame itself ~ 15.8 MB when loaded.
> * ~ 400,000 rows, 8 columns
>
> It looks like so:
>
> exon.start exon.width exon.width.unique exon.anno counts
> symbol transcript chr
> 1 4225 468 0 utr 0
> WASH5P WASH5P chr1
> 2 4833 69 0 utr 1
> WASH5P WASH5P chr1
> 3 5659 152 38 utr 1
> WASH5P WASH5P chr1
> 4 6470 159 0 utr 0
> WASH5P WASH5P chr1
> 5 6721 198 0 utr 0
> WASH5P WASH5P chr1
> 6 7096 136 0 utr 0
> WASH5P WASH5P chr1
> 7 7469 137 0 utr 0
> WASH5P WASH5P chr1
> 8 7778 147 0 utr 0
> WASH5P WASH5P chr1
> 9 8131 99 0 utr 0
> WASH5P WASH5P chr1
> 10 14601 154 0 utr 0
> WASH5P WASH5P chr1
> 11 19184 50 0 utr 0
> WASH5P WASH5P chr1
> 12 4693 140 36 intron 2
> WASH5P WASH5P chr1
> 13 4902 757 36 intron 1
> WASH5P WASH5P chr1
> 14 5811 659 144 intron 47
> WASH5P WASH5P chr1
> 15 6629 92 21 intron 1
> WASH5P WASH5P chr1
> 16 6919 177 0 intron 0
> WASH5P WASH5P chr1
> 17 7232 237 35 intron 2
> WASH5P WASH5P chr1
> 18 7606 172 0 intron 0
> WASH5P WASH5P chr1
> 19 7925 206 0 intron 0
> WASH5P WASH5P chr1
> 20 8230 6371 109 intron 67
> WASH5P WASH5P chr1
> 21 14755 4429 55 intron 12
> WASH5P WASH5P chr1
> ...
>
> I'm "ply"-ing over the "transcript" column and the function transforms
> each such subset of the data.frame into a new data.frame that is just
> 1 row / transcript that basically has the sum of the "counts" for each
> transcript.
>
> The code would look something like this (`summaries` is the data.frame
> I'm referring to):
>
> rpkm <- ddply(summaries, .(transcript), function(df) {
> data.frame(symbol=df$symbol[1], counts=sum(df$counts))
> }
>
> (It actually calculates 2 more columns that are returned in the
> data.frame, but I'm not sure that's really important here).
>
> To test some things out, I've written another function to manually
> iterate/create subsets of my data.frame to summarize.
>
> I'm using sqldf to dump the data.frame into a db, then I lapply over
> subsets of the db `where transcript=x` to summarize each subset of my
> data into a list of single-row data.frames (like ddply is doing), and
> finish with a `do.call(rbind, the.dfs)` o nthis list.
>
> This returns the same exact result ddply would return, and by the time
> `do.call` finishes, my RAM usage hits about 8gb.
>
> So, what am I doing wrong with ddply that makes the difference ram
> usage in the last step ("collation" -- the equivalent of my final
> `do.call(rbind, my.dfs)` be more than 12GB?
>
> Thanks,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
More information about the R-help
mailing list