[R] Grouping data in a data frame: is there an efficient way to do it?

Leo Alekseyev dnquark at gmail.com
Thu Sep 3 02:01:34 CEST 2009


Thanks everyone for the useful suggestions.  The bottleneck might be
memory limitations of my machine (3.2GHz, 2 GB) and the fact that I am
aggregating on a field that is a string.  Using the suggested
as.data.frame(table(my.df$my.field)) I do get a speedup, but the
computation still takes 30 seconds.  For the sake of comparison, I did
write the "counting up rows with common values" function using a Perl
hash (it's only 5 lines of Perl) and it takes 15 seconds to run -- a
2x speedup.  Not yet sure if it's worth the hassle.

--Leo

On Wed, Sep 2, 2009 at 4:28 PM, David M
Smith<david at revolution-computing.com> wrote:
> You may want to try using isplit (from the iterators package). Combined with
> foreach, it's an efficient way of iterating through a data frame by groups
> of rows defined by common values of a columns (which I think is what you're
> after). You can speed things up further if you have a multiprocessor system
> with the doMC package to run iterations in parallel. There's an example
> here:
> http://blog.revolution-computing.com/2009/08/blockprocessing-a-data-frame-with-isplit.html
> Hope this helps,
> # David Smith
> On Wed, Sep 2, 2009 at 3:39 PM, Leo Alekseyev <dnquark at gmail.com> wrote:
>>
>> I have a data frame with about 10^6 rows; I want to group the data
>> according to entries in one of the columns and do something with it.
>> For instance, suppose I want to count up the number of elements in
>> each group.  I tried something like aggregate(my.df$my.field,
>> list(my.df$my.field), length) but it seems to be very slow.  Likewise,
>> the split() function was slow (I killed it before it completed).  Is
>> there a way to efficiently accomplish this in R?..  I am almost
>> tempted to write an external Perl/Python script entering every row
>> into a hashtable keyed by my.field and iterating over the keys...
>> Might this be faster?..
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> David M Smith <david at revolution-computing.com>
> Director of Community, REvolution Computing www.revolution-computing.com
> Tel: +1 (206) 577-4778 x3203 (San Francisco, USA)
>
> Check out our upcoming events schedule at
> www.revolution-computing.com/events
>
>




More information about the R-help mailing list