[R] Thoughts for faster indexing

Thu Nov 21 16:56:14 CET 2013

Hi,

On Nov 21, 2013, at 10:42 AM, "MacQueen, Don" <macqueen1 at llnl.gov> wrote:

> I have some processes where I do the same thing, iterate over subsets of a
> data frame.
> My data frame has ~250,000 rows, 30 variables, and the subsets are such
> that there are about 6000 of them.
> 
> Performing a which() statement like yours seems quite fast.
> 
> For example, wrapping unix.time() around the which() expression, I get
> 
>   user  system elapsed   0.008   0.000   0.008
> 
> It's hard for me to imagine the single task of getting the indexes is slow
> enough to be a bottleneck.
> 
> 
> 
> On the other hand, if the variable being used to identify subsets is a
> factor with many levels (~6000 in my case), it is noticeably slower.
> 
>   user  system elapsed
>  0.024   0.002   0.026
> 
> 
> I haven't tested it, and have no real expectation that it will make a
> difference, but perhaps sorting by the index variable before iterating
> will help (if you haven't already). Since these are not true indexes in
> the sense used by relational database systems, maybe it will make a
> difference.
> 

You might also want to check this out…

http://adv-r.had.co.nz/Performance.html

Cheers,
Ben

> 
> -- 
> Don MacQueen
> 
> Lawrence Livermore National Laboratory
> 7000 East Ave., L-627
> Livermore, CA 94550
> 925-423-1062
> 
> 
> 
> 
> 
> On 11/20/13 12:16 PM, "Noah Silverman" <noahsilverman at g.ucla.edu> wrote:
> 
>> Hello,
>> 
>> I have a fairly large data.frame.  (About 150,000 rows of 100
>> variables.) There are case IDs, and multiple entries for each ID, with a
>> date stamp.  (i.e. records of peoples activity.)
>> 
>> 
>> I need to iterate over each person (record ID) in the data set, and then
>> process their data for each date.  The processing part is fast, the date
>> part is fast.  Locating the records is slow.  I've even tried using
>> data.table, with ID set as the index, and it is still slow.
>> 
>> The line with the slow process (According to Rprof) is:
>> 
>> 
>> j <- which( d$id == person )
>> 
>> (I then process all the records indexed by j, which seems fast enough.)
>> 
>> where d is my data.frame or data.table
>> 
>> I thought that using the data.table indexing would speed things up, but
>> not in this case.
>> 
>> Any ideas on how to speed this up?
>> 
>> 
>> Thanks!
>> 
>> -- 
>> Noah Silverman, M.S., C.Phil
>> UCLA Department of Statistics
>> 8117 Math Sciences Building
>> Los Angeles, CA 90095
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org