[R] Thoughts for faster indexing
Ben Tupper
btupper at bigelow.org
Thu Nov 21 16:56:14 CET 2013
Hi,
On Nov 21, 2013, at 10:42 AM, "MacQueen, Don" <macqueen1 at llnl.gov> wrote:
> I have some processes where I do the same thing, iterate over subsets of a
> data frame.
> My data frame has ~250,000 rows, 30 variables, and the subsets are such
> that there are about 6000 of them.
>
> Performing a which() statement like yours seems quite fast.
>
> For example, wrapping unix.time() around the which() expression, I get
>
> user system elapsed 0.008 0.000 0.008
>
> It's hard for me to imagine the single task of getting the indexes is slow
> enough to be a bottleneck.
>
>
>
> On the other hand, if the variable being used to identify subsets is a
> factor with many levels (~6000 in my case), it is noticeably slower.
>
> user system elapsed
> 0.024 0.002 0.026
>
>
> I haven't tested it, and have no real expectation that it will make a
> difference, but perhaps sorting by the index variable before iterating
> will help (if you haven't already). Since these are not true indexes in
> the sense used by relational database systems, maybe it will make a
> difference.
>
You might also want to check this out…
http://adv-r.had.co.nz/Performance.html
Cheers,
Ben
>
> --
> Don MacQueen
>
> Lawrence Livermore National Laboratory
> 7000 East Ave., L-627
> Livermore, CA 94550
> 925-423-1062
>
>
>
>
>
> On 11/20/13 12:16 PM, "Noah Silverman" <noahsilverman at g.ucla.edu> wrote:
>
>> Hello,
>>
>> I have a fairly large data.frame. (About 150,000 rows of 100
>> variables.) There are case IDs, and multiple entries for each ID, with a
>> date stamp. (i.e. records of peoples activity.)
>>
>>
>> I need to iterate over each person (record ID) in the data set, and then
>> process their data for each date. The processing part is fast, the date
>> part is fast. Locating the records is slow. I've even tried using
>> data.table, with ID set as the index, and it is still slow.
>>
>> The line with the slow process (According to Rprof) is:
>>
>>
>> j <- which( d$id == person )
>>
>> (I then process all the records indexed by j, which seems fast enough.)
>>
>> where d is my data.frame or data.table
>>
>> I thought that using the data.table indexing would speed things up, but
>> not in this case.
>>
>> Any ideas on how to speed this up?
>>
>>
>> Thanks!
>>
>> --
>> Noah Silverman, M.S., C.Phil
>> UCLA Department of Statistics
>> 8117 Math Sciences Building
>> Los Angeles, CA 90095
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org
More information about the R-help
mailing list