[R-sig-Geo] [R] strata -- really slow performance

Mon Jul 13 04:04:52 CEST 2009

Marc:

    Thanks!  I wrote a semi-similar function (not as elegant as the one 
you have below) to do the stratified sampling via the sample function 
vs. the strata (from sampling) function and, yes, it is WAY faster.  
Seconds instead of hours.  I'll send a note to the sampling maintainer 
to see if they can take a look at whats going on with that function.  
The dataframe does have a lot of columns, but I'm confused why that 
should matter since strata is supposed to return an index (so, at least 
in theory, it should just be looking at the stratification columns, no?) 

    Again, thanks for looking into this -- verification that I'm not 
crazy and that function is behaving oddly was very helpful.

--j

P.S. "Also, the construct:

  uniqueids_stratasize <- c(1:length(uniqueids))*0+1

is rather strange. You are essentially creating a vector of 661 0's, 
then adding 1 to each element to get 661 1's. Why not use:

  rep(1, length(uniqueids)) "

--> Learn something new every day!  Wasn't familiar with the rep() 
function, so I got my vector of 1s the only way I currently knew how!

Marc Schwartz wrote:
>
> On Jul 11, 2009, at 8:42 PM, Jonathan Greenberg wrote:
>
>> I'm a bit confused why the following command is taking an 
>> extrodinarily long time (> 2-3 hours) to run on an 3.06ghz iMac 
>> (brand new).  I'm trying to do a stratified random sample, drawing 
>> only a single value per UniqueID from the patch_summary data frame:
>>
>> uniqueids <- unique(patch_summary$UniqueID)
>> uniqueids_stratasize <- c(1:length(uniqueids))*0+1
>> temp_species_patch_random_strata <- 
>> strata(patch_summary,c("UniqueID"),size=uniqueids_stratasize)
>>
>> The patch_summary data frame is too big to include in this email, but 
>> I'll give some summary results for the inputs.  patch_summary has 
>> 48253 rows, UniqueID has 661 levels, and uniqueid_stratasize is a 
>> vector if 661 "1"s (corresponding to the 661 levels from UniqueID -- 
>> as I said I only want one per UniqueID).
>>
>> Am I doing something silly here?  Am I using the wrong stratified 
>> sampling algorithm?  Any help would be appreciated -- thanks -- I 
>> don't see why a stratified sampling would take hours to run -- I am 
>> about to recode this as a series of subset statements which I'm sure 
>> will take only minutes...
>>
>> --j
>
>
> I presume that this is the strata() function from the sampling 
> package? That was unstated, but seems to be the only strata() function 
> that makes sense here.
>
> Also, the construct:
>
>   uniqueids_stratasize <- c(1:length(uniqueids))*0+1
>
> is rather strange. You are essentially creating a vector of 661 0's, 
> then adding 1 to each element to get 661 1's. Why not use:
>
>   rep(1, length(uniqueids))
>
> ?
>
>
> I installed the sampling package and ran the strata() function first 
> on the examples and then on some test data. The examples ran fine and 
> reasonably fast.
>
> The test data that I created (see below) had a basic structure of a 
> certain number of unique IDs and then just a record counter per ID. 
> There are 75 records per ID, just to pick the number that I used in 
> the sample data below. So the record counter runs from 1:75 for each ID.
>
> When I ran the strata() function on 10 unique IDs (750 records), it 
> took about 1 second to execute. However, when I ran the function with 
> 50 unique IDs (3750 records), it took about 25 seconds. For 100 unique 
> IDs (7500 records), it took around 90 seconds.
>
> This suggests to me that the problem that you are having is that the 
> strata() function does not scale well at all to large datasets, 
> perhaps being more sensitive to the number of unique strata. You might 
> want to contact the package maintainer to see if they intended the 
> function and/or package to be used with large datasets or not. I am 
> going to reasonably guess that they did not or at least, did not test 
> it on large datasets to optimize their algorithms.
>
>
> I am not clear how large your data frame is in terms of columns, but 
> if we create one to at least have the same number of unique IDs and 
> approximately the same number of records and create some specific code 
> to get one randomly selected record from each unique ID, it runs very 
> quickly.
>
>
> # Create a data frame with 661 unique IDs, 75 records per ID.
> # Include a second column with a counter of 1:75 for each ID
> DF <- data.frame(ID = rep(1:661, each = 75), RecNum = rep(1:75, 661))
>
>
> # Randomize the rows, so that the unique IDs are not sorted together
> DF <- DF[sample(661 * 75), ]
>
>
> > str(DF)
> 'data.frame':    49575 obs. of  2 variables:
>  $ ID    : int  512 587 640 602 148 250 31 487 336 145 ...
>  $ RecNum: int  14 23 72 49 39 56 58 17 37 8 ...
>
>
> # There are 661 unique IDs
> > length(unique(DF$ID))
> [1] 661
>
>
> # There are 75 of each
> > all(table(DF$ID) == 75)
> [1] TRUE
>
>
> # Now, get one randomly selected record per unique ID
> > system.time(DF.new <- do.call(rbind,
>                                 lapply(split(DF, DF$ID),
>                                        function(x) x[sample(nrow(x), 
> 1), ])))
>    user  system elapsed
>   0.361   0.110   0.478
>
>
> # We have 661 records now
> > str(DF.new)
> 'data.frame':    661 obs. of  2 variables:
>  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
>  $ RecNum: int  54 3 73 5 6 19 51 7 18 8 ...
>
>
> # We have 661 unique IDs
> > length(unique(DF.new$ID))
> [1] 661
>
>
> # 1 record per ID
> > all(table(DF.new$ID) == 1)
> [1] TRUE
>
>
>
> In this simple example, it took less than half a second to generate 
> the result. That is on a 2.93 Ghz MacBook Pro.
>
>
> So, for your data, the code would look something like this:
>
>
> system.time(DF.new <- do.call(rbind,
>                               lapply(split(patch_summary, 
> patch_summary$UniqueID),
>                                      function(x) x[sample(nrow(x), 1), 
> ])))
>
>
> DF.new will contain your subset of data and you will get output 
> similar to mine above, with the time it took for the function to 
> execute. See ?system.time for more information.
>
> Try that and see if that works and how fast it executes on your actual 
> data.
>
> HTH,
>
> Marc Schwartz
>

-- 

Jonathan A. Greenberg, PhD
Postdoctoral Scholar
Center for Spatial Technologies and Remote Sensing (CSTARS)
University of California, Davis
One Shields Avenue
The Barn, Room 250N
Davis, CA 95616
Cell: 415-794-5043
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307