[R] strata -- really slow performance

Mon Jul 13 03:36:48 CEST 2009

On Jul 11, 2009, at 8:42 PM, Jonathan Greenberg wrote:

> I'm a bit confused why the following command is taking an  
> extrodinarily long time (> 2-3 hours) to run on an 3.06ghz iMac  
> (brand new).  I'm trying to do a stratified random sample, drawing  
> only a single value per UniqueID from the patch_summary data frame:
>
> uniqueids <- unique(patch_summary$UniqueID)
> uniqueids_stratasize <- c(1:length(uniqueids))*0+1
> temp_species_patch_random_strata <-  
> strata(patch_summary,c("UniqueID"),size=uniqueids_stratasize)
>
> The patch_summary data frame is too big to include in this email,  
> but I'll give some summary results for the inputs.  patch_summary  
> has 48253 rows, UniqueID has 661 levels, and uniqueid_stratasize is  
> a vector if 661 "1"s (corresponding to the 661 levels from UniqueID  
> -- as I said I only want one per UniqueID).
>
> Am I doing something silly here?  Am I using the wrong stratified  
> sampling algorithm?  Any help would be appreciated -- thanks -- I  
> don't see why a stratified sampling would take hours to run -- I am  
> about to recode this as a series of subset statements which I'm sure  
> will take only minutes...
>
> --j

I presume that this is the strata() function from the sampling  
package? That was unstated, but seems to be the only strata() function  
that makes sense here.

Also, the construct:

   uniqueids_stratasize <- c(1:length(uniqueids))*0+1

is rather strange. You are essentially creating a vector of 661 0's,  
then adding 1 to each element to get 661 1's. Why not use:

   rep(1, length(uniqueids))

?

I installed the sampling package and ran the strata() function first  
on the examples and then on some test data. The examples ran fine and  
reasonably fast.

The test data that I created (see below) had a basic structure of a  
certain number of unique IDs and then just a record counter per ID.  
There are 75 records per ID, just to pick the number that I used in  
the sample data below. So the record counter runs from 1:75 for each ID.

When I ran the strata() function on 10 unique IDs (750 records), it  
took about 1 second to execute. However, when I ran the function with  
50 unique IDs (3750 records), it took about 25 seconds. For 100 unique  
IDs (7500 records), it took around 90 seconds.

This suggests to me that the problem that you are having is that the  
strata() function does not scale well at all to large datasets,  
perhaps being more sensitive to the number of unique strata. You might  
want to contact the package maintainer to see if they intended the  
function and/or package to be used with large datasets or not. I am  
going to reasonably guess that they did not or at least, did not test  
it on large datasets to optimize their algorithms.

I am not clear how large your data frame is in terms of columns, but  
if we create one to at least have the same number of unique IDs and  
approximately the same number of records and create some specific code  
to get one randomly selected record from each unique ID, it runs very  
quickly.

# Create a data frame with 661 unique IDs, 75 records per ID.
# Include a second column with a counter of 1:75 for each ID
DF <- data.frame(ID = rep(1:661, each = 75), RecNum = rep(1:75, 661))

# Randomize the rows, so that the unique IDs are not sorted together
DF <- DF[sample(661 * 75), ]

 > str(DF)
'data.frame':	49575 obs. of  2 variables:
  $ ID    : int  512 587 640 602 148 250 31 487 336 145 ...
  $ RecNum: int  14 23 72 49 39 56 58 17 37 8 ...

# There are 661 unique IDs
 > length(unique(DF$ID))
[1] 661

# There are 75 of each
 > all(table(DF$ID) == 75)
[1] TRUE

# Now, get one randomly selected record per unique ID
 > system.time(DF.new <- do.call(rbind,
                                 lapply(split(DF, DF$ID),
                                        function(x) x[sample(nrow(x),  
1), ])))
    user  system elapsed
   0.361   0.110   0.478

# We have 661 records now
 > str(DF.new)
'data.frame':	661 obs. of  2 variables:
  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
  $ RecNum: int  54 3 73 5 6 19 51 7 18 8 ...

# We have 661 unique IDs
 > length(unique(DF.new$ID))
[1] 661

# 1 record per ID
 > all(table(DF.new$ID) == 1)
[1] TRUE

In this simple example, it took less than half a second to generate  
the result. That is on a 2.93 Ghz MacBook Pro.

So, for your data, the code would look something like this:

system.time(DF.new <- do.call(rbind,
                               lapply(split(patch_summary,  
patch_summary$UniqueID),
                                      function(x) x[sample(nrow(x),  
1), ])))

DF.new will contain your subset of data and you will get output  
similar to mine above, with the time it took for the function to  
execute. See ?system.time for more information.

Try that and see if that works and how fast it executes on your actual  
data.

HTH,

Marc Schwartz