[R] strata -- really slow performance
Marc Schwartz
marc_schwartz at me.com
Mon Jul 13 03:36:48 CEST 2009
On Jul 11, 2009, at 8:42 PM, Jonathan Greenberg wrote:
> I'm a bit confused why the following command is taking an
> extrodinarily long time (> 2-3 hours) to run on an 3.06ghz iMac
> (brand new). I'm trying to do a stratified random sample, drawing
> only a single value per UniqueID from the patch_summary data frame:
>
> uniqueids <- unique(patch_summary$UniqueID)
> uniqueids_stratasize <- c(1:length(uniqueids))*0+1
> temp_species_patch_random_strata <-
> strata(patch_summary,c("UniqueID"),size=uniqueids_stratasize)
>
> The patch_summary data frame is too big to include in this email,
> but I'll give some summary results for the inputs. patch_summary
> has 48253 rows, UniqueID has 661 levels, and uniqueid_stratasize is
> a vector if 661 "1"s (corresponding to the 661 levels from UniqueID
> -- as I said I only want one per UniqueID).
>
> Am I doing something silly here? Am I using the wrong stratified
> sampling algorithm? Any help would be appreciated -- thanks -- I
> don't see why a stratified sampling would take hours to run -- I am
> about to recode this as a series of subset statements which I'm sure
> will take only minutes...
>
> --j
I presume that this is the strata() function from the sampling
package? That was unstated, but seems to be the only strata() function
that makes sense here.
Also, the construct:
uniqueids_stratasize <- c(1:length(uniqueids))*0+1
is rather strange. You are essentially creating a vector of 661 0's,
then adding 1 to each element to get 661 1's. Why not use:
rep(1, length(uniqueids))
?
I installed the sampling package and ran the strata() function first
on the examples and then on some test data. The examples ran fine and
reasonably fast.
The test data that I created (see below) had a basic structure of a
certain number of unique IDs and then just a record counter per ID.
There are 75 records per ID, just to pick the number that I used in
the sample data below. So the record counter runs from 1:75 for each ID.
When I ran the strata() function on 10 unique IDs (750 records), it
took about 1 second to execute. However, when I ran the function with
50 unique IDs (3750 records), it took about 25 seconds. For 100 unique
IDs (7500 records), it took around 90 seconds.
This suggests to me that the problem that you are having is that the
strata() function does not scale well at all to large datasets,
perhaps being more sensitive to the number of unique strata. You might
want to contact the package maintainer to see if they intended the
function and/or package to be used with large datasets or not. I am
going to reasonably guess that they did not or at least, did not test
it on large datasets to optimize their algorithms.
I am not clear how large your data frame is in terms of columns, but
if we create one to at least have the same number of unique IDs and
approximately the same number of records and create some specific code
to get one randomly selected record from each unique ID, it runs very
quickly.
# Create a data frame with 661 unique IDs, 75 records per ID.
# Include a second column with a counter of 1:75 for each ID
DF <- data.frame(ID = rep(1:661, each = 75), RecNum = rep(1:75, 661))
# Randomize the rows, so that the unique IDs are not sorted together
DF <- DF[sample(661 * 75), ]
> str(DF)
'data.frame': 49575 obs. of 2 variables:
$ ID : int 512 587 640 602 148 250 31 487 336 145 ...
$ RecNum: int 14 23 72 49 39 56 58 17 37 8 ...
# There are 661 unique IDs
> length(unique(DF$ID))
[1] 661
# There are 75 of each
> all(table(DF$ID) == 75)
[1] TRUE
# Now, get one randomly selected record per unique ID
> system.time(DF.new <- do.call(rbind,
lapply(split(DF, DF$ID),
function(x) x[sample(nrow(x),
1), ])))
user system elapsed
0.361 0.110 0.478
# We have 661 records now
> str(DF.new)
'data.frame': 661 obs. of 2 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ RecNum: int 54 3 73 5 6 19 51 7 18 8 ...
# We have 661 unique IDs
> length(unique(DF.new$ID))
[1] 661
# 1 record per ID
> all(table(DF.new$ID) == 1)
[1] TRUE
In this simple example, it took less than half a second to generate
the result. That is on a 2.93 Ghz MacBook Pro.
So, for your data, the code would look something like this:
system.time(DF.new <- do.call(rbind,
lapply(split(patch_summary,
patch_summary$UniqueID),
function(x) x[sample(nrow(x),
1), ])))
DF.new will contain your subset of data and you will get output
similar to mine above, with the time it took for the function to
execute. See ?system.time for more information.
Try that and see if that works and how fast it executes on your actual
data.
HTH,
Marc Schwartz
More information about the R-help
mailing list