[R] Fastest way to repeatedly subset a data frame?

Iestyn Lewis ilewis at pharm.emory.edu
Fri Apr 20 18:29:54 CEST 2007


Hi -

 I have a data frame with a large number of observations (62,000 rows, 
but only 2 columns - a character ID and a result list). 

Sample:

 > my.df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
 > my.df
   id result
1 ID1      1
2 ID2      2
3 ID3      3

I have a list of ID vectors.  This list will have anywhere from 100 to 
1000 members, and each member will have anywhere from 10 to 5000 id entries.

Sample:

 > my.idlist[["List1"]] <- c("ID1", "ID3")
 > my.idlist[["List2"]] <- c("ID2")
 > my.idlist
$List1
[1] "ID1" "ID3"

$List2
[1] "ID2"


I need to subset that data frame by the list of IDs in each vector, to 
end up with vectors that contain just the results for the IDs found in 
each vector in the list.  My current approach is to create new columns 
in the original data frame with the names of the list items, and any 
results that don't match replaced with NA.  Here is what I've done so far:

createSubsets <- function(res, slib) {
    for(i in 1:length(slib)) {
        res[ ,names(slib)[i]] <- replace(res$result, 
which(!is.element(res$sid, slib[[i]])), NA)
        return (res)
    }
}

I have 2 problems:

1)  My function only works for the first item in the list:

 > my.df <- createSubsets(my.df, my.idlist)
 > my.df
   id result List1
1 ID1      1     1
2 ID2      2    NA
3 ID3      3     3

In order to get all results, I have to copy the loop out of the function 
and paste it into R directly.

2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list 
entries, it takes probably 5 minutes on a pentium D.  An implementation 
of this kind of subsetting using hashtables in C# takes a neglible 
amount of time. 

I am open to any suggestions about data format, methods, anything. 

Thanks,

Iestyn Lewis
Emory University



More information about the R-help mailing list