[R] Fastest way to repeatedly subset a data frame?
hadley wickham
h.wickham at gmail.com
Fri Apr 20 20:48:26 CEST 2007
On 4/20/07, Iestyn Lewis <ilewis at pharm.emory.edu> wrote:
> I have a data frame with a large number of observations (62,000 rows,
> but only 2 columns - a character ID and a result list).
> Sample:
> > my.df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
> > my.df
> id result
> 1 ID1 1
> 2 ID2 2
> 3 ID3 3
> I have a list of ID vectors. This list will have anywhere from 100 to
> 1000 members, and each member will have anywhere from 10 to 5000 id entries.
> Sample:
>
> > my.idlist[["List1"]] <- c("ID1", "ID3")
> > my.idlist[["List2"]] <- c("ID2")
> > my.idlist
> $List1
> [1] "ID1" "ID3"
> $List2
> [1] "ID2"
> I need to subset that data frame by the list of IDs in each vector, to
> end up with vectors that contain just the results for the IDs found in
> each vector in the list. My current approach is to create new columns
> in the original data frame with the names of the list items, and any
> results that don't match replaced with NA. Here is what I've done so far:
> createSubsets <- function(res, slib) {
> for(i in 1:length(slib)) {
> res[ ,names(slib)[i]] <- replace(res$result,
> which(!is.element(res$sid, slib[[i]])), NA)
> return (res)
> }
> }
> I have 2 problems:
>
> 1) My function only works for the first item in the list:
>
> > my.df <- createSubsets(my.df, my.idlist)
> > my.df
> id result List1
> 1 ID1 1 1
> 2 ID2 2 NA
> 3 ID3 3 3
> In order to get all results, I have to copy the loop out of the function
> and paste it into R directly.
>
> 2) It is very, very slow. For a dataset of 62,000 rows and 253 list
> entries, it takes probably 5 minutes on a pentium D. An implementation
> of this kind of subsetting using hashtables in C# takes a neglible
> amount of time.
> I am open to any suggestions about data format, methods, anything.
How about:
df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
ids <- list()
ids[["List1"]] <- c("ID1", "ID3")
ids[["List2"]] <- c("ID2")
rownames(df) <- df$id
lapply(ids, function(id) df[id, ])
Hadley
