[R] data frame subset too slow

Duke duke.lists at gmx.com
Thu Dec 30 16:23:31 CET 2010


Hi all,

First I dont have much experience with R so be gentle. OK, I am dealing 
with a dataset (~ tens of thousand lines, each line ~ 10 columns of 
data). I have to create some subset of this data based on some certain 
conditions (for example, same first column with another dataset etc...). 
Here is how I did it:

# import data
dat <- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list <- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat <- dat[dat[1] %in% list[1],]

So the third line is to create a new data frame with all the same first 
column in both dat and list. There is no problem with the code as it 
runs just fine with testing data (small). When I tried with my real data 
(~80k lines, ~ 15MB size), it takes like forever (few hours). I dont 
know why it takes that long, but I think it shouldnt. I think even with 
a for loop in C++, I can get this done in say few minutes.

So anyone has any idea/advice/suggestion?

Thanks so much in advance and Happy New Year to all of you.

D.



More information about the R-help mailing list