[R] data frame subset too slow
duke.lists at gmx.com
Thu Dec 30 16:23:31 CET 2010
First I dont have much experience with R so be gentle. OK, I am dealing
with a dataset (~ tens of thousand lines, each line ~ 10 columns of
data). I have to create some subset of this data based on some certain
conditions (for example, same first column with another dataset etc...).
Here is how I did it:
# import data
dat <- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
list <- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
# create sub data
subdat <- dat[dat %in% list,]
So the third line is to create a new data frame with all the same first
column in both dat and list. There is no problem with the code as it
runs just fine with testing data (small). When I tried with my real data
(~80k lines, ~ 15MB size), it takes like forever (few hours). I dont
know why it takes that long, but I think it shouldnt. I think even with
a for loop in C++, I can get this done in say few minutes.
So anyone has any idea/advice/suggestion?
Thanks so much in advance and Happy New Year to all of you.
More information about the R-help