[R] data frame subset too slow
Duke
duke.lists at gmx.com
Thu Dec 30 17:34:00 CET 2010
Actually there are different ways of doing subsetting:
[1]
[[1]]
[,1]
$V1
Please let me know which one is the fastest (and most used) one. Thanks.
D.
On 12/30/10 11:28 AM, Duke wrote:
> Hi Jim,
>
> Is this really a problem for me to use [1] instead of [[1]]? Will this
> make it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?
>
> Anyway, my data and list are basically gene lists (tab delimited):
>
> $ head test.txt
> Xkr4 chr1 - 3204562 3661579 3206102 3661429 3
> 3204562,3411782,3660632, 3207049,3411982,3661579,
> Rp1 chr1 - 4280926 4399322 4283061 4399268 4
> 4280926,4341990,4342282,4399250, 4283093,4342162,4342918,4399322,
> Rp1_2 chr1 - 4333587 4350395 4334680 4342906 4
> 4333587,4341990,4342282,4350280, 4340172,4342162,4342918,4350395,
> Sox17 chr1 - 4481008 4486494 4481796 4483487 5
> 4481008,4483180,4483852,4485216,4486371,
> 4482749,4483547,4483944,4486023,4486494,
> Mrpl15 chr1 - 4763278 4775807 4764532 4775758
> 5 4763278,4767605,4772648,4774031,4775653,
> 4764597,4767729,4772814,4774186,4775807,
> Mrpl15_2 chr1 - 4763278 4775807 4775807 4775807
> 4 4763278,4767605,4772648,4775653, 4764597,4767729,4772814,4775807,
> $ head list.txt
> GeneNames Chr Start End
> 0610007C21Rik chr5 31351012 31356996
> 0610007L01Rik chr5 130695613 130719635
> 0610007L01Rik_2 chr5 130698204 130719635
> 0610007P08Rik chr13 63916627 64001609
> 0610007P08Rik_2 chr13 63916641 63970963
> 0610007P14Rik chr12 87156404 87165495
>
> Thanks,
>
> D.
>
> On 12/30/10 11:13 AM, jim holtman wrote:
>> You should be using dat[[1]]. Here is an example with 80000 rows that
>> take about 0.02 seconds to get the subset.
>>
>> Provide an 'str' of what your data looks like
>>
>>> n<- 80000 # rows to create
>>> dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n),
>>> runif(n), runif(n))
>>> lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n),
>>> runif(n), runif(n))
>>> str(dat)
>> 'data.frame': 80000 obs. of 5 variables:
>> $ sample.1.200..n..TRUE.: int 39 116 69 163 51 125 144 32 28 4 ...
>> $ runif.n. : num 0.519 0.793 0.549 0.77 0.272 ...
>> $ runif.n..1 : num 0.691 0.89 0.783 0.467 0.357 ...
>> $ runif.n..2 : num 0.705 0.254 0.584 0.998 0.279 ...
>> $ runif.n..3 : num 0.873 1 0.678 0.702 0.455 ...
>>> str(lst)
>> 'data.frame': 80000 obs. of 5 variables:
>> $ sample.1.100..n..TRUE.: int 38 83 38 70 77 44 81 55 32 1 ...
>> $ runif.n. : num 0.0621 0.7374 0.074 0.4281 0.0516 ...
>> $ runif.n..1 : num 0.879 0.294 0.146 0.884 0.58 ...
>> $ runif.n..2 : num 0.648 0.745 0.825 0.507 0.799 ...
>> $ runif.n..3 : num 0.2523 0.1679 0.9728 0.0478 0.0967 ...
>>> system.time({
>> + dat.sub<- dat[dat[[1]] %in% lst[[1]],]
>> + })
>> user system elapsed
>> 0.02 0.00 0.01
>>> str(dat.sub)
>> 'data.frame': 39803 obs. of 5 variables:
>> $ sample.1.200..n..TRUE.: int 39 69 51 32 28 4 69 3 48 69 ...
>> $ runif.n. : num 0.5188 0.5494 0.2718 0.5566 0.0893 ...
>> $ runif.n..1 : num 0.691 0.783 0.357 0.619 0.717 ...
>> $ runif.n..2 : num 0.705 0.584 0.279 0.789 0.192 ...
>> $ runif.n..3 : num 0.873 0.678 0.455 0.843 0.383 ...
>> On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.lists at gmx.com> wrote:
>>> Hi all,
>>>
>>> First I dont have much experience with R so be gentle. OK, I am
>>> dealing with
>>> a dataset (~ tens of thousand lines, each line ~ 10 columns of
>>> data). I have
>>> to create some subset of this data based on some certain conditions
>>> (for
>>> example, same first column with another dataset etc...). Here is how
>>> I did
>>> it:
>>>
>>> # import data
>>> dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> # create sub data
>>> subdat<- dat[dat[1] %in% list[1],]
>>>
>>> So the third line is to create a new data frame with all the same first
>>> column in both dat and list. There is no problem with the code as it
>>> runs
>>> just fine with testing data (small). When I tried with my real data
>>> (~80k
>>> lines, ~ 15MB size), it takes like forever (few hours). I dont know
>>> why it
>>> takes that long, but I think it shouldnt. I think even with a for
>>> loop in
>>> C++, I can get this done in say few minutes.
>>>
>>> So anyone has any idea/advice/suggestion?
>>>
>>> Thanks so much in advance and Happy New Year to all of you.
>>>
>>> D.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list