[R] data frame subset too slow

Thu Dec 30 17:40:01 CET 2010

If you want the data in the first column of the dataframe, then you
should be using '[['.  Notice what comes back in each of these cases:

> str(dat)
'data.frame':   80000 obs. of  5 variables:
 $ sample.1.200..n..TRUE.: int  25 199 70 124 93 157 49 137 192 57 ...
 $ runif.n.              : num  0.7725 0.0263 0.0728 0.7594 0.2792 ...
 $ runif.n..1            : num  0.4304 0.8608 0.0882 0.5666 0.1721 ...
 $ runif.n..2            : num  0.3797 0.1191 0.0481 0.3297 0.0649 ...
 $ runif.n..3            : num  0.0895 0.0441 0.0403 0.9679 0.3986 ...
> str(dat[1])
'data.frame':   80000 obs. of  1 variable:
 $ sample.1.200..n..TRUE.: int  25 199 70 124 93 157 49 137 192 57 ...
> str(dat[[1]])
 int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
> str(dat$sample.1.200..n..TRUE)
 int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
>  str(dat[,1])
 int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...

You will get different classes of values.  We would really need to see
the output of 'str' on your data structures to see what might be
happening.  Your data is not that big and most subsetting/extractions
should be in less than a second unless there is something funny in
your data.  So provide the 'str' so we can see.

On Thu, Dec 30, 2010 at 11:28 AM, Duke <duke.lists at gmx.com> wrote:
> Hi Jim,
>
> Is this really a problem for me to use [1] instead of [[1]]? Will this make
> it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?
>
> Anyway, my data and list are basically gene lists (tab delimited):
>
> $ head test.txt
> Xkr4    chr1    -    3204562    3661579    3206102    3661429    3
>  3204562,3411782,3660632,    3207049,3411982,3661579,
> Rp1    chr1    -    4280926    4399322    4283061    4399268    4
>  4280926,4341990,4342282,4399250,    4283093,4342162,4342918,4399322,
> Rp1_2    chr1    -    4333587    4350395    4334680    4342906    4
>  4333587,4341990,4342282,4350280,    4340172,4342162,4342918,4350395,
> Sox17    chr1    -    4481008    4486494    4481796    4483487    5
>  4481008,4483180,4483852,4485216,4486371,
>  4482749,4483547,4483944,4486023,4486494,
> Mrpl15    chr1    -    4763278    4775807    4764532    4775758    5
>  4763278,4767605,4772648,4774031,4775653,
>  4764597,4767729,4772814,4774186,4775807,
> Mrpl15_2    chr1    -    4763278    4775807    4775807    4775807    4
>  4763278,4767605,4772648,4775653,    4764597,4767729,4772814,4775807,
> $ head list.txt
> GeneNames    Chr    Start    End
> 0610007C21Rik    chr5    31351012    31356996
> 0610007L01Rik    chr5    130695613    130719635
> 0610007L01Rik_2    chr5    130698204    130719635
> 0610007P08Rik    chr13    63916627    64001609
> 0610007P08Rik_2    chr13    63916641    63970963
> 0610007P14Rik    chr12    87156404    87165495
>
> Thanks,
>
> D.
>
> On 12/30/10 11:13 AM, jim holtman wrote:
>>
>> You should be using dat[[1]].  Here is an example with 80000 rows that
>> take about 0.02 seconds to get the subset.
>>
>> Provide an 'str' of what your data looks like
>>
>>> n<- 80000  # rows to create
>>> dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n),
>>> runif(n))
>>> lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n),
>>> runif(n))
>>> str(dat)
>>
>> 'data.frame':   80000 obs. of  5 variables:
>>  $ sample.1.200..n..TRUE.: int  39 116 69 163 51 125 144 32 28 4 ...
>>  $ runif.n.              : num  0.519 0.793 0.549 0.77 0.272 ...
>>  $ runif.n..1            : num  0.691 0.89 0.783 0.467 0.357 ...
>>  $ runif.n..2            : num  0.705 0.254 0.584 0.998 0.279 ...
>>  $ runif.n..3            : num  0.873 1 0.678 0.702 0.455 ...
>>>
>>> str(lst)
>>
>> 'data.frame':   80000 obs. of  5 variables:
>>  $ sample.1.100..n..TRUE.: int  38 83 38 70 77 44 81 55 32 1 ...
>>  $ runif.n.              : num  0.0621 0.7374 0.074 0.4281 0.0516 ...
>>  $ runif.n..1            : num  0.879 0.294 0.146 0.884 0.58 ...
>>  $ runif.n..2            : num  0.648 0.745 0.825 0.507 0.799 ...
>>  $ runif.n..3            : num  0.2523 0.1679 0.9728 0.0478 0.0967 ...
>>>
>>> system.time({
>>
>> + dat.sub<- dat[dat[[1]] %in% lst[[1]],]
>> + })
>>    user  system elapsed
>>    0.02    0.00    0.01
>>>
>>> str(dat.sub)
>>
>> 'data.frame':   39803 obs. of  5 variables:
>>  $ sample.1.200..n..TRUE.: int  39 69 51 32 28 4 69 3 48 69 ...
>>  $ runif.n.              : num  0.5188 0.5494 0.2718 0.5566 0.0893 ...
>>  $ runif.n..1            : num  0.691 0.783 0.357 0.619 0.717 ...
>>  $ runif.n..2            : num  0.705 0.584 0.279 0.789 0.192 ...
>>  $ runif.n..3            : num  0.873 0.678 0.455 0.843 0.383 ...
>> On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.lists at gmx.com>  wrote:
>>>
>>> Hi all,
>>>
>>> First I dont have much experience with R so be gentle. OK, I am dealing
>>> with
>>> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I
>>> have
>>> to create some subset of this data based on some certain conditions (for
>>> example, same first column with another dataset etc...). Here is how I
>>> did
>>> it:
>>>
>>> # import data
>>> dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> # create sub data
>>> subdat<- dat[dat[1] %in% list[1],]
>>>
>>> So the third line is to create a new data frame with all the same first
>>> column in both dat and list. There is no problem with the code as it runs
>>> just fine with testing data (small). When I tried with my real data (~80k
>>> lines, ~ 15MB size), it takes like forever (few hours). I dont know why
>>> it
>>> takes that long, but I think it shouldnt. I think even with a for loop in
>>> C++, I can get this done in say few minutes.
>>>
>>> So anyone has any idea/advice/suggestion?
>>>
>>> Thanks so much in advance and Happy New Year to all of you.
>>>
>>> D.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?