[R] data frame subset too slow
jim holtman
jholtman at gmail.com
Thu Dec 30 17:40:01 CET 2010
If you want the data in the first column of the dataframe, then you
should be using '[['. Notice what comes back in each of these cases:
> str(dat)
'data.frame': 80000 obs. of 5 variables:
$ sample.1.200..n..TRUE.: int 25 199 70 124 93 157 49 137 192 57 ...
$ runif.n. : num 0.7725 0.0263 0.0728 0.7594 0.2792 ...
$ runif.n..1 : num 0.4304 0.8608 0.0882 0.5666 0.1721 ...
$ runif.n..2 : num 0.3797 0.1191 0.0481 0.3297 0.0649 ...
$ runif.n..3 : num 0.0895 0.0441 0.0403 0.9679 0.3986 ...
> str(dat[1])
'data.frame': 80000 obs. of 1 variable:
$ sample.1.200..n..TRUE.: int 25 199 70 124 93 157 49 137 192 57 ...
> str(dat[[1]])
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
> str(dat$sample.1.200..n..TRUE)
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
> str(dat[,1])
int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
You will get different classes of values. We would really need to see
the output of 'str' on your data structures to see what might be
happening. Your data is not that big and most subsetting/extractions
should be in less than a second unless there is something funny in
your data. So provide the 'str' so we can see.
On Thu, Dec 30, 2010 at 11:28 AM, Duke <duke.lists at gmx.com> wrote:
> Hi Jim,
>
> Is this really a problem for me to use [1] instead of [[1]]? Will this make
> it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?
>
> Anyway, my data and list are basically gene lists (tab delimited):
>
> $ head test.txt
> Xkr4 chr1 - 3204562 3661579 3206102 3661429 3
> 3204562,3411782,3660632, 3207049,3411982,3661579,
> Rp1 chr1 - 4280926 4399322 4283061 4399268 4
> 4280926,4341990,4342282,4399250, 4283093,4342162,4342918,4399322,
> Rp1_2 chr1 - 4333587 4350395 4334680 4342906 4
> 4333587,4341990,4342282,4350280, 4340172,4342162,4342918,4350395,
> Sox17 chr1 - 4481008 4486494 4481796 4483487 5
> 4481008,4483180,4483852,4485216,4486371,
> 4482749,4483547,4483944,4486023,4486494,
> Mrpl15 chr1 - 4763278 4775807 4764532 4775758 5
> 4763278,4767605,4772648,4774031,4775653,
> 4764597,4767729,4772814,4774186,4775807,
> Mrpl15_2 chr1 - 4763278 4775807 4775807 4775807 4
> 4763278,4767605,4772648,4775653, 4764597,4767729,4772814,4775807,
> $ head list.txt
> GeneNames Chr Start End
> 0610007C21Rik chr5 31351012 31356996
> 0610007L01Rik chr5 130695613 130719635
> 0610007L01Rik_2 chr5 130698204 130719635
> 0610007P08Rik chr13 63916627 64001609
> 0610007P08Rik_2 chr13 63916641 63970963
> 0610007P14Rik chr12 87156404 87165495
>
> Thanks,
>
> D.
>
> On 12/30/10 11:13 AM, jim holtman wrote:
>>
>> You should be using dat[[1]]. Here is an example with 80000 rows that
>> take about 0.02 seconds to get the subset.
>>
>> Provide an 'str' of what your data looks like
>>
>>> n<- 80000 # rows to create
>>> dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n),
>>> runif(n))
>>> lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n),
>>> runif(n))
>>> str(dat)
>>
>> 'data.frame': 80000 obs. of 5 variables:
>> $ sample.1.200..n..TRUE.: int 39 116 69 163 51 125 144 32 28 4 ...
>> $ runif.n. : num 0.519 0.793 0.549 0.77 0.272 ...
>> $ runif.n..1 : num 0.691 0.89 0.783 0.467 0.357 ...
>> $ runif.n..2 : num 0.705 0.254 0.584 0.998 0.279 ...
>> $ runif.n..3 : num 0.873 1 0.678 0.702 0.455 ...
>>>
>>> str(lst)
>>
>> 'data.frame': 80000 obs. of 5 variables:
>> $ sample.1.100..n..TRUE.: int 38 83 38 70 77 44 81 55 32 1 ...
>> $ runif.n. : num 0.0621 0.7374 0.074 0.4281 0.0516 ...
>> $ runif.n..1 : num 0.879 0.294 0.146 0.884 0.58 ...
>> $ runif.n..2 : num 0.648 0.745 0.825 0.507 0.799 ...
>> $ runif.n..3 : num 0.2523 0.1679 0.9728 0.0478 0.0967 ...
>>>
>>> system.time({
>>
>> + dat.sub<- dat[dat[[1]] %in% lst[[1]],]
>> + })
>> user system elapsed
>> 0.02 0.00 0.01
>>>
>>> str(dat.sub)
>>
>> 'data.frame': 39803 obs. of 5 variables:
>> $ sample.1.200..n..TRUE.: int 39 69 51 32 28 4 69 3 48 69 ...
>> $ runif.n. : num 0.5188 0.5494 0.2718 0.5566 0.0893 ...
>> $ runif.n..1 : num 0.691 0.783 0.357 0.619 0.717 ...
>> $ runif.n..2 : num 0.705 0.584 0.279 0.789 0.192 ...
>> $ runif.n..3 : num 0.873 0.678 0.455 0.843 0.383 ...
>> On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.lists at gmx.com> wrote:
>>>
>>> Hi all,
>>>
>>> First I dont have much experience with R so be gentle. OK, I am dealing
>>> with
>>> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I
>>> have
>>> to create some subset of this data based on some certain conditions (for
>>> example, same first column with another dataset etc...). Here is how I
>>> did
>>> it:
>>>
>>> # import data
>>> dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
>>> # create sub data
>>> subdat<- dat[dat[1] %in% list[1],]
>>>
>>> So the third line is to create a new data frame with all the same first
>>> column in both dat and list. There is no problem with the code as it runs
>>> just fine with testing data (small). When I tried with my real data (~80k
>>> lines, ~ 15MB size), it takes like forever (few hours). I dont know why
>>> it
>>> takes that long, but I think it shouldnt. I think even with a for loop in
>>> C++, I can get this done in say few minutes.
>>>
>>> So anyone has any idea/advice/suggestion?
>>>
>>> Thanks so much in advance and Happy New Year to all of you.
>>>
>>> D.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
More information about the R-help
mailing list