[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Emmanuel Levy
emmanuel.levy at gmail.com
Wed Aug 13 17:53:06 CEST 2008
Sorry for being unclear, I thought the example above was clear enough.
I have a data frame of the form:
name info
1 YAL001C 1
2 YAL001C 1
3 YAL001C 1
4 YAL001C 1
5 YAL001C 0
6 YAL001C 1
7 YAL001C 1
8 YAL001C 1
9 YAL001C 1
10 YAL001C 1
...
...
~2700000 lines, and ~6000 different names.
which corresponds to yeast proteins + some info.
So there are about 6000 names like "YAL001C"
I would like to transform this data frame into the following form:
1/ a list, where each protein corresponds to an index, and the info is
the vector
> L[[1]]
[1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 ....
> L[[2]]
[1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 ....
etc.
2/ an index, which gives me the position of each protein in the list:
> index
[1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" ...
I hope this will be clearer!
I'll have a look right now that the split and hash.mat functions.
Thanks for your help,
Emmanuel
2008/8/13 Erik Iverson <iverson at biostat.wisc.edu>:
> I still don't understand what you are doing. Can you make a small example
> that shows what you have and what you want?
>
> Is ?split what you are after?
>
> Emmanuel Levy wrote:
>>
>> Dear Peter and Henrik,
>>
>> Thanks for your replies - this helps speed up a bit, but I thought
>> there would be something much faster.
>>
>> What I mean is that I thought that a particular value of a level
>> could be accessed instantly, similarly to a "hash" key.
>>
>> Since I've got about 6000 levels in that data frame, it means that
>> making a list L of the form
>> L[[1]] = values of name "1"
>> L[[2]] = values of name "2"
>> L[[3]] = values of name "3"
>> ...
>> would take ~1hour.
>>
>> Best,
>>
>> Emmanuel
>>
>>
>>
>>
>> 2008/8/12 Henrik Bengtsson <hb at stat.berkeley.edu>:
>>>
>>> To simplify:
>>>
>>> n <- 2.7e6;
>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>>
>>> # Identify 'A':s
>>> t1 <- system.time(res <- which(x == "A"));
>>>
>>> # To compare a factor to a string, the factor is in practice
>>> # coerced to a character vector.
>>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>>
>>> # Interestingly enough, this seems to be faster (repeated many times)
>>> # Don't know why.
>>> print(t2/t1);
>>> user system elapsed
>>> 0.632653 1.600000 0.754717
>>>
>>> # Avoid coercing the factor, but instead coerce the level compared to
>>> t3 <- system.time(res <- which(x == match("A", levels(x))));
>>>
>>> # ...but gives no speed up
>>> print(t3/t1);
>>> user system elapsed
>>> 1.041667 1.000000 1.018182
>>>
>>> # But coercing the factor to integers does
>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
>>> print(t4/t1);
>>> user system elapsed
>>> 0.4166667 0.0000000 0.3636364
>>>
>>> So, the latter seems to be the fastest way to identify those elements.
>>>
>>> My $.02
>>>
>>> /Henrik
>>>
>>>
>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd at gmail.com> wrote:
>>>>
>>>> Emmanuel,
>>>>
>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com>
>>>> wrote:
>>>>>
>>>>> Dear All,
>>>>>
>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would
>>>>> like to
>>>>> extract the information in a particular way illustrated below:
>>>>>
>>>>>
>>>>> Given a data frame "df":
>>>>>
>>>>>> col1=sample(c(0,1),10, rep=T)
>>>>>> names = factor(c(rep("A",5),rep("B",5)))
>>>>>> df = data.frame(names,col1)
>>>>>> df
>>>>>
>>>>> names col1
>>>>> 1 A 1
>>>>> 2 A 0
>>>>> 3 A 1
>>>>> 4 A 0
>>>>> 5 A 1
>>>>> 6 B 0
>>>>> 7 B 0
>>>>> 8 B 1
>>>>> 9 B 0
>>>>> 10 B 0
>>>>>
>>>>> I would like to tranform it in the form:
>>>>>
>>>>>> index = c("A","B")
>>>>>> col1[[1]]=df$col1[which(df$name=="A")]
>>>>>> col1[[2]]=df$col1[which(df$name=="B")]
>>>>
>>>> I'm not sure I fully understand your problem, you example would not run
>>>> for me.
>>>>
>>>> You could get a small speedup by omitting which(), you can subset by a
>>>> logical vector also which give a small speedup.
>>>>
>>>>> n <- 2700000
>>>>> foo <- data.frame(
>>>>
>>>> + one = sample(c(0,1), n, rep = T),
>>>> + two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
>>>> + )
>>>>>
>>>>> system.time(out <- which(foo$two=="A"))
>>>>
>>>> user system elapsed
>>>> 0.566 0.146 0.761
>>>>>
>>>>> system.time(out <- foo$two=="A")
>>>>
>>>> user system elapsed
>>>> 0.429 0.075 0.588
>>>>
>>>> You might also find use for unstack(), though I didn't see a speedup.
>>>>>
>>>>> system.time(out <- unstack(foo))
>>>>
>>>> user system elapsed
>>>> 1.068 0.697 2.004
>>>>
>>>> HTH
>>>>
>>>> Peter
>>>>
>>>>> My problem is that the command: *** which(df$name=="A") ***
>>>>> takes about 1 second because df is so big.
>>>>>
>>>>> I was thinking that a "level" could maybe be accessed instantly but I
>>>>> am not
>>>>> sure about how to do it.
>>>>>
>>>>> I would be very grateful for any advice that would allow me to speed
>>>>> this up.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Emmanuel
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list