[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Wed Aug 13 18:09:37 CEST 2008

split if probably what you are after.  Here is an example:

> n <- 2700000
> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
> # split it into 6000 lists
> system.time(y <- split(x$value, x$name))
   user  system elapsed
   0.80    0.20    1.07
> str(y[1:10])
List of 10
 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
 $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
 $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
 $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
 $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
 $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
 $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
 $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
 $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
 $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...
>
 Takes less that 1 second to split into 6000 lists.

On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
> Wow great! Split was exactly what was needed. It takes about 1 second
> for the whole operation :D
>
> Thanks again - I can't believe I never used this function in the past.
>
> All the best,
>
> Emmanuel
>
>
> 2008/8/13 Erik Iverson <iverson at biostat.wisc.edu>:
>> I still don't understand what you are doing.  Can you make a small example
>> that shows what you have and what you want?
>>
>> Is ?split what you are after?
>>
>> Emmanuel Levy wrote:
>>>
>>> Dear Peter and Henrik,
>>>
>>> Thanks for your replies - this helps speed up a bit, but I thought
>>> there would be something much faster.
>>>
>>> What I mean is that I thought that a particular value of a level
>>> could be accessed instantly, similarly to a "hash" key.
>>>
>>> Since I've got about 6000 levels in that data frame, it means that
>>> making a list L of the form
>>> L[[1]] = values of name "1"
>>> L[[2]] = values of name "2"
>>> L[[3]] = values of name "3"
>>> ...
>>> would take ~1hour.
>>>
>>> Best,
>>>
>>> Emmanuel
>>>
>>>
>>>
>>>
>>> 2008/8/12 Henrik Bengtsson <hb at stat.berkeley.edu>:
>>>>
>>>> To simplify:
>>>>
>>>> n <- 2.7e6;
>>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>>>
>>>> # Identify 'A':s
>>>> t1 <- system.time(res <- which(x == "A"));
>>>>
>>>> # To compare a factor to a string, the factor is in practice
>>>> # coerced to a character vector.
>>>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>>>
>>>> # Interestingly enough, this seems to be faster (repeated many times)
>>>> # Don't know why.
>>>> print(t2/t1);
>>>>   user   system  elapsed
>>>> 0.632653 1.600000 0.754717
>>>>
>>>> # Avoid coercing the factor, but instead coerce the level compared to
>>>> t3 <- system.time(res <- which(x == match("A", levels(x))));
>>>>
>>>> # ...but gives no speed up
>>>> print(t3/t1);
>>>>   user   system  elapsed
>>>> 1.041667 1.000000 1.018182
>>>>
>>>> # But coercing the factor to integers does
>>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
>>>> print(t4/t1);
>>>>    user    system   elapsed
>>>> 0.4166667 0.0000000 0.3636364
>>>>
>>>> So, the latter seems to be the fastest way to identify those elements.
>>>>
>>>> My $.02
>>>>
>>>> /Henrik
>>>>
>>>>
>>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd at gmail.com> wrote:
>>>>>
>>>>> Emmanuel,
>>>>>
>>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would
>>>>>> like to
>>>>>> extract the information in a particular way illustrated below:
>>>>>>
>>>>>>
>>>>>> Given a data frame "df":
>>>>>>
>>>>>>> col1=sample(c(0,1),10, rep=T)
>>>>>>> names = factor(c(rep("A",5),rep("B",5)))
>>>>>>> df = data.frame(names,col1)
>>>>>>> df
>>>>>>
>>>>>>  names col1
>>>>>> 1      A    1
>>>>>> 2      A    0
>>>>>> 3      A    1
>>>>>> 4      A    0
>>>>>> 5      A    1
>>>>>> 6      B    0
>>>>>> 7      B    0
>>>>>> 8      B    1
>>>>>> 9      B    0
>>>>>> 10     B    0
>>>>>>
>>>>>> I would like to tranform it in the form:
>>>>>>
>>>>>>> index = c("A","B")
>>>>>>> col1[[1]]=df$col1[which(df$name=="A")]
>>>>>>> col1[[2]]=df$col1[which(df$name=="B")]
>>>>>
>>>>> I'm not sure I fully understand your problem, you example would not run
>>>>> for me.
>>>>>
>>>>> You could get a small speedup by omitting which(), you can subset by a
>>>>> logical vector also which give a small speedup.
>>>>>
>>>>>> n <- 2700000
>>>>>> foo <- data.frame(
>>>>>
>>>>> +       one = sample(c(0,1), n, rep = T),
>>>>> +       two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
>>>>> +       )
>>>>>>
>>>>>> system.time(out <- which(foo$two=="A"))
>>>>>
>>>>>  user  system elapsed
>>>>>  0.566   0.146   0.761
>>>>>>
>>>>>> system.time(out <- foo$two=="A")
>>>>>
>>>>>  user  system elapsed
>>>>>  0.429   0.075   0.588
>>>>>
>>>>> You might also find use for unstack(), though I didn't see a speedup.
>>>>>>
>>>>>> system.time(out <- unstack(foo))
>>>>>
>>>>>  user  system elapsed
>>>>>  1.068   0.697   2.004
>>>>>
>>>>> HTH
>>>>>
>>>>> Peter
>>>>>
>>>>>> My problem is that the command:  *** which(df$name=="A") ***
>>>>>> takes about 1 second because df is so big.
>>>>>>
>>>>>> I was thinking that a "level" could maybe be accessed instantly but I
>>>>>> am not
>>>>>> sure about how to do it.
>>>>>>
>>>>>> I would be very grateful for any advice that would allow me to speed
>>>>>> this up.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Emmanuel
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?