[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Wed Aug 13 04:31:33 CEST 2008

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
> Dear All,
>
> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
>   names col1
> 1      A    1
> 2      A    0
> 3      A    1
> 4      A    0
> 5      A    1
> 6      B    0
> 7      B    0
> 8      B    1
> 9      B    0
> 10     B    0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.

> n <- 2700000
> foo <- data.frame(
+ 	one = sample(c(0,1), n, rep = T),
+ 	two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+ 	)
> system.time(out <- which(foo$two=="A"))
   user  system elapsed
  0.566   0.146   0.761
> system.time(out <- foo$two=="A")
   user  system elapsed
  0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.
> system.time(out <- unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

HTH

Peter

> My problem is that the command:  *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed this up.
>
> Best wishes,
>
> Emmanuel