# [R] which(df\$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Peter Cowan cowan.pd at gmail.com
Wed Aug 13 04:31:33 CEST 2008

```Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
> Dear All,
>
> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
>   names col1
> 1      A    1
> 2      A    0
> 3      A    1
> 4      A    0
> 5      A    1
> 6      B    0
> 7      B    0
> 8      B    1
> 9      B    0
> 10     B    0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df\$col1[which(df\$name=="A")]
>> col1[[2]]=df\$col1[which(df\$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.

> n <- 2700000
> foo <- data.frame(
+ 	one = sample(c(0,1), n, rep = T),
+ 	two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+ 	)
> system.time(out <- which(foo\$two=="A"))
user  system elapsed
0.566   0.146   0.761
> system.time(out <- foo\$two=="A")
user  system elapsed
0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.
> system.time(out <- unstack(foo))
user  system elapsed
1.068   0.697   2.004

HTH

Peter

> My problem is that the command:  *** which(df\$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed this up.
>
> Best wishes,
>
> Emmanuel

```