[Rd] Very slow subsetting by name

Thu Jul 15 10:12:10 CEST 2010

Hi,

I'm subsetting a named vector using character indices.
My vector of indices (or keys) is 10x longer than the vector
I'm subsetting. All my keys are distinct and only 10% of them
are valid (i.e. match a name of the vector being subsetted).
It is surprisingly slow:

x1 <- 1:1000
names(x1) <- paste("a", x1, sep="")
keys <- sample(c(names(x1), paste("b", 1:9000, sep="")))
 > system.time(y1 <- x1[keys])
    user  system elapsed
   0.410   0.000   0.416

x2 <- 1:2000
names(x2) <- paste("a", x2, sep="")
keys <- sample(c(names(x2), paste("b", 1:18000, sep="")))
 > system.time(y2 <- x2[keys])
    user  system elapsed
   1.730   0.000   1.736

x3 <- 1:4000
names(x3) <- paste("a", x3, sep="")
keys <- sample(c(names(x3), paste("b", 1:36000, sep="")))
 > system.time(y3 <- x3[keys])
    user  system elapsed
   8.900   0.010   9.227

x4 <- 1:8000
names(x4) <- paste("a", x4, sep="")
keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
 > system.time(y4 <- x4[keys])
    user  system elapsed
130.390   0.000 132.316

And it's apparently worse than quadratic in time!

I'm wondering why this subsetting by name is so slow since it
seems it could be implemented with x4[match(keys, names(x4))],
which is very fast: only 0.012s!

This is with R-2.11.0 and R-2.12.0.

Thanks,
H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319