[Rd] extracting rows from a data frame by looping over the row names: performance issues
Seth Falcon
sfalcon at fhcrc.org
Sat Mar 3 06:48:15 CET 2007
Herve Pages <hpages at fhcrc.org> writes:
> So apparently here extracting with dat[i, ] is 300 times faster than
> extracting with dat[key, ] !
>
>> system.time(for (i in 1:100) dat["1", ])
> user system elapsed
> 12.680 0.396 13.075
>
>> system.time(for (i in 1:100) dat[1, ])
> user system elapsed
> 0.060 0.076 0.137
>
> Good to know!
I think what you are seeing here has to do with the space efficient
storage of row.names of a data.frame. The example data you are
working with has no specified row names and so they get stored in a
compact fashion:
mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
dat <- as.data.frame(mat)
> typeof(attr(dat, "row.names"))
[1] "integer"
In the call to [.data.frame when i is character, the appropriate index
is found using pmatch and this requires that the row names be
converted to character. So in a loop, you get to convert the integer
vector to character vector at each iteration.
If you assign character row names, things will be a bit faster:
# before
system.time(for (i in 1:25) dat["2", ])
user system elapsed
9.337 0.404 10.731
# this looks funny, but has the desired result
rownames(dat) <- rownames(dat)
typeof(attr(dat, "row.names")
# after
system.time(for (i in 1:25) dat["2", ])
user system elapsed
0.343 0.226 0.608
And you probably would have seen this if you had looked at the the
profiling data:
Rprof()
for (i in 1:25) dat["2", ]
Rprof(NULL)
summaryRprof()
+ seth
More information about the R-devel
mailing list