[Rd] extracting rows from a data frame by looping over the row names: performance issues
hpages at fhcrc.org
hpages at fhcrc.org
Sat Mar 3 09:22:23 CET 2007
Hi Seth,
Quoting Seth Falcon <sfalcon at fhcrc.org>:
> Herve Pages <hpages at fhcrc.org> writes:
> > So apparently here extracting with dat[i, ] is 300 times faster than
> > extracting with dat[key, ] !
> >
> >> system.time(for (i in 1:100) dat["1", ])
> > user system elapsed
> > 12.680 0.396 13.075
> >
> >> system.time(for (i in 1:100) dat[1, ])
> > user system elapsed
> > 0.060 0.076 0.137
> >
> > Good to know!
>
> I think what you are seeing here has to do with the space efficient
> storage of row.names of a data.frame. The example data you are
> working with has no specified row names and so they get stored in a
> compact fashion:
>
> mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> dat <- as.data.frame(mat)
>
> > typeof(attr(dat, "row.names"))
> [1] "integer"
>
> In the call to [.data.frame when i is character, the appropriate index
> is found using pmatch and this requires that the row names be
> converted to character. So in a loop, you get to convert the integer
> vector to character vector at each iteration.
Maybe this could be avoided. Why do you need to call pmath when
the row names are integer?
In [.data.frame if you replace this:
...
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
...
by this
...
if (is.character(i)) {
rows <- attr(xx, "row.names")
if (typeof(rows) == "integer")
i <- as.integer(i)
else
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
...
then you get a huge boost:
- with current [.data.frame
> system.time(for (i in 1:100) dat["1", ])
user system elapsed
34.994 1.084 37.915
- with "patched" [.data.frame
> system.time(for (i in 1:100) dat["1", ])
user system elapsed
0.264 0.068 0.364
but maybe I'm missing somethig...
Cheers,
H.
>
> If you assign character row names, things will be a bit faster:
>
> # before
> system.time(for (i in 1:25) dat["2", ])
> user system elapsed
> 9.337 0.404 10.731
>
> # this looks funny, but has the desired result
> rownames(dat) <- rownames(dat)
> typeof(attr(dat, "row.names")
>
> # after
> system.time(for (i in 1:25) dat["2", ])
> user system elapsed
> 0.343 0.226 0.608
>
> And you probably would have seen this if you had looked at the the
> profiling data:
>
> Rprof()
> for (i in 1:25) dat["2", ]
> Rprof(NULL)
> summaryRprof()
>
>
> + seth
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list