[R] lapply (and friends) with data.frames are slow

R. Michael Weylandt michael.weylandt at gmail.com
Sat Jan 5 21:46:47 CET 2013


On Sat, Jan 5, 2013 at 7:38 PM, Kevin Ushey <kevinushey at gmail.com> wrote:
> Hey guys,
>
> I noticed something curious in the lapply call. I'll copy+paste the
> function call here because it's short enough:
>
> lapply <- function (X, FUN, ...)
> {
>     FUN <- match.fun(FUN)
>     if (!is.vector(X) || is.object(X))
>         X <- as.list(X)
>     .Internal(lapply(X, FUN))
> }
>
> Notice that lapply coerces X to a list if the !is.vector || is.object(X)
> check passes.
>
> Curiously, data.frames fail the test (is.vector(data.frame()) returns
> FALSE); but it seems that coercion of a data.frame
> to a list would be unnecessary for the *apply family of functions.
>
> Is there a reason why we must coerce data.frames to list for these
> functions? I thought data.frames were essentially just 'structured lists'?
>
> I ask because it is generally quite slow coercing a (large) data.frame to a
> list, and it seems like this could be avoided for data.frames.

Note sure it's a huge deal, but

It does seem to be an avoidable function call with something like this:

lapply1 <- function (X, FUN, ...)
{
    FUN <- match.fun(FUN)
    if (!(is.vector(X) && is.object(X) || is.data.frame(X)))
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}

On a microbenchmark:

xx <- data.frame(rnorm(5e7), rexp(5e7), runif(5e7))
xx <- cbind(xx, xx, xx, xx, xx)

system.time(lapply(x, range))
system.time(lapply1(x, range))

It saves me about 50% of the time -- that's of course only using a
relatively cheap FUN argument.

Others will hopefully comment more

M




More information about the R-help mailing list