[Rd] extracting rows from a data frame by looping over the row names: performance issues

Sat Mar 3 06:48:15 CET 2007

Herve Pages <hpages at fhcrc.org> writes:
> So apparently here extracting with dat[i, ] is 300 times faster than
> extracting with dat[key, ] !
>
>> system.time(for (i in 1:100) dat["1", ])
>    user  system elapsed
>  12.680   0.396  13.075
>
>> system.time(for (i in 1:100) dat[1, ])
>    user  system elapsed
>   0.060   0.076   0.137
>
> Good to know!

I think what you are seeing here has to do with the space efficient
storage of row.names of a data.frame.  The example data you are
working with has no specified row names and so they get stored in a
compact fashion:

    mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
    dat <- as.data.frame(mat)

    > typeof(attr(dat, "row.names"))
    [1] "integer"

In the call to [.data.frame when i is character, the appropriate index
is found using pmatch and this requires that the row names be
converted to character.  So in a loop, you get to convert the integer
vector to character vector at each iteration.

If you assign character row names, things will be a bit faster:

    # before
    system.time(for (i in 1:25) dat["2", ])
       user  system elapsed 
      9.337   0.404  10.731 

    # this looks funny, but has the desired result
    rownames(dat) <- rownames(dat)
    typeof(attr(dat, "row.names")

    # after
    system.time(for (i in 1:25) dat["2", ])
       user  system elapsed 
      0.343   0.226   0.608 

And you probably would have seen this if you had looked at the the
profiling data:

    Rprof()
    for (i in 1:25) dat["2", ]
    Rprof(NULL)
    summaryRprof()

+ seth