[Rd] extracting rows from a data frame by looping over the row names: performance issues

Sat Mar 3 18:28:22 CET 2007

Quoting hpages at fhcrc.org:
> In [.data.frame if you replace this:
> 
>     ...
>     if (is.character(i)) {
>         rows <- attr(xx, "row.names")
>         i <- pmatch(i, rows, duplicates.ok = TRUE)
>     }
>     ...
> 
> by this
> 
>     ...
>     if (is.character(i)) {
>         rows <- attr(xx, "row.names")
>         if (typeof(rows) == "integer")
>             i <- as.integer(i)
>         else
>             i <- pmatch(i, rows, duplicates.ok = TRUE)
>     }
>     ...
> 
> then you get a huge boost:
> 
>   - with current [.data.frame
>     > system.time(for (i in 1:100) dat["1", ])
>        user  system elapsed
>      34.994   1.084  37.915
> 
>   - with "patched" [.data.frame
>     > system.time(for (i in 1:100) dat["1", ])
>        user  system elapsed
>       0.264   0.068   0.364
> 

mmmh, replacing
    i <- pmatch(i, rows, duplicates.ok = TRUE)
by just
    i <- as.integer(i)
was a bit naive. It will be wrong if rows is not a "seq_len" sequence.

So I need to be more carefull by first calling 'match' to find the exact
matches and then calling 'pmatch' _only_ on those indices that don't have
an exact match. For example like doing something like this:

    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        if (typeof(rows) == "integer") {
            i2 <- match(as.integer(i), rows)
            if (any(is.na(i2)))
                i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok =
TRUE)
            i <- i2
        } else {
            i <- pmatch(i, rows, duplicates.ok = TRUE)
        }
    }

Correctness:

  > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4,
                       row.names=c(11,25,1,3))
  > dat2
     aa bb
  11  a  1
  25  b  2
  1   c  3
  3   d  4

  > dat2["1",]
    aa bb
  1  c  3

  > dat2["3",]
    aa bb
  3  d  4

  > dat2["2",]
     aa bb
  25  b  2

Performance:

  > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
  > dat <- as.data.frame(mat)
  > system.time(for (i in 1:100) dat["1", ])
     user  system elapsed
    2.036   0.880   2.917

Still 17 times faster than with non-patched [.data.frame.

Maybe 'pmatch(x, table, ...)' itself could be improved to be
more efficient when 'x' is a character vector and 'table' an
integer vector so the above trick is not needed anymore.

My point is that something can probably be done to improve the
performance of 'dat[i, ]' when the row names are integer and 'i'
a character vector. I'm assuming that, in the typical use-case,
there is an exact match for 'i' in the row names so converting
those row names to a character vector in order to find this match
is (most of the time) a waste of time.

Cheers,
H.

> but maybe I'm missing somethig...
> 
> Cheers,
> H.
> 
> > 
> > If you assign character row names, things will be a bit faster:
> > 
> >     # before
> >     system.time(for (i in 1:25) dat["2", ])
> >        user  system elapsed 
> >       9.337   0.404  10.731 
> >     
> >     # this looks funny, but has the desired result
> >     rownames(dat) <- rownames(dat)
> >     typeof(attr(dat, "row.names")
> >     
> >     # after
> >     system.time(for (i in 1:25) dat["2", ])
> >        user  system elapsed 
> >       0.343   0.226   0.608 
> > 
> > And you probably would have seen this if you had looked at the the
> > profiling data:
> > 
> >     Rprof()
> >     for (i in 1:25) dat["2", ]
> >     Rprof(NULL)
> >     summaryRprof()
> > 
> > 
> > + seth
> > 
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>