[R] Lookups in R

deepayan.sarkar at gmail.com deepayan.sarkar at gmail.com
Thu Jul 5 02:15:25 CEST 2007


On 7/4/07, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Michael,
>
> A hash provides constant-time access, though the resulting perl-esque
> data structures (a hash of lists, e.g.) are not convenient for other
> manipulations
>
> > n_accts <- 10^3
> > n_trans <- 10^4
> > t <- list()
> > t$amt <- runif(n_trans)
> > t$acct <- as.character(round(runif(n_trans, 1, n_accts)))
> >
> > uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
> > ## keys, presumably account ids
> > for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0)
> >
> > system.time(for (i in seq_along(t$amt)) {
> +     acct <- t$acct[i]
> +     x <- uhash[[acct]]
> +     uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1)
> + })
>    user  system elapsed
>   0.264   0.000   0.262
> > udf <- data.frame(amt=0, n=rep(0L, n_accts),
> +                   row.names=as.character(1:n_accts))
> > system.time(for (i in seq_along(t$amt)) {
> +     idx <- row.names(udf)==t$acct[i]
> +     udf[idx, ] <- c(udf[idx,"amt"], udf[idx, "n"]) + c(t$amt[i], 1)
> + })
>    user  system elapsed
>  18.398   0.000  18.394

I don't think that's a fair comparison--- much of the overhead comes
from the use of data frames and the creation of the indexing vector. I
get

> n_accts <- 10^3
> n_trans <- 10^4
> t <- list()
> t$amt <- runif(n_trans)
> t$acct <- as.character(round(runif(n_trans, 1, n_accts)))
> uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
> for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0)
> system.time(for (i in seq_along(t$amt)) {
+     acct <- t$acct[i]
+     x <- uhash[[acct]]
+     uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1)
+ }, gcFirst = TRUE)
   user  system elapsed
  0.508   0.008   0.517
> udf <- matrix(0, nrow = n_accts, ncol = 2)
> rownames(udf) <- as.character(1:n_accts)
> colnames(udf) <- c("amt", "n")
> system.time(for (i in seq_along(t$amt)) {
+     idx <- t$acct[i]
+     udf[idx, ] <- udf[idx, ] + c(t$amt[i], 1)
+ }, gcFirst = TRUE)
   user  system elapsed
  1.872   0.008   1.883

The loop is still going to be the problem for realistic examples.

-Deepayan



More information about the R-help mailing list