[R] Lookups in R
deepayan.sarkar at gmail.com
deepayan.sarkar at gmail.com
Thu Jul 5 02:15:25 CEST 2007
On 7/4/07, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Michael,
>
> A hash provides constant-time access, though the resulting perl-esque
> data structures (a hash of lists, e.g.) are not convenient for other
> manipulations
>
> > n_accts <- 10^3
> > n_trans <- 10^4
> > t <- list()
> > t$amt <- runif(n_trans)
> > t$acct <- as.character(round(runif(n_trans, 1, n_accts)))
> >
> > uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
> > ## keys, presumably account ids
> > for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0)
> >
> > system.time(for (i in seq_along(t$amt)) {
> + acct <- t$acct[i]
> + x <- uhash[[acct]]
> + uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1)
> + })
> user system elapsed
> 0.264 0.000 0.262
> > udf <- data.frame(amt=0, n=rep(0L, n_accts),
> + row.names=as.character(1:n_accts))
> > system.time(for (i in seq_along(t$amt)) {
> + idx <- row.names(udf)==t$acct[i]
> + udf[idx, ] <- c(udf[idx,"amt"], udf[idx, "n"]) + c(t$amt[i], 1)
> + })
> user system elapsed
> 18.398 0.000 18.394
I don't think that's a fair comparison--- much of the overhead comes
from the use of data frames and the creation of the indexing vector. I
get
> n_accts <- 10^3
> n_trans <- 10^4
> t <- list()
> t$amt <- runif(n_trans)
> t$acct <- as.character(round(runif(n_trans, 1, n_accts)))
> uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
> for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0)
> system.time(for (i in seq_along(t$amt)) {
+ acct <- t$acct[i]
+ x <- uhash[[acct]]
+ uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1)
+ }, gcFirst = TRUE)
user system elapsed
0.508 0.008 0.517
> udf <- matrix(0, nrow = n_accts, ncol = 2)
> rownames(udf) <- as.character(1:n_accts)
> colnames(udf) <- c("amt", "n")
> system.time(for (i in seq_along(t$amt)) {
+ idx <- t$acct[i]
+ udf[idx, ] <- udf[idx, ] + c(t$amt[i], 1)
+ }, gcFirst = TRUE)
user system elapsed
1.872 0.008 1.883
The loop is still going to be the problem for realistic examples.
-Deepayan
More information about the R-help
mailing list