[R] Lookups in R

Martin Morgan mtmorgan at fhcrc.org
Wed Jul 4 23:54:30 CEST 2007


Michael,

A hash provides constant-time access, though the resulting perl-esque
data structures (a hash of lists, e.g.) are not convenient for other
manipulations

> n_accts <- 10^3
> n_trans <- 10^4
> t <- list()
> t$amt <- runif(n_trans)
> t$acct <- as.character(round(runif(n_trans, 1, n_accts)))
> 
> uhash <- new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
> ## keys, presumably account ids
> for (acct in as.character(1:n_accts)) uhash[[acct]] <- list(amt=0, n=0)
> 
> system.time(for (i in seq_along(t$amt)) {
+     acct <- t$acct[i]
+     x <- uhash[[acct]]
+     uhash[[acct]] <- list(amt=x$amt + t$amt[i], n=x$n + 1)
+ })
   user  system elapsed 
  0.264   0.000   0.262 
> udf <- data.frame(amt=0, n=rep(0L, n_accts),
+                   row.names=as.character(1:n_accts))
> system.time(for (i in seq_along(t$amt)) {
+     idx <- row.names(udf)==t$acct[i]
+     udf[idx, ] <- c(udf[idx,"amt"], udf[idx, "n"]) + c(t$amt[i], 1)
+ })
   user  system elapsed 
 18.398   0.000  18.394 

Peter Dalgaard <p.dalgaard at biostat.ku.dk> writes:

> mfrumin wrote:
>> Hey all; I'm a beginner++ user of R, trying to use it to do some processing
>> of data sets of over 1M rows, and running into a snafu.  imagine that my
>> input is a huge table of transactions, each linked to a specif user id.  as
>> I run through the transactions, I need to update a separate table for the
>> users, but I am finding that the traditional ways of doing a table lookup
>> are way too slow to support this kind of operation.
>>
>> i.e:
>>
>> for(i in 1:1000000) {
>>    userid = transactions$userid[i];
>>    amt = transactions$amounts[i];
>>    users[users$id == userid,'amt'] += amt;
>> }
>>
>> I assume this is a linear lookup through the users table (in which there are
>> 10's of thousands of rows), when really what I need is O(constant time), or
>> at worst O(log(# users)).
>>
>> is there any way to manage a list of ID's (be they numeric, string, etc) and
>> have them efficiently mapped to some other table index?
>>
>> I see the CRAN package for SQLite hashes, but that seems to be going a bit
>> too far.
>>   
> Sometimes you need a bit of lateral thinking. I suspect that you could 
> do it like this:
>
> tbl <- with(transactions, tapply(amount, userid, sum))
> users$amt <- users$amt + tbl[users$id]
>
> one catch is that there could be users with no transactions, in which 
> case you may need to replace userid by factor(userid, levels=users$id). 
> None of this is tested, of course.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org



More information about the R-help mailing list