[R] How to make this for() loop memory efficient?

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Jan 11 01:18:04 CET 2012

I'm having a really difficult time understanding what you're trying to
get -- copy and pasting your code is failing to run, and your question
isn't clear, ie:

"For each phone call that BEGINS with the module which is denoted by 81
(i.e. of the form 81X,XXX), what is the expected number of modules in these

How does one calculate the expected number of "modules" in this
module? What does that even mean?

Anyway, here's some using your `data` data.frame that calculates the
number of unique calls and other statistics on the "call id" within
each module prefix. I'm using both data.table and plyr ... there are
no for loops.

You will want to do `whatever it is you really want to do` inside the
"blocks" below.

## R code
data <- transform(data, module.prefix=substring(modules, 1, 2))

## take a look at `data` now

## calulate "stuff" inside each module.prefix using data.table
xx <- data.table(data, key="module.prefix")

ans <- xx[, {
  ## the columns of the particular subset of your data.table
  ## are "injected" into the scope for this expression block
  ## which is where the `calls` variable below comes from
  tabled <- table(as.character(calls))
  list(unique.calls=length(tabled), min=min(tabled),
median=as.numeric(median(tabled)), max=max(tabled))
  ## you will want to return your own list of "stuff"
}, by='module.prefix']

## with plyr
ans <- ddply(data, "module.prefix", function(x) {
  ## `x` is a data.frame that all share the same module.prefix
  ## do whatever you want with it here
  tabled <- table(as.character(x$calls))
  c(unique.calls=length(tabled), min=min(tabled),
median=median(tabled), max=max(tabled))

You'll have to read up on the particulars of data.table and plyr. Both
are really powerful packages ... you should get familiar with at least

plyr is a bit more flexible in some ways.

data.table is a bit more strict (cf. the need for
`as.numeric(median(tabled))`), but also tends to be (much) faster when
working over large datasets


Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the R-help mailing list