[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Bill Dunlap bill at insightful.com
Fri Jul 28 22:51:32 CEST 2006

On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

> In my particular case, the problem was my data frame had over 1
> million lines had probably over 500,000 unique sort keys (ie. think
> of it as an R factor with over 500,000 levels).  The implementation
> of "by" uses "tapply" which in turn uses "split".  So "split" simply
> ate up all the time trying to create 500,000 vectors each of short
> length 1, 2, or 3; and the associated garbage collection.
> I simple loop that walked the short sequence of values (since the
> data frame was already sorted) calculating what it needed, would work
> much faster than splitting the original vector into so very many
> smaller vectors (and the associated copying of data).
> That problem is very similar problem to the calculation of basic
> stats on a short moving window over a very long vector.
> I will look into that package and maybe use it for a model for what I
> want to do.

Splus8.0 has something like what you are talking about
that provides a fast way to compute
    sapply(split(xVector, integerGroupCode), summaryFunction)
for some common summary functions.  The 'integerGroupCode'
is typically the codes from a factor, but you could compute
it in other ways.  It needs to be a "small" integer in
the range 1:ngroups (like the 'bin' argument to tabulate).
Like tabulate(), which is called from table(), these are
meant to be called from other functions that can set up
appropriate group codes.  E.g., groupSums or rowSums or
fancier things could be based on this.

They do not insist you sort the input in any way.  That
would really only be useful for group medians and we haven't
written that one yet.

The typical prototype is

> igroupSums
function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(
        group)) 1 else max(as.integer(group), na.rm = T))

and the currently supported summary functions are
   mean : igroupMeans
   sum : igroupSums
   prod : igroupProds
   min : igroupMins
   max : igroupMaxs
   range : igroupRanges
   any : igroupAnys
   all : igroupAlls
(The plurals are not all properly spelled and they
are plural so match related functions like rowSums.)
It might be useful to also have one to count the number
of non-missing values in each group of x's.

They are fast:

   > x<-runif(2e6)
   > i<-rep(1:1e6, 2)
   > sys.time(sx <- igroupSums(x,i))
   [1] 0.66 0.67
   > length(sx)
   [1] 1000000

On that machine R takes 44 seconds to go the lapply/split

   > unix.time(unlist(lapply(split(x,i), sum)))
   [1] 43.24  0.78 44.11  0.00  0.00

To save setup time in the S code, out-of-range values
in the group argument (negatives, values greater than
ngroup, and NA's), mean that the correponding element
in x is ignored.

Bill Dunlap
Insightful Corporation
bill at insightful dot com

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

More information about the R-devel mailing list