[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Thomas Lumley tlumley at u.washington.edu
Mon Jul 31 16:19:01 CEST 2006


On Sat, 29 Jul 2006, Kevin B. Hendricks wrote:

> Hi Bill,
>
>>>>    sum : igroupSums
>
> Okay, after thinking about this ...
>
> # assumes i is the small integer factor with n levels
> # v is some long vector
> # no sorting required
>
> igroupSums <- function(v,i) {
>   sums <- rep(0,max(i))
>   for (j in 1:length(v)) {
>       sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
>   }
>   sums
> }
>
> if written in fortran or c might be faster than using split.  It is
> at least just linear in time with the length of vector v.

For sums you should look at rowsum().  It uses a hash table in C and last 
time I looked was faster than using split(). It returns a vector of the 
same length as the input, but that would easily be fixed.

The same approach would work for min, max, range, count, mean, but not for 
arbitrary functions.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle



More information about the R-devel mailing list