[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Sat Jul 29 22:05:09 CEST 2006

Hi Bill,

So you wrote one routine that can calculate any single of a variety  
of stats and allows weights, is that right?  Can it return a data  
frame of any subset of requested stats as well (that is what I was  
thinking of doing anyway).

I think someone can easily calculate all of those things in one pass  
through the array and then allow the user to select which of the new  
columns of stats should be added to a data.frame that is returned to  
the user.

To test all of this, I simply wrote my own igroupSums and integrated  
it into r-devel based on the code in split.c.  I can easily modify it  
to handle the case of calculating  a variety of stats (even all at  
the same time if desired).  I do not deal with "weights" at all and  
ignored that for now.

Here is what your test case now shows on my machine with the latest R  
build
with my added igroupSums routine (added internally to R).

 > x <- rnorm(2e6)
 > i <- rep(1:1e6,2)
 > unix.time(Asums <- unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000
 > names(Asums) <- NULL

My version of igroupSums does not keep the names so I remove them to  
make the results comparable.

Here is my my own internal function igroupSums

 > unix.time(Bsums <- igroupSums(x,i))
[1] 0.932 0.024 0.958 0.000 0.000
 >
 > all.equal(Asums, Bsums)
[1] TRUE

So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

I will next modify my code to handle any single one of maxs, mins,  
sums, counts, anys, alls, means, prods, and ranges by user choice.   
Although I will leave the use of weights as unimplemented for now (I  
always get mixed up thinking about weights and basic stats and I  
never use them so ...)

In case others want to play around with this too, here is the R  
wrapper in igroupSums.R to put in src/library/base/R/

igroupSums <- function(x, f, drop = FALSE, ...) UseMethod("igroupSums")

igroupSums.default <- function(x, f, drop=FALSE, ...)
{
     if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)

     if (is.list(f)) f <- interaction(f, drop = drop)
     else if (drop || !is.factor(f)) # drop extraneous levels
         f <- factor(f)
     storage.mode(f) <- "integer"  # some factors have double
     if (is.null(attr(x, "class")))
         return(.Internal(igroupSums(x, f)))
     ## else
     r <- by(x,f,sum)
     r
}

igroupSums.data.frame <- function(x, f, drop = FALSE, ...)
     lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
            function(ind) x[ind, , drop = FALSE])

And here is a very simple igroupSums.c to put in src/main/

It still needs a lot of work since it does not handle NAs in the  
vector x yet and still needs to be modified into a general routine to  
handle any single function of counts, sums, maxs, mins, means, prods,  
anys, alls, and ranges

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include "Defn.h"

SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,  
SEXP env)
{
     SEXP x, f, sums;
     int i, j, nobs, nlevs, nfac;

     checkArity(op, args);

     x = CAR(args);
     f = CADR(args);
     if (!isVector(x))
         errorcall(call, _("first argument must be a vector"));
     if (!isFactor(f))
         errorcall(call, _("second argument must be a factor"));
     nlevs = nlevels(f);
     nfac = LENGTH(CADR(args));
     nobs = LENGTH(CAR(args));
     if (nobs <= 0)
         return R_NilValue;
     if (nfac <= 0)
         errorcall(call, _("Group length is 0 but data length > 0"));
     if (nobs % nfac != 0)
         warningcall(call, _("data length is not a multiple of split  
variable"));

     PROTECT(sums = allocVector(TYPEOF(x), nlevs));
     switch (TYPEOF(x)) {
     case INTSXP:
         for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
         break;
     case REALSXP:
         for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
         break;
     default:
         UNIMPLEMENTED_TYPE("igroupSums", x);
     }
     for (i = 0;  i < nobs; i++) {
         j = INTEGER(f)[i % nfac];
         if (j != NA_INTEGER) {
             j--;
             switch (TYPEOF(x)) {
             case INTSXP:
                 INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
                 break;
             case REALSXP:
                 REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
                 break;
             default:
                 UNIMPLEMENTED_TYPE("igroupSums", x);
             }
         }
     }
     UNPROTECT(1);
     return sums;
}

If anyone is playing with this themselves, don't forget to update  
Internal.h and names.c to reflect the added routine before you make  
clean and then rebuild.

Once I finish, I will post me patches here and then if someone would  
like to modify them to implement "weights", please let me know.

Even if these never get added to R I can keep them in my own tree and  
use them for my own work.

Thanks again for all of your hints and guidance.  This alone will  
speed up my R code greatly!

Kevin

> That is roughly what I did in C code for the Splus version.
> E.g., here is the integer x case for sums and means.  It accumlates
> the weighted group sum and the group weight so that if we
> are doing the mean it has the right divisor.  It would
> go a bit faster if I didn't bother with the group weight
> in the sum case.
>
> (I was mostly interested in seeing if this function's interface
> was sufficiently general for your uses.  Computing the integer
> group codes can sometimes be a pain and there are cases where you
> might want to combine that with computing the grouped summaries.
> I am guessing that in most cases you want those two functions
> to be separate.)
>
> case S_SUM: /*FALLTHROUGH*/
> case S_MEAN:
>     for(i=0;i<length;i++) {
>         bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */
>         weight = weighted ? *weightp++ : 1.0 ;
>         x = *xp++ ;
>         if (is_na_INT(&bin) || bin<0 || bin>=maxbin || weight==0.0)
>             continue ;
>         if (is_na_DOUBLE(&groupWeightp[bin]))
>             continue ;
>         if (is_na_DOUBLE(&x) || is_na_DOUBLE(&weight)) {
>             if (!na_rm) {
>                 na_set3(&valuep[bin], value->mode, To_NA);
>                 na_set3(&groupWeightp[bin], groupWeight->mode, To_NA);
>             }
>             continue ;
>         }
>         if (!is_na_DOUBLE(&valuep[bin])) {
>             valuep[bin] += x * weight ;
>             groupWeightp[bin] += weight ; /* groupWeightp and  
> valuep should both have same NA-ness */
>         }
>     }
>     if (which==S_MEAN) {
>         for(ibin=0;ibin<maxbin;ibin++) {
>             if (is_na_DOUBLE(&valuep[ibin])) {
>                 /* leave valuep[ibin] as NA */
>             } else if (groupWeightp[ibin]==0.0) {
>                 /* valuep[ibin] must be 0.0 if groupWeightp[ibin]  
> is */
>                 na_set3(&valuep[ibin], value->mode, To_NaN) ;
>             } else {
>                 valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ;
>             }
>         }
>     }
>     break;