[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Kevin B. Hendricks
kevin.hendricks at sympatico.ca
Sat Jul 29 22:05:09 CEST 2006
Hi Bill,
So you wrote one routine that can calculate any single of a variety
of stats and allows weights, is that right? Can it return a data
frame of any subset of requested stats as well (that is what I was
thinking of doing anyway).
I think someone can easily calculate all of those things in one pass
through the array and then allow the user to select which of the new
columns of stats should be added to a data.frame that is returned to
the user.
To test all of this, I simply wrote my own igroupSums and integrated
it into r-devel based on the code in split.c. I can easily modify it
to handle the case of calculating a variety of stats (even all at
the same time if desired). I do not deal with "weights" at all and
ignored that for now.
Here is what your test case now shows on my machine with the latest R
build
with my added igroupSums routine (added internally to R).
> x <- rnorm(2e6)
> i <- rep(1:1e6,2)
> unix.time(Asums <- unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000
> names(Asums) <- NULL
My version of igroupSums does not keep the names so I remove them to
make the results comparable.
Here is my my own internal function igroupSums
> unix.time(Bsums <- igroupSums(x,i))
[1] 0.932 0.024 0.958 0.000 0.000
>
> all.equal(Asums, Bsums)
[1] TRUE
So the speed up is quite significant (9.053 seconds vs 0.858 seconds).
I will next modify my code to handle any single one of maxs, mins,
sums, counts, anys, alls, means, prods, and ranges by user choice.
Although I will leave the use of weights as unimplemented for now (I
always get mixed up thinking about weights and basic stats and I
never use them so ...)
In case others want to play around with this too, here is the R
wrapper in igroupSums.R to put in src/library/base/R/
igroupSums <- function(x, f, drop = FALSE, ...) UseMethod("igroupSums")
igroupSums.default <- function(x, f, drop=FALSE, ...)
{
if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)
if (is.list(f)) f <- interaction(f, drop = drop)
else if (drop || !is.factor(f)) # drop extraneous levels
f <- factor(f)
storage.mode(f) <- "integer" # some factors have double
if (is.null(attr(x, "class")))
return(.Internal(igroupSums(x, f)))
## else
r <- by(x,f,sum)
r
}
igroupSums.data.frame <- function(x, f, drop = FALSE, ...)
lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
function(ind) x[ind, , drop = FALSE])
And here is a very simple igroupSums.c to put in src/main/
It still needs a lot of work since it does not handle NAs in the
vector x yet and still needs to be modified into a general routine to
handle any single function of counts, sums, maxs, mins, means, prods,
anys, alls, and ranges
#ifdef HAVE_CONFIG_H
#include <config.h>
#endif
#include "Defn.h"
SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,
SEXP env)
{
SEXP x, f, sums;
int i, j, nobs, nlevs, nfac;
checkArity(op, args);
x = CAR(args);
f = CADR(args);
if (!isVector(x))
errorcall(call, _("first argument must be a vector"));
if (!isFactor(f))
errorcall(call, _("second argument must be a factor"));
nlevs = nlevels(f);
nfac = LENGTH(CADR(args));
nobs = LENGTH(CAR(args));
if (nobs <= 0)
return R_NilValue;
if (nfac <= 0)
errorcall(call, _("Group length is 0 but data length > 0"));
if (nobs % nfac != 0)
warningcall(call, _("data length is not a multiple of split
variable"));
PROTECT(sums = allocVector(TYPEOF(x), nlevs));
switch (TYPEOF(x)) {
case INTSXP:
for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
break;
case REALSXP:
for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
break;
default:
UNIMPLEMENTED_TYPE("igroupSums", x);
}
for (i = 0; i < nobs; i++) {
j = INTEGER(f)[i % nfac];
if (j != NA_INTEGER) {
j--;
switch (TYPEOF(x)) {
case INTSXP:
INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
break;
case REALSXP:
REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
break;
default:
UNIMPLEMENTED_TYPE("igroupSums", x);
}
}
}
UNPROTECT(1);
return sums;
}
If anyone is playing with this themselves, don't forget to update
Internal.h and names.c to reflect the added routine before you make
clean and then rebuild.
Once I finish, I will post me patches here and then if someone would
like to modify them to implement "weights", please let me know.
Even if these never get added to R I can keep them in my own tree and
use them for my own work.
Thanks again for all of your hints and guidance. This alone will
speed up my R code greatly!
Kevin
> That is roughly what I did in C code for the Splus version.
> E.g., here is the integer x case for sums and means. It accumlates
> the weighted group sum and the group weight so that if we
> are doing the mean it has the right divisor. It would
> go a bit faster if I didn't bother with the group weight
> in the sum case.
>
> (I was mostly interested in seeing if this function's interface
> was sufficiently general for your uses. Computing the integer
> group codes can sometimes be a pain and there are cases where you
> might want to combine that with computing the grouped summaries.
> I am guessing that in most cases you want those two functions
> to be separate.)
>
> case S_SUM: /*FALLTHROUGH*/
> case S_MEAN:
> for(i=0;i<length;i++) {
> bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */
> weight = weighted ? *weightp++ : 1.0 ;
> x = *xp++ ;
> if (is_na_INT(&bin) || bin<0 || bin>=maxbin || weight==0.0)
> continue ;
> if (is_na_DOUBLE(&groupWeightp[bin]))
> continue ;
> if (is_na_DOUBLE(&x) || is_na_DOUBLE(&weight)) {
> if (!na_rm) {
> na_set3(&valuep[bin], value->mode, To_NA);
> na_set3(&groupWeightp[bin], groupWeight->mode, To_NA);
> }
> continue ;
> }
> if (!is_na_DOUBLE(&valuep[bin])) {
> valuep[bin] += x * weight ;
> groupWeightp[bin] += weight ; /* groupWeightp and
> valuep should both have same NA-ness */
> }
> }
> if (which==S_MEAN) {
> for(ibin=0;ibin<maxbin;ibin++) {
> if (is_na_DOUBLE(&valuep[ibin])) {
> /* leave valuep[ibin] as NA */
> } else if (groupWeightp[ibin]==0.0) {
> /* valuep[ibin] must be 0.0 if groupWeightp[ibin]
> is */
> na_set3(&valuep[ibin], value->mode, To_NaN) ;
> } else {
> valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ;
> }
> }
> }
> break;
More information about the R-devel
mailing list