[Rd] Fastest non-overlapping binning mean function out there?
Henrik Bengtsson
hb at biostat.ucsf.edu
Thu Oct 4 22:45:30 CEST 2012
Thank you all - I appreciate all your suggestions. My post was mainly
to check if there's already an existing and fast implementation out
there. I've ended up adopting Martin's first proposal. Something
like this:
#include <Rdefines.h>
#include <R.h>
#include <R_ext/Error.h>
SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
SEXP ans = PROTECT(NEW_NUMERIC(nb));
double *ansp = REAL(ans);
int retcount = LOGICAL(retCount)[0];
SEXP count = NULL;
int *countp = NULL;
int i = 0, j = 0, n = 0, iStart=0;
double sum = 0.0;
// Assert same lengths of 'x' and 'y'
if (ny != nx) {
error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
}
if (retcount) {
count = PROTECT(NEW_INTEGER(nb));
countp = INTEGER(count);
}
// Skip to the first bin
while (xp[iStart] < bxp[0]) {
++iStart;
}
// For each x...
for (i = iStart; i < nx; ++i) {
// Skip to a new bin?
while (xp[i] >= bxp[j+1]) {
// Update statistic of current bin?
if (retcount) { countp[j] = n; }
ansp[j] = n > 0 ? sum / n : 0;
sum = 0.0;
n = 0;
// ...and move to next
++j;
}
sum += yp[i];
n += 1;
}
// Update statistic of last bin
ansp[j] = n > 0 ? sum / n : 0;
if (retcount) {
countp[j] = n;
setAttrib(ans, install("count"), count);
UNPROTECT(1); // 'count'
}
UNPROTECT(1); // 'ans'
return ans;
} // binMeans()
BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6)
and not a scalar (10^10), but the misinterpretation got Martin to
point out some of the exciting work extending R core to support "very
large" integers. And, as Herve pointed out, I forgot to say that the
number of bins could be of that range as well, or maybe an order of
magnitude less (say ~1-100 data points per bin).
/Henrik
On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 10/3/2012 6:47 AM, Martin Morgan wrote:
>>
>> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>>>
>>> Hi,
>>>
>>> I'm looking for a super-duper fast mean/sum binning implementation
>>> available in R, and before implementing z = binnedMeans(x y) in native
>>> code myself, does any one know of an existing function/package for
>>> this? I'm sure it already exists. So, given data (x,y) and B bins
>>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
>>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
>>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's
>>> assume there are no missing values and 'x' and 'bx' is already
>>> ordered. The length of 'x' is in the order of 10,000-millions. The
>>> number of elements in each bin vary.
>>
>>
>> since x and bx are ordered (sorting x would be expensive), the C code
>> seems
>> pretty straight-forward and memory-efficient -- create a result vector as
>> long
>> as bx, then iterate through x accumulating n and it's sum until x[i] >=
>> bx[i].
>> (I think R's implementation of mean() actually pays more attention to
>> numerical
>> error, but with this much data...)
>>
>> library(inline)
>> binmean <- cfunction(signature(x="numeric", bx="numeric"),
>> " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
>
>
> I'll take my solution back. The problem specification says that x has
> 10,000-millions of elements, so we need to use R-devel and
>
> R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
>
> but there are two further issues. The first is that on my system
>
> p$ gcc --version
> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>
> I have __SIZEOF_SIZE_T__ 8 but
>
> (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I
> end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using
> the inline package, to avoid that level of uncertainty) and then
>
>
>> SEXP ans = PROTECT(NEW_NUMERIC(nb));
>> double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
>> sum = j = n = 0;
>> for (i = 0; i < nx; ++i) {
>
>
> (b) because nx is a signed type, and since nx > .Machine$integer.max is
> represented as a negative number, I don't ever iterate this loop. So I'd
> have to be more clever if I wanted this to work on all platforms.
>
> For what it's worth, Herve's solution is also problematic
>
>> xx = findInterval(bx, x)
> Error in findInterval(bx, x) : long vector 'vec' is not supported
>
> A different strategy for the problem at hand would seem to involve iteration
> over sequences of x, collecting sufficient statistics (n, sum) for each
> iteration, and calculating the mean at the end of the day. This might also
> result in better memory use and allow parallel processing.
>
> Martin
>
>
>> while (xp[i] >= bxp[j]) {
>> ansp[j++] = n > 0 ? sum / n : 0;
>> sum = n = 0;
>> }
>> n += 1;
>> sum += xp[i];
>> }
>> ansp[j] = n > 0 ? sum / n : 0;
>> UNPROTECT(1);
>> return ans;
>> ")
>>
>> with a test case
>>
>> nx <- 4e7
>> nb <- 1e3
>> x <- sort(runif(nx))
>> bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
>>
>> leading to
>>
>> > bx1 <- c(bx[-1], bx[nb] + 1)
>> > system.time(res <- binmean(x, bx1))
>> user system elapsed
>> 0.052 0.000 0.050
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Henrik
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
More information about the R-devel
mailing list