[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Kevin B. Hendricks
kevin.hendricks at sympatico.ca
Fri Jul 28 20:53:57 CEST 2006
Hi,
> There was a performance comparison of several moving average
> approaches here:
> http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html
>
Thanks for that link. It is not quite the same thing but is very
similar.
The idea is to somehow make functions that work well over small sub-
sequences of a much longer vector without resorting to splitting the
vector into many smaller vectors.
In my particular case, the problem was my data frame had over 1
million lines had probably over 500,000 unique sort keys (ie. think
of it as an R factor with over 500,000 levels). The implementation
of "by" uses "tapply" which in turn uses "split". So "split" simply
ate up all the time trying to create 500,000 vectors each of short
length 1, 2, or 3; and the associated garbage collection.
I simple loop that walked the short sequence of values (since the
data frame was already sorted) calculating what it needed, would work
much faster than splitting the original vector into so very many
smaller vectors (and the associated copying of data).
That problem is very similar problem to the calculation of basic
stats on a short moving window over a very long vector.
> The author of that message ultimately wrote the caTools R package
> which contains some optimized versions.
I will look into that package and maybe use it for a model for what I
want to do.
Thanks,
Kevin
More information about the R-devel
mailing list