[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Fri Jul 28 20:53:57 CEST 2006

Hi,

> There was a performance comparison of several moving average
> approaches here:
> http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html
>

Thanks for that link.  It is not quite the same thing but is very  
similar.

The idea is to somehow make functions that work well over small sub- 
sequences of a much longer vector without resorting to splitting the  
vector into many smaller  vectors.

In my particular case, the problem was my data frame had over 1  
million lines had probably over 500,000 unique sort keys (ie. think  
of it as an R factor with over 500,000 levels).  The implementation  
of "by" uses "tapply" which in turn uses "split".  So "split" simply  
ate up all the time trying to create 500,000 vectors each of short  
length 1, 2, or 3; and the associated garbage collection.

I simple loop that walked the short sequence of values (since the  
data frame was already sorted) calculating what it needed, would work  
much faster than splitting the original vector into so very many  
smaller vectors (and the associated copying of data).

That problem is very similar problem to the calculation of basic  
stats on a short moving window over a very long vector.

> The author of that message ultimately wrote the caTools R package
> which contains some optimized versions.

I will look into that package and maybe use it for a model for what I  
want to do.

Thanks,

Kevin