[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jul 28 21:55:37 CEST 2006
>>>>> "Kevin" == Kevin B Hendricks <kevin.hendricks at sympatico.ca>
>>>>> on Fri, 28 Jul 2006 14:53:57 -0400 writes:
[.........]
Kevin> The idea is to somehow make functions that work well
Kevin> over small sub- sequences of a much longer vector
Kevin> without resorting to splitting the vector into many
Kevin> smaller vectors.
Kevin> In my particular case, the problem was my data frame
Kevin> had over 1 million lines had probably over 500,000
Kevin> unique sort keys (ie. think of it as an R factor with
Kevin> over 500,000 levels). The implementation of "by"
Kevin> uses "tapply" which in turn uses "split". So "split"
Kevin> simply ate up all the time trying to create 500,000
Kevin> vectors each of short length 1, 2, or 3; and the
Kevin> associated garbage collection.
Not that I have spent enough time thinking about this thread's
topic, but I have seen more than one case where using tapply()
unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing
tapply() by a few lines of code {one of which using lapply() IIRC},
sped up that computation by a factor (of 2 ? or more?).
I also vaguely remember that I thought about making tapply()
faster, but came to the conclusion it could not be
sped up quickly, because it works in a quite more general
context than it was used in that application (and maybe yours?).
Kevin> I simple loop that walked the short sequence of
Kevin> values (since the data frame was already sorted)
Kevin> calculating what it needed, would work much faster
Kevin> than splitting the original vector into so very many
Kevin> smaller vectors (and the associated copying of data).
Kevin> That problem is very similar problem to the
Kevin> calculation of basic stats on a short moving window
Kevin> over a very long vector.
>> The author of that message ultimately wrote the caTools R
>> package which contains some optimized versions.
Kevin> I will look into that package and maybe use it for a
Kevin> model for what I want to do.
Kevin> Thanks,
Kevin> Kevin
Kevin> ______________________________________________
Kevin> R-devel at r-project.org mailing list
Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list