[Rd] RE: [R] Benchmarking R, why sort() is so slow?
Martin Maechler
Martin Maechler <maechler@stat.math.ethz.ch>
Fri, 27 Apr 2001 17:54:15 +0200
>>>>> "TL" == Thomas Lumley <tlumley@u.washington.edu> writes:
TL> On Fri, 27 Apr 2001, Philippe Grosjean wrote:
>> I suspect it should be a question of algorithm choice. May be sort() is
>> important enough, including for large datasets, to place some improvement in
>> the "to do" list for a further version...?
>>
TL> Actually, I'm not convinced on the speed issue. Shellsort works
TL> very well for vectors of lengths that R can reasonably
TL> handle. Using a million record dataset in R you really can't get
TL> much done in the 20seconds that a really good sorting routine might
TL> save you.
TL> Heapsort might be worth trying -- it has good worst-case behaviour
TL> and is about half as fast as the typical case of quicksort. Merge
TL> sort might also be nice because it's stable: it leaves ties in the
TL> order it found them, but it needs extra space. Still, I don't think
TL> it's a high optimisation priority.
I tend to disagree.
For me, R is not just for data analysis.
It's my compute engine for almost everything, nowadays.
I definitely have wanted to add better sort() routines to R,
and I think R's default should be to use some O(n log(n)) versions, at
least for larghish n.
20 years ago (or so), ``the'' good thing
{{for in-memory sorting; note that we might consider a different algorithm
for the case of a really large array where [i] is really disk access}}
was Singleton's Quickersort
| R.C. Singleton 'Revised Quickersort'
| (CACM Algorithm 347, Sept.1968)
(quicker than quicksort), a version of which (that does additionally return
the sorted index) I had re-programmed in Pascal and then C.
I can put it up for ftp if there's interest.
The GSL (GNU scientific library) has sorting routines, too,
all of them heapsort.
>From an outdated manual,
http://sources.redhat.com/gsl/ref/gsl-ref_17.html#SEC253
>> Sorting
>> This chapter describes functions for sorting data, both directly and
>> indirectly (using an index). All the functions use the heapsort
>> algorithm. Heapsort is an O(N \log N) algorithm which operates
>> in-place. It does not require any additional storage and provides
>> consistent performance. The running time for its worst-case (ordered
>> data) which is not significantly different from the average and best
>> cases. Note that the heapsort algorithm does not preserve the relative
>> ordering of equal elements -- it is an unstable sort. However the
>> resulting order of equal elements will be consistent across different
>> platforms when using these functions.
>> Sorting objects
>>
>> The following function provides a simple alternative to the standard
>> library function qsort. It is intended for systems lacking qsort, not as
>> a replacement for it. The function qsort should be used whenever
>> possible, as it will be faster and can provide stable ordering of equal
>> elements. Documentation for qsort is available in the GNU C Library
>> Reference Manual.
I.e., they seem to say one should really use qsort() whenever that is
available... hmm,
could we use configure to find if qsort() is available?
Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._