[Rd] Faster sorting algorithm...

irederik m@iii@g oii oib@@et irederik m@iii@g oii oib@@et
Mon Mar 22 06:05:07 CET 2021


I think it is "Professor Neal" :)

I also appreciate the pqR comparisons.

On Wed, Mar 17, 2021 at 09:23:15AM +0000, Morgan Morgan wrote:
>Thank you Neal. This is interesting. I will have a look at pqR.
>Indeed radix only does C collation, I believe that is why it is not the
>default choice for character ordering and sorting.
>Not sure but I believe it can help address the following bugzilla item:
>https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400
>
>On the same topic of collation, there is an experimental sorting function
>"psort" in package kit that might help address this issue.
>
>> library(kit)
>Attaching kit 0.0.7 (OPENMP enabled using 1 thread)
>> x <- c("b","A","B","a","\xe4")
>> Encoding(x) <- "latin1"
>> identical(psort(x, c.locale=FALSE), sort(x))
>[1] TRUE
>> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))
>[1] TRUE
>
>Coming back to the topic of fsort, I have just finished the implementation
>for double, integer, factor and logical.
>The implementation takes into account NA, Inf.. values. Values can be
>sorted in a decreasing order or increasing order.
>Comparing benchmark with the current implementation in data.table, it is
>currently over 30% faster.
>There might bugs but I am sure performance can be further improved as I did
>not really try hard.
>If there is interest in both the implementation and cross community
>sharing, please let know....
>
>Best regards,
>Morgan
>
>On Wed, 17 Mar 2021, 00:37 Radford Neal, <radford using cs.toronto.edu> wrote:
>
>> Those interested in faster sorting may want to look at the merge sort
>> implemented in pqR (see pqR-project.org).  It's often used as the
>> default, because it is stable, and does different collations, while
>> being faster than shell sort (except for small vectors).
>>
>> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
>> compiled identically:
>>
>> -----------------------------
>> pqR-2020-07-23 in C locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>>    user  system elapsed
>>   1.332   0.000   1.334
>> > print(system.time (or <- order(x,method="radix")))
>>    user  system elapsed
>>   0.092   0.004   0.096
>> > print(system.time (om <- order(x,method="merge")))
>>    user  system elapsed
>>   0.363   0.000   0.363
>> > print(identical(os,or))
>> [1] TRUE
>> > print(identical(os,om))
>> [1] TRUE
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 1 2
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > print(order(x,method="merge"))
>> [1] 1 2
>>
>> -----------------------------
>> R-4.0.2 in C locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>>    user  system elapsed
>>   2.381   0.004   2.387
>> > print(system.time (or <- order(x,method="radix")))
>>    user  system elapsed
>>   0.138   0.000   0.137
>> > #print(system.time (om <- order(x,method="merge")))
>> > print(identical(os,or))
>> [1] TRUE
>> > #print(identical(os,om))
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 1 2
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > #print(order(x,method="merge"))
>>
>> ------------------------------------
>> pqR-2020-07-23 in fr_CA.utf8 locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>> utilisateur     système      écoulé
>>       2.960       0.000       2.962
>> > print(system.time (or <- order(x,method="radix")))
>> utilisateur     système      écoulé
>>       0.083       0.008       0.092
>> > print(system.time (om <- order(x,method="merge")))
>> utilisateur     système      écoulé
>>       1.143       0.000       1.142
>> > print(identical(os,or))
>> [1] TRUE
>> > print(identical(os,om))
>> [1] TRUE
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 2 1
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > print(order(x,method="merge"))
>> [1] 2 1
>>
>> ------------------------------------
>> R-4.0.2 in fr_CA.utf8 locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>> utilisateur     système      écoulé
>>       4.222       0.016       4.239
>> > print(system.time (or <- order(x,method="radix")))
>> utilisateur     système      écoulé
>>       0.136       0.000       0.137
>> > #print(system.time (om <- order(x,method="merge")))
>> > print(identical(os,or))
>> [1] TRUE
>> > #print(identical(os,om))
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 2 1
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > #print(order(x,method="merge"))
>>
>>
>> pqR is faster in all the tests, but more relevant to this discussion
>> is that the "merge" method is substantially faster than "shell" for
>> these character vectors with a million elements, while retaining the
>> stability and collation properties of "shell" (whereas "radix" only
>> does C collation).
>>
>> It would probably not be too hard to take the merge sort code from pqR
>> and use it in R core's implementation.
>>
>> The merge sort in pqR doesn't exploit parallelism at the moment, but
>> merge sort is potentially quite parallelizable (though I think the
>> storage allocation strategy I use would have to be modified).
>>
>> Regards,
>>
>>    Radford Neal
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-devel using r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list