[Rd] cwilcox - new version
Aidan Lakshman
AHL27 @end|ng |rom p|tt@edu
Tue Jan 16 15:47:53 CET 2024
New email client, hoping that Outlook won’t continue to HTMLify my emails and mess up my attachments without my consent :/
I’m re-attaching the patch file here, compressed as a .gz in case that was the issue with the previous attachment. Thanks to Ivan for pointing out that my attachment didn’t go through correctly on the first try.
I’ll be working on seeing if I can optimize this to actually improve performance during the day today.
-Aidan
On 15 Jan 2024, at 5:51, Andreas Löffler wrote:
> I am a newbie to C (although I did some programming in Perl and Pascal) so
> forgive me for the language that follows. I am writing because I have a
> question concerning the *implementation of *(the internal function)*
> cwilcox* in
> https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c.
> This function is used to determine the test statistics of the
> Wilcoxon-Mann-Whitney U-test.
>
> _______________________________________________________________________________________________________________________________________
> The function `cwilcox` has three inputs: k, m and n. To establish the test
> statistics `cwilcox` determines the number of possible sums of natural
> numbers with the following restrictions:
> - the sum of the elements must be k,
> - the elements (summands) must be in a decreasing (or increasing) order,
> - every element must be less then m and
> - there must be at most n summands.
>
> The problem with the evaluation of this function `cwilcox(k,m,n)` is that
> it requires a recursion where in fact a three-dimensional array has to be
> evaluated (see the code around line 157). This requires a lot of memory and
> takes time and seems still an issue even with newer machines, see the
> warning in the documentation
> https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test
> .
>
> In an old paper I have written a formula where the evaluation can be done
> in a one-dimensional array that uses way less memory and is much faster.
> The paper itself is in German (you find a copy here:
> https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf),
> so I uploaded a translation into English (to be found in
> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf).
>
> I also looked at the code in R and wrote a new version of the code that
> uses my formulae so that a faster implementation is possible (code with
> comments is below). I have several questions regarding the code:
>
> 1. A lot of commands in the original R code are used to handle the
> memory. I do not know how to do this so I skipped memory handling
> completely and simply allocated space to an array (see below int
> cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of
> 3-dimensional I think as a first guess that will be ok.
> 2. I read the documentation when and why code in R should be changed. I
> am not familiar enough with R to understand how this applies here. My code
> uses less memory - is that enough? Or should another function be defined
> that uses my code? Or shall it be disregarded?
> 3. I was not able to file my ideas in bugzilla.r-project or the like and
> I hope that this mailing list is a good address for my question.
>
> I also have written an example of a main function where the original code
> from R is compared to my code. I do not attach this example because this
> email is already very long but I can provide that example if needed.
>
> Maybe we can discuss this further. Best wishes, Andreas Löffler.
>
>
> ```
> #include <stdio.h>
> #include <stdlib.h>
>
> /* The traditional approch to determine the Mann-Whitney statistics uses a
> recursion formular for partitions of natural numbers, namely in the line
> w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1);
> (taken from
> https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c).
> This approach requires a huge number of partitions to be evaluated because
> the second variable (j in the left term) and the third variable (k in the
> left term) in this recursion are not constant but change as well. Hence, a
> three dimensional array is evaluated.
>
> In an old paper a recursion equations was shown that avoids this
> disadvantage. The recursion equation of that paper uses only an array where
> the second as well as the third variable remain constant. This implies
> faster evaluation and less memory used. The original paper is in German and
> can be found in
> https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf
> and the author has uploaded a translation into English in
> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf. This
> function uses this approach. */
>
> static int
> cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function
> below */
> int s, d;
>
> s=0;
> for (d = 1; d <= m; d++) {
> if (k%d == 0) {
> s=s+d;
> }
> }
> for (d = n+1; d <= n+m; d++) {
> if (k%d == 0) {
> s=s-d;
> }
> }
> return s;
> }
>
> /* this can replace cwilcox. It runs faster and uses way less memory */
> static double
> cwilcox2(int k, int m, int n){
>
> int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution
> */
> int s, i, kk;
>
> if (2*k>m*n){
> k=m*n-k; /* permutation function is symmetric */
> }
>
> for (kk=0; 2*kk<=m*n+1; kk++){
> if (kk==0){
> cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */
> } else {
> s=0;
> for (i = 0; i<kk; i++){
> s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */
> }
> cwilcox_distr[kk]=s/kk;
> }
> }
>
> return (double) cwilcox_distr[k];
> }
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wilcox_patch_draft.gz
Type: application/x-gzip
Size: 1707 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20240116/b2822fbd/attachment.bin>
More information about the R-devel
mailing list