[Rd] cwilcox - new version

Mon Jan 15 22:58:38 CET 2024

Hi Andreas,

These are the kinds of suggested improvements that get me really excited. I’m not a part of R-Core, but I am happy to help out as much as I can with optimizing your C code to submit this.

> I was not able to file my ideas in bugzilla.r-project or the like and
> I hope that this mailing list is a good address for my question.

You should send an email to bug-report-request using r-project.org to get an account. See this link for details: https://www.r-project.org/bugs.html#:~:text=In%20order%20to%20get%20a,the%20report%20isn't%20extraneous.

> I read the documentation when and why code in R should be changed. I
> am not familiar enough with R to understand how this applies here. My code
> uses less memory - is that enough? Or should another function be defined
> that uses my code? Or shall it be disregarded?

It depends a bit on the submission and R-Core, but the first step for something like this is to create a bug report on Bugzilla along with a reproducible example of a situation where your code outperforms the current version. From there, there are typically some discussions around the code itself, leading to a decision on whether or not to incorporate it.

I threw together a quick patch file based on your proposed changes, which I’ve attached here if others would like to look at it and test. I have a feeling the loops in `cwilcox_sigma` can be factored out. The patch file is far from perfect; a final patch would remove some internal calls made obsolete by this method, like `w_free`, `w_init_maybe`, and `wilcox_free`. The current patch just makes `wilcox_free` a noop and comments out calls to the other functions.

In my builds, `pwilcox` / `dwilcox` run much slower and `wilcox.test` runs about equally fast (as the current implementation). I haven’t done any memory benchmarking yet. I think that your version should be faster, so I’m not quite sure where the slowdown is coming from…maybe someone else can weigh in with thoughts. I’ll have more time to work on optimization and tracing down the differences in runtime tomorrow.

I do have to say, I like this implementation better. The code is significantly cleaner (in my opinion), and it doesn’t require use of a static global variable with `on.exit` hooks. If you can find a reproducible example where the current R build crashes/breaks and your version does not, that would be extremely helpful. Some test cases where your version outperforms the existing one would also be helpful.

I’m happy to discuss more and help out with this, including setting up a Bugzilla report if you’d like. There are a lot of people that know more about this than I; hoping to hear their thoughts on this as well.

-Aidan

-----------------------
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
www.AHL27.com
ahl27 using pitt.edu | (724) 612-9940

From: R-devel <r-devel-bounces using r-project.org> on behalf of Andreas Löffler <andreas.loeffler using gmail.com>
Date: Monday, January 15, 2024 at 13:52
To: r-devel using r-project.org <r-devel using r-project.org>
Subject: [Rd] cwilcox - new version
I am a newbie to C (although I did some programming in Perl and Pascal) so
forgive me for the language that follows. I am writing because I have a
question concerning the *implementation of *(the internal function)*
cwilcox* in
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsvn.r-project.org%2FR%2F!svn%2Fbc%2F81274%2Fbranches%2FR-DL%2Fsrc%2Fnmath%2Fwilcox.c&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285927560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=Yu9u36OOyZWSVjA6%2BTKlYSTLh0VjMzOZKuQN48Ml3S0%3D&reserved=0<https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c>.
This function is used to determine the test statistics of the
Wilcoxon-Mann-Whitney U-test.

_______________________________________________________________________________________________________________________________________
The function `cwilcox` has three inputs: k, m and n. To establish the test
statistics `cwilcox` determines the number of possible sums of natural
numbers with the following restrictions:
- the sum of the elements must be k,
- the elements (summands) must be in a decreasing (or increasing) order,
- every element must be less then m and
- there must be at most n summands.

The problem with the evaluation of this function `cwilcox(k,m,n)` is that
it requires a recursion where in fact a three-dimensional array has to be
evaluated (see the code around line 157). This requires a lot of memory and
takes time and seems still an issue even with newer machines, see the
warning in the documentation
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.rdocumentation.org%2Fpackages%2Fstats%2Fversions%2F3.6.2%2Ftopics%2Fwilcox.test&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285935895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=GAkMVhP1tqf%2FezoLdrTKnU7KoysWzCpsDkpCdxCp5aQ%3D&reserved=0<https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test>
.

In an old paper I have written a formula where the evaluation can be done
in a one-dimensional array that uses way less memory and is much faster.
The paper itself is in German (you find a copy here:
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Ff%2Ff5%2FLoefflerWilcoxonMannWhitneyTest.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285940564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=37lZBA9tQ49Gk8NKsR6OidF3U6nk%2Bv3LVbt21IwGSYk%3D&reserved=0)<https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf>,
so I uploaded a translation into English (to be found in
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fde%2F1%2F19%2FMannWhitney_151102.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285945105%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=MZb7dgk%2Bbt38HuEqz3G9M9UA0dsmjQHLgcDl01RAXsg%3D&reserved=0)<https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf>.

I also looked at the code in R and wrote a new version of the code that
uses my formulae so that a faster implementation is possible (code with
comments is below). I have several questions regarding the code:

   1. A lot of commands in the original R code are used to handle the
   memory. I do not know how to do this so I skipped memory handling
   completely and simply allocated space to an array (see below int
   cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of
   3-dimensional I think as a first guess that will be ok.
   2. I read the documentation when and why code in R should be changed. I
   am not familiar enough with R to understand how this applies here. My code
   uses less memory - is that enough? Or should another function be defined
   that uses my code? Or shall it be disregarded?
   3. I was not able to file my ideas in bugzilla.r-project or the like and
   I hope that this mailing list is a good address for my question.

I also have written an example of a main function where the original code
from R is compared to my code. I do not attach this example because this
email is already very long but I can provide that example if needed.

Maybe we can discuss this further. Best wishes, Andreas Löffler.

```
#include <stdio.h>
#include <stdlib.h>

/* The traditional approch to determine the Mann-Whitney statistics uses a
recursion formular for partitions of natural numbers, namely in the line
w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1);
(taken from
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsvn.r-project.org%2FR%2F!svn%2Fbc%2F81274%2Fbranches%2FR-DL%2Fsrc%2Fnmath%2Fwilcox.c&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285949688%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=pPnCBuSzFfmE0%2FH3nmRmtO9qLLBiHSCcG2cNoI2scw4%3D&reserved=0)<https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c>.
This approach requires a huge number of partitions to be evaluated because
the second variable (j in the left term) and the third variable (k in the
left term) in this recursion are not constant but change as well. Hence, a
three dimensional array is evaluated.

In an old paper a recursion equations was shown that avoids this
disadvantage. The recursion equation of that paper uses only an array where
the second as well as the third variable remain constant. This implies
faster evaluation and less memory used. The original paper is in German and
can be found in
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Ff%2Ff5%2FLoefflerWilcoxonMannWhitneyTest.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285954898%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=u7%2FH8geUfv0nwqspn%2FQ0DYGBbmGr9bA180B8vWxKIGc%3D&reserved=0<https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf>
and the author has uploaded a translation into English in
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fde%2F1%2F19%2FMannWhitney_151102.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285959440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=2nhVTotogBjd%2Fa2qaJD0pJLPB%2FE9FLy2d1fXYyz0BI0%3D&reserved=0<https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf>. This
function uses this approach. */

static int
cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function
below */
  int s, d;

  s=0;
  for (d = 1; d <= m; d++) {
  if (k%d == 0) {
  s=s+d;
  }
  }
  for (d = n+1; d <= n+m; d++) {
  if (k%d == 0) {
  s=s-d;
  }
  }
  return s;
}

/* this can replace cwilcox. It runs faster and uses way less memory */
static double
cwilcox2(int k, int m, int n){

int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution
*/
int s, i, kk;

if (2*k>m*n){
k=m*n-k; /* permutation function is symmetric */
}

for (kk=0; 2*kk<=m*n+1; kk++){
if (kk==0){
cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */
} else {
s=0;
for (i = 0; i<kk; i++){
s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */
   }
   cwilcox_distr[kk]=s/kk;
   }
}

return (double) cwilcox_distr[k];
}

        [[alternative HTML version deleted]]

______________________________________________
R-devel using r-project.org mailing list
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-devel&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285963952%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=Xl2%2FW6Th506HlCklz2cmUWh5EHJ7E9%2BBcIgHwUVFpFg%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/r-devel>