[Rd] cwilcox - new version

Wed Jan 17 16:02:23 CET 2024

Hi everyone,

I’ve opened a Bugzilla report for Andreas with the most recent implementation here: https://bugs.r-project.org/show_bug.cgi?id=18655. Feedback would be greatly appreciated.

The most straight forward approach is likely to implement both methods and determine which to use based on population sizes. The cutoff at n=50 is very sharp; it would be a large improvement to just call Andreas’s algorithm when either population is larger than 50, and use the current method otherwise.

For the Bugzilla report I’ve only submitted the new version for benchmarking purposes. I think that if there is a way to improve this algorithm such that it matches the performance of the current version for population sizes under 50, then it would be significantly cleaner than to have two algorithms with an internal switch.

As for remaining performance improvements:

1. cwilcox_sigma is definitely a performance loss. It would improve performance to instead just loop from 1 to min(m, sqrt(k)) and from n+1 to min(m+n, sqrt(k)), since the formula just finds potential factors of k. Maybe there are other ways to improve this, but I think factorization is a notoriously intensive problem, so that further optimzation may be intractable.

2. Calculation of the distribution values has quadratic scaling. Maybe there’s a way to optimize that further? See lines 91-103 in the most recent version.

Regardless of runtime, memory is certainly improved. For calculation on population sizes m,n, the current version has memory complexity O((mn)^2), whereas Andreas’s version has complexity O(mn). Running `qwilcox(0.5,500,500)` crashes my R session with the old version, but runs successfully in about 10s with the new version.

I’ve written up all the information so far on the Bugzilla report, and I’m sure Andreas will add more information if necessary when his account is approved. Thanks again to Andreas for introducing this algorithm—I’m hopeful that this is able to improve performance of the wilcox functions.

-Aidan

-----------------------
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
www.AHL27.com
ahl27 using pitt.edu | (724) 612-9940

On 17 Jan 2024, at 5:55, Andreas Löffler wrote:

>>
>>
>> Performance statistics are interesting. If we assume the two populations
>> have a total of `m` members, then this implementation runs slightly slower
>> for m < 20, and much slower for 50 < m < 100. However, this implementation
>> works significantly *faster* for m > 200. The breakpoint is precisely when
>> each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8
>> microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5
>> milliseconds. The new version runs in roughly 1 millisecond for both. This
>> is probably because of internal logic that requires many more `free/calloc`
>> calls if either population is larger than `WILCOX_MAX`, which is set to 50.
>>
> Also because cwilcox_sigma has to be evaluated, and this is slightly more
> demanding since it uses k%d.
>
> There is a tradeoff here between memory usage and time of execution. I am
> not a heavy user of the U test but I think the typical use case does not
> involve several hundreds of tests in a session so execution time (my 2
> cents) is less important. But if R crashes one execution is already
> problematic.
>
> But the takeaway is  probably: we should implement both approaches in the
> code and leave it to the user which one she prefers. If time is important
> and memory not an issue and if m, n are low go for the "traditional
> approach". Otherwise, use my formula?
>
> PS (@Aidan): I have applied for an bugzilla account two days ago and heard
> not back from them. Also Spam is empty. Is that ok or shall I do something?