[Rd] Updated `dendrapply`

Lakshman, Aidan H AHL27 @end|ng |rom p|tt@edu
Fri Sep 1 11:17:57 CEST 2023


Hi everyone,
Ivan and I had a few discussions several months ago regarding `dendrapply`, but now that I've had the chance to work on it more specifically and discuss it Martin Maechler at the R Project Sprint, I figured it would be a good idea to start a new email chain.
I've refactored `dendrapply`, and the current implementation is available in my bugzilla report (https://bugs.r-project.org/show_bug.cgi?id=18480). This project started due to the help page for `dendrapply`, which specifically requested users to contribute improvements to the function. I'm including a lot of writeup here because I'm very aware that this is a relatively large contribution for someone that doesn't have a history of contributing a lot of code to base, and I'd like to justify the inclusion.
Feel free to skip everything I've written and instead use the following links. A thorough discussion follows.
- Bugzilla with patch: https://bugs.r-project.org/show_bug.cgi?id=18480
- R Checks: https://github.com/r-devel/r-svn/pull/111
- Discussion at R Project Sprint: https://github.com/r-devel/r-project-sprint-2023/discussions/6
- Original blog post about this (long, code out of date, but has a simpler explanation of the implementation): https://www.ahl27.com/posts/2023/02/dendrapply/

Responses to common questions:
- Why does this project need to be done?
The current implementation in `stats::dendrapply` is recursive, and thus has issues with deeply nested dendrogram objects. As of 2015, users experienced issues with recursive operations on dendrograms causing stack overflow issues (see https://bugs.r-project.org/show_bug.cgi?id=15215). This has been alleviated by better computers and short-term workarounds, but many users have limited resources and/or need for large trees. Even with sufficient memory, a recursive implementation incurs a nontrivial amount of unneccessary computational overhead. I'll also add that this is a feature that was requested in R itself (see Note section in `?dendrapply`), and Martin Maechler has been supportive of the work thus far on it.
- What does this implementation do?
There are a few improvements in this implementation. The first is that function application to dendrogram objects is no longer recursive. This implementation is also based in C, providing a performance boost of at least 4x (see later question for details). Finally, iterative application of functions in C allows for flexibility in *how* the dendrogram is traversed, which gives end-users a significant amount of power to work with tree-like structures in an intuitive way. An easy example is subsetting based on leaf values--if a user wanted to subset a dendrogram to only certain leaves, there isn't a good way to do this in base R (even with dendrapply). `how='post.order'` fixes this problem. I'm happy to provide additional examples if needed.
- Why C? This is harder to maintain than R.
This is a valid point. I did my best to include as much documentation as possible, and I'm also volunteering myself to help maintain this function. C provides a lot of power in working with dendrograms, since R's toolkit for tree-like structures is relatively lacking. This refactor is theoretically doable in R, but the implementation would involve an immense amount of memory overhead to ensure we can preserve the node states as we traverse the tree. There is precedence for a C implementation of dendrapply (see other `*apply` functions). Additionally, this decreases function application overhead, and allows future extensions to be written purely in R in a much simpler way. I think this tradeoff is worth it, but I am happy to discuss implementation specifics with anyone that is interested.
- Ivan previously mentioned issues with user specific `[[.dendrogram` implementations, and it doesn't seem that you've fixed that.
This is correct. I discovered during the R project sprint that `stats::dendrapply` does not respect user-specific implementations of `[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the user defines multiple classes for a dendrogram object, double bracket subsetting will remove the class (a bug report will be filed for this shortly). My implementation exactly replicates the performance of stats::`[[.dendrogram`, and if users are in need of a function that can respect custom subset overloading, I can address those feature requests if/when they are submitted.
- Backwards compatibility?
>From current testing, this is a drop-in replacement for `stats::dendrapply`. R builds successfully, and all >400 tests in the CRAN package that uses `dendrapply` the most (dendextend) pass with no changes from the original. The additional argument `how=c('pre.order', 'post.order')` is the same syntax as `rapply`, and additional documentation has been added to the `dendrapply.Rd` to reflect this change. This is still an unfinished TODO; the internal R testing for `dendrapply` is very sparse. I haven't been able to find any differences between stats::dendrapply and this implementation, but I am planning to run a full check against all CRAN packages that use `dendrapply`. I'm also planning to add additional regression testing to R either as part of this patch in a separate patch.
- You mentioned there was more to the listed '4x improvement'
Yes. I haven't yet put together a comprehensive benchmark for highly unbalanced trees, and in truth there are so many possible tree structures that it would be challenging to test them all. However, on trees with 5 leaves the performance is roughly identical to that of `stats`, and benchmarking with `microbenchmark` demonstrates performance gains of roughly 5x on fully balanced trees with 10-5000 leaves. This should be a lower bound for performance, since fully balanced trees minimize internal nodes and thus have less recursion...so on reasonably sized trees of arbitrary structure we should have at least around a 4x improvement. I'll also stress that the focus of this patch is not a runtime improvement--it's nice that we get a speedup, but the added value here is the removal of recursive calls.
- Why not just put this in a package?
I think there's value in fleshing out the structures included in base R. Dendrograms are a general tree structure, and few programming languages provide support for these out of the box. Dendrogram objects are currently rarely used, but with a little bit of additional functionality, they could be a very powerful tool for users. These have applications in a variety of fields that are not just phylogenetics; implementations of domain-specific tools (e.g. `ape`, `DECIPHER`) are better suited to 3rd party packages. However, `dendrograms` already exist in base and have poor support, which is even admitted in their help files. `dendrapply`. While it's currently limited to acting on dendrogram objects, a solid implementation would open the door to generalizing dendrapply to work on any nested list. This is my personal opinion, and there is certainly an argument to be made that `dendrapply` (and even `dendrogram` as a whole) could live outside of base.
- Why/how were the included traversal strategies chosen?
The default, pre.order, was chosen because it replicates existing functionality. post.order was included because it lends itself well to a lot of applications. Between these two methods, we have a way to apply a function to trees ensuring that parents are evaluated before children, and ensuring children are evaluated before parents. Future extensions to support an in.order or BFS/level.order traversal is definitely an option, but I don't think the added implementation effort and complexity adds a lot of functionality over the two that have been included.
There's also been previous comments regarding the structure of dendrograms. While hclust will return a bifurcating tree, dendrogram objects (and dendrapply) support arbitrary multifurcations and edge weights.
Apologies for the lengthy email. If you've read even half, thanks for your time. There�s likely some optimizations that could be made in the C code dealing with R structures; I�m still learning the intricacies of some of the more specific points of this. One that comes to mind is converting the repeated `lang2` calls to instead initialize a `lang2` call and then change the symbol in the call, as is done in `lapply` (and was mentioned by Ivan in my last submission). I�ll test that as well.
Further feedback is welcome and much appreciated.
-Aidan

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Fellow, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 using pitt.edu
(724) 612-9940


	[[alternative HTML version deleted]]



More information about the R-devel mailing list