[Bioc-devel] Why should Bioconductor developers re-use core classes?

Martin Morgan martin.morgan at roswellpark.org
Thu Oct 19 18:53:10 CEST 2017

On 10/19/2017 10:12 AM, Levi Waldron wrote:
> Thanks for all your thoughts Joey, and I hope I didn't come across as
> critical of phyloseq in particular. In fact, the couple packages I created
> as a post-doc (doppelgangR and ffpe) did exactly the same thing, but unlike
> phyloseq have never been used enough for it to make much difference :).
> This comparison with phyloseq and metagenomeSeq is just one that I have
> personal experience with because I use both packages and it was something
> of a revelation to me when I realized that MRExperiment and all other
> eSet-derived classes would "just work" as elements of a
> MultiAssayExperiment. My motivation in making these slides was mainly to
> share my own very recent revelations and try to make life better for new
> package developers and their users.  A couple comments below:
> On Wed, Oct 18, 2017 at 11:36 PM, Paul Joseph McMurdie <joey711 at gmail.com>
> wrote:
>> - Hopefully it is obvious from my description, and also what I imagine to
>> be Levi's motivation for making the slide deck, but somehow new eager
>> developers are missing out on this great infrastructure and it isn't
>> because they want to re-implement core stuff. I sure didn't! I simply
>> didn't know what was there or best practices for BioC. *A "beginner's
>> guide to BioC package development"* would have been at the top of my list
>> of things to read back then.
> I think this is still not abundantly clear from
> http://bioconductor.org/developers/ and I'm not sure how much it factors
> into the package review process. Would it be helpful to have a short
> questionnaire for creators of new packages that includes some things not
> part of R CMD BiocCheck, or CHECK, like - what new classes do you define,
> and what is the purpose of defining them?

[ should have mentioned that the slack channel is #tree-like-se ]

A couple of quick comments on new packages / reviews.

New package submissions have a checklist that people submit with their 
package. Since it is not uncommon to find checked items that have not 
been satisfied, I'm quite cynical about this as a way of enforcing good 
behavior, though maybe it does something to elevate awareness.

One of the check-list items is acknowledge reading the package guidelines


which include an S4 section


and links to general guidance


as well as emphasis on core containers


A relatively humorous pattern is for newer team members to eventually 
lament that new packages don't re-use existing classes and methods, 
suggesting that more documentation is the solution; usually the litany 
above is cited, and the newer team member either tweaks or adds to the 
existing documentation to the point that they think it is surely 
satisfactory, or sees that more documentation is not an effective 
answer. This pattern started again yesterday...

One thing that Joey said, and that I think is true of a lot of new 
package contributors, is that they are 'new' developers. This is 
somewhat ironic, since they are writing packages that are meant (and 
sometimes are, as with phyloseq) to be widely used and influential; it 
seems somehow appropriate to keep that channel open, with the domain and 
other expertise that 'new' developers bring as a trade-off against the 
sophistication of their current development skills. Of course being new 
developers mean that they make both 'mistakes' and choices that are not 
entirely consistent with the overall project, e.g., use of S3 or R6 
rather than S4 classes, reimplementation of existing functionality, 
adopting unfamiliar API conventions (a minor favorite of mine is the 
presentation of a novel method for some aspect of differential 
expression, with the primary input a matrix of samples x 'genes' rather 
than the bioinformatics convention of 'genes' x samples). The diversity 
of programming and analysis approaches available in R makes it not at 
all surprising that the insights that can come even after many years of 
working with Bioconductor are not accessible to many new package 

I'll also mention my own philosophy (which I sometimes emphasize with 
other members of the core team of reviewers) in reviewing packages, 
which is to seek incremental gain rather than perfection; sometimes 
there are issues other than S4 class use (!) where greater strides can 
be made.


>> - It isn't that I didn't read other established packages. I did. However,
>> a lot of core BioC tools had gene-expression specific names even for data
>> classes that were not intrinsically gene expression (e.g. it's actually a
>> matrix, or related tables) -- and I'm happy many of these now use more
>> general names like "experiment" or "row". The old names signaled to me
>> "this isn't for you". And I naively, ignorantly, accepted that at mostly
>> face value.
>> - Conversely, sometimes not-inheriting methods is a feature, because it
>> protects users from doing something that is great and appropriate for one
>> domain (gene expression) but totally irrational in another (microbiome).
>> I'm not saying my original implementation made great nuanced decisions
>> about this -- it has many trappings of a new developer -- but I did have
>> some pretty naive users in mind with phyloseq, for whom navigating legacy
>> methods and method names from other domain(s) was expected to be a hurdle.
>> Curious to hear thoughts on this.
> Something you did well in phyloseq was define methods for all common user
> operations, and if you had inherited from eSet, you probably still would've
> wanted to do this - except that more of your methods would've been just
> wrappers for inappropriately named eSet methods, and your average user
> wouldn't have known the difference. Because of your use of methods and not
> direct slot access by users, I think you could still change the underlying
> data model without it being noticeable to basic users, even if it expanded
> the number of methods actually available. Thankfully, SummarizedExperiment
> I think now avoids these "this isn't for you" signals.
>> - There actually *still isn't core support for evolutionary trees in BioC* (as
>> mentioned by Joe Paulson and Ben Callahan in other threads). One of
>> phyloseq's key contributions was to leverage the fantastic representation
>> of trees implemented in the CRAN package "ape" in order to support analysis
>> techniques popular among microbiome researchers that require a phylogenetic
>> tree. The integration in the phyloseq-class and ape is necessarily pretty
>> deep, including certain row operations. Users also needed a familiar and
>> simple R interface to manipulate that composite object despite the complex
>> hierarchical relationship among rows. Correct me if I'm wrong, but I think
>> there is still no core BioC support for representing tree-like or
>> bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or
>> equivalent; and consequently certain row operations may have to be modified
>> more deeply than usual if we were to re-implement phyloseq "the right way".
>> I'd love to hear thoughts on this.
> AFAIK you're right, and I don't know the solution, although I hope it can
> be built on SummarizedExperiment. Looking forward to talking more about
> this.
>> Even though phyloseq is at the receiving end, I think the criticism is
>> fair, and I want current and future new BioC contributors to not re-make my
>> mistakes circa 2011-12. I'm happy to help if I can.
>> Cheers, and thanks for the interesting, collegial thread.
>> Joey
> Thanks Joey, and I do want to say also that I think phyloseq is responsible
> for making Bioconductor a viable and already superior choice for
> statistical analysis of microbiome data!
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or...{{dropped:2}}

More information about the Bioc-devel mailing list