[Bioc-devel] Why should Bioconductor developers re-use core classes?

Paul Joseph McMurdie joey711 at gmail.com
Fri Oct 20 04:19:46 CEST 2017

Thanks for those additional comments, Levi. I don't think you were being
unfair to phyloseq, and it sounds like some of the issues involved are
still relevant (e.g. new domain, new contributor).

I wanted to riff on Martin's comment about "incremental gain rather than
perfection". I imagine there are many reasons to use this as a guiding
principle, but one of them could be that it helps foster new-to-BioC
contributors by not overwhelming them with sub-critical requirements.
Interoperability is a feature that developers are already incentivized to
leverage, when they realize it, if they haven't buried themselves too deep
O:-). Similarly, after taking a peek at those developer docs, I think
<http://bioconductor.org/developers/how-to/buildingPackagesForBioc/> looks
pretty great, and I wish it had existed (or I had known about it) when I
unwittingly blundered these early decisions in phyloseq.

Thanks again for the discussion/feedback!

Paul J. McMurdie II
Sent from Gmail

On Thu, Oct 19, 2017 at 9:53 AM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:

> On 10/19/2017 10:12 AM, Levi Waldron wrote:
>> Thanks for all your thoughts Joey, and I hope I didn't come across as
>> critical of phyloseq in particular. In fact, the couple packages I created
>> as a post-doc (doppelgangR and ffpe) did exactly the same thing, but
>> unlike
>> phyloseq have never been used enough for it to make much difference :).
>> This comparison with phyloseq and metagenomeSeq is just one that I have
>> personal experience with because I use both packages and it was something
>> of a revelation to me when I realized that MRExperiment and all other
>> eSet-derived classes would "just work" as elements of a
>> MultiAssayExperiment. My motivation in making these slides was mainly to
>> share my own very recent revelations and try to make life better for new
>> package developers and their users.  A couple comments below:
>> On Wed, Oct 18, 2017 at 11:36 PM, Paul Joseph McMurdie <joey711 at gmail.com
>> >
>> wrote:
>>> - Hopefully it is obvious from my description, and also what I imagine to
>>> be Levi's motivation for making the slide deck, but somehow new eager
>>> developers are missing out on this great infrastructure and it isn't
>>> because they want to re-implement core stuff. I sure didn't! I simply
>>> didn't know what was there or best practices for BioC. *A "beginner's
>>> guide to BioC package development"* would have been at the top of my list
>>> of things to read back then.
>> I think this is still not abundantly clear from
>> http://bioconductor.org/developers/ and I'm not sure how much it factors
>> into the package review process. Would it be helpful to have a short
>> questionnaire for creators of new packages that includes some things not
>> part of R CMD BiocCheck, or CHECK, like - what new classes do you define,
>> and what is the purpose of defining them?
> [ should have mentioned that the slack channel is #tree-like-se ]
> A couple of quick comments on new packages / reviews.
> New package submissions have a checklist that people submit with their
> package. Since it is not uncommon to find checked items that have not been
> satisfied, I'm quite cynical about this as a way of enforcing good
> behavior, though maybe it does something to elevate awareness.
> One of the check-list items is acknowledge reading the package guidelines
>   https://bioconductor.org/developers/package-guidelines/
> which include an S4 section
>   https://bioconductor.org/developers/package-guidelines/#classes
> and links to general guidance
> https://bioconductor.org/developers/how-to/efficient-code/#
> re-use-existing-functionality
> as well as emphasis on core containers
>   https://bioconductor.org/developers/how-to/commonMethodsAndClasses/
> A relatively humorous pattern is for newer team members to eventually
> lament that new packages don't re-use existing classes and methods,
> suggesting that more documentation is the solution; usually the litany
> above is cited, and the newer team member either tweaks or adds to the
> existing documentation to the point that they think it is surely
> satisfactory, or sees that more documentation is not an effective answer.
> This pattern started again yesterday...
> One thing that Joey said, and that I think is true of a lot of new package
> contributors, is that they are 'new' developers. This is somewhat ironic,
> since they are writing packages that are meant (and sometimes are, as with
> phyloseq) to be widely used and influential; it seems somehow appropriate
> to keep that channel open, with the domain and other expertise that 'new'
> developers bring as a trade-off against the sophistication of their current
> development skills. Of course being new developers mean that they make both
> 'mistakes' and choices that are not entirely consistent with the overall
> project, e.g., use of S3 or R6 rather than S4 classes, reimplementation of
> existing functionality, adopting unfamiliar API conventions (a minor
> favorite of mine is the presentation of a novel method for some aspect of
> differential expression, with the primary input a matrix of samples x
> 'genes' rather than the bioinformatics convention of 'genes' x samples).
> The diversity of programming and analysis approaches available in R makes
> it not at all surprising that the insights that can come even after many
> years of working with Bioconductor are not accessible to many new package
> contributors.
> I'll also mention my own philosophy (which I sometimes emphasize with
> other members of the core team of reviewers) in reviewing packages, which
> is to seek incremental gain rather than perfection; sometimes there are
> issues other than S4 class use (!) where greater strides can be made.
> Martin
>> - It isn't that I didn't read other established packages. I did. However,
>>> a lot of core BioC tools had gene-expression specific names even for data
>>> classes that were not intrinsically gene expression (e.g. it's actually a
>>> matrix, or related tables) -- and I'm happy many of these now use more
>>> general names like "experiment" or "row". The old names signaled to me
>>> "this isn't for you". And I naively, ignorantly, accepted that at mostly
>>> face value.
>>> - Conversely, sometimes not-inheriting methods is a feature, because it
>>> protects users from doing something that is great and appropriate for one
>>> domain (gene expression) but totally irrational in another (microbiome).
>>> I'm not saying my original implementation made great nuanced decisions
>>> about this -- it has many trappings of a new developer -- but I did have
>>> some pretty naive users in mind with phyloseq, for whom navigating legacy
>>> methods and method names from other domain(s) was expected to be a
>>> hurdle.
>>> Curious to hear thoughts on this.
>> Something you did well in phyloseq was define methods for all common user
>> operations, and if you had inherited from eSet, you probably still
>> would've
>> wanted to do this - except that more of your methods would've been just
>> wrappers for inappropriately named eSet methods, and your average user
>> wouldn't have known the difference. Because of your use of methods and not
>> direct slot access by users, I think you could still change the underlying
>> data model without it being noticeable to basic users, even if it expanded
>> the number of methods actually available. Thankfully, SummarizedExperiment
>> I think now avoids these "this isn't for you" signals.
>> - There actually *still isn't core support for evolutionary trees in
>>> BioC* (as
>>> mentioned by Joe Paulson and Ben Callahan in other threads). One of
>>> phyloseq's key contributions was to leverage the fantastic representation
>>> of trees implemented in the CRAN package "ape" in order to support
>>> analysis
>>> techniques popular among microbiome researchers that require a
>>> phylogenetic
>>> tree. The integration in the phyloseq-class and ape is necessarily pretty
>>> deep, including certain row operations. Users also needed a familiar and
>>> simple R interface to manipulate that composite object despite the
>>> complex
>>> hierarchical relationship among rows. Correct me if I'm wrong, but I
>>> think
>>> there is still no core BioC support for representing tree-like or
>>> bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or
>>> equivalent; and consequently certain row operations may have to be
>>> modified
>>> more deeply than usual if we were to re-implement phyloseq "the right
>>> way".
>>> I'd love to hear thoughts on this.
>> AFAIK you're right, and I don't know the solution, although I hope it can
>> be built on SummarizedExperiment. Looking forward to talking more about
>> this.
>> Even though phyloseq is at the receiving end, I think the criticism is
>>> fair, and I want current and future new BioC contributors to not re-make
>>> my
>>> mistakes circa 2011-12. I'm happy to help if I can.
>>> Cheers, and thanks for the interesting, collegial thread.
>>> Joey
>> Thanks Joey, and I do want to say also that I think phyloseq is
>> responsible
>> for making Bioconductor a viable and already superior choice for
>> statistical analysis of microbiome data!
>>         [[alternative HTML version deleted]]
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.

	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list