[Bioc-devel] Why should Bioconductor developers re-use core classes?

Thu Oct 19 05:36:15 CEST 2017

Thanks, Levi, nice slides.

In case it is a helpful perspective, I'll try to share what I recall of my
thought process as author of phyloseq. And I should preface by admitting
that I've been embarrassed by this major development oversight for some
years now.

At the beginning of 2011 I was a new postdoc, heavy R user, completely new
to R development, and in a "cousin" field (microbiome & bacterial genomics)
that had virtually no presence in BioC. Some of the recommendations you've
made in this slide deck were not available then, but admittedly, I might
have missed them even if they were. I had access to training in base R
devel (Hadley's bootcamp, John Chambers' course at Stanford), but a lot of
resources in BioC were still very new to me.

If you can believe it, my original idea for the phyloseq-class was even
worse! Valerie Obenchain was great and patiently talked me into a better
solution, but somehow she also missed that I might have re-used available
classes and avoided some unnecessary implementation and maintenance.

What additional recommendations can I make with the benefit of hindsight
that have not been mentioned in this thread?

- Hopefully it is obvious from my description, and also what I imagine to
be Levi's motivation for making the slide deck, but somehow new eager
developers are missing out on this great infrastructure and it isn't
because they want to re-implement core stuff. I sure didn't! I simply
didn't know what was there or best practices for BioC. *A "beginner's guide
to BioC package development"* would have been at the top of my list of
things to read back then.

- It isn't that I didn't read other established packages. I did. However, a
lot of core BioC tools had gene-expression specific names even for data
classes that were not intrinsically gene expression (e.g. it's actually a
matrix, or related tables) -- and I'm happy many of these now use more
general names like "experiment" or "row". The old names signaled to me
"this isn't for you". And I naively, ignorantly, accepted that at mostly
face value.

- Conversely, sometimes not-inheriting methods is a feature, because it
protects users from doing something that is great and appropriate for one
domain (gene expression) but totally irrational in another (microbiome).
I'm not saying my original implementation made great nuanced decisions
about this -- it has many trappings of a new developer -- but I did have
some pretty naive users in mind with phyloseq, for whom navigating legacy
methods and method names from other domain(s) was expected to be a hurdle.
Curious to hear thoughts on this.

- There actually *still isn't core support for evolutionary trees in BioC* (as
mentioned by Joe Paulson and Ben Callahan in other threads). One of
phyloseq's key contributions was to leverage the fantastic representation
of trees implemented in the CRAN package "ape" in order to support analysis
techniques popular among microbiome researchers that require a phylogenetic
tree. The integration in the phyloseq-class and ape is necessarily pretty
deep, including certain row operations. Users also needed a familiar and
simple R interface to manipulate that composite object despite the complex
hierarchical relationship among rows. Correct me if I'm wrong, but I think
there is still no core BioC support for representing tree-like or
bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or
equivalent; and consequently certain row operations may have to be modified
more deeply than usual if we were to re-implement phyloseq "the right way".
I'd love to hear thoughts on this.

Even though phyloseq is at the receiving end, I think the criticism is
fair, and I want current and future new BioC contributors to not re-make my
mistakes circa 2011-12. I'm happy to help if I can.

Cheers, and thanks for the interesting, collegial thread.

Joey

---
---
"Joey"
Paul J. McMurdie II
Sent from Gmail

On Wed, Oct 18, 2017 at 11:46 AM, Levi Waldron <lwaldron.research at gmail.com>
wrote:

> On Wed, Oct 18, 2017 at 10:26 AM, Ryan Thompson <rct at thompsonclan.org>
>> wrote:
>>
>>> I think the main reason for reusing/subclassing core classes that users
>>> can
>>> appreciate is that it makes it much easier for users to integrate
>>> multiple
>>> packages into a single workflow. Only the most basic of pipelines uses
>>> just
>>> a single Bioconductor package. For instance, an "edgeR" pipeline
>>> obviously
>>> uses the edgeR package, but it likely also uses several other packages,
>>> like sva, RUV, variancePartition, etc. The more these different packages
>>> operate on the same core data structures, the less work the user has to
>>> do
>>> to use them together. And to bring that back around to an incentive for
>>> developers, making your package interoperate with other packages more
>>> easily means that users will be more likely to use your package.
>>>
>>
> My impression is that the interoperability argument may already be more
> widely appreciated, because in the pipeline example you can have several
> packages operating on the same data class. It seems less obvious when you
> are doing something different that requires defining a new class, why you
> should extend an existing class to meet your needs. Although I guess your
> point extends to interoperability with other packages providing methods for
> the parent class, and the ability to use coercion methods defined for the
> parent class, which I didn't mention...
>

	[[alternative HTML version deleted]]