[Bioc-devel] Why should Bioconductor developers re-use core classes?

Thu Oct 19 16:12:55 CEST 2017

Thanks for all your thoughts Joey, and I hope I didn't come across as
critical of phyloseq in particular. In fact, the couple packages I created
as a post-doc (doppelgangR and ffpe) did exactly the same thing, but unlike
phyloseq have never been used enough for it to make much difference :).
This comparison with phyloseq and metagenomeSeq is just one that I have
personal experience with because I use both packages and it was something
of a revelation to me when I realized that MRExperiment and all other
eSet-derived classes would "just work" as elements of a
MultiAssayExperiment. My motivation in making these slides was mainly to
share my own very recent revelations and try to make life better for new
package developers and their users.  A couple comments below:

On Wed, Oct 18, 2017 at 11:36 PM, Paul Joseph McMurdie <joey711 at gmail.com>
wrote:

>
> - Hopefully it is obvious from my description, and also what I imagine to
> be Levi's motivation for making the slide deck, but somehow new eager
> developers are missing out on this great infrastructure and it isn't
> because they want to re-implement core stuff. I sure didn't! I simply
> didn't know what was there or best practices for BioC. *A "beginner's
> guide to BioC package development"* would have been at the top of my list
> of things to read back then.
>

I think this is still not abundantly clear from
http://bioconductor.org/developers/ and I'm not sure how much it factors
into the package review process. Would it be helpful to have a short
questionnaire for creators of new packages that includes some things not
part of R CMD BiocCheck, or CHECK, like - what new classes do you define,
and what is the purpose of defining them?

> - It isn't that I didn't read other established packages. I did. However,
> a lot of core BioC tools had gene-expression specific names even for data
> classes that were not intrinsically gene expression (e.g. it's actually a
> matrix, or related tables) -- and I'm happy many of these now use more
> general names like "experiment" or "row". The old names signaled to me
> "this isn't for you". And I naively, ignorantly, accepted that at mostly
> face value.
>
> - Conversely, sometimes not-inheriting methods is a feature, because it
> protects users from doing something that is great and appropriate for one
> domain (gene expression) but totally irrational in another (microbiome).
> I'm not saying my original implementation made great nuanced decisions
> about this -- it has many trappings of a new developer -- but I did have
> some pretty naive users in mind with phyloseq, for whom navigating legacy
> methods and method names from other domain(s) was expected to be a hurdle.
> Curious to hear thoughts on this.
>

Something you did well in phyloseq was define methods for all common user
operations, and if you had inherited from eSet, you probably still would've
wanted to do this - except that more of your methods would've been just
wrappers for inappropriately named eSet methods, and your average user
wouldn't have known the difference. Because of your use of methods and not
direct slot access by users, I think you could still change the underlying
data model without it being noticeable to basic users, even if it expanded
the number of methods actually available. Thankfully, SummarizedExperiment
I think now avoids these "this isn't for you" signals.

> - There actually *still isn't core support for evolutionary trees in BioC* (as
> mentioned by Joe Paulson and Ben Callahan in other threads). One of
> phyloseq's key contributions was to leverage the fantastic representation
> of trees implemented in the CRAN package "ape" in order to support analysis
> techniques popular among microbiome researchers that require a phylogenetic
> tree. The integration in the phyloseq-class and ape is necessarily pretty
> deep, including certain row operations. Users also needed a familiar and
> simple R interface to manipulate that composite object despite the complex
> hierarchical relationship among rows. Correct me if I'm wrong, but I think
> there is still no core BioC support for representing tree-like or
> bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or
> equivalent; and consequently certain row operations may have to be modified
> more deeply than usual if we were to re-implement phyloseq "the right way".
> I'd love to hear thoughts on this.
>

AFAIK you're right, and I don't know the solution, although I hope it can
be built on SummarizedExperiment. Looking forward to talking more about
this.

> Even though phyloseq is at the receiving end, I think the criticism is
> fair, and I want current and future new BioC contributors to not re-make my
> mistakes circa 2011-12. I'm happy to help if I can.
>
> Cheers, and thanks for the interesting, collegial thread.
>
> Joey
>

Thanks Joey, and I do want to say also that I think phyloseq is responsible
for making Bioconductor a viable and already superior choice for
statistical analysis of microbiome data!

	[[alternative HTML version deleted]]