[Bioc-devel] Request to add 'normalize' to BiocGenerics

Wolfgang Huber whuber at embl.de
Thu Feb 21 22:23:54 CET 2013


Hi Herve

I have absolutely no objections against 'normalize' in BiocGenerics, I think it is a good idea.

However, the concept of a 'universal namespace' and that no package can mask symbols defined in other packages I find objectionable. There is no 'redefining' of functions as you suggest below in (b). Rather, there are multiple co-existing definitions, associated with the same name, and there are clear and well-designed rules how to resolve such cases.

For the Bioconductor project we seem to have gone down the path of finding that a certain amount of name policing and allocation of 'reserved function names' has more benefits than problems. I would vote for making this policy explicit and stating it in the package guidelines. I would also suggest using this approach sparingly... looking at BiocGenerics, there are currently 67 generics defined, and then there are 90 more in Biobase and 158 in IRanges. And some seem rather un-generic, like 'snpCallProbability' or 'pmax.int', or cryptic, such as 'NROW' and 'NCOL'.     (Perhaps we're already closer to my suggestion of yesterday's post than I thought :)

	Best wishes
	Wolfgang




Il giorno Feb 21, 2013, alle ore 9:16 PM, Hervé Pagès <hpages at fhcrc.org> ha scritto:

> Hi Laurent, Robert, and others,
> 
> On 02/20/2013 01:33 PM, Laurent Gautier wrote:
>> On 2013-02-20 22:02, Hervé Pagès wrote:
>>> Hi Robert,
>>> 
>>> Nice to hear from you!
>>> 
>>> I'm just a little bit worried that if we put normalize() in
>>> BiocGenerics without at the same time have a "take it or leave it"
>>> policy, then most of the authors of the 10 packages are likely going
>>> to ignore that there is now a new normalize() in BiocGenerics.
>> 
>> There are roughly two schools for getting compliance:
>> a- enforcement
>> b- communication and education
>> 
>> Start with b-, and most will comply I think (since it will not change
>> their API).
>> For new packages there are already few requirements to be accepted to
>> bioconductor (all of which are not necessary fulfilled by all packages);
>> this would only make one more requirement.
> 
> Requirement == enforcement. Otherwise we reject the package.
> 
> My understanding of Robert's email is that people might have good
> reasons for re-defining a generic that is already defined in BiocGenerics.
> 
> So here is my attempt at wrapping this up:
> 
>  (a) There seems to be a consensus (Wolfgang?) in favor of adding
>      the normalize() generic to BiocGenerics.
> 
>  (b) We should strongly encourage (not enforce) existing packages and
>      new packages to not redefine their own normalize() function or
>      any generic function already defined in BiocGenerics.
> 
> If we agree on this, I'll add it to our post-BioC-2.12-release TODO
> list. Unless I hear a big "NO" from Wolfgang who I believe is the
> author/maintainer of one of the packages that define a normalize()
> function.
> 
> FWIW, as a package reviewer myself, when I load the package I'm
> reviewing for the first time and see warnings that it's masking
> stuff defined in other packages, it's an orange light to me.
> Or a red light if the masked stuff is defined in packages that are
> at the bottom of the stack. So they'll need to have really really
> good reasons for doing this ;-)
> 
> H.
> 
>> 
>>> They
>>> didn't have any problem so far clashing with 9 other Bioc packages so
>>> why would they care if now it's 10 instead of 9?
>> 
>> I'd suspect that they mostly do/did not care because either they were
>> not using those packages, and if they did probably not within the same R
>> session (so clashes were not so obvious).
>> 
>>> 
>>> Furthermore, the good citizens that modify their package to use the
>>> normalize() in BiocGenerics will be in a situation worth than before
>>> because, from an end user point of view, their normalize() function
>>> (which is now a method attached to BiocGenerics::normalize) won't
>>> seem to work anymore, even if their package was loaded last, just
>>> because one bad citizen was loaded before (unless the end user calls
>>> BiocGenerics::normalize()). That would be unfair.
>>> So if I was the maintainer of one of those packages, I wouldn't see
>>> any benefit of making that move, all the contrary, unless everybody
>>> else also makes it.
>> 
>> Eh... "the Bioconductor package maintainer" as a variant of the
>> "Prisoner's dilemma".
>> ;-)
>> 
>>> 
>>> I kind of agree that namespaces can accommodate the current situation,
>>> even though, for the end user working interactively, the experience is
>>> not really pleasant, especially when they try to access the man page for
>>> normalize(). But if we are happy with that, then I don't really see
>>> the need to put normalize() in BiocGenerics.
>> 
>> An other possible benefit of having a set on function in a
>> "BiocGenerics" is to suggest that some method names are more commonly
>> found (and make it easier for the end user by having less names to
>> remember).
>> For example, if there is a generic "plot" the right should have more
>> chances to be called "plot" rather than "draw", "sketch", or "paint".
>> 
>>> 
>>> Cheers,
>>> H.
>>> 
>>> 
>>> On 02/20/2013 11:52 AM, Robert Gentleman wrote:
>>>> my 2c worth
>>>> 
>>>> On Wed, Feb 20, 2013 at 10:45 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>> Hi,
>>>>> 
>>>>> I agree with Laurent that we can't really play the semantic and concept
>>>>> police. It's the responsibility of package authors to decide whether
>>>>> it's appropriate or not to call "normalization" that particular
>>>>> transformation they're implementing.
>>>>> 
>>>>> However I hope that we all agree on the following rule regarding the
>>>>> generics that make it into BiocGenerics:
>>>>> 
>>>>>   If foo() is a generic function defined in BiocGenerics, no
>>>>>   BioC package should redefine the function (either as a generic
>>>>>   or an ordinary function). It can only define methods for it,
>>>>>   or move away and use a different name for this functionality.
>>>> 
>>>>  but really the point of namespaces is that you don't need to do that.
>>>>  And we really don't want to be the naming police.
>>>>   The sole advantage of BiocGenerics, I think, is that there is a
>>>> common
>>>> and standard location for a set of generic functions that get used in
>>>> different
>>>> packages.  This allows package authors to add methods that specialize
>>>> the
>>>> behavior of a generic function.  They have some confidence that the
>>>> generic will always exist and hence can plan accordingly.  It
>>>> hopefully reduces dependencies between packages.
>>>> 
>>>>   I don't think it should define a set of reserved words, that seems
>>>> counter
>>>> productive.
>>>>    There are often good reasons why the same name is used for different
>>>> concepts (normalize being one of them).  And in some cases a single
>>>> generic suffices, but in others it will not.  Places where a single
>>>> generic
>>>> fall apart are when there are really different argument lists, and
>>>> where inheritance (and hence things like NextMethod) are going to get
>>>> messed
>>>> up if the disparate methods are all linked to a single generic. Generics
>>>> are really concepts - and the methods are realizations of those
>>>> concepts.
>>>> 
>>>>   Of course, packages that define functions whose names clash with
>>>> BiocGenerics will cause problems, and they would generally be best
>>>> to avoid that, but really I don't think I would advocate any sort of
>>>> prohibition.
>>>> 
>>>> 
>>>>> 
>>>>> Does that sound reasonable? Otherwise that would kind of defeat the
>>>>> purpose of having the BiocGenerics package in the 1st place.
>>>>> 
>>>>> To me, having 10 BioC packages defining a normalize() function is far
>>>>> from being ideal. I think having it defined in BiocGenerics would
>>>>> improve things a little bit. Also one potential positive side effect
>>>>> I see is that it would give an opportunity to the authors of those
>>>>> 10 packages to reconsider if they still want to ride the normalize()
>>>>> poney or not. Maybe some of them won't and they'll pick up another
>>>>> name. Not something we can really decide for them...
>>>>> 
>>>>> H.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 02/20/2013 09:47 AM, Laurent Gautier wrote:
>>>>>> 
>>>>>> On 2013-02-20 17:32, Schalkwyk, Leonard wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Is this not just an indication that normalize is now a poor choice of
>>>>>>> a function name?
>>>>>> 
>>>>>> 
>>>>>> If the package authors called the functions "normalize", this means
>>>>>> either:
>>>>>> 1- at least some of the package authors have named a function
>>>>>> performing
>>>>>> an action that is inappropriately described as "normalize"
>>>>>> 2- all functions "normalize" do perform an action that can be
>>>>>> described
>>>>>> with that verb
>>>>>> 
>>>>>> Without more details, I'd vote for 2.
>>>>>> 
>>>>>> (more below)
>>>>>> 
>>>>>>> 
>>>>>>> LEo
>>>>>>> 
>>>>>>> On 20 Feb 2013, at 16:14, Wolfgang Huber wrote:
>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> is it clear that all these different functions (methods) share
>>>>>>>> similar semantics and enough (conceptually) of their interface?
>>>>>> 
>>>>>> 
>>>>>> Playing the semantic and concept police would come after defining
>>>>>> things
>>>>>> like ontologies of data processing; I am not sure this should be a
>>>>>> priority.
>>>>>> I'd see working out a minimal common signature that keeps everyone
>>>>>> going
>>>>>> with a minimal fuss come first.
>>>>>> 
>>>>>>>> 
>>>>>>>> Wouldn't the implication be that preemptively every possible string
>>>>>>>> of characters should already be defined as a generic function in
>>>>>>>> BiocGenerics?
>>>>>> 
>>>>>> 
>>>>>> No. Otherwise this would probably also mean that R's S4 system
>>>>>> should in
>>>>>> fact define all possible strings as generics, which by extension would
>>>>>> also mean that generic functions do not need to be explicitly
>>>>>> declared:
>>>>>> since all possible generics would be declared, it is more practical to
>>>>>> implicitly assume any given function has already generic declared. S4
>>>>>> has notions about implicit generic functions; a starting point is the
>>>>>> man page for setGeneric().
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>>     Best wishes
>>>>>>>>     Wolfgang
>>>>>>>> 
>>>>>>>> Il giorno Feb 20, 2013, alle ore 11:04 AM, Laurent Gatto
>>>>>>>> <lg390 at cam.ac.uk> ha scritto:
>>>>>>>> 
>>>>>>>>> On 19 February 2013 22:44, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Laurent, and maintainers of packages with a normalize()
>>>>>>>>>> function,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 02/15/2013 04:28 AM, Laurent Gatto wrote:
>>>>>>>>>>> 
>>>>>>>>>>> A quick (and incomplete) manual search using
>>>>>>>>>>> http://search.bioconductor.jp/ suggest the following usage of
>>>>>>>>>>> normalize:
>>>>>>>>>>> 
>>>>>>>>>>> As a function:
>>>>>>>>>>> xps::normalize
>>>>>>>>>>> codelink::normalize
>>>>>>>>>>> EBImage::normalize
>>>>>>>>>>> diffGeneAnalysis::normalize
>>>>>>>>>>> 
>>>>>>>>>>> Defining a generic and methods:
>>>>>>>>>>> oligo::normalize
>>>>>>>>>>> flowCore::normalize
>>>>>>>>>>> MSnbase::normalize
>>>>>>>>>>> isobar::normalize
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> several normalize\.[*+] functions
>>>>>>>>>>> 
>>>>>>>>>>> Would it be reasonable to add a normalize generic definition to
>>>>>>>>>>> BiocGenerics?  The generic definitions in the above packages
>>>>>>>>>>> differ,
>>>>>>>>>>> however.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Sounds good to me.
>>>>>>>>>> 
>>>>>>>>>> However, since the various normalize() functions have different
>>>>>>>>>> signatures, we need to agree on what the signature of the generic
>>>>>>>>>> in BiocGenerics should be.
>>>>>>>>>> 
>>>>>>>>>> Here is a summary of the situation:
>>>>>>>>>> 
>>>>>>>>>> ** xps package: normalize() is an ordinary function with the
>>>>>>>>>>     following arg list:
>>>>>>>>>> 
>>>>>>>>>>       normalize(xps.data, filename=character(0), filedir=getwd(),
>>>>>>>>>>                 tmpdir="", update=FALSE, select="all",
>>>>>>>>>> method="mean",
>>>>>>>>>>                 option="transcript:all", logbase="0",
>>>>>>>>>> exonlevel="",
>>>>>>>>>>                 refindex=0, refmethod="mean",
>>>>>>>>>> params=list(0.02, 0),
>>>>>>>>>>                 add.data=TRUE, verbose=TRUE)
>>>>>>>>>> 
>>>>>>>>>>     The package also defines normalize.constant(),
>>>>>>>>>> normalize.lowess(),
>>>>>>>>>>     normalize.quantiles(), normalize.supsmu(), which are also
>>>>>>>>>> ordinary
>>>>>>>>>>     functions (not S3 methods), and have similar but slightly
>>>>>>>>>>     different arg lists.
>>>>>>>>>> 
>>>>>>>>>> ** codelink package: Ordinary function with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(object, method="quantiles", log.it=TRUE,
>>>>>>>>>>                 preserve=FALSE, weights=NULL, verbose=FALSE)
>>>>>>>>>> 
>>>>>>>>>> ** EBImage package: Ordinary function with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(x, separate=TRUE, ft=c(0, 1))
>>>>>>>>>> 
>>>>>>>>>> ** diffGeneAnalysis package: Ordinary function with the following
>>>>>>>>>>     args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(rawdata, numSlides, ctrl, expm, ctrlbg=0.30,
>>>>>>>>>>                 expmbg=0.30)
>>>>>>>>>> 
>>>>>>>>>> ** deepSNV package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(test, control, ...)
>>>>>>>>>> 
>>>>>>>>>> ** isobar package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(x, f=median, target="intensity",
>>>>>>>>>> exclude.protein=NULL,
>>>>>>>>>>                    use.protein=NULL, f.doapply=TRUE, log=TRUE,
>>>>>>>>>>                    channels=NULL, na.rm=FALSE, per.file=TRUE,
>>>>>>>>>> ...)
>>>>>>>>>> 
>>>>>>>>>> ** affy package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(object, ...)
>>>>>>>>>> 
>>>>>>>>>> ** flowCore package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(data, x, ...)
>>>>>>>>>> 
>>>>>>>>>> ** MSnbase package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(object, method, ...)
>>>>>>>>>> 
>>>>>>>>>> ** oligo package: S4 generic with the following args:
>>>>>>>>>> 
>>>>>>>>>>       normalize(object, method=normalizationMethods(),
>>>>>>>>>>                 copy=TRUE, subset=NULL,
>>>>>>>>>>                 target='core', verbose=TRUE, ...)
>>>>>>>>>> 
>>>>>>>>>> So it looks like the greatest common factor is normalize(x, ...).
>>>>>>>>>> Not too surprising for a generic that covers such a wide range of
>>>>>>>>>> related but slightly different concepts / algorithms.
>>>>>>>>>> 
>>>>>>>>>> One technical difficulty though is that, even though almost all
>>>>>>>>>> these
>>>>>>>>>> functions seem to take an S4 object as their 1st arg, some of them
>>>>>>>>>> don't:
>>>>>>>>>> 
>>>>>>>>>> (a) For EBImage::normalize(), 'x' can be an ordinary array in
>>>>>>>>>>      addition to an Image object.
>>>>>>>>>> 
>>>>>>>>>> (b) For diffGeneAnalysis::normalize(), 'rawdata' is an ordinary
>>>>>>>>>>      matrix.
>>>>>>>>>> 
>>>>>>>>>> (c) For deepSNV::normalize(), 'test' can be an ordinary matrix
>>>>>>>>>>      in addition to a deepSNV object.
>>>>>>>>>> 
>>>>>>>>>> (d) For oligo::normalize(), 'object' can be an ordinary matrix
>>>>>>>>>>      in addition to a FeatureSet object.
>>>>>>>>>> 
>>>>>>>>>> So how can we disambiguate when the first arg is an ordinary
>>>>>>>>>> matrix?
>>>>>>>>>> IMO normalize() should fail in that case i.e. no package should
>>>>>>>>>> define
>>>>>>>>>> methods for ordinary arrays or matrices. Not ideal but better
>>>>>>>>>> than the
>>>>>>>>>> current situation where what is returned depends on which
>>>>>>>>>> package was
>>>>>>>>>> loaded last.
>>>>>>>>>> 
>>>>>>>>>> I could put normalize(x, ...) in BiocGenerics if nobody
>>>>>>>>>> objects, but
>>>>>>>>>> that's all. I don't have time to fix the 10 packages that this
>>>>>>>>>> change
>>>>>>>>>> will affect. However, I'd rather wait the beginning of the Bioc
>>>>>>>>>> 2.13
>>>>>>>>>> devel cycle (April) for such a change. For some packages like
>>>>>>>>>> diffGeneAnalysis (which doesn't use S4 at all), that will probably
>>>>>>>>>> require a significant amount of changes since it will need to pass
>>>>>>>>>> the data to normalize in an S4 container instead of an ordinary
>>>>>>>>>> matrix.
>>>>>>>>>> 
>>>>>>>>>> Comments and suggestions are welcome.
>>>>>>>>> 
>>>>>>>>> Fine by me.
>>>>>>>>> 
>>>>>>>>> Laurent
>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> H.
>>>>>>>>>> 
>>>>>>>>>>> Best wishes,
>>>>>>>>>>> 
>>>>>>>>>>> Laurent
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Hervé Pagès
>>>>>>>>>> 
>>>>>>>>>> Program in Computational Biology
>>>>>>>>>> Division of Public Health Sciences
>>>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>>>> P.O. Box 19024
>>>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>>> 
>>>>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>>>>> Phone:  (206) 667-5791
>>>>>>>>>> Fax:    (206) 667-1319
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> 
>>>>> 
>>>>> --
>>>>> Hervé Pagès
>>>>> 
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>> 
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>> 
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list