[Bioc-devel] A new beginner

Fri Feb 18 02:01:47 CET 2011

On 02/17/2011 08:56 AM, Tim Triche, Jr. wrote:
> A silly and perhaps tangential question regarding S4 versus R5 classes:
> 
> What about R5 classes?  Aren't these sort of handy for the large-scale data
> processing targeted in GWAS and integrative studies?  Where can I find out
> more (aside from Chambers' slides and Dirk's offhand references) in terms of
> using them in BioC packages?
> 
> Learning about what Vince Carey did with GGtools, and Benilton Carvalho did
> with crlmm, has piqued my interest in more efficient representations.  I use
> S4 classes and methods out of habit (I learned to do this by aping Sean
> Davis' code, FWIW) and would not want to go back to the implicit dispatch of
> S3 methods.  However, I also find that having to create a clone() method for
> things like eSets is a bit of a drag, and suggests (to me) that the whole
> endeavor would be better carried out with explicit C++-style semantics and
> R5 classes.
> 
> Obviously that is not something that would happen overnight, or perhaps
> ever.  But the thought occurs after a few misteaks that maybe C++ is
> actually the right frame of mind (!?).  What do developers with long
> experience on Bioconductor think of the R5 class model?  Is there any
> intention to move towards reference classes in Bioconductor for the long
> haul?

Hi Tim -- I think you're talking of ?ReferenceClasses.

I think they're appropriate when reference semantics is appropriate,
e.g., for a file on a disk or a url, something for which there is only
one of. They might also be appropriate for data that is essentially
read-only, like a 'view' onto a DNA sequence.

A very serious challenge to using these more generally is that they
introduce 'action at a distance' -- x <- y; <update y>; and now x is
changed. This will be very surprising for users.

Reference classes also introduce another syntax -- a$foo() -- and this
too will be confusing.

In ShortRead, ?"Sampler-class" is a reference class. It is used to refer
to files, and is manipulated with an S4 generic yield() that hides the
implementation from the 'user'.

This is an interesting discussion to have, and it would be good to hear
what others think; at the moment I think new package submissions using
reference classes where action at a distance is likely should be
discouraged.

Martin

> 
> thanks,
> 
> --t
> 
> 
> On Thu, Feb 17, 2011 at 8:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
> 
>> Hi
>>
>> I find that S4 classes really help with writing robust, maintainable and
>> elegant code, since they allow to put related data into one object and can
>> automate much of the validity checking that would be very tedious e.g. with
>> lists.
>>
>> My mileage with S4 methods is more variable. If a certain method is only
>> ever going to be used with one particular signature, one might as well
>> implement it as a normal function, since the syntax for doing so and the
>> debugging is simpler, and not much is lost in other respects.
>>
>> There are, of course, also examples where S4 methods are very useful, like
>> what Martin mentioned. Often there is one big, "substantial" function (S4 or
>> not), which is wrapped by different S4-method definitions that do some
>> datatype-specific pre- or postprocessing.
>>
>>        Wolfgang
>>
>>
>>
>>
>>
>>
>>
>>
>> Martin Morgan scripsit 17/02/11 16:47:
>>
>>  On 02/17/2011 06:34 AM, Stefano Calza wrote:
>>>
>>>> Ciao
>>>>
>>>> you probably mean you have been programming using S3 methods not S4.
>>>>
>>>> Using S4 methods is not compulsory though highly recommended. At
>>>> least this is my understanding. There are packages in BioC not using
>>>> S4, therefore I assume you can go ahead like this.
>>>>
>>>
>>> Actually, new package authors should really think of S4 as 'compulsory'.
>>> Here are two reasons for this:
>>>
>>> 1. Classes provide a way to structure the complicated data that we
>>> typically see in high throughput assays. For instance, coordinating
>>> sample descriptions with expression values and thus minimizing mix-ups
>>> when the user subsets one but not the other.
>>>
>>> 2. Classes provide a way for users to use different packages. For
>>> instance the ExpressionSet returned by affy's justRMA can be used
>>> directly by arrayQualityMetrics. This is both convenient for the user
>>> and minimizes opportunities for error. For this reason it is often a
>>> good strategy to re-use existing classes (like ExpressionSet in the
>>> microarray world, or the classes in IRanges / Biostrings in the
>>> sequencing world), rather than to invent new ones.
>>>
>>> The S4 requirement is not meant to get in the way of high-quality
>>> algorithms; a good strategy is to implement algorithms that operate on
>>> basic data types (a matrix of expression values, for instance) but
>>> expose these as methods on an S4 object.
>>>
>>> In terms of examples, one possibility is the 'StudentGWAS' package we'll
>>> use in a course here at the Hutch in the next two days; it'll become
>>> available at
>>>
>>> http://bioconductor.org/help/course-materials/2011/
>>>
>>> soon. It implements a single class with essential methods (constructor,
>>> accessors, show) and a method for doing something a little more
>>> substantial, so it's not too complicated. Next choices would be Biobase
>>> (something like AnnotatedDataFrame might be a good start) or for a more
>>> advanced example IRanges.
>>>
>>> limma does use S4 classes, e.g., RGlist, etc. While these classes are a
>>> little 'loose' for my taste, they represent for the authors a compromise
>>> between structuring data and implementing foundational algorithms, and
>>> the package was developed at a time when S4 was more in flux than it is
>>> currently.
>>>
>>> Martin
>>>
>>>
>>>> Most packages in BioC use S4 methods, so just pick one not too
>>>> complex!
>>>>
>>>> regards
>>>>
>>>> Stefano
>>>>
>>>> On Thu, Feb 17, 2011 at 02:17:31PM +0000, Stefano Berri wrote:
>>>> <Stefano>Hi everybody.<Stefano>  <Stefano>I am about to start
>>>> assembling my code to make my first Bioconductor<Stefano>package.
>>>> <Stefano>  <Stefano>I've read the instructions about "Package
>>>> Guidelines" and "Package<Stefano>Submission" and I will try to
>>>> follow those instruction the best I can.<Stefano>  <Stefano>I have a
>>>> first question, however.<Stefano>You seem to ask for your code to be
>>>> in S4 Classes and Methods<Stefano>  <Stefano>( Packages should also
>>>> conform to the following:<Stefano>* Use S4 classes and methods.)
>>>> <Stefano>  <Stefano>At the moment I wrote my code in the form
>>>> <Stefano>  <Stefano>List<- myFunction(List, bar = bar, foo = foo)
>>>> <Stefano>  <Stefano>Using "plain functions" and Lists as input and
>>>> output.<Stefano>I was inspired by 'limma' that, as far as I
>>>> understand, works this way.<Stefano>  <Stefano>Can submit using this
>>>> interface or shall I really use S4 implementation? If<Stefano>so,
>>>> could you recommend a simple package that uses classes as you would
>>>> <Stefano>recomend that I can use as template/inspration/guide for my
>>>> code?<Stefano>  <Stefano>Thank you very much<Stefano>
>>>> <Stefano>Stefano Berri<Stefano>
>>>> <Stefano>_______________________________________________
>>>> <Stefano>Bioc-devel at r-project.org mailing list
>>>> <Stefano>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>>
>> --
>>
>>
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
> 
> 
> 

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793