[Bioc-devel] eset.Rnw revised in Biobase, please review

Tue Sep 6 16:19:16 CEST 2005

Hi Kasper,

Kasper Daniel Hansen wrote:
> Hi Vince and others
> 
> Below is my first thoughts about the eSet class. I must say that I  
> like small "tight" classes with a strong validity checking.
> 
> I will start with some specific comments:
> 
> 1) The history slot: a reasonable idea. But if we have a specific  
> history slot, shouldn't it be filled automatically every time an eSet  
> is created or modified. That is, every replacement function or  
> initialization should update this slot. Otherwise I do not really see  
> the need to keep this slot separate from the notes.

  I doubt that such a comprehensive approach will be useful, especially 
since we do not yet have a markup, or intended mechanism for display or 
managing the history mechanism. I suspect that at least initially less 
is going to be more helpful. Perhaps tracking changes to the 
expressions, or a few other slots would be a good first cut.

> 
> 2) The dim method: since it is part of your validity checking that  
> every component of the assayData slot has the same dimensions, there  
> is no need to have the dim be a matrix (every column will by  
> definition be the same). You need an internal method to extract the  
> matrix of dimensions, in order to do the validity checking of course...
> 
  Vince answered this - we are not yet sure that they would be, and 
would appreciate examples where they are not.

> 3) I like the idea of having reportNames separate from the assayData.  
> That also means that the  names do not need to be unique. But shoudl  
> sampleNames be a separate slot or just be the rownames of the  
> phenoData slot? These should be some kind of checking that the length  
> of these names or either 0 (no names given) or equal to the number of  
> samples/reporters.

   I think that these should be checked in many different ways. Any 
place that they can be assigned they should be scrutinized and if 
present we should check that they are the same, and in the same order as 
those in the phenoData (whether row names on the dataframe or in a 
special slot).

> 
> 4) I think the class of reporterInfor (data.frameOrNULL) is a bit too  
> strict. You give a compelling reason that we might want to give a  
> control/active factor. Now, since the number of reporters are huge,  
> this slot will (if not empty) be a very big structure, so I think we  
> really want to allow a very specific usage of this kind of slot  
> (data.frames are not terrible efficient). I would like the option of  
> having it be either a factor, an integer or a matrix. A possible use  
> scenario (which I strongly advocate) would be the use of an integer  
> to indicate (x,y) position on the chip for AffyBatch-like objects  
> (right now the map between row and (x,y) position in the AffyBatch  
> object is implicit which does not allow for subsetting of the object,  
> since that would break the link).

    I don't see the inefficiencies you are mentioning? A data.frame is 
merely a list of vectors and since I don't think we will solve all 
problems with a single vector of reporterInfo then data.frame is the 
natural data structure. If you have some other data indicating specifice 
inefficiencies please provide it. Your example, and others, are what we 
had in mind.

> 
> Also, if someone wants to do splitting or the assayData based on a  
> factor, it may be _way_ more efficient to have the split done once  
> and for all (I imagine assayDataControl, assayDataActive) (something  
> which btw is not really doable in the current setup since the two  
> structures would have different dimensions), instead of using a  
> factor to the split "every time". Hmm. I haven't really thought this  
> through.

   Not sure what you are worried about here, but we do envisage some 
general uses of splitting parts, or all of eSets via different variables 
that are being made available. Again, it is probably best to see what 
the real usage patterns are before we commit to the implementation.

> 
> 5) I am not really in favour of the varMetadata slot of the phenoData  
> class, although the vignette seems to indicate that this was included  
> in Bioc 1.6. The only example you include is the specification of  
> units, something I feel belong in the varLabels slot such as  
> "specimen age, in years". As I currently understand it, I feel this  
> is a bit too much annotation. The same goes for a hypothetical  
> reporterMetadata slot. Perhaps you have another usage in mind? There  
> does not seem to be validity checking of this slot?
> 

   I don't see how you could every realistically parse a label and get 
back what you want (or even know, in some programmatic way that there is 
valuable information there), your experience may be different.

> 6) the assayData slot: I do not really understand the pass-by- 
> reference comments you make in the vignette, but they seem to  
> indicate that there would be performance gains to using an  
> environment. Could you explain this in some more detail. And if there  
> is, I see no reason to allow a list type structure. I think it should  
> be mandatory to have either a list or an environment, allowing both  
> just adds confusion. I would rather have the community choose the  
> most efficient way and then "force" developers to use this.
> 

   We try not to force much of anything onto developers. Lists and 
environments are essentially equivalent here, and there is probably no 
need to impose one or the other. Users/developers need to store things 
together and to access them by name - lists and environments both 
provide that capability. If you, or someone else, wants to do some 
careful time and space comparisons, we would certainly take that under 
advisement, but for now, we think we have the resources to get this new 
data structure in place for the next release.

> 7) So the assayData slot does not have a specific number/names for  
> its components. I see the need for this. But let us say I want to use  
> it for a specific case where I have two assays (let us say a two- 
> color micro array experiment). Do you imagine that people will create  
> more specific versions of the class by something like (code not tested)
>    setClass("twoclor", representation("eSet"),
>       validity = function(object){
>          if(!validObject(as(object, "eSet")
>             return(FALSE)  ## this might be unnecessary
>          if(sort(names(assayData(object)) != c("green", "red"))
>             return(FALSE)
>          else
>            return(TRUE)
>        })
> or how do users actually make sure that the elements of the assayData  
> have the relevant names (and numbers)?

   That would be one use, Martin already pointed out one set of 
problems, let me suggest that the need to sort seems wrong, as does the 
notion that only red and green are valid names ( %in%, toupper, and a 
few other functions might make any user of such a class much happier). 
You probably also want to run the eSet validity checker.

   Thanks again for all the comments,
     Robert

> 
> Kasper
> 
> 
> On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:
> 
> 
>>We need discussion of the eSet class, which is to take the place
>>of exprSet in the future.  eset.Rnw in Biobase/inst/doc has
>>been revised.  Please review and discuss.
>>
>>you will need R 2.2 and the latest Biobase to build this vignette.
>>
>>vc
>>
>>_______________________________________________
>>Bioc-devel at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
> 
> 
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 981029-1024
206-667-7700
rgentlem at fhcrc.org