[Bioc-devel] eset.Rnw revised in Biobase, please review

Tue Sep 6 03:07:27 CEST 2005

> Hi Vince and others
>
> Below is my first thoughts about the eSet class. I must say that I
> like small "tight" classes with a strong validity checking.
>
> I will start with some specific comments:
>
> 1) The history slot: a reasonable idea. But if we have a specific
> history slot, shouldn't it be filled automatically every time an eSet
> is created or modified. That is, every replacement function or
> initialization should update this slot. Otherwise I do not really see
> the need to keep this slot separate from the notes.

RG is working on the history concept now so I will pass on this.

>
> 2) The dim method: since it is part of your validity checking that
> every component of the assayData slot has the same dimensions, there
> is no need to have the dim be a matrix (every column will by
> definition be the same). You need an internal method to extract the
> matrix of dimensions, in order to do the validity checking of course...

good point.  i am hoping to hear from folks whether they can
imagine situations in which the assayData components may have
different dimensions.  in that case the validity check would have
to be relaxed.

>
> 3) I like the idea of having reportNames separate from the assayData.
> That also means that the  names do not need to be unique. But shoudl
> sampleNames be a separate slot or just be the rownames of the
> phenoData slot? These should be some kind of checking that the length
> of these names or either 0 (no names given) or equal to the number of
> samples/reporters.

i have vacillated on this aspect of metadata.  currently i believe
that rownames and colnames should be supplied and that the reporterNames
must come from there.  we now have the reporterData data.frame in there
(in annotatedDataset) that can ameliorate the problem of requiring unique
reporterNames

>
> 4) I think the class of reporterInfor (data.frameOrNULL) is a bit too
> strict. You give a compelling reason that we might want to give a
> control/active factor. Now, since the number of reporters are huge,
> this slot will (if not empty) be a very big structure, so I think we
> really want to allow a very specific usage of this kind of slot
> (data.frames are not terrible efficient). I would like the option of
> having it be either a factor, an integer or a matrix. A possible use
> scenario (which I strongly advocate) would be the use of an integer
> to indicate (x,y) position on the chip for AffyBatch-like objects
> (right now the map between row and (x,y) position in the AffyBatch
> object is implicit which does not allow for subsetting of the object,
> since that would break the link).

is the data.frame as a container of a factor really an efficiency
loss?

>
> Also, if someone wants to do splitting or the assayData based on a
> factor, it may be _way_ more efficient to have the split done once
> and for all (I imagine assayDataControl, assayDataActive) (something
> which btw is not really doable in the current setup since the two
> structures would have different dimensions), instead of using a
> factor to the split "every time". Hmm. I haven't really thought this
> through.

we do need to think through the split use cases.  example, we would
like to make it easy for people to compute a normalization function
based strictly on control spots.

>
> 5) I am not really in favour of the varMetadata slot of the phenoData
> class, although the vignette seems to indicate that this was included
> in Bioc 1.6. The only example you include is the specification of
> units, something I feel belong in the varLabels slot such as
> "specimen age, in years". As I currently understand it, I feel this
> is a bit too much annotation. The same goes for a hypothetical
> reporterMetadata slot. Perhaps you have another usage in mind? There
> does not seem to be validity checking of this slot?

right, no  validity checking yet.  you are right that such metadata
could be contained in labels, but how do you compute on those labels?
if you have a few datasets and need to make years and months variables
compatible, a convention on a units method may be helpful.  we have
one vote (private, a long time ago) in favor of the varMetadata approach and
now one against.

>
> 6) the assayData slot: I do not really understand the pass-by-
> reference comments you make in the vignette, but they seem to
> indicate that there would be performance gains to using an
> environment. Could you explain this in some more detail. And if there
> is, I see no reason to allow a list type structure. I think it should
> be mandatory to have either a list or an environment, allowing both
> just adds confusion. I would rather have the community choose the
> most efficient way and then "force" developers to use this.

environments are not copied when passed to functions.  everything
else is, afaik.  why not require environments?  it is open for
additional discussion

>
> 7) So the assayData slot does not have a specific number/names for
> its components. I see the need for this. But let us say I want to use
> it for a specific case where I have two assays (let us say a two-
> color micro array experiment). Do you imagine that people will create
> more specific versions of the class by something like (code not tested)
>    setClass("twoclor", representation("eSet"),
>       validity = function(object){
>          if(!validObject(as(object, "eSet")
>             return(FALSE)  ## this might be unnecessary
>          if(sort(names(assayData(object)) != c("green", "red"))
>             return(FALSE)
>          else
>            return(TRUE)
>        })
> or how do users actually make sure that the elements of the assayData
> have the relevant names (and numbers)?
>

conceptually i think this is right.  we want to make sure the basic
infrastructure is not missing anything that you would want to have
in COMMON to all the different extensions that one can anticipate
for high throughput platforms.