[Bioc-devel] eSet questions

Thu Jan 11 18:46:40 CET 2007

Hi,

Vincent Carey 525-2265 wrote:
>> Hi,
>>
>> First, according to the manual pages for the "annotatedDataSet" class
> 
> small "s" for Dataset ... threw me for a minute

Oops; sorry about that

> 
>> (in BioConductor 1.9) is a "virtual superset for 'exprSet' , 'eSet',
>> etc".  While this seems to be the case for the soon-to-be-deprecated
>> exprSet, it seems not to be the case for an eSet.  Is that
>> interpretation correct?
> 
> I am not sure annotatedDataset is going anywhere.  It probably should
> be removed.

That's what I thought; thanks for the confirmation.

> 
>>
>> Now to the real question.  
[[SNIP]]
>>
>> Given this description, one might attempt a design something like
>>
>> setClass("ArrayCube", representation=list(
>> 	rawData = "AssayData",
>> 	experimentData = "MIAME",
>> 	featureData = "AnnotatedDataFrame",
>> 	hybridizationData = "AnnotatedDataFrame",
>> 	measurementData = "AnnotatedDataFrame"
>> ))
>>
>> This obviously looks a lot like an eSet.  The differences are
> 
> It seems to me that you don't want to adopt the "AssayData"-"phenoData"
> relationship documented in the eSet man page.  So the above is
> not like an eSet, and the conflict is mostly with the AssayData
> structure.

AssayData is fine, since it's just a class union of list or environment. 
  The conflict really is (I think) with the validity function for an eSet.

> 
>> [1] I am thinking about the rawData entry as a list of data frames (or
>> data matrices), with each one corresponding to a unique file on the hard
>> disk.  These would be easy to read into R in the use cases above, but
>> violate one of the validity constraints on the assayData object in the
>> current eSet. (Specifically, the contraint that the columns in any
>> matrix in the assayData object must correspond to rows of the phenoData
>> object.)
> 
> This constraint is quite important for all the applications of eSet
> in use, so abandoning it suggests designing another class.

Unfortunately, that was my conclusion as well.

>> [2] The featureData slot would describe the rows in each of those data
>> matrices.  In order to accommodate the RPPA data, however, featureData
>> might refer to patient samples instead of the genes that it would refer
>> to in the eSet design.
> 
> I have not had time to think at length about the RPPA data structure.
> It seems possible to use the eSet design to represent it, but there
> is substantial reorganization of the data relative to its physical
> origins.  There are costs and benefits to shoehorning the data into
> an ExpressionSet-like structure and I don't know how to weigh
> them at the moment.  The real question seems to me to be whether it
> is valuable to request X[G, S] for any of these data structures, where
> X is the basic container, G is a predicate identifying a gene selection
> and S is a predicate identifying a sample selection.  If you want that
> AND you want to inherit the infrastructure available for ExpressionSets
> to get that, then it makes sense for you to try to extend what we have in Biobase
> to cover what you are dealing with.  It seems to me that you might
> want to combine AssayData and AnnotatedDataFrame components in a
> structure that does not extend eSet to get what you want.

That is basically what the "ArrayCube" I semi-described above tries to do.

> 
>> [3] Similarly, hybridizationData would replace the phenoData slot, and
>> it also could refer to samples or to genes/proteins depending on the
>> data type.  Also, the phenoData object has to describe the "list"
>> dimension of the rawData instead of the "column" dimension".
>> [4] The measurementData slot would describe the measurement columns from
>> the software.  For already known packages, it would then be easy to
>> convert an ArrayCube into, for example, an RGList by slicing along the
>> desired measurement columns.  For novel quantification packages, one
>> could make an interface that lets the user specify which measurements
>> have which interpretation, and then make an RGList after they have had a
>> chance to load the data easily and start exploring it.
>>
>> In any event, before I head further down this road, I'd like to get some
>> feedback on whether it would be [a] feasible or [b] desirable either to
>> create such a thing or to change the design of an eSet into such a thing.
> 
> My reaction, based on very brief contemplation, is that you'll be designing
> a structure that does not extend eSet but shares some components and some
> functionalities.  If the ability to represent, e.g.,
> RPPA and Expression arrays in a single container type becomes important
> we'll consider how the eSet constraints need to evolve.  Thus far they
> seem to be effective for the most common types of high-throughput data
> encountered.

I think it might be possible (if I get an ArrayCube actually working) to 
make it a "superclass" of eSet. So in a sense, I think I'm trying to 
convince people that this more general class is where a lot of the basic 
functionality should live, and then an eSet and an RPPA class could both 
be derived from it.

I've been ignoring most of the things I'm supposed to be doing this 
morning and trying to write code to import a MINiML formatted data set 
into R as an ArrayCube. That seems to be off to a promising start, since 
the format maps pretty directly into the tentative class design.

> 
> The folks who actually designed the key Biobase containers may well have
> different views of this situation.  This is just my personal reaction.

Thanks for your input,
	Kevin