[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSetclass

Fri Jun 19 18:44:16 CEST 2009

On Fri, Jun 19, 2009 at 11:25 AM, Patrick Aboyoun<paboyoun at fhcrc.org> wrote:
> Laurent,
> One of the subtle requirements I have gotten out of Martin's design is the
> notion of standard and additional columns in an data table, which in this
> case is the form of an AnnotatedDataFrame. As a programmer, objects like
> data.frames and AnnotatedDataFrames give me no end of headache, because the
> methods I write have to contain tedious data checking code to ensure what
> they operating on is what the methods are expecting.
>
> This discussion can have much wider implications. If done right, we can
> create a scheme that others can leverage to create eSet subclasses that have
> specialized phenoData, featureData, and a potentially new
> arrayData/covariateData/experimentData slots with standard and additional

Is new infrastructure required for this?  You can extend a given class
and write a
validity method that defines the requirements by denying construction if
requirements are not met.

> columns enforced by a yet to be defined class. (I was curious if anybody has
> expanded on the AnnotatedDataFrame class to include the notion of standard
> and additional columns and the only package I found that creates subclasses
> of AnnotatedDataFrame is ShortRead, and those classes didn't hit upon this
> topic.)
>
> I agree with you that without a formal data model, this discussion can
> devolve into semantic hair splitting. If, however, we create a lightweight,
> flexible data model that can be adapted to different situations, we can
> provide benefits to both developers and end-users who can assume the
> standard data columns exist and use defined methods to access them.
>

It seems to me that we have this in eSet as it stands.

I noticed that scanDates is character.  So we will have to do some programming
to figure out what is in there.  Is POSIXt possibly a suitable class for
this information?

I have no problem with adding some slots and validity checking possibilities
to eSet, and I think the discussion is important.  I note that we
already have an
experimentData structure and that it is supposed to hold all relevant MIAME
information.  We did not elaborate it carefully.  It may not be
sensible to put scanDates in the MIAME
class definition -- i don't know.  If we did this, all structures that
use experimentData
would be able to hold scanDates in this way.  The internal
representation shouldn't be that
important -- what is important is that we give people reasonable ways of working
with eSet instances through a scanDates method.  So my proposal would
be that we figure out a way of putting scanDates in experimentData as a guide
and as something that satisfies an emergent requirement. [I am not saying that
we need to change anything that has been done, merely that if there were further
decisionmaking to be done, this is how I would contribute to the process.]

If some people decide down the road that they also need hybDates or IVTdates or
other metadata, i would say we don't need that in the infrastructure -- those
who require this information can decide if they want to extend experimentData
to include these items with suitable validity checking support.

[PS -- are we going to have to run updateObject on all serialized eSet
instances after
this change?  this seems to me to be an important consideration regarding how
we address this issue.  regarding phenoData as a container for all
sample-specific
information seems to me a very cost-effective solution to the problem
under discussion --
it is not ideal but it is extremely safe.  Perhaps the class
versioning infrastructure will
minimize the need for reserialization ... I have not studied it.  But
this issue should
be on the radar screen.  Class version mismatch errors are, in my
case, a major trauma
related to java programming that has influenced my approach to
software development.]

>
> Patrick
>
>
>
> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>
>> Kevin R. Coombes wrote:
>>>
>>> I believe that [1] this is not phenoData and [2] it is critical to
>>>  understanding the data set.
>>>
>>> The second point says that there should definitively be some place to
>>> store it; the first point suggests that the phenoData slot is not  ideal.
>>>  One strong argument against including it in the phenoData
>>> slot comes from the situation when replicate assays are performed on
>>> the same sample.
>>
>> What would make less "ideal" than usual ?
>> In the R/data.frame paradigm, information is just (partially) repeated
>> across the rows. The hierarchical relationships (such some rows
>> representing a common sample) are modelled with a column (several
>> columns) containing that information. It's pretty much like in one
>> table within a relational database.
>>
>> Moreover, I want to stress (again) that the very notion of what is
>> phenoData, and what is not, is I think very subjective.
>> What is a "replicate assay" can depend very much on the nature of the
>> study. To take just one example:
>> - when looking for somatic mutation (say with SNP arrays, or CGH),
>> having samples from different tissues/organs are just "replicate"
>> measures
>> - when looking at expression pattern changes between tissues, they are
>> no longer.
>>
>> Finally, phenoData as meaning "phenotypic data" has probably misused
>> for quite some time (anyone storing a known mutation status for example
>> has been storing "genotypic data", as well as anyone storing patient
>> samples with the hospital / physician / surgeon).
>>
>> Rather than (suddenly) trying to play the semantic police about what
>> goes into phenoData, it would be an option to either a) keep the name
>> and tell it's like that for historical reasons or b) take an generic
>> name.
>>
>> "arrayData" does not sound too good to me as it evocates the actual
>> spot signal (that goes in to AssayData). "covariateData" would be an
>> option (if only it had one syllable less). The suggested
>> "experimentData" is a good name (although an abbreviation could be
>> considered ?).
>>
>>
>>
>> L.
>>
>>
>>> The phenotypic/clinical/demographic data for
>>> repeated samples is the same, but the experimental characteristics
>>> (run date, etc) can be very different. And, as already pointed out by
>>> several people, there are numerous other experimental characteristics
>>> (such as sample preparation date) that could also affect the
>>> interpretation of the results.
>>>
>>> So, I would argue in favor of a new slot (probably implemented as yet
>>> another AnnotatedDataFrame) called something like
>>>  experimentCharacteristics, in which scanDate would be one commonly
>>> used column.
>>>
>>> Kevin
>>>
>>> James MacDonald wrote:
>>>>
>>>> If by phenoData we want to mean 'Any random information that may or
>>>> may not be phenotypic in nature', then scan date should certainly
>>>> go there. However, it seems to me that up to this time we have been
>>>> very careful about what goes where precisely because we didn't want
>>>> to stuff random information in odd places.
>>>>
>>>> To me, the idea of having different slots with names like phenoData
>>>> and assayData and featureData implies to the end user what sort of
>>>> data are in there.
>>>>
>>>> If we are to store non-phenotypic, non-biological data somewhere, I
>>>> think it makes sense to have another slot. All the slots we have in
>>>> the eSet class right now are for data that are conceptually quite
>>>> different from things like 'who ran these chips' or 'what day they
>>>> were run' or whatever. So putting this sort of data in with
>>>> phenotypic data makes no sense to me at all.
>>>>
>>>> Jim
>>>>
>>>>
>>>>>>> Kasper Daniel Hansen <khansen at stat.berkeley.edu> wrote:
>>>>>>>
>>>>> I am adding my support to Laurent: I think scanDate is simply
>>>>> another column in the phenotype info, indeed something I always
>>>>> put in, if I have it available (well, actually I am usually more
>>>>> interested in prep date). Putting in a new slot seems counter
>>>>> intuitive to me.
>>>>>
>>>>> Kasper
>>>>>
>>>>> On Jun 18, 2009, at 12:07 , Patrick Aboyoun wrote:
>>>>>
>>>>>
>>>>>> Laurent, The scan dates were singled out originally because we
>>>>>> have encountered data sets at the Hutch that appear to have a
>>>>>> scan date effect and wanted a location to store this
>>>>>> information so it can be included in the analysis. As you
>>>>>> mentioned, there are other variables that could be important as
>>>>>> well and shouldn't be ignored.
>>>>>>
>>>>>> Given that you have been actively working towards a solution of
>>>>>> managing array metadata, you can help create a design that can
>>>>>> be implemented in the Biobase package. Martin Morgan is
>>>>>> currently leading this effort and we can start a dialog
>>>>>> off-list (so as not to spam the rest of the developers with
>>>>>> minutiae) with those who are interested to hammer out a
>>>>>> solution to this problem. I think once the requirements are
>>>>>> formally expressed, we can easily put together a design that
>>>>>> meets the user's needs.
>>>>>>
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>>
>>>>>> Laurent Gautier wrote:
>>>>>>
>>>>>>> Patrick,
>>>>>>>
>>>>>>> The conceptual distinction you want to make can be seen as
>>>>>>> artificial.
>>>>>>>
>>>>>>> When you start introducing "arrayData" as a separated entity,
>>>>>>> you will soon have to introduce "samplepreparationData" (what
>>>>>>> extraction protocol was used, where there any biopsy,
>>>>>>> etc...), "imageAnalysisData" (you know grid alignment, spot
>>>>>>> segmentation). Is it reasonable to add a slot each time ?
>>>>>>> Moreover, those categories can probably also be broken down
>>>>>>> into subcategories. Finally, what is making the scanning date
>>>>>>> so important ? Wouldn't the version of the software used, or
>>>>>>> the scanner, or the scanner settings, or the name of the
>>>>>>> person who performed the scanning be of relevance ?
>>>>>>>
>>>>>>> One route would be to construct an initial AnnotatedDataFrame
>>>>>>> and populate it with whatever you fancy from the raw-data
>>>>>>> files (scan date, software, etc...). I have been going way
>>>>>>> with my homebrew infrastructure, and it has so far been
>>>>>>> leading to quite much expressivity. Reserved words are not
>>>>>>> necessarily very limiting (if sufficiently specific, say
>>>>>>> "array_scan_date" and the associated varMetaData = "Date when
>>>>>>> scanning the hybridized microarray"), and I'd think better to
>>>>>>> carefully design and document what is happening when one is
>>>>>>> trying to add an other column with the same name rather than
>>>>>>> rely on security-through-obscurity with mangled names.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> L.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Patrick Aboyoun wrote:
>>>>>>>
>>>>>>>> Laurent, As you mentioned the existing phenoData
>>>>>>>> infrastructure could be used to house information like scan
>>>>>>>> dates, scanner model, and scanning software version, but
>>>>>>>> this information is not conceptually phenotype data and,
>>>>>>>> and adding it to an AnnotatedDataFrame comes with the
>>>>>>>> limitation of using reserved words (maybe name mangled like
>>>>>>>> .__ScanDates__?) for column names in the
>>>>>>>> AnnotatedDataFrame.
>>>>>>>>
>>>>>>>> The internal discussion we have been having to making this
>>>>>>>> more general is to add a different slot (candidate name
>>>>>>>> arrayData) to eSet (and removing the scanDates slot) that
>>>>>>>> would house the type of information we have been discussing
>>>>>>>> in a combination of dedicated slots like scanDates and a
>>>>>>>> catch all AnnotatedDataFrame slot for less universal data.
>>>>>>>> This design would separate the array data from the
>>>>>>>> phenotype data and having dedicating slots for important
>>>>>>>> information like scan dates would avoid having to manage  special
>>>>>>>> columns in an AnnotatedDataFrame.
>>>>>>>>
>>>>>>>> As you rightly point out we need to ensure we support a
>>>>>>>> rich suite of functionality like "[", subset, etc., but
>>>>>>>> this can all be handled through methods for the eSet class.
>>>>>>>>
>>>>>>>>
>>>>>>>> Keep in mind that this recent change is just a first step,
>>>>>>>> not a final design, and with your help and input from the
>>>>>>>> rest of the BioC developer community, we can ensure we end
>>>>>>>> up with a sufficiently useful microarray data
>>>>>>>> infrastructure.
>>>>>>>>
>>>>>>>> Cheers, Patrick
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Patrick,
>>>>>>>>>
>>>>>>>>> There are indeed always several ways to address needs,
>>>>>>>>> and my comment is mostly pointing at the fact that
>>>>>>>>> creating yet-an-other slot is not necessary since one can
>>>>>>>>> currently store such data into phenoData (into a column
>>>>>>>>> named... say "scan_date").
>>>>>>>>>
>>>>>>>>> I would in fact qualify of overbuilding the approach that
>>>>>>>>> adds a new (and exclusive) slot while improving the
>>>>>>>>> exiting infrastructure could perfectly answer the needs.
>>>>>>>>> So today it's "scanDates", and next could be
>>>>>>>>> "scannerModel", or "scanningSoftwareVersion".
>>>>>>>>>
>>>>>>>>> I have been a little unclear (even to myself) in my
>>>>>>>>> comment about using "[", so here are more details. *If*
>>>>>>>>> the extract operator was made to evaluate expressions
>>>>>>>>> such as the function subset() does, or in fact if a
>>>>>>>>> method subset was implemented for eSet objects, storing
>>>>>>>>> all information into phenoData makes such things nice:
>>>>>>>>>
>>>>>>>>> # silly example: only get the control data scanned in the
>>>>>>>>> future: eset[, scan_date > date() & treatment ==
>>>>>>>>> "control"] # same with subset: subset(eset, , scan_date >
>>>>>>>>> date() & treatment == "control")
>>>>>>>>>
>>>>>>>>> # a little longer to write eset[, scanDates(eset) >
>>>>>>>>> date() & pData(eset) == "control"]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If for some reasons a distinction between phenoData and
>>>>>>>>>  like-phenoData-but-can't-be-the-same is needed, please do
>>>>>>>>> consider the creation of an AnnotatedDataFrame that
>>>>>>>>> contains all of them.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> L.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Patrick Aboyoun wrote:
>>>>>>>>>
>>>>>>>>>> Laurent, We had some immediate need for scan date
>>>>>>>>>> information and rather than overbuild a system for
>>>>>>>>>> managing metadata that we may or may  not need, we
>>>>>>>>>> opted to start simply and then build up as appropriate.
>>>>>>>>>> There has been some internal discussions about managing
>>>>>>>>>> other metadata along with scan dates, but nothing else
>>>>>>>>>> has  bubbled to the top yet. Your thoughts and design
>>>>>>>>>> can help speed up  this process. The class versioning
>>>>>>>>>> system in Biobase supports  iterative development and
>>>>>>>>>> we can make further changes once we lock  a design in
>>>>>>>>>> place. One editorial comment I have is that lots of
>>>>>>>>>> designs are possible for a given need and, for example,
>>>>>>>>>> the current  class properly subsets the scanDates  information
>>>>>>>>>> using "[" despite  not being stored in the
>>>>>>>>>> phenoData (AnnotatedDataFrame) slot.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers, Patrick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi Patrick,
>>>>>>>>>>>
>>>>>>>>>>> Storing the scan dates is indeed useful information,
>>>>>>>>>>> and is it nice to have it offered at the parsing
>>>>>>>>>>> stage. However, first comment would be "does it
>>>>>>>>>>> justify a new slot" to eSet ?
>>>>>>>>>>>
>>>>>>>>>>> I have been storing scan dates for quite some time
>>>>>>>>>>> now, but opted for having them in the phenoData as it
>>>>>>>>>>> made more sense to me, both on an implementation
>>>>>>>>>>> standpoint and on practical standpoint (as standard  extraction
>>>>>>>>>>> of an eset-subset on columns with the "["
>>>>>>>>>>> operator works).
>>>>>>>>>>>
>>>>>>>>>>> If having something specific for scan dates is really
>>>>>>>>>>> really wished, would it make make sense to have that
>>>>>>>>>>> by extending AnnotatedDataFrame ?
>>>>>>>>>>>
>>>>>>>>>>> In my opinion, the stage at which the the data are
>>>>>>>>>>> extracted (in that case when parsing the files coming
>>>>>>>>>>> out of the image analysis) should not dictate where
>>>>>>>>>>> the data are stored. In fact, it might make it for a
>>>>>>>>>>> nice(r) workflow if the function reading raw array
>>>>>>>>>>> data could return an eSet-inheriting instance and a  phenoData
>>>>>>>>>>> with information such as dates and file
>>>>>>>>>>> names. I am working on a workflow that is in fact
>>>>>>>>>>> getting much more data from the header (I suppose
>>>>>>>>>>> that I'd contribute it when enough time to wrap it
>>>>>>>>>>> up).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just few thoughts,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> L.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Patrick Aboyoun wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear Bioconductor developers, The Biocore group has
>>>>>>>>>>>> just committed a change to the BioC 2.5 code  line
>>>>>>>>>>>> (Biobase version 2.5.3) to support the use of microarray  scan
>>>>>>>>>>>> date in statistical analyses by
>>>>>>>>>>>> adding a scanDates slot to  Biobase's eSet class.
>>>>>>>>>>>> This information can be  retrieved and set  using
>>>>>>>>>>>> the new scanDates and scanDates<-  function
>>>>>>>>>>>> respectively. The  scanDates slot is designed to
>>>>>>>>>>>> hold a  character vector of length = #  of samples,
>>>>>>>>>>>> with one character  element for each sample. (See
>>>>>>>>>>>> help(scanDates) for more  information.)
>>>>>>>>>>>>
>>>>>>>>>>>> In this first round of check-ins we have added affy
>>>>>>>>>>>> support of  this  new slot to functions like
>>>>>>>>>>>> ReadAffy and we will be working  towards  adding
>>>>>>>>>>>> this information to other microarray platforms as
>>>>>>>>>>>> well.
>>>>>>>>>>>>
>>>>>>>>>>>> This change involved bumping the eSet version
>>>>>>>>>>>> number from 1.1.0  to  1.2.0 in the Biobase class
>>>>>>>>>>>> definition. In order to minimize  the  impact of
>>>>>>>>>>>> this change, the Biobase methods support both the
>>>>>>>>>>>> current  eSet version 1.2.0 as well as old 1.1.0
>>>>>>>>>>>> serialized  objects so  updateObject will not be required  to be
>>>>>>>>>>>> performed on  eSet-derived  objects
>>>>>>>>>>>> prior to use in other functions. We have  also
>>>>>>>>>>>> tested and  versioned bumped (and patched where
>>>>>>>>>>>> needed) the  following packages that create
>>>>>>>>>>>> eSet-derived classes to minimize  any package build
>>>>>>>>>>>> issues: ACME, beadarray, beadarraySNP,  cellHTS2,
>>>>>>>>>>>> CGHbase, codelink,  crlmm, GeneRegionScan, GGBase,
>>>>>>>>>>>> maDB, oligoClasses, ontoTools, puma,  rMAT,
>>>>>>>>>>>> SNPchip, and spkTools.
>>>>>>>>>>>>
>>>>>>>>>>>> Below is a demonstration of the new functionality.
>>>>>>>>>>>> If you encounter  any issues related to this
>>>>>>>>>>>> change, please e-mail this  list so the  community
>>>>>>>>>>>> can monitor the change.
>>>>>>>>>>>>
>>>>>>>>>>>> - The Biocore Team
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> suppressMessages(library(affy)) example(ReadAffy)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> RdAffy> if(require(affydata)){ RdAffy+      celpath
>>>>>>>>>>>> <- system.file("celfiles", package="affydata") RdAffy+      fns
>>>>>>>>>>>> <-
>>>>>>>>>>>> list.celfiles(path=celpath,full.names=TRUE) RdAffy+
>>>>>>>>>>>> RdAffy+      cat("Reading  files:
>>>>>>>>>>>>  \n",paste(fns,collapse="\n"),"\n") RdAffy+
>>>>>>>>>>>> ##read a binary celfile RdAffy+      abatch <-
>>>>>>>>>>>> ReadAffy(filenames=fns[1]) RdAffy+      ##read a
>>>>>>>>>>>> text celfile RdAffy+      abatch <-
>>>>>>>>>>>> ReadAffy(filenames=fns[2]) RdAffy+      ##read all
>>>>>>>>>>>> files in that dir RdAffy+      abatch <-
>>>>>>>>>>>> ReadAffy(celfile.path=celpath) RdAffy+ } Loading
>>>>>>>>>>>> required package: affydata Reading files:
>>>>>>>>>>>>  /Library/Frameworks/R.framework/Versions/2.10/Resources/
>>>>>>>>>>>> library/affydata/celfiles/binary.cel
>>>>>>>>>>>> /Library/Frameworks/
>>>>>>>>>>>>  R.framework/Versions/2.10/Resources/library/affydata/celfiles/
>>>>>>>>>>>> text.cel
>>>>>>>>>>>>
>>>>>>>>>>>>> scanDates(abatch)
>>>>>>>>>>>>>
>>>>>>>>>>>> binary.cel            text.cel "01/23/04 14:30:57"
>>>>>>>>>>>> "08/29/03 15:12:30"
>>>>>>>>>>>>
>>>>>>>>>>>>> sessionInfo()
>>>>>>>>>>>>>
>>>>>>>>>>>> R version 2.10.0 Under development (unstable)
>>>>>>>>>>>> (2009-06-12 r48755) i386-apple-darwin9.6.0
>>>>>>>>>>>>
>>>>>>>>>>>> locale: [1]
>>>>>>>>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> attached base packages: [1] stats     graphics
>>>>>>>>>>>> grDevices utils     datasets methods   base other
>>>>>>>>>>>> attached packages: [1] affydata_1.11.6 affy_1.23.2
>>>>>>>>>>>> Biobase_2.5.3 loaded via a namespace (and not
>>>>>>>>>>>> attached): [1] affyio_1.13.3
>>>>>>>>>>>> preprocessCore_1.7.4 tools_2.10.0
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>  Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>>>>>>>  https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>>  Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>>>  https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>> _______________________________________________
>>>>>>  Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>  https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>> _______________________________________________
>>>>>  Bioc-devel at stat.math.ethz.ch mailing list
>>>>>  https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>> **********************************************************  Electronic
>>>> Mail is not secure, may not be read every day, and
>>>> should not be used for urgent or sensitive issues
>>>>
>>>> _______________________________________________
>>>>  Bioc-devel at stat.math.ethz.ch mailing list
>>>>  https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Vincent Carey, PhD
Biostatistics, Channing Lab
617 525 2265