[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSet class

Thu Jun 18 22:15:53 CEST 2009

I am adding my support to Laurent: I think scanDate is simply another  
column in the phenotype info, indeed something I always put in, if I  
have it available (well, actually I am usually more interested in prep  
date). Putting in a new slot seems counter intuitive to me.

Kasper

On Jun 18, 2009, at 12:07 , Patrick Aboyoun wrote:

> Laurent,
> The scan dates were singled out originally because we have  
> encountered data sets at the Hutch that appear to have a scan date  
> effect and wanted a location to store this information so it can be  
> included in the analysis. As you mentioned, there are other  
> variables that could be important as well and shouldn't be ignored.
>
> Given that you have been actively working towards a solution of  
> managing array metadata, you can help create a design that can be  
> implemented in the Biobase package. Martin Morgan is currently  
> leading this effort and we can start a dialog off-list (so as not to  
> spam the rest of the developers with minutiae) with those who are  
> interested to hammer out a solution to this problem. I think once  
> the requirements are formally expressed, we can easily put together  
> a design that meets the user's needs.
>
>
> Patrick
>
>
>
> Laurent Gautier wrote:
>>
>> Patrick,
>>
>> The conceptual distinction you want to make can be seen as  
>> artificial.
>>
>> When you start introducing "arrayData" as a separated entity, you  
>> will soon have to introduce "samplepreparationData" (what  
>> extraction protocol was used, where there any biopsy, etc...),  
>> "imageAnalysisData" (you know grid alignment, spot segmentation).  
>> Is it reasonable to add a slot each time ? Moreover, those  
>> categories can probably also be broken down into subcategories.  
>> Finally, what is making the scanning date so important ?
>> Wouldn't the version of the software used, or the scanner, or the  
>> scanner settings, or the name of the person who performed the  
>> scanning be of relevance ?
>>
>> One route would be to construct an initial AnnotatedDataFrame and  
>> populate it with whatever you fancy from the raw-data files (scan  
>> date, software, etc...). I have been going way with my homebrew  
>> infrastructure, and it has so far been leading to quite much  
>> expressivity. Reserved words are not necessarily very limiting (if  
>> sufficiently specific, say "array_scan_date" and the associated  
>> varMetaData = "Date when scanning the hybridized microarray"), and  
>> I'd think better to carefully design and document what is happening  
>> when one is trying to add an other column with the same name rather  
>> than rely on security-through-obscurity with mangled names.
>>
>>
>>
>> L.
>>
>>
>>
>>
>>
>> Patrick Aboyoun wrote:
>>> Laurent,
>>> As you mentioned the existing phenoData infrastructure could be  
>>> used to house information like scan dates, scanner model, and  
>>> scanning software version, but this information is not  
>>> conceptually phenotype data and, and adding it to an  
>>> AnnotatedDataFrame comes with the limitation of using reserved  
>>> words (maybe name mangled like .__ScanDates__?) for column names  
>>> in the AnnotatedDataFrame.
>>>
>>> The internal discussion we have been having to making this more  
>>> general is to add a different slot (candidate name arrayData) to  
>>> eSet (and removing the scanDates slot) that would house the type  
>>> of information we have been discussing in a combination of  
>>> dedicated slots like scanDates and a catch all AnnotatedDataFrame  
>>> slot for less universal data. This design would separate the array  
>>> data from the phenotype data and having dedicating slots for  
>>> important information like scan dates would avoid having to manage  
>>> special columns in an AnnotatedDataFrame.
>>>
>>> As you rightly point out we need to ensure we support a rich suite  
>>> of functionality like "[", subset, etc., but this can all be  
>>> handled through methods for the eSet class.
>>>
>>> Keep in mind that this recent change is just a first step, not a  
>>> final design, and with your help and input from the rest of the  
>>> BioC developer community, we can ensure we end up with a  
>>> sufficiently useful microarray data infrastructure.
>>>
>>> Cheers,
>>> Patrick
>>>
>>>
>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>
>>>> Patrick,
>>>>
>>>> There are indeed always several ways to address needs, and my  
>>>> comment
>>>> is mostly pointing at the fact that creating yet-an-other slot is  
>>>> not
>>>> necessary since one can currently store such data into phenoData  
>>>> (into
>>>> a column named... say "scan_date").
>>>>
>>>> I would in fact qualify of overbuilding the approach that adds a  
>>>> new
>>>> (and exclusive) slot while improving the exiting infrastructure  
>>>> could
>>>> perfectly answer the needs. So today it's "scanDates", and next  
>>>> could
>>>> be "scannerModel", or "scanningSoftwareVersion".
>>>>
>>>> I have been a little unclear (even to myself) in my comment about  
>>>> using
>>>> "[", so here are more details. *If* the extract operator was made  
>>>> to
>>>> evaluate expressions such as the function subset() does, or in  
>>>> fact if
>>>> a method subset was implemented for eSet objects, storing all
>>>> information into phenoData makes such things nice:
>>>>
>>>> # silly example: only get the control data scanned in the future:
>>>> eset[, scan_date > date() & treatment == "control"]
>>>> # same with subset:
>>>> subset(eset, , scan_date > date() & treatment == "control")
>>>>
>>>> # a little longer to write
>>>> eset[, scanDates(eset) > date() & pData(eset) == "control"]
>>>>
>>>>
>>>> If for some reasons a distinction between phenoData and
>>>> like-phenoData-but-can't-be-the-same is needed, please do  
>>>> consider the
>>>> creation of an AnnotatedDataFrame that contains all of them.
>>>>
>>>>
>>>>
>>>> L.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Patrick Aboyoun wrote:
>>>>> Laurent,
>>>>> We had some immediate need for scan date information and rather   
>>>>> than overbuild a system for managing metadata that we may or  
>>>>> may  not need, we opted to start simply and then build up as   
>>>>> appropriate. There has been some internal discussions about   
>>>>> managing other metadata along with scan dates, but nothing else  
>>>>> has  bubbled to the top yet. Your thoughts and design can help  
>>>>> speed up  this process. The class versioning system in Biobase  
>>>>> supports  iterative development and we can make further changes  
>>>>> once we lock  a design in place. One editorial comment I have is  
>>>>> that lots of  designs are possible for a given need and, for  
>>>>> example, the current  class properly subsets the scanDates  
>>>>> information using "[" despite  not being stored in the phenoData  
>>>>> (AnnotatedDataFrame) slot.
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Patrick
>>>>>
>>>>>
>>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>>
>>>>>> Hi Patrick,
>>>>>>
>>>>>> Storing the scan dates is indeed useful information, and is it  
>>>>>> nice to
>>>>>> have it offered at the parsing stage.
>>>>>> However, first comment would be "does it justify a new slot" to  
>>>>>> eSet ?
>>>>>>
>>>>>> I have been storing scan dates for quite some time now, but  
>>>>>> opted for
>>>>>> having them in the phenoData as it made more sense to me, both  
>>>>>> on an
>>>>>> implementation standpoint and on practical standpoint (as  
>>>>>> standard
>>>>>> extraction of an eset-subset on columns with the "[" operator  
>>>>>> works).
>>>>>>
>>>>>> If having something specific for scan dates is really really  
>>>>>> wished,
>>>>>> would it make make sense to have that by extending  
>>>>>> AnnotatedDataFrame ?
>>>>>>
>>>>>> In my opinion, the stage at which the the data are extracted  
>>>>>> (in that
>>>>>> case when parsing the files coming out of the image analysis)  
>>>>>> should
>>>>>> not dictate where the data are stored.
>>>>>> In fact, it might make it for a nice(r) workflow if the function
>>>>>> reading raw array data could return an eSet-inheriting instance  
>>>>>> and a
>>>>>> phenoData with information such as dates and file names. I am  
>>>>>> working
>>>>>> on a workflow that is in fact getting much more data from the  
>>>>>> header (I
>>>>>> suppose that I'd contribute it when enough time to wrap it up).
>>>>>>
>>>>>>
>>>>>> Just few thoughts,
>>>>>>
>>>>>>
>>>>>>
>>>>>> L.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Patrick Aboyoun wrote:
>>>>>>> Dear Bioconductor developers,
>>>>>>> The Biocore group has just committed a change to the BioC 2.5   
>>>>>>> code  line (Biobase version 2.5.3) to support the use of   
>>>>>>> microarray scan  date in statistical analyses by adding a   
>>>>>>> scanDates slot to  Biobase's eSet class. This information can  
>>>>>>> be  retrieved and set  using the new scanDates and  
>>>>>>> scanDates<-  function respectively. The  scanDates slot is  
>>>>>>> designed to hold a  character vector of length = #  of  
>>>>>>> samples, with one character  element for each sample. (See   
>>>>>>> help(scanDates) for more  information.)
>>>>>>>
>>>>>>> In this first round of check-ins we have added affy support  
>>>>>>> of  this  new slot to functions like ReadAffy and we will be  
>>>>>>> working  towards  adding this information to other microarray  
>>>>>>> platforms as  well.
>>>>>>>
>>>>>>> This change involved bumping the eSet version number from  
>>>>>>> 1.1.0  to  1.2.0 in the Biobase class definition. In order to  
>>>>>>> minimize  the  impact of this change, the Biobase methods  
>>>>>>> support both the  current  eSet version 1.2.0 as well as old  
>>>>>>> 1.1.0 serialized  objects so  updateObject will not be  
>>>>>>> required to be performed on  eSet-derived  objects prior to  
>>>>>>> use in other functions. We have  also tested and  versioned  
>>>>>>> bumped (and patched where needed) the  following packages   
>>>>>>> that create eSet-derived classes to minimize  any package  
>>>>>>> build  issues: ACME, beadarray, beadarraySNP,  cellHTS2,  
>>>>>>> CGHbase, codelink,  crlmm, GeneRegionScan, GGBase,  maDB,  
>>>>>>> oligoClasses, ontoTools, puma,  rMAT, SNPchip, and spkTools.
>>>>>>>
>>>>>>> Below is a demonstration of the new functionality. If you   
>>>>>>> encounter  any issues related to this change, please e-mail  
>>>>>>> this  list so the  community can monitor the change.
>>>>>>>
>>>>>>> - The Biocore Team
>>>>>>>
>>>>>>>
>>>>>>>> suppressMessages(library(affy))
>>>>>>>> example(ReadAffy)
>>>>>>>
>>>>>>> RdAffy> if(require(affydata)){
>>>>>>> RdAffy+      celpath <- system.file("celfiles",  
>>>>>>> package="affydata")
>>>>>>> RdAffy+      fns <- list.celfiles(path=celpath,full.names=TRUE)
>>>>>>> RdAffy+  RdAffy+      cat("Reading  files: 
>>>>>>> \n",paste(fns,collapse="\n"),"\n")
>>>>>>> RdAffy+      ##read a binary celfile
>>>>>>> RdAffy+      abatch <- ReadAffy(filenames=fns[1])
>>>>>>> RdAffy+      ##read a text celfile
>>>>>>> RdAffy+      abatch <- ReadAffy(filenames=fns[2])
>>>>>>> RdAffy+      ##read all files in that dir
>>>>>>> RdAffy+      abatch <- ReadAffy(celfile.path=celpath)
>>>>>>> RdAffy+ }
>>>>>>> Loading required package: affydata
>>>>>>> Reading files:
>>>>>>> /Library/Frameworks/R.framework/Versions/2.10/Resources/ 
>>>>>>> library/affydata/celfiles/binary.cel   /Library/Frameworks/ 
>>>>>>> R.framework/Versions/2.10/Resources/library/affydata/celfiles/ 
>>>>>>> text.cel
>>>>>>>> scanDates(abatch)
>>>>>>>      binary.cel            text.cel
>>>>>>> "01/23/04 14:30:57" "08/29/03 15:12:30"
>>>>>>>> sessionInfo()
>>>>>>> R version 2.10.0 Under development (unstable) (2009-06-12  
>>>>>>> r48755)
>>>>>>> i386-apple-darwin9.6.0
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets   
>>>>>>> methods   base
>>>>>>> other attached packages:
>>>>>>> [1] affydata_1.11.6 affy_1.23.2     Biobase_2.5.3
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] affyio_1.13.3        preprocessCore_1.7.4 tools_2.10.0
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel