[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSet class

Henrik Bengtsson hb at stat.berkeley.edu
Thu Jun 18 22:16:48 CEST 2009

FYI, be careful to blindly interpret the timestamps in the Affymetrix
CEL file headers (originating from the DAT header) as always being a
timestamp of the scan, although the Affymetrix file format labels this
DAT header field as 'Date and time of scan (padded with spaces).', cf.


These timestamps can be set/modified by other steps in the analysis
process, e.g.


Another issue is that doesn't specify the time zone.


On Thu, Jun 18, 2009 at 12:07 PM, Patrick Aboyoun<paboyoun at fhcrc.org> wrote:
> Laurent,
> The scan dates were singled out originally because we have encountered data
> sets at the Hutch that appear to have a scan date effect and wanted a
> location to store this information so it can be included in the analysis. As
> you mentioned, there are other variables that could be important as well and
> shouldn't be ignored.
> Given that you have been actively working towards a solution of managing
> array metadata, you can help create a design that can be implemented in the
> Biobase package. Martin Morgan is currently leading this effort and we can
> start a dialog off-list (so as not to spam the rest of the developers with
> minutiae) with those who are interested to hammer out a solution to this
> problem. I think once the requirements are formally expressed, we can easily
> put together a design that meets the user's needs.
> Patrick
> Laurent Gautier wrote:
>> Patrick,
>> The conceptual distinction you want to make can be seen as artificial.
>> When you start introducing "arrayData" as a separated entity, you will
>> soon have to introduce "samplepreparationData" (what extraction protocol was
>> used, where there any biopsy, etc...), "imageAnalysisData" (you know grid
>> alignment, spot segmentation). Is it reasonable to add a slot each time ?
>> Moreover, those categories can probably also be broken down into
>> subcategories. Finally, what is making the scanning date so important ?
>> Wouldn't the version of the software used, or the scanner, or the scanner
>> settings, or the name of the person who performed the scanning be of
>> relevance ?
>> One route would be to construct an initial AnnotatedDataFrame and populate
>> it with whatever you fancy from the raw-data files (scan date, software,
>> etc...). I have been going way with my homebrew infrastructure, and it has
>> so far been leading to quite much expressivity. Reserved words are not
>> necessarily very limiting (if sufficiently specific, say "array_scan_date"
>> and the associated varMetaData = "Date when scanning the hybridized
>> microarray"), and I'd think better to carefully design and document what is
>> happening when one is trying to add an other column with the same name
>> rather than rely on security-through-obscurity with mangled names.
>> L.
>> Patrick Aboyoun wrote:
>>> Laurent,
>>> As you mentioned the existing phenoData infrastructure could be used to
>>> house information like scan dates, scanner model, and scanning software
>>> version, but this information is not conceptually phenotype data and, and
>>> adding it to an AnnotatedDataFrame comes with the limitation of using
>>> reserved words (maybe name mangled like .__ScanDates__?) for column names in
>>> the AnnotatedDataFrame.
>>> The internal discussion we have been having to making this more general
>>> is to add a different slot (candidate name arrayData) to eSet (and removing
>>> the scanDates slot) that would house the type of information we have been
>>> discussing in a combination of dedicated slots like scanDates and a catch
>>> all AnnotatedDataFrame slot for less universal data. This design would
>>> separate the array data from the phenotype data and having dedicating slots
>>> for important information like scan dates would avoid having to manage
>>> special columns in an AnnotatedDataFrame.
>>> As you rightly point out we need to ensure we support a rich suite of
>>> functionality like "[", subset, etc., but this can all be handled through
>>> methods for the eSet class.
>>> Keep in mind that this recent change is just a first step, not a final
>>> design, and with your help and input from the rest of the BioC developer
>>> community, we can ensure we end up with a sufficiently useful microarray
>>> data infrastructure.
>>> Cheers,
>>> Patrick
>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>> Patrick,
>>>> There are indeed always several ways to address needs, and my comment
>>>> is mostly pointing at the fact that creating yet-an-other slot is not
>>>> necessary since one can currently store such data into phenoData (into
>>>> a column named... say "scan_date").
>>>> I would in fact qualify of overbuilding the approach that adds a new
>>>> (and exclusive) slot while improving the exiting infrastructure could
>>>> perfectly answer the needs. So today it's "scanDates", and next could
>>>> be "scannerModel", or "scanningSoftwareVersion".
>>>> I have been a little unclear (even to myself) in my comment about using
>>>> "[", so here are more details. *If* the extract operator was made to
>>>> evaluate expressions such as the function subset() does, or in fact if
>>>> a method subset was implemented for eSet objects, storing all
>>>> information into phenoData makes such things nice:
>>>> # silly example: only get the control data scanned in the future:
>>>> eset[, scan_date > date() & treatment == "control"]
>>>> # same with subset:
>>>> subset(eset, , scan_date > date() & treatment == "control")
>>>> # a little longer to write
>>>> eset[, scanDates(eset) > date() & pData(eset) == "control"]
>>>> If for some reasons a distinction between phenoData and
>>>> like-phenoData-but-can't-be-the-same is needed, please do consider the
>>>> creation of an AnnotatedDataFrame that contains all of them.
>>>> L.
>>>> Patrick Aboyoun wrote:
>>>>> Laurent,
>>>>> We had some immediate need for scan date information and rather  than
>>>>> overbuild a system for managing metadata that we may or may  not need, we
>>>>> opted to start simply and then build up as  appropriate. There has been some
>>>>> internal discussions about  managing other metadata along with scan dates,
>>>>> but nothing else has  bubbled to the top yet. Your thoughts and design can
>>>>> help speed up  this process. The class versioning system in Biobase supports
>>>>>  iterative development and we can make further changes once we lock  a
>>>>> design in place. One editorial comment I have is that lots of  designs are
>>>>> possible for a given need and, for example, the current  class properly
>>>>> subsets the scanDates information using "[" despite  not being stored in the
>>>>> phenoData (AnnotatedDataFrame) slot.
>>>>> Cheers,
>>>>> Patrick
>>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>>> Hi Patrick,
>>>>>> Storing the scan dates is indeed useful information, and is it nice to
>>>>>> have it offered at the parsing stage.
>>>>>> However, first comment would be "does it justify a new slot" to eSet ?
>>>>>> I have been storing scan dates for quite some time now, but opted for
>>>>>> having them in the phenoData as it made more sense to me, both on an
>>>>>> implementation standpoint and on practical standpoint (as standard
>>>>>> extraction of an eset-subset on columns with the "[" operator works).
>>>>>> If having something specific for scan dates is really really wished,
>>>>>> would it make make sense to have that by extending AnnotatedDataFrame
>>>>>> ?
>>>>>> In my opinion, the stage at which the the data are extracted (in that
>>>>>> case when parsing the files coming out of the image analysis) should
>>>>>> not dictate where the data are stored.
>>>>>> In fact, it might make it for a nice(r) workflow if the function
>>>>>> reading raw array data could return an eSet-inheriting instance and a
>>>>>> phenoData with information such as dates and file names. I am working
>>>>>> on a workflow that is in fact getting much more data from the header
>>>>>> (I
>>>>>> suppose that I'd contribute it when enough time to wrap it up).
>>>>>> Just few thoughts,
>>>>>> L.
>>>>>> Patrick Aboyoun wrote:
>>>>>>> Dear Bioconductor developers,
>>>>>>> The Biocore group has just committed a change to the BioC 2.5  code
>>>>>>>  line (Biobase version 2.5.3) to support the use of  microarray scan  date
>>>>>>> in statistical analyses by adding a  scanDates slot to  Biobase's eSet
>>>>>>> class. This information can be  retrieved and set  using the new scanDates
>>>>>>> and scanDates<-  function respectively. The  scanDates slot is designed to
>>>>>>> hold a  character vector of length = #  of samples, with one character
>>>>>>>  element for each sample. (See  help(scanDates) for more  information.)
>>>>>>> In this first round of check-ins we have added affy support of  this
>>>>>>>  new slot to functions like ReadAffy and we will be working  towards  adding
>>>>>>> this information to other microarray platforms as  well.
>>>>>>> This change involved bumping the eSet version number from 1.1.0  to
>>>>>>>  1.2.0 in the Biobase class definition. In order to minimize  the  impact of
>>>>>>> this change, the Biobase methods support both the  current  eSet version
>>>>>>> 1.2.0 as well as old 1.1.0 serialized  objects so  updateObject will not be
>>>>>>> required to be performed on  eSet-derived  objects prior to use in other
>>>>>>> functions. We have  also tested and  versioned bumped (and patched where
>>>>>>> needed) the  following packages  that create eSet-derived classes to
>>>>>>> minimize  any package build  issues: ACME, beadarray, beadarraySNP,
>>>>>>>  cellHTS2, CGHbase, codelink,  crlmm, GeneRegionScan, GGBase,  maDB,
>>>>>>> oligoClasses, ontoTools, puma,  rMAT, SNPchip, and spkTools.
>>>>>>> Below is a demonstration of the new functionality. If you  encounter
>>>>>>>  any issues related to this change, please e-mail this  list so the
>>>>>>>  community can monitor the change.
>>>>>>> - The Biocore Team
>>>>>>>> suppressMessages(library(affy))
>>>>>>>> example(ReadAffy)
>>>>>>> RdAffy> if(require(affydata)){
>>>>>>> RdAffy+      celpath <- system.file("celfiles", package="affydata")
>>>>>>> RdAffy+      fns <- list.celfiles(path=celpath,full.names=TRUE)
>>>>>>> RdAffy+  RdAffy+      cat("Reading
>>>>>>>  files:\n",paste(fns,collapse="\n"),"\n")
>>>>>>> RdAffy+      ##read a binary celfile
>>>>>>> RdAffy+      abatch <- ReadAffy(filenames=fns[1])
>>>>>>> RdAffy+      ##read a text celfile
>>>>>>> RdAffy+      abatch <- ReadAffy(filenames=fns[2])
>>>>>>> RdAffy+      ##read all files in that dir
>>>>>>> RdAffy+      abatch <- ReadAffy(celfile.path=celpath)
>>>>>>> RdAffy+ }
>>>>>>> Loading required package: affydata
>>>>>>> Reading files:
>>>>>>> /Library/Frameworks/R.framework/Versions/2.10/Resources/library/affydata/celfiles/binary.cel
>>>>>>> /Library/Frameworks/R.framework/Versions/2.10/Resources/library/affydata/celfiles/text.cel
>>>>>>>> scanDates(abatch)
>>>>>>>      binary.cel            text.cel
>>>>>>> "01/23/04 14:30:57" "08/29/03 15:12:30"
>>>>>>>> sessionInfo()
>>>>>>> R version 2.10.0 Under development (unstable) (2009-06-12 r48755)
>>>>>>> i386-apple-darwin9.6.0
>>>>>>> locale:
>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>> other attached packages:
>>>>>>> [1] affydata_1.11.6 affy_1.23.2     Biobase_2.5.3
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] affyio_1.13.3        preprocessCore_1.7.4 tools_2.10.0
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list