[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSetclass
James MacDonald
jmacdon at med.umich.edu
Thu Jun 18 23:02:06 CEST 2009
If by phenoData we want to mean 'Any random information that may or may not be phenotypic in nature', then scan date should certainly go there. However, it seems to me that up to this time we have been very careful about what goes where precisely because we didn't want to stuff random information in odd places.
To me, the idea of having different slots with names like phenoData and assayData and featureData implies to the end user what sort of data are in there.
If we are to store non-phenotypic, non-biological data somewhere, I think it makes sense to have another slot. All the slots we have in the eSet class right now are for data that are conceptually quite different from things like 'who ran these chips' or 'what day they were run' or whatever. So putting this sort of data in with phenotypic data makes no sense to me at all.
Jim
>>> Kasper Daniel Hansen <khansen at stat.berkeley.edu> wrote:
> I am adding my support to Laurent: I think scanDate is simply another
> column in the phenotype info, indeed something I always put in, if I
> have it available (well, actually I am usually more interested in prep
> date). Putting in a new slot seems counter intuitive to me.
>
> Kasper
>
> On Jun 18, 2009, at 12:07 , Patrick Aboyoun wrote:
>
>> Laurent,
>> The scan dates were singled out originally because we have
>> encountered data sets at the Hutch that appear to have a scan date
>> effect and wanted a location to store this information so it can be
>> included in the analysis. As you mentioned, there are other
>> variables that could be important as well and shouldn't be ignored.
>>
>> Given that you have been actively working towards a solution of
>> managing array metadata, you can help create a design that can be
>> implemented in the Biobase package. Martin Morgan is currently
>> leading this effort and we can start a dialog off-list (so as not to
>> spam the rest of the developers with minutiae) with those who are
>> interested to hammer out a solution to this problem. I think once
>> the requirements are formally expressed, we can easily put together
>> a design that meets the user's needs.
>>
>>
>> Patrick
>>
>>
>>
>> Laurent Gautier wrote:
>>>
>>> Patrick,
>>>
>>> The conceptual distinction you want to make can be seen as
>>> artificial.
>>>
>>> When you start introducing "arrayData" as a separated entity, you
>>> will soon have to introduce "samplepreparationData" (what
>>> extraction protocol was used, where there any biopsy, etc...),
>>> "imageAnalysisData" (you know grid alignment, spot segmentation).
>>> Is it reasonable to add a slot each time ? Moreover, those
>>> categories can probably also be broken down into subcategories.
>>> Finally, what is making the scanning date so important ?
>>> Wouldn't the version of the software used, or the scanner, or the
>>> scanner settings, or the name of the person who performed the
>>> scanning be of relevance ?
>>>
>>> One route would be to construct an initial AnnotatedDataFrame and
>>> populate it with whatever you fancy from the raw-data files (scan
>>> date, software, etc...). I have been going way with my homebrew
>>> infrastructure, and it has so far been leading to quite much
>>> expressivity. Reserved words are not necessarily very limiting (if
>>> sufficiently specific, say "array_scan_date" and the associated
>>> varMetaData = "Date when scanning the hybridized microarray"), and
>>> I'd think better to carefully design and document what is happening
>>> when one is trying to add an other column with the same name rather
>>> than rely on security-through-obscurity with mangled names.
>>>
>>>
>>>
>>> L.
>>>
>>>
>>>
>>>
>>>
>>> Patrick Aboyoun wrote:
>>>> Laurent,
>>>> As you mentioned the existing phenoData infrastructure could be
>>>> used to house information like scan dates, scanner model, and
>>>> scanning software version, but this information is not
>>>> conceptually phenotype data and, and adding it to an
>>>> AnnotatedDataFrame comes with the limitation of using reserved
>>>> words (maybe name mangled like .__ScanDates__?) for column names
>>>> in the AnnotatedDataFrame.
>>>>
>>>> The internal discussion we have been having to making this more
>>>> general is to add a different slot (candidate name arrayData) to
>>>> eSet (and removing the scanDates slot) that would house the type
>>>> of information we have been discussing in a combination of
>>>> dedicated slots like scanDates and a catch all AnnotatedDataFrame
>>>> slot for less universal data. This design would separate the array
>>>> data from the phenotype data and having dedicating slots for
>>>> important information like scan dates would avoid having to manage
>>>> special columns in an AnnotatedDataFrame.
>>>>
>>>> As you rightly point out we need to ensure we support a rich suite
>>>> of functionality like "[", subset, etc., but this can all be
>>>> handled through methods for the eSet class.
>>>>
>>>> Keep in mind that this recent change is just a first step, not a
>>>> final design, and with your help and input from the rest of the
>>>> BioC developer community, we can ensure we end up with a
>>>> sufficiently useful microarray data infrastructure.
>>>>
>>>> Cheers,
>>>> Patrick
>>>>
>>>>
>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>
>>>>> Patrick,
>>>>>
>>>>> There are indeed always several ways to address needs, and my
>>>>> comment
>>>>> is mostly pointing at the fact that creating yet-an-other slot is
>>>>> not
>>>>> necessary since one can currently store such data into phenoData
>>>>> (into
>>>>> a column named... say "scan_date").
>>>>>
>>>>> I would in fact qualify of overbuilding the approach that adds a
>>>>> new
>>>>> (and exclusive) slot while improving the exiting infrastructure
>>>>> could
>>>>> perfectly answer the needs. So today it's "scanDates", and next
>>>>> could
>>>>> be "scannerModel", or "scanningSoftwareVersion".
>>>>>
>>>>> I have been a little unclear (even to myself) in my comment about
>>>>> using
>>>>> "[", so here are more details. *If* the extract operator was made
>>>>> to
>>>>> evaluate expressions such as the function subset() does, or in
>>>>> fact if
>>>>> a method subset was implemented for eSet objects, storing all
>>>>> information into phenoData makes such things nice:
>>>>>
>>>>> # silly example: only get the control data scanned in the future:
>>>>> eset[, scan_date > date() & treatment == "control"]
>>>>> # same with subset:
>>>>> subset(eset, , scan_date > date() & treatment == "control")
>>>>>
>>>>> # a little longer to write
>>>>> eset[, scanDates(eset) > date() & pData(eset) == "control"]
>>>>>
>>>>>
>>>>> If for some reasons a distinction between phenoData and
>>>>> like-phenoData-but-can't-be-the-same is needed, please do
>>>>> consider the
>>>>> creation of an AnnotatedDataFrame that contains all of them.
>>>>>
>>>>>
>>>>>
>>>>> L.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Patrick Aboyoun wrote:
>>>>>> Laurent,
>>>>>> We had some immediate need for scan date information and rather
>>>>>> than overbuild a system for managing metadata that we may or
>>>>>> may not need, we opted to start simply and then build up as
>>>>>> appropriate. There has been some internal discussions about
>>>>>> managing other metadata along with scan dates, but nothing else
>>>>>> has bubbled to the top yet. Your thoughts and design can help
>>>>>> speed up this process. The class versioning system in Biobase
>>>>>> supports iterative development and we can make further changes
>>>>>> once we lock a design in place. One editorial comment I have is
>>>>>> that lots of designs are possible for a given need and, for
>>>>>> example, the current class properly subsets the scanDates
>>>>>> information using "[" despite not being stored in the phenoData
>>>>>> (AnnotatedDataFrame) slot.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
>>>>>>
>>>>>>> Hi Patrick,
>>>>>>>
>>>>>>> Storing the scan dates is indeed useful information, and is it
>>>>>>> nice to
>>>>>>> have it offered at the parsing stage.
>>>>>>> However, first comment would be "does it justify a new slot" to
>>>>>>> eSet ?
>>>>>>>
>>>>>>> I have been storing scan dates for quite some time now, but
>>>>>>> opted for
>>>>>>> having them in the phenoData as it made more sense to me, both
>>>>>>> on an
>>>>>>> implementation standpoint and on practical standpoint (as
>>>>>>> standard
>>>>>>> extraction of an eset-subset on columns with the "[" operator
>>>>>>> works).
>>>>>>>
>>>>>>> If having something specific for scan dates is really really
>>>>>>> wished,
>>>>>>> would it make make sense to have that by extending
>>>>>>> AnnotatedDataFrame ?
>>>>>>>
>>>>>>> In my opinion, the stage at which the the data are extracted
>>>>>>> (in that
>>>>>>> case when parsing the files coming out of the image analysis)
>>>>>>> should
>>>>>>> not dictate where the data are stored.
>>>>>>> In fact, it might make it for a nice(r) workflow if the function
>>>>>>> reading raw array data could return an eSet-inheriting instance
>>>>>>> and a
>>>>>>> phenoData with information such as dates and file names. I am
>>>>>>> working
>>>>>>> on a workflow that is in fact getting much more data from the
>>>>>>> header (I
>>>>>>> suppose that I'd contribute it when enough time to wrap it up).
>>>>>>>
>>>>>>>
>>>>>>> Just few thoughts,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> L.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Patrick Aboyoun wrote:
>>>>>>>> Dear Bioconductor developers,
>>>>>>>> The Biocore group has just committed a change to the BioC 2.5
>>>>>>>> code line (Biobase version 2.5.3) to support the use of
>>>>>>>> microarray scan date in statistical analyses by adding a
>>>>>>>> scanDates slot to Biobase's eSet class. This information can
>>>>>>>> be retrieved and set using the new scanDates and
>>>>>>>> scanDates<- function respectively. The scanDates slot is
>>>>>>>> designed to hold a character vector of length = # of
>>>>>>>> samples, with one character element for each sample. (See
>>>>>>>> help(scanDates) for more information.)
>>>>>>>>
>>>>>>>> In this first round of check-ins we have added affy support
>>>>>>>> of this new slot to functions like ReadAffy and we will be
>>>>>>>> working towards adding this information to other microarray
>>>>>>>> platforms as well.
>>>>>>>>
>>>>>>>> This change involved bumping the eSet version number from
>>>>>>>> 1.1.0 to 1.2.0 in the Biobase class definition. In order to
>>>>>>>> minimize the impact of this change, the Biobase methods
>>>>>>>> support both the current eSet version 1.2.0 as well as old
>>>>>>>> 1.1.0 serialized objects so updateObject will not be
>>>>>>>> required to be performed on eSet-derived objects prior to
>>>>>>>> use in other functions. We have also tested and versioned
>>>>>>>> bumped (and patched where needed) the following packages
>>>>>>>> that create eSet-derived classes to minimize any package
>>>>>>>> build issues: ACME, beadarray, beadarraySNP, cellHTS2,
>>>>>>>> CGHbase, codelink, crlmm, GeneRegionScan, GGBase, maDB,
>>>>>>>> oligoClasses, ontoTools, puma, rMAT, SNPchip, and spkTools.
>>>>>>>>
>>>>>>>> Below is a demonstration of the new functionality. If you
>>>>>>>> encounter any issues related to this change, please e-mail
>>>>>>>> this list so the community can monitor the change.
>>>>>>>>
>>>>>>>> - The Biocore Team
>>>>>>>>
>>>>>>>>
>>>>>>>>> suppressMessages(library(affy))
>>>>>>>>> example(ReadAffy)
>>>>>>>>
>>>>>>>> RdAffy> if(require(affydata)){
>>>>>>>> RdAffy+ celpath <- system.file("celfiles",
>>>>>>>> package="affydata")
>>>>>>>> RdAffy+ fns <- list.celfiles(path=celpath,full.names=TRUE)
>>>>>>>> RdAffy+ RdAffy+ cat("Reading files:
>>>>>>>> \n",paste(fns,collapse="\n"),"\n")
>>>>>>>> RdAffy+ ##read a binary celfile
>>>>>>>> RdAffy+ abatch <- ReadAffy(filenames=fns[1])
>>>>>>>> RdAffy+ ##read a text celfile
>>>>>>>> RdAffy+ abatch <- ReadAffy(filenames=fns[2])
>>>>>>>> RdAffy+ ##read all files in that dir
>>>>>>>> RdAffy+ abatch <- ReadAffy(celfile.path=celpath)
>>>>>>>> RdAffy+ }
>>>>>>>> Loading required package: affydata
>>>>>>>> Reading files:
>>>>>>>> /Library/Frameworks/R.framework/Versions/2.10/Resources/
>>>>>>>> library/affydata/celfiles/binary.cel /Library/Frameworks/
>>>>>>>> R.framework/Versions/2.10/Resources/library/affydata/celfiles/
>>>>>>>> text.cel
>>>>>>>>> scanDates(abatch)
>>>>>>>> binary.cel text.cel
>>>>>>>> "01/23/04 14:30:57" "08/29/03 15:12:30"
>>>>>>>>> sessionInfo()
>>>>>>>> R version 2.10.0 Under development (unstable) (2009-06-12
>>>>>>>> r48755)
>>>>>>>> i386-apple-darwin9.6.0
>>>>>>>>
>>>>>>>> locale:
>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>
>>>>>>>> attached base packages:
>>>>>>>> [1] stats graphics grDevices utils datasets
>>>>>>>> methods base
>>>>>>>> other attached packages:
>>>>>>>> [1] affydata_1.11.6 affy_1.23.2 Biobase_2.5.3
>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>> [1] affyio_1.13.3 preprocessCore_1.7.4 tools_2.10.0
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioc-devel
mailing list