[BioC] getGEO - getting the .CEL files from GEO
tfrayner at gmail.com
Thu Mar 18 12:49:56 CET 2010
On 17 March 2010 16:51, Sean Davis <seandavi at gmail.com> wrote:
> 2010/3/17 Vincent Carey <stvjc at channing.harvard.edu>:
>> do you really want to put sample-characteristics data in a CEL file?
>> the sample characteristics are available as follows:
>> ff = getGEO("GSE4045")
>> conventional colorectal tumor, mucinous, Dukes Stage c, MSS,
>> no cancer in the family, male, Distal Location , Tumor Grade 2
>> conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no
>> cancer in the family, female, Distal Location , Tumor Grade 2
>> conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no
>> cancer in the family, female, Proximal Location , Tumor Grade 3
>> and you will have to parse that 'description' field to extract stage
>> and other relevant information. for example
>> de = as.character(ff[]$desc
>> gr = gsub(".*, Tumor Grade.(.)$", "\\1", de)
>> gives you a single character string for grade, except for sample 14 --
>> where my regexp doesn't do as much as it should.
>> such activities would be used to populate an annotated data frame
>> which could then serve as the phenoData component of an AffyBatch
>> instance, which is a typical container for CEL-based intensity data,
>> to be propagated downstream through background correction and
>> normalization and so forth. The experimentData element should also be
>> suitably populated, as early in the workflow as possible. If we look
>> closely enough we can find that the ExpressionSet returned by getGEO
>> has quantifications generated by MAS 5.0.
>> On Wed, Mar 17, 2010 at 11:27 AM, 張 語恬 <greengarden_0925 at hotmail.com> wrote:
>>> I've download the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file.
>>> I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood.
>>> Could you use GSE4045 as an example to demonstrate
>>> how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT with the CEL raw microarray data and keep the characteristics left.
> There are a couple of tricks here that can sometimes be useful to get
> better annotation. In this case, they are not a big improvement.
> The GEO GSE data entity contains information as supplied by the
> submitters. The GDS data entity contains data taken from GSE records
> that have been further curated by GEO staff. Often, that leads to
> more useful annotation than comma-separated lists (although the
> information is usually the same or similar, at least). To give an
> example of how one might learn of the existence of such a GDS given a
> GSE, one can use the GEOmetadb package:
> # Next command will take a minute....
> sqlfile = getSQLiteFile()
> # Check to see if the GSE record has a corresponding
> # GDS record
> This series of commands will result in the following:
> from_acc to_acc
> 1 GSE4045 GDS2201
> So, GSE4045 has been curated by NCBI GEO staff and the accession of
> the curated data is GDS2201. We can check to see what subsets
> (phenotypic variables) are available using GEOmetadb, but we have to
> resort to writing SQL to do so:
> # make a connection to the database
> conn = dbConnect('SQLite',sqlfile)
> gds_subset.gds,gds_subset.description,gds_subset.type from gds_subset
> where gds='GDS2201'")
> One can use the columnDescriptions() function to get a data.frame of
> columns, tables, and descriptions if writing SQL is necessary. This
> will return this small data.frame:
> gds description type
> 1 GDS2201 serrated colerectal carcinoma disease state
> 2 GDS2201 conventional colorectal carcinoma disease state
> So, unfortunately, the GEO staff has annotated only the two different
> types of colorectal carcinoma and not the other clinical variables.
> If this is all you wanted, then you can use getGEO('GDS2201') to get
> the annotations and attach those to the ExpressionSet that you create
> by normalizing the .CEL files of your choosing. If not, then Vince's
> method is the way to go.
It's also worth noting that ArrayExpress have imported much of the
data from common Affymetrix platforms (and some other platforms) from
GEO. These imported data sets have generally been put through a basic
curation step which does improve the computability of the annotation
somewhat. The general rule is that for a GEO series GSENNNN then the
ArrayExpress entry is E-GEOD-NNNN:
abatch <- ArrayExpress('E-GEOD-4045')
Not that it makes a huge difference in this case, but this is a pretty
good workaround when a GDS set is not available in GEO.
(former AE curator)
Bioinformatician, Smith Lab
CIMR, University of Cambridge
More information about the Bioconductor