[BioC] getGEO - getting the .CEL files from GEO

Wed Mar 17 17:51:53 CET 2010

2010/3/17 Vincent Carey <stvjc at channing.harvard.edu>:
> do you really want to put sample-characteristics data in a CEL file?
>
> the sample characteristics are available as follows:
>
>  ff = getGEO("GSE4045")
>
>> table(pData(ff[[1]])$descr)
>
>        conventional colorectal tumor, mucinous, Dukes Stage c, MSS,
> no cancer in the family, male, Distal Location , Tumor Grade 2
>
>                                                           1
>  conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no
> cancer in the family, female, Distal Location , Tumor Grade 2
>
>                                                           1
> conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no
> cancer in the family, female, Proximal Location , Tumor Grade 3
>
>                                                           1
> ....
>
> and you will have to parse that 'description' field to extract stage
> and other relevant information.  for example
>
> de = as.character(ff[[1]]$desc
> gr = gsub(".*, Tumor Grade.(.)$", "\\1", de)
>
> gives you a single character string for grade, except for sample 14 --
> where my regexp doesn't do as much as it should.
>
> such activities would be used to populate an annotated data frame
> which could then serve as the phenoData component of an AffyBatch
> instance, which is a typical container for CEL-based intensity data,
> to be propagated downstream through background correction and
> normalization and so forth.  The experimentData element should also be
> suitably populated, as early in the workflow as possible.  If we look
> closely enough we can find that the ExpressionSet returned by getGEO
> has quantifications generated by MAS 5.0.
>
> On Wed, Mar 17, 2010 at 11:27 AM, 張 語恬 <greengarden_0925 at hotmail.com> wrote:
>>
>>
>> Hi:
>>
>> I've download  the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file.
>>
>> I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood.
>>
>> Could you use GSE4045 as an example to demonstrate
>> how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT  with the CEL raw microarray data and keep the characteristics left.
>>

There are a couple of tricks here that can sometimes be useful to get
better annotation.  In this case, they are not a big improvement.

The GEO GSE data entity contains information as supplied by the
submitters.  The GDS data entity contains data taken from GSE records
that have been further curated by GEO staff.  Often, that leads to
more useful annotation than comma-separated lists (although the
information is usually the same or similar, at least).  To give an
example of how one might learn of the existence of such a GDS given a
GSE, one can use the GEOmetadb package:

library(GEOmetadb)
# Next command will take a minute....
sqlfile = getSQLiteFile()
# Check to see if the GSE record has a corresponding
# GDS record
geoConvert('GSE4045','gds')

This series of commands will result in the following:

$gds
  from_acc  to_acc
1  GSE4045 GDS2201

So, GSE4045 has been curated by NCBI GEO staff and the accession of
the curated data is GDS2201.  We can check to see what subsets
(phenotypic variables) are available using GEOmetadb, but we have to
resort to writing SQL to do so:

# make a connection to the database
conn = dbConnect('SQLite',sqlfile)
dbGetQuery(conn,"select
gds_subset.gds,gds_subset.description,gds_subset.type from gds_subset
where gds='GDS2201'")

One can use the columnDescriptions() function to get a data.frame of
columns, tables, and descriptions if writing SQL is necessary.  This
will return this small data.frame:

      gds                       description          type
1 GDS2201     serrated colerectal carcinoma disease state
2 GDS2201 conventional colorectal carcinoma disease state

So, unfortunately, the GEO staff has annotated only the two different
types of colorectal carcinoma and not the other clinical variables.
If this is all you wanted, then you can use getGEO('GDS2201') to get
the annotations and attach those to the ExpressionSet that you create
by normalizing the .CEL files of your choosing.  If not, then Vince's
method is the way to go.

Sean