[BioC] GEOmetadb query to retrieve sample groups
Sean Davis
sdavis2 at mail.nih.gov
Mon Jun 10 18:24:06 CEST 2013
Hi, Tom.
Sorry to take so long to get back to you. See below.
On Thu, Jun 6, 2013 at 11:15 AM, Thomas H. Hampton
<Thomas.H.Hampton at dartmouth.edu> wrote:
> The following getGEO query retrieves data files and meta data for a recent GEO submission of mine,
> one that has been curated:
>
> GDS4252 <- getGEO("GDS4252")
> Columns(GDS4252)
>> str(Columns(GDS4252))
> 'data.frame': 16 obs. of 4 variables:
> $ sample : Factor w/ 16 levels "GSM754979","GSM754980",..: 5 6 7 8 1 2 3 4 13 14 ...
> $ genotype/variation: Factor w/ 2 levels "CFTR mutant",..: 1 1 1 1 1 1 1 1 2 2 ...
> $ agent : Factor w/ 2 levels "PA01","unexposed": 1 1 1 1 2 2 2 2 1 1 ...
>
> The folks at NCBI have correctly created two factors with two levels to describe the 16 samples in my experiment.
>
> I am interested in retrieving similar information using GEOmetadb, but this has proved problematic.
>
> getSQLiteFile(destdir = getwd(), destfile = "GEOmetadb.sqlite.gz")
>
> con <- dbConnect(SQLite(), "GEOmetadb.sqlite")
> dat <- dbGetQuery(con, "select * from gds where gds = 'GDS4252'")
>
>> dat
> [1] ID gds title
> [4] description type pubmed_id
> [7] gpl platform_organism platform_technology_type
> [10] feature_count sample_organism sample_type
> [13] channel_count sample_count value_type
> [16] gse order update_date
> <0 rows> (or 0-length row.names)
>
> It seems, for starters, that this GDS identifier for my particular submission isn't accounted for in the current
> database.
>
> Others are, so it looks like my syntax and so forth is ok:
>
>> dat <- dbGetQuery(con, "select gds from gds limit 10")
>> dat
> gds
> 1 GDS5
> 2 GDS6
> 3 GDS10
> 4 GDS12
> 5 GDS15
> 6 GDS16
> 7 GDS17
> 8 GDS18
> 9 GDS19
> 10 GDS20
>
>
> There is also the question of where a set of fields (variable in number) describing sample factors and their levels would actually "live"
> in the SQLite database.
It does appear that our update script has a bug; GDS4252 is not
present, so we'll check on that.
> This information does not seem to be an attribute of the GDS in any case:
You'll want to check out the gds_subset table for details of the GDS groups.
>> dat <- dbGetQuery(con, "select fieldname from geodb_column_desc where TableName = 'gds'")
>> dat
> FieldName
> 1 ID
> 2 channel_count
> 3 description
> 4 feature_count
> 5 gds
> 6 order
> 7 platform
> 8 platform_organism
> 9 platform_technology_type
> 10 pubmed_id
> 11 reference_series
> 12 sample_count
> 13 sample_organism
> 14 sample_type
> 15 title
> 16 type
> 17 update_date
> 18 value_type
>
> Nor does it seem to be a feature stored in the samples:
>
>> dat <- dbGetQuery(con, "select fieldname from geodb_column_desc where TableName = 'gsm'")
>> dat
> FieldName
> 1 ID
> 2 channel_count
> 3 characteristics_ch1
> 4 characteristics_ch2
> 5 contact
> 6 data_processing
> 7 data_row_count
> 8 description
> 9 extract_protocol_ch1
> 10 extract_protocol_ch2
> 11 gpl
> 12 gse
> 13 gsm
> 14 hyb_protocol
> 15 label_ch1
> 16 label_ch2
> 17 label_protocol_ch1
> 18 label_protocol_ch2
> 19 last_update_date
> 20 molecule_ch1
> 21 molecule_ch2
> 22 organism_ch1
> 23 organism_ch2
> 24 source_name_ch1
> 25 source_name_ch2
> 26 status
> 27 submission_date
> 28 supplementary_file
> 29 title
> 30 treatment_protocol_ch1
> 31 treatment_protocol_ch2
> 32 type
>
>
> Any advice greatly appreciated.
More information about the Bioconductor
mailing list