[BioC] GEOmetadb and gse organism type
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Mar 19 16:08:34 CET 2009
hello Jack, Sean, and others,
i have an observation and a question related to GEOmetadb. the queries
below are made directly in sqlite, but could have easily be done within
R, for example.
the gds (data sets) table contains two fields related to organisms:
sqlite GEOmetadb.sqlite 'pragma table_info(gds)' | grep organism
# 7|platform_organism|TEXT|0||0
# 10|sample_organism|TEXT|0||0
the gse (data series) table does not contain any such field:
sqlite GEOmetadb.sqlite 'pragma table_info(gse)' | grep organism
the gpl (platform) table contains one such field:
sqlite GEOmetadb.sqlite 'pragma table_info(gpl)' | grep organism
# 8|organism|TEXT|0||0
(and likewise for the gsm (sample) table). i would like to retrieve all
data *series* (not data sets) where gene expression in a specific
organism was investigated. in the vignette to the GEOmetadb
Bioconductor package (as well as the GEOtools Matlab one), you give the
following example:
"we would like to find all the human breast cancer-related
Affymetrix gene expression GEO series."
and then you use the gse_gpl table to retrieve the *platform* organism
-- which doesn't seem quite right, because you can hybridize non-human
samples to human arrays, as well as human samples to non-human arrays:
sqlite GEOmetadb.sqlite '
select gds, sample_organism as so, platform_organism as po
from gds
where (so like "%homo%" and po not like "%homo%")
or (so not like "%homo%" and po like "%homo%")'
(and likewise for data series, but the query would be a little bit more
involved.)
the comment: you may want to update the documentation, unless i
misunderstood your intentions.
the question: is there no direct way to check the sample organism for
data series entries in GEO? it would be very useful to filter gse by
sample organism, without having to join the table with gse_gsm and gsm.
you already have it for gds, it would be pretty simple to add an
analogous field to gse. (and likewise for platform organism). surely,
a gse may include more than one organism in the platforms or samples,
but that's what we have in gds already:
sqlite GEOmetadb.sqlite '
select platform_organism as po, sample_organism as so
from gds
where po like "%,%" or so like "%,%"'
so i would like to be able to
sqlite GEOmetadb.sqlite '
select gds, sample_organism as so, platform_organism as po from gds
where so like "%homo%"
and po like "%homo%"'
rather than
sqlite GEOmetadb.sqlite '
select distinct gse.gse, gsm.
from gse
join gse_gsm on gse.gse = gse_gsm.gse
join gsm on gsm.gsm = gse_gsm.gsm
join gse_gpl on gse.gse = gse_gpl.gse
join gpl on gpl.gpl = gse_gpl.gpl
where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2
like "%homo%")
and gpl.organism like "%homo%"'
after all, gds are backed by gse, so if gds entries can have
sample_organism and platform_organism, why could not gse have them too?
regards,
vQ
--
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD
Email: waku at idi.ntnu.no
Phone: +47 73591875, +47 72574609
Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303
Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060
More information about the Bioconductor
mailing list