[BioC] GEOmetadb and gse organism type
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Mar 19 21:59:50 CET 2009
Sean Davis wrote:
> On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk <
> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>
>
>> the question: is there no direct way to check the sample organism for
>> data series entries in GEO? it would be very useful to filter gse by
>> sample organism, without having to join the table with gse_gsm and gsm.
>> you already have it for gds, it would be pretty simple to add an
>> analogous field to gse. (and likewise for platform organism). surely,
>> a gse may include more than one organism in the platforms or samples,
>> but that's what we have in gds already:
>>
>> sqlite GEOmetadb.sqlite '
>> select platform_organism as po, sample_organism as so
>> from gds
>> where po like "%,%" or so like "%,%"'
>>
>> so i would like to be able to
>>
>> sqlite GEOmetadb.sqlite '
>> select gds, sample_organism as so, platform_organism as po from gds
>> where so like "%homo%"
>> and po like "%homo%"'
>>
>> rather than
>>
>> sqlite GEOmetadb.sqlite '
>> select distinct gse.gse, gsm.
>> from gse
>> join gse_gsm on gse.gse = gse_gsm.gse
>> join gsm on gsm.gsm = gse_gsm.gsm
>> join gse_gpl on gse.gse = gse_gpl.gse
>> join gpl on gpl.gpl = gse_gpl.gpl
>> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2
>> like "%homo%")
>> and gpl.organism like "%homo%"'
>>
>> after all, gds are backed by gse, so if gds entries can have
>> sample_organism and platform_organism, why could not gse have them too?
>>
>>
>
> A GSE can have samples from multiple platforms and can, therefore, represent
> multiple organisms. That is why doing the join to platform (if you are
> willing to trust that the platform was used for the specified organism) or a
> join to samples is necessary. A GDS, on the other hand, does not have that
> characteristic and represents a set of samples from a single platform.
>
from a single platform, but then these:
select gds, platform_organism from gds where platform_organism like
"%,%";
# 7 data sets
indicate mixed-organism platforms, and it is organisms i was originally
interested in.
and, likewise,
select gds, sample_organism from gds where sample_organism like "%,%";
# 8 data sets
some data sets seem to contain data from more than one organism.
if a data set can have multiple organisms in the platform_organism or
sample_organism fields, what is the point in data series not having
organism fields at all? just like gds2302 has several organisms listed
in the sample_organism field, and gds2349 has several organisms listed
in the platform_organism field, so could data series have. it wouldn't
have to mean that all samples and all platforms involved in a data
series include material from all of the species listed in the respective
fields of the data series record.
such an update to the database is a matter of a fairly simple statement,
as far as i can see.
best,
vQ
More information about the Bioconductor
mailing list