[BioC] GEOmetadb and gse organism type
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Fri Mar 20 16:07:43 CET 2009
Sean Davis wrote:
> On Thu, Mar 19, 2009 at 5:30 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>
>> On Thu, Mar 19, 2009 at 4:59 PM, Wacek Kusnierczyk <
>> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>>
>>
>>> if a data set can have multiple organisms in the platform_organism or
>>> sample_organism fields, what is the point in data series not having
>>> organism fields at all? just like gds2302 has several organisms listed
>>> in the sample_organism field, and gds2349 has several organisms listed
>>> in the platform_organism field, so could data series have. it wouldn't
>>> have to mean that all samples and all platforms involved in a data
>>> series include material from all of the species listed in the respective
>>> fields of the data series record.
>>>
>>> such an update to the database is a matter of a fairly simple statement,
>>> as far as i can see.
>>>
>>>
>> There are a couple of reasons for not doing so:
>>
>> 1) GEO does not provide such a field, so we do not either. As for GDS and
>> sample_organism and platform_organism, we have parsed those directly from
>> GEO.
>>
>> 2) Doing so violates a rule of database design (it denormalizes the
>> database).
>>
>> Of course, the power of SQL is that you can manipulate the data in your
>> local instance as you like, so you do not have to stick to our ideas of what
>> the best database design is.
>>
>>
>
> Just to follow up, something like this:
>
> select gse.gse,platform_organism,sample_organism
> from
> (select gse.gse,group_concat(gpl.organism,',') as platform_organism
> from gse
> join gse_gpl gg on gg.gse=gse.gse
> join gpl on gg.gpl=gpl.gpl
> group by gse.gse) as o1
> join gse on o1.gse=gse.gse (select gse.gse,group_concat(gsm.organism_ch1,',') || ',' ||
> group_concat(gsm.organism_ch2,',') as sample_organism
> from gse join gse_gsm gs on gs.gse=gse.gse
> join gsm on gsm.gsm=gs.gsm group by gse.gse) as o2 on gse.gse=o2.gse;
>
> could be used to create a table or view similar to what you are asking for.
>
this will give redundant replicates for each series, which is not
necessarily bad, but could be avoided with a few modifications. also,
i'm not sure if you're aware that some series do not have corresponding
samples:
select count(*) from (select distinct gsm from gse_gsm);
-- 288939
select count(*) from (select distinct gsm from gse_gsm natural join
gsm);
-- 268106
which looks like some 20000 entries in gse_gsm do not have a counterpart
in gsm (are these typos in gse_gsm?)
and likewise for platforms:
select count(*) from (select distinct gpl from gpl);
-- 5322
select count(*) from (select distinct gpl from gse_gpl natural join
gpl);
-- 4505
which looks like some 800 entries in gse_gpl do not have a counterpart
in gpl.
if anyone is interested, the following is a statement that will, as far
as i can see, construct a table of the shape i need -- with each gse
entry being a comination of entries from gse and three additional
fields, cumulating the organisms from the paltforms and samples involved
in the series:
create table xgse as
select
gse.*, p.porg as platform_organism,
s1.sorg1 as sample_organism_ch1,
s2.sorg2 as sample_organism_ch2
from gse
natural join (
select gse, group_concat(porg) as porg
from (
select distinct gse.gse as gse, gpl.organism as porg
from gse
left join gse_gpl on gse.gse = gse_gpl.gse
left join gpl on gse_gpl.gpl = gpl.gpl)
group by gse) as p
natural join (
select gse, group_concat(sorg) as sorg1
from (
select distinct gse.gse as gse, gsm.organism_ch1 as sorg
from gse
left join gse_gsm on gse.gse = gse_gsm.gse
left join gsm on gse_gsm.gsm = gsm.gsm)
group by gse) as s1
natural join (
select gse, group_concat(sorg) as sorg2
from (
select distinct gse.gse as gse, gsm.organism_ch2 as sorg
from gse
left join gse_gsm on gse.gse = gse_gsm.gse
left join gsm on gse_gsm.gsm = gsm.gsm)
group by gse) as s2;
and now it's easy to query for series on a platform or with a sample
from a specific organism, with no joins.
vQ
More information about the Bioconductor
mailing list