[BioC] GEOmetadb and gse organism type

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Mar 19 16:08:34 CET 2009

hello Jack, Sean, and others,

i have an observation and a question related to GEOmetadb.  the queries
below are made directly in sqlite, but could have easily be done within
R, for example.

the gds (data sets) table contains two fields related to organisms:

    sqlite GEOmetadb.sqlite 'pragma table_info(gds)' | grep organism
    # 7|platform_organism|TEXT|0||0
    # 10|sample_organism|TEXT|0||0

the gse (data series) table does not contain any such field:

    sqlite GEOmetadb.sqlite 'pragma table_info(gse)' | grep organism

the gpl (platform) table contains one such field:

    sqlite GEOmetadb.sqlite 'pragma table_info(gpl)' | grep organism
    # 8|organism|TEXT|0||0

(and likewise for the gsm (sample) table).  i would like to retrieve all
data *series* (not data sets) where gene expression in a specific
organism was investigated.  in the vignette to the GEOmetadb
Bioconductor package (as well as the GEOtools Matlab one), you give the
following example:

    "we would like to find all the human breast cancer-related
Affymetrix gene expression GEO series."

and then you use the gse_gpl table to retrieve the *platform* organism
-- which doesn't seem quite right, because you can hybridize non-human
samples to human arrays, as well as human samples to non-human arrays:

    sqlite GEOmetadb.sqlite '
        select gds, sample_organism as so, platform_organism as po
            from gds
            where (so like "%homo%" and po not like "%homo%")
                or (so not like "%homo%" and po like "%homo%")'

(and likewise for data series, but the query would be a little bit more

the comment:  you may want to update the documentation, unless i
misunderstood your intentions.

the question:  is there no direct way to check the sample organism for
data series entries in GEO?  it would be very useful to filter gse by
sample organism, without having to join the table with gse_gsm and gsm. 
you already have it for gds, it would be pretty simple to add an
analogous field to gse.  (and likewise for platform organism).  surely,
a gse may include more than one organism in the platforms or samples,
but that's what we have in gds already:

    sqlite GEOmetadb.sqlite '
        select platform_organism as po, sample_organism as so
            from gds
            where po like "%,%" or so like "%,%"'

so i would like to be able to

    sqlite GEOmetadb.sqlite '
        select gds, sample_organism as so, platform_organism as po from gds
          where so like "%homo%"
            and po like "%homo%"'

rather than

    sqlite GEOmetadb.sqlite '
        select distinct gse.gse, gsm.
            from gse
                join gse_gsm on gse.gse = gse_gsm.gse
                join gsm on gsm.gsm = gse_gsm.gsm
                join gse_gpl on gse.gse = gse_gpl.gse
                join gpl on gpl.gpl = gse_gpl.gpl
            where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2
like "%homo%")
                and gpl.organism like "%homo%"'
after all, gds are backed by gse, so if gds entries can have
sample_organism and platform_organism, why could not gse have them too?


Wacek Kusnierczyk, MD PhD

Email: waku at idi.ntnu.no
Phone: +47 73591875, +47 72574609

Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303

Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060

More information about the Bioconductor mailing list