[BioC] queryGEO fails on GDS files (GEO Datasets)

Peter bioconductor-mailinglist at maubp.freeserve.co.uk
Wed Jan 4 16:27:23 CET 2006


This follows on from a question from Saurin D. Jani, on the list a year ago:

https://stat.ethz.ch/pipermail/bioconductor/2005-January/007405.html

A working example:

library(AnnBuilder)
geo <- GEO()
queryGEO(geo,"GSM107")

This downloads and parses:-

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM107&targ=self&form=text&view=data

This fails for GEO Datasets (GDS files) like GDS813 (Saurin's example) 
because the URL isn't accepted - the NCBI returns an HTML page which 
redirects you to:

http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=813

This page in turn can be used (by a human, a little more tricky in code) 
to download the actual GDS file - but only in compressed form:

ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS813.soft.gz

What this means is that at the moment, queryGEO doesn't support GDS 
files.  Even if it did, they are generally large and only available in 
compressed format, making things generally more complicated.

Would it make more sense to provide to separate functions:

Firstly, to download the file (dealing with all possible URLs) and if 
need be decompress it.

Secondly, to parse a GEO file from the provided handle/filename/url

This makes sense for other large GEO files like the GPL annotation 
files, as well as the GEO datasets (GDS files).  It seems wasteful and 
slow to download them fresh each time.

Peter



More information about the Bioconductor mailing list