[BioC] GEOquery package

Sean Davis sdavis2 at mail.nih.gov
Wed Jan 11 21:05:22 CET 2006




On 1/11/06 2:29 PM, "Peter" <bioconductor-mailinglist at maubp.freeserve.co.uk>
wrote:

> Sean Davis wrote:
>> Peter,
>> 
>> I have recently uploaded a new package to bioconductor called GEOquery.
> 
> I've had a little play - very nice work.  Cheers.  Just a few
> queries/questions for you...
> 
> I never did work out how to load the package from the source files, but
> I noticed there is now a Windows binary package on the website...
> 
> http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html
> 
> I downloaded the ZIP file and installed it on Windows XP with R 2.1.1
> and got the following warning:
> 
> package 'GEOquery' successfully unpacked and MD5 sums checked
> updating HTML package descriptions
> Warning message:
> no package 'file15658' was found in: packageDescription(i, fields =
> "Title", lib.loc = lib)
> 
> Question One
> ------------
> Is the above "no package" warning important?

I don't know the answer to that one, but I will look into it.

> Question Two
> ------------
> 
>> library(GEOquery)
> Warning message:
> package 'GEOquery' was built under R version 2.3.0
> 
> Does the version of R matter?  I assume R version 2.3.0 is the
> development version of R, as 2.2.1 is the latest official release.

By definition, the development versions of Bioconductor packages are built
to work with the current development version of R.  That said, I venture to
say that most of them will work with relatively recent versions of R,
GEOquery included. 


> Question Three
> --------------
> 
>> gds37 <- getGEO('GDS37', destdir="c:/temp/geo")
> trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS37.soft.gz'
> ftp data connection made, file length 132384 bytes
> opened URL
> downloaded 129Kb
> 
> File stored at:
> c:/temp/geo/GDS37.soft.gz
> c:/temp/geo/GDS37.soft.gz
> parsing geodata
> parsing subsets
> ready to return
> 
> Why does it print the file location twice?

Sloppy debugging code that didn't get removed.  Thanks for pointing this
out.

> Question Four
> -------------
> If I repeat the command getGEO, why does it re-download the file?
> 
>> gds37 <- getGEO('GDS37', destdir="c:/temp/geo")
> 
> I would personally have written the getGEO code to check in the
> destination folder for the files GDS37.soft or GDS37.soft.gz and just
> load the local copy if it existed.

I can make that change, yes.

> I know I should use the following instead:
> 
>> gds37 <- getGEO(filename="c:/temp/geo/gds37.soft.gz")

Obviously what I envisioned....

> 
> Question Five
> -------------
> I like how you have handled converting subset information into phenotype
> data in GDS2eSet.
> 
> Have you considered also parsing the "description" to extract the
> "Alternative Sample Name" and the "Sample Source"?
> 
> As far as I can tell, all the current NCBI GDS files use the same format
> for the description lines:
> 
> "Value for SAMPLENAME: ALTNAME; src: SOURCE"
> 
> On the other hand, this is clearly not a "defined field" and is subject
> to change.  

That is exactly why I don't parse it.  I can talk to the folks about GEO
whether this is likely to change or not.

> Maybe automatically parse the lines if and only if it
> follows that format?

That is a possibility.


> Thanks again - GEOquery looks like it will be very handy...

Thanks for the feedback.  Keep it coming....

Sean



More information about the Bioconductor mailing list