[BioC] GEOquery and parsing SOFT files

Mon May 25 21:15:56 CEST 2009

Dear Wacek,

thank you for the feedback and pointing this out. Two general remarks:

1. Please include a reproducible example (R script) for others to 
reproduce your experience, and subsequently the output of sessionInfo().

2. Robert Gentleman's book "R Programming for Bioinformatics" (as well 
as many free sources on the web) describes how to profile R code in 
order to see in which functions the CPU time is spent. Based on this, 
you can investigate where to invest developer time for improving the code.

Best wishes
      Wolfgang

Wacek Kusnierczyk ha scritto:
> Hello,
> 
> The getGEO function from GEOquery parses GEO soft files.  With a
> particular GSE file (GSE13638), it took over 15 minutes on my
> not-so-crappy machine to parse the file (a local file, download time
> excluded).  I've written a simple parser in perl, and parsing the same
> file and storing the data in a nested hash/array structure takes ca. 2
> seconds.  I'm pretty sure there is more essential processing done by
> getGEO to organize the data into a GSE object, but still, there seems to
> be an incredibly inefficient implementation underneath.
> 
> I haven't looked at the source code yet, but here's a question:  what is
> the likely reason getGEO is so slow?  Is it the parsing itself, or
> rather wraping the data into the appropriate structure?  Where should I
> start to look for code to be improved?
> 
> vQ
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------------------------
Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber