[BioC] getGEO

Kevin R. Coombes krc at mdacc.tmc.edu
Fri Nov 3 21:09:50 CET 2006


Slow means REALLY SLOW.

I recently downloaded both a SOFT file and a MINiML file for the same 
data set (27 two-color glass arrays with 23,184 spots per array) from 
GEO.  Reading the SOFT format with GEOquery took more than an 
hour-and-a-half. Unzipping and reading the TSV files in the MINiML 
format took less than 5 minutes. I took the rest of the hour-and-a-half 
to learn how to use the XML package from CRAN to parse the sample 
information out of the accompanying XML file.

It may well be that the problem is intrinsic to the SOFT format; I don't 
really know. But I do not that there is a big difference between loading 
data in 5 minutes and loading data in 90 minutes.

For more details of the saga, you can look at Lectures 18 and 19 at
	http://bioinformatics.mdanderson.org/MicroarrayCourse
for the online course notes from a course I'm teaching this semester.

   -- Kevin Coombes

Sean Davis wrote:
> On Friday 03 November 2006 13:17, Weiwei Shi wrote:
>> hi,
>>
>> I am a newbie using this GEOquery package and have a couple of issues
>> when I used it:
>> 2. When I used
>> getGEO("GSE3210")
>> then there is a kind of connection problem. I am wondering how to correct
>> this?
> 
> You will need to use the version from 1.9 or 2.0 (devel) of Bioconductor.  The 
> URL for GEO changes relatively regularly, which is what causes this problem.
> 
>> 3. When I download the gz file and run it locally,
>> it seems working but ends with a couple of warnings (shown below). Is it
>> ok?
> 
> This has to do with the encoding in the file being incorrect.  This isn't 
> SUPPOSED to happen, but I have seen this also.  You may have to do some 
> simple checks to make sure that you have all the samples represented, etc.
> 
>> 4. BTW, it is really slow, even running from local. Should gunzipping
>> at first help?
> 
> I'm not sure what you mean by slow, but it is parsing a 4.7 million line file, 
> so that may take a while.  Gunzipping will not help significantly--sorry.
> 
> Sean
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list