[BioC] getGEO
Kevin R. Coombes
krc at mdacc.tmc.edu
Fri Nov 3 21:09:50 CET 2006
Slow means REALLY SLOW.
I recently downloaded both a SOFT file and a MINiML file for the same
data set (27 two-color glass arrays with 23,184 spots per array) from
GEO. Reading the SOFT format with GEOquery took more than an
hour-and-a-half. Unzipping and reading the TSV files in the MINiML
format took less than 5 minutes. I took the rest of the hour-and-a-half
to learn how to use the XML package from CRAN to parse the sample
information out of the accompanying XML file.
It may well be that the problem is intrinsic to the SOFT format; I don't
really know. But I do not that there is a big difference between loading
data in 5 minutes and loading data in 90 minutes.
For more details of the saga, you can look at Lectures 18 and 19 at
http://bioinformatics.mdanderson.org/MicroarrayCourse
for the online course notes from a course I'm teaching this semester.
-- Kevin Coombes
Sean Davis wrote:
> On Friday 03 November 2006 13:17, Weiwei Shi wrote:
>> hi,
>>
>> I am a newbie using this GEOquery package and have a couple of issues
>> when I used it:
>> 2. When I used
>> getGEO("GSE3210")
>> then there is a kind of connection problem. I am wondering how to correct
>> this?
>
> You will need to use the version from 1.9 or 2.0 (devel) of Bioconductor. The
> URL for GEO changes relatively regularly, which is what causes this problem.
>
>> 3. When I download the gz file and run it locally,
>> it seems working but ends with a couple of warnings (shown below). Is it
>> ok?
>
> This has to do with the encoding in the file being incorrect. This isn't
> SUPPOSED to happen, but I have seen this also. You may have to do some
> simple checks to make sure that you have all the samples represented, etc.
>
>> 4. BTW, it is really slow, even running from local. Should gunzipping
>> at first help?
>
> I'm not sure what you mean by slow, but it is parsing a 4.7 million line file,
> so that may take a while. Gunzipping will not help significantly--sorry.
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list