[BioC] Normalization of array data from GEO repository
Markus Schmidberger
schmidb at ibe.med.uni-muenchen.de
Tue Jul 14 11:58:21 CEST 2009
Hi,
have a look to the AE FAQ:
http://www.ebi.ac.uk/microarray/doc/help/faq.html#submitter_FAQ_general
*How much over-lap is there between ArrayExpress and the Gene Expression
Omnibus (GEO)?*
We import data on a weekly basis from GEO (NCBI). As a priority all GEO
experiments which are in GEO datasets on catalogue Affymetrix and
Agilent platforms are imported and we re-curate these before loading
into ArrayExpress. We also import all GSE on these platforms and these
are loaded uncurated if they pass our quality checks (e.g. no corrupt
data files). All experiments imported from GEO have accession numbers in
the format of E-GEOD-n, where n is a number. For more information see
the http://www.ebi.ac.uk/microarray/doc/help/GEO_data.html
I had a more detailed look at the "HG-U133A" chip type. There I found an
overlap of more than 90%. Especially all the new experiments are
available in AE, too. Using R and Bioconductor for analyses, I
recognized that the file format in AE is more suitable.
Best
Markus
James F. Reid schrieb:
> Hi,
>
> care: this is my understanding and I might be quite wrong.
>
> There is indeed no synchronization between the two databases for lack
> of a common standard (each have their own flavour of MAGE-ML).
> In addition to investigators submitting to both repositories,
> ArrayExpress also imports experiments from GEO according to certain
> criteria. These are prefixed by 'E-GEOD' in the experiment ID.
> Querying ArrayExpress for these returns 5155 such experiments out of a
> total of 8372. GEO contains 12810 Series (experiments), so GEO does
> contain more data I would say.
>
> HTH,
> James.
>
>
> Sean Davis wrote:
>> On Wed, Jul 8, 2009 at 6:16 AM, Joern Toedling
>> <Joern.Toedling at curie.fr>wrote:
>>
>>> Hello,
>>>
>>> just a small addendum: you may also want to have a look at the
>>> ArrayExpress
>>> package which allows the user to retrieve data sets from the
>>> ArrayExpress
>>> database at EBI and returns the data in form of an AffyBatch,
>>> NChannelSet,
>>> RGList or the like. Since GEO and ArrayExpress are regularly
>>> synchronized,
>>> you
>>> may be able to find your data sets of interest there as well.
>>>
>>
>> Actually, ArrayExpress and GEO are NOT synchronized. There are some
>> overlaps where investigators have submitted to both and for other
>> reasons,
>> but GEO is still the larger of the two and they each contain largely
>> non-overlapping data sets.
>>
>>
>>> Regards,
>>> Joern
>>>
>>>
>>> On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote
>>>> Hi,
>>>>
>>>> On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]Aleš Maver wrote:
>>>>
>>>>> Hi all,
>>>>> I have obtained several GEO Series (GSE) entries from GEO repository
>>>>> using
>>>>> getGEO function (GEOquery package).
>>>>> Data obtained in this manner is stored in ExpressionSet class. The
>>>>> problem
>>>>> is I don't know how to perform quality control analyses and
>>>>> normalization
>>>>> procedures on ExpressionSet data, because functions like expresso
>>>>> (affy
>>>>> package) work only on AffyBatch classes. Is there anything I am
>>>>> missing?
>>>> Sorry, I've never used the GEOquery package before, so I can't speak
>>>> much to that, but I'd be surprised if there isn't an option to
>>>> return your results as an AffyBatch object, because I'd dare say
>>>> that you can get most of the data from geo in its raw format (eg,
>>>> CEL file or whatever).
>>>>
>>>>> And- does anyone know whether data in GEO repository is already
>>>>> normalised
>>>>> or not?
>>>> It depends, sometimes you aren't given the raw files: sometimes the
>>>> data is from a custom array, or I've also seen some datasets
>>>> provided in the post-processed form (already MAS5 normalized, for
>>>> example), but it's been my experience that you can get the raw data
>>>> for most of the experiments you find there.
>>>>
>>>> Also, for array quality assessment, look into the
>>>> arrayQualityMetrics package:
>>>>
>>>>
>>> http://www.bioconductor.org/packages/release/bioc/html/arrayQualityMetrics.html
>>>
>>>> Hope that helps,
>>>> -steve
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Dipl.-Tech. Math. Markus Schmidberger
Ludwig-Maximilians-Universität München
IBE - Institut für medizinische Informationsverarbeitung,
Biometrie und Epidemiologie
Marchioninistr. 15, D-81377 Muenchen
URL: http://www.ibe.med.uni-muenchen.de
Mail: Markus.Schmidberger [at] ibe.med.uni-muenchen.de
Tel: +49 (089) 7095 - 4497
More information about the Bioconductor
mailing list