[BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data

Gustavo Fernández Bayón gbayon at gmail.com
Thu Jun 28 16:08:37 CEST 2012


Dear Sean and James,  

first of all, I would like to apologize for my late reply. There were a lot of storms yesterday here at Oviedo, and a lot of derived technical problems at my workplace which kept me from accessing the net.  

Thank you for your kind replies. Having read James' answer, I tried to replicate again the problematic commands:

> gpl <- getGEO('GPL13534', destdir='/Users/gbayon/Documents/GEO/')
Using locally cached version of GPL13534 found here:
/Users/gbayon/Documents/GEO//GPL13534.soft

> Meta(gpl)$data_row_count == nrow(Table(gpl))
[1] FALSE

> Meta(gpl)$data_row_count
[1] "485577"

> nrow(Table(gpl))
[1] 143889

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base  
other attached packages:
[1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] RCurl_1.91-1 XML_3.9-4  



As you can see, the problem was still there. GEOquery, Biobase and BioCGenerics versions are the same than in James' case. R version is 2.15.0 instead of 2.15.1. I see James has executed getGEO() directly, while I employed a cached version of the SOFT file. So, I decided to run previous commands again, but forcing the download:

> gpl <- getGEO('GPL13534')
File stored at:  
/var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq/GPL13534.soft

> Meta(gpl)$data_row_count == nrow(Table(gpl))
[1] TRUE

> nrow(Table(gpl))
[1] 485577



oO. Now it worked. In order to find if there was any problem with the cached version on disk, I then tried to replicate the experiment again, but this time using the temporary folder soft file.

> gpl2 <- getGEO('GPL13534',destdir='/var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq/')
Using locally cached version of GPL13534 found here:
/var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq//GPL13534.soft

> Meta(gpl2)$data_row_count == nrow(Table(gpl2))
[1] TRUE



Ok. Now it worked smoothly. It seems that my cached version could be corrupted. Let me see:

gbayon$ ls -lh
total 538984
-rw-r--r-- 1 gbayon staff 203M 28 jun 15:53 GPL13534.soft
-rw-r--r-- 1 gbayon staff 60M 28 jun 15:54 GPL13534_prev.soft



OMG you are going to kill me… that difference in size… :-O

gbayon$ tail -n 1 GPL13534.soft  
!platform_table_end

gbayon$ tail -n 1 GPL13534_prev.soft  
cg05516328 cg05516328 21692480 ATTCACCAAATAACCTAACAAAATAATCACAAAACACAAAACTCAAAAAC II GCTCTGCTGTCCCGAGCCACTCATGGTGAGCTGCCTCCCTACGAATTCCAGCCGCTGTCG[CG]CCCTTGAGTTCTGTGTCCTGTGACCACCTTGCCAGGCCACTTGGTGAACACGC



Now it makes sense. It seems the cached file I had was not complete. Maybe due to a broken or timed out connection. getGEO() read exactly the number of probes it had, no more, no less.  

I am sorry for wasting your time chasing a GEOquery bug that finally did not exist. Maybe, just a suggestion, it would be nice if it could write a warning message or something if it did not find the '!platform_table_end' string. Just in case another newbie arrives to the same point.

Thank you again for your hints. And for the new, corrected, GPL data I have now.  

Regards,
Gus


---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)


El miércoles 27 de junio de 2012 a las 21:18, Sean Davis escribió:

> Thanks, James, for doing my work for me. ; )
>  
> Sean
>  
>  
> On Wed, Jun 27, 2012 at 12:40 PM, James F. Reid <reidjf at gmail.com (mailto:reidjf at gmail.com)> wrote:
> > Dear Sean and Gustavo,
> >  
> >  
> > I cannot reproduce this error. See below.
> >  
> > On 27/06/12 16:54, Sean Davis wrote:
> > > On Wed, Jun 27, 2012 at 11:38 AM, Gustavo Fernández Bayón
> > > <gbayon at gmail.com (mailto:gbayon at gmail.com)>wrote:
> > >  
> > > > Hi again.
> > > >  
> > > > I would like to add a little bit more of information on this issue. I have
> > > > been debugging inside the parseGSEMatrix() function in GEOquery source
> > > > code. The suspicious NA's appeared when execution arrived to the following
> > > > line:
> > > >  
> > > > ## Apparently, NCBI GEO uses case-insensitive matching
> > > > ## between platform IDs and series ID Refs ???
> > > > dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),]
> > > >  
> > > >  
> > > >  
> > > > The problem here is that 'datamat' has the correct number of rows, which
> > > > is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the
> > > > series matrix file while 'dat' comes from the GPL.
> > > >  
> > > > If you go to the GEO page of that GPL (
> > > > http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534),
> > > > you'll find it says that the GPL decryption table has exactly 485577 rows,
> > > > which is kind of logical, a description for each probeset. However, inside
> > > > the code, 'dat' has only 143889 rows.
> > > >  
> > > > Replicating directly from R console:
> > > >  
> > > > > gpl <- getGEO('GPL13534',destdir='../../GEO/')
> > > > > Meta(gpl)$data_row_count
> > > >  
> > > > [1] "485577"
> > > >  
> > > > > t <- Table(gpl)
> > > > > dim(t)
> > > >  
> > > > [1] 143889 37
> > > >  
> > > >  
> > > >  
> > > > I was really surprised to find this, and I do not have enough knowledge as
> > > > to know if it responds to an unknown constraint I happen to ignore. Is that
> > > > ok? Or is there any bug in the GPL processing code? Now I'm going home, but
> > > > I'll try to continue debugging to see what is really happening inside.
> > >  
> > > This is most likely a bug in GPL parsing. There are A LOT of edge cases
> > > that I have tried to deal with, some not very appropriately. Often, the
> > > error is due to an extraneous quote in an unexpected location. I'll look
> > > into this one. Could you do me a favor and send along sessionInfo() just
> > > so I know?
> > >  
> > > Thanks,
> > > Sean
> > >  
> > >  
> > >  
> > > > Any help will be very much appreciated.
> > > >  
> > > > Regards,
> > > > Gus
> > > >  
> > > >  
> > > > ---------------------------
> > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
> > > >  
> > > >  
> > > > El miércoles 27 de junio de 2012 a las 10:51, Gustavo Fernández Bayón
> > > > escribió:
> > > >  
> > > > > Hi everybody.
> > > > >  
> > > > > I am experiencing quite a few problems while trying to download and
> > > > parse a dataset of methylation values. These are not technical problems,
> > > > IMHO. GEOquery works perfectly, and it really makes getting this kind of
> > > > data an easy task. However, I think I do not understand exactly the
> > > > lifecycle of GEO series data, and I would like to ask in this list for any
> > > > hint on this behavior, so I could try to fix it.
> > > > >  
> > > > > What I first did was to download and parse the desired GSE data file,
> > > > with the default value of GSMMatrix parameter (TRUE). Besides, I extracted
> > > > the ExpressionSet and the assayData I was looking for.
> > > > >  
> > > > > my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/')
> > > > > my.expr.set <- my.gse[[1]]
> > > > > beta.values <- exprs(my.expr.set)
> > > > >  
> > > > > What really gave me a surprise at first, was to see many strange values
> > > > (all containing the 'NA' string) in the featureNames of the expression set.
> > > > >  
> > > > > > head(featureNames(es), n=20)
> > > > > [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3"
> > > > > [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734"
> > > > > [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12"
> > > > >  
> > > > >  
> > > > >  
> > > > > If I select an individual GSM in the series, and download it, the
> > > > featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I
> > > > get a list of GSM data sets, and the results is again good. This made me
> > > > suspect of the intermediate, pre-parsed, matrix form. I haven't found a
> > > > clue about the lifecycle of this kind of data. I mean, how the matrix is
> > > > built. Is it a manual process? Is it automatic?
> > > > >  
> > > > > If it is a manual process, then I guess I will have to contact the
> > > > responsible of uploading the data to see if they can fix it. But, if it is
> > > > not, I would like to know if this is something relating to BioC or, more
> > > > plausibly, to GEO.
> > > > >  
> > > > > Any help would be appreciated.
> > > > >  
> > > > > Regards,
> > > > > Gustavo
> > > > >  
> > > > >  
> > > > > ---------------------------
> > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
> > > >  
> > > >  
> > > > _______________________________________________
> > > > Bioconductor mailing list
> > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > Search the archives:
> > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >  
> > >  
> > > [[alternative HTML version deleted]]
> > >  
> > >  
> > >  
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >  
> >  
> >  
> > library(GEOquery)
> >  
> > my.gse <- getGEO('GSE30870', destdir=".")
> >  
> > featureNames(my.gse[[1]])[1:10]
> > # [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
> > # [6] "cg00000289" "cg00000292" "cg00000321" "cg00000363" "cg00000622"
> > all(featureNames(my.gse[[1]]) == rownames(exprs(my.gse[[1]])))
> > #[1] TRUE
> >  
> > gpl <- getGEO('GPL13534',destdir=".")
> > Meta(gpl)$data_row_count == nrow(Table(gpl))
> > # [1] TRUE
> >  
> >  
> > > sessionInfo()
> > R version 2.15.1 (2012-06-22)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> >  
> > locale:
> > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> > [7] LC_PAPER=C LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> >  
> > attached base packages:
> > [1] stats graphics grDevices utils datasets tools methods
> > [8] base
> >  
> > other attached packages:
> > [1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0
> >  
> > loaded via a namespace (and not attached):
> > [1] RCurl_1.91-1 XML_3.9-4
> >  
> >  
> > HTH,
> > J.
>  



More information about the Bioconductor mailing list