[BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data

Wed Jun 27 17:38:07 CEST 2012

Hi again.  

I would like to add a little bit more of information on this issue. I have been debugging inside the parseGSEMatrix() function in GEOquery source code. The suspicious NA's appeared when execution arrived to the following line:

## Apparently, NCBI GEO uses case-insensitive matching
## between platform IDs and series ID Refs ???
dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),]

The problem here is that 'datamat' has the correct number of rows, which is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the series matrix file while 'dat' comes from the GPL.  

If you go to the GEO page of that GPL (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534), you'll find it says that the GPL decryption table has exactly 485577 rows, which is kind of logical, a description for each probeset. However, inside the code, 'dat' has only 143889 rows.  

Replicating directly from R console:

>gpl <- getGEO('GPL13534',destdir='../../GEO/')
>Meta(gpl)$data_row_count
[1] "485577"

>t <- Table(gpl)
>dim(t)
[1] 143889 37

I was really surprised to find this, and I do not have enough knowledge as to know if it responds to an unknown constraint I happen to ignore. Is that ok? Or is there any bug in the GPL processing code? Now I'm going home, but I'll try to continue debugging to see what is really happening inside.  

Any help will be very much appreciated.

Regards,
Gus

---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)

El miércoles 27 de junio de 2012 a las 10:51, Gustavo Fernández Bayón escribió:

> Hi everybody.  
>  
> I am experiencing quite a few problems while trying to download and parse a dataset of methylation values. These are not technical problems, IMHO. GEOquery works perfectly, and it really makes getting this kind of data an easy task. However, I think I do not understand exactly the lifecycle of GEO series data, and I would like to ask in this list for any hint on this behavior, so I could try to fix it.
>  
> What I first did was to download and parse the desired GSE data file, with the default value of GSMMatrix parameter (TRUE). Besides, I extracted the ExpressionSet and the assayData I was looking for.
>  
> my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/')
> my.expr.set <- my.gse[[1]]
> beta.values <- exprs(my.expr.set)
>  
> What really gave me a surprise at first, was to see many strange values (all containing the 'NA' string) in the featureNames of the expression set.
>  
> > head(featureNames(es), n=20)
> [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3"  
> [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734"
> [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12"
>  
>  
>  
> If I select an individual GSM in the series, and download it, the featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I get a list of GSM data sets, and the results is again good. This made me suspect of the intermediate, pre-parsed, matrix form. I haven't found a clue about the lifecycle of this kind of data. I mean, how the matrix is built. Is it a manual process? Is it automatic?
>  
> If it is a manual process, then I guess I will have to contact the responsible of uploading the data to see if they can fix it. But, if it is not, I would like to know if this is something relating to BioC or, more plausibly, to GEO.  
>  
> Any help would be appreciated.
>  
> Regards,
> Gustavo
>  
>  
> ---------------------------
> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)