[BioC] GEOquery and GEO issues
Christian.Stratowa@vie.boehringer-ingelheim.com
Christian.Stratowa at vie.boehringer-ingelheim.com
Mon Jan 23 11:18:15 CET 2006
Dear Sean
While trying to find a parser for the GEO soft files I encoutered your
GEOquery package which works great.
Nevertheless, I would like to mention two issues which might be of general
interest:
1, Memory problems:
I have downloaded from GEO the file 'GSE2109_family.soft.gz' first (due to
our proxy settings I cannot use
getGEO for this purpose) and then imported it into R with:
gse2109 <- getGEO(filename='GSE2109_family.soft.gz')
Although I have succeeded in importing the file into R, it took 39.3 hours
on a 64 bit Opteron machine with
16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0 GB.
Maybe, a future version of GEOquery could reduce both time and memory
consumption.
2, Non-unique GEO platforms:
I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' where we
had to use both the
Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a serious
flaw of the GEO
database that it declares both chips as single platform GPL91.
In your description of the GEOquery package, chapter 4.3 Converting GSE to
an exprSet, you supply
code to make sure that all of the GSMs are from the same platform (see my
small function below).
Sorrowly, this is not sufficient in this case (and probably other Affymetrix
chips where two versions exist).
Even though the Sample_data_row_count is different (12625 vs 12626) cbind
simply recylces the rows.
In this case, I could test if Sample_data_row_count is identical for all
chips, but theoretically there may
be the case that different chip versions may still have the same number of
probe sets.
One possibility would be that GEO forces the submitters not only to supply
Sample_platform_id, but
also a "Sample_platform_title" which would contain the name of the chip as
given by the manufacturer.
3, Sample descriptions:
Since most data are useless w/o the sample description, which contains the
clinical data, it would
be helpful if GEO would supply a certain format for adding the clinical
data, so that it would be
possible to write a parser to extract these data automatically into a table.
Best regards
Christian
Attached function:
#---------------------------------------------------------------------#
table4GEO <- function(gse, column="VALUE", lg2=T){
# (c) Christian Stratowa created: 01/19/2006 last modified: 01/19/2006
# Get sample table of columns "column" for GEO Series GSExxxx
# gse: GEOqueryclass imported from GEO GSE file GSExxxx_family.soft (or
soft.gz)
# column: name of column to be extracted from data table
# load libraries
library(Biobase);
library(GEOquery);
# get list
gsm <- GSMList(gse);
# check number of platforms (must be one platform only)
tmp <- unlist(lapply(gsm, function(x) {Meta(x)$platform}));
if (length(unique(tmp)) != 1) {
stop("Data must belong to one platform ID only!");
}#if
# number of samples
size <- length(tmp);
print(paste("Number of samples:",size))
# check if all samples have the chosen column
tmp <- unlist(lapply(gsm, function(x) {which(Columns(x)[,1] ==
column)}));
if (length(tmp) != size) {
stop(paste("Only <", length(tmp), "> of <", size, "> samples have
column ", column));
}#if
# get "column" from all chips
data <- do.call("cbind", lapply(gsm, function(x){Table(x)[,column]}));
dimnames(data)[[1]] <- Table(gsm[[1]])$ID_REF
if (lg2==TRUE) {
data <- log2(data);
}#if
return(data);
}#table4GEO
==============================================
Christian Stratowa, PhD
Boehringer Ingelheim Austria
Dept NCE Lead Discovery - Bioinformatics
Dr. Boehringergasse 5-11
A-1121 Vienna, Austria
Tel.: ++43-1-80105-2470
Fax: ++43-1-80105-2782
email: christian.stratowa at vie.boehringer-ingelheim.com
More information about the Bioconductor
mailing list