[BioC] topTable (fit) annotation

Freudenberg, Johannes (NIH/NIEHS) [E] johannes.freudenberg at nih.gov
Wed Aug 31 18:38:29 CEST 2011


Hi Jing, 

I'm not quite sure I understand your question.  Why can't you use the data for further analysis?  Maybe the issue is that getGEO() returns a list? Have you tried something like:

> gse16962 <- getGEO("GSE16962")
> eset <- gse16962[[1]]
> head(fData(eset))
                 ID GB_ACC SPOT_ID Species.Scientific.Name Annotation.Date
1007_s_at 1007_s_at U48705    <NA>            Homo sapiens    Mar 11, 2009
1053_at     1053_at M87338    <NA>            Homo sapiens    Mar 11, 2009
117_at       117_at X51757    <NA>            Homo sapiens    Mar 11, 2009
121_at       121_at X69699    <NA>            Homo sapiens    Mar 11, 2009
1255_g_at 1255_g_at L36861    <NA>            Homo sapiens    Mar 11, 2009
1294_at     1294_at L13852    <NA>            Homo sapiens    Mar 11, 2009
              Sequence.Type                 Sequence.Source
1007_s_at Exemplar sequence Affymetrix Proprietary Database
1053_at   Exemplar sequence                         GenBank
117_at    Exemplar sequence Affymetrix Proprietary Database
121_at    Exemplar sequence                         GenBank
1255_g_at Exemplar sequence Affymetrix Proprietary Database
1294_at   Exemplar sequence                         GenBank 

Etc.

--Johannes



-----Original Message-----
From: Jing Huang [mailto:huangji at ohsu.edu] 
Sent: Wednesday, August 31, 2011 12:21 PM
To: Davis, Sean (NIH/NCI) [E]
Cc: bioconductor at r-project.org
Subject: Re: [BioC] topTable (fit) annotation

Thank YOU Sean for responding my question. I am not sure where I should add the platform annotation in.

Here are what I observed:

If I extract "GDS2162" by typing in

> gds=getGEO("GDS2162")
> eset=GDS2eSet(gds,do.log2=T)

>eset

ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 16 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GSM67339 GSM67343 ... GSM67352 (16 total)
  varLabels: sample genotype/variation agent description
  varMetadata: labelDescription
featureData
  featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total)
  fvarLabels: ID Gene.title ... GO.Component.1 (21 total)
      
  fvarMetadata: Column labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 16237459
Annotation:  

Most of annotation is in.



If I extract "GSE16962" by typing in

>gse=getGEO("GSE16962")

> gse
$GSE16962_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 12 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GSM424759 GSM424760 ... GSM424770 (12 total)
  varLabels: title geo_accession ... data_row_count (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene.Ontology.Molecular.Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 

It looks to me it includes annotation package such as GO term....
But I can't use this eset data to do further analysis (such as fit table) what I need.

If I extract ("GSE16962") by typing in

>gse=getGEO("GSE16962", GSEMatrix=F)
 
Then following GEOquery package, I can generate eset2, which I can use to do analysis what I need. But the eset2 looks like this:

> eset2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 12 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GSM424759 GSM424760 ... GSM424770 (12 total)
  varLabels: samples
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:  



Here are the scripts that I use to generate eset2:

>probesets <- Table(GPLList(gse)[[1]])$ID  data.matrix <- 
>do.call("cbind", lapply(GSMList(gse), function(x) {
+ tab <- Table(x)
+ mymatch <- match(probesets,tab$ID_REF)
+ return(tab$VALUE[mymatch])
+ }))
> data.matrix <- apply(data.matrix, 2, function(x) {
+ as.numeric(as.character(x))
+ })
> require(Biobase)
> rownames(data.matrix) <- probesets
> colnames(data.matrix) <- names(GSMList(gse)) pdata <- 
> data.frame(samples=names(GSMList(gse)))
> rownames(pdata) <- names(GSMList(gse)) pheno <- 
> as(pdata,"AnnotatedDataFrame")
> eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)


At which step, I should add

>gplannot = getGEO("GPL96", AnnotGPL=TRUE)


Many Many thanks

Jing















On 8/30/11 6:01 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:

>Hi, Jing.
>
>NCBI GEO maintains two types of GPL records.  The normal variant is 
>just supplied by the submitter.  However, when a GEO Series is curated 
>by NCBI GEO into a GEO DataSet (GDS), they create a so-called 
>"Annotation GPL".  These have a relatively standard set of columns.  I 
>have not made the change to GEOquery yet to grab this annotation GPL 
>when getting Series Matrix files.  But, you can get them yourself by
>specifying:
>
>gplannot = getGEO("GPL96", AnnotGPL=TRUE)
>
>You can always replace the feature data of the ExpressionSets with the 
>information in the retrieved Annotation GPL.
>
>I hope that is clear.
>
>Sean
>
>
>On Tue, Aug 30, 2011 at 5:01 PM, Jing Huang <huangji at ohsu.edu> wrote:
>> Dear All members,
>>
>> I have been extracting data from GEO (GEO package) and do some 
>>analysis on them by using limma package. What I discover is the 
>>components of
>>topTable(fit) are different from the dataset GDS and GSE.
>>
>> If the data is from GDS, then the colnames of topTable (fit) looks 
>>like this.
>>
>>> colnames(topTable(fit))
>> [1] "ID"                    "Gene.title"            "Gene.symbol"
>>  [4] "Gene.ID"               "UniGene.title"         "UniGene.symbol"
>>  [7] "UniGene.ID"            "Nucleotide.Title"      "GI"
>> [10] "GenBank.Accession"     "Platform_CLONEID"      "Platform_ORF"
>> [13] "Platform_SPOTID"       "Chromosome.location"
>>"Chromosome.annotation"
>> [16] "GO.Function"           "GO.Process"            "GO.Component"
>> [19] "GO.Function.1"         "GO.Process.1"          "GO.Component.1"
>> [22] "CTRL"                  "HIF1a"                 "HIF2a"
>> [25] "HIF1a2a"               "AveExpr"               "F"
>> [28] "P.Value"               "adj.P.Val"
>>
>> If the data is from GSE, then the   colnames of topTable(fit) looks
>>like this:
>>
>>>colnames(topTable(fit)
>>
>> [1] "ID"        "mir210"    "CTRL2"     "AveExpr"   "F"
>>"P.Value"   "adj.P.Val"
>>
>> I am trying to add some term into this table by doing following one 
>>by
>>one: the data is generated by Affymetrix human U133 platform:
>>
>>>Library(hgu133plus2.db)
>>>x=hgu133plus2SYMBOL
>>>y=topTable(fit)
>>>y$SYMBOL=unlist(as.list(x[y$ID]))
>>
>> It works but I need to add ENTREZID,SYMBOL,CHR, CHRloc, and GO 
>>annotations as well.  I like to have the topTable more like the
>>topTable(fit) generated at top by data GEO GDS data
>>
>> I am wondering if there is an easy way to annotate all once.
>>
>> In addition, I am having a trouble to annotate GO term.
>>
>> Many Thanks
>>
>> Jing
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list