[Bioc-devel] Making GOstats pathway analysis using data from both a and b chip s in one go.

Tue Aug 22 17:52:38 CEST 2006

Hi,
   Basically the problem with combining data from multiple chips is that 
you need to be pretty careful about duplicate probes for single genes 
(since GO maps at the EntrezGene level that is the level of resolution 
you have to work with and double counting some genes and not others is a 
pretty bad idea) and next it is getting the Universe right.  This is 
also incredibly important, and is straightforward for a single chip (and 
that is what has been done in GOstats) and less straightforward for 
anything else.  You will need to do some programming to create your own 
Universe and solve a few other problems (the GOHyper has a universe 
argument that should work, let us know if it does not).

  Finally as Wolfgang recently said, this approach (hypergeometric 
testing) is probably not as good as Gene Set Enrichment (there are two 
Bioc packages for doing this, one is called Category the other is called 
sigPathway

  best wishes
    Robert

Peder.Worning at astrazeneca.com wrote:
> Dear Bioconductor list,
> 
> I have been using the GOstats package to make pathway analysis of Affymetrix
> data from the mouse MOE430a chips, and that works fine. Now I want to make
> an analysis which includes the data from both the a and b chips in one go is
> that possible?
> 
> My problem comes when I use the GOHyperG function where I have to make
> either lib="moe430a" or lib="moe430b". Is there a way to incude annotation
> from both chips in the same analysis?
> 
> Here is the R code I have used:
> ****************************************************************************
> ************************************** 
> library(GOstats)
> library(moe430a)
> library(moe430b)
> library
> 
> indat <- read.delim("ASM+CSM+12X2+LPS_A.SAMROC.ann.molcl.logP.tab.txt",
> header = T, sep = "\t", blank.lines.skip = T)
> indata <- indat[!(substr(as.character(indat[,1]),1,3) %in% "AFF"),] # remove
> AFF probesets
> 
> newdata_ASMup_01_pr02 <- indata[indata[,10] >= 2 & indata[,2] >=0.2,]
> 
> # GO pathway analysis from here
> 
> ############################################################################
> #############################################
> newdata <- newdata_ASMup_01_pr02 # Here is the dataset chosen everything
> from here must be repeated for a new dataset ####
> newprobeset <- as.character(newdata[,1])
> newprobeset_a <- as.character(newdata[newdata[,61]=="A" ,1])
> newprobeset_b <- as.character(newdata[newdata[,61]=="B" ,1])
> 
> gNll_a = as.character(unlist(mget(newprobeset_a, moe430aLOCUSID)))
> gNsLL_a <- unique(unlist(mget(newprobeset_a, env=moe430aLOCUSID,
> ifnotfound=NA)))
> gNll_b = as.character(unlist(mget(newprobeset_b, moe430bLOCUSID)))
> gNsLL_b <- unique(unlist(mget(newprobeset_b, env=moe430bLOCUSID,
> ifnotfound=NA)))
> 
> gNll <- c(gNll_a,gNll_b)
> gNsLL <- unique(c(gNsLL_a,gNsLL_b))
> 
> 
> # The three GO ontologies using only data from the a-chip
> gGhyp_BP <- GOHyperG(gNsLL_a, lib="moe430a", what="BP")
> gGO_BP <- makeGOGraph(gNll_a, Ontology="BP")
> 
> gGhyp_CC <- GOHyperG(gNsLL_a, lib="moe430a", what="CC")
> gGO_CC <- makeGOGraph(gNll_a, Ontology="CC")
> 
> gGhyp_MF <- GOHyperG(gNsLL_a, lib="moe430a", what="MF")
> gGO_MF <- makeGOGraph(gNll_a, Ontology="MF")
> 
> gGhyp_BP.pv <- gGhyp_BP$pv[nodes(gGO_BP)]
> gg_BP = sort(gGhyp_BP.pv[gGhyp_BP.pv < 0.05])
> 
> gGhyp_CC.pv <- gGhyp_CC$pv[nodes(gGO_CC)]
> gg_CC = sort(gGhyp_CC.pv[gGhyp_CC.pv < 0.05])
> 
> gGhyp_MF.pv <- gGhyp_MF$pv[nodes(gGO_MF)]
> gg_MF = sort(gGhyp_MF.pv[gGhyp_MF.pv < 0.05])
> 
> gg_BP.terms <- getGOTerm(names(gg_BP))[["BP"]]
> gg_CC.terms <- getGOTerm(names(gg_CC))[["CC"]]
> gg_MF.terms <- getGOTerm(names(gg_MF))[["MF"]]
> 
> ggt_BP = as.character(unlist(gg_BP.terms))
> numCh_BP = nchar(ggt_BP)
> ggt2_BP = substr(ggt_BP, 1, 50)
> ggt3_BP = paste("BP: ",ggt2_BP, ifelse(numCh_BP > 50, "..", ""), sep="")
> 
> ggt_CC = as.character(unlist(gg_CC.terms))
> numCh_CC = nchar(ggt_CC)
> ggt2_CC = substr(ggt_CC, 1, 50)
> ggt3_CC = paste("CC: ",ggt2_CC, ifelse(numCh_CC > 50, "..", ""), sep="")
> 
> ggt_MF = as.character(unlist(gg_MF.terms))
> numCh_MF = nchar(ggt_MF)
> ggt2_MF = substr(ggt_MF, 1, 50)
> ggt3_MF = paste("MF: ",ggt2_MF, ifelse(numCh_MF > 50, "..", ""), sep="")
> 
> 
> ##get counts
> gg_BP.counts <- gGhyp_BP$intCounts[names(gg_BP)]
> gg_CC.counts <- gGhyp_CC$intCounts[names(gg_CC)]
> gg_MF.counts <- gGhyp_MF$intCounts[names(gg_MF)]
> gg_BP.allcounts <- gGhyp_BP$goCounts[names(gg_BP)]
> gg_CC.allcounts <- gGhyp_CC$goCounts[names(gg_CC)]
> gg_MF.allcounts <- gGhyp_MF$goCounts[names(gg_MF)]
> 
> ggMat_BP = matrix(c(names(gg_BP.terms), ggt3_BP, signif(gg_BP,3),
> gg_BP.counts, gg_BP.allcounts),
>     byrow=FALSE, nc=5, dimnames=list(1:length(gg_BP), c("GO ID",
>    "Term", "p-value","No of Genes","Total number")))
> 
> ggMat_CC = matrix(c(names(gg_CC.terms), ggt3_CC, signif(gg_CC,3),
> gg_CC.counts, gg_CC.allcounts),
>     byrow=FALSE, nc=5, dimnames=list(1:length(gg_CC), c("GO ID",
>    "Term", "p-value","No of Genes","Total number")))
> 
> ggMat_MF = matrix(c(names(gg_MF.terms), ggt3_MF, signif(gg_MF,3),
> gg_MF.counts, gg_MF.allcounts),
>     byrow=FALSE, nc=5, dimnames=list(1:length(gg_MF), c("GO ID",
>    "Term", "p-value","No of Genes","Total number")))
> 
> newMat <- rbind(ggMat_BP,ggMat_CC,ggMat_MF)
> 
> This is what I have done and it works beautifully, but how do I make the
> analysis using both chips, because making the analysis seaprate of the A and
> the B chip data doesn't make sense. It means that the B chip data is somehow
> useless in this context. I have the same problem with my human data.
> 
> Some technical details: R is version 2.3.1 and GOstats is version 1.6.0 and
> the annotation files are moe430a_1.12.0 and moeb430_1.12.0
> 
> Sincerly yours
> 
> Peder
> 
> ________________________________________________
> Peder Worning, PhD
> 
> Senior Scientist, Bioinformatics
> AstraZeneca R&D Lund
> S-221 87 Lund, Sweden
> 
> Tel:     +46 46 33 72 93
> Fax:    +46 46 33 71 64
> Email:  peder.worning at astrazeneca.com
> 
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org