[Bioc-devel] Making GOstats pathway analysis using data from both a and b chip s in one go.
Robert Gentleman
rgentlem at fhcrc.org
Tue Aug 22 17:52:38 CEST 2006
Hi,
Basically the problem with combining data from multiple chips is that
you need to be pretty careful about duplicate probes for single genes
(since GO maps at the EntrezGene level that is the level of resolution
you have to work with and double counting some genes and not others is a
pretty bad idea) and next it is getting the Universe right. This is
also incredibly important, and is straightforward for a single chip (and
that is what has been done in GOstats) and less straightforward for
anything else. You will need to do some programming to create your own
Universe and solve a few other problems (the GOHyper has a universe
argument that should work, let us know if it does not).
Finally as Wolfgang recently said, this approach (hypergeometric
testing) is probably not as good as Gene Set Enrichment (there are two
Bioc packages for doing this, one is called Category the other is called
sigPathway
best wishes
Robert
Peder.Worning at astrazeneca.com wrote:
> Dear Bioconductor list,
>
> I have been using the GOstats package to make pathway analysis of Affymetrix
> data from the mouse MOE430a chips, and that works fine. Now I want to make
> an analysis which includes the data from both the a and b chips in one go is
> that possible?
>
> My problem comes when I use the GOHyperG function where I have to make
> either lib="moe430a" or lib="moe430b". Is there a way to incude annotation
> from both chips in the same analysis?
>
> Here is the R code I have used:
> ****************************************************************************
> **************************************
> library(GOstats)
> library(moe430a)
> library(moe430b)
> library
>
> indat <- read.delim("ASM+CSM+12X2+LPS_A.SAMROC.ann.molcl.logP.tab.txt",
> header = T, sep = "\t", blank.lines.skip = T)
> indata <- indat[!(substr(as.character(indat[,1]),1,3) %in% "AFF"),] # remove
> AFF probesets
>
> newdata_ASMup_01_pr02 <- indata[indata[,10] >= 2 & indata[,2] >=0.2,]
>
> # GO pathway analysis from here
>
> ############################################################################
> #############################################
> newdata <- newdata_ASMup_01_pr02 # Here is the dataset chosen everything
> from here must be repeated for a new dataset ####
> newprobeset <- as.character(newdata[,1])
> newprobeset_a <- as.character(newdata[newdata[,61]=="A" ,1])
> newprobeset_b <- as.character(newdata[newdata[,61]=="B" ,1])
>
> gNll_a = as.character(unlist(mget(newprobeset_a, moe430aLOCUSID)))
> gNsLL_a <- unique(unlist(mget(newprobeset_a, env=moe430aLOCUSID,
> ifnotfound=NA)))
> gNll_b = as.character(unlist(mget(newprobeset_b, moe430bLOCUSID)))
> gNsLL_b <- unique(unlist(mget(newprobeset_b, env=moe430bLOCUSID,
> ifnotfound=NA)))
>
> gNll <- c(gNll_a,gNll_b)
> gNsLL <- unique(c(gNsLL_a,gNsLL_b))
>
>
> # The three GO ontologies using only data from the a-chip
> gGhyp_BP <- GOHyperG(gNsLL_a, lib="moe430a", what="BP")
> gGO_BP <- makeGOGraph(gNll_a, Ontology="BP")
>
> gGhyp_CC <- GOHyperG(gNsLL_a, lib="moe430a", what="CC")
> gGO_CC <- makeGOGraph(gNll_a, Ontology="CC")
>
> gGhyp_MF <- GOHyperG(gNsLL_a, lib="moe430a", what="MF")
> gGO_MF <- makeGOGraph(gNll_a, Ontology="MF")
>
> gGhyp_BP.pv <- gGhyp_BP$pv[nodes(gGO_BP)]
> gg_BP = sort(gGhyp_BP.pv[gGhyp_BP.pv < 0.05])
>
> gGhyp_CC.pv <- gGhyp_CC$pv[nodes(gGO_CC)]
> gg_CC = sort(gGhyp_CC.pv[gGhyp_CC.pv < 0.05])
>
> gGhyp_MF.pv <- gGhyp_MF$pv[nodes(gGO_MF)]
> gg_MF = sort(gGhyp_MF.pv[gGhyp_MF.pv < 0.05])
>
> gg_BP.terms <- getGOTerm(names(gg_BP))[["BP"]]
> gg_CC.terms <- getGOTerm(names(gg_CC))[["CC"]]
> gg_MF.terms <- getGOTerm(names(gg_MF))[["MF"]]
>
> ggt_BP = as.character(unlist(gg_BP.terms))
> numCh_BP = nchar(ggt_BP)
> ggt2_BP = substr(ggt_BP, 1, 50)
> ggt3_BP = paste("BP: ",ggt2_BP, ifelse(numCh_BP > 50, "..", ""), sep="")
>
> ggt_CC = as.character(unlist(gg_CC.terms))
> numCh_CC = nchar(ggt_CC)
> ggt2_CC = substr(ggt_CC, 1, 50)
> ggt3_CC = paste("CC: ",ggt2_CC, ifelse(numCh_CC > 50, "..", ""), sep="")
>
> ggt_MF = as.character(unlist(gg_MF.terms))
> numCh_MF = nchar(ggt_MF)
> ggt2_MF = substr(ggt_MF, 1, 50)
> ggt3_MF = paste("MF: ",ggt2_MF, ifelse(numCh_MF > 50, "..", ""), sep="")
>
>
> ##get counts
> gg_BP.counts <- gGhyp_BP$intCounts[names(gg_BP)]
> gg_CC.counts <- gGhyp_CC$intCounts[names(gg_CC)]
> gg_MF.counts <- gGhyp_MF$intCounts[names(gg_MF)]
> gg_BP.allcounts <- gGhyp_BP$goCounts[names(gg_BP)]
> gg_CC.allcounts <- gGhyp_CC$goCounts[names(gg_CC)]
> gg_MF.allcounts <- gGhyp_MF$goCounts[names(gg_MF)]
>
> ggMat_BP = matrix(c(names(gg_BP.terms), ggt3_BP, signif(gg_BP,3),
> gg_BP.counts, gg_BP.allcounts),
> byrow=FALSE, nc=5, dimnames=list(1:length(gg_BP), c("GO ID",
> "Term", "p-value","No of Genes","Total number")))
>
> ggMat_CC = matrix(c(names(gg_CC.terms), ggt3_CC, signif(gg_CC,3),
> gg_CC.counts, gg_CC.allcounts),
> byrow=FALSE, nc=5, dimnames=list(1:length(gg_CC), c("GO ID",
> "Term", "p-value","No of Genes","Total number")))
>
> ggMat_MF = matrix(c(names(gg_MF.terms), ggt3_MF, signif(gg_MF,3),
> gg_MF.counts, gg_MF.allcounts),
> byrow=FALSE, nc=5, dimnames=list(1:length(gg_MF), c("GO ID",
> "Term", "p-value","No of Genes","Total number")))
>
> newMat <- rbind(ggMat_BP,ggMat_CC,ggMat_MF)
>
> This is what I have done and it works beautifully, but how do I make the
> analysis using both chips, because making the analysis seaprate of the A and
> the B chip data doesn't make sense. It means that the B chip data is somehow
> useless in this context. I have the same problem with my human data.
>
> Some technical details: R is version 2.3.1 and GOstats is version 1.6.0 and
> the annotation files are moe430a_1.12.0 and moeb430_1.12.0
>
> Sincerly yours
>
> Peder
>
> ________________________________________________
> Peder Worning, PhD
>
> Senior Scientist, Bioinformatics
> AstraZeneca R&D Lund
> S-221 87 Lund, Sweden
>
> Tel: +46 46 33 72 93
> Fax: +46 46 33 71 64
> Email: peder.worning at astrazeneca.com
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
More information about the Bioc-devel
mailing list