[BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category
Iain Gallagher
iaingallagher at btopenworld.com
Fri Sep 9 15:59:16 CEST 2011
Hi Helen
Here's how I did exactly what you want (with added symbol and logFC goodness)
I was studying cows hence the bosTau4 and using ENSEBML gene ids so hence the ensGene. You can adapt this all you like though.
These are the libraries I loaded upfront (i.e. before the edgeR analysis from which the following is snipped):
library(edgeR)
library(org.Bt.eg.db)#cows!
library(goseq)
library(GO.db)
library(annotate)
#this calculates a vector that indicates whether each gene is regulated (at FDR 1% in my case) or not. It's fed to the nullp function of goseq (see below)
# e.g. 0 0 1 1 0 0 1 # two genes not regulated, next two are, next two not etc
genes <- as.integer(p.adjust(lrtFiltered$table$p.value, method = "BH") < 0.01)#FDR = 1%
#...GO ANALYSIS...#
pwf=nullp(genes,'bosTau4','ensGene') # genes = the regulated genes
goCats <- goseq(pwf, 'bosTau4','ensGene', test.cats=("GO:BP"))
sigCats <- goCats[which(goCats[,2] < 0.05),] # getting the sig categories
#let's annotate the GO categories
cats <- sigCats$category
terms <- stack(lapply(mget(cats, GOTERM, ifnotfound=NA), Term))
#write out GO data
sigCats$Term <- with(sigCats, terms$values[match(terms$ind, sigCats$category)] )
#get the genes in each category
allGos <- stack(getgo(rownames(topGenes$table), 'bosTau4', 'ensGene')) # so here I pull the GO terms for every gene that is regulated.
#add the terms
onlySigCats <- allGos[allGos$values %in% sigCats$category,]
onlySigCats$Term <- with( onlySigCats, terms$value[match(onlySigCats$values, terms$ind)] )
#add the gene symbol
onlySigCats$Symbol <- with( onlySigCats, annot2[,2][match(onlySigCats$ind, rownames(annot2) )] )
#add the gene logFC
onlySigCats$logFC <- with( onlySigCats, topGenes$table$logFC[match(onlySigCats$ind, rownames(topGenes$table) )] )
You can then write out this table.
HTH
Best
Iain
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0
[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0
[7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1
[10] Biobase_2.12.2 edgeR_2.2.5
loaded via a namespace (and not attached):
[1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0
[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1
[7] IRanges_1.10.6 lattice_0.19-33 limma_3.8.3
[10] Matrix_0.9996875-3 mgcv_1.7-6 nlme_3.1-102
[13] RCurl_1.6-9 rtracklayer_1.12.4 tools_2.13.1
[16] XML_3.4-2 xtable_1.5-6
>
--- On Fri, 9/9/11, Helen Wright <hlwright.uk at gmail.com> wrote:
> From: Helen Wright <hlwright.uk at gmail.com>
> Subject: [BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category
> To: bioconductor at r-project.org
> Date: Friday, 9 September, 2011, 14:30
> Hello everyone
>
> This is my first message to Bioconductor, I hope someone
> out there can
> help me. I have been working through the goseq vignette
> and adapting
> it to my own dataset. I have got my list of enriched go
> categories,
> which looks like this:
>
> > head(GO.wall)
> category
> over_represented_pvalue
> under_represented_pvalue
> 3217 GO:0006954 1.496746e-10
> 1
> 3548 GO:0008009 4.502492e-10
> 1
> 9081 GO:0042379 1.010877e-09
> 1
> 2160 GO:0005126 1.033292e-09
> 1
> 2159 GO:0005125 1.534272e-09
> 1
> 4251 GO:0009611 1.983311e-09
> 1
>
>
> I can get a list of the genes which are contained in each
> GO category using :
> > ensembl.gene.ids=get("GO:0006954",
> org.Hs.egGO2ALLEGS)
> > unlist(mget(unique(ensembl.gene.ids),
> org.Hs.egGENENAME))
>
> but this just gives me a list of ALL the genes attached to
> this GO
> category. I would like to find out which of my DE genes
> are in this
> GO category.
>
>
> I have referred to a previous Bioconductor post
> <https://stat.ethz.ch/pipermail/bioconductor/attachments/20110308/92b27df4/attachment.pl>
> which shows how to get a list of all the genes in each GO
> category,
> like this:
>
> > data(genes)
> > genes2go=getgo(names(genes),'hg19','ensGene')
> > go2genes=goseq:::reversemapping(genes2go)
>
> Here is an extract from go2genes:
> > go2genes$'GO:0006954'
> [1] "ENSG00000088812" "ENSG00000078747"
> "ENSG00000172216"
> "ENSG00000196839" "ENSG00000243646" "ENSG00000128271"
> "ENSG00000100014" "ENSG00000128335" "ENSG00000100292"
> "ENSG00000128284"
> [11] "ENSG00000240972" "ENSG00000117215"
> "ENSG00000050628"
> "ENSG00000187554" "ENSG00000188257" "ENSG00000159377"
> "ENSG00000117525" "ENSG00000116288" "ENSG00000158769"
> "ENSG00000160712"
> [21] "ENSG00000135744" "ENSG00000117335"
> "ENSG00000163435"
> "ENSG00000177807" "ENSG00000117586" "ENSG00000078900"
> "ENSG00000178585" "ENSG00000168297" "ENSG00000134070"
> "ENSG00000144802"
> [31] "ENSG00000113889" "ENSG00000172936"
> "ENSG00000074416"
> "ENSG00000085276" "ENSG00000121807" "ENSG00000157017"
> "ENSG00000233276" "ENSG00000175040" "ENSG00000072274"
> "ENSG00000132170"
>
> > class(go2genes)
> [1] "list"
>
>
> I have my list of genes in a data.frame called "pwf" which
> looks like this"
> > head(pwf)
> DEgenes
> bias.data pwf
> ENSG00000000003 0 1563.5 0.003486092
> ENSG00000000005 0 1353.0 0.003287926
> ENSG00000000419 0 1079.5 0.002955906
> ENSG00000000457 0 3128.5 0.003943449
> ENSG00000000460 0 3126.0 0.003943445
> ENSG00000000938 0 2437.0 0.003896895
>
> The number 1 in pwf$DEgenes indicates a differentially
> expressed gene.
>
>
> So what (I think) I need to do is:
> 1) get the GOTERM (e.g. GO:0006954) from GO.wall$category
> 2) look in go2genes for $'GO:0006954' and get the list of
> all genes
> associated with this GOTERM
> 3) look for these genes in rownames(pwf), and if
> pwf$DEgenes==1 then
> print GO.wall$category, GO.wall$over_represented_pvalue,
> rownames(pwf)
>
>
> I have no idea if this is possible. The sort of output I
> would expect
> would be something like (GOTERM, p-value, list of genes
> from this
> category enriched in my dataset):
>
> GO:0006954 1.496746e-10
> "ENSG00000078747" "ENSG00000177807" "ENSG00000117586"
> "ENSG00000078900" "ENSG00000175040" "ENSG00000072274"
> "ENSG00000132170"
>
>
> Thank you for taking the time to read my email. I hope
> someone out
> there will be able to help me.
>
> Many thanks and have a nice weekend everybody.
>
> Helen Wright
> University of Liverpool, UK
>
>
>
> > sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> locale:
> [1] C/en_US.UTF-8/C/C/C/C
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
> other attached packages:
> [1] GO.db_2.5.0 org.Hs.eg.db_2.5.0
> RSQLite_0.9-4
> DBI_0.2-5 AnnotationDbi_1.14.1
> edgeR_2.2.5
> goseq_1.4.0 geneLenDataBase_0.99.7
> [9] BiasedUrn_1.04 DESeq_1.4.1
> locfit_1.5-6
> lattice_0.19-30 akima_0.5-4
> Biobase_2.12.2
> Rsamtools_1.4.3 Biostrings_2.20.3
> [17] GenomicFeatures_1.4.4 GenomicRanges_1.4.8
> IRanges_1.10.6
> loaded via a namespace (and not attached):
> [1] BSgenome_1.20.0 Matrix_0.999375-50
> RColorBrewer_1.0-5
> RCurl_1.6-10 XML_3.4-2
> annotate_1.30.1 biomaRt_2.8.1
> genefilter_1.34.0 geneplotter_1.30.0
> [10] grid_2.13.1 limma_3.8.3
> mgcv_1.7-6
> nlme_3.1-101 rtracklayer_1.12.4 splines_2.13.1
> survival_2.36-9 tools_2.13.1 xtable_1.5-6
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list