[BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category

Fri Sep 9 15:59:16 CEST 2011

Hi Helen

Here's how I did exactly what you want (with added symbol and logFC goodness)

I was studying cows hence the bosTau4 and using ENSEBML gene ids so hence the ensGene. You can adapt this all you like though.

These are the libraries I loaded upfront (i.e. before the edgeR analysis from which the following is snipped):

library(edgeR)
library(org.Bt.eg.db)#cows!
library(goseq)
library(GO.db)
library(annotate)

#this calculates a vector that indicates whether each gene is regulated (at FDR 1% in my case) or not. It's fed to the nullp function of goseq (see below)

# e.g. 0 0 1 1 0 0 1 # two genes not regulated, next two are, next two not etc

genes <- as.integer(p.adjust(lrtFiltered$table$p.value, method = "BH") < 0.01)#FDR = 1%

#...GO ANALYSIS...#
pwf=nullp(genes,'bosTau4','ensGene') # genes = the regulated genes

goCats <- goseq(pwf, 'bosTau4','ensGene', test.cats=("GO:BP"))

sigCats <- goCats[which(goCats[,2] < 0.05),] # getting the sig categories

#let's annotate the GO categories
cats <- sigCats$category

terms <- stack(lapply(mget(cats, GOTERM, ifnotfound=NA), Term))

#write out GO data
sigCats$Term <- with(sigCats, terms$values[match(terms$ind, sigCats$category)] )

#get the genes in each category
allGos <- stack(getgo(rownames(topGenes$table), 'bosTau4', 'ensGene')) # so here I pull the GO terms for every gene that is regulated.

#add the terms
onlySigCats <- allGos[allGos$values %in% sigCats$category,]
onlySigCats$Term <- with( onlySigCats, terms$value[match(onlySigCats$values, terms$ind)] )

#add the gene symbol
onlySigCats$Symbol <- with( onlySigCats, annot2[,2][match(onlySigCats$ind, rownames(annot2) )] )

#add the gene logFC
onlySigCats$logFC <- with( onlySigCats, topGenes$table$logFC[match(onlySigCats$ind, rownames(topGenes$table) )] )

You can then write out this table.

HTH

Best

Iain

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8    
 [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8   
 [7] LC_PAPER=en_GB.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] annotate_1.30.0        GO.db_2.5.0            goseq_1.4.0           
 [4] geneLenDataBase_0.99.7 BiasedUrn_1.04         org.Bt.eg.db_2.5.0    
 [7] RSQLite_0.9-4          DBI_0.2-5              AnnotationDbi_1.14.1  
[10] Biobase_2.12.2         edgeR_2.2.5           

loaded via a namespace (and not attached):
 [1] biomaRt_2.8.1         Biostrings_2.20.3     BSgenome_1.20.0      
 [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8   grid_2.13.1          
 [7] IRanges_1.10.6        lattice_0.19-33       limma_3.8.3          
[10] Matrix_0.9996875-3    mgcv_1.7-6            nlme_3.1-102         
[13] RCurl_1.6-9           rtracklayer_1.12.4    tools_2.13.1         
[16] XML_3.4-2             xtable_1.5-6         
> 

--- On Fri, 9/9/11, Helen Wright <hlwright.uk at gmail.com> wrote:

> From: Helen Wright <hlwright.uk at gmail.com>
> Subject: [BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category
> To: bioconductor at r-project.org
> Date: Friday, 9 September, 2011, 14:30
> Hello everyone
> 
> This is my first message to Bioconductor, I hope someone
> out there can
> help me. I have been working through the goseq vignette
> and adapting
> it to my own dataset.  I have got my list of enriched go
> categories,
> which looks like this:
> 
> > head(GO.wall)
>          category       
>    over_represented_pvalue
> under_represented_pvalue
> 3217 GO:0006954            1.496746e-10        
>                1
> 3548 GO:0008009            4.502492e-10        
>                1
> 9081 GO:0042379            1.010877e-09        
>                1
> 2160 GO:0005126            1.033292e-09        
>                1
> 2159 GO:0005125            1.534272e-09        
>                1
> 4251 GO:0009611            1.983311e-09        
>                1
> 
> 
> I can get a list of the genes which are contained in each
> GO category using :
> > ensembl.gene.ids=get("GO:0006954",
> org.Hs.egGO2ALLEGS)
> > unlist(mget(unique(ensembl.gene.ids),
> org.Hs.egGENENAME))
> 
> but this just gives me a list of ALL the genes attached to
> this GO
> category.  I would like to find out which of my DE genes
> are in this
> GO category.
> 
> 
> I have referred to a previous Bioconductor post
> <https://stat.ethz.ch/pipermail/bioconductor/attachments/20110308/92b27df4/attachment.pl>
> which shows how to get a list of all the genes in each GO
> category,
> like this:
> 
> > data(genes)
> > genes2go=getgo(names(genes),'hg19','ensGene')
> > go2genes=goseq:::reversemapping(genes2go)
> 
> Here is an extract from go2genes:
> > go2genes$'GO:0006954'
>   [1] "ENSG00000088812" "ENSG00000078747"
> "ENSG00000172216"
> "ENSG00000196839" "ENSG00000243646" "ENSG00000128271"
> "ENSG00000100014" "ENSG00000128335" "ENSG00000100292"
> "ENSG00000128284"
>  [11] "ENSG00000240972" "ENSG00000117215"
> "ENSG00000050628"
> "ENSG00000187554" "ENSG00000188257" "ENSG00000159377"
> "ENSG00000117525" "ENSG00000116288" "ENSG00000158769"
> "ENSG00000160712"
>  [21] "ENSG00000135744" "ENSG00000117335"
> "ENSG00000163435"
> "ENSG00000177807" "ENSG00000117586" "ENSG00000078900"
> "ENSG00000178585" "ENSG00000168297" "ENSG00000134070"
> "ENSG00000144802"
>  [31] "ENSG00000113889" "ENSG00000172936"
> "ENSG00000074416"
> "ENSG00000085276" "ENSG00000121807" "ENSG00000157017"
> "ENSG00000233276" "ENSG00000175040" "ENSG00000072274"
> "ENSG00000132170"
> 
> > class(go2genes)
> [1] "list"
> 
> 
> I have my list of genes in a data.frame called "pwf" which
> looks like this"
> > head(pwf)
>                           DEgenes   
> bias.data         pwf
> ENSG00000000003       0    1563.5 0.003486092
> ENSG00000000005       0    1353.0 0.003287926
> ENSG00000000419       0    1079.5 0.002955906
> ENSG00000000457       0    3128.5 0.003943449
> ENSG00000000460       0    3126.0 0.003943445
> ENSG00000000938       0    2437.0 0.003896895
> 
> The number 1 in pwf$DEgenes indicates a differentially
> expressed gene.
> 
> 
> So what (I think) I need to do is:
> 1) get the GOTERM (e.g. GO:0006954) from GO.wall$category
> 2) look in go2genes for $'GO:0006954' and get the list of
> all genes
> associated with this GOTERM
> 3) look for these genes in rownames(pwf), and if
> pwf$DEgenes==1 then
> print GO.wall$category, GO.wall$over_represented_pvalue,
> rownames(pwf)
> 
> 
> I have no idea if this is possible.  The sort of output I
> would expect
> would be something like (GOTERM, p-value, list of genes
> from this
> category enriched in my dataset):
> 
> GO:0006954 1.496746e-10
> "ENSG00000078747" "ENSG00000177807" "ENSG00000117586"
> "ENSG00000078900" "ENSG00000175040" "ENSG00000072274"
> "ENSG00000132170"
> 
> 
> Thank you for taking the time to read my email. I hope
> someone out
> there will be able to help me.
> 
> Many thanks and have a nice weekend everybody.
> 
> Helen Wright
> University of Liverpool, UK
> 
> 
> 
> > sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
> locale:
> [1] C/en_US.UTF-8/C/C/C/C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets
>  methods   base
> other attached packages:
>  [1] GO.db_2.5.0            org.Hs.eg.db_2.5.0    
> RSQLite_0.9-4
>    DBI_0.2-5              AnnotationDbi_1.14.1  
> edgeR_2.2.5
>  goseq_1.4.0            geneLenDataBase_0.99.7
>  [9] BiasedUrn_1.04         DESeq_1.4.1          
>  locfit_1.5-6
>    lattice_0.19-30        akima_0.5-4          
>  Biobase_2.12.2
>  Rsamtools_1.4.3        Biostrings_2.20.3
> [17] GenomicFeatures_1.4.4  GenomicRanges_1.4.8  
>  IRanges_1.10.6
> loaded via a namespace (and not attached):
>  [1] BSgenome_1.20.0    Matrix_0.999375-50
> RColorBrewer_1.0-5
> RCurl_1.6-10       XML_3.4-2        
>  annotate_1.30.1    biomaRt_2.8.1
>      genefilter_1.34.0  geneplotter_1.30.0
> [10] grid_2.13.1        limma_3.8.3      
>  mgcv_1.7-6
> nlme_3.1-101       rtracklayer_1.12.4 splines_2.13.1
> survival_2.36-9    tools_2.13.1       xtable_1.5-6
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>