[BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category

Fri Sep 9 15:30:23 CEST 2011

Hello everyone

This is my first message to Bioconductor, I hope someone out there can
help me. I have been working through the goseq vignette and adapting
it to my own dataset.  I have got my list of enriched go categories,
which looks like this:

> head(GO.wall)
         category           over_represented_pvalue
under_represented_pvalue
3217 GO:0006954            1.496746e-10                        1
3548 GO:0008009            4.502492e-10                        1
9081 GO:0042379            1.010877e-09                        1
2160 GO:0005126            1.033292e-09                        1
2159 GO:0005125            1.534272e-09                        1
4251 GO:0009611            1.983311e-09                        1

I can get a list of the genes which are contained in each GO category using :
> ensembl.gene.ids=get("GO:0006954", org.Hs.egGO2ALLEGS)
> unlist(mget(unique(ensembl.gene.ids), org.Hs.egGENENAME))

but this just gives me a list of ALL the genes attached to this GO
category.  I would like to find out which of my DE genes are in this
GO category.

I have referred to a previous Bioconductor post
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20110308/92b27df4/attachment.pl>
which shows how to get a list of all the genes in each GO category,
like this:

> data(genes)
> genes2go=getgo(names(genes),'hg19','ensGene')
> go2genes=goseq:::reversemapping(genes2go)

Here is an extract from go2genes:
> go2genes$'GO:0006954'
  [1] "ENSG00000088812" "ENSG00000078747" "ENSG00000172216"
"ENSG00000196839" "ENSG00000243646" "ENSG00000128271"
"ENSG00000100014" "ENSG00000128335" "ENSG00000100292"
"ENSG00000128284"
 [11] "ENSG00000240972" "ENSG00000117215" "ENSG00000050628"
"ENSG00000187554" "ENSG00000188257" "ENSG00000159377"
"ENSG00000117525" "ENSG00000116288" "ENSG00000158769"
"ENSG00000160712"
 [21] "ENSG00000135744" "ENSG00000117335" "ENSG00000163435"
"ENSG00000177807" "ENSG00000117586" "ENSG00000078900"
"ENSG00000178585" "ENSG00000168297" "ENSG00000134070"
"ENSG00000144802"
 [31] "ENSG00000113889" "ENSG00000172936" "ENSG00000074416"
"ENSG00000085276" "ENSG00000121807" "ENSG00000157017"
"ENSG00000233276" "ENSG00000175040" "ENSG00000072274"
"ENSG00000132170"

> class(go2genes)
[1] "list"

I have my list of genes in a data.frame called "pwf" which looks like this"
> head(pwf)
                          DEgenes    bias.data         pwf
ENSG00000000003       0    1563.5 0.003486092
ENSG00000000005       0    1353.0 0.003287926
ENSG00000000419       0    1079.5 0.002955906
ENSG00000000457       0    3128.5 0.003943449
ENSG00000000460       0    3126.0 0.003943445
ENSG00000000938       0    2437.0 0.003896895

The number 1 in pwf$DEgenes indicates a differentially expressed gene.

So what (I think) I need to do is:
1) get the GOTERM (e.g. GO:0006954) from GO.wall$category
2) look in go2genes for $'GO:0006954' and get the list of all genes
associated with this GOTERM
3) look for these genes in rownames(pwf), and if pwf$DEgenes==1 then
print GO.wall$category, GO.wall$over_represented_pvalue, rownames(pwf)

I have no idea if this is possible.  The sort of output I would expect
would be something like (GOTERM, p-value, list of genes from this
category enriched in my dataset):

GO:0006954 1.496746e-10
"ENSG00000078747" "ENSG00000177807" "ENSG00000117586"
"ENSG00000078900" "ENSG00000175040" "ENSG00000072274"
"ENSG00000132170"

Thank you for taking the time to read my email. I hope someone out
there will be able to help me.

Many thanks and have a nice weekend everybody.

Helen Wright
University of Liverpool, UK

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
 [1] GO.db_2.5.0            org.Hs.eg.db_2.5.0     RSQLite_0.9-4
   DBI_0.2-5              AnnotationDbi_1.14.1   edgeR_2.2.5
 goseq_1.4.0            geneLenDataBase_0.99.7
 [9] BiasedUrn_1.04         DESeq_1.4.1            locfit_1.5-6
   lattice_0.19-30        akima_0.5-4            Biobase_2.12.2
 Rsamtools_1.4.3        Biostrings_2.20.3
[17] GenomicFeatures_1.4.4  GenomicRanges_1.4.8    IRanges_1.10.6
loaded via a namespace (and not attached):
 [1] BSgenome_1.20.0    Matrix_0.999375-50 RColorBrewer_1.0-5
RCurl_1.6-10       XML_3.4-2          annotate_1.30.1    biomaRt_2.8.1
     genefilter_1.34.0  geneplotter_1.30.0
[10] grid_2.13.1        limma_3.8.3        mgcv_1.7-6
nlme_3.1-101       rtracklayer_1.12.4 splines_2.13.1
survival_2.36-9    tools_2.13.1       xtable_1.5-6