[R] Check a list of genes for a specific GO term

Mon Jul 8 09:12:49 CEST 2013

Please ask follow-up questions about Bioconductor packages on the Bioconductor mailing list.

  http://bioconductor.org/help/mailing-list/mailform/

If you are interested in organisms rather than chips, use the organism package, e.g., for Homo sapiens

  library(org.Hs.eg.db)
  df0 = select(org.Hs.eg.db, keys(org.Hs.eg.db), "GO")

giving

  > head(df)
    ENTREZID         GO EVIDENCE ONTOLOGY
  1        1 GO:0003674       ND       MF
  2        1 GO:0005576      IDA       CC
  3        1 GO:0008150       ND       BP
  4       10 GO:0004060      IEA       MF
  5       10 GO:0005829      TAS       CC
  6       10 GO:0006805      TAS       BP

from which you might

  df = unique(df0[df0$ONTOLOGY == "BP", c("ENTREZID", "GO")])
  len = tapply(df$ENTREZID, df$GO, length)
  keep = len[len < 1000]

to get a vector of counts, with names being GO ids. Remember that the GO is a directed acyclic graph, so terms are nested; you'll likely want to give some thought to what you're actually wanting.

The vignettes in the AnnotationDbi and Category packages

  http://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
  http://bioconductor.org/packages/release/bioc/html/Category.html

are two useful sources of information, as is the annotation work flow

  http://bioconductor.org/help/workflows/annotation/

Martin

----- Chirag Gupta <cxg040 at email.uark.edu> wrote:
> Hi
> I think I asked the wrong question. Apologies.
> 
> Actually I want all the GO BP annotations for my organism and from them I
> want to retain only those annotations which annotate less than a specified
> number of genes. (say <1000 genes)
> 
> I hope I have put it clearly.
> 
> sorry again.
> 
> Thanks!
> 
> 
> On Sun, Jul 7, 2013 at 6:55 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> 
> > In Bioconductor, install the annotation package
> >
> >
> > http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData
> >
> > corresponding to your chip, e.g.,
> >
> >   source("http://bioconductor.org/biocLite.R")
> >   biocLite("hgu95av2.db")
> >
> > then load it and select the GO terms corresponding to your probes
> >
> >   library(hgu95av2.db)
> >   lkup <- select(hgu95av2.db, rownames(dat), "GO")
> >
> > then use standard R commands to find the probesets that have the GO id
> > you're interested in
> >
> >   keep = lkup$GO %in% "GO:0006355"
> >   unique(lkup$PROBEID[keep])
> >
> > Ask follow-up questions about Bioconductor packages on the Bioconductor
> > mailing list
> >
> >   http://bioconductor.org/help/mailing-list/mailform/
> >
> > Martin
> > ----- Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > Hello,
> > >
> > > Your question is not very clear, maybe if you post a data example.
> > > To do so, use ?dput. If your data frame is named 'dat', use the
> > following.
> > >
> > > dput(head(dat, 50))  # paste the output of this in a post
> > >
> > >
> > > If you want to get the rownames matching a certain pattern, maybe
> > > something like the following.
> > >
> > >
> > > idx <- grep("GO:0006355", rownames(dat))
> > > dat[idx, ]
> > >
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > >
> > > Em 07-07-2013 07:01, Chirag Gupta escreveu:
> > > > Hello everyone
> > > >
> > > > I have a dataframe with rows as probeset ID and columns as samples
> > > > I want to check the rownames and find which are those probes are
> > > > transcription factors. (GO:0006355 )
> > > >
> > > > Any suggestions?
> > > >
> > > > Thanks!
> > > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> -- 
> *Chirag Gupta*
> Department of Crop, Soil, and Environmental Sciences,
> 115 Plant Sciences Building, Fayetteville, Arkansas 72701