[R] Check a list of genes for a specific GO term
Martin Morgan
mtmorgan at fhcrc.org
Mon Jul 8 09:12:49 CEST 2013
Please ask follow-up questions about Bioconductor packages on the Bioconductor mailing list.
http://bioconductor.org/help/mailing-list/mailform/
If you are interested in organisms rather than chips, use the organism package, e.g., for Homo sapiens
library(org.Hs.eg.db)
df0 = select(org.Hs.eg.db, keys(org.Hs.eg.db), "GO")
giving
> head(df)
ENTREZID GO EVIDENCE ONTOLOGY
1 1 GO:0003674 ND MF
2 1 GO:0005576 IDA CC
3 1 GO:0008150 ND BP
4 10 GO:0004060 IEA MF
5 10 GO:0005829 TAS CC
6 10 GO:0006805 TAS BP
from which you might
df = unique(df0[df0$ONTOLOGY == "BP", c("ENTREZID", "GO")])
len = tapply(df$ENTREZID, df$GO, length)
keep = len[len < 1000]
to get a vector of counts, with names being GO ids. Remember that the GO is a directed acyclic graph, so terms are nested; you'll likely want to give some thought to what you're actually wanting.
The vignettes in the AnnotationDbi and Category packages
http://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
http://bioconductor.org/packages/release/bioc/html/Category.html
are two useful sources of information, as is the annotation work flow
http://bioconductor.org/help/workflows/annotation/
Martin
----- Chirag Gupta <cxg040 at email.uark.edu> wrote:
> Hi
> I think I asked the wrong question. Apologies.
>
> Actually I want all the GO BP annotations for my organism and from them I
> want to retain only those annotations which annotate less than a specified
> number of genes. (say <1000 genes)
>
> I hope I have put it clearly.
>
> sorry again.
>
> Thanks!
>
>
> On Sun, Jul 7, 2013 at 6:55 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
> > In Bioconductor, install the annotation package
> >
> >
> > http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData
> >
> > corresponding to your chip, e.g.,
> >
> > source("http://bioconductor.org/biocLite.R")
> > biocLite("hgu95av2.db")
> >
> > then load it and select the GO terms corresponding to your probes
> >
> > library(hgu95av2.db)
> > lkup <- select(hgu95av2.db, rownames(dat), "GO")
> >
> > then use standard R commands to find the probesets that have the GO id
> > you're interested in
> >
> > keep = lkup$GO %in% "GO:0006355"
> > unique(lkup$PROBEID[keep])
> >
> > Ask follow-up questions about Bioconductor packages on the Bioconductor
> > mailing list
> >
> > http://bioconductor.org/help/mailing-list/mailform/
> >
> > Martin
> > ----- Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > Hello,
> > >
> > > Your question is not very clear, maybe if you post a data example.
> > > To do so, use ?dput. If your data frame is named 'dat', use the
> > following.
> > >
> > > dput(head(dat, 50)) # paste the output of this in a post
> > >
> > >
> > > If you want to get the rownames matching a certain pattern, maybe
> > > something like the following.
> > >
> > >
> > > idx <- grep("GO:0006355", rownames(dat))
> > > dat[idx, ]
> > >
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > >
> > > Em 07-07-2013 07:01, Chirag Gupta escreveu:
> > > > Hello everyone
> > > >
> > > > I have a dataframe with rows as probeset ID and columns as samples
> > > > I want to check the rownames and find which are those probes are
> > > > transcription factors. (GO:0006355 )
> > > >
> > > > Any suggestions?
> > > >
> > > > Thanks!
> > > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> *Chirag Gupta*
> Department of Crop, Soil, and Environmental Sciences,
> 115 Plant Sciences Building, Fayetteville, Arkansas 72701
More information about the R-help
mailing list