[BioC] How to get position for each gene ID/gene symbol instead of position for each transcript
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Aug 25 04:20:42 CEST 2010
Hi Shirley,
On Tue, Aug 24, 2010 at 9:07 PM, shirley zhang <shirley0818 at gmail.com> wrote:
> Hi Marc,
>
> Thanks a lot for your detailed reply. It is very helpful.
>
> I tried library "GenomicFeatures" and your code. It works great. I did
> get position for each Transcript. However I am only interested in
> obtaining position for each gene ID/gene symbol instead of position
> for each transcript. For example, if one gene has multiple
> transcripts, I only want the biggest boundary of each gene which
> includes all of its transcripts. Does GenomicFeatures or other package
> can do this? Or I have to compute the gene position by combining all
> of its corresponding transcripts?
You can do this pretty "simply" with GenomicFeatures, if you want to
stick with that:
R> txdb <- loadFeatures('your.transcript.db')
R> xcripts <- transcriptsBy(txdb, by='gene')
## This part is really slow -- this will be subject of next email
R> gene.bounds <- seqapply(xcripts, reduce)
the names() of gene.bounds is the entrez.id of the gene. You can use
the org.Hs.eg.db pacakges
R> library(org.Hs.eg.db)
R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA)
symbols will now be a list (names are entrez ids, values are the gene
symbols) that you can manipulate in "the standard R way"
Hope that helps,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list