[BioC] exon genomic coordinates
Hervé Pagès
hpages at fhcrc.org
Tue Nov 26 00:30:35 CET 2013
Hi Michael, John,
I was hoping maybe making a TranscriptDb object from the CCDS track
would help here:
library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC("hg19", "ccdsGene")
but unfortunately it's not easy to link the exons in 'txdb' to
Ensembl transcript or gene ids because 'txdb' lacks this information:
> ex <- exons(txdb, columns=c("exon_id", "tx_name", "gene_id"))
> head(ex)
GRanges with 6 ranges and 3 metadata columns:
seqnames ranges strand | exon_id tx_name
<Rle> <IRanges> <Rle> | <integer> <CharacterList>
[1] chr1 [ 69091, 70008] + | 1 CCDS30547.1
[2] chr1 [367659, 368597] + | 2 CCDS41220.1
[3] chr1 [861322, 861393] + | 3 CCDS2.2
[4] chr1 [865535, 865716] + | 4 CCDS2.2
[5] chr1 [866419, 866469] + | 5 CCDS2.2
[6] chr1 [871152, 871276] + | 6 CCDS2.2
gene_id
<CharacterList>
[1] NA
[2] NA
[3] NA
[4] NA
[5] NA
[6] NA
---
seqlengths:
chr1 chr2 ... chrUn_gl000249
249250621 243199373 ... 38502
Querying Ensembl directly to make a TranscriptDb object:
library(GenomicFeatures)
txdb <- makeTranscriptDbFromBiomart() # takes a while! (40 min. for me)
# this used to be much faster,
# don't know what's going on
saveDb(txdb, file="hsapiens_gene_ensembl_txdb.sqlite") # save for
later re-use
KIT_exons <- exons(txdb, vals=list(gene_id="ENSG00000157404"),
columns=c("exon_name", "tx_name", "gene_id"))
tx_names <- unique(unlist(mcols(KIT_exons)$tx_name))
# tx_names # 4 transcripts
# [1] "ENST00000288135" "ENST00000412167" "ENST00000514582"
"ENST00000512959"
ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
KIT_ex_by_tx <- ex_by_tx[tx_names]
Transcript lengths:
> sum(width(KIT_ex_by_tx))
ENST00000288135 ENST00000412167 ENST00000514582 ENST00000512959
5186 3470 538 746
Pick-up the longest:
> KIT_ex_by_tx[["ENST00000288135"]]
GRanges with 21 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 4 [55524085, 55524248] + | 156828 ENSE00001905199
[2] 4 [55561678, 55561947] + | 156830 ENSE00001032350
[3] 4 [55564450, 55564731] + | 156832 ENSE00001074448
[4] 4 [55565796, 55565932] + | 156833 ENSE00001121859
[5] 4 [55569890, 55570058] + | 156834 ENSE00001074426
... ... ... ... ... ... ...
[17] 4 [55599236, 55599358] + | 156850 ENSE00001074435
[18] 4 [55602664, 55602775] + | 156852 ENSE00001074442
[19] 4 [55602887, 55602986] + | 156853 ENSE00001224349
[20] 4 [55603341, 55603446] + | 156854 ENSE00001074415
[21] 4 [55604595, 55606881] + | 156856 ENSE00001898693
exon_rank
<integer>
[1] 1
[2] 2
[3] 3
[4] 4
[5] 5
... ...
[17] 17
[18] 18
[19] 19
[20] 20
[21] 21
---
seqlengths:
1 2 ... LRG_98
LRG_99
249250621 243199373 ... 18750
13294
Cheers,
H.
On 11/25/2013 02:19 PM, Michael Lawrence wrote:
> You could use rtracklayer to grab the CCDS track from UCSC. Might be some
> way with Biomart from Ensembl.
>
>
> On Mon, Nov 25, 2013 at 2:06 PM, array chip <arrayprofile at yahoo.com> wrote:
>
>> Thanks Michael. How do I restrict to consensus CDS?
>>
>> John
>>
>> ------------------------------
>> *From:* Michael Lawrence <lawrence.michael at gene.com>
>> *To:* array chip <arrayprofile at yahoo.com>
>> *Cc:* "bioconductor at r-project.org" <bioconductor at r-project.org>
>> *Sent:* Monday, November 25, 2013 1:55 PM
>>
>> *Subject:* Re: [BioC] exon genomic coordinates
>>
>> Well, one approach is to take the longest one. That's what UCSC uses to
>> call its "canonical transcripts". And restrict to the consensus CDS (CCDS).
>>
>>
>> On Mon, Nov 25, 2013 at 1:12 PM, array chip <arrayprofile at yahoo.com>wrote:
>>
>> Hi all, have another questions about exon genomic coordinates:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>>
>> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>>
>>> getBM(attributes =
>> c("external_gene_id","chromosome_name","exon_chrom_start","exon_chrom_end","ensembl_transcript_id","rank"), filters
>> = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>> chromosome_name exon_chrom_start exon_chrom_end ensembl_transcript_id
>> rank
>> 1 4 55524085 55524248 ENST00000412167
>> 1
>> 2 4 55561678 55561947 ENST00000412167
>> 2
>> 3 4 55564450 55564731 ENST00000412167
>> 3
>> 4 4 55565796 55565932 ENST00000412167
>> 4
>> 5 4 55569890 55570058 ENST00000412167
>> 5
>> 6 4 55573264 55573453 ENST00000412167
>> 6
>> 7 4 55575590 55575705 ENST00000412167
>> 7
>> 8 4 55589750 55589864 ENST00000412167
>> 8
>> 9 4 55592023 55592204 ENST00000412167
>> 9
>> 10 4 55593384 55593490 ENST00000412167
>> 10
>> 11 4 55593582 55593708 ENST00000412167
>> 11
>> 12 4 55593989 55594093 ENST00000412167
>> 12
>> 13 4 55594177 55594287 ENST00000412167
>> 13
>> 14 4 55595501 55595651 ENST00000412167
>> 14
>> 15 4 55597494 55597585 ENST00000412167
>> 15
>> 16 4 55598037 55598164 ENST00000412167
>> 16
>> 17 4 55599236 55599358 ENST00000412167
>> 17
>> 18 4 55602664 55602775 ENST00000412167
>> 18
>> 19 4 55602887 55602986 ENST00000412167
>> 19
>> 20 4 55603341 55603446 ENST00000412167
>> 20
>> 21 4 55604595 55605177 ENST00000412167
>> 21
>> 22 4 55524085 55524248 ENST00000288135
>> 1
>> 23 4 55561678 55561947 ENST00000288135
>> 2
>> 24 4 55564450 55564731 ENST00000288135
>> 3
>> 25 4 55565796 55565932 ENST00000288135
>> 4
>> 26 4 55569890 55570058 ENST00000288135
>> 5
>> 27 4 55573264 55573453 ENST00000288135
>> 6
>> 28 4 55575590 55575705 ENST00000288135
>> 7
>> 29 4 55589750 55589864 ENST00000288135
>> 8
>> 30 4 55592023 55592216 ENST00000288135
>> 9
>> 31 4 55593384 55593490 ENST00000288135
>> 10
>> 32 4 55593582 55593708 ENST00000288135
>> 11
>> 33 4 55593989 55594093 ENST00000288135
>> 12
>> 34 4 55594177 55594287 ENST00000288135
>> 13
>> 35 4 55595501 55595651 ENST00000288135
>> 14
>> 36 4 55597494 55597585 ENST00000288135
>> 15
>> 37 4 55598037 55598164 ENST00000288135
>> 16
>> 38 4 55599236 55599358 ENST00000288135
>> 17
>> 39 4 55602664 55602775 ENST00000288135
>> 18
>> 40 4 55602887 55602986 ENST00000288135
>> 19
>> 41 4 55603341 55603446 ENST00000288135
>> 20
>> 42 4 55604595 55606881 ENST00000288135
>> 21
>> 43 4 55524106 55524248 ENST00000514582
>> 1
>> 44 4 55561678 55562072 ENST00000514582
>> 2
>> 45 4 55595458 55595651 ENST00000512959
>> 1
>> 46 4 55597494 55597585 ENST00000512959
>> 2
>> 47 4 55598037 55598164 ENST00000512959
>> 3
>> 48 4 55599236 55599567 ENST00000512959
>> 4
>>
>> This will give many versions of genomic coordinates. For example, KIT has
>> 3 sets of exons. I think these different versions may refer to different
>> splicing variants/isoforms. Is there a "default"/"standard" set of exons
>> for each gene? and how do I know which one is such one?
>>
>> Thanks
>>
>> John
>>
>>
>>
>>
>> ________________________________
>>
>> To: Hans-Rudolf Hotz <hrh at fmi.ch>; "bioconductor at r-project.org" <
>> bioconductor at r-project.org>
>> Sent: Monday, November 25, 2013 12:41 PM
>> Subject: Re: [BioC] exon genomic coordinates
>>
>>
>> Hi,
>>
>> I am trying to use4 bioMart to retrieve the exon coordinates using the
>> example provided below:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>>
>> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>>
>> getBM(attributes =
>> c("chromosome_name","exon_chrom_start","exon_chrom_end","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>> The above works fine. However, when I tried to add "hgnc_symbol" to the
>> attributes list, it gave me error:
>>
>> getBM(attributes =
>>
>> c("hgnc_symbol","chromosome_name","exon_chrom_start","exon_chrom_end","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>>
>> Error in getBM(attributes = c("hgnc_symbol", "chromosome_name",
>> "exon_chrom_start", :
>> Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple
>> attribute pages are not allowed
>>
>> But if I keep "hgnc_symbol" in the atributes list and
>> remove "exon_chrom_start" and "exon_chrom_end", then it worked again:
>> getBM(attributes =
>> c("hgnc_symbol","chromosome_name","ensembl_transcript_id","rank"),
>> filters = 'hgnc_symbol', values=c("KIT"),mart=ensembl)
>>
>> Can anyone tell me why is that?
>>
>> Thanks
>>
>> John
>>
>>
>> ________________________________
>> From: Hans-Rudolf Hotz <hrh at fmi.ch>
>>
>> onductor at r-project.org>
>> Sent: Thursday, November 21, 2013 5:38 AM
>> Subject: Re: [BioC] exon genomic coordinates
>>
>>
>> Hi John
>>
>> You can use the BioMart database, which you can access with the biomaRt
>> package to get all exons for all transcripts for a given giene, eg:
>>
>> library(biomaRt)
>> ensembl = useMart("ensembl")
>> #assuming you are interested in mouse
>> mouse.ensembl = useDataset("mmusculus_gene_ensembl",mart=ensembl)
>>
>> getBM(attributes =
>>
>> c("chromosome_name","exon_chrom_start","exon_chrom_end","ensembl_exon_id","ensembl_transcript_id","ensembl_gene_id"),
>> filters = 'mgi_symbol', values=c("KIT"),mart=mouse.ensembl)
>>
>>
>> Hope this helps
>>
>> Hans-Rudolf
>>
>>
>> On 11/21/2013 09:14 AM, array chip wrote:
>>> Hi,
>>>
>>>
>>> Can anyone suggest how to retrieve the genomic coordinates for all exons
>> for a given gene by say gene symbol? For example, how to retrieve the
>> coordinates for all 21 exons for gene KIT?
>>>
>>> Thanks
>>>
>>> John
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>>
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list