[Bioc-sig-seq] Why are there non-imformative names() from GenomicFeatures:::exonsBy(...)?

Wed Aug 4 17:38:19 CEST 2010

Steve,
Herve has better knowledge of the transcript names key than I do since 
he was dealing with the pre-processed data from UCSC and BioMart, but it 
is my understanding that there is no guarantee that the transcript names 
assigned by these resources are unique or even exist so using the 
transcript names can be problematic. We could add a flag 
(useTxNames=FALSE?) to the exonsBy function that, if TRUE, would check 
to see if the transcript names are suitable, and if they are suitable, 
it would return the output you are looking for, and if they are not 
suitable, the function would throw and error stating the transcript 
names cannot be used for this purpose. Thoughts?

Cheers,
Patrick

On 8/4/10 6:20 AM, Steve Lianoglou wrote:
> Howdy,
>
> For what it's worth, I'm trying to hack on GenomicFeatures in a
> non-intrusive so that I can (i) wrap some functionality in a way that
> I think is useful/intuitive, and (ii) hopefully contribute back to the
> package itself.
>
> I addressed the "problem" in my original email by tweaking the
> .featuresBy function here:
>
> http://github.com/lianos/GenomicFeaturesX/blob/master/R/transcriptsBy.R#L56
>
> (around lines 56, and again 93)
>
> As I'm a noob with this package, I'm not sure if it will aversely
> effect anything else down stream, but it's doing what I need for now,
> so that exonsBy(txdb, 'tx') now has the transcript id's in the names()
> slot of the GRangesList that is returned.
>
> If it pleases the court, I'd also like to S4-ize the `exons` and
> `transcripts` functions (which currently work on TranscriptDb's) since
> I'd like to use them as accessors for my Gene-like objects:
>
> http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R
>
> Cheers,
> -steve
>
> On Tue, Aug 3, 2010 at 7:33 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com>  wrote:
>    
>> Hi,
>>
>> I'm seeing if I can transition some of my code from my
>> (previously-mentioned) GenomeAnnotations package to use
>> GenomicFeatures, so I have a few questions of how certain things
>> should be done w/ GenomicFeatures.
>>
>> I'll start with this one:
>>
>> Shouldn't the GRangesList returned by exonsBy(txdb, 'tx') have more
>> informative names than:
>> "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
>> ?
>>
>> I've made a TranscriptDb from the 'ensembl' gene annos, and I'd like
>> to reconcile which "transcript exons" belong to which transcripts, but
>> it seems there's no direct link from the GRangesList returned by
>> exonsBy(..., 'tx') and the GRanges object returned by
>> transcripts(txdb).
>>
>> Shouldn't there be? One would expect a 1:1 map, no? The length of the
>> exonsBy/GRangesList is the same as transcripts/GRanges object, so ...
>> yes :-).
>>
>> I mean:
>>
>> R>  txdb<- makeTranscriptDbFromUCSC(genome='hg18', tablename='ensGene')
>> R>  txdb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg18
>> | UCSC Table: ensGene
>> | Type of Gene ID: Ensembl gene ID
>> | Full dataset: yes
>> | transcript_nrow: 63280
>> | exon_nrow: 276075
>> | cds_nrow: 225373
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-08-03 16:50:06 -0400 (Tue, 03 Aug 2010)
>> | GenomicFeatures version at creation time: 1.0.6
>> | RSQLite version at creation time: 0.9-2
>>
>> R>  xcripts<- transcripts(txdb)
>> R>  xcripts
>> xcripts
>> GRanges with 63280 ranges and 2 elementMetadata values
>>         seqnames               ranges strand   |     tx_id         tx_name
>>            <Rle>              <IRanges>    <Rle>     |<integer>       <character>
>>     [1]     chr1     [  1873,   3533]      +   |      1730 ENST00000404059
>>     [2]     chr1     [ 20229,  20366]      +   |      1732 ENST00000408384
>> ...
>>
>> R>  xcripts.exons<- exonsBy(txdb, 'tx')
>> R>  head(names(xcripts.exons))
>> [1] "1" "2" "3" "4" "5" "6"
>>
>> I feel like names(xcripts.exons) should give me something like:
>> "ENST00000404059" "ENST00000408384" "ENST00000359752" .... (in correct
>> order, of course)
>>
>> My same confusion has to do lack of informative names returned from
>> exonsBy(txdb, 'gene') -- I feel like it should set the names in the
>> same way that transcriptsBy(txdb, 'gene').
>>
>> So, I wonder if this is just an oversight, or is it not there on
>> purpose and I have to rethink the way I approach these problems.
>> Should I be findOverlap-ing between my xcripts.exons GRangesList and
>> my xcripts GRanges, or something? And if so, why is that better?
>>
>> R>  sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-06-28 r52408)
>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] RSQLite_0.9-2         DBI_0.2-5             GenomicFeatures_1.1.6
>> GenomicRanges_1.1.20
>> [5] IRanges_1.7.15
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.9.0      biomaRt_2.5.1      Biostrings_2.17.27
>> BSgenome_1.17.6    RCurl_1.4-3
>> [6] rtracklayer_1.9.4  tools_2.12.0       XML_3.1-0
>>
>>
>> Thanks,
>> -steve
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>>      
>
>
>