[Bioc-sig-seq] Why are there non-imformative names() from GenomicFeatures:::exonsBy(...)?

Hervé Pagès hpages at fhcrc.org
Mon Aug 16 21:10:49 CEST 2010


Hi Steve,

On 08/04/2010 06:20 AM, Steve Lianoglou wrote:
> Howdy,
>
> For what it's worth, I'm trying to hack on GenomicFeatures in a
> non-intrusive so that I can (i) wrap some functionality in a way that
> I think is useful/intuitive, and (ii) hopefully contribute back to the
> package itself.
>
> I addressed the "problem" in my original email by tweaking the
> .featuresBy function here:
>
> http://github.com/lianos/GenomicFeaturesX/blob/master/R/transcriptsBy.R#L56
>
> (around lines 56, and again 93)
>
> As I'm a noob with this package, I'm not sure if it will aversely
> effect anything else down stream, but it's doing what I need for now,
> so that exonsBy(txdb, 'tx') now has the transcript id's in the names()
> slot of the GRangesList that is returned.

Sorry for the delay on this.

I agree that the names or the returned GRangesList object are not
very informative. We choose to use them only because it's the only
thing that is guaranteed to be defined and unique. For example,
there are many tables at UCSC where the transcript names (txName
col) are not unique. This was the main motivation for introducing
our own "internal id", make it the primary key for the transcript
table. Then it was natural to use it for naming the groups of exons
returned by 'exonsBy(txdb, "tx")'.
Note that the situation is not any better with exon names:
Biomart/Ensembl provides them but UCSC doesn't. So this was one
more reason to come up with the current db schema where we have
the tx_id, tx_name, exon_id, exon_name cols.

But yes, even if the transcript names are not guaranteed to be
unique, it's still useful to have the GRangesList object returned
by exonsBy() set with those names.

Starting with GenomicFeatures 1.0.10 (will be available in the
next 24 hours or so), transcriptsBy(), exonsBy(), cdsBy(),
intronsByTranscript(), fiveUTRsByTranscript() and
threeUTRsByTranscript() accept extra argument 'use.names' (FALSE
by default). This allows the user to choose how the returned
object should be named: with the feature internal ids (not very
informative but guaranteed to be defined and unique), or with the
feature names (more informative but not guaranteed to be unique or
even defined).
A warning is issued if some names are not defined (NAs) or not unique.


>
> If it pleases the court, I'd also like to S4-ize the `exons` and
> `transcripts` functions (which currently work on TranscriptDb's) since
> I'd like to use them as accessors for my Gene-like objects:
>
> http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R

Just to clarify, are you asking us to make these functions generic
so you can write your own methods for your Gene-like objects?
If nobody objects, I'll be happy to make this change.

Cheers,
H.

>
> Cheers,
> -steve
>
> On Tue, Aug 3, 2010 at 7:33 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com>  wrote:
>> Hi,
>>
>> I'm seeing if I can transition some of my code from my
>> (previously-mentioned) GenomeAnnotations package to use
>> GenomicFeatures, so I have a few questions of how certain things
>> should be done w/ GenomicFeatures.
>>
>> I'll start with this one:
>>
>> Shouldn't the GRangesList returned by exonsBy(txdb, 'tx') have more
>> informative names than:
>> "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
>> ?
>>
>> I've made a TranscriptDb from the 'ensembl' gene annos, and I'd like
>> to reconcile which "transcript exons" belong to which transcripts, but
>> it seems there's no direct link from the GRangesList returned by
>> exonsBy(..., 'tx') and the GRanges object returned by
>> transcripts(txdb).
>>
>> Shouldn't there be? One would expect a 1:1 map, no? The length of the
>> exonsBy/GRangesList is the same as transcripts/GRanges object, so ...
>> yes :-).
>>
>> I mean:
>>
>> R>  txdb<- makeTranscriptDbFromUCSC(genome='hg18', tablename='ensGene')
>> R>  txdb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg18
>> | UCSC Table: ensGene
>> | Type of Gene ID: Ensembl gene ID
>> | Full dataset: yes
>> | transcript_nrow: 63280
>> | exon_nrow: 276075
>> | cds_nrow: 225373
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-08-03 16:50:06 -0400 (Tue, 03 Aug 2010)
>> | GenomicFeatures version at creation time: 1.0.6
>> | RSQLite version at creation time: 0.9-2
>>
>> R>  xcripts<- transcripts(txdb)
>> R>  xcripts
>> xcripts
>> GRanges with 63280 ranges and 2 elementMetadata values
>>         seqnames               ranges strand   |     tx_id         tx_name
>>            <Rle>              <IRanges>    <Rle>     |<integer>       <character>
>>     [1]     chr1     [  1873,   3533]      +   |      1730 ENST00000404059
>>     [2]     chr1     [ 20229,  20366]      +   |      1732 ENST00000408384
>> ...
>>
>> R>  xcripts.exons<- exonsBy(txdb, 'tx')
>> R>  head(names(xcripts.exons))
>> [1] "1" "2" "3" "4" "5" "6"
>>
>> I feel like names(xcripts.exons) should give me something like:
>> "ENST00000404059" "ENST00000408384" "ENST00000359752" .... (in correct
>> order, of course)
>>
>> My same confusion has to do lack of informative names returned from
>> exonsBy(txdb, 'gene') -- I feel like it should set the names in the
>> same way that transcriptsBy(txdb, 'gene').
>>
>> So, I wonder if this is just an oversight, or is it not there on
>> purpose and I have to rethink the way I approach these problems.
>> Should I be findOverlap-ing between my xcripts.exons GRangesList and
>> my xcripts GRanges, or something? And if so, why is that better?
>>
>> R>  sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-06-28 r52408)
>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] RSQLite_0.9-2         DBI_0.2-5             GenomicFeatures_1.1.6
>> GenomicRanges_1.1.20
>> [5] IRanges_1.7.15
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.9.0      biomaRt_2.5.1      Biostrings_2.17.27
>> BSgenome_1.17.6    RCurl_1.4-3
>> [6] rtracklayer_1.9.4  tools_2.12.0       XML_3.1-0
>>
>>
>> Thanks,
>> -steve
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list