[Bioc-sig-seq] Why are there non-imformative names() from GenomicFeatures:::exonsBy(...)?

Tue Aug 17 02:45:44 CEST 2010

Steve,

On 08/16/2010 02:52 PM, Steve Lianoglou wrote:
> Hi Hervé,
>
> 2010/8/16 Hervé Pagès<hpages at fhcrc.org>:
> <snip>
>> Sorry for the delay on this.
>
> No problem :-)
>
> <snip>
>
>> But yes, even if the transcript names are not guaranteed to be
>> unique, it's still useful to have the GRangesList object returned
>> by exonsBy() set with those names.
>>
>> Starting with GenomicFeatures 1.0.10 (will be available in the
>> next 24 hours or so), transcriptsBy(), exonsBy(), cdsBy(),
>> intronsByTranscript(), fiveUTRsByTranscript() and
>> threeUTRsByTranscript() accept extra argument 'use.names' (FALSE
>> by default). This allows the user to choose how the returned
>> object should be named: with the feature internal ids (not very
>> informative but guaranteed to be defined and unique), or with the
>> feature names (more informative but not guaranteed to be unique or
>> even defined).
>> A warning is issued if some names are not defined (NAs) or not unique.
>
> Perhaps another option (more consistent(?) since the names() are still
> unique)

AFAIK the names of a vector (atomic or list) in R are not required to
be unique. They can even contain NAs and the empty string.
The rownames of a data frame have these kinds of restrictions though.

> might be to just keep it as it is now, but add a DataFrame to
> the GRangesList with something like a tx_name column if the user asks
> for it?
>
> (I didn't realize we could add DF's to *RangesList objects until I
> reread some of the IRanges/GenomicFeatures vignettes)

Yes, every Sequence class has an elementMetadata slot accessible thru
the values() accessor. So you can attach values to the ranges of an
IRanges or GRanges object, to the elements of an Rle object, to the
top-level elements of a GRangesList object, to the letters of a
DNAString object, to the sequences of a DNAStringSet object, etc...
and even to the column of a DataFrame object (DataFrame is also
a Sequence subclass).

This design makes attaching data to the elements of a Sequence easy
and very flexible. But also it comes at the price of making things
look tricky (and maybe a little bit overwhelming) to the user when
a high-level Sequence subclass like GRangesList has slots that are
themselves Sequence objects which have slots that are themselves
Sequence objects which have slots etc etc...

Just look at the output of str(exonsBy(txdb)) and try to spot all
the "@ elementMetadata" lines to see what I mean.

There is also a display challenge e.g. when a GRangesList object
has values attached at 2 different levels (top-level values and
element-level values). Currently the "show" method for GRangesList
only displays the latter (try to add top-level values to the result
of exonsBy(): they are not displayed). Since it would be impossible
to display the content of all the elementMetadata slots contained
in a high-level Sequence object in general, we need to make choices.
For GRangesList the choice was to display the element-level values,
not the top-level values.

For these reasons, we decided to avoid using the GRangesList's
top-level elementMetadata slot in the GenomicFeatures package.

>
> This way the user will be assured of unique names in the GRangesList's
> `names()` slot, but can get at the tx_name of the given list item via
> something like:
>
> values(exon.list)$tx_name
>
> And perhaps this extra DataFrame would only be added if exonsBy, etc.
> function is called with a "with.names=TRUE" (instead of use.names).

For the current implementation of the 'use.names' arg, I added an
helper function id2name() that is exported but not documented (I will
document it now). It returns the mapping between ids and names for
the specified feature:

   > tx_id2name <- id2name(txdb, feature.type="tx")
   > tx_id2name[1:4]
              1            2            3            4
   "uc001aaa.2" "uc009vip.1" "uc009vis.1" "uc002qvt.1"

The mapping is just a named character vector where the names are
the internal ids.

So you can still use exonsBy() with use.names=FALSE and do your own
mapping separately with id2name().

>
> Just brainstorming ...
>
> Also:
>
>>> If it pleases the court, I'd also like to S4-ize the `exons` and
>>> `transcripts` functions (which currently work on TranscriptDb's) since
>>> I'd like to use them as accessors for my Gene-like objects:
>>>
>>> http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R
>>
>> Just to clarify, are you asking us to make these functions generic
>> so you can write your own methods for your Gene-like objects?
>> If nobody objects, I'll be happy to make this change.
>
> Yes, that's what I'm asking -- I'd like functions like exonsBy, cdsBy,
> transcripts, exons, etc. to be S4-ized (and allowed to take an '...'
> parameter).

Sounds good to me. I'll wait a couple of days before I make the change
so people have time to comment/object if they want.

>
> Actually, a few days ago, I think I've more-or-less moved all of the
> GenomicFeatures methods I'm trampling (S4-izing) over here, on line 37
> down:
>
> http://github.com/lianos/GenomicFeaturesX/blob/master/R/AllS4.toGenomicFeatures.R#L37
>
> Though I think you might find some of the other functions for
> TranscriptDb objects on that page (simple as they may be) useful, such
> as get/setMetadata, for instance.

Right, we don't provide anything to extract stuff from the metadata
table (something like your getMetadata() extractor). But we could if
people find this useful. Right now, they can do:

   dbReadTable(GenomicFeatures:::txdbConn(txdb), "metadata")

and then do standard data.frame manipulation (like row subsetting)
on the result.

As for your setMetadata(), I can see how people might want to add
their own metadata to the TranscriptDb object but this breaks the
"TranscriptDb objects are read-only" paradigm. Here is a suggestion:
if the user already knows those extra metadata at db creation time,
maybe we could add an extra argument to the makeTranscriptDbFrom*()
family so s/he can pass them when s/he makes the object. Would that
be good enough? What kind of metadata do you want to add to your
TranscriptDb objects?

Thanks for your feedback,
H.

>
> Thanks,
> -steve
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319