[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame

Mon Mar 19 23:27:14 CET 2012

Hi Michael,

On 03/19/2012 01:57 PM, Michael Lawrence wrote:
>
>
> On Mon, Mar 19, 2012 at 12:37 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Tim, Steve,
>
>
>     On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
>
>         Hi Tim,
>
>         On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates<TYates at picr.man.ac.uk
>         <mailto:TYates at picr.man.ac.uk>>  wrote:
>         [snip]
>
>             `$.GRanges`   = function( x, name )        {
>               elementMetadata( x )[[ name ]]
>             }
>             `$<-.GRanges` = function( x, name, value ) {
>               elementMetadata( x )[[ name ]] = value
>               x
>             }
>
>             Then all the above functions behave in a similar fashion,
>             however I don't want to do this in my package, as I believe
>             it will be a gateway to namespace hell…
>
>             Is there any plan to add this accessor operator to Granges
>             so that access is normalised across all three of these types?
>
>
>         This does seem to come up every now and again. Most recently (I
>         think)
>         in this post:
>
>         http://thread.gmane.org/gmane.__science.biology.informatics.__conductor/39169/focus=39189
>         <http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189>
>
>         There is a *technically* correct reason as to why there's no `$`
>         accessor on GRanges that is explained there by Martin (it has to
>         do w/
>         GRanges extending the Vector class), but still ... it seems that
>         some
>         of us feel like we still want it, even if it's bad for us. :-)
>
>
>     Also note that, generally speaking, you can't expect to be able to
>     manipulate a GRanges exactly like a RangedData or like a data.frame.
>     There are important differences:
>
>      > rd
>     RangedData with 3 rows and 3 value columns across 1 space
>          space    ranges |      strand     score          id
>     <factor> <IRanges> | <character> <numeric> <character>
>     1        1    [1, 4] |           +       5.1           K
>     2        1    [2, 5] |           +       5.2           L
>     3        1    [3, 6] |           +       5.3           M
>      > gr <- as(rd, "GRanges")
>      > names(gr) <- letters[1:3]
>      > gr
>     GRanges with 3 ranges and 2 elementMetadata cols:
>         seqnames    ranges strand |     score          id
>     <Rle> <IRanges> <Rle> | <numeric> <character>
>       a        1    [1, 4]      + |       5.1           K
>       b        1    [2, 5]      + |       5.2           L
>       c        1    [3, 6]      + |       5.3           M
>       ---
>       seqlengths:
>        1
>        6
>
>      > dim(rd)
>     [1] 3 3
>      > dim(as.data.frame(rd))
>     [1] 3 7
>      > dim(gr)
>     NULL
>
>      > length(rd)
>     [1] 1
>      > length(as.data.frame(rd))
>     [1] 7
>      > length(gr)
>     [1] 3
>
>      > names(rd)
>     [1] "1"
>      > names(as.data.frame(rd))
>     [1] "space" "start" "end" "width" "strand" "score" "id"
>      > names(gr)
>     [1] "a" "b" "c"
>
>     Each of them has its own notion of dim, length and names.
>     Therefore each of them needs to have its own notion of what $ does.
>     For GRanges the notion of $ would be to be compatible with its
>     notion of length and names i.e. gr$c would access element named "c".
>     This is already the case for GRangesList, DNAStringSet, and probably
>     98% of the classes defined in the IRanges/GenomicRanges/__Biostrings
>     infrastructure (I wish I could say 100%):
>
>      > grl <- GRangesList(tx1=gr, tx2=gr)
>      > grl$tx2
>     GRanges with 3 ranges and 2 elementMetadata cols:
>         seqnames    ranges strand |     score          id
>     <Rle> <IRanges> <Rle> | <numeric> <character>
>       a        1    [1, 4]      + |       5.1           K
>       b        1    [2, 5]      + |       5.2           L
>       c        1    [3, 6]      + |       5.3           M
>       ---
>       seqlengths:
>        1
>        6
>
>      > DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
>       7-letter "DNAString" instance
>     seq: ACGTATT
>
>     I can see it's tempting to sacrifice consistency for convenience
>     but there are so many classes in the IRanges/GenomicRanges that IMO
>     this would be a dangerous slope in the long run.
>
>
>
> We need to balance the concerns of the developers and the users.
> Developers who are building on top of low-level infrastructure need data
> structures with consistent API and behaviors, whereas for users,
> convenience is more of a concern.
>
> For any List derivative, I would be against having '$' mean anything
> other than what it does now. However, $ on a non-List Vector is not
> generally useful for extracting elements of the vector, and indeed R has
> disallowed the use of $ on atomic vectors.

IMO the difference between being a List derivative or a non-List Vector
is too subtle to take this as the dividing line. This dividing line
wouldn't even necessarily give you what you want e.g. IRanges or
DNAStringSet are List derivatives but few users/developers know or
care about this. If $ gives you an elementMetadata column of a GRanges,
how could we justify that it doesn't do so for an IRanges too?
Because IRanges and GRanges are so close and use the same notion of
length and names, this would be an unfortunate inconsistency,
a much more unfortunate one than the current inconsistencies between
RangedData and GRanges.

I have concerns that going on that slope will give us a situation
where some of the 140+ classes in the IRanges/GenomicRanges/Biostrings
/ShortRead infrastructure do one thing with $, while others do
something else, with no clear dividing line, so every time I need
to use $ on an object, I need to check the man page for that kind
of object (I'm not very good at remembering the specificities of
140+ containers for such a standard operator as $). I'd rather
have everybody do the same thing.

Sacrificing clean design for convenience might seem like a good idea
at first sight but we've seen notable examples in base R where this
turned out to not be such a good deal in the long run... and in some
cases the convenience was finally abandoned (e.g. name partial matching
for list element access). The cost of the return ticket can be high,
sometimes it's just impossible (e.g. name mangling in unlist()). In
our own code, I recently had to go thru a deprecation/defunct cycle
for the coercion methods from AtomicList to atomic vectors in IRanges.
It will take 1 year to completely get rid of this and in the meantime
it broke a lot of code.

> Thus, '$' is available for
> other uses, such as extracting elements from a List associated with a
> Vector (like elementMetadata). This might be confusing to people,
> because they might start thinking that the Vector object is actually a
> List, or something, but in practice this has not happened with
> RangedData. There have been few queries about the "shape" of RangedData.
> People seem to just get it.  Even with the proposed $ accessor, GRanges
> would remain comparatively simple. There is also something to be said
> for preserving the consistency between RangedData and GRanges, as
> evidenced by the multiple discussions on this topic.

I don't like having to type elementMetadata either everytime I need
to access an elementMetadata column. But at least I know I can rely
on this to work *anywhere* (i.e. any Vector derivative), *even* on a
RangedData. Note that both rd$score and elementMetadata(rd)$score
work on a RangedData but they do different things. Having gr$score
being a shortcut for elementMetadata(gr)$score on a GRanges object
would actually introduce another inconsistency between RangedData
and GRanges, rather than eliminate one.

H.

>
> Michael
>
>     Cheers,
>     H.
>
>
>         -steve
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319