[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame

Hervé Pagès hpages at fhcrc.org
Tue Mar 20 01:17:28 CET 2012


On 03/19/2012 04:00 PM, Michael Lawrence wrote:
>
>
> On Mon, Mar 19, 2012 at 3:27 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>
>     On 03/19/2012 01:57 PM, Michael Lawrence wrote:
>
>
>
>         On Mon, Mar 19, 2012 at 12:37 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>             Hi Tim, Steve,
>
>
>             On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
>
>                 Hi Tim,
>
>                 On Mon, Mar 19, 2012 at 8:02 AM, Tim
>         Yates<TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>
>         <mailto:TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>>__>
>           wrote:
>
>                 [snip]
>
>                     `$.GRanges`   = function( x, name )        {
>                       elementMetadata( x )[[ name ]]
>                     }
>                     `$<-.GRanges` = function( x, name, value ) {
>                       elementMetadata( x )[[ name ]] = value
>                       x
>                     }
>
>                     Then all the above functions behave in a similar
>         fashion,
>                     however I don't want to do this in my package, as I
>         believe
>                     it will be a gateway to namespace hell…
>
>                     Is there any plan to add this accessor operator to
>         Granges
>                     so that access is normalised across all three of
>         these types?
>
>
>                 This does seem to come up every now and again. Most
>         recently (I
>                 think)
>                 in this post:
>
>         http://thread.gmane.org/gmane.____science.biology.informatics.____conductor/39169/focus=39189
>         <http://thread.gmane.org/gmane.__science.biology.informatics.__conductor/39169/focus=39189>
>
>         <http://thread.gmane.org/__gmane.science.biology.__informatics.conductor/39169/__focus=39189
>         <http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189>>
>
>                 There is a *technically* correct reason as to why
>         there's no `$`
>                 accessor on GRanges that is explained there by Martin
>         (it has to
>                 do w/
>                 GRanges extending the Vector class), but still ... it
>         seems that
>                 some
>                 of us feel like we still want it, even if it's bad for
>         us. :-)
>
>
>             Also note that, generally speaking, you can't expect to be
>         able to
>             manipulate a GRanges exactly like a RangedData or like a
>         data.frame.
>             There are important differences:
>
>          > rd
>             RangedData with 3 rows and 3 value columns across 1 space
>                  space    ranges |      strand     score          id
>         <factor> <IRanges> | <character> <numeric> <character>
>             1        1    [1, 4] |           +       5.1           K
>             2        1    [2, 5] |           +       5.2           L
>             3        1    [3, 6] |           +       5.3           M
>          > gr <- as(rd, "GRanges")
>          > names(gr) <- letters[1:3]
>          > gr
>             GRanges with 3 ranges and 2 elementMetadata cols:
>                 seqnames    ranges strand |     score          id
>         <Rle> <IRanges> <Rle> | <numeric> <character>
>               a        1    [1, 4]      + |       5.1           K
>               b        1    [2, 5]      + |       5.2           L
>               c        1    [3, 6]      + |       5.3           M
>               ---
>               seqlengths:
>                1
>                6
>
>          > dim(rd)
>             [1] 3 3
>          > dim(as.data.frame(rd))
>             [1] 3 7
>          > dim(gr)
>             NULL
>
>          > length(rd)
>             [1] 1
>          > length(as.data.frame(rd))
>             [1] 7
>          > length(gr)
>             [1] 3
>
>          > names(rd)
>             [1] "1"
>          > names(as.data.frame(rd))
>             [1] "space" "start" "end" "width" "strand" "score" "id"
>          > names(gr)
>             [1] "a" "b" "c"
>
>             Each of them has its own notion of dim, length and names.
>             Therefore each of them needs to have its own notion of what
>         $ does.
>             For GRanges the notion of $ would be to be compatible with its
>             notion of length and names i.e. gr$c would access element
>         named "c".
>             This is already the case for GRangesList, DNAStringSet, and
>         probably
>             98% of the classes defined in the
>         IRanges/GenomicRanges/____Biostrings
>
>             infrastructure (I wish I could say 100%):
>
>          > grl <- GRangesList(tx1=gr, tx2=gr)
>          > grl$tx2
>             GRanges with 3 ranges and 2 elementMetadata cols:
>                 seqnames    ranges strand |     score          id
>         <Rle> <IRanges> <Rle> | <numeric> <character>
>               a        1    [1, 4]      + |       5.1           K
>               b        1    [2, 5]      + |       5.2           L
>               c        1    [3, 6]      + |       5.3           M
>               ---
>               seqlengths:
>                1
>                6
>
>          > DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
>               7-letter "DNAString" instance
>             seq: ACGTATT
>
>             I can see it's tempting to sacrifice consistency for convenience
>             but there are so many classes in the IRanges/GenomicRanges
>         that IMO
>             this would be a dangerous slope in the long run.
>
>
>
>         We need to balance the concerns of the developers and the users.
>         Developers who are building on top of low-level infrastructure
>         need data
>         structures with consistent API and behaviors, whereas for users,
>         convenience is more of a concern.
>
>         For any List derivative, I would be against having '$' mean anything
>         other than what it does now. However, $ on a non-List Vector is not
>         generally useful for extracting elements of the vector, and
>         indeed R has
>         disallowed the use of $ on atomic vectors.
>
>
>     IMO the difference between being a List derivative or a non-List Vector
>     is too subtle to take this as the dividing line. This dividing line
>     wouldn't even necessarily give you what you want e.g. IRanges or
>     DNAStringSet are List derivatives but few users/developers know or
>     care about this.
>
>
> I think this hits at the core of the issue. Very few users understand
> the low-level intricacies of the IRanges infrastructure and are instead
> guided by their intuition. Empirically, it seems few have tried to
> manipulate the element metadata of IRanges, probably due to a lack of a
> use case. Similarly, most people do not even realize that one could use
> elementMetadata on a RangedData, nor would they often need to do so.
> They instead of think in high-level notions such as having "values" on
> ranges, where GRanges and RangedData could have more consistency.
>
> This is all about choosing a balance between high-level convenience and
> low-level consistency, and one's opinion probably strongly depends on
> how often one needs (and maybe forgets) to type "values()" in a day, vs.
> how often one writes "setMethod(...)" in a day.
>
> Looking back at Tim's question, perhaps we need some sort of
> compatibility API for package authors, so that RangedData and GRanges
> could be treated identically for many operations. I've found that there
> is often a need to get a value column using the same API. This is not
> presently possible.

Yes most of the time RangedData and GRanges are interchangeable. It
would be good to know why users/developers need to be able to switch
back and forth between the 2 representations. Maybe that could be
avoided by adding the functionalities that are missing on either side.
I guess bringing the 2 containers exactly on par won't be possible
(otherwise that would mean they are 100% redundant?), and in some
rare situations, switching back and forth will still be the only
solution. Having specific use cases would help.

I would still prefer trying to work in that direction rather than
encouraging the users/developers to switch back and forth or to write
code that works on both containers. In the case of Tim's package for
example, it's of course a good idea to write tools that support
different types of input, but maybe the input could be coerced to
1 type (the "native" type) and all the internal code written to work
on that "native" type only? It's a technique I personally use a lot
in my own code and I find it makes the code cleaner and *much*
easier to maintain e.g. it's straightforward to add support for
additional types and also I'm not at the mercy of subtle changes
in the API of *one* of the many types I support (I only need to
know and care about the API of the "native" type). Even if the
initial coercion introduces some overhead, I find this approach
really worth it.

Cheers,
H.

>
> Michael
>
>     If $ gives you an elementMetadata column of a GRanges,
>     how could we justify that it doesn't do so for an IRanges too?
>     Because IRanges and GRanges are so close and use the same notion of
>     length and names, this would be an unfortunate inconsistency,
>     a much more unfortunate one than the current inconsistencies between
>     RangedData and GRanges.
>
>     I have concerns that going on that slope will give us a situation
>     where some of the 140+ classes in the IRanges/GenomicRanges/__Biostrings
>     /ShortRead infrastructure do one thing with $, while others do
>     something else, with no clear dividing line, so every time I need
>     to use $ on an object, I need to check the man page for that kind
>     of object (I'm not very good at remembering the specificities of
>     140+ containers for such a standard operator as $). I'd rather
>     have everybody do the same thing.
>
>     Sacrificing clean design for convenience might seem like a good idea
>     at first sight but we've seen notable examples in base R where this
>     turned out to not be such a good deal in the long run... and in some
>     cases the convenience was finally abandoned (e.g. name partial matching
>     for list element access). The cost of the return ticket can be high,
>     sometimes it's just impossible (e.g. name mangling in unlist()). In
>     our own code, I recently had to go thru a deprecation/defunct cycle
>     for the coercion methods from AtomicList to atomic vectors in IRanges.
>     It will take 1 year to completely get rid of this and in the meantime
>     it broke a lot of code.
>
>
>         Thus, '$' is available for
>         other uses, such as extracting elements from a List associated
>         with a
>         Vector (like elementMetadata). This might be confusing to people,
>         because they might start thinking that the Vector object is
>         actually a
>         List, or something, but in practice this has not happened with
>         RangedData. There have been few queries about the "shape" of
>         RangedData.
>         People seem to just get it.  Even with the proposed $ accessor,
>         GRanges
>         would remain comparatively simple. There is also something to be
>         said
>         for preserving the consistency between RangedData and GRanges, as
>         evidenced by the multiple discussions on this topic.
>
>
>     I don't like having to type elementMetadata either everytime I need
>     to access an elementMetadata column. But at least I know I can rely
>     on this to work *anywhere* (i.e. any Vector derivative), *even* on a
>     RangedData. Note that both rd$score and elementMetadata(rd)$score
>     work on a RangedData but they do different things. Having gr$score
>     being a shortcut for elementMetadata(gr)$score on a GRanges object
>     would actually introduce another inconsistency between RangedData
>     and GRanges, rather than eliminate one.
>
>     H.
>
>
>         Michael
>
>             Cheers,
>             H.
>
>
>                 -steve
>
>
>
>             --
>             Hervé Pagès
>
>             Program in Computational Biology
>             Division of Public Health Sciences
>             Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N, M1-B514
>             P.O. Box 19024
>             Seattle, WA 98109-1024
>
>             E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         <tel:%28206%29%20667-5791>
>             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>         <tel:%28206%29%20667-1319>
>
>
>             ___________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>> mailing list
>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list