[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame
Hervé Pagès
hpages at fhcrc.org
Tue Mar 20 01:17:28 CET 2012
On 03/19/2012 04:00 PM, Michael Lawrence wrote:
>
>
> On Mon, Mar 19, 2012 at 3:27 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Michael,
>
>
> On 03/19/2012 01:57 PM, Michael Lawrence wrote:
>
>
>
> On Mon, Mar 19, 2012 at 12:37 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
> Hi Tim, Steve,
>
>
> On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
>
> Hi Tim,
>
> On Mon, Mar 19, 2012 at 8:02 AM, Tim
> Yates<TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>
> <mailto:TYates at picr.man.ac.uk <mailto:TYates at picr.man.ac.uk>>__>
> wrote:
>
> [snip]
>
> `$.GRanges` = function( x, name ) {
> elementMetadata( x )[[ name ]]
> }
> `$<-.GRanges` = function( x, name, value ) {
> elementMetadata( x )[[ name ]] = value
> x
> }
>
> Then all the above functions behave in a similar
> fashion,
> however I don't want to do this in my package, as I
> believe
> it will be a gateway to namespace hell…
>
> Is there any plan to add this accessor operator to
> Granges
> so that access is normalised across all three of
> these types?
>
>
> This does seem to come up every now and again. Most
> recently (I
> think)
> in this post:
>
> http://thread.gmane.org/gmane.____science.biology.informatics.____conductor/39169/focus=39189
> <http://thread.gmane.org/gmane.__science.biology.informatics.__conductor/39169/focus=39189>
>
> <http://thread.gmane.org/__gmane.science.biology.__informatics.conductor/39169/__focus=39189
> <http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189>>
>
> There is a *technically* correct reason as to why
> there's no `$`
> accessor on GRanges that is explained there by Martin
> (it has to
> do w/
> GRanges extending the Vector class), but still ... it
> seems that
> some
> of us feel like we still want it, even if it's bad for
> us. :-)
>
>
> Also note that, generally speaking, you can't expect to be
> able to
> manipulate a GRanges exactly like a RangedData or like a
> data.frame.
> There are important differences:
>
> > rd
> RangedData with 3 rows and 3 value columns across 1 space
> space ranges | strand score id
> <factor> <IRanges> | <character> <numeric> <character>
> 1 1 [1, 4] | + 5.1 K
> 2 1 [2, 5] | + 5.2 L
> 3 1 [3, 6] | + 5.3 M
> > gr <- as(rd, "GRanges")
> > names(gr) <- letters[1:3]
> > gr
> GRanges with 3 ranges and 2 elementMetadata cols:
> seqnames ranges strand | score id
> <Rle> <IRanges> <Rle> | <numeric> <character>
> a 1 [1, 4] + | 5.1 K
> b 1 [2, 5] + | 5.2 L
> c 1 [3, 6] + | 5.3 M
> ---
> seqlengths:
> 1
> 6
>
> > dim(rd)
> [1] 3 3
> > dim(as.data.frame(rd))
> [1] 3 7
> > dim(gr)
> NULL
>
> > length(rd)
> [1] 1
> > length(as.data.frame(rd))
> [1] 7
> > length(gr)
> [1] 3
>
> > names(rd)
> [1] "1"
> > names(as.data.frame(rd))
> [1] "space" "start" "end" "width" "strand" "score" "id"
> > names(gr)
> [1] "a" "b" "c"
>
> Each of them has its own notion of dim, length and names.
> Therefore each of them needs to have its own notion of what
> $ does.
> For GRanges the notion of $ would be to be compatible with its
> notion of length and names i.e. gr$c would access element
> named "c".
> This is already the case for GRangesList, DNAStringSet, and
> probably
> 98% of the classes defined in the
> IRanges/GenomicRanges/____Biostrings
>
> infrastructure (I wish I could say 100%):
>
> > grl <- GRangesList(tx1=gr, tx2=gr)
> > grl$tx2
> GRanges with 3 ranges and 2 elementMetadata cols:
> seqnames ranges strand | score id
> <Rle> <IRanges> <Rle> | <numeric> <character>
> a 1 [1, 4] + | 5.1 K
> b 1 [2, 5] + | 5.2 L
> c 1 [3, 6] + | 5.3 M
> ---
> seqlengths:
> 1
> 6
>
> > DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
> 7-letter "DNAString" instance
> seq: ACGTATT
>
> I can see it's tempting to sacrifice consistency for convenience
> but there are so many classes in the IRanges/GenomicRanges
> that IMO
> this would be a dangerous slope in the long run.
>
>
>
> We need to balance the concerns of the developers and the users.
> Developers who are building on top of low-level infrastructure
> need data
> structures with consistent API and behaviors, whereas for users,
> convenience is more of a concern.
>
> For any List derivative, I would be against having '$' mean anything
> other than what it does now. However, $ on a non-List Vector is not
> generally useful for extracting elements of the vector, and
> indeed R has
> disallowed the use of $ on atomic vectors.
>
>
> IMO the difference between being a List derivative or a non-List Vector
> is too subtle to take this as the dividing line. This dividing line
> wouldn't even necessarily give you what you want e.g. IRanges or
> DNAStringSet are List derivatives but few users/developers know or
> care about this.
>
>
> I think this hits at the core of the issue. Very few users understand
> the low-level intricacies of the IRanges infrastructure and are instead
> guided by their intuition. Empirically, it seems few have tried to
> manipulate the element metadata of IRanges, probably due to a lack of a
> use case. Similarly, most people do not even realize that one could use
> elementMetadata on a RangedData, nor would they often need to do so.
> They instead of think in high-level notions such as having "values" on
> ranges, where GRanges and RangedData could have more consistency.
>
> This is all about choosing a balance between high-level convenience and
> low-level consistency, and one's opinion probably strongly depends on
> how often one needs (and maybe forgets) to type "values()" in a day, vs.
> how often one writes "setMethod(...)" in a day.
>
> Looking back at Tim's question, perhaps we need some sort of
> compatibility API for package authors, so that RangedData and GRanges
> could be treated identically for many operations. I've found that there
> is often a need to get a value column using the same API. This is not
> presently possible.
Yes most of the time RangedData and GRanges are interchangeable. It
would be good to know why users/developers need to be able to switch
back and forth between the 2 representations. Maybe that could be
avoided by adding the functionalities that are missing on either side.
I guess bringing the 2 containers exactly on par won't be possible
(otherwise that would mean they are 100% redundant?), and in some
rare situations, switching back and forth will still be the only
solution. Having specific use cases would help.
I would still prefer trying to work in that direction rather than
encouraging the users/developers to switch back and forth or to write
code that works on both containers. In the case of Tim's package for
example, it's of course a good idea to write tools that support
different types of input, but maybe the input could be coerced to
1 type (the "native" type) and all the internal code written to work
on that "native" type only? It's a technique I personally use a lot
in my own code and I find it makes the code cleaner and *much*
easier to maintain e.g. it's straightforward to add support for
additional types and also I'm not at the mercy of subtle changes
in the API of *one* of the many types I support (I only need to
know and care about the API of the "native" type). Even if the
initial coercion introduces some overhead, I find this approach
really worth it.
Cheers,
H.
>
> Michael
>
> If $ gives you an elementMetadata column of a GRanges,
> how could we justify that it doesn't do so for an IRanges too?
> Because IRanges and GRanges are so close and use the same notion of
> length and names, this would be an unfortunate inconsistency,
> a much more unfortunate one than the current inconsistencies between
> RangedData and GRanges.
>
> I have concerns that going on that slope will give us a situation
> where some of the 140+ classes in the IRanges/GenomicRanges/__Biostrings
> /ShortRead infrastructure do one thing with $, while others do
> something else, with no clear dividing line, so every time I need
> to use $ on an object, I need to check the man page for that kind
> of object (I'm not very good at remembering the specificities of
> 140+ containers for such a standard operator as $). I'd rather
> have everybody do the same thing.
>
> Sacrificing clean design for convenience might seem like a good idea
> at first sight but we've seen notable examples in base R where this
> turned out to not be such a good deal in the long run... and in some
> cases the convenience was finally abandoned (e.g. name partial matching
> for list element access). The cost of the return ticket can be high,
> sometimes it's just impossible (e.g. name mangling in unlist()). In
> our own code, I recently had to go thru a deprecation/defunct cycle
> for the coercion methods from AtomicList to atomic vectors in IRanges.
> It will take 1 year to completely get rid of this and in the meantime
> it broke a lot of code.
>
>
> Thus, '$' is available for
> other uses, such as extracting elements from a List associated
> with a
> Vector (like elementMetadata). This might be confusing to people,
> because they might start thinking that the Vector object is
> actually a
> List, or something, but in practice this has not happened with
> RangedData. There have been few queries about the "shape" of
> RangedData.
> People seem to just get it. Even with the proposed $ accessor,
> GRanges
> would remain comparatively simple. There is also something to be
> said
> for preserving the consistency between RangedData and GRanges, as
> evidenced by the multiple discussions on this topic.
>
>
> I don't like having to type elementMetadata either everytime I need
> to access an elementMetadata column. But at least I know I can rely
> on this to work *anywhere* (i.e. any Vector derivative), *even* on a
> RangedData. Note that both rd$score and elementMetadata(rd)$score
> work on a RangedData but they do different things. Having gr$score
> being a shortcut for elementMetadata(gr)$score on a GRanges object
> would actually introduce another inconsistency between RangedData
> and GRanges, rather than eliminate one.
>
> H.
>
>
> Michael
>
> Cheers,
> H.
>
>
> -steve
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
> ___________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>> mailing list
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list