[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame

Mon Mar 19 20:37:47 CET 2012

Hi Tim, Steve,

On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
> Hi Tim,
>
> On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates<TYates at picr.man.ac.uk>  wrote:
> [snip]
>
>> `$.GRanges`   = function( x, name )        {
>>   elementMetadata( x )[[ name ]]
>> }
>> `$<-.GRanges` = function( x, name, value ) {
>>   elementMetadata( x )[[ name ]] = value
>>   x
>> }
>>
>> Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell…
>>
>> Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?
>
> This does seem to come up every now and again. Most recently (I think)
> in this post:
>
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189
>
> There is a *technically* correct reason as to why there's no `$`
> accessor on GRanges that is explained there by Martin (it has to do w/
> GRanges extending the Vector class), but still ... it seems that some
> of us feel like we still want it, even if it's bad for us. :-)

Also note that, generally speaking, you can't expect to be able to
manipulate a GRanges exactly like a RangedData or like a data.frame.
There are important differences:

 > rd
RangedData with 3 rows and 3 value columns across 1 space
      space    ranges |      strand     score          id
   <factor> <IRanges> | <character> <numeric> <character>
1        1    [1, 4] |           +       5.1           K
2        1    [2, 5] |           +       5.2           L
3        1    [3, 6] |           +       5.3           M
 > gr <- as(rd, "GRanges")
 > names(gr) <- letters[1:3]
 > gr
GRanges with 3 ranges and 2 elementMetadata cols:
     seqnames    ranges strand |     score          id
        <Rle> <IRanges>  <Rle> | <numeric> <character>
   a        1    [1, 4]      + |       5.1           K
   b        1    [2, 5]      + |       5.2           L
   c        1    [3, 6]      + |       5.3           M
   ---
   seqlengths:
    1
    6

 > dim(rd)
[1] 3 3
 > dim(as.data.frame(rd))
[1] 3 7
 > dim(gr)
NULL

 > length(rd)
[1] 1
 > length(as.data.frame(rd))
[1] 7
 > length(gr)
[1] 3

 > names(rd)
[1] "1"
 > names(as.data.frame(rd))
[1] "space"  "start"  "end"    "width"  "strand" "score"  "id"
 > names(gr)
[1] "a" "b" "c"

Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what $ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element named "c".
This is already the case for GRangesList, DNAStringSet, and probably
98% of the classes defined in the IRanges/GenomicRanges/Biostrings
infrastructure (I wish I could say 100%):

 > grl <- GRangesList(tx1=gr, tx2=gr)
 > grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
     seqnames    ranges strand |     score          id
        <Rle> <IRanges>  <Rle> | <numeric> <character>
   a        1    [1, 4]      + |       5.1           K
   b        1    [2, 5]      + |       5.2           L
   c        1    [3, 6]      + |       5.3           M
   ---
   seqlengths:
    1
    6

 > DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
   7-letter "DNAString" instance
seq: ACGTATT

I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges that IMO
this would be a dangerous slope in the long run.

Cheers,
H.

>
> -steve
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319