[Bioc-devel] Meta data access with $ for GRanges vs RangedData vs data.frame
Hervé Pagès
hpages at fhcrc.org
Mon Mar 19 20:37:47 CET 2012
Hi Tim, Steve,
On 03/19/2012 05:50 AM, Steve Lianoglou wrote:
> Hi Tim,
>
> On Mon, Mar 19, 2012 at 8:02 AM, Tim Yates<TYates at picr.man.ac.uk> wrote:
> [snip]
>
>> `$.GRanges` = function( x, name ) {
>> elementMetadata( x )[[ name ]]
>> }
>> `$<-.GRanges` = function( x, name, value ) {
>> elementMetadata( x )[[ name ]] = value
>> x
>> }
>>
>> Then all the above functions behave in a similar fashion, however I don't want to do this in my package, as I believe it will be a gateway to namespace hell…
>>
>> Is there any plan to add this accessor operator to Granges so that access is normalised across all three of these types?
>
> This does seem to come up every now and again. Most recently (I think)
> in this post:
>
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/39169/focus=39189
>
> There is a *technically* correct reason as to why there's no `$`
> accessor on GRanges that is explained there by Martin (it has to do w/
> GRanges extending the Vector class), but still ... it seems that some
> of us feel like we still want it, even if it's bad for us. :-)
Also note that, generally speaking, you can't expect to be able to
manipulate a GRanges exactly like a RangedData or like a data.frame.
There are important differences:
> rd
RangedData with 3 rows and 3 value columns across 1 space
space ranges | strand score id
<factor> <IRanges> | <character> <numeric> <character>
1 1 [1, 4] | + 5.1 K
2 1 [2, 5] | + 5.2 L
3 1 [3, 6] | + 5.3 M
> gr <- as(rd, "GRanges")
> names(gr) <- letters[1:3]
> gr
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> dim(rd)
[1] 3 3
> dim(as.data.frame(rd))
[1] 3 7
> dim(gr)
NULL
> length(rd)
[1] 1
> length(as.data.frame(rd))
[1] 7
> length(gr)
[1] 3
> names(rd)
[1] "1"
> names(as.data.frame(rd))
[1] "space" "start" "end" "width" "strand" "score" "id"
> names(gr)
[1] "a" "b" "c"
Each of them has its own notion of dim, length and names.
Therefore each of them needs to have its own notion of what $ does.
For GRanges the notion of $ would be to be compatible with its
notion of length and names i.e. gr$c would access element named "c".
This is already the case for GRangesList, DNAStringSet, and probably
98% of the classes defined in the IRanges/GenomicRanges/Biostrings
infrastructure (I wish I could say 100%):
> grl <- GRangesList(tx1=gr, tx2=gr)
> grl$tx2
GRanges with 3 ranges and 2 elementMetadata cols:
seqnames ranges strand | score id
<Rle> <IRanges> <Rle> | <numeric> <character>
a 1 [1, 4] + | 5.1 K
b 1 [2, 5] + | 5.2 L
c 1 [3, 6] + | 5.3 M
---
seqlengths:
1
6
> DNAStringSet(c(aa="aa", bb="acgtatt"))$bb
7-letter "DNAString" instance
seq: ACGTATT
I can see it's tempting to sacrifice consistency for convenience
but there are so many classes in the IRanges/GenomicRanges that IMO
this would be a dangerous slope in the long run.
Cheers,
H.
>
> -steve
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list