[Bioc-devel] Feedback wanted on design of fixed-width Ranges class
Gabe Becker
becker.gabe at gene.com
Thu Nov 24 03:59:12 CET 2016
Hey all,
I just wanted to chime in on this as it relates to some work I'm doing with
Luke Tierney and Tomas Kalibera. There's another approach to this that will
be available in the near future (we hope).
Alternative internal representations of atomic vectors, including compact
representations, are coming to R, hopefully (though not guaranteed) in the
2017 release.
See https://svn.r-project.org/R/branches/ALTREP/ALTREP.html for more
details.
With this approach, we could simply have a length N integer vector for
width that only took 1 integer in memory for it's payload (so long as it's
data was accessed properly). There would likely be some gotchas but it
would let a "normal" GRanges/IRanges object exhibit the behavior you want.
It could very well be worth doing separately in the interim, but it might
also be good to leverage the machinery for this that R will be getting soon.
I'm pretty excited about exploring applications to this stuff. I'm pretty
confident we'll be able to find more ways for it to synergize with the
Bioconductor infrastructure.
Best,
~G
On Wed, Nov 23, 2016 at 5:01 PM, Ryan <rct at thompsonclan.org> wrote:
> Is it possible to allow the width slot of IRanges to be either a normal
> vector or an Rle?
>
>
>
> On 11/23/16 6:18 PM, Peter Hickey wrote:
>
>> I've been toying with the idea of a fixed/constant width Ranges
>> subclass. The motivation comes from storing DNA methylation data at CH
>> loci (non-CpG methylation): there are 1.1 billion CH loci in the human
>> genome, so to store these as a GRanges object requires 2 x 1.1 billion
>> integer vectors, one for the @start and one for the @width slots of
>> the IRanges object in the @ranges slot. But in this case, and perhaps
>> others, such as storing SNP data, we have a situation where all loci
>> have the same width, namely 1. Of course, you might argue such a
>> 2-fold reduction in size is purely academic, but I think it could be a
>> nice efficiency that's worth pursuing.
>>
>> I've sketched out two different prototypes, neither of which I've
>> worked up to a complete implementation; I'd like to get some feedback
>> on these two designs, along with a variation that I've not yet even
>> tried implementing, before I decide how/whether to proceed.
>>
>> The two approaches are:
>>
>> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
>> name suggestions).
>> a. The @width slot would be an integer vector of length 1
>> b. [variation not yet implemented] The @width slot would be an Rle
>> vector parallel to @start
>> 2. Modifying the IRanges class. The @width slot may be a integer
>> vector of length 1 or a vector parallel to @start
>>
>> [Upon reflection, I suppose there could be a '2b' where the @width
>> slot is an Rle, but I'm going to ignore this for now since in general
>> it would be inefficient when the ranges have (random) variable widths]
>>
>> # Pros of 1
>>
>> - It seems the proper thing is to create a new Ranges subclass
>> - No dangers associated with stuffing around with internals of the
>> IRanges class and clean code separation
>>
>> # Pros of 1b compared to 1a
>>
>> - Like for IRanges, the @width slot would remain parallel to the @start
>> slot
>>
>> # Cons of 1
>>
>> - Can't immediately use in a GRanges object because the @ranges slot
>> is classed as an IRanges object
>> - Perhaps this could be changed to allow a Ranges object in the
>> @ranges slot of a GRanges object?
>> - Otherwise, would also need to implement a subclass of GenomicRanges
>> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This
>> would necessitate a fair bit of code duplicated from GRanges methods.
>> - Methods like start<-, end<-, width<- would either have to
>> - (A) return an error if the new object no longer has fixed/constant
>> widths
>> - (B) coerce it to an IRanges object (with or without warning) thus
>> meaning these operations would not be strict endomorphisms
>> - Users would only get the space-savings of the FWRanges class if they
>> explicitly construct a FWRanges object or coerce a compatible IRanges
>> object to an FWRanges object
>> - Clean code separation from the IRanges class may also lead to
>> duplicated code
>>
>> # Cons of 1b compared to 1a
>>
>> - Endomorphic versions of methods like start<-, end<-, width<- could
>> create a @width slot that is twice the 'necessary' size (e.g., an Rle
>> representation of a vector that contains no 'runs').
>>
>> # Pros of 2
>>
>> - If properly implemented, the user wouldn't need to think about
>> whether the ranges were fixed or variable width, they'd just get the
>> most efficient representation
>>
>> # Cons of 2
>>
>> - This is fairly obvious, 2 would be a major (internal) change to a
>> core Bioconductor class
>> - The @width slot would no longer necessarily be parallel to @start
>> slot, e.g., code that does direct slot access via @width could easily
>> break (of course, the width() getter would be modified to return a
>> parallel vector to the @start slot, but people (*cough* me) have code
>> that does the wrong thing with respect to the use of getters vs.
>> direct slot access)
>> - New IRanges objects may be incompatible with earlier version of IRanges
>>
>> Your feedback is very appreciated,
>> Pete
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Gabriel Becker, Ph.D
Associate Scientist
Bioinformatics and Computational Biology
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list