[Bioc-devel] Feedback wanted on design of fixed-width Ranges class

Vincent Carey stvjc at channing.harvard.edu
Thu Nov 24 00:36:16 CET 2016


pace Wolfgang Huber ...

Peter I don't mean to be rude.  Your comments deserve more study.  But it
was fun to remember GPos, which I had forgotten.

On Wed, Nov 23, 2016 at 6:34 PM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

> library(GenomicRanges)
> class?GPos
>
> On Wed, Nov 23, 2016 at 6:18 PM, Peter Hickey <peter.hickey at gmail.com>
> wrote:
>
>> I've been toying with the idea of a fixed/constant width Ranges
>> subclass. The motivation comes from storing DNA methylation data at CH
>> loci (non-CpG methylation): there are 1.1 billion CH loci in the human
>> genome, so to store these as a GRanges object requires 2 x 1.1 billion
>> integer vectors, one for the @start and one for the @width slots of
>> the IRanges object in the @ranges slot. But in this case, and perhaps
>> others, such as storing SNP data, we have a situation where all loci
>> have the same width, namely 1. Of course, you might argue such a
>> 2-fold reduction in size is purely academic, but I think it could be a
>> nice efficiency that's worth pursuing.
>>
>> I've sketched out two different prototypes, neither of which I've
>> worked up to a complete implementation; I'd like to get some feedback
>> on these two designs, along with a variation that I've not yet even
>> tried implementing, before I decide how/whether to proceed.
>>
>> The two approaches are:
>>
>> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
>> name suggestions).
>> a. The @width slot would be an integer vector of length 1
>> b. [variation not yet implemented] The @width slot would be an Rle
>> vector parallel to @start
>> 2. Modifying the IRanges class. The @width slot may be a integer
>> vector of length 1 or a vector parallel to @start
>>
>> [Upon reflection, I suppose there could be a '2b' where the @width
>> slot is an Rle, but I'm going to ignore this for now since in general
>> it would be inefficient when the ranges have (random) variable widths]
>>
>> # Pros of 1
>>
>> - It seems the proper thing is to create a new Ranges subclass
>> - No dangers associated with stuffing around with internals of the
>> IRanges class and clean code separation
>>
>> # Pros of 1b compared to 1a
>>
>> - Like for IRanges, the @width slot would remain parallel to the @start
>> slot
>>
>> # Cons of 1
>>
>> - Can't immediately use in a GRanges object because the @ranges slot
>> is classed as an IRanges object
>> - Perhaps this could be changed to allow a Ranges object in the
>> @ranges slot of a GRanges object?
>> - Otherwise, would also need to implement a subclass of GenomicRanges
>> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This
>> would necessitate a fair bit of code duplicated from GRanges methods.
>> - Methods like start<-, end<-, width<- would either have to
>> - (A) return an error if the new object no longer has fixed/constant
>> widths
>> - (B) coerce it to an IRanges object (with or without warning) thus
>> meaning these operations would not be strict endomorphisms
>> - Users would only get the space-savings of the FWRanges class if they
>> explicitly construct a FWRanges object or coerce a compatible IRanges
>> object to an FWRanges object
>> - Clean code separation from the IRanges class may also lead to
>> duplicated code
>>
>> # Cons of 1b compared to 1a
>>
>> - Endomorphic versions of methods like start<-, end<-, width<- could
>> create a @width slot that is twice the 'necessary' size (e.g., an Rle
>> representation of a vector that contains no 'runs').
>>
>> # Pros of 2
>>
>> - If properly implemented, the user wouldn't need to think about
>> whether the ranges were fixed or variable width, they'd just get the
>> most efficient representation
>>
>> # Cons of 2
>>
>> - This is fairly obvious, 2 would be a major (internal) change to a
>> core Bioconductor class
>> - The @width slot would no longer necessarily be parallel to @start
>> slot, e.g., code that does direct slot access via @width could easily
>> break (of course, the width() getter would be modified to return a
>> parallel vector to the @start slot, but people (*cough* me) have code
>> that does the wrong thing with respect to the use of getters vs.
>> direct slot access)
>> - New IRanges objects may be incompatible with earlier version of IRanges
>>
>> Your feedback is very appreciated,
>> Pete
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list