[Bioc-devel] Feedback wanted on design of fixed-width Ranges class

Ryan rct at thompsonclan.org
Thu Nov 24 04:14:19 CET 2016


Hi all,

In addition to the technical concerns, I suppose we should consider 
whether fixed-width ranges are conceptually different enough from 
general ranges to warrant a separate class, or whether this is just 
being considered for purely technical reasons. My feeling is that 
fixed-width ranges aren't sufficiently different from general ranges to 
justify a separate class. The two main uses I can think of for 
fixed-width ranges are genomic positions (i.e. length 1 ranges) and 
cases like "1kb upstream of" or "1kb radius around" a set of specified 
positions. But even for that case, fixed-wdith ranges are not 
necessarily usable because a position less than 1kb from the end of a 
chromosome would require a truncated range. (What behavior would we 
expect from a hypothetical FWRanges class in this case?)

-Ryan

On 11/23/16 8:01 PM, Ryan wrote:
> Is it possible to allow the width slot of IRanges to be either a 
> normal vector or an Rle?
>
>
> On 11/23/16 6:18 PM, Peter Hickey wrote:
>> I've been toying with the idea of a fixed/constant width Ranges
>> subclass. The motivation comes from storing DNA methylation data at CH
>> loci (non-CpG methylation): there are 1.1 billion CH loci in the human
>> genome, so to store these as a GRanges object requires 2 x 1.1 billion
>> integer vectors, one for the @start and one for the @width slots of
>> the IRanges object in the @ranges slot. But in this case, and perhaps
>> others, such as storing SNP data, we have a situation where all loci
>> have the same width, namely 1. Of course, you might argue such a
>> 2-fold reduction in size is purely academic, but I think it could be a
>> nice efficiency that's worth pursuing.
>>
>> I've sketched out two different prototypes, neither of which I've
>> worked up to a complete implementation; I'd like to get some feedback
>> on these two designs, along with a variation that I've not yet even
>> tried implementing, before I decide how/whether to proceed.
>>
>> The two approaches are:
>>
>> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
>> name suggestions).
>> a. The @width slot would be an integer vector of length 1
>> b. [variation not yet implemented] The @width slot would be an Rle
>> vector parallel to @start
>> 2. Modifying the IRanges class. The @width slot may be a integer
>> vector of length 1 or a vector parallel to @start
>>
>> [Upon reflection, I suppose there could be a '2b' where the @width
>> slot is an Rle, but I'm going to ignore this for now since in general
>> it would be inefficient when the ranges have (random) variable widths]
>>
>> # Pros of 1
>>
>> - It seems the proper thing is to create a new Ranges subclass
>> - No dangers associated with stuffing around with internals of the
>> IRanges class and clean code separation
>>
>> # Pros of 1b compared to 1a
>>
>> - Like for IRanges, the @width slot would remain parallel to the 
>> @start slot
>>
>> # Cons of 1
>>
>> - Can't immediately use in a GRanges object because the @ranges slot
>> is classed as an IRanges object
>> - Perhaps this could be changed to allow a Ranges object in the
>> @ranges slot of a GRanges object?
>> - Otherwise, would also need to implement a subclass of GenomicRanges
>> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This
>> would necessitate a fair bit of code duplicated from GRanges methods.
>> - Methods like start<-, end<-, width<- would either have to
>> - (A) return an error if the new object no longer has fixed/constant 
>> widths
>> - (B) coerce it to an IRanges object (with or without warning) thus
>> meaning these operations would not be strict endomorphisms
>> - Users would only get the space-savings of the FWRanges class if they
>> explicitly construct a FWRanges object or coerce a compatible IRanges
>> object to an FWRanges object
>> - Clean code separation from the IRanges class may also lead to 
>> duplicated code
>>
>> # Cons of 1b compared to 1a
>>
>> - Endomorphic versions of methods like start<-, end<-, width<- could
>> create a @width slot that is twice the 'necessary' size (e.g., an Rle
>> representation of a vector that contains no 'runs').
>>
>> # Pros of 2
>>
>> - If properly implemented, the user wouldn't need to think about
>> whether the ranges were fixed or variable width, they'd just get the
>> most efficient representation
>>
>> # Cons of 2
>>
>> - This is fairly obvious, 2 would be a major (internal) change to a
>> core Bioconductor class
>> - The @width slot would no longer necessarily be parallel to @start
>> slot, e.g., code that does direct slot access via @width could easily
>> break (of course, the width() getter would be modified to return a
>> parallel vector to the @start slot, but people (*cough* me) have code
>> that does the wrong thing with respect to the use of getters vs.
>> direct slot access)
>> - New IRanges objects may be incompatible with earlier version of 
>> IRanges
>>
>> Your feedback is very appreciated,
>> Pete
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list