[Bioc-devel] Feedback wanted on design of fixed-width Ranges class

Peter Hickey peter.hickey at gmail.com
Thu Nov 24 00:18:48 CET 2016


I've been toying with the idea of a fixed/constant width Ranges
subclass. The motivation comes from storing DNA methylation data at CH
loci (non-CpG methylation): there are 1.1 billion CH loci in the human
genome, so to store these as a GRanges object requires 2 x 1.1 billion
integer vectors, one for the @start and one for the @width slots of
the IRanges object in the @ranges slot. But in this case, and perhaps
others, such as storing SNP data, we have a situation where all loci
have the same width, namely 1. Of course, you might argue such a
2-fold reduction in size is purely academic, but I think it could be a
nice efficiency that's worth pursuing.

I've sketched out two different prototypes, neither of which I've
worked up to a complete implementation; I'd like to get some feedback
on these two designs, along with a variation that I've not yet even
tried implementing, before I decide how/whether to proceed.

The two approaches are:

1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
name suggestions).
a. The @width slot would be an integer vector of length 1
b. [variation not yet implemented] The @width slot would be an Rle
vector parallel to @start
2. Modifying the IRanges class. The @width slot may be a integer
vector of length 1 or a vector parallel to @start

[Upon reflection, I suppose there could be a '2b' where the @width
slot is an Rle, but I'm going to ignore this for now since in general
it would be inefficient when the ranges have (random) variable widths]

# Pros of 1

- It seems the proper thing is to create a new Ranges subclass
- No dangers associated with stuffing around with internals of the
IRanges class and clean code separation

# Pros of 1b compared to 1a

- Like for IRanges, the @width slot would remain parallel to the @start slot

# Cons of 1

- Can't immediately use in a GRanges object because the @ranges slot
is classed as an IRanges object
- Perhaps this could be changed to allow a Ranges object in the
@ranges slot of a GRanges object?
- Otherwise, would also need to implement a subclass of GenomicRanges
(say, FWGRanges) that used a FWRanges object in the @ranges slot. This
would necessitate a fair bit of code duplicated from GRanges methods.
- Methods like start<-, end<-, width<- would either have to
- (A) return an error if the new object no longer has fixed/constant widths
- (B) coerce it to an IRanges object (with or without warning) thus
meaning these operations would not be strict endomorphisms
- Users would only get the space-savings of the FWRanges class if they
explicitly construct a FWRanges object or coerce a compatible IRanges
object to an FWRanges object
- Clean code separation from the IRanges class may also lead to duplicated code

# Cons of 1b compared to 1a

- Endomorphic versions of methods like start<-, end<-, width<- could
create a @width slot that is twice the 'necessary' size (e.g., an Rle
representation of a vector that contains no 'runs').

# Pros of 2

- If properly implemented, the user wouldn't need to think about
whether the ranges were fixed or variable width, they'd just get the
most efficient representation

# Cons of 2

- This is fairly obvious, 2 would be a major (internal) change to a
core Bioconductor class
- The @width slot would no longer necessarily be parallel to @start
slot, e.g., code that does direct slot access via @width could easily
break (of course, the width() getter would be modified to return a
parallel vector to the @start slot, but people (*cough* me) have code
that does the wrong thing with respect to the use of getters vs.
direct slot access)
- New IRanges objects may be incompatible with earlier version of IRanges

Your feedback is very appreciated,
Pete



More information about the Bioc-devel mailing list