[Bioc-devel] Feedback wanted on design of fixed-width Ranges class

Peter Hickey peter.hickey at gmail.com
Thu Nov 24 00:45:14 CET 2016


Vince - From my understanding GPos mostly gains its efficiencies when
positions are adjacent, which is generally not the case for the types of
positions I'm considering. In fact, the @ranges slot of the @pos_runs slot
in a GPos object is just a IRanges object where n adjacent positions are
'compressed' into a single width-n range.

(Also, FWRanges could generalise to intervals with fixed-width > 1)

On Thu, 24 Nov 2016 at 10:36 Vincent Carey <stvjc at channing.harvard.edu>
wrote:

> pace Wolfgang Huber ...
>
> Peter I don't mean to be rude.  Your comments deserve more study.  But it
> was fun to remember GPos, which I had forgotten.
>
> On Wed, Nov 23, 2016 at 6:34 PM, Vincent Carey <stvjc at channing.harvard.edu
> > wrote:
>
> library(GenomicRanges)
> class?GPos
>
> On Wed, Nov 23, 2016 at 6:18 PM, Peter Hickey <peter.hickey at gmail.com>
> wrote:
>
> I've been toying with the idea of a fixed/constant width Ranges
> subclass. The motivation comes from storing DNA methylation data at CH
> loci (non-CpG methylation): there are 1.1 billion CH loci in the human
> genome, so to store these as a GRanges object requires 2 x 1.1 billion
> integer vectors, one for the @start and one for the @width slots of
> the IRanges object in the @ranges slot. But in this case, and perhaps
> others, such as storing SNP data, we have a situation where all loci
> have the same width, namely 1. Of course, you might argue such a
> 2-fold reduction in size is purely academic, but I think it could be a
> nice efficiency that's worth pursuing.
>
> I've sketched out two different prototypes, neither of which I've
> worked up to a complete implementation; I'd like to get some feedback
> on these two designs, along with a variation that I've not yet even
> tried implementing, before I decide how/whether to proceed.
>
> The two approaches are:
>
> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
> name suggestions).
> a. The @width slot would be an integer vector of length 1
> b. [variation not yet implemented] The @width slot would be an Rle
> vector parallel to @start
> 2. Modifying the IRanges class. The @width slot may be a integer
> vector of length 1 or a vector parallel to @start
>
> [Upon reflection, I suppose there could be a '2b' where the @width
> slot is an Rle, but I'm going to ignore this for now since in general
> it would be inefficient when the ranges have (random) variable widths]
>
> # Pros of 1
>
> - It seems the proper thing is to create a new Ranges subclass
> - No dangers associated with stuffing around with internals of the
> IRanges class and clean code separation
>
> # Pros of 1b compared to 1a
>
> - Like for IRanges, the @width slot would remain parallel to the @start
> slot
>
> # Cons of 1
>
> - Can't immediately use in a GRanges object because the @ranges slot
> is classed as an IRanges object
> - Perhaps this could be changed to allow a Ranges object in the
> @ranges slot of a GRanges object?
> - Otherwise, would also need to implement a subclass of GenomicRanges
> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This
> would necessitate a fair bit of code duplicated from GRanges methods.
> - Methods like start<-, end<-, width<- would either have to
> - (A) return an error if the new object no longer has fixed/constant widths
> - (B) coerce it to an IRanges object (with or without warning) thus
> meaning these operations would not be strict endomorphisms
> - Users would only get the space-savings of the FWRanges class if they
> explicitly construct a FWRanges object or coerce a compatible IRanges
> object to an FWRanges object
> - Clean code separation from the IRanges class may also lead to duplicated
> code
>
> # Cons of 1b compared to 1a
>
> - Endomorphic versions of methods like start<-, end<-, width<- could
> create a @width slot that is twice the 'necessary' size (e.g., an Rle
> representation of a vector that contains no 'runs').
>
> # Pros of 2
>
> - If properly implemented, the user wouldn't need to think about
> whether the ranges were fixed or variable width, they'd just get the
> most efficient representation
>
> # Cons of 2
>
> - This is fairly obvious, 2 would be a major (internal) change to a
> core Bioconductor class
> - The @width slot would no longer necessarily be parallel to @start
> slot, e.g., code that does direct slot access via @width could easily
> break (of course, the width() getter would be modified to return a
> parallel vector to the @start slot, but people (*cough* me) have code
> that does the wrong thing with respect to the use of getters vs.
> direct slot access)
> - New IRanges objects may be incompatible with earlier version of IRanges
>
> Your feedback is very appreciated,
> Pete
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list