[Bioc-devel] Feedback wanted on design of fixed-width Ranges class

Peter Hickey peter.hickey at gmail.com
Thu Nov 24 04:50:08 CET 2016


Gabe - very cool! I'll be following this with interest.

Ryan - conceptually I haven't been thinking of fixed-width ranges as
different from general ranges, hence why I think it'd be neat if the
user just got the benefits of space-efficient representation without
having to know/care about the underlying representation. My
delineation by class is more of prototyping/conceptual convenience and
my thinking of IRanges/FWRanges as being concrete implementations of
the (virtual) Ranges class (albeit with FWRanges subject to additional
constraints).

Cheers,
Pete

On Thu, 24 Nov 2016 at 14:14 Ryan <rct at thompsonclan.org> wrote:
>
> Hi all,
>
> In addition to the technical concerns, I suppose we should consider
> whether fixed-width ranges are conceptually different enough from
> general ranges to warrant a separate class, or whether this is just
> being considered for purely technical reasons. My feeling is that
> fixed-width ranges aren't sufficiently different from general ranges to
> justify a separate class. The two main uses I can think of for
> fixed-width ranges are genomic positions (i.e. length 1 ranges) and
> cases like "1kb upstream of" or "1kb radius around" a set of specified
> positions. But even for that case, fixed-wdith ranges are not
> necessarily usable because a position less than 1kb from the end of a
> chromosome would require a truncated range. (What behavior would we
> expect from a hypothetical FWRanges class in this case?)
>
> -Ryan
>
> On 11/23/16 8:01 PM, Ryan wrote:
> > Is it possible to allow the width slot of IRanges to be either a
> > normal vector or an Rle?
> >
> >
> > On 11/23/16 6:18 PM, Peter Hickey wrote:
> >> I've been toying with the idea of a fixed/constant width Ranges
> >> subclass. The motivation comes from storing DNA methylation data at CH
> >> loci (non-CpG methylation): there are 1.1 billion CH loci in the human
> >> genome, so to store these as a GRanges object requires 2 x 1.1 billion
> >> integer vectors, one for the @start and one for the @width slots of
> >> the IRanges object in the @ranges slot. But in this case, and perhaps
> >> others, such as storing SNP data, we have a situation where all loci
> >> have the same width, namely 1. Of course, you might argue such a
> >> 2-fold reduction in size is purely academic, but I think it could be a
> >> nice efficiency that's worth pursuing.
> >>
> >> I've sketched out two different prototypes, neither of which I've
> >> worked up to a complete implementation; I'd like to get some feedback
> >> on these two designs, along with a variation that I've not yet even
> >> tried implementing, before I decide how/whether to proceed.
> >>
> >> The two approaches are:
> >>
> >> 1. A new Ranges subclass, FWRanges (fixed-width Ranges, open to better
> >> name suggestions).
> >> a. The @width slot would be an integer vector of length 1
> >> b. [variation not yet implemented] The @width slot would be an Rle
> >> vector parallel to @start
> >> 2. Modifying the IRanges class. The @width slot may be a integer
> >> vector of length 1 or a vector parallel to @start
> >>
> >> [Upon reflection, I suppose there could be a '2b' where the @width
> >> slot is an Rle, but I'm going to ignore this for now since in general
> >> it would be inefficient when the ranges have (random) variable widths]
> >>
> >> # Pros of 1
> >>
> >> - It seems the proper thing is to create a new Ranges subclass
> >> - No dangers associated with stuffing around with internals of the
> >> IRanges class and clean code separation
> >>
> >> # Pros of 1b compared to 1a
> >>
> >> - Like for IRanges, the @width slot would remain parallel to the
> >> @start slot
> >>
> >> # Cons of 1
> >>
> >> - Can't immediately use in a GRanges object because the @ranges slot
> >> is classed as an IRanges object
> >> - Perhaps this could be changed to allow a Ranges object in the
> >> @ranges slot of a GRanges object?
> >> - Otherwise, would also need to implement a subclass of GenomicRanges
> >> (say, FWGRanges) that used a FWRanges object in the @ranges slot. This
> >> would necessitate a fair bit of code duplicated from GRanges methods.
> >> - Methods like start<-, end<-, width<- would either have to
> >> - (A) return an error if the new object no longer has fixed/constant
> >> widths
> >> - (B) coerce it to an IRanges object (with or without warning) thus
> >> meaning these operations would not be strict endomorphisms
> >> - Users would only get the space-savings of the FWRanges class if they
> >> explicitly construct a FWRanges object or coerce a compatible IRanges
> >> object to an FWRanges object
> >> - Clean code separation from the IRanges class may also lead to
> >> duplicated code
> >>
> >> # Cons of 1b compared to 1a
> >>
> >> - Endomorphic versions of methods like start<-, end<-, width<- could
> >> create a @width slot that is twice the 'necessary' size (e.g., an Rle
> >> representation of a vector that contains no 'runs').
> >>
> >> # Pros of 2
> >>
> >> - If properly implemented, the user wouldn't need to think about
> >> whether the ranges were fixed or variable width, they'd just get the
> >> most efficient representation
> >>
> >> # Cons of 2
> >>
> >> - This is fairly obvious, 2 would be a major (internal) change to a
> >> core Bioconductor class
> >> - The @width slot would no longer necessarily be parallel to @start
> >> slot, e.g., code that does direct slot access via @width could easily
> >> break (of course, the width() getter would be modified to return a
> >> parallel vector to the @start slot, but people (*cough* me) have code
> >> that does the wrong thing with respect to the use of getters vs.
> >> direct slot access)
> >> - New IRanges objects may be incompatible with earlier version of
> >> IRanges
> >>
> >> Your feedback is very appreciated,
> >> Pete
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>



More information about the Bioc-devel mailing list