[Bioc-devel] is.unsorted method for GRanges objects
Peter Hickey
peter.hickey at gmail.com
Tue Nov 3 12:50:38 CET 2015
Hi Michael,
Sorry, I took this off-list with Hervé. I've written a prototype
is.unsorted,GenomicRanges-method.
I structured it following the lead of order,GenomicRanges-method, so
there's an outer R-level that "translates" the GenomicRanges object to
4 integer vectors (actually the slowest part of the whole process, I
think). These 4 integer vectors are then passed to a lower-level
function that does the actual looped comparisons. Side note, this
lower-level function, isUnsortedIntegerQuads(), is perhaps better
housed in S4Vectors.
isUnsortedIntegerQuads() currently calls an Rcpp function, but I'll
convert that to a plain C function callable using .Call2() tomorrow
morning (Melbourne, Australia time).
I'll then try to figure out the necessary plumbing in order to get it
all up and running as part of the GenomicRanges package. I hope to
finish it all and send through a patch tomorrow, but it depends what
else hits my desk. (I'll add docs and unit tests if the patch is considered
helpful).
Cheers,
Pete
On 3 November 2015 at 22:41, Michael Lawrence <lawrence.michael at gene.com> wrote:
> If we're going to do that, it brings up the question of whether
> is.unsorted() could be made to handle multiple vectors like order(). It
> would be nice to implement that logic only once. Suggestions for the API?
> New function? Additional argument? Patches welcome ;)
>
> Michael
>
> On Mon, Nov 2, 2015 at 10:45 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> OK. Thanks Pete for the timings. The fact that the relative difference
>> in speed is larger for small n in your brief tests is because one
>> performs roughly in n*log(n) (quicksort-based) and the other one is
>> linear in time. Which is why I assumed (but without doing any testing)
>> that the latter was going to perform better. Anyway it seems that there
>> is just too much overhead involved in that solution to make it a good
>> candidate.
>>
>> So back to square one and to the business of trying to come up with
>> something even more efficient than is.unsorted(order(x)) for
>> GenomicRanges objects. It's indeed important that is.unsorted() be
>> as fast and as memory efficient as possible since it is typically
>> used as a quick/cheap way of checking whether a costly sort is
>> required or not (e.g. with something like if (is.unsorted(x))
>> x <- sort(x)).
>>
>> So it seems that unfortunately we won't be able to do it without
>> writing some C code. Your proposal sounds very reasonable to me. It
>> will perform in linear time (in the worst case) and avoid any copy
>> of the object (that we get with the expensive calls to head() and
>> tail() in my solution). So will be much faster than the 2 R solutions
>> whatever n is. Should work on GenomicRanges objects, not just GRanges
>> (this is easily achieved by passing S4Vectors:::decodeRle(seqnames(x)),
>> start(x), with(x), and S4Vectors:::decodeRle(strand(x)) to the .Call
>> entry point).
>>
>> I'll take your patch if you want to work on this or I can add this
>> to GenomicRanges, let me know. We should probably take this off-list.
>>
>> Thanks,
>> H.
>>
>>
>> On 11/02/2015 09:43 PM, Peter Hickey wrote:
>>>
>>> Thanks for everyones' input.
>>>
>>> @Hervé FWIW, the below benchmark suggests that unfortunately this is a
>>> fair bit slower than is.unsorted(order(gr)) when the length of the
>>> GRanges object is < 10,000,000 (the relative difference in speed is
>>> larger for small n in my brief tests; I didn't check above n >
>>> 10,000,000)
>>>
>>> ```r
>>> # GenomicRanges_1.23.1
>>> library(GenomicRanges)
>>>
>>> # Simulate some random ranges
>>> sim_gr <- function(n) {
>>> GRanges(seqnames = sample(paste0("chr", 1:22), n, replace = TRUE),
>>> ranges = IRanges(sample(n * 10, size = n, replace = TRUE),
>>> width = runif(n, 1, 100000)),
>>> strand = sample(c("+", "-", "*"), n, replace = TRUE),
>>> seqinfo = Seqinfo(paste0("chr", 1:22)))
>>>
>>> }
>>>
>>> gr <- sim_gr(10000000)
>>>
>>> herve <- function(x, na.rm=FALSE, strictly=FALSE) {
>>> if (length(x) <= 1L)
>>> return(FALSE)
>>> x1 <- head(x, n=-1)
>>> x2 <- tail(x, n=-1)
>>> if (strictly)
>>> return(any(x1 >= x2))
>>> any(x1 > x2)
>>> }
>>>
>>> # 22 seconds
>>> system.time(herve(gr))
>>> # 11.3 seconds
>>> system.time(is.unsorted(order(gr)))
>>>
>>> # And when it's already sorted
>>> gr2 <- sort(gr)
>>>
>>> # 4.3 seconds
>>> system.time(herve(gr2))
>>> # 0.2 seconds
>>> system.time(is.unsorted(order(gr2)))
>>> ```
>>>
>>> Roughly, it looks like the head(), tail() calls take approximately 1/4
>>> of the time each, while the any() call takes the remaining 1/2 of the
>>> time. I was thinking it might be possible to make this quite fast by
>>> looping over the GRanges object at the C-level and breaking out of the
>>> loop if gr[i+1] <= gr[i] or gr[i+1] < gr[i], as appropriate. Does this
>>> sound reasonable?
>>>
>>> Cheers,
>>> Pete
>>>
>>> On 3 November 2015 at 14:06, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On Mon, Nov 2, 2015 at 6:39 PM, Hervé Pagès <hpages at fredhutch.org>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> @Pete:
>>>>>
>>>>> 2a- I would just compare pairs of adjacent elements, taking
>>>>> advantage of the fact that <= is vectorized and cheap. So something
>>>>> like:
>>>>>
>>>>> setMethod("is.unsorted", "Vector",
>>>>> function(x, na.rm=FALSE, strictly=FALSE)
>>>>> {
>>>>> if (length(x) <= 1L)
>>>>> return(FALSE)
>>>>> x1 <- head(x, n=-1)
>>>>> x2 <- tail(x, n=-1)
>>>>> if (strictly)
>>>>> return(any(x1 >= x2))
>>>>> any(x1 > x2)
>>>>> }
>>>>> )
>>>>>
>>>>> Since this will work on any Vector derivative for which <= and
>>>>> subsetting are defined, it's a good candidate for being the default
>>>>> "is.unsorted" method for Vector objects. I'll add it to S4Vectors.
>>>>>
>>>>> 2b- The semantic of is.unsorted() on a GRangesList object or any
>>>>> List object in general should be sapply(x, is.unsorted), for
>>>>> consistency with order(), sort(), etc:
>>>>>
>>>>> > sort(IntegerList(4:3, 1:-2))
>>>>> IntegerList of length 2
>>>>> [[1]] 3 4
>>>>> [[2]] -2 -1 0 1
>>>>>
>>>>> I'll add this too.
>>>>>
>>>>> 2c - That won't be needed. The default method for Vector objects will
>>>>> work on RangedSummarizedExperiment objects (<= and 1D subsetting are
>>>>> defined and along the same dimension).
>>>>>
>>>>> @Gabe:
>>>>>
>>>>> See ?`GenomicRanges-comparison` for how the order of genomic ranges
>>>>> is defined.
>>>>>
>>>>> @Michael:
>>>>>
>>>>> Calling base::.gt() in a loop sounds indeed very inefficient. What
>>>>> about having base::is.unsorted() do the above on "objects" instead?
>>>>> base::.gt() seems to also require that the object is subsettable so
>>>>> the requirements are the same.
>>>>>
>>>>> Then we wouldn't need the default "is.unsorted" method for Vector
>>>>> objects, only a default "anyNA" method for Vector objects that always
>>>>> returns FALSE (plus some specific ones for Rle and other Vector
>>>>> derivatives that support NAs).
>>>>>
>>>>
>>>> Yea, I assumed it did what you suggested before looking at it. It would
>>>> be
>>>> the right place to fix this.
>>>>
>>>>>
>>>>> Thanks,
>>>>> H.
>>>>>
>>>>>
>>>>> On 11/02/2015 05:35 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>
>>>>>> The notion of sortedness is already formally defined, which is why we
>>>>>> have
>>>>>> an order method, etc.
>>>>>>
>>>>>> The base is.unsorted implementation for "objects" ends up calling
>>>>>> base::.gt() for each adjacent pair of elements, which is likely too
>>>>>> slow
>>>>>> to
>>>>>> be practical, so we probably should add a custom method.
>>>>>>
>>>>>> This does bring up the tangential question of whether GenomicRanges
>>>>>> should
>>>>>> have an anyNA method that returns FALSE (and similarly an is.na()
>>>>>> method),
>>>>>> although we have never defined the concept of a "missing range".
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> On Mon, Nov 2, 2015 at 4:55 PM, Gabe Becker <becker.gabe at gene.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Pete,
>>>>>>>
>>>>>>> What does sorted mean for granges? If the starts are sorted but the
>>>>>>> ends
>>>>>>> aren't does that count? What if only the ends are but the ranges are
>>>>>>> on
>>>>>>> the
>>>>>>> negative strand?
>>>>>>>
>>>>>>> Do we consider seqlevels to be ordinal in the order the levels are
>>>>>>> returned
>>>>>>> from seqlevels ()? That usually makes sense, but does it always?
>>>>>>>
>>>>>>> In essence I'm asking if sortedness is a well enough defined term for
>>>>>>> an
>>>>>>> is.sorted method to make sense.
>>>>>>>
>>>>>>> Best,
>>>>>>> ~G
>>>>>>> On Nov 2, 2015 4:27 PM, "Peter Hickey" <peter.hickey at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I sometimes want to test whether a GRanges object (or some object
>>>>>>>> with
>>>>>>>> a GRanges slot, e.g., a SummarizedExperiment object) is (un)sorted.
>>>>>>>> There is no is.unsorted,GRanges-method or, rather, it defers to
>>>>>>>> is.unsorted,ANY-method. I'm unsure that deferring to the
>>>>>>>> is.unsorted,ANY-method is what is really desired when a user calls
>>>>>>>> is.unsorted on a GRanges object, and it will certainly return a
>>>>>>>> (possibly unrelated) warning - "In is.na(x) : is.na() applied to
>>>>>>>> non-(list or vector) of type 'S4'".
>>>>>>>>
>>>>>>>>
>>>>>>>> For this reason, I tend to use is.unsorted(order(x)) when x is a
>>>>>>>> GRanges object. This workaround is also used, for example, by minfi
>>>>>>>>
>>>>>>>>
>>>>>>>> (https://github.com/kasperdanielhansen/minfi/blob/master/R/blocks.R#L121
>>>>>>>
>>>>>>>
>>>>>>> ).
>>>>>>>>
>>>>>>>>
>>>>>>>> However, this is slow because it essentially sorts the object to
>>>>>>>> test
>>>>>>>> whether it is already sorted.
>>>>>>>>
>>>>>>>>
>>>>>>>> So, to my questions:
>>>>>>>>
>>>>>>>> 1. Have I overlooked a fast way to test whether a GRanges object is
>>>>>>>
>>>>>>>
>>>>>>> sorted?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2a. Could a is.unsorted,GenomicRanges-method be added to the
>>>>>>>> GenomicRanges package? Side note, I'm unsure at which level to
>>>>>>>> define
>>>>>>>> this method, e.g., GRanges vs. GenomicRanges.
>>>>>>>> 2b. Is it possible to have a sensible definition and implementation
>>>>>>>> for is.unsorted,GRangesList-method?
>>>>>>>> 2c. Could a is.unsorted,RangedSummarizedExperiment-method be added
>>>>>>>> to
>>>>>>>> the SummarizedExperiment package?
>>>>>>>>
>>>>>>>> I started working on a patch for 2a/2c, but wanted to ensure I
>>>>>>>> hadn't
>>>>>>>> overlooked something obvious. Also, I'm sure 2a/2b/2c could be
>>>>>>>> written
>>>>>>>> much more efficiently at the C-level but I'm afraid this might be a
>>>>>>>> bit beyond my abilities to integrate nicely with the existing code.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Pete
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fredhutch.org
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>>
>>>>
>>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
>
More information about the Bioc-devel
mailing list