[Bioc-devel] is.unsorted method for GRanges objects

Tue Nov 3 06:43:51 CET 2015

Thanks for everyones' input.

@Hervé FWIW, the below benchmark suggests that unfortunately this is a
fair bit slower than is.unsorted(order(gr)) when the length of the
GRanges object is < 10,000,000 (the relative difference in speed is
larger for small n in my brief tests; I didn't check above n >
10,000,000)

```r
# GenomicRanges_1.23.1
library(GenomicRanges)

# Simulate some random ranges
sim_gr <- function(n) {
  GRanges(seqnames = sample(paste0("chr", 1:22), n, replace = TRUE),
          ranges = IRanges(sample(n * 10, size = n, replace = TRUE),
                  width = runif(n, 1, 100000)),
          strand = sample(c("+", "-", "*"), n, replace = TRUE),
          seqinfo = Seqinfo(paste0("chr", 1:22)))

}

gr <- sim_gr(10000000)

herve <- function(x, na.rm=FALSE, strictly=FALSE) {
  if (length(x) <= 1L)
    return(FALSE)
  x1 <- head(x, n=-1)
  x2 <- tail(x, n=-1)
  if (strictly)
    return(any(x1 >= x2))
  any(x1 > x2)
}

# 22 seconds
system.time(herve(gr))
# 11.3 seconds
system.time(is.unsorted(order(gr)))

# And when it's already sorted
gr2 <- sort(gr)

# 4.3 seconds
system.time(herve(gr2))
# 0.2 seconds
system.time(is.unsorted(order(gr2)))
```

Roughly, it looks like the head(), tail() calls take approximately 1/4
of the time each, while the any() call takes the remaining 1/2 of the
time. I was thinking it might be possible to make this quite fast by
looping over the GRanges object at the C-level and breaking out of the
loop if gr[i+1] <= gr[i] or gr[i+1] < gr[i], as appropriate. Does this
sound reasonable?

Cheers,
Pete

On 3 November 2015 at 14:06, Michael Lawrence <lawrence.michael at gene.com> wrote:
>
>
> On Mon, Nov 2, 2015 at 6:39 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> Hi,
>>
>> @Pete:
>>
>> 2a- I would just compare pairs of adjacent elements, taking
>> advantage of the fact that <= is vectorized and cheap. So something
>> like:
>>
>>   setMethod("is.unsorted", "Vector",
>>     function(x, na.rm=FALSE, strictly=FALSE)
>>     {
>>         if (length(x) <= 1L)
>>             return(FALSE)
>>         x1 <- head(x, n=-1)
>>         x2 <- tail(x, n=-1)
>>         if (strictly)
>>             return(any(x1 >= x2))
>>         any(x1 > x2)
>>     }
>>   )
>>
>> Since this will work on any Vector derivative for which <= and
>> subsetting are defined, it's a good candidate for being the default
>> "is.unsorted" method for Vector objects. I'll add it to S4Vectors.
>>
>> 2b- The semantic of is.unsorted() on a GRangesList object or any
>> List object in general should be sapply(x, is.unsorted), for
>> consistency with order(), sort(), etc:
>>
>>   > sort(IntegerList(4:3, 1:-2))
>>   IntegerList of length 2
>>   [[1]] 3 4
>>   [[2]] -2 -1 0 1
>>
>> I'll add this too.
>>
>> 2c - That won't be needed. The default method for Vector objects will
>> work on RangedSummarizedExperiment objects (<= and 1D subsetting are
>> defined and along the same dimension).
>>
>> @Gabe:
>>
>> See ?`GenomicRanges-comparison` for how the order of genomic ranges
>> is defined.
>>
>> @Michael:
>>
>> Calling base::.gt() in a loop sounds indeed very inefficient. What
>> about having base::is.unsorted() do the above on "objects" instead?
>> base::.gt() seems to also require that the object is subsettable so
>> the requirements are the same.
>>
>> Then we wouldn't need the default "is.unsorted" method for Vector
>> objects, only a default "anyNA" method for Vector objects that always
>> returns FALSE (plus some specific ones for Rle and other Vector
>> derivatives that support NAs).
>>
>
> Yea, I assumed it did what you suggested before looking at it. It would be
> the right place to fix this.
>
>>
>> Thanks,
>> H.
>>
>>
>> On 11/02/2015 05:35 PM, Michael Lawrence wrote:
>>>
>>> The notion of sortedness is already formally defined, which is why we
>>> have
>>> an order method, etc.
>>>
>>> The base is.unsorted implementation for "objects" ends up calling
>>> base::.gt() for each adjacent pair of elements, which is likely too slow
>>> to
>>> be practical, so we probably should add a custom method.
>>>
>>> This does bring up the tangential question of whether GenomicRanges
>>> should
>>> have an anyNA method that returns FALSE (and similarly an is.na()
>>> method),
>>> although we have never defined the concept of a "missing range".
>>>
>>> Michael
>>>
>>> On Mon, Nov 2, 2015 at 4:55 PM, Gabe Becker <becker.gabe at gene.com> wrote:
>>>
>>>> Pete,
>>>>
>>>> What does sorted mean for granges? If the starts  are sorted but the
>>>> ends
>>>> aren't does that count? What if only the ends are but the ranges are on
>>>> the
>>>> negative strand?
>>>>
>>>> Do we consider seqlevels to be ordinal in the order the levels are
>>>> returned
>>>> from seqlevels ()? That usually makes sense, but does it always?
>>>>
>>>> In essence I'm asking if sortedness is a well enough defined term for an
>>>> is.sorted method to make sense.
>>>>
>>>> Best,
>>>> ~G
>>>> On Nov 2, 2015 4:27 PM, "Peter Hickey" <peter.hickey at gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I sometimes want to test whether a GRanges object (or some object with
>>>>> a GRanges slot, e.g., a SummarizedExperiment object) is (un)sorted.
>>>>> There is no is.unsorted,GRanges-method or, rather, it defers to
>>>>> is.unsorted,ANY-method. I'm unsure that deferring to the
>>>>> is.unsorted,ANY-method is what is really desired when a user calls
>>>>> is.unsorted on a GRanges object, and it will certainly return a
>>>>> (possibly unrelated) warning - "In is.na(x) : is.na() applied to
>>>>> non-(list or vector) of type 'S4'".
>>>>>
>>>>>
>>>>> For this reason, I tend to use is.unsorted(order(x)) when x is a
>>>>> GRanges object. This workaround is also used, for example, by minfi
>>>>>
>>>>> (https://github.com/kasperdanielhansen/minfi/blob/master/R/blocks.R#L121
>>>>
>>>> ).
>>>>>
>>>>> However, this is slow because it essentially sorts the object to test
>>>>> whether it is already sorted.
>>>>>
>>>>>
>>>>> So, to my questions:
>>>>>
>>>>> 1. Have I overlooked a fast way to test whether a GRanges object is
>>>>
>>>> sorted?
>>>>>
>>>>> 2a. Could a is.unsorted,GenomicRanges-method be added to the
>>>>> GenomicRanges package? Side note, I'm unsure at which level to define
>>>>> this method, e.g., GRanges vs. GenomicRanges.
>>>>> 2b. Is it possible to have a sensible definition and implementation
>>>>> for is.unsorted,GRangesList-method?
>>>>> 2c. Could a is.unsorted,RangedSummarizedExperiment-method be added to
>>>>> the SummarizedExperiment package?
>>>>>
>>>>> I started working on a patch for 2a/2c, but wanted to ensure I hadn't
>>>>> overlooked something obvious. Also, I'm sure 2a/2b/2c could be written
>>>>> much more efficiently at the C-level but I'm afraid this might be a
>>>>> bit beyond my abilities to integrate nicely with the existing code.
>>>>>
>>>>> Thanks,
>>>>> Pete
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>
>