[Bioc-devel] GRanges Unique [actually -- `order`] Method

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Jun 15 18:30:34 CEST 2011


Hi,

On Wed, Jun 15, 2011 at 8:33 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 06/15/2011 04:26 AM, Michael Lawrence wrote:
>>
>> Thanks for looking into this Steve. Maybe I am missing something here, but
>> why not just do something like:
>>
>> order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr))

*blush* :-)

I guess I was transfixed with rigging up a way to take advantage of
the Ranges::order method and skipped over the (in retrospect) more
obvious solution you provided ... to be honest, it didn't even pop
into mind.

The comments in order,Ranges method indicates that calling
orderTwoIntegers(start(x), width(x)) is  substantially faster than
order(start(x), width(x)) when working over a large number of ranges
which is way I initially set my mind to do it this way.

For the example GRanges object I quoted previously though (~ 1 million
ranges spread over 25 chromosomes), the runtimes between my method and
yours are pretty much even (~5.5 seconds) -- I guess you'd want to
change yours a bit to keep the same "sort characteristics" as
order,Ranges, eg:

order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr), width(gr))

If I take the example from the Ranges,order comment block, though, the
speed difference becomes more evident (the way I'm proposing is ~ 2x
faster):

N <- 20000000L  # nb of ranges
W <- 40L        # average width of the ranges
start <- 1L
end <- 55000000L
set.seed(777))
x <- start <- sample(end - W - 2L, N, replace=TRUE)
x <- width <- W + sample(-3:3, N, replace=TRUE)
x <- IRanges(start=x <- start, width=x <- width)

xgr <- GRanges(sample(c('chr1', 'chr2'), length(x), replace=TRUE), x,
               sample(c('+', '-'), length(x), replace=TRUE))

And try to reorder it, the difference in speed is a bit more evident:

## Using "my" order function
system.time(o <- order(xgr))
   user  system elapsed
107.935   9.928 117.905

## Using yours:
system.time(oo <- order(as.factor(seqnames(xgr)),
as.factor(strand(xgr)), start(xgr), width(xgr)))
user  system elapsed
283.970   1.233 285.264

identical(o, oo)
[1] TRUE


Neither of them are winning any speed records, though -- as is evident
from doing a "clean" sort on just the ranges:

system.time(ro <- order(ranges(xgr)))
   user  system elapsed
 17.053   0.423  17.480

... so ... lots of room for improvement for order,GenomicRanges I reckon.


>> I think we'd want an option for including strand or not.
>
> like nearest,GenomicRanges,GenomicRanges, which has ignore.strand=FALSE.

I agree, but I noticed that passing values to "new" parameters messes
up the dispatch.

For instance, changing the method signature to:

setMethod("order", "GenomicRanges",
    function(..., ignore.strand=FALSE, na.last=TRUE, decreasing=FALSE) {
      ...
    })

Won't work -- passing in order(x, ignore.strand=TRUE) looks as if it
invokes the order,ANY method. showMethods('order') also seems to
corroborate that with the "GRanges#logical" signature (never seen that
#* notation before):

R> showMethods('order')
Function: order (package IRanges)
...="ANY"
...="CompressedAtomicList"
...="CompressedRleList"
...="GenomicRanges"
...="GRanges"
    (inherited from: ...="GenomicRanges")
...="GRanges#logical"
    (inherited from: ...="ANY")
...="IRanges"
    (inherited from: ...="Ranges")
...="Ranges"
...="SimpleAtomicList"
...="SimpleRleList"
...="XStringSet"

How would one fix that so that adding additional params (like
ignore.strand) could work? I'm not sure how dispatch works over `...`
(I think I read it somewhere in some man page I stumbled on, but I
can't recall where).

> GRangesList maybe an easy approach is to add a first argument
> order(rep(seq_along(grl), elementLengths(grl)), ...) then unlist, order, and
> re-list.
>
> Also not a fan of allowing the user to specify seqnames.order; you can't do
> this for factors, and sounds really like the user wants seqlevels(gr) <- ...

Yeah, your probably right.

-steve


> Martin
>>
>> Thanks again,
>> Michael
>>
>> On Tue, Jun 14, 2011 at 10:17 PM, Steve Lianoglou<
>> mailinglist.honeypot at gmail.com>  wrote:
>>
>>> I took another crack at my original attempt and reduced a call to my
>>> GenomicRanges::order from ~ 22 seconds to ~ 5.5 seconds over 1 million
>>> randomly picked ranges over hsapiens.
>>>
>>> Still not super fast, but not as abysmal as before.
>>>
>>> I'll put it here for review before checking in (or not):
>>> https://gist.github.com/1026520
>>>
>>> Thanks,
>>> -steve
>>>
>>> On Tue, Jun 14, 2011 at 8:06 PM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> (Digging up an old [related] thread since I'm not sure of the status
>>>> of the code that Michael referred to in this context is ...)
>>>>
>>>> I have a suboptimal-but-working implementation of `order` (and by
>>>> extension, `sort`) for GenomicRanges objects, eg. it calculates the
>>>> `order`ing of a GRanges object of length 1 million (randomly spread
>>>> across all Hsapiens chromosomes and strands) in ~ 22 seconds[*].
>>>>
>>>> The resulting/ordered ranges are sorted/grouped by
>>>> seqnames,strand,ranges (the caller can specify the ordering of the
>>>> seqnames, otherwise the ordering as defined by
>>>> seqleves(your.granges.object) is used.
>>>>
>>>> Also it is only defined for one GRanges object (not sure what the
>>>> appropriate result would be if multiple granges objects are passed in)
>>>>
>>>> I can check it into SVN if that sounds good so it can work as a
>>>> stop-gap until one of the *Ranges-guru's can whip up a superior one.
>>>>
>>>> [*] By the by, the runtime is dominated by iterating over the seqnames
>>>> and subselecting the appropriate ranges to work for one at a time ...
>>>> maybe the speed can be increased by using `split` a few times, but
>>>> then you have several copies of your GRanges object in memory, so ...
>>>> not sure what's best atm or how useful it is to talk about code in the
>>>> "abstract," but we can continue the discussion if you reckon it's
>>>> worthy to be checked in for now ...
>>>>
>>>> -steve
>>>>
>>>> On Wed, May 25, 2011 at 9:02 AM, Michael Lawrence
>>>> <lawrence.michael at gene.com>  wrote:
>>>>>
>>>>> Someone has to write the methods...
>>>>>
>>>>> On Tue, May 24, 2011 at 11:00 PM, Dario Strbenac
>>>>> <D.Strbenac at garvan.org.au>wrote:
>>>>>
>>>>>>>   Yes, the sort method just calls order.
>>>>>>
>>>>>> Something isn't quite working out for me.
>>>>>>
>>>>>> library(GenomicRanges) # 1.4.5
>>>>>> gr<- GRanges("chr1", IRanges(c(1, 10), c(50, 60)), '+')
>>>>>> sort(gr)
>>>>>>
>>>>>> --------------------------------------
>>>>>> Dario Strbenac
>>>>>> Research Assistant
>>>>>> Cancer Epigenetics
>>>>>> Garvan Institute of Medical Research
>>>>>> Darlinghurst NSW 2010
>>>>>> Australia
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>  | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioc-devel mailing list