[Bioc-devel] GRanges Unique [actually -- `order`] Method

Wed Jun 15 18:49:57 CEST 2011

Ugh, sorry -- I pasted some of my code through emacs/ess which
transformed "_" into "<-" in the wrong places (there must be a way of
turning off text mangling when pasting/yanking that I don't know about
yet!).

Setting up the example GRanges object should be like this:

N <- 20000000L  # nb of ranges
W <- 40L        # average width of the ranges
start <- 1L
end <- 55000000L
set.seed(777)
x_start <- sample(end - W - 2L, N, replace=TRUE)
x_width <- W + sample(-3:3, N, replace=TRUE)
x <- IRanges(start=x_start, width=x_width)

xgr <- GRanges(sample(c('chr1', 'chr2'), length(x), replace=TRUE), x,
               sample(c('+', '-'), length(x), replace=TRUE))

On Wed, Jun 15, 2011 at 12:30 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> On Wed, Jun 15, 2011 at 8:33 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 06/15/2011 04:26 AM, Michael Lawrence wrote:
>>>
>>> Thanks for looking into this Steve. Maybe I am missing something here, but
>>> why not just do something like:
>>>
>>> order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr))
>
> *blush* :-)
>
> I guess I was transfixed with rigging up a way to take advantage of
> the Ranges::order method and skipped over the (in retrospect) more
> obvious solution you provided ... to be honest, it didn't even pop
> into mind.
>
> The comments in order,Ranges method indicates that calling
> orderTwoIntegers(start(x), width(x)) is  substantially faster than
> order(start(x), width(x)) when working over a large number of ranges
> which is way I initially set my mind to do it this way.
>
> For the example GRanges object I quoted previously though (~ 1 million
> ranges spread over 25 chromosomes), the runtimes between my method and
> yours are pretty much even (~5.5 seconds) -- I guess you'd want to
> change yours a bit to keep the same "sort characteristics" as
> order,Ranges, eg:
>
> order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr), width(gr))
>
> If I take the example from the Ranges,order comment block, though, the
> speed difference becomes more evident (the way I'm proposing is ~ 2x
> faster):
>
> N <- 20000000L  # nb of ranges
> W <- 40L        # average width of the ranges
> start <- 1L
> end <- 55000000L
> set.seed(777))
> x <- start <- sample(end - W - 2L, N, replace=TRUE)
> x <- width <- W + sample(-3:3, N, replace=TRUE)
> x <- IRanges(start=x <- start, width=x <- width)
>
> xgr <- GRanges(sample(c('chr1', 'chr2'), length(x), replace=TRUE), x,
>               sample(c('+', '-'), length(x), replace=TRUE))
>
> And try to reorder it, the difference in speed is a bit more evident:
>
> ## Using "my" order function
> system.time(o <- order(xgr))
>   user  system elapsed
> 107.935   9.928 117.905
>
> ## Using yours:
> system.time(oo <- order(as.factor(seqnames(xgr)),
> as.factor(strand(xgr)), start(xgr), width(xgr)))
> user  system elapsed
> 283.970 1.233 285.264
>
> identical(o, oo)
> [1] TRUE
>
>
> Neither of them are winning any speed records, though -- as is evident
> from doing a "clean" sort on just the ranges:
>
> system.time(ro <- order(ranges(xgr)))
>   user  system elapsed
>  17.053   0.423  17.480
>
> ... so ... lots of room for improvement for order,GenomicRanges I reckon.
>
>
>>> I think we'd want an option for including strand or not.
>>
>> like nearest,GenomicRanges,GenomicRanges, which has ignore.strand=FALSE.
>
> I agree, but I noticed that passing values to "new" parameters messes
> up the dispatch.
>
> For instance, changing the method signature to:
>
> setMethod("order", "GenomicRanges",
>    function(..., ignore.strand=FALSE, na.last=TRUE, decreasing=FALSE) {
>      ...
>    })
>
> Won't work -- passing in order(x, ignore.strand=TRUE) looks as if it
> invokes the order,ANY method. showMethods('order') also seems to
> corroborate that with the "GRanges#logical" signature (never seen that
> #* notation before):
>
> R> showMethods('order')
> Function: order (package IRanges)
> ...="ANY"
> ...="CompressedAtomicList"
> ...="CompressedRleList"
> ...="GenomicRanges"
> ...="GRanges"
>    (inherited from: ...="GenomicRanges")
> ...="GRanges#logical"
>    (inherited from: ...="ANY")
> ...="IRanges"
>    (inherited from: ...="Ranges")
> ...="Ranges"
> ...="SimpleAtomicList"
> ...="SimpleRleList"
> ...="XStringSet"
>
> How would one fix that so that adding additional params (like
> ignore.strand) could work? I'm not sure how dispatch works over `...`
> (I think I read it somewhere in some man page I stumbled on, but I
> can't recall where).
>
>> GRangesList maybe an easy approach is to add a first argument
>> order(rep(seq_along(grl), elementLengths(grl)), ...) then unlist, order, and
>> re-list.
>>
>> Also not a fan of allowing the user to specify seqnames.order; you can't do
>> this for factors, and sounds really like the user wants seqlevels(gr) <- ...
>
> Yeah, your probably right.
>
> -steve
>
>
>> Martin
>>>
>>> Thanks again,
>>> Michael
>>>
>>> On Tue, Jun 14, 2011 at 10:17 PM, Steve Lianoglou<
>>> mailinglist.honeypot at gmail.com>  wrote:
>>>
>>>> I took another crack at my original attempt and reduced a call to my
>>>> GenomicRanges::order from ~ 22 seconds to ~ 5.5 seconds over 1 million
>>>> randomly picked ranges over hsapiens.
>>>>
>>>> Still not super fast, but not as abysmal as before.
>>>>
>>>> I'll put it here for review before checking in (or not):
>>>> https://gist.github.com/1026520
>>>>
>>>> Thanks,
>>>> -steve
>>>>
>>>> On Tue, Jun 14, 2011 at 8:06 PM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com>  wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> (Digging up an old [related] thread since I'm not sure of the status
>>>>> of the code that Michael referred to in this context is ...)
>>>>>
>>>>> I have a suboptimal-but-working implementation of `order` (and by
>>>>> extension, `sort`) for GenomicRanges objects, eg. it calculates the
>>>>> `order`ing of a GRanges object of length 1 million (randomly spread
>>>>> across all Hsapiens chromosomes and strands) in ~ 22 seconds[*].
>>>>>
>>>>> The resulting/ordered ranges are sorted/grouped by
>>>>> seqnames,strand,ranges (the caller can specify the ordering of the
>>>>> seqnames, otherwise the ordering as defined by
>>>>> seqleves(your.granges.object) is used.
>>>>>
>>>>> Also it is only defined for one GRanges object (not sure what the
>>>>> appropriate result would be if multiple granges objects are passed in)
>>>>>
>>>>> I can check it into SVN if that sounds good so it can work as a
>>>>> stop-gap until one of the *Ranges-guru's can whip up a superior one.
>>>>>
>>>>> [*] By the by, the runtime is dominated by iterating over the seqnames
>>>>> and subselecting the appropriate ranges to work for one at a time ...
>>>>> maybe the speed can be increased by using `split` a few times, but
>>>>> then you have several copies of your GRanges object in memory, so ...
>>>>> not sure what's best atm or how useful it is to talk about code in the
>>>>> "abstract," but we can continue the discussion if you reckon it's
>>>>> worthy to be checked in for now ...
>>>>>
>>>>> -steve
>>>>>
>>>>> On Wed, May 25, 2011 at 9:02 AM, Michael Lawrence
>>>>> <lawrence.michael at gene.com>  wrote:
>>>>>>
>>>>>> Someone has to write the methods...
>>>>>>
>>>>>> On Tue, May 24, 2011 at 11:00 PM, Dario Strbenac
>>>>>> <D.Strbenac at garvan.org.au>wrote:
>>>>>>
>>>>>>>>   Yes, the sort method just calls order.
>>>>>>>
>>>>>>> Something isn't quite working out for me.
>>>>>>>
>>>>>>> library(GenomicRanges) # 1.4.5
>>>>>>> gr<- GRanges("chr1", IRanges(c(1, 10), c(50, 60)), '+')
>>>>>>> sort(gr)
>>>>>>>
>>>>>>> --------------------------------------
>>>>>>> Dario Strbenac
>>>>>>> Research Assistant
>>>>>>> Cancer Epigenetics
>>>>>>> Garvan Institute of Medical Research
>>>>>>> Darlinghurst NSW 2010
>>>>>>> Australia
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>>  | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>  | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact