[Bioc-devel] GRanges Unique [actually -- `order`] Method

Wed Jun 15 19:24:48 CEST 2011

Steve,

re emacs/ESS, here is a snippet form my init.el, inspired by http://www.r-bloggers.com/a-small-customization-of-ess/

;;; There is no way I know of to disable the 'smart underscore' just for pasted code (as opposed to directly typed).
;;; You can disable it altogether, and use control-= to insert <-,
(ess-toggle-underscore nil)		; disable 'smart underscore'
(setq ess-S-assign-key (kbd "C-="))	; we want to use control-= instead...
(ess-toggle-S-assign-key	t)	; ... which we enable here

~Malcolm

> -----Original Message-----
> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
> project.org] On Behalf Of Steve Lianoglou
> Sent: Wednesday, June 15, 2011 11:50 AM
> To: Martin Morgan
> Cc: Michael Lawrence; bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] GRanges Unique [actually -- `order`] Method
> 
> Ugh, sorry -- I pasted some of my code through emacs/ess which
> transformed "_" into "<-" in the wrong places (there must be a way of
> turning off text mangling when pasting/yanking that I don't know about
> yet!).
> 
> Setting up the example GRanges object should be like this:
> 
> N <- 20000000L  # nb of ranges
> W <- 40L        # average width of the ranges
> start <- 1L
> end <- 55000000L
> set.seed(777)
> x_start <- sample(end - W - 2L, N, replace=TRUE)
> x_width <- W + sample(-3:3, N, replace=TRUE)
> x <- IRanges(start=x_start, width=x_width)
> 
> xgr <- GRanges(sample(c('chr1', 'chr2'), length(x), replace=TRUE), x,
>                sample(c('+', '-'), length(x), replace=TRUE))
> 
> 
> On Wed, Jun 15, 2011 at 12:30 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
> > Hi,
> >
> > On Wed, Jun 15, 2011 at 8:33 AM, Martin Morgan <mtmorgan at fhcrc.org>
> wrote:
> >> On 06/15/2011 04:26 AM, Michael Lawrence wrote:
> >>>
> >>> Thanks for looking into this Steve. Maybe I am missing something here,
> but
> >>> why not just do something like:
> >>>
> >>> order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr))
> >
> > *blush* :-)
> >
> > I guess I was transfixed with rigging up a way to take advantage of
> > the Ranges::order method and skipped over the (in retrospect) more
> > obvious solution you provided ... to be honest, it didn't even pop
> > into mind.
> >
> > The comments in order,Ranges method indicates that calling
> > orderTwoIntegers(start(x), width(x)) is  substantially faster than
> > order(start(x), width(x)) when working over a large number of ranges
> > which is way I initially set my mind to do it this way.
> >
> > For the example GRanges object I quoted previously though (~ 1 million
> > ranges spread over 25 chromosomes), the runtimes between my method
> and
> > yours are pretty much even (~5.5 seconds) -- I guess you'd want to
> > change yours a bit to keep the same "sort characteristics" as
> > order,Ranges, eg:
> >
> > order(as.factor(seqnames(gr)), as.factor(strand(gr)), start(gr), width(gr))
> >
> > If I take the example from the Ranges,order comment block, though, the
> > speed difference becomes more evident (the way I'm proposing is ~ 2x
> > faster):
> >
> > N <- 20000000L  # nb of ranges
> > W <- 40L        # average width of the ranges
> > start <- 1L
> > end <- 55000000L
> > set.seed(777))
> > x <- start <- sample(end - W - 2L, N, replace=TRUE)
> > x <- width <- W + sample(-3:3, N, replace=TRUE)
> > x <- IRanges(start=x <- start, width=x <- width)
> >
> > xgr <- GRanges(sample(c('chr1', 'chr2'), length(x), replace=TRUE), x,
> >               sample(c('+', '-'), length(x), replace=TRUE))
> >
> > And try to reorder it, the difference in speed is a bit more evident:
> >
> > ## Using "my" order function
> > system.time(o <- order(xgr))
> >   user  system elapsed
> > 107.935   9.928 117.905
> >
> > ## Using yours:
> > system.time(oo <- order(as.factor(seqnames(xgr)),
> > as.factor(strand(xgr)), start(xgr), width(xgr)))
> > user  system elapsed
> > 283.970 1.233 285.264
> >
> > identical(o, oo)
> > [1] TRUE
> >
> >
> > Neither of them are winning any speed records, though -- as is evident
> > from doing a "clean" sort on just the ranges:
> >
> > system.time(ro <- order(ranges(xgr)))
> >   user  system elapsed
> >  17.053   0.423  17.480
> >
> > ... so ... lots of room for improvement for order,GenomicRanges I reckon.
> >
> >
> >>> I think we'd want an option for including strand or not.
> >>
> >> like nearest,GenomicRanges,GenomicRanges, which has
> ignore.strand=FALSE.
> >
> > I agree, but I noticed that passing values to "new" parameters messes
> > up the dispatch.
> >
> > For instance, changing the method signature to:
> >
> > setMethod("order", "GenomicRanges",
> >    function(..., ignore.strand=FALSE, na.last=TRUE, decreasing=FALSE) {
> >      ...
> >    })
> >
> > Won't work -- passing in order(x, ignore.strand=TRUE) looks as if it
> > invokes the order,ANY method. showMethods('order') also seems to
> > corroborate that with the "GRanges#logical" signature (never seen that
> > #* notation before):
> >
> > R> showMethods('order')
> > Function: order (package IRanges)
> > ...="ANY"
> > ...="CompressedAtomicList"
> > ...="CompressedRleList"
> > ...="GenomicRanges"
> > ...="GRanges"
> >    (inherited from: ...="GenomicRanges")
> > ...="GRanges#logical"
> >    (inherited from: ...="ANY")
> > ...="IRanges"
> >    (inherited from: ...="Ranges")
> > ...="Ranges"
> > ...="SimpleAtomicList"
> > ...="SimpleRleList"
> > ...="XStringSet"
> >
> > How would one fix that so that adding additional params (like
> > ignore.strand) could work? I'm not sure how dispatch works over `...`
> > (I think I read it somewhere in some man page I stumbled on, but I
> > can't recall where).
> >
> >> GRangesList maybe an easy approach is to add a first argument
> >> order(rep(seq_along(grl), elementLengths(grl)), ...) then unlist, order,
> and
> >> re-list.
> >>
> >> Also not a fan of allowing the user to specify seqnames.order; you can't
> do
> >> this for factors, and sounds really like the user wants seqlevels(gr) <- ...
> >
> > Yeah, your probably right.
> >
> > -steve
> >
> >
> >> Martin
> >>>
> >>> Thanks again,
> >>> Michael
> >>>
> >>> On Tue, Jun 14, 2011 at 10:17 PM, Steve Lianoglou<
> >>> mailinglist.honeypot at gmail.com>  wrote:
> >>>
> >>>> I took another crack at my original attempt and reduced a call to my
> >>>> GenomicRanges::order from ~ 22 seconds to ~ 5.5 seconds over 1
> million
> >>>> randomly picked ranges over hsapiens.
> >>>>
> >>>> Still not super fast, but not as abysmal as before.
> >>>>
> >>>> I'll put it here for review before checking in (or not):
> >>>> https://gist.github.com/1026520
> >>>>
> >>>> Thanks,
> >>>> -steve
> >>>>
> >>>> On Tue, Jun 14, 2011 at 8:06 PM, Steve Lianoglou
> >>>> <mailinglist.honeypot at gmail.com>  wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> (Digging up an old [related] thread since I'm not sure of the status
> >>>>> of the code that Michael referred to in this context is ...)
> >>>>>
> >>>>> I have a suboptimal-but-working implementation of `order` (and by
> >>>>> extension, `sort`) for GenomicRanges objects, eg. it calculates the
> >>>>> `order`ing of a GRanges object of length 1 million (randomly spread
> >>>>> across all Hsapiens chromosomes and strands) in ~ 22 seconds[*].
> >>>>>
> >>>>> The resulting/ordered ranges are sorted/grouped by
> >>>>> seqnames,strand,ranges (the caller can specify the ordering of the
> >>>>> seqnames, otherwise the ordering as defined by
> >>>>> seqleves(your.granges.object) is used.
> >>>>>
> >>>>> Also it is only defined for one GRanges object (not sure what the
> >>>>> appropriate result would be if multiple granges objects are passed in)
> >>>>>
> >>>>> I can check it into SVN if that sounds good so it can work as a
> >>>>> stop-gap until one of the *Ranges-guru's can whip up a superior one.
> >>>>>
> >>>>> [*] By the by, the runtime is dominated by iterating over the
> seqnames
> >>>>> and subselecting the appropriate ranges to work for one at a time ...
> >>>>> maybe the speed can be increased by using `split` a few times, but
> >>>>> then you have several copies of your GRanges object in memory, so
> ...
> >>>>> not sure what's best atm or how useful it is to talk about code in the
> >>>>> "abstract," but we can continue the discussion if you reckon it's
> >>>>> worthy to be checked in for now ...
> >>>>>
> >>>>> -steve
> >>>>>
> >>>>> On Wed, May 25, 2011 at 9:02 AM, Michael Lawrence
> >>>>> <lawrence.michael at gene.com>  wrote:
> >>>>>>
> >>>>>> Someone has to write the methods...
> >>>>>>
> >>>>>> On Tue, May 24, 2011 at 11:00 PM, Dario Strbenac
> >>>>>> <D.Strbenac at garvan.org.au>wrote:
> >>>>>>
> >>>>>>>>   Yes, the sort method just calls order.
> >>>>>>>
> >>>>>>> Something isn't quite working out for me.
> >>>>>>>
> >>>>>>> library(GenomicRanges) # 1.4.5
> >>>>>>> gr<- GRanges("chr1", IRanges(c(1, 10), c(50, 60)), '+')
> >>>>>>> sort(gr)
> >>>>>>>
> >>>>>>> --------------------------------------
> >>>>>>> Dario Strbenac
> >>>>>>> Research Assistant
> >>>>>>> Cancer Epigenetics
> >>>>>>> Garvan Institute of Medical Research
> >>>>>>> Darlinghurst NSW 2010
> >>>>>>> Australia
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioc-devel at r-project.org mailing list
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>>
> >>>>>>
> >>>>>>        [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioc-devel at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Steve Lianoglou
> >>>>> Graduate Student: Computational Systems Biology
> >>>>>  | Memorial Sloan-Kettering Cancer Center
> >>>>>  | Weill Medical College of Cornell University
> >>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Steve Lianoglou
> >>>> Graduate Student: Computational Systems Biology
> >>>>  | Memorial Sloan-Kettering Cancer Center
> >>>>  | Weill Medical College of Cornell University
> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >>>>
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >> --
> >> Computational Biology
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>
> >> Location: M1-B861
> >> Telephone: 206 667-2793
> >>
> >
> >
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> >  | Memorial Sloan-Kettering Cancer Center
> >  | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
> 
> 
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel