[Bioc-devel] How to speed up GRange comparision

Pages, Herve hp@ge@ @end|ng |rom |redhutch@org
Thu Jan 30 00:18:01 CET 2020


On 1/29/20 13:14, Jianhong Ou, Ph.D. wrote:
> Oh, I forget that. Thank you for reminder.
> Then how about:
> 
> distance(query, narrow(subject, start=2, end=-2)) == 0
> 
> ?

Yep, that's more accurate. With the following gotcha:

   'narrow(subject, start=2, end=-2)' will fail if 'subject'
   contains ranges that cover less than 2 positions

Not an unlikely situation e.g. if 'subject' contains TSS!

I just feel that distance() is not really appropriate to detect overlaps.

H.

> 
> 
> On 1/29/20, 12:40 PM, "Pages, Herve" <hpages using fredhutch.org> wrote:
> 
>      On 1/29/20 08:04, Jianhong Ou, Ph.D. wrote:
>      > Try
>      > dist=distance(query, subject)
>      > dist==0
>      > ?
>      
>      Please be aware that dist==0 does NOT mean that 2 ranges overlap. It
>      means that they overlap OR are **adjacent**:
>      
>       > distance(GRanges("chr1:1-20"), GRanges("chr1:21-25"))
>      [1] 0
>      
>      H.
>      
>      >
>      > On 1/29/20, 10:50 AM, "Bioc-devel on behalf of web working" <bioc-devel-bounces using r-project.org on behalf of webworking using posteo.de> wrote:
>      >
>      >      Hello,
>      >
>      >      I have two big GRanges objects and want to search for an overlap of  the
>      >      first range of query with the first range of subject. Then take the
>      >      second range of query and compare it with the second range of subject
>      >      and so on. Here an example of my problem:
>      >
>      >      # GRanges objects
>      >      query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10,
>      >      22)), id=1:4)
>      >      subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2,
>      >      21)), id=1:4)
>      >
>      >      # The 2 overlaps at the first position should not be counted, because
>      >      these ranges are at different rows.
>      >      countOverlaps(query, subject)
>      >
>      >      # Approach 1 (bad style. I have simplified it to understand)
>      >      dat <- as.data.frame(findOverlaps(query, subject))
>      >      indexDat <- apply(dat, 1, function(x) x[1]==x[2])
>      >      indexBool <- dat[indexDat,1]
>      >      out <- rep(FALSE, length(query))
>      >      out[indexBool] <- TRUE
>      >      as.numeric(out)
>      >
>      >      # Approach 2 (bad style and takes too long)
>      >      out <- vector("numeric", 4)
>      >      for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i]))
>      >      out
>      >
>      >      # Approach 3 (wrong results)
>      >      as.numeric(overlapsAny(query, subject))
>      >      as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4)))
>      >
>      >
>      >      Maybe someone has an idea to speed this up?
>      >
>      >
>      >      Best,
>      >
>      >      Tobias
>      >
>      >      _______________________________________________
>      >      Bioc-devel using r-project.org mailing list
>      >      https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=PXg851DHXyo-Gs3eMIfeo49gUXVh-JSZu_MZDDxGun8&m=CL_4pe8tWi75jDizROxriMm7-LhebnosKRxforvK2Jo&s=Ft0x9f_4tOy2Ov9DHVp5KlTOSI4CeURNB8ywlrwgn9E&e=
>      >
>      >
>      > _______________________________________________
>      > Bioc-devel using r-project.org mailing list
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=mlMbcbdMyysqzyTia1k6Xb4YO7x7jyDtw2bT7ad0dyg&s=jPRTi7pxhHzcFnU-du42SSiHfemeYcUdEF4RZfqdCvU&e=
>      >
>      
>      --
>      Hervé Pagès
>      
>      Program in Computational Biology
>      Division of Public Health Sciences
>      Fred Hutchinson Cancer Research Center
>      1100 Fairview Ave. N, M1-B514
>      P.O. Box 19024
>      Seattle, WA 98109-1024
>      
>      E-mail: hpages using fredhutch.org
>      Phone:  (206) 667-5791
>      Fax:    (206) 667-1319
>      
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319


More information about the Bioc-devel mailing list