[Bioc-devel] How to speed up GRange comparision

Thu Jan 30 08:31:08 CET 2020

Hi Herve,

Thank you for your answer. pcompare works fine for me. Here my solution:

query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10, 22)))
subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2, 21)))
out <- vector("numeric", length(query))
out[(which(abs(pcompare(query, subject))<5))] <- 1
out

Carey was right that this here is off list. Next time I will pose my 
question on support.bioconductor.org <http://support.bioconductor.org>.

Best,

Tobias

Am 29.01.20 um 18:02 schrieb Pages, Herve:
> Yes poverlaps().
>
> Or pcompare(), which should be even faster. But only if you are not
> afraid to go low-level. See ?rangeComparisonCodeToLetter for the meaning
> of the codes returned by pcompare().
>
> H.
>
> On 1/29/20 08:01, Michael Lawrence via Bioc-devel wrote:
>> poverlaps()?
>>
>> On Wed, Jan 29, 2020 at 7:50 AM web working <webworking using posteo.de> wrote:
>>> Hello,
>>>
>>> I have two big GRanges objects and want to search for an overlap of  the
>>> first range of query with the first range of subject. Then take the
>>> second range of query and compare it with the second range of subject
>>> and so on. Here an example of my problem:
>>>
>>> # GRanges objects
>>> query <- GRanges(rep("chr1", 4), IRanges(c(1, 5, 9, 20), c(2, 6, 10,
>>> 22)), id=1:4)
>>> subject <- GRanges(rep("chr1",4), IRanges(c(3, 1, 1, 15), c(4, 2, 2,
>>> 21)), id=1:4)
>>>
>>> # The 2 overlaps at the first position should not be counted, because
>>> these ranges are at different rows.
>>> countOverlaps(query, subject)
>>>
>>> # Approach 1 (bad style. I have simplified it to understand)
>>> dat <- as.data.frame(findOverlaps(query, subject))
>>> indexDat <- apply(dat, 1, function(x) x[1]==x[2])
>>> indexBool <- dat[indexDat,1]
>>> out <- rep(FALSE, length(query))
>>> out[indexBool] <- TRUE
>>> as.numeric(out)
>>>
>>> # Approach 2 (bad style and takes too long)
>>> out <- vector("numeric", 4)
>>> for(i in seq_along(query)) out[i] <- (overlapsAny(query[i], subject[i]))
>>> out
>>>
>>> # Approach 3 (wrong results)
>>> as.numeric(overlapsAny(query, subject))
>>> as.numeric(overlapsAny(split(query, 1:4), split(subject, 1:4)))
>>>
>>>
>>> Maybe someone has an idea to speed this up?
>>>
>>>
>>> Best,
>>>
>>> Tobias
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=FSrHBK59_OMc6EbEtcPhkTVO0cfDgSbQBGFOXWyHhjc&s=3tZpvRAw7T5dP21u32TRTf4lZ4QFLtmkouKR7TUlJws&e=
>>
>>

	[[alternative HTML version deleted]]