[BioC] Most efficient way to compute width of overlap of multiple features
Charles Berry
ccberry at ucsd.edu
Tue Jan 7 18:30:48 CET 2014
Vince S. Buffalo <vsbuffalo at ...> writes:
>
> Hi Bioconductor folks,
>
> I'm trying to create some GRanges summaries, but I think I may be missing
> an obvious solution. I have fixed-width windows as a GRanges object, and
> for each window/row I'd like to add a metadata column that indicates how
> many base pairs of this window overlap features in another GRange object.
> I'll need to add a few columns for different features in different GRanges
> objects.
>
> I've tried using the approach of findOverlaps, followed by ranges() to
> extract range widths. This creates an error: "'query' must be a Ranges of
> length equal to number of queries". I saw in the source
> that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too
> (and does without error). This returns the overlapping ranges, but it'd
> take a load of data munging to get it into the format I'd like it seems
> like I may be overlooking an easier solution.
>
> thanks,
> Vince
>
>
pintersect() seems like the right approach.
Use coverage() to normalize the subject GRanges, then tapply to add the
bases covered:
> set.seed(12345)
> gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE),
+ IRanges(start=sample(100000,100),width=1000))
> gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE),
+ IRanges(start=sample(100000,100),width=1000))
> cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score)
> seqinfo(cover2.gr) <- seqinfo(gr2)
> hits <- findOverlaps(gr,cover2.gr)
> gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)])
> gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) sum(width(x)))
> gr$overlap<- 0
> gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts)
> gr
GRanges with 100 ranges and 1 metadata column:
seqnames ranges strand | overlap
<Rle> <IRanges> <Rle> | <numeric>
[1] b [29447, 30446] * | 0
[2] b [61725, 62724] * | 601
[3] b [97426, 98425] * | 0
[4] b [61820, 62819] * | 696
[5] a [52135, 53134] * | 441
... ... ... ... ... ...
[96] b [ 693, 1692] * | 0
[97] b [27406, 28405] * | 0
[98] b [79728, 80727] * | 755
[99] a [24344, 25343] * | 353
[100] a [45782, 46781] * | 0
---
seqlengths:
a b
NA NA
> length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check
[1] TRUE
> length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check
[1] TRUE
>
HTH,
Chuck
More information about the Bioconductor
mailing list