[Bioc-devel] Merging GInteraction/GenomicInteractions ranges

Tue Feb 19 19:00:15 CET 2019

Hi Luke,

If I understand correctly what you want to do, you can use `findOverlaps()`
to create a new GInteractions object from a set of expanded ranges and the
original interactions.

anchor.one <- GRanges(c("chr1", "chr1", "chr1", "chr1"),
                     IRanges(c(10, 20, 30, 20), width=5))
anchor.two <- GRanges(c("chr1", "chr1", "chr1", "chr2"),
                     IRanges(c(100, 200, 200, 50), width=5))
interaction_counts <- sample(1:10, 4)
test <- GInteractions(anchor1 = anchor.one,
                      anchor2 = anchor.two,
                      counts=interaction_counts)

wider_ranges <- reduce(resize(regions(test), fix = "center", width = 10))

first_ol <- findOverlaps(test, wider_ranges, use.region = "first")
second_ol <- findOverlaps(test, wider_ranges, use.region = "second")

new_gi <- GInteractions(anchor1 = subjectHits(first_ol),
                        anchor2 = subjectHits(second_ol),
                        regions = wider_ranges,
                        counts = test$counts)

This will give you a GInteractions object that's the same length as your
original object. You said you already had a way of summing the counts for
each group of interactions, but you could also incorporate that here. I
would do it with dplyr, but I'm sure there's alternatives that introduce
fewer dependencies...

new_gi_info_df <- data.frame(anchor1 = subjectHits(first_ol),
                             anchor2 = subjectHits(second_ol),
                             counts = test$counts) %>%
  dplyr::group_by(anchor1, anchor2) %>%
  dplyr::summarise(counts = sum(counts))

best wishes,
Liz

On Tue, Feb 19, 2019 at 6:15 PM Luke Klein <lklei001 using ucr.edu> wrote:

> Thank you for the tip, Aaron!  I’m working on the sorting step right now.
>
> One thing I’m not clear on is how to expand the ranges from one step to
> the next.  The way GenomicInteractions are structured, there is a Granges
> object with all possible ranges, and the GInteractions object is populated
> by reference to said interactions.
>
> What I am going to need is a new GRanges object with the new set of
> (expanded) ranges, and a way to map the prior ranges to the new, wider
> range.
>
> For your edification, I’ve included the images in this email (which is
> addressed directly to you) so that you can see what I’d written in my first
> question.
>
> Best,
>
> — Luke Klein
>     PhD Student
>     Department of Statistics
>     University of California, Riverside
>     lklei001 using ucr.edu
>
>
>
>
>
>
>
> > On Feb 13, 2019, at 3:34 AM, Aaron Lun <
> infinite.monkeys.with.keyboards using gmail.com> wrote:
> >
> > Note that your visual won't show up for many (all?) of us. Nonetheless,
> > I think I know what you want to do.
> >
> > Your task does not lend itself to vectorization, which makes it
> > difficult to write efficient R code. It's not impossible, but it would
> > be quite hard to read and debug, and your maintenance programmer will
> > be cursing you somewhere down the line.
> >
> > If speed is truly a concern, I would write this code in C++. This would
> > probably be several lines' worth of code:
> >
> > 1. Compute a pair of bin IDs for each interaction by dividing each
> > anchor coordinate by the bin width and truncating the result. (You'll
> > need to decide if you want to use the midpoint/start/end, etc.)
> > 2. Sort the interactions by the paired bin IDs, e.g., with std::sort.
> > 3. Identify each "run" of interactions with the same paired IDs.
> > 4. Repeat step 1 within each run (you'll need to offset the anchor
> > coordinate before dividing this time). Append the current quadrant to
> > the quadrant sequence for return to R at the end of recursion.
> >
> > Clear, concise, and can be slapped together in less than half an hour
> > with Rcpp and C++11, if you know what you're doing.
> >
> > -A
> >
> > On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
> >> Hello.  I am planning to develop a new package which extends the
> >> GenomicInteractions package.  I would like some help/advice on
> >> implementing the following functionality.
> >>
> >> Consider the follow GenomicInteractions object
> >>
> >> GenomicInteractions object with 10 interactions and 1 metadata
> >> column:
> >>        seqnames1   ranges1     seqnames2   ranges2 |    counts
> >>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> >>    [1]      chrA       1-2 ---      chrA      9-10 |         1
> >>    [2]      chrA       1-2 ---      chrA     15-16 |         1
> >>    [3]      chrA       3-4 ---      chrA       3-4 |         1
> >>    [4]      chrA       5-6 ---      chrA       7-8 |         1
> >>    [5]      chrA       5-6 ---      chrA      9-10 |         1
> >>    [6]      chrA       7-8 ---      chrA       7-8 |         1
> >>    [7]      chrA       7-8 ---      chrA     11-12 |         1
> >>    [8]      chrA       7-8 ---      chrA     17-18 |         1
> >>    [9]      chrA      9-10 ---      chrA      9-10 |         1
> >>   [10]      chrA      9-10 ---      chrA     15-16 |         1
> >>   -------
> >>   regions: 8 ranges and 0 metadata columns
> >>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> >>
> >>
> >> Which is visually represented thusly
> >>
> >>
> >>
> >> I would like to do the following:
> >>
> >> 1) I want to group the regions into bins of WxW (in this case, W will
> >> be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Qua <
> https://en.wikipedia.org/wiki/Qua>
> >> dtree> with the final group being WxW (instead of 2x2).  This will
> >> involve
> >>      - iteratively dividing the matrix into quadrants {upper-left
> >> (0), upper-right (1), lower-left (2), lower-right(3)} .
> >>      - labeling each subdivision in a new column until the final WxW
> >> resolution is reached.
> >>      - sorting by the columns
> >>
> >>
> >>
> >>
> >> GenomicInteractions object with 10 interactions and 1 metadata
> >> column:
> >>        seqnames1   ranges1     seqnames2   ranges2
> >> |    counts     quad1     quad2
> >>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> >> <integer> <integer>
> >>    [1]      chrA       1-2 ---      chrA      9-10 |         1
> >>        0      1
> >>    [2]      chrA       1-2 ---      chrA     15-16 |         1
> >>        1      0
> >>    [3]      chrA       3-4 ---      chrA       3-4 |         1
> >>        0      0
> >>    [4]      chrA       5-6 ---      chrA       7-8 |         1
> >>        0      1
> >>    [5]      chrA       5-6 ---      chrA      9-10 |         1
> >>        0      1
> >>    [6]      chrA       7-8 ---      chrA       7-8 |         1
> >>        0      3
> >>    [7]      chrA       7-8 ---      chrA     11-12 |         1
> >>        0      3
> >>    [8]      chrA       7-8 ---      chrA     17-18 |         1
> >>        1      2
> >>    [9]      chrA      9-10 ---      chrA      9-10 |         1
> >>        0      3
> >>   [10]      chrA      9-10 ---      chrA     15-16 |         1
> >>        1      2
> >>   -------
> >>   regions: 8 ranges and 0 metadata columns
> >>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> >>
> >>
> >> Sorting by the two columns yields what I am after.  Of course, I
> >> include the “quadX” column for illustration only.  Upon
> >> implementation, I would like these columns hidden from the user.
> >>
> >> GenomicInteractions object with 10 interactions and 1 metadata
> >> column:
> >>        seqnames1   ranges1     seqnames2   ranges2
> >> |    counts     quad1     quad2
> >>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> >> <integer> <integer>
> >>    [1]      chrA       3-4 ---      chrA       3-4 |         1
> >>        0      0
> >>    [2]      chrA       1-2 ---      chrA      9-10 |         1
> >>        0      1
> >>    [3]      chrA       5-6 ---      chrA       7-8 |         1
> >>        0      1
> >>    [4]      chrA       5-6 ---      chrA      9-10 |         1
> >>        0      1
> >>    [5]      chrA       7-8 ---      chrA       7-8 |         1
> >>        0      3
> >>    [6]      chrA       7-8 ---      chrA     11-12 |         1
> >>        0      3
> >>    [7]      chrA      9-10 ---      chrA      9-10 |         1
> >>        0      3
> >>    [8]      chrA       1-2 ---      chrA     15-16 |         1
> >>        1      0
> >>    [9]      chrA       7-8 ---      chrA     17-18 |         1
> >>        1      2
> >>   [10]      chrA      9-10 ---      chrA     15-16 |         1
> >>        1      2
> >>   -------
> >>   regions: 8 ranges and 0 metadata columns
> >>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> >>
> >> The sorting gives me the quad-tree structure, and each unique
> >> quadrant sequence defines the group.
> >>
> >>
> >> GenomicInteractions object with 10 interactions and 1 metadata
> >> column:
> >>        seqnames1   ranges1     seqnames2   ranges2 |    counts
> >>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> >>    [1]      chrA       3-4 ---      chrA       3-4 |         1
> >>    [2]      chrA       1-2 ---      chrA      9-10 |         1
> >>    [3]      chrA       5-6 ---      chrA       7-8 |         1
> >>    [4]      chrA       5-6 ---      chrA      9-10 |         1
> >>    [5]      chrA       7-8 ---      chrA       7-8 |         1
> >>    [6]      chrA       7-8 ---      chrA     11-12 |         1
> >>    [7]      chrA      9-10 ---      chrA      9-10 |         1
> >>    [8]      chrA       1-2 ---      chrA     15-16 |         1
> >>    [9]      chrA       7-8 ---      chrA     17-18 |         1
> >>   [10]      chrA      9-10 ---      chrA     15-16 |         1
> >>   -------
> >>   regions: 8 ranges and 0 metadata columns
> >>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> >>
> >>
> >> 2) Then I would like to merge the WxW window (i.e. bin the regions),
> >> expanding the ranges accordingly and adding the counts..  This
> >> process will
> >>      - ***identify all range-pairs in the same window and merge them
> >> into a new range pair with appropriately expanded ranges*** (this is
> >> my primary goal)
> >>      - sum the counts for each of the aforementioned range-pairs (i
> >> have already figured a way to do this)
> >>
> >>
> >>
> >> GenomicInteractions object with 5 interactions and 1 metadata column:
> >>        seqnames1   ranges1     seqnames2   ranges2 |    counts
> >>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> >>    [1]      chrA       1-6  ---      chrA      1-6  |         1
> >>    [2]      chrA       1-6  ---      chrA      7-12 |         3
> >>    [3]      chrA       7-12 ---      chrA      7-12 |         3
> >>    [4]      chrA       1-6  ---      chrA     13-18 |         1
> >>    [5]      chrA       7-12 ---      chrA     13-18 |         2
> >>   -------
> >>   regions: 3 ranges and 0 metadata columns
> >>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> >>
> >>
> >> NOTE that ranges1 and ranges2 MUST expand so that the region width is
> >> 6, though the counts will only change if there exists another
> >> subrange covered by this bin/expansion that contains a positive
> >> count.
> >>
> >> As always, speed in a concern.
> >>
> >> Best,
> >>
> >> — Luke Klein
> >>     PhD Student
> >>     Department of Statistics
> >>     University of California, Riverside
> >>     lklei001 using ucr.edu
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org> mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel <
> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
> Hello.  I am planning to develop a new package which extends the
> GenomicInteractions package.  I would like some help/advice on implementing
> the following functionality.
>
> Consider the follow GenomicInteractions object
>
> GenomicInteractions object with 10 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       1-2 ---      chrA      9-10 |         1
>    [2]      chrA       1-2 ---      chrA     15-16 |         1
>    [3]      chrA       3-4 ---      chrA       3-4 |         1
>    [4]      chrA       5-6 ---      chrA       7-8 |         1
>    [5]      chrA       5-6 ---      chrA      9-10 |         1
>    [6]      chrA       7-8 ---      chrA       7-8 |         1
>    [7]      chrA       7-8 ---      chrA     11-12 |         1
>    [8]      chrA       7-8 ---      chrA     17-18 |         1
>    [9]      chrA      9-10 ---      chrA      9-10 |         1
>   [10]      chrA      9-10 ---      chrA     15-16 |         1
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
> Which is visually represented thusly
>
>
>
> I would like to do the following:
>
> 1) I want to group the regions into bins of WxW (in this case, W will be
> 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Quadtree>
> with the final group being WxW (instead of 2x2).  This will involve
>         - iteratively dividing the matrix into quadrants {upper-left (0),
> upper-right (1), lower-left (2), lower-right(3)} .
>         - labeling each subdivision in a new column until the final WxW
> resolution is reached.
>         - sorting by the columns
>
>
>
>
> GenomicInteractions object with 10 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts     quad1
>    quad2
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer> <integer>
> <integer>
>    [1]      chrA       1-2 ---      chrA      9-10 |         1         0
>        1
>    [2]      chrA       1-2 ---      chrA     15-16 |         1         1
>        0
>    [3]      chrA       3-4 ---      chrA       3-4 |         1         0
>        0
>    [4]      chrA       5-6 ---      chrA       7-8 |         1         0
>        1
>    [5]      chrA       5-6 ---      chrA      9-10 |         1         0
>        1
>    [6]      chrA       7-8 ---      chrA       7-8 |         1         0
>        3
>    [7]      chrA       7-8 ---      chrA     11-12 |         1         0
>        3
>    [8]      chrA       7-8 ---      chrA     17-18 |         1         1
>        2
>    [9]      chrA      9-10 ---      chrA      9-10 |         1         0
>        3
>   [10]      chrA      9-10 ---      chrA     15-16 |         1         1
>        2
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
> Sorting by the two columns yields what I am after.  Of course, I include
> the “quadX” column for illustration only.  Upon implementation, I would
> like these columns hidden from the user.
>
> GenomicInteractions object with 10 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts     quad1
>    quad2
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer> <integer>
> <integer>
>    [1]      chrA       3-4 ---      chrA       3-4 |         1         0
>        0
>    [2]      chrA       1-2 ---      chrA      9-10 |         1         0
>        1
>    [3]      chrA       5-6 ---      chrA       7-8 |         1         0
>        1
>    [4]      chrA       5-6 ---      chrA      9-10 |         1         0
>        1
>    [5]      chrA       7-8 ---      chrA       7-8 |         1         0
>        3
>    [6]      chrA       7-8 ---      chrA     11-12 |         1         0
>        3
>    [7]      chrA      9-10 ---      chrA      9-10 |         1         0
>        3
>    [8]      chrA       1-2 ---      chrA     15-16 |         1         1
>        0
>    [9]      chrA       7-8 ---      chrA     17-18 |         1         1
>        2
>   [10]      chrA      9-10 ---      chrA     15-16 |         1         1
>        2
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> The sorting gives me the quad-tree structure, and each unique quadrant
> sequence defines the group.
>
>
> GenomicInteractions object with 10 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       3-4 ---      chrA       3-4 |         1
>    [2]      chrA       1-2 ---      chrA      9-10 |         1
>    [3]      chrA       5-6 ---      chrA       7-8 |         1
>    [4]      chrA       5-6 ---      chrA      9-10 |         1
>    [5]      chrA       7-8 ---      chrA       7-8 |         1
>    [6]      chrA       7-8 ---      chrA     11-12 |         1
>    [7]      chrA      9-10 ---      chrA      9-10 |         1
>    [8]      chrA       1-2 ---      chrA     15-16 |         1
>    [9]      chrA       7-8 ---      chrA     17-18 |         1
>   [10]      chrA      9-10 ---      chrA     15-16 |         1
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
> 2) Then I would like to merge the WxW window (i.e. bin the regions),
> expanding the ranges accordingly and adding the counts..  This process will
>         - ***identify all range-pairs in the same window and merge them
> into a new range pair with appropriately expanded ranges*** (this is my
> primary goal)
>         - sum the counts for each of the aforementioned range-pairs (i
> have already figured a way to do this)
>
>
>
> GenomicInteractions object with 5 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       1-6  ---      chrA      1-6  |         1
>    [2]      chrA       1-6  ---      chrA      7-12 |         3
>    [3]      chrA       7-12 ---      chrA      7-12 |         3
>    [4]      chrA       1-6  ---      chrA     13-18 |         1
>    [5]      chrA       7-12 ---      chrA     13-18 |         2
>   -------
>   regions: 3 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
> NOTE that ranges1 and ranges2 MUST expand so that the region width is 6,
> though the counts will only change if there exists another subrange covered
> by this bin/expansion that contains a positive count.
>
> As always, speed in a concern.
>
> Best,
>
> — Luke Klein
>     PhD Student
>     Department of Statistics
>     University of California, Riverside
>     lklei001 using ucr.edu <mailto:lklei001 using ucr.edu>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]