[Bioc-devel] Merging GInteraction/GenomicInteractions ranges

Aaron Lun |n||n|te@monkey@@w|th@keybo@rd@ @end|ng |rom gm@||@com
Wed Feb 20 12:22:55 CET 2019


> One thing I’m not clear on is how to expand the ranges from one step
> to the next.  The way GenomicInteractions are structured, there is a
> Granges object with all possible ranges, and the GInteractions object
> is populated by reference to said interactions.

If you're already in C++, then life is easy. Just compute the minimum
bounding box within each run of interactions (if you want empirical
bounds) or compute the theoretical bounds based on the bin widths and
offsets. Collect all of these ranges and report them at the end of the
function to construct a new G(enomic)Interactions object.

Note that the theoretical bounds will be better at leveraging the
memory-saving internal structure of the GInteractions object, as it is
more likely that the anchor regions will be shared between different
bin pairs at the same height of the tree. With empirical bounds, this
is unlikely to be the case.

> What I am going to need is a new GRanges object with the new set of
> (expanded) ranges, and a way to map the prior ranges to the new,
> wider range.

If you're constructing the quad-tree using the recursive algorithm
discussed below, mapping is trivial. At each step of the recursion, you
pass the identity of the parent interaction and record it for each new
child interaction that is generated.

-A

> On Feb 13, 2019, at 3:34 AM, Aaron Lun <infinite.monkeys.with.keybo
> > ards using gmail.com> wrote:
> > 
> > Note that your visual won't show up for many (all?) of us.
> > Nonetheless,
> > I think I know what you want to do.
> > 
> > Your task does not lend itself to vectorization, which makes it
> > difficult to write efficient R code. It's not impossible, but it
> > would
> > be quite hard to read and debug, and your maintenance programmer
> > will
> > be cursing you somewhere down the line.
> > 
> > If speed is truly a concern, I would write this code in C++. This
> > would
> > probably be several lines' worth of code:
> > 
> > 1. Compute a pair of bin IDs for each interaction by dividing each
> > anchor coordinate by the bin width and truncating the result.
> > (You'll
> > need to decide if you want to use the midpoint/start/end, etc.)
> > 2. Sort the interactions by the paired bin IDs, e.g., with
> > std::sort.
> > 3. Identify each "run" of interactions with the same paired IDs.
> > 4. Repeat step 1 within each run (you'll need to offset the anchor
> > coordinate before dividing this time). Append the current quadrant
> > to
> > the quadrant sequence for return to R at the end of recursion.
> > 
> > Clear, concise, and can be slapped together in less than half an
> > hour
> > with Rcpp and C++11, if you know what you're doing.
> > 
> > -A
> > 
> > On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
> > > Hello.  I am planning to develop a new package which extends the
> > > GenomicInteractions package.  I would like some help/advice on
> > > implementing the following functionality.
> > > 
> > > Consider the follow GenomicInteractions object
> > > 
> > > GenomicInteractions object with 10 interactions and 1 metadata
> > > column:
> > >        seqnames1   ranges1     seqnames2   ranges2 |    counts
> > >            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> > >    [1]      chrA       1-2 ---      chrA      9-10 |         1
> > >    [2]      chrA       1-2 ---      chrA     15-16 |         1
> > >    [3]      chrA       3-4 ---      chrA       3-4 |         1
> > >    [4]      chrA       5-6 ---      chrA       7-8 |         1
> > >    [5]      chrA       5-6 ---      chrA      9-10 |         1
> > >    [6]      chrA       7-8 ---      chrA       7-8 |         1
> > >    [7]      chrA       7-8 ---      chrA     11-12 |         1
> > >    [8]      chrA       7-8 ---      chrA     17-18 |         1
> > >    [9]      chrA      9-10 ---      chrA      9-10 |         1
> > >   [10]      chrA      9-10 ---      chrA     15-16 |         1
> > >   -------
> > >   regions: 8 ranges and 0 metadata columns
> > >   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > > 
> > > 
> > > Which is visually represented thusly
> > > 
> > > 
> > > 
> > > I would like to do the following:
> > > 
> > > 1) I want to group the regions into bins of WxW (in this case, W
> > > will
> > > be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki
> > > /Qua
> > > dtree> with the final group being WxW (instead of 2x2).  This
> > > will
> > > involve 
> > > 	- iteratively dividing the matrix into quadrants {upper-left
> > > (0), upper-right (1), lower-left (2), lower-right(3)} .
> > > 	- labeling each subdivision in a new column until the final WxW
> > > resolution is reached.
> > > 	- sorting by the columns
> > > 
> > > 
> > > 
> > > 
> > > GenomicInteractions object with 10 interactions and 1 metadata
> > > column:
> > >        seqnames1   ranges1     seqnames2   ranges2
> > > |    counts     quad1     quad2
> > >            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> > > <integer> <integer>
> > >    [1]      chrA       1-2 ---      chrA      9-10 |         1	
> > >        0	 1
> > >    [2]      chrA       1-2 ---      chrA     15-16 |         1	
> > >        1	 0
> > >    [3]      chrA       3-4 ---      chrA       3-4 |         1	
> > >        0	 0
> > >    [4]      chrA       5-6 ---      chrA       7-8 |         1	
> > >        0	 1
> > >    [5]      chrA       5-6 ---      chrA      9-10 |         1	
> > >        0	 1
> > >    [6]      chrA       7-8 ---      chrA       7-8 |         1	
> > >        0	 3	 
> > >    [7]      chrA       7-8 ---      chrA     11-12 |         1	
> > >        0	 3
> > >    [8]      chrA       7-8 ---      chrA     17-18 |         1	
> > >        1	 2
> > >    [9]      chrA      9-10 ---      chrA      9-10 |         1	
> > >        0	 3
> > >   [10]      chrA      9-10 ---      chrA     15-16 |         1	
> > >        1	 2
> > >   -------
> > >   regions: 8 ranges and 0 metadata columns
> > >   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > > 
> > > 
> > > Sorting by the two columns yields what I am after.  Of course, I
> > > include the “quadX” column for illustration only.  Upon
> > > implementation, I would like these columns hidden from the user.
> > > 
> > > GenomicInteractions object with 10 interactions and 1 metadata
> > > column:
> > >        seqnames1   ranges1     seqnames2   ranges2
> > > |    counts     quad1     quad2
> > >            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> > > <integer> <integer>
> > >    [1]      chrA       3-4 ---      chrA       3-4 |         1	
> > >        0	 0
> > >    [2]      chrA       1-2 ---      chrA      9-10 |         1	
> > >        0	 1
> > >    [3]      chrA       5-6 ---      chrA       7-8 |         1	
> > >        0	 1
> > >    [4]      chrA       5-6 ---      chrA      9-10 |         1	
> > >        0	 1
> > >    [5]      chrA       7-8 ---      chrA       7-8 |         1	
> > >        0	 3
> > >    [6]      chrA       7-8 ---      chrA     11-12 |         1	
> > >        0	 3	 
> > >    [7]      chrA      9-10 ---      chrA      9-10 |         1	
> > >        0	 3
> > >    [8]      chrA       1-2 ---      chrA     15-16 |         1	
> > >        1	 0
> > >    [9]      chrA       7-8 ---      chrA     17-18 |         1	
> > >        1	 2
> > >   [10]      chrA      9-10 ---      chrA     15-16 |         1	
> > >        1	 2
> > >   -------
> > >   regions: 8 ranges and 0 metadata columns
> > >   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > > 
> > > The sorting gives me the quad-tree structure, and each unique
> > > quadrant sequence defines the group.
> > >   
> > > 
> > > GenomicInteractions object with 10 interactions and 1 metadata
> > > column:
> > >        seqnames1   ranges1     seqnames2   ranges2 |    counts
> > >            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> > >    [1]      chrA       3-4 ---      chrA       3-4 |         1
> > >    [2]      chrA       1-2 ---      chrA      9-10 |         1
> > >    [3]      chrA       5-6 ---      chrA       7-8 |         1
> > >    [4]      chrA       5-6 ---      chrA      9-10 |         1
> > >    [5]      chrA       7-8 ---      chrA       7-8 |         1
> > >    [6]      chrA       7-8 ---      chrA     11-12 |         1
> > >    [7]      chrA      9-10 ---      chrA      9-10 |         1
> > >    [8]      chrA       1-2 ---      chrA     15-16 |         1
> > >    [9]      chrA       7-8 ---      chrA     17-18 |         1
> > >   [10]      chrA      9-10 ---      chrA     15-16 |         1
> > >   -------
> > >   regions: 8 ranges and 0 metadata columns
> > >   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > > 
> > > 
> > > 2) Then I would like to merge the WxW window (i.e. bin the
> > > regions),
> > > expanding the ranges accordingly and adding the counts..  This
> > > process will
> > > 	- ***identify all range-pairs in the same window and merge them
> > > into a new range pair with appropriately expanded ranges*** (this
> > > is
> > > my primary goal)
> > > 	- sum the counts for each of the aforementioned range-pairs (i
> > > have already figured a way to do this)
> > > 
> > > 
> > > 
> > > GenomicInteractions object with 5 interactions and 1 metadata
> > > column:
> > >        seqnames1   ranges1     seqnames2   ranges2 |    counts
> > >            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> > >    [1]      chrA       1-6  ---      chrA      1-6  |         1
> > >    [2]      chrA       1-6  ---      chrA      7-12 |         3
> > >    [3]      chrA       7-12 ---      chrA      7-12 |         3
> > >    [4]      chrA       1-6  ---      chrA     13-18 |         1
> > >    [5]      chrA       7-12 ---      chrA     13-18 |         2
> > >   -------
> > >   regions: 3 ranges and 0 metadata columns
> > >   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > > 
> > > 
> > > NOTE that ranges1 and ranges2 MUST expand so that the region
> > > width is
> > > 6, though the counts will only change if there exists another
> > > subrange covered by this bin/expansion that contains a positive
> > > count.
> > > 
> > > As always, speed in a concern.
> > > 
> > > Best,
> > > 
> > > — Luke Klein
> > >     PhD Student
> > >     Department of Statistics
> > >     University of California, Riverside
> > >     lklei001 using ucr.edu
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Bioc-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> Hello.  I am planning to develop a new package which extends the
> GenomicInteractions package.  I would like some help/advice on
> implementing the following functionality.
> 
> Consider the follow GenomicInteractions object
> 
> GenomicInteractions object with 10 interactions and 1 metadata
> column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       1-2 ---      chrA      9-10 |         1
>    [2]      chrA       1-2 ---      chrA     15-16 |         1
>    [3]      chrA       3-4 ---      chrA       3-4 |         1
>    [4]      chrA       5-6 ---      chrA       7-8 |         1
>    [5]      chrA       5-6 ---      chrA      9-10 |         1
>    [6]      chrA       7-8 ---      chrA       7-8 |         1
>    [7]      chrA       7-8 ---      chrA     11-12 |         1
>    [8]      chrA       7-8 ---      chrA     17-18 |         1
>    [9]      chrA      9-10 ---      chrA      9-10 |         1
>   [10]      chrA      9-10 ---      chrA     15-16 |         1
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> 
> Which is visually represented thusly
> 
> 
> 
> I would like to do the following:
> 
> 1) I want to group the regions into bins of WxW (in this case, W will
> be 3), as in a quad-tree structure with the final group being WxW
> (instead of 2x2).  This will involve 
> 	- iteratively dividing the matrix into quadrants {upper-left
> (0), upper-right (1), lower-left (2), lower-right(3)} .
> 	- labeling each subdivision in a new column until the final WxW
> resolution is reached.
> 	- sorting by the columns
> 
> 
> 
> 
> GenomicInteractions object with 10 interactions and 1 metadata
> column:
>        seqnames1   ranges1     seqnames2   ranges2
> |    counts     quad1     quad2
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> <integer> <integer>
>    [1]      chrA       1-2 ---      chrA      9-10 |         1	
>        0	 1
>    [2]      chrA       1-2 ---      chrA     15-16 |         1	
>        1	 0
>    [3]      chrA       3-4 ---      chrA       3-4 |         1	
>        0	 0
>    [4]      chrA       5-6 ---      chrA       7-8 |         1	
>        0	 1
>    [5]      chrA       5-6 ---      chrA      9-10 |         1	
>        0	 1
>    [6]      chrA       7-8 ---      chrA       7-8 |         1	
>        0	 3	 
>    [7]      chrA       7-8 ---      chrA     11-12 |         1	
>        0	 3
>    [8]      chrA       7-8 ---      chrA     17-18 |         1	
>        1	 2
>    [9]      chrA      9-10 ---      chrA      9-10 |         1	
>        0	 3
>   [10]      chrA      9-10 ---      chrA     15-16 |         1	
>        1	 2
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> 
> Sorting by the two columns yields what I am after.  Of course, I
> include the “quadX” column for illustration only.  Upon
> implementation, I would like these columns hidden from the user.
> 
> GenomicInteractions object with 10 interactions and 1 metadata
> column:
>        seqnames1   ranges1     seqnames2   ranges2
> |    counts     quad1     quad2
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
> <integer> <integer>
>    [1]      chrA       3-4 ---      chrA       3-4 |         1	
>        0	 0
>    [2]      chrA       1-2 ---      chrA      9-10 |         1	
>        0	 1
>    [3]      chrA       5-6 ---      chrA       7-8 |         1	
>        0	 1
>    [4]      chrA       5-6 ---      chrA      9-10 |         1	
>        0	 1
>    [5]      chrA       7-8 ---      chrA       7-8 |         1	
>        0	 3
>    [6]      chrA       7-8 ---      chrA     11-12 |         1	
>        0	 3	 
>    [7]      chrA      9-10 ---      chrA      9-10 |         1	
>        0	 3
>    [8]      chrA       1-2 ---      chrA     15-16 |         1	
>        1	 0
>    [9]      chrA       7-8 ---      chrA     17-18 |         1	
>        1	 2
>   [10]      chrA      9-10 ---      chrA     15-16 |         1	
>        1	 2
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> The sorting gives me the quad-tree structure, and each unique
> quadrant sequence defines the group.
>   
> 
> GenomicInteractions object with 10 interactions and 1 metadata
> column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       3-4 ---      chrA       3-4 |         1
>    [2]      chrA       1-2 ---      chrA      9-10 |         1
>    [3]      chrA       5-6 ---      chrA       7-8 |         1
>    [4]      chrA       5-6 ---      chrA      9-10 |         1
>    [5]      chrA       7-8 ---      chrA       7-8 |         1
>    [6]      chrA       7-8 ---      chrA     11-12 |         1
>    [7]      chrA      9-10 ---      chrA      9-10 |         1
>    [8]      chrA       1-2 ---      chrA     15-16 |         1
>    [9]      chrA       7-8 ---      chrA     17-18 |         1
>   [10]      chrA      9-10 ---      chrA     15-16 |         1
>   -------
>   regions: 8 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> 
> 2) Then I would like to merge the WxW window (i.e. bin the regions),
> expanding the ranges accordingly and adding the counts..  This
> process will
> 	- ***identify all range-pairs in the same window and merge them
> into a new range pair with appropriately expanded ranges*** (this is
> my primary goal)
> 	- sum the counts for each of the aforementioned range-pairs (i
> have already figured a way to do this)
> 
> 
> 
> GenomicInteractions object with 5 interactions and 1 metadata column:
>        seqnames1   ranges1     seqnames2   ranges2 |    counts
>            <Rle> <IRanges>         <Rle> <IRanges> | <integer>
>    [1]      chrA       1-6  ---      chrA      1-6  |         1
>    [2]      chrA       1-6  ---      chrA      7-12 |         3
>    [3]      chrA       7-12 ---      chrA      7-12 |         3
>    [4]      chrA       1-6  ---      chrA     13-18 |         1
>    [5]      chrA       7-12 ---      chrA     13-18 |         2
>   -------
>   regions: 3 ranges and 0 metadata columns
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 
> 
> NOTE that ranges1 and ranges2 MUST expand so that the region width is
> 6, though the counts will only change if there exists another
> subrange covered by this bin/expansion that contains a positive
> count.
> 
> As always, speed in a concern.
> 
> Best,
> 
> — Luke Klein
>     PhD Student
>     Department of Statistics
>     University of California, Riverside
>     lklei001 using ucr.edu
>



More information about the Bioc-devel mailing list