[Bioc-devel] Block bootstrap for GenomicRanges

Hervé Pagès hp@ge@ @ending from fredhutch@org
Tue Aug 14 01:30:44 CEST 2018


Hi Michael,

On 08/13/2018 01:39 PM, Michael Love wrote:
> dear Bioc-devels,
> 
>  From a conversation on Twitter [1], I've been wondering if there has
> been any work to implement block bootstrap for GRanges.
> 
> I guess that's my main question, and then the rest is details and
> ideas in case it hasn't already been done.
> 
> The block bootstrap helps to create perhaps more realistic null data,
> in way that deals with clumping of features. You take many random,
> potentially overlapping blocks from the original data (ranges) to fill
> up a new set of data (ranges). There's a 2011 paper from Peter Bickel
> about this topic [2], which had some python software associated with
> it.
> 
> Aside from the statistical considerations, one would want to do each
> bootstrap quickly so you can make many bootstrap samples, and so
> having vectorized code would be an obvious desirable thing. I was
> thinking about how the Views infrastructure is kind of what one would
> want, but it operates on Rle's if I remember correctly, and the
> desired output here would be ranges. Below is kind of getting what I
> want, as a hack I'd then convert the Rle's back to ranges and
> concatenate them if I wanted 5 x 40bp = 200 bp of bootstrap data.
> 
>> ir <- IRanges(c(51,61,71,111,121,131,200),width=5) # make some clumpy data
>> Views(coverage(ir), start=round(runif(5,1,150)), width=40) # e.g. 5 random 40 bp blocks
> Views on a 204-length Rle subject
> 
> views:
>      start end width
> [1]    97 136    40 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
> 1 1 1 1 0 0 0 0 0 1 1 ...]
> [2]    50  89    40 [0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1
> 1 0 0 0 0 0 0 0 0 0 0 ...]
> [3]    18  57    40 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 1 1 1 ...]
> [4]     8  47    40 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0 0 0 ...]
> [5]    97 136    40 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
> 1 1 1 1 0 0 0 0 0 1 1 ...]

Never heard of block bootstrapping before and don't really understand
the issue nor what you are after. My feeling though is that your
question might be about data representation, not just about block
bootstrapping implementations. So I'll try to focus on the data
representation aspect. Sorry if I'm completely off-topic.

IIUC you have blocks along the genome and these blocks are represented
by genomic ranges. Now you want to be able to represent some genomic
regions of interest (what you call "views") and for each region of
interest the subregions that correspond to blocks. Yes a Views object
sounds like a natural candidate for that but as you pointed out each
view is actually returned to the user as an Rle, not as a range-based
container. Another problem is that creating views on a GRanges object
is not currently supported.

I can think of at least 2 alternatives: (1) use a GPos-based
representation, or (2) use a GRangesList nested in a GRanges (or
the other way around).

(1) GPos-based representation
-----------------------------

The full object (i.e. before extraction of regions of interest) would
look like this:

 > gpos
GPos object with 207 positions and 1 metadata column:
         seqnames       pos strand | block
            <Rle> <integer>  <Rle> | <Rle>
     [1]     chr1         1      * |     0
     [2]     chr1         2      * |     0
     [3]     chr1         3      * |     0
     [4]     chr1         4      * |     0
     [5]     chr1         5      * |     0
     ...      ...       ...    ... .   ...
   [203]     chr1       203      * |     1
   [204]     chr1       204      * |     1
   [205]     chr1       205      * |     0
   [206]     chr1       206      * |     0
   [207]     chr1       207      * |     0
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This is assuming that we're only working with chromosome 1, even though,
like with a GRanges object, a GPos object can contain ranges on various
chromosomes.

If 'my_ROI' is a GRanges object containing my regions of interest:

 > my_ROI
GRanges object with 4 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1    97-136      *
   [2]     chr1     50-89      *
   [3]     chr1     18-57      *
   [4]     chr1      8-47      *
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then extracting these regions from 'gpos' is just a matter of subsetting
'gpos' with subsetByOverlaps():

 > subsetByOverlaps(gpos, my_ROI)
GPos object with 122 positions and 1 metadata column:
         seqnames       pos strand | block
            <Rle> <integer>  <Rle> | <Rle>
     [1]     chr1         8      * |     0
     [2]     chr1         9      * |     0
     [3]     chr1        10      * |     0
     [4]     chr1        11      * |     0
     [5]     chr1        12      * |     0
     ...      ...       ...    ... .   ...
   [118]     chr1       132      * |     1
   [119]     chr1       133      * |     1
   [120]     chr1       134      * |     1
   [121]     chr1       135      * |     1
   [122]     chr1       136      * |     0
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

There are 3 serious cons with this approach though: one is that the
subsetted object doesn't keep track of the separation between the
regions of interest. Another one is that nothing prevents the user
from completely loosing the hierarchy of blocks within regions of
interest by shuffling the object elements. The last one is that the
blocks are still represented with Rle's, which is not the natural
way to represent genomic ranges.

(2) Nested GRanges/GRangesList combo
------------------------------------

It would look something like this:

 > my_ROI
GRanges object with 4 ranges and 1 metadata column:
       seqnames    ranges strand |                                 blocks
          <Rle> <IRanges>  <Rle> |                          <GRangesList>
   [1]     chr1    97-136      * | chr1:111-115,chr1:121-125,chr1:131-135
   [2]     chr1     50-89      * |       chr1:51-55,chr1:61-65,chr1:71-75
   [3]     chr1     18-57      * |                             chr1:51-55
   [4]     chr1      8-47      * |
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that the chromosome and strand info of the "blocks" metadata column
is not needed (the chrom can be inferred from the regions of interest
and blocks are unstranded features by nature). So a simpler solution is
to use an IRangesList instead of a GRangesList for the blocks:

 > my_ROI
GRanges object with 4 ranges and 1 metadata column:
       seqnames    ranges strand |                  blocks
          <Rle> <IRanges>  <Rle> |           <IRangesList>
   [1]     chr1    97-136      * | 111-115,121-125,131-135
   [2]     chr1     50-89      * |       51-55,61-65,71-75
   [3]     chr1     18-57      * |                   51-55
   [4]     chr1      8-47      * |
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now we don't provide an easy way to create or operate on this kind of
object at the moment but that's something we could work on. Am I
completely off-topic or is this the kind of object that would be
useful for your block bootstrapping use case?

Finally note that from a data representation point of view, we face
the same problem for representing a set of transcripts and their
exons. More generally speaking we don't have a simple/easy way to
represent a hierarchy of ranges. For example right now we provide 2
separate extractors for the transcripts and exons of a TxDb object
but we don't have an extractor that returns the transcripts and exons
ranges together in a single object that reflects the range hierarchy.
This has to be done by hand with something like:

 > tx <- transcripts(txdb, columns=NULL)
 > mcols(tx)$exons <- as(exonsBy(txdb, by="tx"), "IRangesList")
 > tx[1:3]
GRanges object with 3 ranges and 1 metadata column:
       seqnames            ranges strand |
          <Rle>         <IRanges>  <Rle> |
   [1]     chr1 32671236-32674288      + |
   [2]     chr1 32671236-32674288      + |
   [3]     chr1 32671236-32674288      + |
                                                           exons
                                                   <IRangesList>
   [1] 32671236-32671324,32671755-32671898,32672110-32672362,...
   [2] 32671236-32671564,32671755-32671898,32672110-32672362,...
   [3] 32671236-32671324,32671755-32672362,32672603-32672721,...
   -------
   seqinfo: 93 sequences (1 circular) from hg19 genome

or with something like:

 > transcripts(txdb, columns=c("exon_start", "exon_end"))[1:3]
GRanges object with 3 ranges and 2 metadata columns:
       seqnames            ranges strand |                     exon_start
          <Rle>         <IRanges>  <Rle> |                  <IntegerList>
   [1]     chr1 32671236-32674288      + | 32671236,32671755,32672110,...
   [2]     chr1 32671236-32674288      + | 32671236,32671755,32672110,...
   [3]     chr1 32671236-32674288      + | 32671236,32671755,32672603,...
                             exon_end
                        <IntegerList>
   [1] 32671324,32671898,32672362,...
   [2] 32671564,32671898,32672362,...
   [3] 32671324,32672362,32672721,...
   -------
   seqinfo: 93 sequences (1 circular) from hg19 genome

H.


> 
> Another option would be to convert to time-series data, then use an
> existing package like 'boot' then convert back to ranges.
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_mikelove_status_1027296115346092032&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=58qAk0b5MsXbsLCIUZP84lVgqb3DywZToIUQoX2WpTc&s=jujtHTMS-p5wE8t4KciQuk70osCp836zSy6QMqGtqA8&e=
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1101.0947&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=58qAk0b5MsXbsLCIUZP84lVgqb3DywZToIUQoX2WpTc&s=ImgUVf9qB2n3xNN8-8GQxPe8P5o4EBiL_st1fXkVpD0&e=
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=58qAk0b5MsXbsLCIUZP84lVgqb3DywZToIUQoX2WpTc&s=kaZA6WamwApoWQOGryaxzaB3GgxiLFgfd1YRK4w0O7U&e=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list