[Bioc-devel] Block bootstrap for GenomicRanges

Tue Aug 14 14:26:53 CEST 2018

dear Hervé,

Thanks again for the quick and useful reply!

I think that the theory behind the block bootstrap [Kunsch (1989), Liu
and Singh (1992), Politis and Romano (1994)], needs that the blocks be
drawn with replacement (you can get some features twice) and that the
blocks can be overlapping. In a hand-waving way, I think, it's "good"
for the variance estimation on any statistic of interest that y' may
have more or less features than y.

I will explore a bit using the solutions you've laid out.

Now that I think about it, the start-position based solution that I
was thinking about will break if two features in y share the same
start position, so that's not good.

On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpages using fredhutch.org> wrote:
> That helps. I think I start to understand what you are after.
>
> See below...
>
>
> On 08/13/2018 06:07 PM, Michael Love wrote:
>>
>> dear Hervé,
>>
>> Thanks for the quick reply about directions to take this.
>>
>> I'm sorry for not providing sufficient detail about the goal of block
>> bootstrapping in my initial post. Let me try again. For a moment, let
>> me ignore multiple chromosomes/seqs and just focus on a single set of
>> IRanges.
>>
>> The point of the block bootstrap is: Let's say we want to find the
>> number of overlaps of x and y, and then assess how surprised we are at
>> how large that overlap is. Both of them may have features that tend to
>> cluster together along the genome (independently). One method would
>> just be to move the features in y around to random start sites, making
>> y', say B times, and then calculate each of the B times the number of
>> overlaps between x and y'. Or we might make this better by having
>> blacklisted sites where the randomly shuffled features in y cannot go.
>>
>> The block bootstrap is an alternative to randomly moving the start
>> sites, where instead we create random data, by taking big "blocks" of
>> features in y. Each block is a lot like a View. And the ultimate goal
>> is to make B versions of the data y where the features have been
>> shuffled around, but by taking blocks, we preserve the clumpiness of
>> the features in y.
>>
>> Let me give some numbers to make this more concrete, so say we're
>> making a single block bootstrap sample of a chromosome that is 1000 bp
>> long. Here is the original y:
>>
>> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
>>
>> If I go with my coverage approach, I should extend it all the way to
>> the end of the chromosome. Here I lose information if there are
>> overlaps of features in y, and I'm thinking of a fix I'll describe
>> below.
>>
>> cov <- c(coverage(y), Rle(rep(0,55)))
>>
>> I could make one block bootstrap sample of y (this is 1 out of B in
>> the ultimate procedure) by taking 10 blocks of width 100. The blocks
>> have random start positions from 1 to 901.
>>
>> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
>
>
> Choosing blocks that can overlap with each others could make y' appear
> to have more features than y (by repeating some of the original
> features). Also choosing blocks that can leave big gaps in the
> chromosome could make y' appear to have less features than y
> (by dropping some of the original ranges). Isn't that a problem?
>
> Have you considered choosing a set of blocks that represent a
> partitioning of the chromosome? This would make y' look more like y
> by preserving the number of original ranges.
>
> In other words, if you can assign each range in y to one block and
> one block only, then you could generate y' by just shifting the
> ranges in y. The exact amount of shifting (positive or negative)
> would only depend on the block that the range belongs to. This would
> give you an y' that is parallel to y (i.e. same number of ranges
> and y'[i] corresponds to y[i]).
>
> Something like this:
>
> 1) Define the blocks (must be a partitioning of the chromosome):
>
>   blocks <- successiveIRanges(rep(100, 10))
>
> 2) Assign each range in y to the block it belongs to:
>
>   mcols(y)$block <- findOverlaps(y, blocks, select="first")
>
> 1) and 2) are preliminary steps that you only need to do once,
> before you generate B versions of the shuffled data (what you
> call y'). The next steps are for generating one version of the
> shuffled data so would need to be repeated B times.
>
> 3) Shuffle the blocks:
>
>   perm <- sample(length(blocks))
>   perm
>   # [1]  6  5  8  3  2  7  1  9  4 10
>
>   permuted_blocks <- successiveIRanges(width(blocks)[perm])
>   permuted_blocks[perm] <- permuted_blocks
>
> 4) Compute the shift along the chromosome that each block went
> thru:
>
>   block_shift <- start(permuted_blocks) - start(blocks)
>
> 5) Apply this shift to the ranges in y:
>
>   shift(y, block_shift[mcols(y)$block])
>   # IRanges object with 12 ranges and 1 metadata column:
>   #          start       end     width |     block
>   #      <integer> <integer> <integer> | <integer>
>   #  [1]       651       655         5 |         1
>   #  [2]       661       665         5 |         1
>   #  [3]       671       675         5 |         1
>   #  [4]       411       415         5 |         2
>   #  [5]       421       425         5 |         2
>   #  ...       ...       ...       ... .       ...
>   #  [8]        11        15         5 |         6
>   #  [9]        21        25         5 |         6
>   # [10]       921       925         5 |        10
>   # [11]       931       935         5 |        10
>   # [12]       941       945         5 |        10
>
> This code would work with overlapping ranges in y and/or
> blocks of variable size.
>
> Hope this helps,
>
> H.
>
>>
>> This last line below is a hack to get back to the ranges for a single
>> block bootstrap sample of y. It works here, but only because none of
>> the features in y were overlapping each other.
>>
>> Instead of coverage(), if I'd made an Rle where the non-zero elements
>> are located at the start of ranges, and the value is the width of the
>> range, I think this Views approach would actually work.
>>
>> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
>>    width=runLength(y.boot.1)[runValue(y.boot.1) == 1])
>>
>> I'm interested in building a function that takes in IRanges and
>> outputs these shuffled set of IRanges.
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319