[Bioc-devel] Block bootstrap for GenomicRanges

Tue Aug 14 18:49:03 CEST 2018

I see. If the blocks need to be completely random (but still
with the constraint that once rearranged they form a partition
of the chromosome), then a slightly modified solution would be:

## Add a feature id to the ranges in y. This is not required
## but will help see what happens to the features:

   mcols(y)$feature_id <- head(letters, length(y))
   y
   # IRanges object with 12 ranges and 1 metadata column:
   #          start       end     width |  feature_id
   #      <integer> <integer> <integer> | <character>
   #  [1]        51        55         5 |           a
   #  [2]        61        65         5 |           b
   #  [3]        71        75         5 |           c
   #  [4]       111       115         5 |           d
   #  [5]       121       125         5 |           e
   #  ...       ...       ...       ... .         ...
   #  [8]       511       515         5 |           h
   #  [9]       521       525         5 |           i
   # [10]       921       925         5 |           j
   # [11]       931       935         5 |           k
   # [12]       941       945         5 |           l

## Generate the random blocks:

   random_blocks <- IRanges(start=round(runif(10,1,901)), width=100)

## Add a block id. Again, not needed for the algo below, but will
## help understand the final object y_prime:

   mcols(random_blocks)$block_id <- head(LETTERS, length(random_blocks))
   random_blocks
   # IRanges object with 10 ranges and 1 metadata column:
   #          start       end     width |    block_id
   #      <integer> <integer> <integer> | <character>
   #  [1]       283       382       100 |           A
   #  [2]       898       997       100 |           B
   #  [3]       298       397       100 |           C
   #  [4]       680       779       100 |           D
   #  [5]       722       821       100 |           E
   #  [6]       632       731       100 |           F
   #  [7]       594       693       100 |           G
   #  [8]       689       788       100 |           H
   #  [9]       886       985       100 |           I
   # [10]       673       772       100 |           J

## Compute the shift involved in rearranging each block:

   rearranged_blocks <- successiveIRanges(width(random_blocks))
   block_shift <- start(rearranged_blocks) - start(random_blocks)

## Compute y':

   y_prime <- do.call(c,
     lapply(seq_along(random_blocks),
       function(b) {
         features_to_shift <- subsetByOverlaps(y, random_blocks[b])
         block_id <- mcols(random_blocks)$block_id[b]
         mcols(features_to_shift)$block_id <- rep(block_id, 
length(features_to_shift))
         shift(features_to_shift, block_shift[b])
       }
     )
   )

   y_prime
   # IRanges object with 6 ranges and 2 metadata columns:
   #         start       end     width |  feature_id    block_id
   #     <integer> <integer> <integer> | <character> <character>
   # [1]       124       128         5 |           j           B
   # [2]       134       138         5 |           k           B
   # [3]       144       148         5 |           l           B
   # [4]       836       840         5 |           j           I
   # [5]       846       850         5 |           k           I
   # [6]       856       860         5 |           l           I

Still based on shift(), which avoids all the little annoyances
of using Rle's as an intermediate representation of the ranges.

It uses a loop which might be problem if the number of blocks is
big (say more than 50000). There might be a way to avoid the loop
though, but it's probably not trivial...

H.

On 08/14/2018 05:26 AM, Michael Love wrote:
> dear Hervé,
> 
> Thanks again for the quick and useful reply!
> 
> I think that the theory behind the block bootstrap [Kunsch (1989), Liu
> and Singh (1992), Politis and Romano (1994)], needs that the blocks be
> drawn with replacement (you can get some features twice) and that the
> blocks can be overlapping. In a hand-waving way, I think, it's "good"
> for the variance estimation on any statistic of interest that y' may
> have more or less features than y.
> 
> I will explore a bit using the solutions you've laid out.
> 
> Now that I think about it, the start-position based solution that I
> was thinking about will break if two features in y share the same
> start position, so that's not good.
> 
> On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpages using fredhutch.org> wrote:
>> That helps. I think I start to understand what you are after.
>>
>> See below...
>>
>>
>> On 08/13/2018 06:07 PM, Michael Love wrote:
>>>
>>> dear Hervé,
>>>
>>> Thanks for the quick reply about directions to take this.
>>>
>>> I'm sorry for not providing sufficient detail about the goal of block
>>> bootstrapping in my initial post. Let me try again. For a moment, let
>>> me ignore multiple chromosomes/seqs and just focus on a single set of
>>> IRanges.
>>>
>>> The point of the block bootstrap is: Let's say we want to find the
>>> number of overlaps of x and y, and then assess how surprised we are at
>>> how large that overlap is. Both of them may have features that tend to
>>> cluster together along the genome (independently). One method would
>>> just be to move the features in y around to random start sites, making
>>> y', say B times, and then calculate each of the B times the number of
>>> overlaps between x and y'. Or we might make this better by having
>>> blacklisted sites where the randomly shuffled features in y cannot go.
>>>
>>> The block bootstrap is an alternative to randomly moving the start
>>> sites, where instead we create random data, by taking big "blocks" of
>>> features in y. Each block is a lot like a View. And the ultimate goal
>>> is to make B versions of the data y where the features have been
>>> shuffled around, but by taking blocks, we preserve the clumpiness of
>>> the features in y.
>>>
>>> Let me give some numbers to make this more concrete, so say we're
>>> making a single block bootstrap sample of a chromosome that is 1000 bp
>>> long. Here is the original y:
>>>
>>> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
>>>
>>> If I go with my coverage approach, I should extend it all the way to
>>> the end of the chromosome. Here I lose information if there are
>>> overlaps of features in y, and I'm thinking of a fix I'll describe
>>> below.
>>>
>>> cov <- c(coverage(y), Rle(rep(0,55)))
>>>
>>> I could make one block bootstrap sample of y (this is 1 out of B in
>>> the ultimate procedure) by taking 10 blocks of width 100. The blocks
>>> have random start positions from 1 to 901.
>>>
>>> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
>>
>>
>> Choosing blocks that can overlap with each others could make y' appear
>> to have more features than y (by repeating some of the original
>> features). Also choosing blocks that can leave big gaps in the
>> chromosome could make y' appear to have less features than y
>> (by dropping some of the original ranges). Isn't that a problem?
>>
>> Have you considered choosing a set of blocks that represent a
>> partitioning of the chromosome? This would make y' look more like y
>> by preserving the number of original ranges.
>>
>> In other words, if you can assign each range in y to one block and
>> one block only, then you could generate y' by just shifting the
>> ranges in y. The exact amount of shifting (positive or negative)
>> would only depend on the block that the range belongs to. This would
>> give you an y' that is parallel to y (i.e. same number of ranges
>> and y'[i] corresponds to y[i]).
>>
>> Something like this:
>>
>> 1) Define the blocks (must be a partitioning of the chromosome):
>>
>>    blocks <- successiveIRanges(rep(100, 10))
>>
>> 2) Assign each range in y to the block it belongs to:
>>
>>    mcols(y)$block <- findOverlaps(y, blocks, select="first")
>>
>> 1) and 2) are preliminary steps that you only need to do once,
>> before you generate B versions of the shuffled data (what you
>> call y'). The next steps are for generating one version of the
>> shuffled data so would need to be repeated B times.
>>
>> 3) Shuffle the blocks:
>>
>>    perm <- sample(length(blocks))
>>    perm
>>    # [1]  6  5  8  3  2  7  1  9  4 10
>>
>>    permuted_blocks <- successiveIRanges(width(blocks)[perm])
>>    permuted_blocks[perm] <- permuted_blocks
>>
>> 4) Compute the shift along the chromosome that each block went
>> thru:
>>
>>    block_shift <- start(permuted_blocks) - start(blocks)
>>
>> 5) Apply this shift to the ranges in y:
>>
>>    shift(y, block_shift[mcols(y)$block])
>>    # IRanges object with 12 ranges and 1 metadata column:
>>    #          start       end     width |     block
>>    #      <integer> <integer> <integer> | <integer>
>>    #  [1]       651       655         5 |         1
>>    #  [2]       661       665         5 |         1
>>    #  [3]       671       675         5 |         1
>>    #  [4]       411       415         5 |         2
>>    #  [5]       421       425         5 |         2
>>    #  ...       ...       ...       ... .       ...
>>    #  [8]        11        15         5 |         6
>>    #  [9]        21        25         5 |         6
>>    # [10]       921       925         5 |        10
>>    # [11]       931       935         5 |        10
>>    # [12]       941       945         5 |        10
>>
>> This code would work with overlapping ranges in y and/or
>> blocks of variable size.
>>
>> Hope this helps,
>>
>> H.
>>
>>>
>>> This last line below is a hack to get back to the ranges for a single
>>> block bootstrap sample of y. It works here, but only because none of
>>> the features in y were overlapping each other.
>>>
>>> Instead of coverage(), if I'd made an Rle where the non-zero elements
>>> are located at the start of ranges, and the value is the width of the
>>> range, I think this Views approach would actually work.
>>>
>>> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
>>>     width=runLength(y.boot.1)[runValue(y.boot.1) == 1])
>>>
>>> I'm interested in building a function that takes in IRanges and
>>> outputs these shuffled set of IRanges.
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages using fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319