[Bioc-devel] Block bootstrap for GenomicRanges
Hervé Pagès
hp@ge@ @ending from fredhutch@org
Tue Aug 14 18:49:03 CEST 2018
I see. If the blocks need to be completely random (but still
with the constraint that once rearranged they form a partition
of the chromosome), then a slightly modified solution would be:
## Add a feature id to the ranges in y. This is not required
## but will help see what happens to the features:
mcols(y)$feature_id <- head(letters, length(y))
y
# IRanges object with 12 ranges and 1 metadata column:
# start end width | feature_id
# <integer> <integer> <integer> | <character>
# [1] 51 55 5 | a
# [2] 61 65 5 | b
# [3] 71 75 5 | c
# [4] 111 115 5 | d
# [5] 121 125 5 | e
# ... ... ... ... . ...
# [8] 511 515 5 | h
# [9] 521 525 5 | i
# [10] 921 925 5 | j
# [11] 931 935 5 | k
# [12] 941 945 5 | l
## Generate the random blocks:
random_blocks <- IRanges(start=round(runif(10,1,901)), width=100)
## Add a block id. Again, not needed for the algo below, but will
## help understand the final object y_prime:
mcols(random_blocks)$block_id <- head(LETTERS, length(random_blocks))
random_blocks
# IRanges object with 10 ranges and 1 metadata column:
# start end width | block_id
# <integer> <integer> <integer> | <character>
# [1] 283 382 100 | A
# [2] 898 997 100 | B
# [3] 298 397 100 | C
# [4] 680 779 100 | D
# [5] 722 821 100 | E
# [6] 632 731 100 | F
# [7] 594 693 100 | G
# [8] 689 788 100 | H
# [9] 886 985 100 | I
# [10] 673 772 100 | J
## Compute the shift involved in rearranging each block:
rearranged_blocks <- successiveIRanges(width(random_blocks))
block_shift <- start(rearranged_blocks) - start(random_blocks)
## Compute y':
y_prime <- do.call(c,
lapply(seq_along(random_blocks),
function(b) {
features_to_shift <- subsetByOverlaps(y, random_blocks[b])
block_id <- mcols(random_blocks)$block_id[b]
mcols(features_to_shift)$block_id <- rep(block_id,
length(features_to_shift))
shift(features_to_shift, block_shift[b])
}
)
)
y_prime
# IRanges object with 6 ranges and 2 metadata columns:
# start end width | feature_id block_id
# <integer> <integer> <integer> | <character> <character>
# [1] 124 128 5 | j B
# [2] 134 138 5 | k B
# [3] 144 148 5 | l B
# [4] 836 840 5 | j I
# [5] 846 850 5 | k I
# [6] 856 860 5 | l I
Still based on shift(), which avoids all the little annoyances
of using Rle's as an intermediate representation of the ranges.
It uses a loop which might be problem if the number of blocks is
big (say more than 50000). There might be a way to avoid the loop
though, but it's probably not trivial...
H.
On 08/14/2018 05:26 AM, Michael Love wrote:
> dear Hervé,
>
> Thanks again for the quick and useful reply!
>
> I think that the theory behind the block bootstrap [Kunsch (1989), Liu
> and Singh (1992), Politis and Romano (1994)], needs that the blocks be
> drawn with replacement (you can get some features twice) and that the
> blocks can be overlapping. In a hand-waving way, I think, it's "good"
> for the variance estimation on any statistic of interest that y' may
> have more or less features than y.
>
> I will explore a bit using the solutions you've laid out.
>
> Now that I think about it, the start-position based solution that I
> was thinking about will break if two features in y share the same
> start position, so that's not good.
>
> On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpages using fredhutch.org> wrote:
>> That helps. I think I start to understand what you are after.
>>
>> See below...
>>
>>
>> On 08/13/2018 06:07 PM, Michael Love wrote:
>>>
>>> dear Hervé,
>>>
>>> Thanks for the quick reply about directions to take this.
>>>
>>> I'm sorry for not providing sufficient detail about the goal of block
>>> bootstrapping in my initial post. Let me try again. For a moment, let
>>> me ignore multiple chromosomes/seqs and just focus on a single set of
>>> IRanges.
>>>
>>> The point of the block bootstrap is: Let's say we want to find the
>>> number of overlaps of x and y, and then assess how surprised we are at
>>> how large that overlap is. Both of them may have features that tend to
>>> cluster together along the genome (independently). One method would
>>> just be to move the features in y around to random start sites, making
>>> y', say B times, and then calculate each of the B times the number of
>>> overlaps between x and y'. Or we might make this better by having
>>> blacklisted sites where the randomly shuffled features in y cannot go.
>>>
>>> The block bootstrap is an alternative to randomly moving the start
>>> sites, where instead we create random data, by taking big "blocks" of
>>> features in y. Each block is a lot like a View. And the ultimate goal
>>> is to make B versions of the data y where the features have been
>>> shuffled around, but by taking blocks, we preserve the clumpiness of
>>> the features in y.
>>>
>>> Let me give some numbers to make this more concrete, so say we're
>>> making a single block bootstrap sample of a chromosome that is 1000 bp
>>> long. Here is the original y:
>>>
>>> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
>>>
>>> If I go with my coverage approach, I should extend it all the way to
>>> the end of the chromosome. Here I lose information if there are
>>> overlaps of features in y, and I'm thinking of a fix I'll describe
>>> below.
>>>
>>> cov <- c(coverage(y), Rle(rep(0,55)))
>>>
>>> I could make one block bootstrap sample of y (this is 1 out of B in
>>> the ultimate procedure) by taking 10 blocks of width 100. The blocks
>>> have random start positions from 1 to 901.
>>>
>>> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
>>
>>
>> Choosing blocks that can overlap with each others could make y' appear
>> to have more features than y (by repeating some of the original
>> features). Also choosing blocks that can leave big gaps in the
>> chromosome could make y' appear to have less features than y
>> (by dropping some of the original ranges). Isn't that a problem?
>>
>> Have you considered choosing a set of blocks that represent a
>> partitioning of the chromosome? This would make y' look more like y
>> by preserving the number of original ranges.
>>
>> In other words, if you can assign each range in y to one block and
>> one block only, then you could generate y' by just shifting the
>> ranges in y. The exact amount of shifting (positive or negative)
>> would only depend on the block that the range belongs to. This would
>> give you an y' that is parallel to y (i.e. same number of ranges
>> and y'[i] corresponds to y[i]).
>>
>> Something like this:
>>
>> 1) Define the blocks (must be a partitioning of the chromosome):
>>
>> blocks <- successiveIRanges(rep(100, 10))
>>
>> 2) Assign each range in y to the block it belongs to:
>>
>> mcols(y)$block <- findOverlaps(y, blocks, select="first")
>>
>> 1) and 2) are preliminary steps that you only need to do once,
>> before you generate B versions of the shuffled data (what you
>> call y'). The next steps are for generating one version of the
>> shuffled data so would need to be repeated B times.
>>
>> 3) Shuffle the blocks:
>>
>> perm <- sample(length(blocks))
>> perm
>> # [1] 6 5 8 3 2 7 1 9 4 10
>>
>> permuted_blocks <- successiveIRanges(width(blocks)[perm])
>> permuted_blocks[perm] <- permuted_blocks
>>
>> 4) Compute the shift along the chromosome that each block went
>> thru:
>>
>> block_shift <- start(permuted_blocks) - start(blocks)
>>
>> 5) Apply this shift to the ranges in y:
>>
>> shift(y, block_shift[mcols(y)$block])
>> # IRanges object with 12 ranges and 1 metadata column:
>> # start end width | block
>> # <integer> <integer> <integer> | <integer>
>> # [1] 651 655 5 | 1
>> # [2] 661 665 5 | 1
>> # [3] 671 675 5 | 1
>> # [4] 411 415 5 | 2
>> # [5] 421 425 5 | 2
>> # ... ... ... ... . ...
>> # [8] 11 15 5 | 6
>> # [9] 21 25 5 | 6
>> # [10] 921 925 5 | 10
>> # [11] 931 935 5 | 10
>> # [12] 941 945 5 | 10
>>
>> This code would work with overlapping ranges in y and/or
>> blocks of variable size.
>>
>> Hope this helps,
>>
>> H.
>>
>>>
>>> This last line below is a hack to get back to the ranges for a single
>>> block bootstrap sample of y. It works here, but only because none of
>>> the features in y were overlapping each other.
>>>
>>> Instead of coverage(), if I'd made an Rle where the non-zero elements
>>> are located at the start of ranges, and the value is the width of the
>>> range, I think this Views approach would actually work.
>>>
>>> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
>>> width=runLength(y.boot.1)[runValue(y.boot.1) == 1])
>>>
>>> I'm interested in building a function that takes in IRanges and
>>> outputs these shuffled set of IRanges.
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages using fredhutch.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list