[Bioc-devel] Block bootstrap for GenomicRanges

Tue Aug 14 15:06:13 CEST 2018

I agree this is super important.  I think there may be multiple ways of
thinking about a decent bootstrapping or permutation of ranges, in
genomics. I am quite interested in the topic. I think it might belong in a
new package. I would be interesting in extending the conversation and have
a couple of different approaches (theoretical) that we could work on being
efficient.

Best,
Kasper

On Tue, Aug 14, 2018 at 8:27 AM Michael Love <michaelisaiahlove using gmail.com>
wrote:

> dear Hervé,
>
> Thanks again for the quick and useful reply!
>
> I think that the theory behind the block bootstrap [Kunsch (1989), Liu
> and Singh (1992), Politis and Romano (1994)], needs that the blocks be
> drawn with replacement (you can get some features twice) and that the
> blocks can be overlapping. In a hand-waving way, I think, it's "good"
> for the variance estimation on any statistic of interest that y' may
> have more or less features than y.
>
> I will explore a bit using the solutions you've laid out.
>
> Now that I think about it, the start-position based solution that I
> was thinking about will break if two features in y share the same
> start position, so that's not good.
>
> On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpages using fredhutch.org>
> wrote:
> > That helps. I think I start to understand what you are after.
> >
> > See below...
> >
> >
> > On 08/13/2018 06:07 PM, Michael Love wrote:
> >>
> >> dear Hervé,
> >>
> >> Thanks for the quick reply about directions to take this.
> >>
> >> I'm sorry for not providing sufficient detail about the goal of block
> >> bootstrapping in my initial post. Let me try again. For a moment, let
> >> me ignore multiple chromosomes/seqs and just focus on a single set of
> >> IRanges.
> >>
> >> The point of the block bootstrap is: Let's say we want to find the
> >> number of overlaps of x and y, and then assess how surprised we are at
> >> how large that overlap is. Both of them may have features that tend to
> >> cluster together along the genome (independently). One method would
> >> just be to move the features in y around to random start sites, making
> >> y', say B times, and then calculate each of the B times the number of
> >> overlaps between x and y'. Or we might make this better by having
> >> blacklisted sites where the randomly shuffled features in y cannot go.
> >>
> >> The block bootstrap is an alternative to randomly moving the start
> >> sites, where instead we create random data, by taking big "blocks" of
> >> features in y. Each block is a lot like a View. And the ultimate goal
> >> is to make B versions of the data y where the features have been
> >> shuffled around, but by taking blocks, we preserve the clumpiness of
> >> the features in y.
> >>
> >> Let me give some numbers to make this more concrete, so say we're
> >> making a single block bootstrap sample of a chromosome that is 1000 bp
> >> long. Here is the original y:
> >>
> >> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
> >>
> >> If I go with my coverage approach, I should extend it all the way to
> >> the end of the chromosome. Here I lose information if there are
> >> overlaps of features in y, and I'm thinking of a fix I'll describe
> >> below.
> >>
> >> cov <- c(coverage(y), Rle(rep(0,55)))
> >>
> >> I could make one block bootstrap sample of y (this is 1 out of B in
> >> the ultimate procedure) by taking 10 blocks of width 100. The blocks
> >> have random start positions from 1 to 901.
> >>
> >> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
> >
> >
> > Choosing blocks that can overlap with each others could make y' appear
> > to have more features than y (by repeating some of the original
> > features). Also choosing blocks that can leave big gaps in the
> > chromosome could make y' appear to have less features than y
> > (by dropping some of the original ranges). Isn't that a problem?
> >
> > Have you considered choosing a set of blocks that represent a
> > partitioning of the chromosome? This would make y' look more like y
> > by preserving the number of original ranges.
> >
> > In other words, if you can assign each range in y to one block and
> > one block only, then you could generate y' by just shifting the
> > ranges in y. The exact amount of shifting (positive or negative)
> > would only depend on the block that the range belongs to. This would
> > give you an y' that is parallel to y (i.e. same number of ranges
> > and y'[i] corresponds to y[i]).
> >
> > Something like this:
> >
> > 1) Define the blocks (must be a partitioning of the chromosome):
> >
> >   blocks <- successiveIRanges(rep(100, 10))
> >
> > 2) Assign each range in y to the block it belongs to:
> >
> >   mcols(y)$block <- findOverlaps(y, blocks, select="first")
> >
> > 1) and 2) are preliminary steps that you only need to do once,
> > before you generate B versions of the shuffled data (what you
> > call y'). The next steps are for generating one version of the
> > shuffled data so would need to be repeated B times.
> >
> > 3) Shuffle the blocks:
> >
> >   perm <- sample(length(blocks))
> >   perm
> >   # [1]  6  5  8  3  2  7  1  9  4 10
> >
> >   permuted_blocks <- successiveIRanges(width(blocks)[perm])
> >   permuted_blocks[perm] <- permuted_blocks
> >
> > 4) Compute the shift along the chromosome that each block went
> > thru:
> >
> >   block_shift <- start(permuted_blocks) - start(blocks)
> >
> > 5) Apply this shift to the ranges in y:
> >
> >   shift(y, block_shift[mcols(y)$block])
> >   # IRanges object with 12 ranges and 1 metadata column:
> >   #          start       end     width |     block
> >   #      <integer> <integer> <integer> | <integer>
> >   #  [1]       651       655         5 |         1
> >   #  [2]       661       665         5 |         1
> >   #  [3]       671       675         5 |         1
> >   #  [4]       411       415         5 |         2
> >   #  [5]       421       425         5 |         2
> >   #  ...       ...       ...       ... .       ...
> >   #  [8]        11        15         5 |         6
> >   #  [9]        21        25         5 |         6
> >   # [10]       921       925         5 |        10
> >   # [11]       931       935         5 |        10
> >   # [12]       941       945         5 |        10
> >
> > This code would work with overlapping ranges in y and/or
> > blocks of variable size.
> >
> > Hope this helps,
> >
> > H.
> >
> >>
> >> This last line below is a hack to get back to the ranges for a single
> >> block bootstrap sample of y. It works here, but only because none of
> >> the features in y were overlapping each other.
> >>
> >> Instead of coverage(), if I'd made an Rle where the non-zero elements
> >> are located at the start of ranges, and the value is the width of the
> >> range, I think this Views approach would actually work.
> >>
> >> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
> >>    width=runLength(y.boot.1)[runValue(y.boot.1) == 1])
> >>
> >> I'm interested in building a function that takes in IRanges and
> >> outputs these shuffled set of IRanges.
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages using fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]