[Bioc-devel] Block bootstrap for GenomicRanges
Kasper Daniel Hansen
k@@perd@nielh@n@en @ending from gm@il@com
Tue Aug 14 15:06:13 CEST 2018
I agree this is super important. I think there may be multiple ways of
thinking about a decent bootstrapping or permutation of ranges, in
genomics. I am quite interested in the topic. I think it might belong in a
new package. I would be interesting in extending the conversation and have
a couple of different approaches (theoretical) that we could work on being
efficient.
Best,
Kasper
On Tue, Aug 14, 2018 at 8:27 AM Michael Love <michaelisaiahlove using gmail.com>
wrote:
> dear Hervé,
>
> Thanks again for the quick and useful reply!
>
> I think that the theory behind the block bootstrap [Kunsch (1989), Liu
> and Singh (1992), Politis and Romano (1994)], needs that the blocks be
> drawn with replacement (you can get some features twice) and that the
> blocks can be overlapping. In a hand-waving way, I think, it's "good"
> for the variance estimation on any statistic of interest that y' may
> have more or less features than y.
>
> I will explore a bit using the solutions you've laid out.
>
> Now that I think about it, the start-position based solution that I
> was thinking about will break if two features in y share the same
> start position, so that's not good.
>
> On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpages using fredhutch.org>
> wrote:
> > That helps. I think I start to understand what you are after.
> >
> > See below...
> >
> >
> > On 08/13/2018 06:07 PM, Michael Love wrote:
> >>
> >> dear Hervé,
> >>
> >> Thanks for the quick reply about directions to take this.
> >>
> >> I'm sorry for not providing sufficient detail about the goal of block
> >> bootstrapping in my initial post. Let me try again. For a moment, let
> >> me ignore multiple chromosomes/seqs and just focus on a single set of
> >> IRanges.
> >>
> >> The point of the block bootstrap is: Let's say we want to find the
> >> number of overlaps of x and y, and then assess how surprised we are at
> >> how large that overlap is. Both of them may have features that tend to
> >> cluster together along the genome (independently). One method would
> >> just be to move the features in y around to random start sites, making
> >> y', say B times, and then calculate each of the B times the number of
> >> overlaps between x and y'. Or we might make this better by having
> >> blacklisted sites where the randomly shuffled features in y cannot go.
> >>
> >> The block bootstrap is an alternative to randomly moving the start
> >> sites, where instead we create random data, by taking big "blocks" of
> >> features in y. Each block is a lot like a View. And the ultimate goal
> >> is to make B versions of the data y where the features have been
> >> shuffled around, but by taking blocks, we preserve the clumpiness of
> >> the features in y.
> >>
> >> Let me give some numbers to make this more concrete, so say we're
> >> making a single block bootstrap sample of a chromosome that is 1000 bp
> >> long. Here is the original y:
> >>
> >> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
> >>
> >> If I go with my coverage approach, I should extend it all the way to
> >> the end of the chromosome. Here I lose information if there are
> >> overlaps of features in y, and I'm thinking of a fix I'll describe
> >> below.
> >>
> >> cov <- c(coverage(y), Rle(rep(0,55)))
> >>
> >> I could make one block bootstrap sample of y (this is 1 out of B in
> >> the ultimate procedure) by taking 10 blocks of width 100. The blocks
> >> have random start positions from 1 to 901.
> >>
> >> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
> >
> >
> > Choosing blocks that can overlap with each others could make y' appear
> > to have more features than y (by repeating some of the original
> > features). Also choosing blocks that can leave big gaps in the
> > chromosome could make y' appear to have less features than y
> > (by dropping some of the original ranges). Isn't that a problem?
> >
> > Have you considered choosing a set of blocks that represent a
> > partitioning of the chromosome? This would make y' look more like y
> > by preserving the number of original ranges.
> >
> > In other words, if you can assign each range in y to one block and
> > one block only, then you could generate y' by just shifting the
> > ranges in y. The exact amount of shifting (positive or negative)
> > would only depend on the block that the range belongs to. This would
> > give you an y' that is parallel to y (i.e. same number of ranges
> > and y'[i] corresponds to y[i]).
> >
> > Something like this:
> >
> > 1) Define the blocks (must be a partitioning of the chromosome):
> >
> > blocks <- successiveIRanges(rep(100, 10))
> >
> > 2) Assign each range in y to the block it belongs to:
> >
> > mcols(y)$block <- findOverlaps(y, blocks, select="first")
> >
> > 1) and 2) are preliminary steps that you only need to do once,
> > before you generate B versions of the shuffled data (what you
> > call y'). The next steps are for generating one version of the
> > shuffled data so would need to be repeated B times.
> >
> > 3) Shuffle the blocks:
> >
> > perm <- sample(length(blocks))
> > perm
> > # [1] 6 5 8 3 2 7 1 9 4 10
> >
> > permuted_blocks <- successiveIRanges(width(blocks)[perm])
> > permuted_blocks[perm] <- permuted_blocks
> >
> > 4) Compute the shift along the chromosome that each block went
> > thru:
> >
> > block_shift <- start(permuted_blocks) - start(blocks)
> >
> > 5) Apply this shift to the ranges in y:
> >
> > shift(y, block_shift[mcols(y)$block])
> > # IRanges object with 12 ranges and 1 metadata column:
> > # start end width | block
> > # <integer> <integer> <integer> | <integer>
> > # [1] 651 655 5 | 1
> > # [2] 661 665 5 | 1
> > # [3] 671 675 5 | 1
> > # [4] 411 415 5 | 2
> > # [5] 421 425 5 | 2
> > # ... ... ... ... . ...
> > # [8] 11 15 5 | 6
> > # [9] 21 25 5 | 6
> > # [10] 921 925 5 | 10
> > # [11] 931 935 5 | 10
> > # [12] 941 945 5 | 10
> >
> > This code would work with overlapping ranges in y and/or
> > blocks of variable size.
> >
> > Hope this helps,
> >
> > H.
> >
> >>
> >> This last line below is a hack to get back to the ranges for a single
> >> block bootstrap sample of y. It works here, but only because none of
> >> the features in y were overlapping each other.
> >>
> >> Instead of coverage(), if I'd made an Rle where the non-zero elements
> >> are located at the start of ranges, and the value is the width of the
> >> range, I think this Views approach would actually work.
> >>
> >> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
> >> width=runLength(y.boot.1)[runValue(y.boot.1) == 1])
> >>
> >> I'm interested in building a function that takes in IRanges and
> >> outputs these shuffled set of IRanges.
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages using fredhutch.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list