[Bioc-sig-seq] Comparing two chipseq position sets

Tue May 19 22:09:47 CEST 2009

Hi Ivan, Steve,

Steve Lianoglou wrote:
> Hi,
> 
> On May 7, 2009, at 10:53 AM, Steve Goldstein wrote:
>> A simple permutation test could be done by selecting random sets of 
>> intervals "matching" the query intervals and counting the number of 
>> overlaps with the reference intervals.  Each random set of intervals 
>> could be picked so that the number and size of the intervals was the 
>> same as the query.   A general implementation of the method would need 
>> to know the length of each chromosome.
> 
> I was just about to suggest something similar, though I didn't think to 
> consider chromosome length ... can you give some intuition as to why 
> that's important for this question?
> 
> I guess you'd expect more "collisions" to happen at random on a 
> chromosome if it's longer, but I think in general one wouldn't be 
> interested in finding the number of reads that "collide" between two 
> experiments for a particular chromosome as you might be interested in 
> just seeing how many collisions happen over the entire extent of the 
> genome ... so is it helpful to think of the genome as broken up into 
> chromosome-pieces, or would it suffice to simply think of it as being 
> one contiguous length of sequence for this purpose?
> 
>> Of course, if the null hypothesis for this permutation test (the sets 
>> intervals are not related) is rejected, then you have to think about 
>> the next questions:  To what degree are the set related?
> 
> I'm picturing the aligned reads as painting small lines on a large 
> canvas (the entire canvas is the genome in this analogy).
> 
> If your first expt is painting red lines.
> Your second expt is painting in blue lines.
> The question is how much of the canvas is purple.

That's an interesting approach. It has the advantage to be very
simple (some would say "naive") but it's easy to implement with
the IRanges facilities.

In fact what makes this "painting approach" so interesting is because
it's possible to formalize it a little bit and to define a
pseudo-distance between 2 arbitrary IRanges objects.

Let's call "the covered width" of an IRanges object the number of
positions that are covered by at least one of its ranges:

   coveredWidth <- function(x) sum(width(reduce(x)))

Then the following metric

   IRangesPseudoDist <- function(x1, x2)
   {
     cwu <- coveredWidth(union(x1, x2))
     if (cwu == 0)
         return(1)
     1 - coveredWidth(intersect(x1, x2)) / cwu
   }

has the interesting property that it's a distance in the space of *normal*
IRanges objects i.e. if 'x1' and 'x2' are 2 normal IRanges objects (in a
normal IRanges object the ranges don't overlap), then

   IRangesPseudoDist(x1, x2) == 0 iff the set of ranges
       in 'x1' is exactly the same as the set of ranges in 'x2'.

If 'x1' and 'x2' are just any IRanges objects (not necesarily
normal), then IRangesPseudoDist(x1, x2) == 0 only means that
they cover exactly the same set of positions.

Note that this metric is blind to the order in which the ranges
are stored in the 2 IRanges objects that are being compared.

Example:

   set.seed(235)

   ## Comparing 2 arbitrary IRanges objects
   ## -------------------------------------

   start1 <- sample(9000000, 500, replace=TRUE)
   width1 <- sample(1200, 500, replace=TRUE) + 200
   ir1 <- IRanges(start=start1, width=width1)

   start2 <- sample(9500000, 475, replace=TRUE)
   width2 <- sample(1200, 475, replace=TRUE) + 200
   ir2 <- IRanges(start=start2, width=width2)

   > IRangesPseudoDist(ir1, ir2)
   [1] 0.9782646
   > IRangesPseudoDist(ir1, ir1)
   [1] 0

   ## The order of the ranges doesn't matter:
   > IRangesPseudoDist(ir1, rev(ir1))
   [1] 0

   ## Comparing 2 very similar IRanges objects
   ## ----------------------------------------
   ## We build 'ir3', an IRanges object "close" to 'ir1', by
   ## (a) Shuffling the ranges (including dropping and repeating
   ## some of them):
   ii <- c(seq_len(length(ir1)), sample(length(ir1), 30, replace=TRUE))
   ii <- sample(ii)
   ii <- ii[-seq.int(from=1, to=length(ii), by=10)]
   ir3 <- ir1[ii]

   ## (b) Modifying the boundaries of the ranges:
   delta_start <- sample(15, length(ir3), replace=TRUE) - 8
   delta_end <- sample(15, length(ir3), replace=TRUE) - 8
   ir3 <- IRanges(start(ir3) + delta_start, end(ir3) + delta_end)

   ## 'ir1' and 'ir3' are still very close!
   > IRangesPseudoDist(ir1, ir3)
   [1] 0.08862256

Cheers,
H.

> 
> So it "feels" like some sort of an enrichment test to me, could you try 
> to answer this question in a similar fashion to GO enrichment, via some 
> sort of hypergeometric test? That's not exactly correct ... just 
> brainstorming is all.
> 
>> Where do they differ and where are they the same?
> 
> I'll stop with my speculation here ... :-)
> 
> -steve
> 
> -- 
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Medical College of Cornell University
> 
> http://cbio.mskcc.org/~lianos
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319