[BioC] help with reduction operation using IRanges/GRanges

Thu Apr 26 00:55:22 CEST 2012

Hi list,

On 04/24/2012 03:54 PM, Steve Lianoglou wrote:
> Hi,
>
> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com>  wrote:
>>> ir<- IRanges(start = c(1,5), width = 3)
>>> ir
>> IRanges of length 2
>>     start end width
>> [1]     1   3     3
>> [2]     5   7     3
>>> reduce(ir)
>> IRanges of length 2
>>     start end width
>> [1]     1   3     3
>> [2]     5   7     3
>>> reduce(ir, min.gapwidth = 2)
>> IRanges of length 1
>>     start end width
>> [1]     1   7     7
>>
>> As far as I understand, this is exactly what you want.  If you have
>> further questions, I suggest you read the help page.
>
> This actually isn't what he wants -- but what he wants isn't
> answerable w/ `reduce`.
>
> All of the OP's ranges overlap eachother -- I think he's trying to
> split his ranges into two groups because one group has ends that are
> very close to each other and the other's are a bit different. Note
> that the last two rows, the ends are a greater distance away from (but
> still *within*) the ends of the first group of ranges he's trying to
> group.
>
> It's almost trying to cluster groups of reads together, but again -- not really.

The clustering approach looks good to me because it allows reusing
existing clustering tools like as.dist(), hclust() and cutree(). It's
just a matter of defining the "right" distance between 2 ranges. Here
it seems that a suitable distance would be the distance between the 2
starts + the distance between the 2 ends:

   D(range1, range2) = abs(start(range1) - start(range2)) +
                       abs(end(range1) - end(range2))

Another good candidate would be the euclidian distance if ranges are
seen as points in the 2D space but it's a little bit more expensive to
compute and I would expect it to give very similar results to D() for
clustering.

The code is very simple:

   rangesDist <- function(x)
   {
     lx <- length(x)
     m <- sapply(seq_len(lx),
                 function(i)
                   abs(start(x)[i] - start(x)) + abs(end(x)[i] - end(x))
                 )
     as.dist(m)
   }

   d <- rangesDist(ir)
   hc <- hclust(d)
   plot(hc)

   ## Then use cutree() to assign a group id to each original range.
   ## For example if you want to split in 2 groups:

   > cutree(hc, k=2)
   [1] 1 1 1 1 2 2

Would be nice if the "cutting" could be specified in terms of minimum
distance between groups but I'm not familiar enough with the existing
clustering tools to know how to do that...

Cheers,
H.

>
> Not sure why you might want to do this, but ... just to say that
> reduce isn't the answer here.
>
> -steve
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319