[BioC] A code kata: evening-out *Ranges ends.

Wed Aug 25 06:17:50 CEST 2010

Quoting Steve Lianoglou <mailinglist.honeypot at gmail.com>:

> In case anyone else is trying to sharpen their IRanges-fu, here's a
> possible solution to my own question -- works much faster.
>
> Instead of:
>
>>  reads <- GRanges(seqnames='chr1', strand='+',
>>                ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE),
>>                                 sample(100:108, 10, replace=TRUE)),
>>                  width=sample(18:20, 20, replace=TRUE)))
>>
>> You can see there are two "islands" of reads there, at around
>> positions 20-40, and 100-120.
>> What I want to do is to "fix" the reads in each island to have the
>> same start/end position.
>>
>> Here's code I have to do that (forget any "strand" issues here --
>> those are already taken care of):
>>
>>  fence.posts <- reduce(reads)
>>  o <- findOverlaps(fence.posts, reads)

Just

   ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)]

? (I would have done findOverlaps(reads, fence.posts), but...)

>>  subjects <- subjectHits(o)
>>  query <- queryHits(o)
>>
>>  for (idx in unique(query)) {
>>    adjust <- subjects[query == idx]
>>    start(reads[adjust]) <- start(fence.posts[idx])
>>    end(reads[adjust]) <- end(fence.posts[idx])
>>  }
>
> I'm doing:
>
> covr <- coverage(reads)[[1]]
> fences <- slice(covr, 1)

or

   s = covr > 1
   breaks = cumsum(runLength(s))[!runValue(s)]
   post = cut(start(reads), c(breaks, Inf), labels=FALSE)
   ranges(reads) = ranges(fence.posts)[post]

? but the method above keeps track of sequences...

Martin

> f.repeat <- viewMaxs(fences)
> repacked <- IRanges(start=rep(start(fences), f.repeat),
>                     end=rep(end(fences), f.repeat))
> if (length(repacked) != length(reads)) {
>   stop("Length of repacked reads is not the same as the original")
> reads <- GRanges(seqnames=seqnames(reads), ranges=repacked,
>                  strand=strand(reads))
>
> I'm assuming that all the reads come from the same chromosome (so I
> unlist the result from coverage), and I'm only sending in reads of the
> same strand to this part of the function.
>
> Anyway, running this newer version on 1000 tags takes 0.181 seconds,
> where the older version took 20 seconds.
>
> Thought some might find it interesting, otherwise sorry to spam.
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:   
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>