[BioC] A code kata: evening-out *Ranges ends.
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Aug 25 09:16:59 CEST 2010
Steve,
Riffing off of Martin's post, I would use the match function to do the
mapping:
fence.posts <- reduce(reads)
ranges(reads) <- ranges(fence.posts)[match(reads, fence.posts)]
It is not as fast as the Rle/cut based method by Martin, but it is more
readable and, thus, easier to maintain.
Cheers,
Patrick
On 8/24/10 9:17 PM, mtmorgan at fhcrc.org wrote:
> Quoting Steve Lianoglou <mailinglist.honeypot at gmail.com>:
>
>> In case anyone else is trying to sharpen their IRanges-fu, here's a
>> possible solution to my own question -- works much faster.
>>
>> Instead of:
>>
>>> reads <- GRanges(seqnames='chr1', strand='+',
>>> ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE),
>>> sample(100:108, 10, replace=TRUE)),
>>> width=sample(18:20, 20, replace=TRUE)))
>>>
>>> You can see there are two "islands" of reads there, at around
>>> positions 20-40, and 100-120.
>>> What I want to do is to "fix" the reads in each island to have the
>>> same start/end position.
>>>
>>> Here's code I have to do that (forget any "strand" issues here --
>>> those are already taken care of):
>>>
>>> fence.posts <- reduce(reads)
>>> o <- findOverlaps(fence.posts, reads)
>
> Just
>
> ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)]
>
> ? (I would have done findOverlaps(reads, fence.posts), but...)
>
>>> subjects <- subjectHits(o)
>>> query <- queryHits(o)
>>>
>>> for (idx in unique(query)) {
>>> adjust <- subjects[query == idx]
>>> start(reads[adjust]) <- start(fence.posts[idx])
>>> end(reads[adjust]) <- end(fence.posts[idx])
>>> }
>>
>> I'm doing:
>>
>> covr <- coverage(reads)[[1]]
>> fences <- slice(covr, 1)
>
> or
>
> s = covr > 1
> breaks = cumsum(runLength(s))[!runValue(s)]
> post = cut(start(reads), c(breaks, Inf), labels=FALSE)
> ranges(reads) = ranges(fence.posts)[post]
>
> ? but the method above keeps track of sequences...
>
> Martin
>
>> f.repeat <- viewMaxs(fences)
>> repacked <- IRanges(start=rep(start(fences), f.repeat),
>> end=rep(end(fences), f.repeat))
>> if (length(repacked) != length(reads)) {
>> stop("Length of repacked reads is not the same as the original")
>> reads <- GRanges(seqnames=seqnames(reads), ranges=repacked,
>> strand=strand(reads))
>>
>> I'm assuming that all the reads come from the same chromosome (so I
>> unlist the result from coverage), and I'm only sending in reads of the
>> same strand to this part of the function.
>>
>> Anyway, running this newer version on 1000 tags takes 0.181 seconds,
>> where the older version took 20 seconds.
>>
>> Thought some might find it interesting, otherwise sorry to spam.
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list