[BioC] A code kata: evening-out *Ranges ends.
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Aug 24 23:47:58 CEST 2010
Hi,
Preamble: This is a low priority question. It's basically something
I'm just asking to see how one might code it more cleanly.
One of Martin's responses to a question today suggested that the
person use IRanges::slice to perform a specific task:
http://thread.gmane.org/gmane.science.biology.informatics.conductor/30550/focus=30554
I was quite pleasantly surprised from that post since as one step in
cleaning some of my short-read input data. I haven't stumbled on that
before, and I essentially had written up a small piece of code that
does the same thing that the IRanges::slice operation does, but using
normal R operations on a vector-converted Rle object.
Using IRanges::slice is much nicer, and no doubt more efficient
(although my implementation wasn't terribly inefficient ... wonderful.
That's what's prompting me to ask this question.
I have another small piece of code that further cleans my data, but
this one IS terribly inefficient (it has a for loop). It runs over a
(large) I/GRanges object (representing short reads) and sets a
"cluster" of reads to have the same start/end. I'll show you what I
mean with example data and code below.
As some background, my data is coming from a high-throughput SAGE-like
protocol, where I get one very specific read/tag per transcript. As a
result of some variability in enzyme digestion (either "naturally
occurring" in the cell, or as a result of the protocol itself), the
resulting reads are one or two bp's off of what they should be.
Here's some fake data:
reads <- GRanges(seqnames='chr1', strand='+',
ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE),
sample(100:108, 10, replace=TRUE)),
width=sample(18:20, 20, replace=TRUE)))
You can see there are two "islands" of reads there, at around
positions 20-40, and 100-120.
What I want to do is to "fix" the reads in each island to have the
same start/end position.
Here's code I have to do that (forget any "strand" issues here --
those are already taken care of):
fence.posts <- reduce(reads)
o <- findOverlaps(fence.posts, reads)
subjects <- subjectHits(o)
query <- queryHits(o)
for (idx in unique(query)) {
adjust <- subjects[query == idx]
start(reads[adjust]) <- start(fence.posts[idx])
end(reads[adjust]) <- end(fence.posts[idx])
}
It works, but it's slow.
So, I'm curious if there is a more clever/idiomatic way to do this
using the IRanges infrastructure while at the same time being more
efficient.
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list