[BioC] stranded findOverlaps

Steve Lianoglou mailinglist.honeypot at gmail.com
Mon Jan 25 19:57:06 CET 2010


Hi,

On Mon, Jan 25, 2010 at 12:56 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at upf.edu>wrote:
>
>> dear list, and particularly, the IRanges developers,
>>
>> i'm using the function findOverlaps from the IRanges package because i
>> need to find what stranded genomic intervals from one set (as a
>> RangedData object) overlap with what stranded genomic intervals from
>> another set (as another RangedData object). the problem is that i don't
>> what to consider overlaps between genomic intervals from different
>> strands.
>>
>> i've been looking to the help page of findOverlaps (devel version, see
>> my sessionInfo() below) and searched through the BioC mailinglist and my
>> preliminary conclusion is that such an operation is not yet supported.
>>
>> i've been thinking of using rdapply to break down the RangedData objects
>> into spaces and then again by the two strands but the problem is that
>> the query and subject indexes resulting of findOverlaps will not match
>> the dimension of the original RangedData objects.
>>
>> so, i'd like to suggest that some option is added to this useful
>> function to restrict the overlapping search by strand. of course, if
>> this is somehow already implemented and i just missed it, then i'll be
>> very grateful if you let me know what function/parameter i should be
>> using.
>>
>>
> Well, IRanges knows nothing about Biology, so a 'strand' option would be out
> of place, in my opinion. That said, I can think of at least two approaches.
>
> 1) Simply filter the results for matches that are the the same strand. This
> is something as simple as:
> result <- findOverlaps(a, b)
> mat <- as.matrix(result)
> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],]
>
> 2) Out of recognition that we are really treating the two strands as
> separate spaces, break down the RangedData into chrom*strand spaces, as in:
> rd <- RangedData(...)
> rd <- do.call(c, split(rd, rd$strand))
> result <- findOverlaps(rd, ...)
> ## then maybe eventually go back chromosome spaces
> rds <- split(rd, rd$strand)
> names(rds[[1]]) <- chromNames
> names(rds[[2]]) <- chromNames
> rd <- do.call(rbind, rds)
>
> The second approach would be very convenient if you always want to treat the
> strands separately. The separation could be specified at construction time,
> e.g.:
> RangedData(ranges, strand, space = interaction(chrom, strand))
>
> But in general neither of these are awfully convenient, and I've always had
> the suspicion that we'd eventually need multiple space variables. Yes, we
> could add some argument to the findOverlaps method for RangedData that takes
> a vector of variable names for splitting into subspaces, but I think we
> would want a more general solution, where the RangedData itself has the
> notion of subspaces. This would be a non-trivial change. Would it behave
> like a nested list in some ways?
>
> Hopefully others have better ideas...

How about defining findOverlaps on "AlignedRead" objects (from the
ShortRead), and having "easy" ways to create an AlignedRead object out
of IRanges/RangesList objects (with appropriate additional metadata)?

I reckon you'd need a juiced up findOverlaps function to add params
specifying how to (or not) deal with the metadata in the AlignedRead
objects, though, among other things.

-steve


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list