[BioC] stranded findOverlaps

Tue Jan 26 19:06:28 CET 2010

Michael,
Before creating a new class to capture the strandedness of a RangedData 
objects, it would be useful to have a list of methods in which 
strandedness can be used:

Inter-interval ops: disjoin, gaps, reduce, range, coverage
Between interval set ops: intersect, setdiff, union, findOverlaps, %in%, 
match

For each of these operations, is strandedness a separate and unique 
categorization of the data or are there other categories users would 
like to group intervals during these operations? For example, I added a 
"by" argument to the reduce method for RangedData because I was in 
correspondence with someone who wanted to use both "strand" and "score" 
columns to match during the reduction exercise. My instinct is that all 
the inter-interval operations could use grouping capabilities and these 
groupings are fluid where at one pass you might want to use a "score" 
column and at another you would like to ignore that information. So in 
short, I would argue to not add any more classes and instead add "by" 
arguments to all the operations listed above that could support it and 
for those between interval set operation that couldn't support it use 
all the columns in the RangedData objects and not just the strand so if 
you wanted to find the intersection of two RangedData objects, the 
entire row of data would have to match and not just the intervals or the 
stranded intervals.

Patrick

Michael Lawrence wrote:
> It seems that RangedData (with a strand variable) has fallen into this role
> within the IRanges framework. At one point, there was a GenomicData subclass
> that did allow for special strand options. Unfortunately, the GenomicData()
> convenience constructor of RangedData is all that is left of that. So we
> could bring that back. Whatever happened to the proposed GenomeRanges
> package?
>
> Michael
>
> On Mon, Jan 25, 2010 at 11:02 AM, Kasper Daniel Hansen <
> khansen at stat.berkeley.edu> wrote:
>
>   
>> On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote:
>>
>>     
>>> On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at upf.edu
>>> wrote:
>>>
>>>       
>>>> dear list, and particularly, the IRanges developers,
>>>>
>>>> i'm using the function findOverlaps from the IRanges package because i
>>>> need to find what stranded genomic intervals from one set (as a
>>>> RangedData object) overlap with what stranded genomic intervals from
>>>> another set (as another RangedData object). the problem is that i don't
>>>> what to consider overlaps between genomic intervals from different
>>>> strands.
>>>>
>>>> i've been looking to the help page of findOverlaps (devel version, see
>>>> my sessionInfo() below) and searched through the BioC mailinglist and my
>>>> preliminary conclusion is that such an operation is not yet supported.
>>>>
>>>> i've been thinking of using rdapply to break down the RangedData objects
>>>> into spaces and then again by the two strands but the problem is that
>>>> the query and subject indexes resulting of findOverlaps will not match
>>>> the dimension of the original RangedData objects.
>>>>
>>>> so, i'd like to suggest that some option is added to this useful
>>>> function to restrict the overlapping search by strand. of course, if
>>>> this is somehow already implemented and i just missed it, then i'll be
>>>> very grateful if you let me know what function/parameter i should be
>>>> using.
>>>>
>>>>
>>>>         
>>> Well, IRanges knows nothing about Biology, so a 'strand' option would be
>>>       
>> out
>>     
>>> of place, in my opinion. That said, I can think of at least two
>>>       
>> approaches.
>>     
>>> 1) Simply filter the results for matches that are the the same strand.
>>>       
>> This
>>     
>>> is something as simple as:
>>> result <- findOverlaps(a, b)
>>> mat <- as.matrix(result)
>>> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],]
>>>
>>> 2) Out of recognition that we are really treating the two strands as
>>> separate spaces, break down the RangedData into chrom*strand spaces, as
>>>       
>> in:
>>     
>>> rd <- RangedData(...)
>>> rd <- do.call(c, split(rd, rd$strand))
>>> result <- findOverlaps(rd, ...)
>>> ## then maybe eventually go back chromosome spaces
>>> rds <- split(rd, rd$strand)
>>> names(rds[[1]]) <- chromNames
>>> names(rds[[2]]) <- chromNames
>>> rd <- do.call(rbind, rds)
>>>
>>> The second approach would be very convenient if you always want to treat
>>>       
>> the
>>     
>>> strands separately. The separation could be specified at construction
>>>       
>> time,
>>     
>>> e.g.:
>>> RangedData(ranges, strand, space = interaction(chrom, strand))
>>>
>>> But in general neither of these are awfully convenient, and I've always
>>>       
>> had
>>     
>>> the suspicion that we'd eventually need multiple space variables. Yes, we
>>> could add some argument to the findOverlaps method for RangedData that
>>>       
>> takes
>>     
>>> a vector of variable names for splitting into subspaces, but I think we
>>> would want a more general solution, where the RangedData itself has the
>>> notion of subspaces. This would be a non-trivial change. Would it behave
>>> like a nested list in some ways?
>>>
>>> Hopefully others have better ideas...
>>>       
>> We will need good support for stranded genomic intervals.  This is a very
>> important case to handle, and will be even more important in the future
>> where a number of assays will be stranded.  We need support for doing
>> operations on such objects, both ignoring strand and not ignoring strand.
>>
>> An example could be that we take (stranded) genome annotation and what to
>> perform a per-chromosome reduce().  Users might want to do a reduce
>> respecting strand information where we would get one IRanges per chromosome
>> * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) which
>> yields one IRanges per chromosome.
>>
>> I agree that the general design might be to allow for any number of nested
>> subspaces, but we do have a very important special case where we know that
>> the second level of nestedness only have two components.  I believe a lot of
>> value would be gained from being able to operate easily on such objects.
>>
>> Kasper
>>     
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>