[BioC] GRanges - reduce() function

Wed Nov 30 02:44:32 CET 2011

Martin Morgan <mtmorgan at ...> writes:

> 
> On 11/17/2011 05:57 PM, Jason Ross wrote:
> > Hi Fahim,
> >
> > I am also frustrated by this. The meta-data also vanishes when using
> > findOverlaps(). I'm thinking of writing some wrapper functions to place the
> > meta-data back into the Granges object.
> 
> Hi Jason et al.,
> 
> The problem in 'reduce' is that the elementMetadata columns need to be 
> 'reduce'd too, and there is no universal way to do that -- for 
> 'transcripts' in Fahim's example, maybe it's just collapsing entries 
> into a CharacterList, whereas for "Gene" it's split-by-reduced-range and 
> 'unique'. For numeric values one might sum or mean or max or ....
> 
> Can you be more specific about findOverlaps? It's not really clear which 
> data you'd like to have propagated.
> 
> For Fahim's question, I arrived at
> 
> values(r)[["Gene"]] <-
>      tapply(values(gr)[["Gene"]], match(gr, r), unique)
> 
> which I think is quite robust, but I'd recommend checking carefully on 
> complicated data.
> 
> Martin
> 

Hi Martin,

I tend to use GenomicRanges objects a lot for annotating features so I want R 
merge or SQL join like functionality. I was joining data to annotations using 
mySQL but found the indices broke with range joins. I considered BEDtools but 
didn't like the constraints of only using BED/GFF and the shell. I switched to 
using GenomicRanges and findOverlaps as I liked the very efficient interval tree 
approach. I usually wrap the output of findOverlaps into a function emulating a 
left or inner join from two data frames. This process is handled natively, but 
rather inelegantly in BEDtools. GRanges is more powerful but doesn't offer 
boolean switches on union/intersect, etc or wrapper functions that keep the 
metadata.

I appreciate that their is no universal way to disentangle metadata when 
aggregating but it would be nice to have some of the options available in the 
union/intersection/reduce functions, or in wrapper functions. At the moment I 
roll my own.

Regardless, I find GenomicRanges, etc to be very useful and powerful and it's my 
preferred strategy in dealing with genomic data.

Cheers,
Jason.

At first I created GRanges objects from the dataframes