[BioC] findOverlaps method in GenomicRanges not supporting type="equal" for GRangesList, GRangesList?

Nicolas Delhomme nicolas.delhomme at umu.se
Wed Nov 27 13:28:58 CET 2013


Hej Hervé!
---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---------------------------------------------------------------

On 22 Nov 2013, at 22:03, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi Nico,
> 
> On 11/22/2013 06:36 AM, Nicolas Delhomme wrote:
>> Hej Michael, Hervé!
>> 
>> I have to admit that I’m not exactly following the details of your discussion here. What would appear intuitive for me for a findOverlaps function on GRangesList, GRangesList is that it returns a list of findOverlaps on GRanges,GRanges. I can see difficulties in determining the concordance between the GRanges of the two GRangesList though. And I don’t have the overview to decide whether such a behaviour would be consistent, but that’s naively what I would expect.
> 
> Typically when 'query' and 'subject' are GRangesList, one is
> representing alignments (e.g. single-end with gaps or paired-end,
> thus more than 1 range is needed to represent an alignment), and
> the other one is representing a set of compound genomic features
> e.g. exons grouped by transcript or by gene.
> findOverlaps() then just returns yes/no answers to the questions
> "do query[[i]] and subject[[j]] overlap?" for all the possible (i, j)
> combinations. It doesn't return any detailed information about which
> ranges in GRanges objects query[[i]] and subject[[j]] have overlaps.
> If a hit is reported between query[[i]] and subject[[j]], it just
> means that at least one range in the former overlaps with one range
> in the latter. For many analyses, this is enough information i.e.
> knowing that there is a hit between a read and a transcript is enough.
> 

Sure, completely make senses.

> For analyses that need to know a little bit more about the
> characteristics of those hits, some post-processing of the Hits object
> is generally required. That's what tools like summarizeOverlaps or
> findSpliceOverlaps do. In your case, if you want a "list of
> findOverlaps on GRanges,GRanges", it's also something that you could
> obtained by post-processing the hit object:
> 
>   ov <- findOverlaps(query, subject, ...)
>   mapply(findOverlaps, query[queryHits(ov)], subject[subjectHits(ov)])
> 

Exactly what I ended up doing and as my GRangesList are not large ~50 elements, it went fast. 
mclapply helped made it even faster obviously :-)


> It probably won't be very efficient but we could address this by
> implementing a fast "parallel findOverlaps" (e.g. pfindOverlaps)
> if there is interest in something like that.
> 

I have no such needs yet, but thanks.

> But back to your original request of supporting type="equal" on
> GRangesList objects. When you use the 'type' argument, that only
> changes the criteria that findOverlaps uses for deciding whether
> to report a hit or not. That doesn't change what is reported.
> What is reported is still a Hits object answering the yes/no
> answers. When type="equal" will be available, a hit between
> query[[i]] and subject[[j]] will be reported only if the ranges
> are the same in the 2 objects. In that case there doesn't seem to
> be a lot of value in trying to generate a "list of findOverlaps on 
> GRanges,GRanges" because you already know what you need to know.
> 

And that’s the behaviour I should indeed expect. I was biased by my own needs ;-)

> So what would be more useful: (1) a fast pfindOverlaps, (2) supporting 
> type="equal" on GRangesList, (3) both? Knowing a little bit more
> about what you are trying to do would help us decide.
> 

I think (2) would be good simply because it is stated as possible in the man page. Or the man page would need correction.

> FWIW let me mention another way of post-processing a Hits object.
> encodeOverlaps() can be used on the Hits object to produce "overlap
> encodings". These are small strings that describe how the individual
> ranges in the 2 GRanges objects involved in a hit are positioned
> with respect to each other. The idea is a little bit the same as
> with the CIGAR encodings that provide details about how reads align
> to the reference. See the OverlapEncoding vignette in the GenomicRanges
> package for more info (it's a low level tool and it can be a little
> bit tricky to use, also the vignette is probably too technical and
> would need to be improved).

That sounds very interesting; I’ll take a look at it. 

Thanks for this very detailed very useful answer!

Cheers,

Nico

> 
> Cheers,
> H.
> 
>> 
>> Anyway, just my 2 cts. Thanks for taking a look.
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Nathaniel Street Lab
>> Department of Plant Physiology
>> Umeå Plant Science Center
>> 
>> Tel: +46 90 786 7989
>> Email: nicolas.delhomme at plantphys.umu.se
>> SLU - Umeå universitet
>> Umeå S-901 87 Sweden
>> ---------------------------------------------------------------
>> 
>> On 21 Nov 2013, at 19:43, Michael Lawrence <lawrence.michael at gene.com> wrote:
>> 
>>> So I've checked into devel a match,GRangesList,GRangesList. This allows findMatches() to return what you want. There is a question though before this is approved: does it make sense for match() to act like findOverlaps and consider each GRanges atomically (one returned index per GRanges) or should match behave as it does other Lists and return an IntegerList, with a value per range, grouped by the top-level elements. If we decide on the latter, then the method I wrote needs to be removed and the implementation moved to the "equals" mode in findOverlaps. Either way, findOverlaps(type="equals") should be made to work.
>>> 
>>> Michael
>>> 
>>> 
>>> On Thu, Nov 21, 2013 at 8:13 AM, Nicolas Delhomme <nicolas.delhomme at umu.se> wrote:
>>> Thanks!
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>> 
>>> Nathaniel Street Lab
>>> Department of Plant Physiology
>>> Umeå Plant Science Center
>>> 
>>> Tel: +46 90 786 7989
>>> Email: nicolas.delhomme at plantphys.umu.se
>>> SLU - Umeå universitet
>>> Umeå S-901 87 Sweden
>>> ---------------------------------------------------------------
>>> 
>>> On 21 Nov 2013, at 17:06, Michael Lawrence <lawrence.michael at gene.com> wrote:
>>> 
>>>> I will work on this today.
>>>> 
>>>> Michael
>>>> 
>>>> 
>>>> On Thu, Nov 21, 2013 at 4:43 AM, Nicolas Delhomme <nicolas.delhomme at umu.se> wrote:
>>>> Hej Bioc!
>>>> 
>>>> When I try to find “equal” ranges from two GRangesList object, I get the following error:
>>>> 
>>>>> findOverlaps(query=grng.def,subject=grng.mod,type="equal")
>>>> Error in match.arg(type) :
>>>>   'arg' should be one of “any”, “start”, “end”, “within”
>>>> 
>>>> Isn’t type=“equal” supported for the GRangesList, GRangesList signature?
>>>> 
>>>> Cheers,
>>>> 
>>>> Nico
>>>> 
>>>> sessionInfo()
>>>> R version 3.0.2 (2013-09-25)
>>>> Platform: x86_64-apple-darwin13.0.0 (64-bit)
>>>> 
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>> 
>>>> other attached packages:
>>>>  [1] easyRNASeq_1.8.2       ShortRead_1.20.0       Rsamtools_1.14.1       GenomicRanges_1.14.3   DESeq_1.14.0           lattice_0.20-24       locfit_1.5-9.1
>>>>  [8] Biostrings_2.30.1      XVector_0.2.0          IRanges_1.20.5         edgeR_3.4.0            limma_3.18.3           biomaRt_2.18.0         Biobase_2.22.0
>>>> [15] genomeIntervals_1.18.0 BiocGenerics_0.8.0     intervals_0.14.0
>>>> 
>>>> loaded via a namespace (and not attached):
>>>>  [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6         DBI_0.2-7            genefilter_1.44.0    geneplotter_1.40.0   grid_3.0.2           hwriter_1.3
>>>>  [9] latticeExtra_0.6-26  LSD_2.5              RColorBrewer_1.0-5   RCurl_1.95-4.1       RSQLite_0.11.4       splines_3.0.2        stats4_3.0.2         survival_2.37-4
>>>> [17] tools_3.0.2          XML_3.98-1.1         xtable_1.7-1         zlibbioc_1.8.0
>>>> 
>>>> 
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>> 
>>>> Nathaniel Street Lab
>>>> Department of Plant Physiology
>>>> Umeå Plant Science Center
>>>> 
>>>> Tel: +46 90 786 7989
>>>> Email: nicolas.delhomme at plantphys.umu.se
>>>> SLU - Umeå universitet
>>>> Umeå S-901 87 Sweden
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list