[Bioc-devel] IRanges findOverlaps Result Different for Recent Update

Michael Lawrence lawrence.michael at gene.com
Thu Jan 15 20:59:07 CET 2015


My concern is mostly in user code not seen in Bioc svn. But perhaps the
partial sorting (by query) is sufficient for many of those.

On Thu, Jan 15, 2015 at 11:34 AM, Hervé Pagès <hpages at fredhutch.org> wrote:

> Hi guys,
>
> Indeed, the Hits object returned by findOverlaps() is not fully
> sorted anymore. Now it's sorted by query hit *only* and not by query
> hit *and* subject hit. Fully sorting a big Hits object has a high
> cost, both in terms of time and memory footprint. The partial
> sorting is *much* cheaper: it's done using a "tabulated sorting"
> algo implemented in C that works in linear time.
>
> The partial sorting is important: it allows a very common
> transformation like as(hits, "List") to be super fast. But the
> full sorting was overkill and generally not needed. Also note that
> the full sorting was never enforced via the validity method for
> Hits objects (and t(hits) was breaking that order in BioC < 3.1).
> Now the validity method for Hits enforces the partial sorting and
> t(hits) preserves it.
>
> There were only 3 or 4 packages that broke in devel because of
> that change (typically the change broke their unit tests). I fixed
> them (except Repitools, but it's still on my list). The fix is easy:
> if having the hits fully sorted matters, just use sort() on the Hits
> object. The man page for ?findOverlaps will soon be updated to
> reflect these changes.
>
> Cheers,
> H.
>
>
>
> On 01/15/2015 06:42 AM, Kasper Daniel Hansen wrote:
>
>> Has it ever been documented that the return object is sorted in a specific
>> way?  I just want to make sure we think about whether that is something we
>> want to enforce giving the possibility of using a different algorithm in
>> the future.
>>
>> We could also address this by implementing (perhaps it already exists) a
>> sort() method for the return object.  That would still break existing code
>> though.
>>
>> Best,
>> Kasper
>>
>> On Wed, Jan 14, 2015 at 11:13 PM, Michael Lawrence <
>> lawrence.michael at gene.com> wrote:
>>
>>  I bet there is a lot of code that depends on having the hits
>>> (conveniently)
>>> ordered by query,subject index, so we should try to restore the previous
>>> behavior.
>>>
>>> On Wed, Jan 14, 2015 at 8:00 PM, Dario Strbenac <
>>> dstr7320 at uni.sydney.edu.au>
>>> wrote:
>>>
>>>  Hello,
>>>>
>>>> For an identical query, the matrix results are in a different order.
>>>> Consider the subject hits of the last two rows :
>>>>
>>>>  mapping        # R Under development (unstable) (2015-01-13 r67453) and
>>>>>
>>>> IRanges 2.1.35
>>>>       queryHits subjectHits
>>>> [1,]         1           1
>>>> [2,]         1           4
>>>> [3,]         2           2
>>>> [4,]         4           1
>>>> [5,]         4           4
>>>> [6,]         6           7
>>>> [7,]         6           6
>>>>
>>>>  mapping        # R Under development (unstable) (2015-01-13 r67453) and
>>>>>
>>>> IRanges 2.0.1
>>>>       queryHits subjectHits
>>>> [1,]         1           1
>>>> [2,]         1           4
>>>> [3,]         2           2
>>>> [4,]         4           1
>>>> [5,]         4           4
>>>> [6,]         6           6
>>>> [7,]         6           7
>>>>
>>>> This causes some values to be extracted in a different order by our
>>>> annotationLookup function, and causes an error for the development
>>>>
>>> version
>>>
>>>> of Repitools on a test case which uses all.equal to compare a list to a
>>>> correct list, but not for the release version which uses the release
>>>> version of IRanges. Should I update the test case to have a new expected
>>>> result, or is this new characteristic of findOverlaps likely to revert
>>>> to
>>>> the previous output soon ?
>>>>
>>>> The two sets of intervals to produce this result are anno and probesGR,
>>>> defined in the tests.R file in the Repitools package.
>>>>
>>>> --------------------------------------
>>>> Dario Strbenac
>>>> PhD Student
>>>> University of Sydney
>>>> Camperdown NSW 2050
>>>> Australia
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list