Thanks for providing this use case.

It's actually documented that the Hits are sorted by query. So using that
assumption, an Rle-based approach like that provided by Harris is probably
the most efficient. But I would use Rle from IRanges instead of the base
"rle" class, because it offers more conveniences; for example, the code is
pretty intuitive for this use case:

  queryHits_rle <- Rle(queryHits(all_hits))
  first_hits <- all_hits[start(queryHits_rle)]
  last_hits <- all_hits[end(queryHits_rle)]

Michael



On Sun, Oct 20, 2013 at 1:32 PM, Peter Hickey <hickey@wehi.edu.au> wrote:

> Hi all,
>
> Suppose I've created a Hits object by
> all_hits <- findOverlaps(query, subject, select = 'all', type = 'any')
> and now I want to get the first (or last) hit for each query. Of course, I
> could use
> first_hits <- findOverlaps(query, subject, select = 'first', type = 'any')
> last_hits <- findOverlaps(query, subject, select = 'last', type = 'any')
> but I this isn't necessary as all the required information is in the
> all_hits Hits object.
>
> I was unable to find a function to do what I wanted, however, based at the
> source for findOverlaps-methods.R (actually Hector's branch at
> https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps-methods.R)
> I wrote the following simple functions
> getFirstHit <- function(all_hits){
>   ans <- rep.int(NA_integer_, queryLength(all_hits))
>   oo <- IRanges:::orderIntegerPairs(queryHits(all_hits),
> subjectHits(all_hits), decreasing = TRUE)
>   ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo]
>   return(ans)
> }
> getLastHit <- function(all_hits){
>   ans <- rep.int(NA_integer_, queryLength(all_hits))
>   ans[queryHits(all_hits)] <- subjectHits(all_hits)
>   return(ans)
> }
>
> These do what I want and are faster than re-running findOverlaps for my
> use case (millions of query intervals with tens of thousands of subject
> intervals)
>
> Now to my questions:
> (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions?
> (2) Would others find "getFirstHit" and "getLastHit" useful enough to
> justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)?
> (3) I recall some discussion on this mailing list and Rdevel stating that
> using ":::" to access non-exported functions isn't a great idea for use in
> package code. If I need to include my own "getFirstHit" would someone be so
> kind as to point me to the best advice for avoiding calls to ":::" in my
> code to access IRanges:::orderIntegerPairs(queryHits)?
>
> Many thanks,
> Pete
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey@wehi.edu.au
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:13}}

