[R] Fast multiple match function
Keshav Dhandhania
kshav.91 at gmail.com
Fri Apr 17 18:44:05 CEST 2015
Hi Jeff,
Indeed the data.table package does provide a much cleaner way to achieve
the same functionality, and a lot of other functionality as bonus.
Thanks for letting me know about it.
On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
> You might find the data.table package helpful. It uses an index sorted
> with a radix sort and minimizes moving the data around in memory.
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
> Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania <kshav.91 at gmail.com>
> wrote:
> >Hi all,
> >
> >Thanks for the responses.
> >Herve's example is a good small size example of what I wanted.
> >
> >> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
> >> someCoolFunc(-2, y)
> >[1] 3 9
> >> someCoolFunc(15, y)
> >[1] 4 5 8
> >
> >The requirement is that I want someCoolFunc() to run in O(number of
> >matches) time, instead of O(size of y).
> >This is because y is big. And I don't know all the queries I want to
> >do up-front. And the results of some queries might change the queries
> >I want to do in the future.
> >
> >@David: I hope the above description is more clear.
> >@Enrico, Herve: I want both the functionality provided by one function.
> >- On repeated calls, fmatch() does give O(1) performance, but it does
> >not give all matches.
> >- findMatches() gives all matches, but I need to know the entire
> >vector x beforehand. I don't have that luxury.
> >
> >
> >I do have something that works now, using split and fmatch (package
> >fastmatch). So just posting that in case anyone in the future has the
> >same problem.
> >> y.unique <- unique(y)
> >>
> >> # create a map from the unique elements of y to the locations of all
> >occurrences of the element
> >> y.map <- split(1:length(y), match(y, y.unique))
> >>
> >> # write a wrapper function that does a look-up on the unique list.
> >and then returns all matches using the map.
> >> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
> >
> >
> >
> >On Tue, 7 Apr 2015 at 13:21 Hervé Pagès <hpages at fredhutch.org> wrote:
> >>
> >> Hi Keshav,
> >>
> >> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
> >think
> >> does what you want:
> >>
> >> library(IRanges)
> >> y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
> >> x <- c(unique(y), 999L)
> >> hits <- findMatches(x, y)
> >>
> >> Then:
> >>
> >> > hits
> >> Hits object with 9 hits and 0 metadata columns:
> >> queryHits subjectHits
> >> <integer> <integer>
> >> [1] 1 1
> >> [2] 2 2
> >> [3] 3 3
> >> [4] 3 9
> >> [5] 4 4
> >> [6] 4 5
> >> [7] 4 8
> >> [8] 5 6
> >> [9] 6 7
> >> -------
> >> queryLength: 7
> >> subjectLength: 9
> >>
> >> The Hits object can be turned into a list with:
> >>
> >> > as.list(hits)
> >> [[1]]
> >> [1] 1
> >>
> >> [[2]]
> >> [1] 2
> >>
> >> [[3]]
> >> [1] 3 9
> >>
> >> [[4]]
> >> [1] 4 5 8
> >>
> >> [[5]]
> >> [1] 6
> >>
> >> [[6]]
> >> [1] 7
> >>
> >> [[7]]
> >> integer(0)
> >>
> >> H.
> >>
> >> > sessionInfo()
> >> R version 3.2.0 beta (2015-04-05 r68151)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >> Running under: Ubuntu 14.04.2 LTS
> >>
> >> locale:
> >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> >> [9] LC_ADDRESS=C LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] parallel stats4 stats graphics grDevices utils
> >datasets
> >> [8] methods base
> >>
> >> other attached packages:
> >> [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11
> >>
> >> loaded via a namespace (and not attached):
> >> [1] tools_3.2.0
> >>
> >> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
> >> > Hi,
> >> >
> >> > I know that one can find all occurrences of x in a vector v by
> >doing
> >> >> which(x == v).
> >> >
> >> > However, if I need to do this again and again, where v is remaining
> >the
> >> > same, then this is quite inefficient. In my particular case, I need
> >to do
> >> > this millions of times, and length(v) = 100 million.
> >> >
> >> > Does anyone have suggestion on how to go about it?
> >> > I know of a package called fmatch that does the above for the match
> >> > function. But they don't handle multiple matches.
> >> >
> >> > Thanks
> >> >
> >> > [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fredhutch.org
> >> Phone: (206) 667-5791
> >> Fax: (206) 667-1319
> >
> >______________________________________________
> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list