[R] Fast multiple match function
Hervé Pagès
hpages at fredhutch.org
Tue Apr 7 22:21:51 CEST 2015
Hi Keshav,
findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
does what you want:
library(IRanges)
y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
x <- c(unique(y), 999L)
hits <- findMatches(x, y)
Then:
> hits
Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 3
[4] 3 9
[5] 4 4
[6] 4 5
[7] 4 8
[8] 5 6
[9] 6 7
-------
queryLength: 7
subjectLength: 9
The Hits object can be turned into a list with:
> as.list(hits)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3 9
[[4]]
[1] 4 5 8
[[5]]
[1] 6
[[6]]
[1] 7
[[7]]
integer(0)
H.
> sessionInfo()
R version 3.2.0 beta (2015-04-05 r68151)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11
loaded via a namespace (and not attached):
[1] tools_3.2.0
On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>
> Thanks
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help
mailing list