[R] Compact Patricia Trees (Tries)

Gabor Grothendieck ggrothendieck at gmail.com
Fri Apr 30 20:13:55 CEST 2010


What do you care if its optimal or not.  If is fast enough for your
application that is all that matters.  If its not then you will likely
have to code it yourself.

On Fri, Apr 30, 2010 at 2:09 PM, Richard R. Liu <richard.liu at pueo-owl.ch> wrote:
> Gabor,
>
> Thanks for the hint to use charmatch.  It is faster than what I was doing,
> but I suspect it might still be suboptimal.  I have a list of words, and I
> want to know which of those exactly or partially match one word.  By the
> very description of charmatch's arguments, the second being "table", I
> suspect, if it's investing any time at all in creating a structure that it
> can search more efficiently, it's going it with "table", not with the first
> argument "x".  I suspect that the time difference in favor of charmatch is
> due to the fact that it's doing natively what I was doing with R script,
> i.e., comparing each element of the first argument to a substring of the
> second argument with the same length as the element.
>
> Any insights?
>
> Regards,
> Richard
>
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Mobil: +41 79 708 67 66
> Email:  richard.liu at pueo-owl.ch
>
>
>
> On Apr 29, 2010, at 13:06 , Gabor Grothendieck wrote:
>
>> Using charmatch partial matches of 10,000 5 leters words to the same
>> list can be done in 10 seconds on my machine and 10,000 5 letter words
>> to 100,000 10 letter words in 1 minute.  Is that good enough?  Try
>> this simulation:
>>
>> # generate N random words each k long
>> rwords <- function(N, k) {
>>  L <- sample(letters, N*k, replace = TRUE)
>>  apply(matrix(L, k), 2, paste, collapse = "")
>> }
>> w1 <- rwords(1e5, 10)
>> w2 <- rwords(1e4, 5)
>>
>> system.time(charmatch(w2, w2))
>>
>> system.time(charmatch(w2, w1))
>>
>>
>> On Thu, Apr 29, 2010 at 4:05 AM, Richard R. Liu <richard.liu at pueo-owl.ch>
>> wrote:
>>>
>>> I have an application that a long list of character strings to determine
>>> which
>>> occur at the beginning of a given word.  A straight forward R script
>>> using grep
>>> takes a long time to run.  Rewriting it to use substr and match might be
>>> an
>>> option, but I have the impression that preparing the list as a trie and
>>> performing trie searches might lead to dramatic improvements in
>>> performance.
>>>
>>>
>>> I have searched the CRAN packages and find no packages that support
>>> Compact
>>> Patricia Trees.  Does anybody know of such?
>>>
>>>
>>> Thanks,
>>> Richard
>>>
>>> Richard R. Liu
>>> richard.liu at pueo-owl.ch
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
>



More information about the R-help mailing list