[R] Fast string comparison

Romain Francois romain.francois at dbmail.com
Tue Jul 13 15:48:18 CEST 2010


Hi Matt,

I think there are some confusing factors in your results.


system.time(strcmp(strings[-1], strings[-1e5]))

would also include the time required to perform both subscripting 
(strings[-1] and strings[-1e5] ) which actually "takes some time".


Also, you do have a bit of overhead due to the use of STRING_ELT and the 
write barrier.


I've include below a version that uses R internals so that you get the 
fast (but you have to understand the risks, etc ...) version of 
STRING_ELT using the plugin system of inline.

library(inline)
code <- "
     SEXP ans;
     int i, len, *cans;
     if(!isString(s1) || !isString(s2))
         error(\"invalid arguments\");
     len = length(s1)>length(s2)?length(s2):length(s1);
     PROTECT(ans = allocVector(INTSXP, len));
     cans = INTEGER(ans);
     for(i = 0; i < len; i++)
         cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
                          CHAR(STRING_ELT(s2,i)));
     UNPROTECT(1);
     return ans;
"
sig <- signature(s1="character", s2="character")
strcmp <- cfunction(sig, code)

strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse 
=  ""))


lhs <- strings[-1]
rhs <- strings[-1e5]
system.time( lhs == rhs )
system.time(strcmp( lhs, rhs) )

library(inline)
settings <- getPlugin( "default" )
settings$includes <- paste( "#define USE_RINTERNALS", settings$includes, 
collapse = "\n" )
code2 <- "
     SEXP ans;
     int i, len, *cans;
     if(!isString(s1) || !isString(s2))
         error(\"invalid arguments\");
     len = length(s1)>length(s2)?length(s2):length(s1);
     PROTECT(ans = allocVector(INTSXP, len));
     cans = INTEGER(ans);
     for(i = 0; i < len; i++)
         cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
                          CHAR(STRING_ELT(s2,i)));
     UNPROTECT(1);
     return ans;
"
sig <- signature(s1="character", s2="character" )
strcmp2 <- cxxfunction(sig, code2, settings = settings)
system.time(strcmp2( lhs, rhs) )



I get:

$ Rscript strings.R
Le chargement a nécessité le package : methods
utilisateur     système      écoulé
       0.002       0.000       0.002
utilisateur     système      écoulé
       0.004       0.000       0.005
utilisateur     système      écoulé
       0.003       0.000       0.003

Romain


Le 13/07/10 15:24, Matt Shotwell a écrit :
>
> On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote:
>> strings<- replicate(1e5, paste(sample(letters, 100, rep = T), collapse =  ""))
>> system.time(strings[-1] == strings[-1e5])
>> #   user  system elapsed
>> #  0.016   0.000   0.017
>>
>> So it takes ~1/100 of a second to do ~100,000 string comparisons. You
>> need to provide a reproducible example that illustrates why you think
>> string comparisons are slow.
>
> Here's a vectorized alternative to '==' for strings, with minimal
> argument checking or result conversion. I haven't looked at the
> corresponding R source code, it may be similar:
>
> library(inline)
> code<- "
>      SEXP ans;
>      int i, len, *cans;
>      if(!isString(s1) || !isString(s2))
>          error(\"invalid arguments\");
>      len = length(s1)>length(s2)?length(s2):length(s1);
>      PROTECT(ans = allocVector(INTSXP, len));
>      cans = INTEGER(ans);
>      for(i = 0; i<  len; i++)
>          cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
>                           CHAR(STRING_ELT(s2,i)));
>      UNPROTECT(1);
>      return ans;
> "
> sig<- signature(s1="character", s2="character")
> strcmp<- cfunction(sig, code)
>
>
>> system.time(strings[-1] == strings[-1e5])
>     user  system elapsed
>    0.036   0.000   0.035
>> system.time(strcmp(strings[-1], strings[-1e5]))
>     user  system elapsed
>    0.032   0.000   0.034
>
> That's pretty fast, though I seem to be working with a slower system
> than Hadley. It's hard to see how this could be improved, except maybe
> by caching results of string comparisons.
>
> -Matt
>
>>
>> Hadley
>>
>>
>> On Tue, Jul 13, 2010 at 6:52 AM, Ralf B<ralf.bierig at gmail.com>  wrote:
>>> I am asking this question because String comparison in R seems to be
>>> awfully slow (based on profiling results) and I wonder if perhaps '=='
>>> alone is not the best one can do. I did not ask for anything
>>> particular and I don't think I need to provide a self-contained source
>>> example for the question. So, to re-phrase my question, are there more
>>> (runtime) effective ways to find out if two strings (about 100-150
>>> characters long) are equal?
>>>
>>> Ralf
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jul 11, 2010 at 2:37 PM, Sharpie<chuck at sharpsteen.net>  wrote:
>>>>
>>>>
>>>> Ralf B wrote:
>>>>>
>>>>> What is the fastest way to compare two strings in R?
>>>>>
>>>>> Ralf
>>>>>
>>>>
>>>> Which way is not fast enough?
>>>>
>>>> In other words, are you asking this question because profiling showed one of
>>>> R's string comparison operations is causing a massive bottleneck in your
>>>> code? If so, which one and how are you using it?
>>>>
>>>> -Charlie


-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/bc8jNi : Rcpp 0.8.4
|- http://bit.ly/dz0RlX : bibtex 0.2-1
`- http://bit.ly/a5CK2h : Les estivales 2010



More information about the R-help mailing list