[R] Fast string comparison

Matt Shotwell shotwelm at musc.edu
Tue Jul 13 17:37:22 CEST 2010


Good idea Romain, there is quite a bit of type testing in the function
versions of STRING_ELT and CHAR, not to mention the function call
overhead. Since the types are checked explicitly, I believe this
function is safe. All together now...

> system.time(strings[-1] == strings[-1e5])
   user  system elapsed 
  0.032   0.000   0.035 
> system.time(strcmp(strings[-1], strings[-1e5]))
   user  system elapsed 
  0.032   0.000   0.034 
> system.time(strcmp2(strings[-1], strings[-1e5]))
   user  system elapsed 
  0.024   0.000   0.026 
> system.time(lhs==rhs)
   user  system elapsed 
  0.012   0.000   0.013 
> system.time(strcmp(lhs, rhs))
   user  system elapsed 
  0.012   0.000   0.011 
> system.time(strcmp2(lhs, rhs))
   user  system elapsed 
  0.004   0.000   0.004

I looks like you can squeeze out more speed using the macro versions of
STRING_ELT and CHAR.

On Tue, 2010-07-13 at 09:48 -0400, Romain Francois wrote:
> Hi Matt,
> 
> I think there are some confusing factors in your results.
> 
> 
> system.time(strcmp(strings[-1], strings[-1e5]))
> 
> would also include the time required to perform both subscripting 
> (strings[-1] and strings[-1e5] ) which actually "takes some time".
> 
> 
> Also, you do have a bit of overhead due to the use of STRING_ELT and the 
> write barrier.
> 
> 
> I've include below a version that uses R internals so that you get the 
> fast (but you have to understand the risks, etc ...) version of 
> STRING_ELT using the plugin system of inline.
> 
> library(inline)
> code <- "
>      SEXP ans;
>      int i, len, *cans;
>      if(!isString(s1) || !isString(s2))
>          error(\"invalid arguments\");
>      len = length(s1)>length(s2)?length(s2):length(s1);
>      PROTECT(ans = allocVector(INTSXP, len));
>      cans = INTEGER(ans);
>      for(i = 0; i < len; i++)
>          cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
>                           CHAR(STRING_ELT(s2,i)));
>      UNPROTECT(1);
>      return ans;
> "
> sig <- signature(s1="character", s2="character")
> strcmp <- cfunction(sig, code)
> 
> strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse 
> =  ""))
> 
> 
> lhs <- strings[-1]
> rhs <- strings[-1e5]
> system.time( lhs == rhs )
> system.time(strcmp( lhs, rhs) )
> 
> library(inline)
> settings <- getPlugin( "default" )
> settings$includes <- paste( "#define USE_RINTERNALS", settings$includes, 
> collapse = "\n" )
> code2 <- "
>      SEXP ans;
>      int i, len, *cans;
>      if(!isString(s1) || !isString(s2))
>          error(\"invalid arguments\");
>      len = length(s1)>length(s2)?length(s2):length(s1);
>      PROTECT(ans = allocVector(INTSXP, len));
>      cans = INTEGER(ans);
>      for(i = 0; i < len; i++)
>          cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
>                           CHAR(STRING_ELT(s2,i)));
>      UNPROTECT(1);
>      return ans;
> "
> sig <- signature(s1="character", s2="character" )
> strcmp2 <- cxxfunction(sig, code2, settings = settings)
> system.time(strcmp2( lhs, rhs) )
> 
> 
> 
> I get:
> 
> $ Rscript strings.R
> Le chargement a nécessité le package : methods
> utilisateur     système      écoulé
>        0.002       0.000       0.002
> utilisateur     système      écoulé
>        0.004       0.000       0.005
> utilisateur     système      écoulé
>        0.003       0.000       0.003
> 
> Romain
> 
> 
> Le 13/07/10 15:24, Matt Shotwell a écrit :
> >
> > On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote:
> >> strings<- replicate(1e5, paste(sample(letters, 100, rep = T), collapse =  ""))
> >> system.time(strings[-1] == strings[-1e5])
> >> #   user  system elapsed
> >> #  0.016   0.000   0.017
> >>
> >> So it takes ~1/100 of a second to do ~100,000 string comparisons. You
> >> need to provide a reproducible example that illustrates why you think
> >> string comparisons are slow.
> >
> > Here's a vectorized alternative to '==' for strings, with minimal
> > argument checking or result conversion. I haven't looked at the
> > corresponding R source code, it may be similar:
> >
> > library(inline)
> > code<- "
> >      SEXP ans;
> >      int i, len, *cans;
> >      if(!isString(s1) || !isString(s2))
> >          error(\"invalid arguments\");
> >      len = length(s1)>length(s2)?length(s2):length(s1);
> >      PROTECT(ans = allocVector(INTSXP, len));
> >      cans = INTEGER(ans);
> >      for(i = 0; i<  len; i++)
> >          cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
> >                           CHAR(STRING_ELT(s2,i)));
> >      UNPROTECT(1);
> >      return ans;
> > "
> > sig<- signature(s1="character", s2="character")
> > strcmp<- cfunction(sig, code)
> >
> >
> >> system.time(strings[-1] == strings[-1e5])
> >     user  system elapsed
> >    0.036   0.000   0.035
> >> system.time(strcmp(strings[-1], strings[-1e5]))
> >     user  system elapsed
> >    0.032   0.000   0.034
> >
> > That's pretty fast, though I seem to be working with a slower system
> > than Hadley. It's hard to see how this could be improved, except maybe
> > by caching results of string comparisons.
> >
> > -Matt
> >
> >>
> >> Hadley
> >>
> >>
> >> On Tue, Jul 13, 2010 at 6:52 AM, Ralf B<ralf.bierig at gmail.com>  wrote:
> >>> I am asking this question because String comparison in R seems to be
> >>> awfully slow (based on profiling results) and I wonder if perhaps '=='
> >>> alone is not the best one can do. I did not ask for anything
> >>> particular and I don't think I need to provide a self-contained source
> >>> example for the question. So, to re-phrase my question, are there more
> >>> (runtime) effective ways to find out if two strings (about 100-150
> >>> characters long) are equal?
> >>>
> >>> Ralf
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Jul 11, 2010 at 2:37 PM, Sharpie<chuck at sharpsteen.net>  wrote:
> >>>>
> >>>>
> >>>> Ralf B wrote:
> >>>>>
> >>>>> What is the fastest way to compare two strings in R?
> >>>>>
> >>>>> Ralf
> >>>>>
> >>>>
> >>>> Which way is not fast enough?
> >>>>
> >>>> In other words, are you asking this question because profiling showed one of
> >>>> R's string comparison operations is causing a massive bottleneck in your
> >>>> code? If so, which one and how are you using it?
> >>>>
> >>>> -Charlie
> 
> 
-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com



More information about the R-help mailing list