[R] AGREP

Thu Feb 12 17:11:44 CET 2004

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gabor 
> Grothendieck
> Sent: den 12 februari 2004 16:07
> To: rodrigo.abt at sii.cl; r-help at stat.math.ethz.ch
> Subject: Re: [R] AGREP
> 
> One could shorten it slightly with these minor improvements.  
> Unfortunately, the key performance problem, the double loop 
> at the end which implements the dynamic programming calculation, 
> is still there.
> 
> levenshtein<-function(s1,s2) {
>      # Make sure args are strings
>      a <- as.character(s1); an <- nchar(s1)+1
>      b <- as.character(s2); bn <- nchar(s2)+1
> 
>      # If one arg is an empty string, returns the length of the
other
>      if (nchar(a)==0) return(nchar(b))
>      if (nchar(b)==0) return(nchar(a))
> 
>      # Initialize matrix for calculations
>      m <- matrix(0, nrow=an, ncol=bn)
>      m[1,] <- 1:bn
>      m[,1] <- 1:an 
> 
>      # Cost calculation - line beginning (substr... is 0-1 cost f'n
>      for (i in 2:an) 
>           for (j in 2:bn) 
> 		  m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
> 		       (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) ) 
> 
>      # Returns the distance
>      m[an,bn]-1
> }
> 

But a very expensive part of the code though is the substr() calls.
Instead of doing this nchar(a)*nchar(b) times it's enough to do it
nchar(a)+nchar(b). Even better is to use strsplit() first as below:

levenshteinFast <- function(s1,s2) {
  # Make sure args are strings
  a <- as.character(s1)
  b <- as.character(s2)

  # Split strings into vectors
  a <- strsplit(a, split="")[[1]]
  b <- strsplit(b, split="")[[1]]

  # If one arg is an empty string, returns the length of the other
  an <- length(a)
  bn <- length(b)
  if (an==0) return(bn)
  if (bn==0) return(an)

  # Initialize matrix for calculations
  m <- matrix(0, nrow=an+1, ncol=bn+1)
  m[1,] <- 1:(bn+1)
  m[,1] <- 1:(an+1)

  # Cost calculation - line beginning (substr... is 0-1 cost f'n
  for (i in 2:(an+1)) {
    for (j in 2:(bn+1)) {
      m[i,j] <- min(m[i-1,j  ] + 1,
                    m[i  ,j-1] + 1,
                    m[i-1,j-1] + !identical(a[i-1],b[j-1]))
    }
  }

  # Returns the distance
  m[an+1,bn+1]-1;
} # levenshteinFast()

# Example
N <- 500
s1 <- sample(letters, size=N, replace=TRUE)
s1 <- paste(s1, collapse="")
s2 <- sample(letters, size=N, replace=TRUE)
s2 <- paste(s2, collapse="")

t1 <- system.time(dist1 <- levenshtein(s1,s2))
print(c(t1,dist1))
# [1]  46.83   0.23  54.24     NA     NA 443.00

t2 <- system.time(dist2 <- levenshteinFast(s1,s2))
print(c(t2,dist2))
# [1]  18.82   0.07  20.90     NA     NA 443.00

/Henrik

> ---
> Date:   Thu, 12 Feb 2004 11:01:19 -0300 
> From:   Rodrigo Abt <rodrigo.abt at sii.cl>
> To:   'Lista de Correo de R' <r-help at stat.math.ethz.ch> 
> Subject:   Re: [R] AGREP 
> 
>  
> "Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:
> 
> >Hi listers
> >
> >If you don't know what is the Edit Distance beetwen two 
> strings, I will 
> >try to explain, in fact it is very simple to understund but not to 
> >calculate througth a program. It is simplilly the minimum number of

> >operations you must perform to transform string A on string B, by 
> >operations I mean delete letters, insert letters or 
> substitute letter.
> >
> >If you need to do few operations, it means string A is 
> almost the same 
> >as string B. Otherwise they are more differente as the number of 
> >operations increase.
> >
> >If you have a idea of how to make a function to calculate this 
> >distance, it would be very important for me.
> >
> >Thanks very much,
> >
> >Marcos
> 
> I guess you're looking for Levenshtein distance, so try this:
> 
> levenshtein<-function(s1,s2) {
>      # Make sure args are strings
>      a<-as.character(s1);an=nchar(s1)+1
>      b<-as.character(s2);bn=nchar(s2)+1
> 
>      # Initialize matrix for calculations
>      m<-matrix(nrow=an,ncol=bn)
> 
>      # If one arg is an empty string, returns the length of the
other
>      if (nchar(a)==0)
>           return(nchar(b))
>      if (nchar(b)==0)
>           return(nchar(a))
> 
>      # Matrix initialization
>      for (i in 1:an) {
>           for (j in 1:bn) {
>                m[i,j]<-0
>                m[1,j]<-j
>           }
>           m[i,1]<-i
>      }
> 
>      # Cost calculation
>      for (i in 2:an) {
>           for (j in 2:bn) {
>                if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
>                     cost<-0
>                else
>                     cost<-1
>           m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
>           }
>      }
>      # Returns the distance
>      m[an,bn]-1
> }
> 
> Examples:
> 
> > levenshtein("Great","Grreat")          <-- One addition
> [1] 1
> > levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion
> and one substitution
> [1] 3
> 
> Note that this function IS case sensitive. If you want to 
> apply this on vectors of strings you'll have to write the 
> corresponding wrapper function.
> 
> Hope that helps,
> 
> Rodrigo Abt B,
> Analyst,
> Dept. Economic Studies,
> SII, Chile.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailma> n/listinfo/r-help
> PLEASE 
> do read the posting guide! 
> http://www.R-project.org/posting-guide.html