RES: [R] AGREP
Marcos Sanches
marcos.sanches at ipsos-opinion.com.br
Thu Feb 12 21:35:54 CET 2004
Hi Henrik,
Your function is really faster, but I tested it to solve my
problem. And I found it is too time consuming yet for me. This happens
because I need to compare strings from two very large vectors. Bellow I
wrote the syntax I have to use:
##########################
Ls1<-length(s1)
Ls2<-length(s2)
for ( p in 1:ls1){
for (q in 1:ls2){
t1<-levenshteinFast(s1[p],s2[q])
if (t1<s12[p]){
s12[p]<-s2[q]
n12[q]<-t1}
}
}
#############################
If you want to have na idea, the size of my loop are:
Ls1=42000
Ls2=70000
I think I will wait for months untill this program ends. Do you have any
sugestion to increase the speed? The AGREP function is much faster, and
I am using it, but I don't have a efficient comparation because
AGREP(a,b) is not equal AGREP(b,a).
Thanks,
Marcos
-----Mensagem original-----
De: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] Em nome de Henrik Bengtsson
Enviada em: quinta-feira, 12 de fevereiro de 2004 13:12
Para: ggrothendieck at myway.com; rodrigo.abt at sii.cl;
r-help at stat.math.ethz.ch
Assunto: RE: [R] AGREP
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gabor
> Grothendieck
> Sent: den 12 februari 2004 16:07
> To: rodrigo.abt at sii.cl; r-help at stat.math.ethz.ch
> Subject: Re: [R] AGREP
>
> One could shorten it slightly with these minor improvements.
> Unfortunately, the key performance problem, the double loop
> at the end which implements the dynamic programming calculation,
> is still there.
>
> levenshtein<-function(s1,s2) {
> # Make sure args are strings
> a <- as.character(s1); an <- nchar(s1)+1
> b <- as.character(s2); bn <- nchar(s2)+1
>
> # If one arg is an empty string, returns the length of the
other
> if (nchar(a)==0) return(nchar(b))
> if (nchar(b)==0) return(nchar(a))
>
> # Initialize matrix for calculations
> m <- matrix(0, nrow=an, ncol=bn)
> m[1,] <- 1:bn
> m[,1] <- 1:an
>
> # Cost calculation - line beginning (substr... is 0-1 cost f'n
> for (i in 2:an)
> for (j in 2:bn)
> m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
> (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) )
>
> # Returns the distance
> m[an,bn]-1
> }
>
But a very expensive part of the code though is the substr() calls.
Instead of doing this nchar(a)*nchar(b) times it's enough to do it
nchar(a)+nchar(b). Even better is to use strsplit() first as below:
levenshteinFast <- function(s1,s2) {
# Make sure args are strings
a <- as.character(s1)
b <- as.character(s2)
# Split strings into vectors
a <- strsplit(a, split="")[[1]]
b <- strsplit(b, split="")[[1]]
# If one arg is an empty string, returns the length of the other
an <- length(a)
bn <- length(b)
if (an==0) return(bn)
if (bn==0) return(an)
# Initialize matrix for calculations
m <- matrix(0, nrow=an+1, ncol=bn+1)
m[1,] <- 1:(bn+1)
m[,1] <- 1:(an+1)
# Cost calculation - line beginning (substr... is 0-1 cost f'n
for (i in 2:(an+1)) {
for (j in 2:(bn+1)) {
m[i,j] <- min(m[i-1,j ] + 1,
m[i ,j-1] + 1,
m[i-1,j-1] + !identical(a[i-1],b[j-1]))
}
}
# Returns the distance
m[an+1,bn+1]-1;
} # levenshteinFast()
# Example
N <- 500
s1 <- sample(letters, size=N, replace=TRUE)
s1 <- paste(s1, collapse="")
s2 <- sample(letters, size=N, replace=TRUE)
s2 <- paste(s2, collapse="")
t1 <- system.time(dist1 <- levenshtein(s1,s2))
print(c(t1,dist1))
# [1] 46.83 0.23 54.24 NA NA 443.00
t2 <- system.time(dist2 <- levenshteinFast(s1,s2))
print(c(t2,dist2))
# [1] 18.82 0.07 20.90 NA NA 443.00
/Henrik
> ---
> Date: Thu, 12 Feb 2004 11:01:19 -0300
> From: Rodrigo Abt <rodrigo.abt at sii.cl>
> To: 'Lista de Correo de R' <r-help at stat.math.ethz.ch>
> Subject: Re: [R] AGREP
>
>
> "Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:
>
> >Hi listers
> >
> >If you don't know what is the Edit Distance beetwen two
> strings, I will
> >try to explain, in fact it is very simple to understund but not to
> >calculate througth a program. It is simplilly the minimum number of
> >operations you must perform to transform string A on string B, by
> >operations I mean delete letters, insert letters or
> substitute letter.
> >
> >If you need to do few operations, it means string A is
> almost the same
> >as string B. Otherwise they are more differente as the number of
> >operations increase.
> >
> >If you have a idea of how to make a function to calculate this
> >distance, it would be very important for me.
> >
> >Thanks very much,
> >
> >Marcos
>
> I guess you're looking for Levenshtein distance, so try this:
>
> levenshtein<-function(s1,s2) {
> # Make sure args are strings
> a<-as.character(s1);an=nchar(s1)+1
> b<-as.character(s2);bn=nchar(s2)+1
>
> # Initialize matrix for calculations
> m<-matrix(nrow=an,ncol=bn)
>
> # If one arg is an empty string, returns the length of the
other
> if (nchar(a)==0)
> return(nchar(b))
> if (nchar(b)==0)
> return(nchar(a))
>
> # Matrix initialization
> for (i in 1:an) {
> for (j in 1:bn) {
> m[i,j]<-0
> m[1,j]<-j
> }
> m[i,1]<-i
> }
>
> # Cost calculation
> for (i in 2:an) {
> for (j in 2:bn) {
> if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> cost<-0
> else
> cost<-1
> m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
> }
> }
> # Returns the distance
> m[an,bn]-1
> }
>
> Examples:
>
> > levenshtein("Great","Grreat") <-- One addition
> [1] 1
> > levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion
> and one substitution
> [1] 3
>
> Note that this function IS case sensitive. If you want to
> apply this on vectors of strings you'll have to write the
> corresponding wrapper function.
>
> Hope that helps,
>
> Rodrigo Abt B,
> Analyst,
> Dept. Economic Studies,
> SII, Chile.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailma> n/listinfo/r-help
> PLEASE
> do read the posting guide!
> http://www.R-project.org/posting-guide.html
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list