[R] Comparison of two very large strings

Tue Jul 13 00:46:16 CEST 2010

On Jul 12, 2010, at 6:03 PM, harsh yadav wrote:

> Hi,
>
> I have a function in R that compares two very large strings for  
> about 1
> million records.
>
> The strings are very large URLs like:-
>
>
> http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
> .
> ..
>
> or of larger lengths.
>
> The data-frame looks like:-
>
> id url
> 1
> http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
> .
> ..
> 2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
> 3
> http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
> .
> ..
> 4
> http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
> 5
> http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt
>
> and so on for about 1 million records.
>
> Here is the function that I am using to compare the two strings:-
>
> stringCompare <- function(currentURL, currentId){
>  j <- currentId - 1
> while(j>=1)
> previousURL <-
> previousURLLength <- nchar(previousURL)
> #Compare smaller with bigger
> if(nchar(currentURL) <= previousURLLength){
> matchPhrase <- substr(previousURL,1,nchar(currentURL))
> if(matchPhrase == currentURL){
> return(TRUE)
> }
> }else{
> matchPhrase <- substr(currentURL,1,previousURLLength)
> if(matchPhrase == previousURL){
> return(TRUE)
> }
> }
> j <- j -1
> }
> return(FALSE)
> }

Couldn't you just store the "url" vector after running  through nchar  
and then do the comparison in a vectorized manner?

test <- rd.txt('id url
1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
"
2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 
"
3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt 
"', stringsAsFactors=FALSE)

copyUrls <- test[,"url"]
sizeUrls <- nchar(copyUrls)
lengU <- length(sizeUrls)
sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)])
substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1:(lengU-1)],  
1, sizidx)

#[1] FALSE FALSE  TRUE FALSE

> Here, I compare the URL at a given row with all the previous URLs in  
> the
> data-frame. I compare the smaller of the two given URls with the  
> larger one
> (upto the length of the smaller).
>
> When I run the above function for about 1 million records, the  
> execution
> becomes really slow, which otherwise is fast if I remove the
> string comparison step.
>
> Any ideas how it can be implemented in a fast and efficient way.
>
> Thanks and Regards,
> Harsh Yadav
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT