[R] text vector clustering

David Winsemius dwinsemius at comcast.net
Thu Jan 22 15:59:25 CET 2009



Simply doing a tabulation and isolating the cases with only one entry  
might have been a possibility if the count discrepancy weren't so  
high. It appears you have a greater degree of corruption than would be  
expected just from "typos".

Have you looked at the packages referenced at:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

The Soundex algorithm is an old programming chestnut which I have seen  
implemented in R, but I understand there are improved versions. How  
well they perform on persons' names may depend strongly on cultural  
origins of your population.

-- 
David Winsemius

On Jan 22, 2009, at 6:03 AM, srinivasa raghavan wrote:

> Hi,
>
> I am a new user of R using R 2.8.1 in windows 2003.  I have a  csv  
> file with
> single column which contain the 30,000 students names. There were typo
> errors while entering this student names. The actual list of names  
> is <
> 1000. However we dont have that list for keyword search.
>
> I am interested in grouping/cluster these names   as those which are
> similar  letter to letter.  Are there any text clustering algorithm  
> in R
> which can group names of similar type in to segments of exactly  
> matching ,
> 90% matching, 80% matching,....etc.
>
> thanks in advance,
>
> regards,
> srinivas
> statistical analyst.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list