[R] Effective distance measures for Text Clustering
Harsh
singhalblr at gmail.com
Tue Apr 20 14:48:11 CEST 2010
Hi useRs,
Disclaimer: My question is more statistical than pertaining
specifically to the R system)
I am using the "tm" package in R to create a Document-Term Matrix,
with Tf-Idf measures.
A) Once done, I create a distance matrix using "euclidean" distance measure.
B) After this, I use hierarchical clustering to find an "appropriate"
separation in the data using "ward" measure
For A above, what are generally the best practices for distance
measures on TfIdf. I used the cosine similarity measure, but that
creates NaN/Inf values which have to be converted to zero.
For B above, I used "ward" since the Details alluded to it being the
most used measure which provides better results.
I understand that such a question requires extensive research since
the underlying data (emails in my case) may have a great influence on
the results.
I have used a Part of Speech tagger to extract nouns as features to
use as the dictionary in order to weed out trivial words.
Any feedback/link to online knowledge resources/your experience would
be greatly appreciated.
Thank you for your time.
Regards,
Harsh Singhal
More information about the R-help
mailing list