[R] Effective distance measures for Text Clustering

Tue Apr 20 14:48:11 CEST 2010

Hi useRs,
Disclaimer: My question is more statistical than pertaining
specifically to the R system)
I am using the "tm" package in R to create a Document-Term Matrix,
with Tf-Idf measures.
A) Once done, I create a distance matrix using "euclidean" distance measure.

B) After this, I use hierarchical clustering to find an "appropriate"
separation in the data using "ward" measure

For A above, what are generally the best practices for distance
measures on TfIdf. I used the cosine similarity measure, but that
creates NaN/Inf values which have to be converted to zero.

For B above, I used "ward" since the Details alluded to it being the
most used measure which provides better results.

I understand that such a question requires extensive research since
the underlying data (emails in my case) may have a great influence on
the results.

I have used a Part of Speech tagger to extract nouns as features to
use as the dictionary in order to weed out trivial words.

Any feedback/link to online knowledge resources/your experience would
be greatly appreciated.

Thank you for your time.

Regards,
Harsh Singhal