[R] Clustering with R - efficient processing of large sparse data sets (text data)

Sun Sep 27 20:10:01 CEST 2009

I checked the R procedure HCLUST (hierarchical clustering) but it
looks like it requires a full triangular n x n similarity matrix as
input, where n = number of observations. The number of variables is
200.

My data set has n = 50,000 observations (keywords), and I use ad-hoc
similarity measures, not available in R, to measure keyword
similarity. Here, the vast majority of the n x n similarities are
equal to zero.

So I am looking for a clustering procedure that would accept the
following alternate input:

x1, y1, s1
x2, y2, s2

...

xk, yk, sk

where xi, yi are 2 keywords with similarity si > 0 (1 <= i <= k). This
input would contain k = 10,000 rows, which is much smaller than n x n
= 50,000 x 50,000 elements when using the similarity matrix. The
HCLUST function would crash if it used the dissimilarity matrix as
input.

Do you know how to use my small data input in R, instead of a very
large sparse similarity matrix? Or in SAS? I need a simple solution,
otherwise I'll just write myself the code that does hierarchical
clustering, in C or Perl, or use a library. It would take me 2 hours
to write the hierarchical clustering code from scratch, so I'm looking
for a simple solution that will take less than 2 hours to implement.

Follow up at http://www.analyticbridge.com/group/R_Packages/forum/topics/clustering-with-r-efficient