[R] SVD for reducing dimensions

Sun Nov 17 21:55:38 CET 2002

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all, this is probably simple and I'm just doing something stupid, sorry 
about that :-)

I'm trying to convert words (strings of letters) into a fairly small 
dimensional space (say 10, but anything between about 5 and 50 would be ok), 
which I will call a feature vector.  The the distance between two words 
represents the similarity of the contexts of the words, so big and little 
have very similar contexts and should get a similar representation. Basically 
to build something similar to a thesaurus.

I have computed bigram counts between the n most common words, for varying 
values of n between 500 and 5000. These are saved to a file which I can load 
with read.table.  This matrix is symmetric and far from sparse, although I 
can adjust the sparseness by changing the bigram window.  First question, 
should I scale the counts? The angle is all that is really important, I'd 
like 1,1,1,2 to be basically the same as 2,2,2,4, perhaps with that the 
latter having more weight in resolving disrepencies.

Next is the job of reducing the matrix from 500 dimensions to say 10.  I think 
the correct way of doing this is using SVD, does that sound right? At least, 
I have read a paper by Schuetze which used SVD. Other algorithms (K-means, 
SOM) also sound applicable but may balk at the amount of data, or might not 
provide the distance property I'm trying to get. 

However I must be doing something stupid here because the result I get from 
SVD has n dimensions instead of k.  Firstly, I don't seem to be able to use 
La.svd at all, and for normal svd I'm not getting the results I expect.

> x <- read.table("bigram.500")
> xs <- La.svd(x)
Error in La.svd(x) : argument to La.svd must be numeric or complex

> xs <- svd(x)
> ncol(xs$v)
[1] 500
> nrow(xs$v)
[1] 500
> nrow(xs$u)
[1] 500
> ncol(xs$u)
[1] 500

Also, how should I locate the million or so less common words into the space 
generated by this? Running svd on the full bigrams sounds infeasable, it 
would be a 200GB matrix, for a start. Really I just want to 'predict' their 
location rather than build the classifier with a larger set.

Thank you for your time

Corrin Lakeland
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE92AJOi5A0ZsG8x8cRAnGMAJ44sSne464uKPOrkoNFF/cKGItgTQCggxPK
NdtSE7jSy2EYfqQyM8HwQ/M=
=4AnC
-----END PGP SIGNATURE-----
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._