[R] SVD for reducing dimensions
Corrin Lakeland
lakeland at cs.otago.ac.nz
Sun Nov 17 21:55:38 CET 2002
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi all, this is probably simple and I'm just doing something stupid, sorry
about that :-)
I'm trying to convert words (strings of letters) into a fairly small
dimensional space (say 10, but anything between about 5 and 50 would be ok),
which I will call a feature vector. The the distance between two words
represents the similarity of the contexts of the words, so big and little
have very similar contexts and should get a similar representation. Basically
to build something similar to a thesaurus.
I have computed bigram counts between the n most common words, for varying
values of n between 500 and 5000. These are saved to a file which I can load
with read.table. This matrix is symmetric and far from sparse, although I
can adjust the sparseness by changing the bigram window. First question,
should I scale the counts? The angle is all that is really important, I'd
like 1,1,1,2 to be basically the same as 2,2,2,4, perhaps with that the
latter having more weight in resolving disrepencies.
Next is the job of reducing the matrix from 500 dimensions to say 10. I think
the correct way of doing this is using SVD, does that sound right? At least,
I have read a paper by Schuetze which used SVD. Other algorithms (K-means,
SOM) also sound applicable but may balk at the amount of data, or might not
provide the distance property I'm trying to get.
However I must be doing something stupid here because the result I get from
SVD has n dimensions instead of k. Firstly, I don't seem to be able to use
La.svd at all, and for normal svd I'm not getting the results I expect.
> x <- read.table("bigram.500")
> xs <- La.svd(x)
Error in La.svd(x) : argument to La.svd must be numeric or complex
> xs <- svd(x)
> ncol(xs$v)
[1] 500
> nrow(xs$v)
[1] 500
> nrow(xs$u)
[1] 500
> ncol(xs$u)
[1] 500
Also, how should I locate the million or so less common words into the space
generated by this? Running svd on the full bigrams sounds infeasable, it
would be a 200GB matrix, for a start. Really I just want to 'predict' their
location rather than build the classifier with a larger set.
Thank you for your time
Corrin Lakeland
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE92AJOi5A0ZsG8x8cRAnGMAJ44sSne464uKPOrkoNFF/cKGItgTQCggxPK
NdtSE7jSy2EYfqQyM8HwQ/M=
=4AnC
-----END PGP SIGNATURE-----
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list