[R] Principle components analysis on a large dataset
Prof. John C Nash
nashjc at uottawa.ca
Fri Aug 21 16:44:26 CEST 2009
The essential issue is that the matrix you need to manipulate is very
large. This is not a new problem, and about a year ago I exchanged ideas
with the Rff package developers (things have been on the back burner
since due to recession woes and illness issues). These ideas were based
on some very small codes from my 1979 book "Compact numerical methods
for computers". This contains a code that takes a matrix row-wise from a
file and builds a triangular decomposition as well as a list of
orthogonal transformations, then does an svd of the result. Your problem
would work on the transpose. This is a whole lot different from how R
users generally work, so there are lots of interfacing and similar
issues. Also there are likely more efficient computational methods than
the one I used -- but I was working in 1974 on an HP9830 desk calculator
with the matrix on punched cards to develop this. And it has a short
code that can be written in a fairly vectorized way in R only, which may
make the human/computer trade-off favourable, depending on how many
times you need to run such problems.
However, the main point is that you need to use some sort of "out of
core" (how dated that sounds!) method, which is and will remain an issue
for systems like R that work on objects in memory.
I'm willing to kibbitz on such work, but it would go best if there are
3-4 folk involved to bring different skills to the table.
John Nash
Message: 128
Date: Thu, 20 Aug 2009 17:45:00 -0700 (PDT)
From: misha680 <mk144210 at bcm.edu>
Subject: [R] Principle components analysis on a large dataset
To: r-help at r-project.org
Message-ID: <25072510.post at talk.nabble.com>
Content-Type: text/plain; charset=us-ascii
Dear Sirs:
Please pardon me I am very new to R. I have been using MATLAB.
I was wondering if R would allow me to do principal components analysis on a
very large
dataset.
Specifically, our dataset has 68800 variables and around 6000 observations.
Matlab gives "out of memory" errors. I have tried also doing princomp in
pieces, but this does not seem to quite work for our approach.
Anything that might help much appreciated. If anyone has had experience
doing this in R much appreciated.
Thank you
Misha
-- View this message in context:
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list