[R] Principle components analysis on a large dataset

Prof. John C Nash nashjc at uottawa.ca
Fri Aug 21 16:44:26 CEST 2009


The essential issue is that the matrix you need to manipulate is very 
large. This is not a new problem, and about a year ago I exchanged ideas 
with the Rff package developers (things have been on the back burner 
since due to recession woes and illness issues). These ideas were based 
on some very small codes from my 1979 book "Compact numerical methods 
for computers". This contains a code that takes a matrix row-wise from a 
file and builds a triangular decomposition as well as a list of 
orthogonal transformations, then does an svd of the result. Your problem 
would work on the transpose. This is a whole lot different from how R 
users generally  work, so there are lots of interfacing and similar 
issues. Also there are likely more efficient computational methods than 
the one I used -- but I was working in 1974 on an HP9830 desk calculator 
with the matrix on punched cards to develop this. And it has a short 
code that can be written in a fairly vectorized way in R only, which may 
make the human/computer trade-off favourable, depending on how many 
times you need to run such problems.

However, the main point is that you need to use some sort of "out of 
core" (how dated that sounds!) method, which is and will remain an issue 
for systems like R that work on objects in memory.

I'm willing to kibbitz on such work, but it would go best if there are 
3-4 folk involved to bring different skills to the table.

John Nash




Message: 128
Date: Thu, 20 Aug 2009 17:45:00 -0700 (PDT)
From: misha680 <mk144210 at bcm.edu>
Subject: [R]  Principle components analysis on a large dataset
To: r-help at r-project.org
Message-ID: <25072510.post at talk.nabble.com>
Content-Type: text/plain; charset=us-ascii


Dear Sirs:

Please pardon me I am very new to R. I have been using MATLAB.

I was wondering if R would allow me to do principal components analysis on a
very large
dataset.

Specifically, our dataset has 68800 variables and around 6000 observations.
Matlab gives "out of memory" errors. I have tried also doing princomp in
pieces, but this does not seem to quite work for our approach.

Anything that might help much appreciated. If anyone has had experience
doing this in R much appreciated.

Thank you
Misha
-- View this message in context: 
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html 
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list