[R] Principle components analysis on a large dataset

Moshe Olshansky m_olshansky at yahoo.com
Fri Aug 21 03:13:35 CEST 2009

Hi Misha,

Since PCA is a linear procedure and you have only 6000 observations, you do not need 68000 variables. Using any 6000 of your variables so that the resulting 6000x6000 matrix is non-singular will do. You can choose these 6000 variables (columns) randomly, hoping that the resulting matrix is non-singular (and checking for this). Alternatively, you can try something like choosing one "nice" column, then choosing the second one which is the mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the third one which is mostly orthogonal to the first two, etc. (I am not sure how much rounoff may be a problem- try doing this using higher precision if you can). Note that you do not need to load the entire 6000x68000 matrix into memory (you can load several thousands of columns, process them and discard them).
Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, which can fit into a memory and you can perform the usual PCA on this matrix.

Good luck!


P.S. I am curious to see what other people think.

--- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote:

> From: misha680 <mk144210 at bcm.edu>
> Subject: [R]  Principle components analysis on a large dataset
> To: r-help at r-project.org
> Received: Friday, 21 August, 2009, 10:45 AM
> Dear Sirs:
> Please pardon me I am very new to R. I have been using
> I was wondering if R would allow me to do principal
> components analysis on a
> very large
> dataset.
> Specifically, our dataset has 68800 variables and around
> 6000 observations.
> Matlab gives "out of memory" errors. I have tried also
> doing princomp in
> pieces, but this does not seem to quite work for our
> approach.
> Anything that might help much appreciated. If anyone has
> had experience
> doing this in R much appreciated.
> Thank you
> Misha
> -- 
> View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.

More information about the R-help mailing list