[R] Principle components analysis on a large dataset

misha680 mk144210 at bcm.edu
Fri Aug 21 21:20:19 CEST 2009


Hi Moshe,

Your idea sounds reasonable to me. It seems analogous to have a system of
linear equations with
more unknowns that equations - there should be several solutions so there is
no "exact" PCA solution.

My plan (* = dot product)
1. Pick first "nice" vector to be longest - that is x1 * x1 is maximal.
2. For all second vectors x2 ~= x1, compute 
(x2 * x1)^2 / (x1 * x1)
and pick minimum as my second vector.
3. For all third vectors x3 ~= x2 ~= x1, compute
(x3 * x1)^2 / (x1 * x2) + (x3 * x2)^2/(x2 * x2)
and pick minimum as my third vector.
4. So on until we have 6000 vectors.
5. Perform PCA on this 6000x6000 resulting matrix.

What do you think?


Moshe Olshansky-2 wrote:
> 
> Hi Misha,
> 
> Since PCA is a linear procedure and you have only 6000 observations, you
> do not need 68000 variables. Using any 6000 of your variables so that the
> resulting 6000x6000 matrix is non-singular will do. You can choose these
> 6000 variables (columns) randomly, hoping that the resulting matrix is
> non-singular (and checking for this). Alternatively, you can try something
> like choosing one "nice" column, then choosing the second one which is the
> mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the
> third one which is mostly orthogonal to the first two, etc. (I am not sure
> how much rounoff may be a problem- try doing this using higher precision
> if you can). Note that you do not need to load the entire 6000x68000
> matrix into memory (you can load several thousands of columns, process
> them and discard them).
> Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries,
> which can fit into a memory and you can perform the usual PCA on this
> matrix.
> 
> Good luck!
> 
> Moshe.
> 
> P.S. I am curious to see what other people think.
> 
> --- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote:
> 
>> From: misha680 <mk144210 at bcm.edu>
>> Subject: [R]  Principle components analysis on a large dataset
>> To: r-help at r-project.org
>> Received: Friday, 21 August, 2009, 10:45 AM
>> 
>> Dear Sirs:
>> 
>> Please pardon me I am very new to R. I have been using
>> MATLAB.
>> 
>> I was wondering if R would allow me to do principal
>> components analysis on a
>> very large
>> dataset.
>> 
>> Specifically, our dataset has 68800 variables and around
>> 6000 observations.
>> Matlab gives "out of memory" errors. I have tried also
>> doing princomp in
>> pieces, but this does not seem to quite work for our
>> approach.
>> 
>> Anything that might help much appreciated. If anyone has
>> had experience
>> doing this in R much appreciated.
>> 
>> Thank you
>> Misha
>> -- 
>> View this message in context:
>> http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 
>> ______________________________________________
>> R-help at r-project.org
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25085859.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list