[R] Principal component analysis

ripley@stats.ox.ac.uk ripley at stats.ox.ac.uk
Mon Dec 9 12:15:06 CET 2002


On Mon, 9 Dec 2002 Arne.Muller at aventis.com wrote:

> Dear R users,
>
> I'm trying to cluster 30 gene chips using principal component analysis in
> package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is
> probably just one of several methods to cluster the 30 chips. However, I
> don't know how to run prcomp, and I don't know how to interpret it's output.
>
> If there are 30 data points in 1,000 dimensions each, do I have to provide
> the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?

None of those. A 30x1000 matrix.

> > data[1:5,1:5]
>   x.HU.04h.Ctr.118.01.4.ctrl x.HU.04h.010.118.04.4.0.1
> 1                         21                        45
> 2                         24                        35
> 3                        109                       173
> 4                         86                        99
> 5                        130                       204
>   x.HU.04h.050.118.05.4.0.5 x.HU.04h.100.118.06.4.1
> x.HU.24h.Ctr.118.07.24.ctrl
> 1                        24                      28
> 22
> 2                        25                      25
> 20
> 3                       107                     125
> 95
> 4                        72                      79
> 61
> 5                       126                     166
> 128
>
> > m <- t(data)
> > m[1:5,1:5]
>                              1  2   3  4   5
> x.HU.04h.Ctr.118.01.4.ctrl  21 24 109 86 130
> x.HU.04h.010.118.04.4.0.1   45 35 173 99 204
> x.HU.04h.050.118.05.4.0.5   24 25 107 72 126
> x.HU.04h.100.118.06.4.1     28 25 125 79 166
> x.HU.24h.Ctr.118.07.24.ctrl 22 20  95 61 128
>
> > pca <- prcomp(m, retx = TRUE)
>
> there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be
> 1000 PCs, with the 1st PC beeing the most discriminativePC? In a principal

No.  970 of them span the null space: you have massive over-fitting.

> comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I
> thought that PCA somehow collapses dimensionality ... . What is are PCs for
> my 30 data points. Afterwards I'd also like to display the results in a
> diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm
> doing the right thing.

Well, statistically neither am I.  But mathematically at least, the PCs
for your 30 data points are the `x' component of the result, and you can
plot them via

plot(pca$x[1:2])

in two dimensions, or use scatterplot3d (a package) or (preferably as it
is dynamic) the ggobi or xgobi interfaces in 3D.

This sort of thing *is* covered in many of the texts about S (or S-PLUS or
R).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list