[R] Several PCA questions...

Tue Jun 29 18:35:56 CEST 2004

See `cor' in ?princomp, and its references.  I meant `scale' as in ?scale.

On Tue, 29 Jun 2004, Dan Bolser wrote:

> 
> Perhaps this question is less dumb... (in context below...)
> 
> 
> On Tue, 29 Jun 2004, Prof Brian Ripley wrote:
> 
> >On Tue, 29 Jun 2004, Dan Bolser wrote:
> >
> >> Hi, I am doing PCA on several columns of data in a data.frame.
> >> 
> >> I am interested in particular rows of data which may have a particular
> >> combination of 'types' of column values (without any pre-conception of
> >> what they may be).
> >> 
> >> I do the following...
> >> 
> >> # My data table.
> >> allDat <- read.table("big_select_thresh_5", header=1)
> >> 
> >> # Where some rows look like this...
> >> # PDB     SUNID1  SUNID2  AA      CH      IPCA    PCA     IBB     BB
> >> # 3sdh    14984   14985   6       10      24      24      93      116
> >> # 3hbi    14986   14987   6       10      20      22      94      117
> >> # 4sdh    14988   14989   6       10      20      20      104     122
> >> 
> >> # NB First three columns = row ID, last 6 = variables
> >> 
> >> attach(allDat)
> >> 
> >> # My columns of interest (variables).
> >> part <- data.frame(AA,CH,IPCA,PCA,IBB,BB)
> >> 
> >> pc <- princomp(part)
> >
> >Do you really want an unscaled PCA on that data set?  Looks unlikely (but 
> >then two of the columns are constant in the sample, which is also 
> >worrying).
> 
> 
> That is just sample bias. By unscaled I assume you mean something like
> normalized?
> 
> 
> >> plot(pc)
> >> 
> >> The above plot shows that 95% of the variance is due to the first
> >> 'Component' (which I assume is AA).
> >
> >No, it is the first (principal) component.  You did ask for P>C<A!
> >
> >> i.e. All the variables behave in quite much the same way.
> >
> >Or you failed to scale the data so one dominates.
> 
> Yes.
> 
> I added the following to the above....
> 
> 
> x <- colMeans(part)
> partNorm <- part/x
> pc1 <- princomp(partNorm)
> 
> plot(pc1)
> 
> biplot(pc1)
> 
> Which shows two major components, and possibly a third.
> 
> What I want to know is that given my data is not uniformly distributed, is
> my normalization valid?
> 
> I know I should find this out via further investigation of PCA, but in
> general if my variables have a very skewed distribution (possibly without
> a theoretically definable mean) should I attempt to use any standard
> clustering technique?
> 
> I guess I should log transform my data.
> 
> Cheers,
> Dan.
> 
> 
> 
> 
> 
> >> I then did ...
> >> 
> >> 
> >> biplot(pc)
> >> 
> >> Which showed some outliers with a numeric ID - How do I get back my old 3
> >> part ID used in allDat?
> >
> >Set row names on your data frame.  Like almost all of R, it is the row 
> >names of a data frame that are used for labelling, and you did not give 
> >any so you got numbers.
> >
> >> In the above plot I saw all the variables (correctly named) pointing in
> >> more or less the same direction (as shown by the variance). I then did the
> >> following...
> >> 
> >> postscript(file="test.ps",paper="a4")
> >> 
> >> biplot(pc)
> >> 
> >> dev.off()
> >> 
> >> However, looking at test.ps shows that the arrows are missing (using
> >> ggv)... Hmmm, they come back when I pstoimg then xv... never mind.
> >
> >So ggv is unreliable, perhaps cannot cope with colours?
> >
> >> Finally, I would like to make a contour plot of the above biplot, is this
> >> possible? (or even a good way to present the data?
> >
> >What do you propose to represent by the contours?  Biplots have a 
> >well-defined interpretation in terms of distances and angles.
> >
> >
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595