[R] Several PCA questions...
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Jun 29 18:35:56 CEST 2004
See `cor' in ?princomp, and its references. I meant `scale' as in ?scale.
On Tue, 29 Jun 2004, Dan Bolser wrote:
>
> Perhaps this question is less dumb... (in context below...)
>
>
> On Tue, 29 Jun 2004, Prof Brian Ripley wrote:
>
> >On Tue, 29 Jun 2004, Dan Bolser wrote:
> >
> >> Hi, I am doing PCA on several columns of data in a data.frame.
> >>
> >> I am interested in particular rows of data which may have a particular
> >> combination of 'types' of column values (without any pre-conception of
> >> what they may be).
> >>
> >> I do the following...
> >>
> >> # My data table.
> >> allDat <- read.table("big_select_thresh_5", header=1)
> >>
> >> # Where some rows look like this...
> >> # PDB SUNID1 SUNID2 AA CH IPCA PCA IBB BB
> >> # 3sdh 14984 14985 6 10 24 24 93 116
> >> # 3hbi 14986 14987 6 10 20 22 94 117
> >> # 4sdh 14988 14989 6 10 20 20 104 122
> >>
> >> # NB First three columns = row ID, last 6 = variables
> >>
> >> attach(allDat)
> >>
> >> # My columns of interest (variables).
> >> part <- data.frame(AA,CH,IPCA,PCA,IBB,BB)
> >>
> >> pc <- princomp(part)
> >
> >Do you really want an unscaled PCA on that data set? Looks unlikely (but
> >then two of the columns are constant in the sample, which is also
> >worrying).
>
>
> That is just sample bias. By unscaled I assume you mean something like
> normalized?
>
>
> >> plot(pc)
> >>
> >> The above plot shows that 95% of the variance is due to the first
> >> 'Component' (which I assume is AA).
> >
> >No, it is the first (principal) component. You did ask for P>C<A!
> >
> >> i.e. All the variables behave in quite much the same way.
> >
> >Or you failed to scale the data so one dominates.
>
> Yes.
>
> I added the following to the above....
>
>
> x <- colMeans(part)
> partNorm <- part/x
> pc1 <- princomp(partNorm)
>
> plot(pc1)
>
> biplot(pc1)
>
> Which shows two major components, and possibly a third.
>
> What I want to know is that given my data is not uniformly distributed, is
> my normalization valid?
>
> I know I should find this out via further investigation of PCA, but in
> general if my variables have a very skewed distribution (possibly without
> a theoretically definable mean) should I attempt to use any standard
> clustering technique?
>
> I guess I should log transform my data.
>
> Cheers,
> Dan.
>
>
>
>
>
> >> I then did ...
> >>
> >>
> >> biplot(pc)
> >>
> >> Which showed some outliers with a numeric ID - How do I get back my old 3
> >> part ID used in allDat?
> >
> >Set row names on your data frame. Like almost all of R, it is the row
> >names of a data frame that are used for labelling, and you did not give
> >any so you got numbers.
> >
> >> In the above plot I saw all the variables (correctly named) pointing in
> >> more or less the same direction (as shown by the variance). I then did the
> >> following...
> >>
> >> postscript(file="test.ps",paper="a4")
> >>
> >> biplot(pc)
> >>
> >> dev.off()
> >>
> >> However, looking at test.ps shows that the arrows are missing (using
> >> ggv)... Hmmm, they come back when I pstoimg then xv... never mind.
> >
> >So ggv is unreliable, perhaps cannot cope with colours?
> >
> >> Finally, I would like to make a contour plot of the above biplot, is this
> >> possible? (or even a good way to present the data?
> >
> >What do you propose to represent by the contours? Biplots have a
> >well-defined interpretation in terms of distances and angles.
> >
> >
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list