[R] Several PCA questions...
dmb at mrc-dunn.cam.ac.uk
Tue Jun 29 17:49:13 CEST 2004
Perhaps this question is less dumb... (in context below...)
On Tue, 29 Jun 2004, Prof Brian Ripley wrote:
>On Tue, 29 Jun 2004, Dan Bolser wrote:
>> Hi, I am doing PCA on several columns of data in a data.frame.
>> I am interested in particular rows of data which may have a particular
>> combination of 'types' of column values (without any pre-conception of
>> what they may be).
>> I do the following...
>> # My data table.
>> allDat <- read.table("big_select_thresh_5", header=1)
>> # Where some rows look like this...
>> # PDB SUNID1 SUNID2 AA CH IPCA PCA IBB BB
>> # 3sdh 14984 14985 6 10 24 24 93 116
>> # 3hbi 14986 14987 6 10 20 22 94 117
>> # 4sdh 14988 14989 6 10 20 20 104 122
>> # NB First three columns = row ID, last 6 = variables
>> # My columns of interest (variables).
>> part <- data.frame(AA,CH,IPCA,PCA,IBB,BB)
>> pc <- princomp(part)
>Do you really want an unscaled PCA on that data set? Looks unlikely (but
>then two of the columns are constant in the sample, which is also
That is just sample bias. By unscaled I assume you mean something like
>> The above plot shows that 95% of the variance is due to the first
>> 'Component' (which I assume is AA).
>No, it is the first (principal) component. You did ask for P>C<A!
>> i.e. All the variables behave in quite much the same way.
>Or you failed to scale the data so one dominates.
I added the following to the above....
x <- colMeans(part)
partNorm <- part/x
pc1 <- princomp(partNorm)
Which shows two major components, and possibly a third.
What I want to know is that given my data is not uniformly distributed, is
my normalization valid?
I know I should find this out via further investigation of PCA, but in
general if my variables have a very skewed distribution (possibly without
a theoretically definable mean) should I attempt to use any standard
I guess I should log transform my data.
>> I then did ...
>> Which showed some outliers with a numeric ID - How do I get back my old 3
>> part ID used in allDat?
>Set row names on your data frame. Like almost all of R, it is the row
>names of a data frame that are used for labelling, and you did not give
>any so you got numbers.
>> In the above plot I saw all the variables (correctly named) pointing in
>> more or less the same direction (as shown by the variance). I then did the
>> However, looking at test.ps shows that the arrows are missing (using
>> ggv)... Hmmm, they come back when I pstoimg then xv... never mind.
>So ggv is unreliable, perhaps cannot cope with colours?
>> Finally, I would like to make a contour plot of the above biplot, is this
>> possible? (or even a good way to present the data?
>What do you propose to represent by the contours? Biplots have a
>well-defined interpretation in terms of distances and angles.
More information about the R-help