[R] highly biased PCA data?
andy_liaw at merck.com
Fri Nov 5 01:53:59 CET 2004
I am no expert on this sort of matters, but that has never stopped me from
tossing in my $0.02...
As Gabor and Bert hinted, this is what I would try:
Run randomForest on the data, using sampsize=c(10, 10, 10) and
importance=TRUE, for example. Then take the few most important variables
with respect to each class and maybe do PCA on those to see if you can see
> From: Dan Bolser
> On Thu, 4 Nov 2004, Berton Gunter wrote:
> >1) There is no guarantee that PCA will show separate groups,
> of course, as
> >that is not its purpose, although it is frequently a side effect.
> >2) If you were to use a classification method of some sort
> >analysis, neural nets, SVM's, model=based classification, ...), my
> >understanding is that yes, indeed, severely unbalanced group
> >would, indeed, affect results. A guess is that Bayesian or
> other methods
> >that could explicitly model the prior membership
> probabilities would do
> >better. To make it clear why, suppose that there was a 99.9%
> preference of
> >"dog" and .05% each of the others. Than your datasets would
> have almost no
> >information on how covariates could distinguish the classes
> and the best
> >classifier would be to call everything a "dog" no matter
> what values the
> >covariates had.
> >I presume experts will have more and better to say about this.
> Sounds interesting. Thanks very much for the input. Just out
> of curiosity,
> given that I can make my data more uniform (less biased), how
> could I best
> generate a 2d plot to encapsulate the clusters (and inter cluster
> Actually I am thinking of a 2d density.
> >-- Bert Gunter
> >Genentech Non-Clinical Statistics
> >South San Francisco, CA
> >"The business of the statistician is to catalyze the
> scientific learning
> >process." - George E. P. Box
> >> -----Original Message-----
> >> From: r-help-bounces at stat.math.ethz.ch
> >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
> >> Sent: Thursday, November 04, 2004 9:41 AM
> >> To: R mailing list
> >> Subject: [R] highly biased PCA data?
> >> Hello, supposing that I have two or three clear categories
> >> for my data,
> >> lets say pet preferece across fish, cat, dog. Lets say most
> >> people rate
> >> their preference as being mostly one of the categories.
> >> I want to do pca on the data to see three 'groups' of people,
> >> one group
> >> for fish, one for cat and one for dog. I would like to see
> >> the odd person
> >> who likes both or all three in the (appropriate) middle of
> >> the other main
> >> groups.
> >> Will my data be affected by the fact that I have
> interviewed 1000 dog
> >> owners, 100 cat owners and 10 fish owners? (assuming that
> >> each scale of
> >> preference has an equal range).
> >> Cheers,
> >> dan.
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide!
> >> http://www.R-project.org/posting-guide.html
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide!
More information about the R-help