[R] highly biased PCA data?

Sat Nov 6 00:36:54 CET 2004

I'd suggest you start by using lda() or qda() from MASS,
benefits being that

(a) if the frequencies in the sample do not reflect the frequencies
in the target population, you can set 'prior' to mirror the target
frequencies.  The issue is, perhaps, is your odd person odd in
a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in
a 1000:1000:50 population?  You can also vary the prior to see
what the effect is.  If however you set a large prior probability for
a group that is poorly represented, results will be 'noisy'.  Note
the use of 'classwt' for the prior probablities for randomForest().

(b) You can plot second versus first discriminant function scores,
to get a direct graphical representation of results.
Other discrimination techniques may have to use an ordination
technique or even lds() or qds() on a >2 dimensional representation
of results, in order to get a scatterplot.
[cf MDSplot() for randomForest()]

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 5 Nov 2004, at 10:18 PM, r-help-request at stat.math.ethz.ch wrote:

> From: Berton Gunter <gunter.berton at gene.com>
> Date: 5 November 2004 5:08:38 AM
> To: "'Dan Bolser'" <dmb at mrc-dunn.cam.ac.uk>, "'R-help'" 
> <r-help at stat.math.ethz.ch>
> Cc: Subject: RE: [R] highly biased PCA data?
>
> Dan:
>
> 1) There is no guarantee that PCA will show separate groups, of 
> course, as
> that is not its purpose, although it is frequently a side effect.
>
> 2) If you were to use a classification method of some sort 
> (discriminant
> analysis, neural nets, SVM's, model=based classification,  ...), my
> understanding is that yes, indeed, severely unbalanced group membership
> would, indeed, affect results. A guess is that Bayesian or other 
> methods
> that could explicitly model the prior membership probabilities would do
> better. To make it clear why, suppose that there was a 99.9% 
> preference of
> "dog" and .05% each of the others. Than your datasets would have 
> almost no
> information on how covariates could distinguish the classes and the 
> best
> classifier would be to call everything a "dog" no matter what values 
> the
> covariates had.
>
> I presume experts will have more and better to say about this.
>
> -- Bert Gunter
>
>
>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
>> Sent: Thursday, November 04, 2004 9:41 AM
>> To: R mailing list
>> Subject: [R] highly biased PCA data?
>>
>> Hello, supposing that I have two or three clear categories
>> for my data, lets say pet preferece across fish, cat, dog. Lets say 
>> most
>> people rate their preference as being mostly one of the categories.
>>
>> I want to do pca on the data to see three 'groups' of people,
>> one group for fish, one for cat and one for dog. I would like to see
>> the odd person who likes both or all three in the (appropriate) 
>> middle of
>> the other main groups.
>>
>> Will my data be affected by the fact that I have interviewed 1000 dog
>> owners, 100 cat owners and 10 fish owners? (assuming that
>> each scale of preference has an equal range).
>>
>> Cheers,
>> dan.