[R] Animal Morphology: Deriving Classification Equation with

Sun May 24 21:07:46 CEST 2009

[Your data and output listings removed. For comments, see at end]

On 24-May-09 13:01:26, cdm wrote:
> Fellow R Users:
> I'm not extremely familiar with lda or R programming, but a recent
> editorial review of a manuscript submission has prompted a crash
> course. I am on this forum hoping I could solicit some much needed
> advice for deriving a classification equation.
> 
> I have used three basic measurements in lda to predict two groups:
> male and female. I have a working model, low Wilk's lambda, graphs,
> coefficients, eigenvalues, etc. (see below). I adjusted the sample
> analysis for Fisher's or Anderson's Iris data provided in the MASS
> library for my own data.
> 
> My final and last step is simply form the classification equation.
> The classification equation is simply using standardized coefficients
> to classify each group- in this case male or female. A more thorough
> explanation is provided:
> 
> "For cases with an equal sample size for each group the classification
> function coefficient (Cj) is expressed by the following equation:
> 
> Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp
> 
> where Cj is the score for the jth group, j = 1 â€¦ k, cjo is the
> constant for the jth group, and x = raw scores of each predictor.
> If W = within-group variance-covariance matrix, and M = column matrix
> of means for group j, then the constant   cjo= (-1/2)CjMj" (Julia
> Barfield, John Poulsen, and Aaron French 
> http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discriminant.htm).
> 
> I am unable to navigate this last step based on the R output I have.
> I only have the linear discriminant coefficients for each predictor
> that would be needed to complete this equation.
> 
> Please, if anybody is familiar or able to to help please let me know.
> There is a spot in the acknowledgments for you.
> 
> All the best,
> Chase Mendenhall

The first thing I did was to plot your data. This indicates in the
first place that a perfect discrimination can be obtained on the
basis of your variables WRMA_WT and WRMA_ID alone (names abbreviated
to WG, WT, ID, SEX):

  d.csv("horsesLDA.csv")
  # names(D0) # "WRMA_WG"  "WRMA_WT"  "WRMA_ID"  "WRMA_SEX"
  WG<-D0$WRMA_WG; WT<-D0$WRMA_WT;
  ID<-D0$WRMA_ID; SEX<-D0$WRMA_SEX

  ix.M<-(SEX=="M"); ix.F<-(SEX=="F")

  ## Plot WT vs ID (M & F)
  plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
  points(ID[ix.M],WT[ix.M],pch="+",col="blue")
  points(ID[ix.F],WT[ix.F],pch="+",col="red")
  lines(ID,15.5-1.0*(ID))

and that there is a lot of possible variation in the discriminating
line WT = 15.5-1.0*(ID)

Also, it is apparent that the covariance between WT and ID for Females
is different from the covariance between WT and ID for Males. Hence
the assumption (of common covariance matrix in the two groups) for
standard LDA (which you have been applying) does not hold.

Given that the sexes can be perfectly discriminated within the data
on the basis of the linear discriminator (WT + ID) (and others),
the variable WG is in effect a close approximation to noise.

However, to the extent that there was a common covariance matrix
to the two groups (in all three variables WG, WT, ID), and this
was well estimated from the data, then inclusion of the third
variable WG could yield a slightly improved discriminator in that
the probability of misclassification (a rare event for such data)
could be minimised. But it would not make much difference!

However, since that assumption does not hold, this analysis would
not be valid.

If you plot WT vs WG, a common covariance is more plausible; but
there is considerable overlap for these two variables:

  plot(WG,WT)
  points(WG[ix.M],WT[ix.M],pch="+",col="blue")
  points(WG[ix.F],WT[ix.F],pch="+",col="red")

If you plot WG vs ID, there is perhaps not much overlap, but a
considerable difference in covariance between the two groups:

  plot(ID,WG)
  points(ID[ix.M],WG[ix.M],pch="+",col="blue")
  points(ID[ix.F],WG[ix.F],pch="+",col="red")

This looks better on a log scale, however:

  lWG <- log(WG) ; lWT <- log(WT) ; lID <- log(ID)
## Plot log(WG) vs log(ID) (M & F)
  plot(lID,lWG)
  points(lID[ix.M],lWG[ix.M],pch="+",col="blue")
  points(lID[ix.F],lWG[ix.F],pch="+",col="red")

and common covaroance still looks good for WG vs WT:

  ## Plot log(WT) vs log(WG) (M & F)
  plot(lWG,lWT)
  points(lWG[ix.M],lWT[ix.M],pch="+",col="blue")
  points(lWG[ix.F],lWT[ix.F],pch="+",col="red")

but there is no improvement for WG vs IG:

  ## Plot log(WT) vs log(ID) (M & F)
  plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
  points(ID[ix.M],WT[ix.M],pch="+",col="blue")
  points(ID[ix.F],WT[ix.F],pch="+",col="red")

So there is no simple road to applying a routine LDA to your data.

To take account of different covariances between the two groups,
you would normally be looking at a quadratic discriminator. However,
as indicated above, the fact that a linear discriminator using
the variables ID & WT alone works so well would leave considerable
imprecision in conclusions to be drawn from its results.

Sorry this is not the straightforward answer you were hoping for
(which I confess I have not sought); it is simply a reaction to
what your data say.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 24-May-09                                       Time: 20:07:43
------------------------------ XFMail ------------------------------