[R] Obtaining illogical results from posterior LDA-classification because of "too good" data?

Fri Oct 30 22:33:08 CET 2009

Dear list,
my problem seems to be primarily a statistical one, but maybe there is a
misspecification within R (and hopefully a solution).

I have two groups with two measured variables as training data. According to
the variables, the groups differ totally. I know that this is a very easy
situation, but the later analysis will use the same principle (aside from
more groups and more possible values). The example should be enough to draw
my problem:
matrix <- matrix(rep(c(0,0,0,0,0,1,1,1,1,1),3), ncol = 3, byrow = FALSE)
matrix[,2:3] <- jitter(matrix[,2:3], .001)
lda <- lda(matrix[,2:3],matrix[,1], prior = c(5,5)/10)

I added some jitter to obtain a little within-group variance. The LDA would
fail otherwise. When trying to predict to probability of new values, I get
some strange results:
testmatrix <- matrix(c(0,0,1,1,0,1,1,0), ncol = 2, byrow = TRUE)
predict(lda,testmatrix)$posterior
> predict(lda,testmatrix)$posterior
     0 1
[1,] 1 0
[2,] 0 1
[3,] 0 1
[4,] 1 0

Row 1 and 2 are quite right, although the probability should be not equal to
1, rather be close to 1. But row 3 and 4 really bothers me. The
probabilities should be .5 for every value. Additionally the coefficients
seem to be way to high:
> lda[["scaling"]]
          LD1
[1,] 5835.805
[2,] 7000.393

When I insert 1 error per group, the results are quite right (jitter is not
needed in this case):
matrix <- matrix(rep(c(0,0,0,0,0,1,1,1,1,1),3), ncol = 3, byrow = FALSE)
matrix[3,2] <- c(1)
matrix[8,3] <- c(0)
lda <- lda(matrix[,2:3],matrix[,1], prior = c(5,5)/10)
predict(lda,testmatrix)$posterior
> predict(lda,testmatrix)$posterior
                0            1
[1,] 0.9996646499 0.0003353501
[2,] 0.0003353501 0.9996646499
[3,] 0.5000000000 0.5000000000
[4,] 0.5000000000 0.5000000000

My question is now: Is my data "too good" or did I make a mistake in my
code?

Best regards,
Arne Schulz