[R] Obtaining illogical results from posterior LDA-classification because of "too good" data?

Sun Nov 1 18:58:06 CET 2009

Arne Schulz wrote:
> Dear list,
> my problem seems to be primarily a statistical one, but maybe there is a
> misspecification within R (and hopefully a solution).
> 
> I have two groups with two measured variables as training data. According to
> the variables, the groups differ totally. I know that this is a very easy
> situation, but the later analysis will use the same principle (aside from
> more groups and more possible values). The example should be enough to draw
> my problem:
> matrix <- matrix(rep(c(0,0,0,0,0,1,1,1,1,1),3), ncol = 3, byrow = FALSE)
> matrix[,2:3] <- jitter(matrix[,2:3], .001)
> lda <- lda(matrix[,2:3],matrix[,1], prior = c(5,5)/10)
> 
> I added some jitter to obtain a little within-group variance. The LDA would
> fail otherwise. When trying to predict to probability of new values, I get
> some strange results:
> testmatrix <- matrix(c(0,0,1,1,0,1,1,0), ncol = 2, byrow = TRUE)
> predict(lda,testmatrix)$posterior
>> predict(lda,testmatrix)$posterior
>      0 1
> [1,] 1 0
> [2,] 0 1
> [3,] 0 1
> [4,] 1 0
> 
> Row 1 and 2 are quite right, although the probability should be not equal to
> 1, rather be close to 1. But row 3 and 4 really bothers me. The
> probabilities should be .5 for every value. Additionally the coefficients
> seem to be way to high:
>> lda[["scaling"]]
>           LD1
> [1,] 5835.805
> [2,] 7000.393
> 
> When I insert 1 error per group, the results are quite right (jitter is not
> needed in this case):
> matrix <- matrix(rep(c(0,0,0,0,0,1,1,1,1,1),3), ncol = 3, byrow = FALSE)
> matrix[3,2] <- c(1)
> matrix[8,3] <- c(0)
> lda <- lda(matrix[,2:3],matrix[,1], prior = c(5,5)/10)
> predict(lda,testmatrix)$posterior
>> predict(lda,testmatrix)$posterior
>                 0            1
> [1,] 0.9996646499 0.0003353501
> [2,] 0.0003353501 0.9996646499
> [3,] 0.5000000000 0.5000000000
> [4,] 0.5000000000 0.5000000000
> 
> 
> My question is now: Is my data "too good" or did I make a mistake in my
> code?

Your learning data has an intra-group variance close to 0 and hence the 
pooled variance is also almost 0.
Hence minimal deviation from the center makes the posterior almost 1 in 
the corresponding direction.

In your second example you are increasing the variance by orders of 
magnitude.

Best,
Uwe Ligges

> 
> 
> Best regards,
> Arne Schulz
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.