[R] mda and kmeans

avanisco at univ-fcomte.fr avanisco at univ-fcomte.fr
Wed Aug 15 11:24:35 CEST 2007


I am using the function mda of the mda library in order to discriminate 
4 groups with 8 explanatory variables. I only have 66 observations.
I tested all possible combinations of those variable and run for each 
the Mixture Discriminant Analysis.

For some iterations, I got an error message: "error in kmeans(xx, 
start): initial centers are not distinct".

I understood that the function kmeans() called by mda() choose randomly 
the initial centers for starting the clustering procedure.
As I aim to boostrap this function and need a lot of random selections, 
I'd like to avoid the effects of replicated centers by keeping the 
initial centers constant.

When debugging, it seems that mda() is linked with kmeans() by the 
following condition:
  if (inherits(weights, "mda")) {
        if (is.null(weights$weights))            weights <- 
predict(weights, x, type = "weights",                g = fg)
        else weights <- weights$weights

This condition call mda.start() if "weight" is null.
Kmeans() is called in mda.start() by starter() where arguments for 
kmeans (xx and start) are calculated.

The problem arises in the function sample() in starter() which sample 
randomly the data set.
For example, I could obtain duplicated row such as followed:

Debug: start <- xx[sample(1:nrow(xx), size = nc), ]
debug: TT <- kmeans(xx, start)
Browse[1]> start
          etm5      etm6  elevation    slope         SI     NDVI        EVI
28   0.7746975 0.4611835 -0.5566161 1.646738  4.5260250 1.519095  0.2501180
28.1 0.7746975 0.4611835 -0.5566161 1.646738  4.5260250 1.519095  0.2501180
30.1 0.4137596 0.2615745 -0.5367707 1.889310 -0.2040883 0.824643 -0.1526292

In sample function,it seems that sampling without replacement is the 
default. But actually, in the case above it sampled 2 times the same 
row (28).

So, this is still a black box for me.
Even if as it is mentionned in the help page of mda(), "the 'weights' 
argument need never be accessed", do you think it's possible to avoid 
this duplicated sampling?
          Thanks in advance for your ideas,

Amelie Vaniscotte
University of Franche-comté
25000 Besançon

More information about the R-help mailing list