[RsR] Singular covariance in plot.lmrob

Matias Salibian-Barrera m@t|@@ @end|ng |rom @t@t@ubc@c@
Fri Nov 28 18:08:51 CET 2008


Thanks Christian for reminding me of this issue. It was discussed in 
Banff last year, and it may in principle happen any time you have a 
categorical explanatory variable in your model, as the design matrix 
becomes sparse and sub-sampling search algorithms tend to produce too 
many singular subsamples of size p+1.

I am not sure that this can be fixed by lowering the BP in the current 
MCD algorithm. Note how your example fails with a message that 14 (out 
of 392) obs. are on a lower-dimensional hyperplane. Shouldn't we be 
considering samples of size ~ 200? I believe this error message may be 
more related to the random subsampling search than the BP of the target 
estimator. Maybe Valentin can help me understand what is happening here.

For the linear regression case, I would argue the following: since 
Mahalanobis distances can be hard to interpret for categorical 
variables, one possibility would be to simply remove these "factor" 
variables when calculating the distances for the plot. Sometimes, 
however, the user may have already "coded" the factors into rows of 0's 
and 1's (instead of using proper factor variables in the formula), which 
would be a more difficult case to protect against.

For the more general multivariate location/scatter problem, I believe 
the default "failing" behaviour of the MCD algorithm may need to be 
revisited, since, as you mention, one may still want to get a (singular) 
covariance matrix estimator when half the data are lying on a 
lower-dimensional hyperplane. While we've had this conversation in the 
past, we never reached much of an consensus. Maybe it is time to try again.

Matias


Christian Hennig wrote:
> Dear list,
> 
> I have come across several situations in which the robust Mahalanobis 
> distance vs. residuals plot, the first default plot in plot.lmrob, gave 
> an error like this:
> 
> # recomputing robust Mahalanobis distances
> # The covariance matrix has become singular during
> # the iterations of the MCD algorithm.
> # There are 14 observations (in the entire dataset of 392 obs.) lying on
> # the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p)
> # = 0 with (m_1,...,m_p) the mean of these observations and coefficients
> # a_i from the vector a <- c(-0.0102123, 0, 0, 0, 0, -0.9999479)
> # Error in solve.default(cov, ...) :
> #   system is computationally singular: reciprocal condition number = 
> 2.33304e-3
> 
> This particular error has been produced with the Auto-mpg dataset from
> http://archive.ics.uci.edu/ml/datasets.html
> 
> autod <- read.table("auto-mpg.data",col.names=c("mpg","cylinders",
>                 "displacement","horsepower","weight","acceleration",
>                 "modelyear","origin","carname"),na.strings="?")
> autoc <- autod[complete.cases(autod),]
> auto17 <- autoc[,1:7]
> rautolm <- lmrob(mpg~cylinders+displacement+horsepower+weight+acceleration+
>              modelyear,data=auto17)
> plot(rautolm)
> (I don't claim that this is the most reasonable thing to do with these 
> data because of nonlinearity, anyway...)
> 
> This problem happens easily if at least one of the variables is discrete 
> and there are several observations with the same value.
> Such a situation is by no means atypical and therefore I think that it's 
> worthwhile that something is done about this, for example checking 
> singularity
> internally and in that case trying a different initial sample. It may 
> also make sense to give the option that the robust covariance matrix is 
> tuned down to 25% breakdown, say, because one may still want to see a 
> bit if half of the data lie on a lower dimensional hyperplane (in case 
> of a binary x-variable) but regression still makes sense.
> 
> Best regards,
> Christian
> 
> *** --- ***
> Christian Hennig
> University College London, Department of Statistical Science
> Gower St., London WC1E 6BT, phone +44 207 679 1698
> chrish using stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
> 
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust




More information about the R-SIG-Robust mailing list