[RsR] Singular covariance in plot.lmrob

Wed Dec 3 14:27:38 CET 2008

Hi Christian, Matias,

With this example covMcd() fails before starting the actual algorithm - the
first to do is to standardize the data by (for each variable) subtracting
med and dividing by mad (actually not mad but the h-quantile) . Of course
this is not possible for h= (n+p+1)/2=199 since 204 observations in variable
'cylinders' (unfortunately most of the cars are with 4 cylinders). You could
avoid this by increasing alpha (reducing the breakdown point).

Then comes the __false__ error message that 14 observations lie on a
hyperplane - they are 204 out of 392, I am going to investigate the FORTRAN
code to find out what is happening.

But even in case of of more than h observations lying on a hyperplane
covMcd, apart from issuing an warning message, returns the MCD location and
scatter matrix (of course singular).

In ltsReg() you can specify mcd=FALSE if you do not want to investigate the
leverage information (and you do not want in this case, don't you?) and this
will save you from the failure in the plot function.

Best regards,
Valentin

On Fri, Nov 28, 2008 at 9:31 PM, Matias Salibian-Barrera <matias using stat.ubc.ca
> wrote:

>
> Good evening Christian,
>
>  yes, I remember that something like this was discussed in Banff.
>> Actually I think that there are two different issues. (snip)
>>
>
> Yes, I agree.
>
>  there is a binary variable with 80% of the data on one value, then I'd
>> rather go for a 10% breakdown point
>> and not 50%).
>>
>
> This will indeed prevent you from finding a singular covariance matrix. But
> I would go a bit further and question what information (regarding leverage)
> can be expected to be gained by keeping this variable in the design matrix.
> I am inclined to remove categorical variables from the design matrix when
> looking for high-leverage points. Having said this, however, I would like to
> hear other opinions.
>
>  But this is not the issue with the dataset cited below, where all
>> variables are quantitative but two of them are discret with a much lower
>> number of distinct values than observations so that there may be many, but
>> much fewer than 50% observations with the same x-value.
>> The problem with such data is not with the definition, but with the
>> algorithm.
>>
>
> I agree as well. And will wait for the experts' comments on this.
>
>  It should be pretty straightforward to prevent plot.lmrob from producing
>> an error in this case with the try command and just make it giving out the
>> other plots and a warning. However, I don't know whether there is a
>>
>
> Yes, I hope I find time for this in the coming weeks after our teaching
> Term ends. This will also be partially solved if we do not use binary
> variables (particularly when they result from coding categorical
> covariates).
>
> Matias
>
>
>
>
>  reliable method to construct a better initialization of the covariance
>> estimator.
>>
>> Christian
>>
>> On Fri, 28 Nov 2008, Matias Salibian-Barrera wrote:
>>
>>
>>> Thanks Christian for reminding me of this issue. It was discussed in
>>> Banff last year, and it may in principle happen any time you have a
>>> categorical explanatory variable in your model, as the design matrix becomes
>>> sparse and sub-sampling search algorithms tend to produce too many singular
>>> subsamples of size p+1.
>>>
>>> I am not sure that this can be fixed by lowering the BP in the current
>>> MCD algorithm. Note how your example fails with a message that 14 (out of
>>> 392) obs. are on a lower-dimensional hyperplane. Shouldn't we be considering
>>> samples of size ~ 200? I believe this error message may be more related to
>>> the random subsampling search than the BP of the target estimator. Maybe
>>> Valentin can help me understand what is happening here.
>>>
>>> For the linear regression case, I would argue the following: since
>>> Mahalanobis distances can be hard to interpret for categorical variables,
>>> one possibility would be to simply remove these "factor" variables when
>>> calculating the distances for the plot. Sometimes, however, the user may
>>> have already "coded" the factors into rows of 0's and 1's (instead of using
>>> proper factor variables in the formula), which would be a more difficult
>>> case to protect against.
>>>
>>> For the more general multivariate location/scatter problem, I believe the
>>> default "failing" behaviour of the MCD algorithm may need to be revisited,
>>> since, as you mention, one may still want to get a (singular) covariance
>>> matrix estimator when half the data are lying on a lower-dimensional
>>> hyperplane. While we've had this conversation in the past, we never reached
>>> much of an consensus. Maybe it is time to try again.
>>>
>>> Matias
>>>
>>>
>>> Christian Hennig wrote:
>>>
>>>> Dear list,
>>>>
>>>> I have come across several situations in which the robust Mahalanobis
>>>> distance vs. residuals plot, the first default plot in plot.lmrob, gave an
>>>> error like this:
>>>>
>>>> # recomputing robust Mahalanobis distances
>>>> # The covariance matrix has become singular during
>>>> # the iterations of the MCD algorithm.
>>>> # There are 14 observations (in the entire dataset of 392 obs.) lying on
>>>> # the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p)
>>>> # = 0 with (m_1,...,m_p) the mean of these observations and coefficients
>>>> # a_i from the vector a <- c(-0.0102123, 0, 0, 0, 0, -0.9999479)
>>>> # Error in solve.default(cov, ...) :
>>>> #   system is computationally singular: reciprocal condition number =
>>>> 2.33304e-3
>>>>
>>>> This particular error has been produced with the Auto-mpg dataset from
>>>> http://archive.ics.uci.edu/ml/datasets.html
>>>>
>>>> autod <- read.table("auto-mpg.data",col.names=c("mpg","cylinders",
>>>>                "displacement","horsepower","weight","acceleration",
>>>>                "modelyear","origin","carname"),na.strings="?")
>>>> autoc <- autod[complete.cases(autod),]
>>>> auto17 <- autoc[,1:7]
>>>> rautolm <-
>>>> lmrob(mpg~cylinders+displacement+horsepower+weight+acceleration+
>>>>             modelyear,data=auto17)
>>>> plot(rautolm)
>>>> (I don't claim that this is the most reasonable thing to do with these
>>>> data because of nonlinearity, anyway...)
>>>>
>>>> This problem happens easily if at least one of the variables is discrete
>>>> and there are several observations with the same value.
>>>> Such a situation is by no means atypical and therefore I think that it's
>>>> worthwhile that something is done about this, for example checking
>>>> singularity
>>>> internally and in that case trying a different initial sample. It may
>>>> also make sense to give the option that the robust covariance matrix is
>>>> tuned down to 25% breakdown, say, because one may still want to see a bit if
>>>> half of the data lie on a lower dimensional hyperplane (in case of a binary
>>>> x-variable) but regression still makes sense.
>>>>
>>>> Best regards,
>>>> Christian
>>>>
>>>> *** --- ***
>>>> Christian Hennig
>>>> University College London, Department of Statistical Science
>>>> Gower St., London WC1E 6BT, phone +44 207 679 1698
>>>> chrish using stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche<http://www.homepages.ucl.ac.uk/%7Eucakche>
>>>>
>>>> _______________________________________________
>>>> R-SIG-Robust using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>>>>
>>>
>>> _______________________________________________
>>> R-SIG-Robust using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>>>
>>>
>> *** --- ***
>> Christian Hennig
>> University College London, Department of Statistical Science
>> Gower St., London WC1E 6BT, phone +44 207 679 1698
>> chrish using stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche<http://www.homepages.ucl.ac.uk/%7Eucakche>
>>
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>

	[[alternative HTML version deleted]]