[RsR] MASS:rlm() and robustbase::lmrob() are both breaking

Mon Sep 8 18:57:56 CEST 2008

To add noise on the discussion I had tried to run wle.lm function in  
package wle on the dataset (You can download the code and the Rimage  
from www.dst.unive.it/~claudio/shah.[R,Rimage]. The method finds two  
roots. But even if they assigned different weights to a small part of  
the observations, the roots are exactly the same but for one  
coefficient. Hence I think that the right root to choose is the first  
one. Would you mind to compare this result with the one you have from  
rlm and quantreg?

Claudio

On Sep 8, 2008, at 2:06 PM, Martin Maechler wrote:

>>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>>>     on Mon, 8 Sep 2008 06:55:12 -0400 writes:
>
>     GaGr> Did you actually try it?
>
> yes,
> however  with a slightly cleaned version of the data :
> I've really worked with  'd.long'
> which I produced with the following
>
> ## simplify it a bit
> d.long <- long
> rn <- abbreviate(row.names(long), 8, method="both")
> row.names(d.long) <- rn
> ## Remove the (previous) na.action -- it is not relevant here
> attr(d.long, "na.action") <- NULL
> head(d.long, 20)
> summary(d.long) # f.year has level '2001' with *no* entry
> d.long <- within(d.long,
>                  f.year <- factor(f.year))
>
> --- so the important part was removing the factor level with no entry
>
> I apologize for having omitted that part --
> which I'd find crucial in a pre-analysis "data step"
> but erronously thought to not matter much here.
>
> Martin
>
>>> library(MASS)
>>> rlm(fo, data = long)
>     GaGr> Error in rlm.default(x, y, weights, method = method,  
> wt.method = wt.method,  :
>     GaGr> 'x' is singular: singular fits are not implemented in rlm
>
>     GaGr> However, with rq:
>
>>> library(quantreg)
>>> rq(fo, data = long, method = "fn")
>     GaGr> Call:
>     GaGr> rq(formula = fo, data = long, method = "fn")
>
>     GaGr> Coefficients:
>     GaGr> f.year2002                f.year2003                 
> f.year2004
>     GaGr> 8.77440508               10.01745274                
> 11.51253770
>     GaGr> f.year2005                f.year2006                 
> f.year2007
>     GaGr> 16.41821284               19.58352037                
> 23.11792593
>     GaGr> major.industryDiversified  
> major.industryElectricity        major.industryFood
>     GaGr> 0.82687643                0.28608694                 
> 0.87899802
>     GaGr> major.industryMachinery      major.industryMetals    
> major.industryMiscManuf
>     GaGr> 1.98080897                5.44824581                 
> 0.09976173
>     GaGr> major.industryNonMetalMin     major.industryServ.IT   
> major.industryServ.Other
>     GaGr> 2.53125111                4.44879221                 
> 2.19568600
>     GaGr> major.industryTextiles  
> major.industryTransportEq                   i.xTRUE
>     GaGr> 0.07032491                5.34869366                 
> 2.94512801
>     GaGr> lta.l1               I(lta.l1^2)
>     GaGr> -3.20155385                0.25836972
>
>     GaGr> Degrees of freedom: 6849 total; 6829 residual
>
>
>     GaGr> On Mon, Sep 8, 2008 at 6:07 AM, Martin Maechler
>     GaGr> <maechler using stat.math.ethz.ch> wrote:
>>>>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>>>>> on Sun, 7 Sep 2008 11:06:33 -0400 writes:
>>>
>     GaGr> Try quantreg (using the interior point method):
>>>
>     GaGr> fo <- da.g1 ~ -1 + f.year + major.industry + i.x + lta.l1  
> + I(lta.l1^2)
>     GaGr> library(quantreg)
>     GaGr> rq(fo, data = long, method = "fn")
>>>
>>> Well, no:
>>> Note that quantreg  (with default tau = 0.5)
>>> does so-called "L1" regression, which is a Huber M-estimate (the
>>> most-robust, least-efficient among the Hubers).
>>> BUT the influcence function of such an estimate is still linear
>>> in the predictor (x \in \R^p) and hence *un*bounded (in x).
>>> In (too) short: Such estimates are robust against outliers in y
>>> but *NOT* robust against outliers in x.
>>>
>>> For this reason, more than 20 years ago, so called bounded
>>> influence regression estimators were invented and later
>>> complemented by high-breakdown and then MM-estimates (which
>>> combine high breakdown with high efficiency).
>>>
>>> Now to the specific case
>>> (which also contains much more about the general problem):
>>>
>     GaGr> On Sun, Sep 7, 2008 at 10:13 AM, Ajay Shah  
> <ajayshah using mayin.org> wrote:
>>>>> I have a regression where the lm() goes through fine. This mailing
>>>>> list has always encouraged me to worry about how a robust  
>>>>> regression
>>>>> might do things differently. I tried two approaches and both don't
>>>>> work.
>>>>>
>>>>> First I need to give you the dataset:
>>>>>
>>>>>> load(url("http://www.mayin.org/ajayshah/tmp/long.rda"))
>>>>>
>>>>> This gives you a data frame named "long". Here's the simple lm():
>>>>>
>>>>>> summary(lm(da.g1 ~ -1 + f.year +
>>>>> +            major.industry +
>>>>> +            i.x +
>>>>> +            lta.l1 + I(lta.l1^2), data=long))
>>>>>
>>>
>>> [............]
>>>
>>>>> In this, f.year is a factor and major.industry is a factor. i.x  
>>>>> is a
>>>>> boolean. lta.l1 is a real number. The left hand side variable  
>>>>> (da.g1)
>>>>> is a real number. I put a -1 on the regression to make space  
>>>>> for the
>>>>> dummy variables.
>>>>>
>>>>> On to my woes with robust regressions. MASS:rlm() breaks:
>>>>>
>>>>>> library(MASS)
>>>>>> summary(rlm(da.g1 ~ -1 + f.year +
>>>>> +            major.industry +
>>>>> +            i.x +
>>>>> +            lta.l1 + I(lta.l1^2), method="MM", data=long))
>>>>> Error in rlm.default(x, y, weights, method = method, wt.method  
>>>>> = wt.method,  :
>>>>> 'x' is singular: singular fits are not implemented in rlm
>>>
>>> If you do *NOT* use  method="MM", i.e. use the default,
>>> then  MASS::rlm() works
>>> *and* is (in my biased view) to be preferred over  L1(quantreg),
>>> even though the fits will probably be quite similar here.
>>>
>>>>> robustbase::lmrob() breaks:
>>>>>
>>>>>> library(robustbase)
>>>>>> summary(lmrob(da.g1 ~ -1 + f.year +
>>>>> +            major.industry +
>>>>> +            i.x +
>>>>> +            lta.l1 + I(lta.l1^2), data=long))
>>>>>
>>>>> Too many singular resamples
>>>>> Aborting fast_s_w_mem()
>>>>>
>>>>> Error in lmrob.S(x = x, y = y, control = control) :
>>>>> C function R_lmrob_S() exited prematurely
>>>
>>> Yes; the reason for robustbase::lmrob's failure is basically the
>>> same as for  MASS::rlm(., method="MM")'s one:
>>>
>>> The MM-method of both require a high-breakdown initial estimate,
>>> *and* the high-breakdown initial estimates are based algorithms
>>> that use resampling methods which are not at all adapted to this
>>> (and based on a theory of affine-invariance which does *not*
>>> apply in the case of factor-predictors...)
>>>
>>> Within the small community of experts this is well-known.
>>> For this reason,  Maronna & Yohai (1999) proposed the so-called
>>> "S/M - estimate" which is supposedly implemented in Insightful's
>>> "Robust library" and consequently in the 'robust' R package.
>>> The 'robust.pdf' manual (of Insightful's original) says this on
>>> page 35:
>>>
>>>> The Robust Library deals with this problem by using a new
>>>> alternate S/M-estimate computing algorithm due to Maronna and
>>>> Yohai (1999) for robustly fitting linear models which contain  
>>>> factor
>>>> variables with possibly many levels. S-PLUS automatically uses the
>>>> new S/M-estimate algorithm whenever your linear model data
>>>> contains at least one factor variable. For further information  
>>>> see the
>>>> Theoretical Details section Alternating S and M Initial Estimate.
>>>
>>> *AND* in last year's "robust stats in R" meeting in Banff,
>>> we had decided that this algorithm should really become an
>>> option in 'robustbase' ((but it seems we hadn't made good
>>> enough work assignments there ... ;-))
>>>
>>> Anyway, I've tried
>>>
>>> library(robust)
>>> mLMR. <- lmRob(fo, data = long)
>>>
>>> and unfortunately, that failed too, even though I'd have
>>> expected it would work thanks to using the S/M algorithm
>>> but *ALAS* it gets into problems as well..
>>>
>>> I'm interested in more feedback, particularly from other experts
>>> about S/M and/or its implementation in 'robust'.
>>>
>>>
>>>>> If you could guide me on what I'm doing wrong, that'll be  
>>>>> great. How
>>>>> would I do the above specification as a robust regression? I  
>>>>> googled
>>>>> around and I found a few others asking these same questions in the
>>>>> past, but it didn't look like there was a clear answer.
>>>
>>> In short: The  "MM"  method hasn't been well implemented for
>>> this situation,
>>> and you need to stick with the default
>>>
>>> rlm( )  for now.
>>>
>>> Martin Maechler, ETH Zurich
>>>
>>>>>
>>>>> --
>>>>> Ajay Shah                                      http:// 
>>>>> www.mayin.org/ajayshah
>>>>> ajayshah using mayin.org                             http:// 
>>>>> ajayshahblog.blogspot.com
>>>
>
>     GaGr> _______________________________________________
>     GaGr> R-SIG-Robust using r-project.org mailing list
>     GaGr> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust

------------------------------------------------------------
Claudio Agostinelli
Dipartimento di Statistica
Universita' Ca' Foscari di Venezia
San Giobbe, Cannaregio 873
30121 Venezia
Tel: 041 2347446, Fax: 041 2347444
email: claudio using unive.it, www: www.dst.unive.it/~claudio
skype: claudio.agostinelli
------------------------------------------------------------
Visita il sito di Pino Masciari www.pinomasciari.org
------------------------------------------------------------
Per favore non mandatemi allegati in Word o PowerPoint.
Si veda http://www.gnu.org/philosophy/no-word-attachments.html

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html