[RsR] MASS:rlm() and robustbase::lmrob() are both breaking

Mon Sep 8 12:55:12 CEST 2008

Did you actually try it?

> library(MASS)
> rlm(fo, data = long)
Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  :
  'x' is singular: singular fits are not implemented in rlm

However, with rq:

> library(quantreg)
> rq(fo, data = long, method = "fn")
Call:
rq(formula = fo, data = long, method = "fn")

Coefficients:
               f.year2002                f.year2003                f.year2004
               8.77440508               10.01745274               11.51253770
               f.year2005                f.year2006                f.year2007
              16.41821284               19.58352037               23.11792593
major.industryDiversified major.industryElectricity        major.industryFood
               0.82687643                0.28608694                0.87899802
  major.industryMachinery      major.industryMetals   major.industryMiscManuf
               1.98080897                5.44824581                0.09976173
major.industryNonMetalMin     major.industryServ.IT  major.industryServ.Other
               2.53125111                4.44879221                2.19568600
   major.industryTextiles major.industryTransportEq                   i.xTRUE
               0.07032491                5.34869366                2.94512801
                   lta.l1               I(lta.l1^2)
              -3.20155385                0.25836972

Degrees of freedom: 6849 total; 6829 residual

On Mon, Sep 8, 2008 at 6:07 AM, Martin Maechler
<maechler using stat.math.ethz.ch> wrote:
>>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>>>     on Sun, 7 Sep 2008 11:06:33 -0400 writes:
>
>    GaGr> Try quantreg (using the interior point method):
>
>    GaGr> fo <- da.g1 ~ -1 + f.year + major.industry + i.x + lta.l1 + I(lta.l1^2)
>    GaGr> library(quantreg)
>    GaGr> rq(fo, data = long, method = "fn")
>
> Well, no:
> Note that quantreg  (with default tau = 0.5)
> does so-called "L1" regression, which is a Huber M-estimate (the
> most-robust, least-efficient among the Hubers).
> BUT the influcence function of such an estimate is still linear
> in the predictor (x \in \R^p) and hence *un*bounded (in x).
> In (too) short: Such estimates are robust against outliers in y
> but *NOT* robust against outliers in x.
>
> For this reason, more than 20 years ago, so called bounded
> influence regression estimators were invented and later
> complemented by high-breakdown and then MM-estimates (which
> combine high breakdown with high efficiency).
>
> Now to the specific case
> (which also contains much more about the general problem):
>
>    GaGr> On Sun, Sep 7, 2008 at 10:13 AM, Ajay Shah <ajayshah using mayin.org> wrote:
>    >> I have a regression where the lm() goes through fine. This mailing
>    >> list has always encouraged me to worry about how a robust regression
>    >> might do things differently. I tried two approaches and both don't
>    >> work.
>    >>
>    >> First I need to give you the dataset:
>    >>
>    >>> load(url("http://www.mayin.org/ajayshah/tmp/long.rda"))
>    >>
>    >> This gives you a data frame named "long". Here's the simple lm():
>    >>
>    >>> summary(lm(da.g1 ~ -1 + f.year +
>    >> +            major.industry +
>    >> +            i.x +
>    >> +            lta.l1 + I(lta.l1^2), data=long))
>    >>
>
> [............]
>
>    >> In this, f.year is a factor and major.industry is a factor. i.x is a
>    >> boolean. lta.l1 is a real number. The left hand side variable (da.g1)
>    >> is a real number. I put a -1 on the regression to make space for the
>    >> dummy variables.
>    >>
>    >> On to my woes with robust regressions. MASS:rlm() breaks:
>    >>
>    >>> library(MASS)
>    >>> summary(rlm(da.g1 ~ -1 + f.year +
>    >> +            major.industry +
>    >> +            i.x +
>    >> +            lta.l1 + I(lta.l1^2), method="MM", data=long))
>    >> Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  :
>    >> 'x' is singular: singular fits are not implemented in rlm
>
> If you do *NOT* use  method="MM", i.e. use the default,
> then  MASS::rlm() works
> *and* is (in my biased view) to be preferred over  L1(quantreg),
> even though the fits will probably be quite similar here.
>
>    >> robustbase::lmrob() breaks:
>    >>
>    >>> library(robustbase)
>    >>> summary(lmrob(da.g1 ~ -1 + f.year +
>    >> +            major.industry +
>    >> +            i.x +
>    >> +            lta.l1 + I(lta.l1^2), data=long))
>    >>
>    >> Too many singular resamples
>    >> Aborting fast_s_w_mem()
>    >>
>    >> Error in lmrob.S(x = x, y = y, control = control) :
>    >> C function R_lmrob_S() exited prematurely
>
> Yes; the reason for robustbase::lmrob's failure is basically the
> same as for  MASS::rlm(., method="MM")'s one:
>
> The MM-method of both require a high-breakdown initial estimate,
> *and* the high-breakdown initial estimates are based algorithms
> that use resampling methods which are not at all adapted to this
> (and based on a theory of affine-invariance which does *not*
>  apply in the case of factor-predictors...)
>
> Within the small community of experts this is well-known.
> For this reason,  Maronna & Yohai (1999) proposed the so-called
> "S/M - estimate" which is supposedly implemented in Insightful's
> "Robust library" and consequently in the 'robust' R package.
> The 'robust.pdf' manual (of Insightful's original) says this on
> page 35:
>
>>   The Robust Library deals with this problem by using a new
>>   alternate S/M-estimate computing algorithm due to Maronna and
>>   Yohai (1999) for robustly fitting linear models which contain factor
>>   variables with possibly many levels. S-PLUS automatically uses the
>>   new S/M-estimate algorithm whenever your linear model data
>>   contains at least one factor variable. For further information see the
>>   Theoretical Details section Alternating S and M Initial Estimate.
>
> *AND* in last year's "robust stats in R" meeting in Banff,
> we had decided that this algorithm should really become an
> option in 'robustbase' ((but it seems we hadn't made good
> enough work assignments there ... ;-))
>
> Anyway, I've tried
>
>  library(robust)
>  mLMR. <- lmRob(fo, data = long)
>
> and unfortunately, that failed too, even though I'd have
> expected it would work thanks to using the S/M algorithm
> but *ALAS* it gets into problems as well..
>
> I'm interested in more feedback, particularly from other experts
> about S/M and/or its implementation in 'robust'.
>
>
>    >> If you could guide me on what I'm doing wrong, that'll be great. How
>    >> would I do the above specification as a robust regression? I googled
>    >> around and I found a few others asking these same questions in the
>    >> past, but it didn't look like there was a clear answer.
>
> In short: The  "MM"  method hasn't been well implemented for
>          this situation,
> and you need to stick with the default
>
> rlm( )  for now.
>
> Martin Maechler, ETH Zurich
>
>    >>
>    >> --
>    >> Ajay Shah                                      http://www.mayin.org/ajayshah
>    >> ajayshah using mayin.org                             http://ajayshahblog.blogspot.com
>