[RsR] MASS:rlm() and robustbase::lmrob() are both breaking

Mon Sep 8 14:06:11 CEST 2008

>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>>     on Mon, 8 Sep 2008 06:55:12 -0400 writes:

    GaGr> Did you actually try it?

yes,
however  with a slightly cleaned version of the data :
I've really worked with  'd.long' 
which I produced with the following

## simplify it a bit
d.long <- long
rn <- abbreviate(row.names(long), 8, method="both")
row.names(d.long) <- rn
## Remove the (previous) na.action -- it is not relevant here
attr(d.long, "na.action") <- NULL
head(d.long, 20)
summary(d.long) # f.year has level '2001' with *no* entry
d.long <- within(d.long,
                 f.year <- factor(f.year))

--- so the important part was removing the factor level with no entry

I apologize for having omitted that part -- 
which I'd find crucial in a pre-analysis "data step"
but erronously thought to not matter much here.

Martin

    >> library(MASS)
    >> rlm(fo, data = long)
    GaGr> Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  :
    GaGr> 'x' is singular: singular fits are not implemented in rlm

    GaGr> However, with rq:

    >> library(quantreg)
    >> rq(fo, data = long, method = "fn")
    GaGr> Call:
    GaGr> rq(formula = fo, data = long, method = "fn")

    GaGr> Coefficients:
    GaGr> f.year2002                f.year2003                f.year2004
    GaGr> 8.77440508               10.01745274               11.51253770
    GaGr> f.year2005                f.year2006                f.year2007
    GaGr> 16.41821284               19.58352037               23.11792593
    GaGr> major.industryDiversified major.industryElectricity        major.industryFood
    GaGr> 0.82687643                0.28608694                0.87899802
    GaGr> major.industryMachinery      major.industryMetals   major.industryMiscManuf
    GaGr> 1.98080897                5.44824581                0.09976173
    GaGr> major.industryNonMetalMin     major.industryServ.IT  major.industryServ.Other
    GaGr> 2.53125111                4.44879221                2.19568600
    GaGr> major.industryTextiles major.industryTransportEq                   i.xTRUE
    GaGr> 0.07032491                5.34869366                2.94512801
    GaGr> lta.l1               I(lta.l1^2)
    GaGr> -3.20155385                0.25836972

    GaGr> Degrees of freedom: 6849 total; 6829 residual

    GaGr> On Mon, Sep 8, 2008 at 6:07 AM, Martin Maechler
    GaGr> <maechler using stat.math.ethz.ch> wrote:
    >>>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
    >>>>>>> on Sun, 7 Sep 2008 11:06:33 -0400 writes:
    >> 
    GaGr> Try quantreg (using the interior point method):
    >> 
    GaGr> fo <- da.g1 ~ -1 + f.year + major.industry + i.x + lta.l1 + I(lta.l1^2)
    GaGr> library(quantreg)
    GaGr> rq(fo, data = long, method = "fn")
    >> 
    >> Well, no:
    >> Note that quantreg  (with default tau = 0.5)
    >> does so-called "L1" regression, which is a Huber M-estimate (the
    >> most-robust, least-efficient among the Hubers).
    >> BUT the influcence function of such an estimate is still linear
    >> in the predictor (x \in \R^p) and hence *un*bounded (in x).
    >> In (too) short: Such estimates are robust against outliers in y
    >> but *NOT* robust against outliers in x.
    >> 
    >> For this reason, more than 20 years ago, so called bounded
    >> influence regression estimators were invented and later
    >> complemented by high-breakdown and then MM-estimates (which
    >> combine high breakdown with high efficiency).
    >> 
    >> Now to the specific case
    >> (which also contains much more about the general problem):
    >> 
    GaGr> On Sun, Sep 7, 2008 at 10:13 AM, Ajay Shah <ajayshah using mayin.org> wrote:
    >> >> I have a regression where the lm() goes through fine. This mailing
    >> >> list has always encouraged me to worry about how a robust regression
    >> >> might do things differently. I tried two approaches and both don't
    >> >> work.
    >> >>
    >> >> First I need to give you the dataset:
    >> >>
    >> >>> load(url("http://www.mayin.org/ajayshah/tmp/long.rda"))
    >> >>
    >> >> This gives you a data frame named "long". Here's the simple lm():
    >> >>
    >> >>> summary(lm(da.g1 ~ -1 + f.year +
    >> >> +            major.industry +
    >> >> +            i.x +
    >> >> +            lta.l1 + I(lta.l1^2), data=long))
    >> >>
    >> 
    >> [............]
    >> 
    >> >> In this, f.year is a factor and major.industry is a factor. i.x is a
    >> >> boolean. lta.l1 is a real number. The left hand side variable (da.g1)
    >> >> is a real number. I put a -1 on the regression to make space for the
    >> >> dummy variables.
    >> >>
    >> >> On to my woes with robust regressions. MASS:rlm() breaks:
    >> >>
    >> >>> library(MASS)
    >> >>> summary(rlm(da.g1 ~ -1 + f.year +
    >> >> +            major.industry +
    >> >> +            i.x +
    >> >> +            lta.l1 + I(lta.l1^2), method="MM", data=long))
    >> >> Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  :
    >> >> 'x' is singular: singular fits are not implemented in rlm
    >> 
    >> If you do *NOT* use  method="MM", i.e. use the default,
    >> then  MASS::rlm() works
    >> *and* is (in my biased view) to be preferred over  L1(quantreg),
    >> even though the fits will probably be quite similar here.
    >> 
    >> >> robustbase::lmrob() breaks:
    >> >>
    >> >>> library(robustbase)
    >> >>> summary(lmrob(da.g1 ~ -1 + f.year +
    >> >> +            major.industry +
    >> >> +            i.x +
    >> >> +            lta.l1 + I(lta.l1^2), data=long))
    >> >>
    >> >> Too many singular resamples
    >> >> Aborting fast_s_w_mem()
    >> >>
    >> >> Error in lmrob.S(x = x, y = y, control = control) :
    >> >> C function R_lmrob_S() exited prematurely
    >> 
    >> Yes; the reason for robustbase::lmrob's failure is basically the
    >> same as for  MASS::rlm(., method="MM")'s one:
    >> 
    >> The MM-method of both require a high-breakdown initial estimate,
    >> *and* the high-breakdown initial estimates are based algorithms
    >> that use resampling methods which are not at all adapted to this
    >> (and based on a theory of affine-invariance which does *not*
    >> apply in the case of factor-predictors...)
    >> 
    >> Within the small community of experts this is well-known.
    >> For this reason,  Maronna & Yohai (1999) proposed the so-called
    >> "S/M - estimate" which is supposedly implemented in Insightful's
    >> "Robust library" and consequently in the 'robust' R package.
    >> The 'robust.pdf' manual (of Insightful's original) says this on
    >> page 35:
    >> 
    >>> The Robust Library deals with this problem by using a new
    >>> alternate S/M-estimate computing algorithm due to Maronna and
    >>> Yohai (1999) for robustly fitting linear models which contain factor
    >>> variables with possibly many levels. S-PLUS automatically uses the
    >>> new S/M-estimate algorithm whenever your linear model data
    >>> contains at least one factor variable. For further information see the
    >>> Theoretical Details section Alternating S and M Initial Estimate.
    >> 
    >> *AND* in last year's "robust stats in R" meeting in Banff,
    >> we had decided that this algorithm should really become an
    >> option in 'robustbase' ((but it seems we hadn't made good
    >> enough work assignments there ... ;-))
    >> 
    >> Anyway, I've tried
    >> 
    >> library(robust)
    >> mLMR. <- lmRob(fo, data = long)
    >> 
    >> and unfortunately, that failed too, even though I'd have
    >> expected it would work thanks to using the S/M algorithm
    >> but *ALAS* it gets into problems as well..
    >> 
    >> I'm interested in more feedback, particularly from other experts
    >> about S/M and/or its implementation in 'robust'.
    >> 
    >> 
    >> >> If you could guide me on what I'm doing wrong, that'll be great. How
    >> >> would I do the above specification as a robust regression? I googled
    >> >> around and I found a few others asking these same questions in the
    >> >> past, but it didn't look like there was a clear answer.
    >> 
    >> In short: The  "MM"  method hasn't been well implemented for
    >> this situation,
    >> and you need to stick with the default
    >> 
    >> rlm( )  for now.
    >> 
    >> Martin Maechler, ETH Zurich
    >> 
    >> >>
    >> >> --
    >> >> Ajay Shah                                      http://www.mayin.org/ajayshah
    >> >> ajayshah using mayin.org                             http://ajayshahblog.blogspot.com
    >> 

    GaGr> _______________________________________________
    GaGr> R-SIG-Robust using r-project.org mailing list
    GaGr> https://stat.ethz.ch/mailman/listinfo/r-sig-robust