[RsR] MASS:rlm() and robustbase::lmrob() are both breaking
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Sep 8 14:06:11 CEST 2008
>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>> on Mon, 8 Sep 2008 06:55:12 -0400 writes:
GaGr> Did you actually try it?
yes,
however with a slightly cleaned version of the data :
I've really worked with 'd.long'
which I produced with the following
## simplify it a bit
d.long <- long
rn <- abbreviate(row.names(long), 8, method="both")
row.names(d.long) <- rn
## Remove the (previous) na.action -- it is not relevant here
attr(d.long, "na.action") <- NULL
head(d.long, 20)
summary(d.long) # f.year has level '2001' with *no* entry
d.long <- within(d.long,
f.year <- factor(f.year))
--- so the important part was removing the factor level with no entry
I apologize for having omitted that part --
which I'd find crucial in a pre-analysis "data step"
but erronously thought to not matter much here.
Martin
>> library(MASS)
>> rlm(fo, data = long)
GaGr> Error in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
GaGr> 'x' is singular: singular fits are not implemented in rlm
GaGr> However, with rq:
>> library(quantreg)
>> rq(fo, data = long, method = "fn")
GaGr> Call:
GaGr> rq(formula = fo, data = long, method = "fn")
GaGr> Coefficients:
GaGr> f.year2002 f.year2003 f.year2004
GaGr> 8.77440508 10.01745274 11.51253770
GaGr> f.year2005 f.year2006 f.year2007
GaGr> 16.41821284 19.58352037 23.11792593
GaGr> major.industryDiversified major.industryElectricity major.industryFood
GaGr> 0.82687643 0.28608694 0.87899802
GaGr> major.industryMachinery major.industryMetals major.industryMiscManuf
GaGr> 1.98080897 5.44824581 0.09976173
GaGr> major.industryNonMetalMin major.industryServ.IT major.industryServ.Other
GaGr> 2.53125111 4.44879221 2.19568600
GaGr> major.industryTextiles major.industryTransportEq i.xTRUE
GaGr> 0.07032491 5.34869366 2.94512801
GaGr> lta.l1 I(lta.l1^2)
GaGr> -3.20155385 0.25836972
GaGr> Degrees of freedom: 6849 total; 6829 residual
GaGr> On Mon, Sep 8, 2008 at 6:07 AM, Martin Maechler
GaGr> <maechler using stat.math.ethz.ch> wrote:
>>>>>>> "GaGr" == Gabor Grothendieck <ggrothendieck using gmail.com>
>>>>>>> on Sun, 7 Sep 2008 11:06:33 -0400 writes:
>>
GaGr> Try quantreg (using the interior point method):
>>
GaGr> fo <- da.g1 ~ -1 + f.year + major.industry + i.x + lta.l1 + I(lta.l1^2)
GaGr> library(quantreg)
GaGr> rq(fo, data = long, method = "fn")
>>
>> Well, no:
>> Note that quantreg (with default tau = 0.5)
>> does so-called "L1" regression, which is a Huber M-estimate (the
>> most-robust, least-efficient among the Hubers).
>> BUT the influcence function of such an estimate is still linear
>> in the predictor (x \in \R^p) and hence *un*bounded (in x).
>> In (too) short: Such estimates are robust against outliers in y
>> but *NOT* robust against outliers in x.
>>
>> For this reason, more than 20 years ago, so called bounded
>> influence regression estimators were invented and later
>> complemented by high-breakdown and then MM-estimates (which
>> combine high breakdown with high efficiency).
>>
>> Now to the specific case
>> (which also contains much more about the general problem):
>>
GaGr> On Sun, Sep 7, 2008 at 10:13 AM, Ajay Shah <ajayshah using mayin.org> wrote:
>> >> I have a regression where the lm() goes through fine. This mailing
>> >> list has always encouraged me to worry about how a robust regression
>> >> might do things differently. I tried two approaches and both don't
>> >> work.
>> >>
>> >> First I need to give you the dataset:
>> >>
>> >>> load(url("http://www.mayin.org/ajayshah/tmp/long.rda"))
>> >>
>> >> This gives you a data frame named "long". Here's the simple lm():
>> >>
>> >>> summary(lm(da.g1 ~ -1 + f.year +
>> >> + major.industry +
>> >> + i.x +
>> >> + lta.l1 + I(lta.l1^2), data=long))
>> >>
>>
>> [............]
>>
>> >> In this, f.year is a factor and major.industry is a factor. i.x is a
>> >> boolean. lta.l1 is a real number. The left hand side variable (da.g1)
>> >> is a real number. I put a -1 on the regression to make space for the
>> >> dummy variables.
>> >>
>> >> On to my woes with robust regressions. MASS:rlm() breaks:
>> >>
>> >>> library(MASS)
>> >>> summary(rlm(da.g1 ~ -1 + f.year +
>> >> + major.industry +
>> >> + i.x +
>> >> + lta.l1 + I(lta.l1^2), method="MM", data=long))
>> >> Error in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
>> >> 'x' is singular: singular fits are not implemented in rlm
>>
>> If you do *NOT* use method="MM", i.e. use the default,
>> then MASS::rlm() works
>> *and* is (in my biased view) to be preferred over L1(quantreg),
>> even though the fits will probably be quite similar here.
>>
>> >> robustbase::lmrob() breaks:
>> >>
>> >>> library(robustbase)
>> >>> summary(lmrob(da.g1 ~ -1 + f.year +
>> >> + major.industry +
>> >> + i.x +
>> >> + lta.l1 + I(lta.l1^2), data=long))
>> >>
>> >> Too many singular resamples
>> >> Aborting fast_s_w_mem()
>> >>
>> >> Error in lmrob.S(x = x, y = y, control = control) :
>> >> C function R_lmrob_S() exited prematurely
>>
>> Yes; the reason for robustbase::lmrob's failure is basically the
>> same as for MASS::rlm(., method="MM")'s one:
>>
>> The MM-method of both require a high-breakdown initial estimate,
>> *and* the high-breakdown initial estimates are based algorithms
>> that use resampling methods which are not at all adapted to this
>> (and based on a theory of affine-invariance which does *not*
>> apply in the case of factor-predictors...)
>>
>> Within the small community of experts this is well-known.
>> For this reason, Maronna & Yohai (1999) proposed the so-called
>> "S/M - estimate" which is supposedly implemented in Insightful's
>> "Robust library" and consequently in the 'robust' R package.
>> The 'robust.pdf' manual (of Insightful's original) says this on
>> page 35:
>>
>>> The Robust Library deals with this problem by using a new
>>> alternate S/M-estimate computing algorithm due to Maronna and
>>> Yohai (1999) for robustly fitting linear models which contain factor
>>> variables with possibly many levels. S-PLUS automatically uses the
>>> new S/M-estimate algorithm whenever your linear model data
>>> contains at least one factor variable. For further information see the
>>> Theoretical Details section Alternating S and M Initial Estimate.
>>
>> *AND* in last year's "robust stats in R" meeting in Banff,
>> we had decided that this algorithm should really become an
>> option in 'robustbase' ((but it seems we hadn't made good
>> enough work assignments there ... ;-))
>>
>> Anyway, I've tried
>>
>> library(robust)
>> mLMR. <- lmRob(fo, data = long)
>>
>> and unfortunately, that failed too, even though I'd have
>> expected it would work thanks to using the S/M algorithm
>> but *ALAS* it gets into problems as well..
>>
>> I'm interested in more feedback, particularly from other experts
>> about S/M and/or its implementation in 'robust'.
>>
>>
>> >> If you could guide me on what I'm doing wrong, that'll be great. How
>> >> would I do the above specification as a robust regression? I googled
>> >> around and I found a few others asking these same questions in the
>> >> past, but it didn't look like there was a clear answer.
>>
>> In short: The "MM" method hasn't been well implemented for
>> this situation,
>> and you need to stick with the default
>>
>> rlm( ) for now.
>>
>> Martin Maechler, ETH Zurich
>>
>> >>
>> >> --
>> >> Ajay Shah http://www.mayin.org/ajayshah
>> >> ajayshah using mayin.org http://ajayshahblog.blogspot.com
>>
GaGr> _______________________________________________
GaGr> R-SIG-Robust using r-project.org mailing list
GaGr> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
More information about the R-SIG-Robust
mailing list