[RsR] praise for taking good care of ANOVA in the robust libraries in R
Olivier Renaud
O||v|er@Ren@ud @end|ng |rom un|ge@ch
Tue Dec 8 13:57:39 CET 2009
Re,
With this mail, I would like to praise for taking good care of ANOVA in
the robust libraries in R. May I recall that in fields like psychology,
more that 80% of the articles contain an ANOVA whereas less that 10% of
them contain a regression. Doing statistical consulting with
psychologists for 10 years, here are the points I find essential to
convince psychologists to use R and robust procedures for ANOVA:
(A) If I understand correctly, lmrob will eventually supersede lmRob. I
strongly suggest that a real reflection is taken concerning X variables
that are factors. As I understand, lmrob cannot handle such data since
the initial algorithm will very likely fail with such covariates. I
suggest adding the possibility to have a L1-type of initial algorithm
(or similar) with these covariates or to use separate initial algorithms
for the continuous covariates and the factors, like in lmRob.
(B) Provide so-called "Type III sums of squares" or effects tested
marginally in the anova.lmrob or anova.lmRob (+ to implement anova.lmrob
for only one model). I know it can be done by hand, but for an average
user, having it as an optional argument to anova.lm(R/r)ob would be an
important argument to use R and robust ANOVA. Since this is an extremely
hot topic within the S/R community, I give below what I believe to be
convincing arguments given by (other) prominent members of the
statistics community. By the way, "marginal" or "Type III sums of
squares" are in several important R libraries, like in the car library
(used by Rcommander, see function Anova (with a capital A) with its
type="III" argument) and the nlme library (see anova.lme with its
type="marginal" argument).
(C) Since ANOVA is so much used, why not write a small function aovrob
that just call lmrob with the appropriate arguments for the initial
algorithm and return anova.lmrob of the object, with marginal as the
default value ?
Arguments for "marginal" or "Type III sums of squares" (for part (B))
Some prominent members of the S/R community gave over the years many
negatives comment "Type III sums of squares" or effect tested
marginally. However their examples were often in regression (e.g.
polynomial). In the context of unbalanced ANOVA, there are other
prominent members of the statistics community that give extremely
convincing arguments.
The big difference comes from the fact that in almost all real examples,
if the design is unbalanced, this is due to (hopefully MAR) missing
values, and not due to an underlying population distribution that is
unbalanced. On the contrary, in regression, the distribution of X is
supposed to be fixed (or loosely speaking to reflect the population
distribution but computations are conditioned on the sample values).
Suppose you have data with two factors but unfortunately an unbalanced
design. You want to test the two main effects and the interaction. The
model is
$Y_{ijk}=\mu +\gamma _{j} +\theta _{k} +(\gamma\theta )_{jk} +E_{ijk}$
with $k=1, \ldots, n_{ij}$
Your favorite software propose you several ANOVA tables called Type I,
II, III, etc. Which one to choose ?
Let's concentrate on Type I, where terms are added sequentially, and
Type III, where terms are tested marginally (to the full model).
To decide,
* One might argue about unique explained variance and use this
argument to favor one given Type
* One might argue that for testing a main effect, the Type III make
no sense since the "null" model contains the interaction but not
the main effect.
* Searle (1987), Milliken & Johnson (1992) and others however simply
argue that as statisticians, we should not look at explain
variances or philosophical arguments about what a model should
contain, but one should simply look to what null hypothesis each
test corresponds. They clearly show that
with the Type III SS, the corresponding H0 are exactly what we expect:
$\gamma_1=\gamma_2= \cdots = \gamma_a (=0)$,
$\theta_1=\theta_2= \cdots = \theta_b (=0)$, and
$(\gamma\theta)_{11}=(\gamma\theta)_{12}= \cdots =
(\gamma\theta)_{ab} (=0)$
whereas for Type I SS, the corresponding H0 for the first factor is
(see Searle p. 112 and 114 for an example):
$\rho'_1=\rho'_2= \cdots = \rho'_a (=0)$, where $\rho'_i = \sum_j
n_{ij} \mu_{ij} / n_{i.}$
and even more complex for the second factor, where we do not even
test that some parameters are 0:
$\delta'_j=\sum_i n_{ij} \rho'_j \forall j$, where $\delta'_i =
\sum_i n_{ij} \mu_{ij} / n_{.j}$
In 10 years of consulting, I have never seen a psychologist willing to
test such an odd hypothesis !
Just looking at the corresponding null hypotheses hopefully will
convince some of you that, for unbalanced ANOVA, if "Type III" is
recommended in many applied field and used as default by e.g. SAS and
SPSS is not so surprising.
Finally, a technical detail: in the presence of interaction, the exact
definition of Type III for classical method is slightly more involved
(from help file of Statistica):
"The Type III sums of squares attributable to an effect is computed as
the sums of squares for the effect controlling for any effects of equal
or lower degree and orthogonal to any higher-order interaction effects
(if any) that contain it. The orthogonality to higher-order containing
interaction is what gives Type III sums of squares the desirable
properties associated with linear combinations of least squares means in
ANOVA designs with no missing cells."
Also, if programmed correctly, it is "invariant to the choice of the
coding of effects for categorical predictor variables (e.g., the use of
the sigma-restricted or overparameterized model) and to the choice of
the particular g2 inverse of X'X used to solve the normal equations".
References
Searle, S. R. (1987). Linear models for unbalanced data. New York: Wiley.
Milliken, G. A., & Johnson, D. E. (1992). Analysis of messy data: Vol.
I. Designed experiments. New York:
Chapman & Hall
Sorry for the long mail, but it is in the hope that more and more users
will turn to robust procedure and to R.
Cheers,
Olivier
--
!!! New e-mail, please update your address book !!!
Olivier.Renaud using unige.ch http://www.unige.ch/fapse/mad/
Methodology & Data Analysis - Psychology Dept - University of Geneva
UniMail, Office 4164 - 40, Bd du Pont d'Arve - CH-1211 Geneva 4
[[alternative HTML version deleted]]
More information about the R-SIG-Robust
mailing list