[RsR] praise for taking good care of ANOVA in the robust libraries in R

Olivier Renaud O||v|er@Ren@ud @end|ng |rom un|ge@ch
Tue Dec 8 13:57:39 CET 2009


Re,
With this mail, I would like to praise for taking good care of ANOVA in 
the robust libraries in R. May I recall that in fields like psychology, 
more that 80% of the articles contain an ANOVA whereas less that 10% of 
them contain a regression. Doing statistical consulting with 
psychologists for 10 years, here are the points I find essential to 
convince psychologists to use R and robust procedures for ANOVA:

(A) If I understand correctly, lmrob will eventually supersede lmRob. I 
strongly suggest that a real reflection is taken concerning X variables 
that are factors. As I understand, lmrob cannot handle such data since 
the initial algorithm will very likely fail with such covariates. I 
suggest adding the possibility to have a L1-type of initial algorithm 
(or similar) with these covariates or to use separate initial algorithms 
for the continuous covariates and the factors, like in lmRob.

(B) Provide so-called "Type III sums of squares" or effects tested 
marginally in the anova.lmrob or anova.lmRob (+ to implement anova.lmrob 
for only one model). I know it can be done by hand, but for an average 
user, having it as an optional argument to anova.lm(R/r)ob would be an 
important argument to use R and robust ANOVA. Since this is an extremely 
hot topic within the S/R community, I give below what I believe to be 
convincing arguments given by (other) prominent members of the 
statistics community. By the way, "marginal" or "Type III sums of 
squares" are in several important R libraries, like in the car library 
(used by Rcommander, see function Anova (with a capital A) with its 
type="III" argument) and the nlme library (see anova.lme with its 
type="marginal" argument).

(C) Since ANOVA is so much used, why not write a small function aovrob 
that just call lmrob with the appropriate arguments for the initial 
algorithm and return anova.lmrob of the object, with marginal as the 
default value ?


Arguments for "marginal" or "Type III sums of squares" (for part (B))

Some prominent members of the S/R community gave over the years many 
negatives comment "Type III sums of squares" or effect tested 
marginally. However their examples were often in regression (e.g. 
polynomial). In the context of unbalanced ANOVA, there are other 
prominent members of the statistics community that give extremely 
convincing arguments.

The big difference comes from the fact that in almost all real examples, 
if the design is unbalanced, this is due to (hopefully MAR) missing 
values, and not due to an underlying population distribution that is 
unbalanced. On the contrary, in regression, the distribution of X is 
supposed to be fixed (or loosely speaking to reflect the population 
distribution but computations are conditioned on the sample values).

Suppose you have data with two factors but unfortunately an unbalanced 
design. You want to test the two main effects and the interaction. The 
model is
$Y_{ijk}=\mu +\gamma _{j} +\theta _{k} +(\gamma\theta )_{jk} +E_{ijk}$ 
with $k=1, \ldots, n_{ij}$
Your favorite software propose you several ANOVA tables called Type I, 
II, III, etc. Which one to choose ?

Let's concentrate on Type I, where terms are added sequentially, and 
Type III, where terms are tested marginally (to the full model).

To decide,

    * One might argue about unique explained variance and use this
      argument to favor one given Type

    * One might argue that for testing a main effect, the Type III make
      no sense since the "null" model contains the interaction but not
      the main effect.

    * Searle (1987), Milliken & Johnson (1992) and others however simply
      argue that as statisticians, we should not look at explain
      variances or philosophical arguments about what a model should
      contain, but one should simply look to what null hypothesis each
      test corresponds. They clearly show that

    with the Type III SS, the corresponding H0 are exactly what we expect:
    $\gamma_1=\gamma_2= \cdots = \gamma_a (=0)$,
    $\theta_1=\theta_2= \cdots = \theta_b (=0)$, and
    $(\gamma\theta)_{11}=(\gamma\theta)_{12}= \cdots =
    (\gamma\theta)_{ab} (=0)$

    whereas for Type I SS, the corresponding H0 for the first factor is
    (see Searle p. 112 and 114 for an example):
    $\rho'_1=\rho'_2= \cdots = \rho'_a (=0)$, where $\rho'_i = \sum_j
    n_{ij} \mu_{ij} / n_{i.}$

    and even more complex for the second factor, where we do not even
    test that some parameters are 0:
    $\delta'_j=\sum_i n_{ij} \rho'_j \forall j$, where $\delta'_i =
    \sum_i n_{ij} \mu_{ij} / n_{.j}$

In 10 years of consulting, I have never seen a psychologist willing to 
test such an odd hypothesis !

Just looking at the corresponding null hypotheses hopefully will 
convince some of you that, for unbalanced ANOVA, if "Type III" is 
recommended in many applied field and used as default by e.g. SAS and 
SPSS is not so surprising.

Finally, a technical detail: in the presence of interaction, the exact 
definition of Type III for classical method is slightly more involved 
(from help file of Statistica):
"The Type III sums of squares attributable to an effect is computed as 
the sums of squares for the effect controlling for any effects of equal 
or lower degree and orthogonal to any higher-order interaction effects 
(if any) that contain it. The orthogonality to higher-order containing 
interaction is what gives Type III sums of squares the desirable 
properties associated with linear combinations of least squares means in 
ANOVA designs with no missing cells."
Also, if programmed correctly, it is "invariant to the choice of the 
coding of effects for categorical predictor variables (e.g., the use of 
the sigma-restricted or overparameterized model) and to the choice of 
the particular g2 inverse of X'X used to solve the normal equations".

References
Searle, S. R. (1987). Linear models for unbalanced data. New York: Wiley.
Milliken, G. A., & Johnson, D. E. (1992). Analysis of messy data: Vol. 
I. Designed experiments. New York:
Chapman & Hall

Sorry for the long mail, but it is in the hope that more and more users 
will turn to robust procedure and to R.

Cheers,
Olivier

-- 
!!! New e-mail, please update your address book !!!
Olivier.Renaud using unige.ch               http://www.unige.ch/fapse/mad/
Methodology & Data Analysis - Psychology Dept - University of Geneva
UniMail, Office 4164  -  40, Bd du Pont d'Arve   -  CH-1211 Geneva 4


	[[alternative HTML version deleted]]




More information about the R-SIG-Robust mailing list