[R] Strange degrees of freedom and SS from car::Anova with type II SS?
Fox, John
jfox @ending from mcm@@ter@c@
Thu Dec 6 03:43:06 CET 2018
Dear R.,
The problem you constructed is too ill-conditioned for the method that Anova() uses to compute type-II sums of squares and the associated degrees of freedom, with an immense condition number of the coefficient covariance matrix:
> library(car)
Loading required package: carData
> mod <- lm(prestige ~ women * type * income * education, data=Prestige)
> e <- eigen(vcov(mod))$values
> max(e)/min(e)
[1] 2.776205e+17
Simply centering the numerical predictors reduces the condition number by a factor of 10^3, which allows Anova() to work, even though the problem is still extremely ill-conditioned:
> Prestige.c <- within(Prestige, {
+ income <- income - mean(income)
+ education <- education - mean(education)
+ women <- women - mean(women)
+ })
> mod.c <- lm(prestige ~ women * type * income * education, data=Prestige.c)
> e.c <- eigen(vcov(mod.c))$values
> max(e)/min(e)
[1] 2.776205e+17
> Anova(mod.c)
Anova Table (Type II tests)
Response: prestige
Sum Sq Df F value Pr(>F)
women 167.29 1 4.9516 0.0291142 *
type 744.30 2 11.0150 6.494e-05 ***
income 789.00 1 23.3529 7.112e-06 ***
education 699.54 1 20.7050 2.057e-05 ***
women:type 140.32 2 2.0766 0.1326023
women:income 33.14 1 0.9807 0.3252424
type:income 653.40 2 9.6697 0.0001859 ***
women:education 30.36 1 0.8986 0.3462316
type:education 0.72 2 0.0107 0.9893462
income:education 7.88 1 0.2331 0.6306681
women:type:income 136.80 2 2.0245 0.1393087
women:type:education 140.18 2 2.0745 0.1328633
women:income:education 100.42 1 2.9722 0.0888832 .
type:income:education 82.02 2 1.2138 0.3029069
women:type:income:education 2.05 2 0.0303 0.9701334
Residuals 2500.16 74
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> mod.c.2 <- update(mod.c, . ~ . - women:type:income:education)
> sum(residuals(mod.c.2)^2) - sum(residuals(mod.c)^2)
[1] 2.049735
Beyond demonstrating that the algorithm that Anova() uses can be made to fail if the coefficient covariance matrix is sufficiently ill-conditioned problem, I’m not sure what the point of this is. I suppose that we could try to detect this condition, which falls in the small region between where lm() detects a singularity and the projections used by Anova() break down.
Best,
John
-------------------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http::/socserv.mcmaster.ca/jfox
> On Dec 5, 2018, at 7:33 PM, Ramon Diaz-Uriarte <rdiaz02 using gmail.com> wrote:
>
>
> Dear All,
>
> I do not understand the degrees of freedom returned by car::Anova under
> some models. They seem to be too many (e.g., numerical variables getting
> more than 1 df, factors getting more df than levels there are).
>
> This is a reproducible example:
>
> library(car)
> data(Prestige)
>
> ## Make sure no issues from NAs in comparisons of SS below
> prestige_nona <- na.omit(Prestige)
>
> Anova(lm(prestige ~ women * type * income * education,
> data = prestige_nona))
>
> ## Notice how women, a numerical variable, has 3 df
> ## and type (factor with 3 levels) has 4 df.
>
>
> ## In contrast this seems to get the df right:
> Anova(lm(prestige ~ women * type * income * education,
> data = prestige_nona), type = "III")
>
> ## And also gives the df I'd expect
> anova(lm(prestige ~ women * type * income * education,
> data = prestige_nona))
>
>
>
> ## Type II SS for women in the above model I do not understand either.
> m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)
> ## Does not match women SS
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
> ## See [1] below for examples where they match.
>
>
> Looking at the code, I do not understand what the call from
> linearHypothesis returns here (specially compared to other models), and the
> problem seems to be in the return from ConjComp, possibly due to the the
> vcov of the model? (But this is over my head).
>
>
> I understand this is not a reasonable model to fit, and there are possibly
> serious collinearity problems. But I was surprised by the dfs in the
> absence of any warning of something gone wrong. So I think there is
> something very basic I do not understand.
>
>
>
> Thanks,
>
>
> R.
>
>
> [1] In contrast, in other models I see what I'd expect. For example:
>
> ## 1 df for women, 2 for type
> Anova(lm(prestige ~ type * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ type * income, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
> ## 1 df for women, income, education
> Anova(lm(prestige ~ education * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ education * income, data = prestige_nona)
> m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
>
>
>
> --
> Ramon Diaz-Uriarte
> Department of Biochemistry, Lab B-25
> Facultad de Medicina
> Universidad Autónoma de Madrid
> Arzobispo Morcillo, 4
> 28029 Madrid
> Spain
>
> Phone: +34-91-497-2412
>
> Email: rdiaz02 using gmail.com
> ramon.diaz using iib.uam.es
>
> http://ligarto.org/rdiaz
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list