[R-sig-ME] Nested fixed effects --- redux.

Thu Jan 2 23:32:58 CET 2014

This question is (a) lengthy, and (b) a bit vague.  Just delete this 
posting if you don't feel like confronting these defects.

Last March there was some discussion on this list in respect of
nested fixed effects in linear models.  I expressed the opinion that
nested fixed effects don't really make much sense, but others disagreed,
and pointed out that one might be interested in testing hypotheses
about such effects.

Recently I have undertaken a consulting project in which this issue has
arisen.  I managed to "test for" a nested fixed effect, using anova()
applied to the output of lm().  I also conducted that test using glht()
from the "multcomp" package, applied the output of lm() with the nested
fixed effect omitted from the model.

The outputs of the two tests were "close, but no cigar".  After puzzling
about this for a while, I guess that the discrepancy might be due to the
fact that my data were (somewhat) unbalanced.  I constructed a balanced
data set, applied the two tests, and bingo!  Exact agreement.

I would appreciate it if some learned and knowledgeable person or 
persons could offer some insight into what is going on here.  Presumably 
slightly different hypotheses are being tested by the two tests in the 
case of unbalanced data.  Has anyone a solid understanding (which they 
could express in simple terms that a Bear of Very Little Brain could 
understand) of just what hypotheses are being tested?

The "real" data set is rather large, unwieldy, and complicated.  To 
provide a reproducible example I offer a small and simple data set which 
is taken from Devore's textbook on intro stats for engineers.
This data set consists (I *think*; I don't have the book to hand) of 
percentages of polyunsaturated fat in various brands of margarine.  Two 
of the brands (Mazola and Fleischman) are corn-based; the others are 
soy-based.  We thus have the fixed effect "brand" nested within "base". 
  One might be interested in testing whether there is a difference 
between the two bases.  Devore sets the exercise of estimating the 
difference (with confidence interval) between (mu_1 + mu_2 + mu_3 + 
mu_4)/4 and (mu_5 + mu_6)/2, where mu_i is the mean of the response 
variable corresponding to the i-th brand.
I conducted a test for the equality of these two quantities using both
anova() and glht().

The data are unbalanced in that 4 of the brands have 4 observations each 
and 2 of them have 5 observations.  The results of the tests are not 
*quite* identical.  When I threw away an observation from each of the 
5-observation brands so that the data became balanced, the results of 
the two tests *were* identical.

I have attached the data set in "dput" form in the file "oleo.txt" and a 
script, that effects my analysis of these data in the file, 
"scrDemo.txt". The data are also available in the "Devore7" package as 
"ex10.26".

I *suspect* that what is going is that the glht() test is doing 
*something like* using "Type III" sums of squares whereas the anova() 
test is using sequential ("Type I") sums of squares.  But I am not sure 
about this and don't know how to work out what is really happening.  I 
did try to apply the Anova() function from the "car" package with "Type" 
set equal to "III", but it threw an error to the effect that
"there are aliased coefficients in the model".  (Not surprising.)

I would be grateful for any insight that anyone can offer.

	cheers,

	Rolf Turner
-------------- next part --------------
structure(list(Imperial = c(14.1, 13.6, 14.4, 14.3, NA), Parkay = c(12.8, 
12.5, 13.4, 13, 12.3), Blue.Bonnet = c(13.5, 13.4, 14.1, 14.3, 
NA), Chiffon = c(13.2, 12.7, 12.6, 13.9, NA), Mazola = c(16.8, 
17.2, 16.4, 17.3, 18), Fleischmann = c(18.1, 17.2, 18.7, 18.4, 
NA)), .Names = c("Imperial", "Parkay", "Blue.Bonnet", "Chiffon", 
"Mazola", "Fleischmann"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))
-------------- next part --------------
#
# Script scrDemo.
#
require(multcomp)

# Prepare the data.
Oleo  <- dget("oleo.txt")
brand <- factor(rep(names(Oleo),each=5),levels=names(Oleo))
base  <- factor(c(rep("soy",20),rep("corn",10)),levels=c("soy","corn"))
Z     <- data.frame(papfua=as.vector(as.matrix(Oleo)),brand=brand,base=base)
Zb    <- Z[-5*(1:6),]
Zu    <- Z[-5*c(1,3,4,6),]

# Balanced data (Zb).
cat("Balanced data:\n")
fitb1 <- lm(papfua ~ brand,data=Zb)
fitb2 <- lm(papfua ~ base + brand%in%base,data=Zb)
cat("Checking that the two fits are identical.\n")
print(anova(fitb1,fitb2))
cat("Coefficients of the fit *without* the nested fixed effect:\n")
print(coef(fitb1))
ctrst <- matrix(c(0,0.25,0.25,0.25,-0.5,-0.5),1)
rsltb <- glht(fitb1,ctrst)
cat("Test of the nested fixed effect using glht():\n")
print(summary(rsltb))
cat("Test of the nested fixed effect using anova():\n")
print(anova(fitb2))
cat("The p-values look identical and the F statistic is\n")
cat("the square of the \"t value\".  But check more carefully:\n")
A <- as.vector(summary(rsltb)$test$pvalues)
B <- anova(fitb2)[1,5]
cat("Difference in the p-values:\n")
print(A-B)
A <- unname(summary(rsltb)$test$tstat^2)
B <- anova(fitb2)[1,4]
cat("Difference in the statistics:\n")
print(A-B)

# Unbalanced data (Zu).
cat("Slightly unbalanced data:\n")
fitu1 <- lm(papfua ~ brand,data=Zu)
fitu2 <- lm(papfua ~ base + brand%in%base,data=Zu)
cat("Checking that the two fits are identical.\n")
print(anova(fitu1,fitu2))
rsltu <- glht(fitu1,ctrst)
cat("Test of the nested fixed effect using glht():\n")
print(summary(rsltu))
cat("Test of the nested fixed effect using anova():\n")
print(anova(fitu2))
cat("The p-values are not quite identical and the F statistic is\n")
cat("not quite the square of the \"t value\". Checking more carefully:\n")
A <- as.vector(summary(rsltu)$test$pvalues)
B <- anova(fitu2)[1,5]
cat("Difference in the p-values:\n")
print(A-B)
# Hard to tell if they really are different.  But ...
A <- unname(summary(rsltu)$test$tstat^2)
B <- anova(fitu2)[1,4]
cat("Difference in the statistics:\n")
print(A-B)