[R] Binomial multi-level (hierarchical) modelling [partly stats question, not completely R related]
Mark Wardle
mark at wardle.org
Sat Jul 21 22:43:06 CEST 2007
Dear all,
This question is partly statistics and partly R and I apologise in
advance for my (usual) verbosity! My data is a little more complicated
that this suggests, but essentially I have proportion data from
different studies (id), each from a specific country and region of the
World. I would like to examine the variables that affect the
proportion, but these factors are hierarchical. In case I am not
explaining the problem correctly, I have created a simple analysis
with simulated data below!
My question is: How does one perform logistic regression using
interdependent (hieraerchical) variables and decide how much each
factor contributes to the variability?
Originally I wanted to look at whether there was only a regional
difference between studies, and simply aggregated the data and
performed a standard contingency analysis. However, I wanted to make
sure I wasn't aggregating and losing valuable covariates (?Simpson's
paradox...). Hence I thought I would be better off using binomial
regression instead, just to check other factors are not possible
confounders:
# At present I have a data frame with nested groups with proportion
data (counts and totals)
# make up some equivalent simulated data:
d1 <- data.frame(
id = factor(seq(1,20)),
count = round(c(rnorm(10, mean=10, sd=4),rnorm(10, mean=20, sd=2)),0),
total = round(rnorm(20, mean=50, sd=5),0),
country = c('UK','UK','Fr','Fr','Sp','Sp',rep('Other',4),
rep('Japan',8),rep('China',2)),
region = c(rep('Europe',10), rep('Asia',10))
)
d1$remain <- with(d1, total-count)
summary(d1) # show what we've done
# current analysis - simple aggregate and perform assessment of contigency table
d2 <- aggregate(d1[,2:3], by=list(d1$region), FUN=sum )
d2$remain <- with(d2, total-count)
d2.f <- fisher.test(matrix(c(d2$count, d2$remain), ncol=2))
print(d2.f) # highly significant difference between europe and asia
# I liked the result of this - highly significant for both simulated
and real data, and easy to interpret for a stats-idiot like me!
# a better (?) analysis?
m1 <- glm(cbind(count, remain) ~ region, "binomial", data=d1)
summary(m1)
# with real data:
# Call:
# glm(formula = cbind(count, remain) ~ region, family = "binomial",
# data = d1)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -5.2427766 -2.4016616 -1.0373422 -0.0004486 4.9658610
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.7402 0.1007 -17.287 < 2e-16 ***
# regionAsia, Americas -0.6781 0.2128 -3.187 0.00144 **
# regionEurope -0.8692 0.1665 -5.219 1.80e-07 ***
# regionIndia -16.6492 956.1380 -0.017 0.98611
# regionOther 0.3107 0.2882 1.078 0.28096
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 187.02 on 18 degrees of freedom
# Residual deviance: 141.37 on 14 degrees of freedom
# (2 observations deleted due to missingness)
# AIC: 209.28
#
# Number of Fisher Scoring iterations: 14
# *** but how does one include nested parameters (ie. decide how much
of the variability is dependent on individual study, country and
region? especially given that these factors are inter-dependent -
hierarchical.)
m2 <- glm(cbind(count, remain) ~ region + country + id, "binomial", data=d1)
# with the fake data, nothing is significant - how does one decide
which effect to remove?
# and with REAL data:
# Null deviance: 1.8702e+02 on 18 degrees of freedom
# Residual deviance: 5.4476e-10 on 0 degrees of freedom
# (2 observations deleted due to missingness)
# AIC: 95.914
#
# Number of Fisher Scoring iterations: 22
# Therefore, take out "id" as this clearly isn't the correct approach!
m3 <- glm(cbind(count, remain) ~ region + country, "binomial", data=d1)
summary(m3)
# and with REAL data:
# Call:
# glm(formula = cbind(count, remain) ~ region + country, family =
"binomial",
# data = d1)
#
# Deviance Residuals:
# Basu:2000 Brusco:2004 Filla:2000
Hadjivassiliou:2003 Juvonen:2005 Leggo:1997
Maruyama:2002 Nagaoka:1999 Onodera:2000
Pujana:1999 Saleem:2000 Schols:1997
# -6.322e-08 -2.978e-01 3.468e-01
7.383e-02 0.000e+00 -9.928e-02
3.328e+00 -1.225e+00 -2.354e+00
-1.718e-07 -2.002e-04 4.215e-08
# Silveira:2002 Soong:2001 Storey:2000
Takano:1998 Tang:2000 Warrenburg:2002
Watanabe:1998
# 0.000e+00 0.000e+00 -1.421e-07
0.000e+00 -2.019e-04 0.000e+00
-3.590e+00
#
# Coefficients: (3 not defined because of singularities)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.1102 0.3744 -5.637 1.73e-08 ***
# regionAsia, Americas -0.3080 0.4187 -0.736 0.46193
# regionEurope -0.8342 0.7007 -1.191 0.23385
# regionIndia -18.1705 4285.1131 -0.004 0.99662
# regionOther 0.6807 0.4616 1.475 0.14026
# countryChina -20.0407 4247.6466 -0.005 0.99624
# countryFinland -0.9268 1.1712 -0.791 0.42877
# countryGermany 1.6833 0.6530 2.578 0.00994 **
# countryIndia -1.1087 1.0863 -1.021 0.30747
# countryItaly -1.3562 0.7773 -1.745 0.08104 .
# countryJapan 0.5989 0.3893 1.538 0.12396
# countryJapan, USA NA NA NA NA
# countryNetherlands 1.2081 0.6209 1.946 0.05167 .
# countryPortugal / Brazil -1.7095 1.1664 -1.466 0.14273
# countrySpain -1.3182 1.1683 -1.128 0.25918
# countryTaiwan NA NA NA NA
# countryUK NA NA NA NA
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 187.019 on 18 degrees of freedom
# Residual deviance: 31.233 on 5 degrees of freedom
# (2 observations deleted due to missingness)
# AIC: 117.15
#
# Number of Fisher Scoring iterations: 17
So I have evidence of marked overdispersion (if I read this
correctly!) with my real data.
It seems to me that this analysis needs a mixed model with
hierarchical terms and a binomial link function (?using nlme). Is this
possible? I tried using lme4's lmer() function like this:
library(lme4)
d1$prop <- with(d1, count/total)
me1 <- lmer(prop ~ country + (1|region) + (0+country|region),
family='binomial', weights=total, data=d1)
But I don't think I am using this properly and can't seem to determine
the variance due to the different hierarchical terms.
Can one use the interaction of country*region instead of using a mixed
model (where presumably country would be a fixed effect and region a
random effct?)
Any pointers much appreciated!
Many thanks,
Mark
P.S. while experimenting while creating the line:
> me1 <- lmer(prop ~ country + (1|region) + (0+country|region), family='binomial', weights=total, data=d1)
I accidentally pressed <RETURN> and issued the following command:
me1 <- lmer(prop ~ country + (0|region), family='binomial',
weights=total, data=d1)
which causes R to crash (and the application closes completely).
This is with R version
> sessionInfo()
R version 2.5.1 (2007-06-27)
i386-apple-darwin8.9.1
locale:
en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] "stats" "graphics" "grDevices" "datasets" "utils"
"methods" "base"
other attached packages:
lme4 Matrix lattice RODBC
"0.99875-4" "0.999375-0" "0.15-11" "1.2-1"
--
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK
More information about the R-help
mailing list