[R-sig-ME] zero-inflation with mixed-effects, glmmADMB, and cloglog

Wed Mar 7 04:37:56 CET 2012

Hi, 

The cplm package may be a good place to look at for the first problem. It handles zero-inflated continuous data using the compound Poisson distribution, and the mixed-model version (cpglmm) is developed based on glmer. 

Regards, 
Wayne Zhang

-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Anne Bjorkman
Sent: Tuesday, March 06, 2012 7:31 PM
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] zero-inflation with mixed-effects, glmmADMB, and cloglog

Dear Mixed Modelers,

I have a few questions about various aspects of a dataset I am working on,
and I was hoping to get some feedback from the experts.  I apologize for
not being able to provide my data, but if anyone is interested I would be
happy to share it off-line.

My overall question is about changes in crop species abundances and
presence/absence over a 40 year period in 154 countries in the world, with
the measurements occurring at 10 year intervals (so, 1967, 1977, 1987,
1997, and 2007).  I would like to analyze these data in terms of both
abundance AND presence/absence, as they answer slightly different questions
(the first about abundance, the second about spread/distribution).  A
subset of my data look like this:

            Country             Item_general Total    Percent_of_Total Year
pres.abs

28465 Albania                Apples  6.45       0.002903365 1967        1

28466 Albania   Bananas_and_plantains  0.00       0.000000000 1967        0

28467 Albania                Barley 17.09      0.007692792 1967        1

28468 Albania                 Beans 41.80      0.018815610 1967        1

28469 Albania   Beverages_Alcoholic  5.60       0.002520751 1967        1

28470 Albania   Beverages_Fermented  0.00       0.000000000 1967        0

The abundance metric ("Total") here is number of Kcalories consumed of that
crop in that country in that year.  I have already looked at multivariate
changes in composition, but now I want to know about the change that occurs
for each crop species (i.e., which crop species have increased or decreased
the most over the past 40 years?).  My goal is to model each crop species
separately and extract slope parameters and standard errors for the slopes,
which would then tell me about the magnitude and direction of change for
each crop.  However, there are three complications:

1) The data for most crops are zero-inflated.  The zero-inflation is not
equal throughout the years, because overall countries have been increasing
in the number of crops they use, so there are more 0's in 1967 than in 2007
for a given crop.  This is true of most crops, but there are some that have
been decreasing.

2) I have repeated measurements on individual countries over the 40 year (5
sampling times) period. Thus I have been trying to account for this by
using "Country" as a random effect in a mixed model.

3) The data for each crop are not distributed in the same way.  Some crops
(such as corn) approach a nearly-normal distribution, while other crops
(e.g., Tea) are extremely zero-inflated.

I have attempted to use glmmADMB to model these data, as I understand this
is one of the few ways you can have random effects and account for
zero-inflation, using a negative binomial distribution and
zeroInflation=TRUE.

admb.model<-glmmadmb(Total_Rnd~I(Year-1967),random=~1|Country,family=
"nbinom",zeroInflation=TRUE,data=mydata)

where Total_Rnd is the "Total" column (total abundance) rounded to the
nearest integer using "ceiling".  This usually works on a subset of the
data that includes only one crop species, but if I try to model the entire
dataset (all crops together) I get the error:

Error in glmmadmb(Rel.Total.Rnd ~ I(Year - 1967), random = ~1 | Country,  :

  The function maximizer failed (couldn't find STD file)

In addition: Warning message:

running command './glmmadmb -maxfn 500 -maxph 5 -noinit -shess' had status
1

Need to increase the maximum number of separable calls allowed to at least
20001

Current value is 20000

Use the -ndi N command line option

The output from just one crop species (Peas in this case) looks like this:

Call:

glmmadmb(formula = Total_Rnd ~ I(Year - 1967), data = subset(kcal.sub.test,

    Item_general == "Pulses_Other"), family = "nbinom", random = ~1 |

    Country, zeroInflation = TRUE)

Coefficients:

               Estimate Std. Error z value Pr(>|z|)

(Intercept)     2.64530    0.13035   20.29   <2e-16 ***

I(Year - 1967) -0.00107    0.00136   -0.78     0.43

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of observations: total=745, Country=154

Random effect variance(s):

$Country

            (Intercept)

(Intercept)      2.3298

Negative binomial dispersion parameter: 5.8899 (std. err.: 0.48002)

Zero-inflation: 0.013045  (std. err.:  0.0050869 )

Log-likelihood: -2788

This seems to match up somewhat with the raw data:

> mean(kcal.sub.test$Total_Rnd[kcal.sub.test$Item_general=="Pulses_Other" &
kcal.sub.test$Year==1967])

[1] 31.81208

> mean(kcal.sub.test$Total_Rnd[kcal.sub.test$Item_general=="Pulses_Other" &
kcal.sub.test$Year==2007])

[1] 29.69799

(so there has been a decrease in mean abundance from 1967 to 2007)

In addition, when I plot fitted values vs. residuals for a given crop to
try to check the assumptions for these models, they do not look good (a
very serious fanning-out from left to right):

admb.model.fit<-fitted(admb.model)

admb.model.res<-resid(admb.model)

plot(admb.model.fit,admb.model.res)

I have read on the mixed-models help list and elsewhere that plotting the
fitted v. residual values is not as straightforward with mixed effects
models, so I'm not entirely sure this is the right code to use, or what is
a better way to check how well this model fits (other than comparing AIC
values of a bunch of different models with the same data).

Of course, as I mentioned above, the data for each crop are distributed
differently, and I would be enormously appreciative if someone could
comment on whether I could use different models (in other words, some of
them with zeroInflation=TRUE and some with just a normal family="nbinom"
without zero inflation) for the different crops and still compare their
slopes.  Is my priority to find the best model for each crop, or is it to
use the same model for each crop so that the slopes are directly
comparable?  I have the same question about transforming the data.  Some of
the crops (not the zero-inflated ones, of course) would be better fit by
transforming to a square root of abundance, but I'm unsure whether I could
then compare the slope parameters with non-square-root-transformed crops. I
remember reading somewhere (but can't remember where) that
back-transforming the parameters is not a simple matter with mixed models,
thus making differently-transformed slope values comparable to each other
might not be possible.

To make matters more complicated, I would really like to model this as *
relative* abundance (or, "Percent of Total") to account for the fact that
people have been eating more over the years (total Kcalories has increased
over time). I would still be dealing with zero-inflation here, and my
understanding is that I could still use the same model described above,
since negative binomial and poisson distributions are for counts, and
proportions can be modeled like counts, even though they have an upper
limit?

The second part of the data, the presence-absence data, are a bit more
straightforward.  I transformed the data to presence-absence to try to get
away from the zero-inflation issue, and have been performing logistic
regression using lmer:

pres.abs.mod<-lmer(pres.abs~I(Year-1967)+(1|Country),data=mydata,family=
binomial)

For one crop, Peas, the output looks like this:

Generalized linear mixed model fit by the Laplace approximation

Formula: pres.abs ~ I(Year - 1967) + (1 | Country)

   Data: subset(pres.abs.kcal, Item_general == "Pulses_Other")

   AIC   BIC logLik deviance

 255.2 269.1 -124.6    249.2

Random effects:

 Groups  Name        Variance Std.Dev.

 Country (Intercept) 60.785   7.7965

Number of obs: 760, groups: Country, 152

Fixed effects:

               Estimate Std. Error z value Pr(>|z|)

(Intercept)     7.64468    1.47113   5.196 2.03e-07 ***

I(Year - 1967)  0.06714    0.01747   3.843 0.000122 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:

            (Intr)

I(Yer-1967) -0.196

When I try to transform the Intercept term using plogis(7.64468) I get a
value of 0.9995216 (which, if I understand correctly, is like the expected
proportion of countries that have that crop) which I know from the data is
much too high.

My data do NOT have equal numbers of 1's and 0's, so if I try
family=binomial(link="cloglog") I get a much more reasonable estimate:

Generalized linear mixed model fit by the Laplace approximation

Formula: pres.abs ~ I(Year - 1967) + (1 | Country)

   Data: subset(pres.abs.kcal, Item_general == "Pulses_Other")

   AIC   BIC logLik deviance

 229.8 243.7 -111.9    223.8

Random effects:

 Groups  Name        Variance Std.Dev.

 Country (Intercept) 10.482   3.2376

Number of obs: 760, groups: Country, 152

Fixed effects:

               Estimate Std. Error z value Pr(>|z|)

(Intercept)    2.458405   0.696478    3.53 0.000416 ***

I(Year - 1967) 0.028910   0.009901    2.92 0.003503 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:

            (Intr)

I(Yer-1967) -0.292

which matches up almost perfectly with the mean of the raw data (which is
what I think the Intercept term estimates...):

> plogis(2.458405)

[1] 0.9211739

> mean(pres.abs.kcal$pres.abs[pres.abs.kcal$Item_general=="Pulses_Other"])

[1] 0.9276316

the problem is, I get an error that says

Warning message:

In mer_finalize(ans) : false convergence (8)

Can I still use the output of the link="cloglog" model, since it seems to
fit my model better (incidentally, the AIC is lower and the plot of fitted
v. residuals looks more even - even though of course for logistic models
the plots of fitted v. residuals aren't exactly straightforward to
interpret!).  Or can someone advise me as to what I am doing wrong to get
this error?

Finally, similarly to my question posed above, can I still compare the
slope parameters from my logistic regression models if I use link="logit"
for some and link="cloglog" for others, depending on which one provides a
better model fit, or should I use the same model for all crop species, even
if some are fit better than others.

I sincerely apologize for the novel-like length of this email, I have been
battling these questions for over a month now, searching R-list archives
and reading every (layman's) book I can get my hands on, and I've gotten to
the point now where, for every new thing I learn, I forget something I read
yesterday.

Enormous thanks in advance to anyone who feels willing to tackle my
statistical novel!

Best,

Anne

	[[alternative HTML version deleted]]

NOTICE:  This e-mail message, including any attachments and appended messages, is for the sole use of the intended recipients and may contain confidential and legally privileged information.
If you are not the intended recipient, any review, dissemination, distribution, copying, storage or other use of all or any portion of this message is strictly prohibited.
If you received this message in error, please immediately notify the sender by reply e-mail and delete this message in its entirety.