[R-sig-ME] Modelling overdispersed, high-zero Abundance data

Mon Sep 12 17:10:44 CEST 2011

Dear Gillian: there is a book from Alain Zuur called "Mixed effects models
and extensions in ecology with R" that has a chapter (11) that deals with
zero inflated models. What I would do if I were you I Would study the source
of the zeros, may be you don t find animal species in certain sites because
it is not appropriate to find any animal in those sites, for example looking
for wild animals in an urbanized area, if you have that kind of zeros you
should remove them. I have seen this approach also in a paper from 1996
which described the eucalyptus (a tree species) dispersion in Australia
(Austin M & Meyers J, 1996: Current approaches to modeling the environmental
niche of eucalypts: implication for management of forest biodiversity.
Forest Ecology and Management, 85, 95-106.)
Best regards
Fernando 

-----Mensaje original-----
 En nombre de Gillian Eastwood
Enviado el: lunes, 12 de septiembre de 2011 06:23 a.m.
Para: r-sig-mixed-models at r-project.org
Asunto: [R-sig-ME] Modelling overdispersed, high-zero Abundance data

Dear list,

I was hoping to get some consensus on an analysis Im doing because at
present I seem to going round in circles trying different techniques
depending on who I've last spoken to. Im also abit confused between the
different options of mixed models that there are how to test them. To
explain:

Im trying to model species abundance from dataset that has both spatial
(site, distance to water, distance to village) and temporal (ie.
trap-sites monitored 5+ times on different dates across 5 years, some more
than
others) elements to it.   The ideal
outcome would to be able to predict for additional site that werent
sampled, and to include a visual map of hi/med/lo abundance i.e. some
presentable to management.

 The
     data is count data  therefore Im using Poison family of errors (also
     tried Neg Bin)
 The data seems OverdispersedThere
     are a lot of Zeros (since the species often arent at the same site) 
     therefore I did try zero-inflated (zinb) models, and also
(log(count+1))
     transforming the outcome.
 Some
     covariates seem to have a non-linear relationship to the response var,
so
     I wondered if some of the techniques I've considered aren't suitable  I
     have 1070 observations, 38 sites, 7 habitats, plus climatic covariates.

 Note
     that not all sites were monitored regularly (some very often/some
hardly
     ever). Also I suspect the data is biased depending on the time of year
     when those few monitorings took place, i.e. if it took place in summer
there seem some very large sites averages, and if all in winter, zeros.I
have another 300 observations kept aside which I was using to check model
predictions upon,

Firstly I have tried:

GLM with Neg Bin errors:  (But this fails to reflect Site differences,and
predicted badly)

            N1<-glm.nb( log(Count+1) ~ VegZone + Elevation + Tide + Temp +
RelHum + Rain)  

GLM with Quasipoisson structure: (Recently Ive been trying to investigate
the spatial dependency of the data by then creating an
geoR object and using a variogram / likfit spatial model.   For this I first
had to convert my dataset to get averages per site (rather than per
obervation) and did this separately for summer / winter
seasons.  For one season there was
spatial dependency, but I believe this method fell down because of the high
amount of zeros in the data)

WinterC<-glm(CountWinter~LAT+LONG+VegZone+Elevation+TempW+RainW,family="quas
ipoisson")

Zero-inflated Neg bin GLM (ZINB)      

(this seemed to work but I wouldn't mind some advice as to how to best
decide which variables should go in the zero component. At the moment i'm
just guessing factors that make biological sense, but several other
combinations seem to work ok also)    The problem here seems that the
predictions based on
another dataset were a fair bit higher than observed in the field.

Library(pscl)f1<-formula(Count~VegZone+Elevation+Month+
Tide+Temp | RelHum+Rain) Zip1<-zeroinfl(f1,dist="poisson",
link="logit")  Znb1<-zeroinfl(f1,dist="negbin", link="logit")(Vuong
test showed * for ZeroNegBin over StdNegBin and  lrtest(Zip1,Znb1) showed
NegBin better than Poisson)            >   summary(Znb1)
Call:zeroinfl(formula
= f1, dist = "negbin", link = "logit") Pearson
residuals:     Min      
1Q   Median       3Q     
Max -0.58531
-0.49502 -0.40727 -0.07085 18.12045  Count
model coefficients (negbin with log link):                   Estimate Std.
Error z value
Pr(>|z|)    (Intercept)       -4.721674   1.354796 
-3.485 0.000492 ***VegZone1        0.404217   0.537177  
0.752 0.451760    VegZone2       2.221035   0.567481  
3.914 9.08e-05 ***VegZone3        1.539401   0.567190  
2.714 0.006646 ** VegZone4       -2.680181  
0.602345  -4.450 8.60e-06 ***VegZone5       -0.288254  
0.229131  -1.258 0.208381    VegZone6       2.301781  
0.596701   3.858 0.000115 ***Elevation          0.008885   0.001920  
4.626 3.72e-06 ***Month             -0.178041   0.026461 
-6.728 1.71e-11 ***Tide               0.832042   0.305550  
2.723 0.006467 ** Temp          0.182989   0.039613  
4.619 3.85e-06 ***Log(theta)        -1.038962   0.087121 -11.926  < 2e-16
*** Zero-inflation
model coefficients (binomial with logit link):             Estimate Std.
Error z value
Pr(>|z|)   (Intercept)   7.73138   
3.05750   2.529  0.01145 * RelHum       -0.10398    0.03650 
-2.848  0.00439 **Rain             -0.05182   
0.02607  -1.987  0.04689 * ---Signif.
codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Theta = 0.3538
Number of iterations in BFGS optimization: 32 Log-likelihood:
-2606 on 15 Df                  lrtest(Znb1) -
Likelihood ratio test #Df  LogLik 
Df  Chisq Pr(>Chisq)    1  15
-2605.6                          2   3 -2773.4
-12 335.53  < 2.2e-16 ***

LMER  (with
Site and Year as a random factor, plus an Obs level to deal with
Overdispersion)

(As there isn't a predict function in lmer I wrote my own based on
coefficients of fixef(); the model didnt predict well even against itself,
and a plot(fitted,residuals) indicated clustering peaking around zero.

Obs<-1:nrow(X)Mod1<-lmer(Count~VegZone+Tide+RelHum+Rain+Temp+(1|Obs)+(1|SITE
)+(1|Year),family=poisson)summary(Mod1)

Generalized
linear mixed model fit by the Laplace
approximation Formula:
Count ~ VegZone + Elevation + Tide + RelHum + Rain + Temp +      (1 | Obs) +
(1 | SITE) + (1 | Year)   AIC 
BIC logLik deviance 2959 3034 
-1465     2929Random
effects: Groups Name        Variance Std.Dev. Obs   
(Intercept) 2.80118  1.67367  SITE
(Intercept) 0.77605  0.88094  Year
(Intercept) 1.20464  1.09756 Number
of obs: 1073, groups: Obs, 1073; SITE, 38; Year, 6 Fixed
effects:                    Estimate Std. Error z value
Pr(>|z|)    (Intercept)       -19.675884   2.387821 
-8.240  < 2e-16 ***VegZone1        -0.758027   1.226631 
-0.618    0.537    VegZone2        1.380153   1.393533  
0.990    0.322    VegZone3              1.010145   1.344949  
0.751    0.453    VegZone4                 
-1.432985   1.228785  -1.166   
0.244    VegZone5                
0.099095   0.661508   0.150   
0.881    VegZone6              -0.736065   1.312398 
-0.561    0.575    Elevation           0.004610   0.004037  
1.142    0.254    Tide                0.479253   0.342998  
1.397    0.162    RelHum              0.063506   0.016140  
3.935 8.33e-05 ***Rain               
0.003029   0.000576   5.259 1.45e-07 ***Temp           0.527903   0.039350 
13.416  < 2e-16 ***---Signif.
codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1  Correlation
of Fixed Effects:            (Intr) Veg1  Veg2  
Veg3   Veg4   Veg5  
Veg6    Elevtn Tide   RelHum  
RainVeg1  -0.496

Veg2  -0.456 
0.852                                  
                            Veg3
 -0.490 
0.887  0.857                                                       
Veg4
  0.395 -0.660 -0.649 -0.676                                                
Veg5  -0.311 
0.526  0.497  0.518 -0.198                                          Veg6
 -0.498
0.878  0.848  0.883 -0.665 
0.515                                  
Elevation   -0.518 
0.896  0.870  0.907 -0.826 
0.422  0.894                            Tide        -0.194 
0.006  0.008  0.008 -0.027 
0.037  0.007  0.019                     RelHum     -0.693 
0.012 -0.014  0.001 -0.024
-0.014  0.017  0.009 -0.095              Rain        0.138
-0.002 -0.004 -0.002  0.014 -0.014 -0.002
-0.030 -0.155 -0.053       Temp   -0.567 -0.039 -0.058 -0.046  0.025 
0.042 -0.036 -0.017 -0.021  0.321
-0.164 

glmmPQL (because of the Overdispersion, but then the
plots, as in the Owls data egs confirm the zero-inflatedness)   

QP1<-
glmmPQL(Count ~ x+y+z, random = ~1 |SITE, family =quasipoisson, data = X)

INLA  (So to try and deal with Zero-inflatedness and Overdispersion and
Random factors - even more recently I am looking at this as I was advised
this could cope with the zero-inflated aspect; the NegBin model wont run,
but the Poisson will.  I understand I have to check that the quantile
intervals include 0 to be
significant.   With this INLA model should the random factors be those to
represent the Zero's)                                  

formula1<-Count~Elevation+Tide+RelHum+Rain+Temp+f(SITE,model="iid")+f(Year,m
odel="iid")

Z1<-inla(formula1,family="zeroinflatednbinomial1",data=X)
Warning
message:  In is.vector(object) : NAs
introduced by coercion     
Z2<-inla(formula1,family="zeroinflatedpoisson0",data=X)summary(Z2)
Call:"inla(formula = formula1, family =
\"zeroinflatedpoisson0\", data = X)" Time used: Pre-processing    Running
inla Post-processing           Total      
0.6250000       7.3440001       0.2029998       8.1719999  Fixed effects:
mean           sd   
0.025quant     0.5quant    0.975quant          kld(Intercept)  -4.8645490672
0.5116398678 -5.872449e+00
-4.864323731 -3.8591231353 1.441839e-03Elevation     0.0019088062
0.0009895353
-4.055322e-05  0.001906647  0.0038696403 3.746575e-04Tide
0.4133481865 0.0583345762  2.987825e-01 
0.413348241  0.5277415863
2.357353e-06RelHum        0.0201337858 0.0027565909  1.472445e-02 
0.020132161  0.0255446762
3.705232e-05Rain          0.0002081704 0.0000749166  5.988189e-05 
0.000208579  0.0003539284
1.432702e-07Temp    0.1722034238 0.0073888403  1.577326e-01 
0.172188544  0.1867410550
5.566846e-05 Random effects:Name     
Model         Max KLD SITE   IID model
Year   IID model
 Model hyperparameters:                   
                                   mean    sd     
0.025quant 0.5quant 0.975quantZero-Probability parameter for zero-inflated
Poisson_0
0.46385 0.01501 0.43458    0.46376  0.49352  
Precision for SITE                                     0.96005
0.23775 0.56946    0.93505  1.49789  
Precision for Year                                     3.04348
1.63155 0.89671    2.72199  7.09648  
 Expected number of effective parameters(std dev):
44.03(0.195)Number of equivalent replicates : 24.37  Marginal Likelihood: 
-8250.01 Warning: Interpret the marginal likelihood with care if the prior
model is improper. 

This INLA model seems a good option
but Im unsure how to check the quality or make predictions.

And then GLMER and MCMCglmm seem options also.

I have a rather short timeframe by which to complete this.  I think my data
is quite hard to explain on the available factors anyway but I was hoping
the list might be able indicate which of these I seem to be going in the
right direction to persevere with. 

Many thanks in advance for any advice that you might offer

Gillian

	[[alternative HTML version deleted]]