[R-sig-ME] it is difficult to deal with within-group errors heteroscedasticity

Maria Jose Juan Jorda mjuanjorda at gmail.com
Mon Dec 1 18:32:01 CET 2008


Dear Mixed list:

I am trying to fit a mixed linear model, particularly, a longitudinal study
with the lme function and I am really need some advise..... My question is
about how to deal with the within-group errors heteroscedasticity. This is
my first mixed model so the learning curve is high.


First I give you background with my model.

1)  I am trying to quantify how the mean length of different species of fish
are changing over time. However, I do not have one time series of main size
over time for each specie. What I have it is many time series for each
specie because those species are fished with different gears. So each time
series belongs to a different fishing gear and then many time series of
fishing gears belong to a specie. Not all the species are fished with the
same gears. So there are different number of time series for each specie.

I have attached figure 1 with the data if you want to see.

This is my basic model:

model3<-lme(meansize~year*stock,random=~year|id, data=cas)

Variables:
meansize: mean length of the fish, continuous variable
year: the years, continuous variable
stock: 3 species of fish, categorical variable.
id: it is the variable identifying each of the time series.

So in this model I am interested in estimating a mean intercept and a mean
slope for each specie and quantify how much the mean size of the specie has
change over time and see what species are decreasing more. I am treating
each of the time series of the gears as random effects. In my head I am
imaging an experiment where I have random fishing gears fishing in the ocean
and they are a random sample which will tell me if mean size of fish are
decreasing or increasing over time. Well, in reality, they are not random
samples, this is the real data that exist but i am treating them as random.
Any comments about this?

Now there are not predictors in  the model, because first I want to quantify
how much the species have decreae/increase over time, but in the future i
will add predictors  such us level of fishing exploitation or the biology of
the specie to explain why some species decrease more than others.

1)  This is the sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
[1] nlme_3.1-89


2) The first modification of the model is to account for the fact that some
gear catch more quantity of fish than others so I want to give more weights
to the gears that fish more and less weights to the gears that fish less
fish. So I have a variable the it is number of fish for each year, gear and
specie.
-the variable called w are the weights from 0 to 1, to give more importance
to the gear catching more fish.

this will be my model:
model3.1<-lme(meansize~year*stock,random=~year|id,
data=cas,weights=varFixed(~1/w), method="ML")

So this is the model I would choose following my logic and my needs.

So now my problems:

1) I compared model3 with model3.1  which have the same fix effects.

> anova(model3,model3.1)
         Model df      AIC      BIC    logLik
model3       1 10 6532.240 6580.253 -3256.120
model3.1     2 10 7876.156 7924.169 -3928.078

and the AIC  suggest selecting model 3.

But the research question tells me to choose model3.1 because for me it is
important to give more weights to the gears fishing more.

So what do I do?

I imagine to follow my research question but...it gets more complicated....


2) Problem 2
I explored the residuals and the normality assuption of model 3 and
model3.1.

plot(model3, form=resid(., type="p")~fitted(.)| stock,abline=0,
ylim=c(-10,10))
plot(model3.1, form=resid(., type="p")~fitted(.)| stock,abline=0,
ylim=c(-10,10))
qqnorm(resid(model3,type="p"),cex=1,main="")
qqnorm(resid(model3.1,type="p"),cex=1,main="")

and it is clear, the variance is not the same across groups (here stocks).
The residuals suggest to use different variance components for each species,
and the qq plots are horrible, it does not follow the normality assumption
for any of the models.


So I did two things:

-I added to model 3 the varIdent function to account for the different
variance across groups.
model3.6<-lme(meansize~year*stock,random=~year|id,
data=cas,weights=varIdent(form=~1|stock),
control=list(maxIter=500, msMaxIter=500, niterEM=500))

-I added to model3.1 the varIdent function too. So now I am using the
function VarComb to take into account two things 1) give more importance to
gears fishing more wth varFixed and 2) to take into account the variance
across the 3 species (3 groups)

model3.9<-lme(meansize~year*stock,random=~year|id,
data=cas,weights=varComb(varFixed(~1/w),varIdent(form=~1|stock)),control=list(maxIter=500,
msMaxIter=500, niterEM=500))


Now I explore the residuals for both models (model3.6 and model3.9). the
residuals for each group look better for model 3.9 than model 3.6, in model
3.9 they do not follow any clear trend. But the qq plot still looks bad for
both models,the normality plot is distinctly curved for both models. I have
attached QQplot model3.9. Basically I don´t know what to do or how to
follow.

I compare the two models (below) and the AIC suggest to choose the model 3.6
and looking at my residual plots I would chose model3.9 over model3.6.

anova(model3.6,model3.9)> anova(model3.6,model3.9)
         Model df      AIC      BIC    logLik
model3.6     1 12 5893.051 5950.667 -2934.526
model3.9     2 12 7599.363 7656.978 -3787.681


So I am confused here, can anybody give me some advise?


I would choose model3.9, one because the residuals look better (but still I
have the problem with the QQplot that I don´t know how to solve the problem)
and two because following my research question I NEED and I WANT to give
more weights to the gears catching more fish.

But when I compare the two models, the AIC and the BIC and the loglik
suggest to choose the mode3.6 where I am not taking into account the weights
for the gears and the residuals look bad.


I have attached the data and the code to reproduce everything in a fast way.

I would appreciate any help of how to proceed, I don´t know who else to ask.

Thanks  and thanks in advance.

Maria Jose









The right-hand plot
shows the highly non-normal distribution of errors.
-- 
>))):)   >))):)   >))):)   >))):)   >))):)   >))):)   >))):)   >))):)

Maria Jose Juan Jorda

AZTI - Tecnalia / Unidad de Investigación Marina
Herrera Kaia Portualde z/g
20110 Pasaia, Gipuzkoa, Spain

Recursos Marinos y Pesquerias
Depart. Biologia Animal, Vegetal y Ecologia
Universidade A Coruña
Campus A Zapateira s/n
15071, A Coruña, Spain

Tel. Oficina +34-981167000 ext. 2204
Tel. Mobil + 34-671072900
Fax. +34981167065
mjuan at pas.azti.es
mjuanjorda at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Figure_one.pdf
Type: application/pdf
Size: 82812 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20081201/1d683428/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: QQplot_model3.9.pdf
Type: application/pdf
Size: 64370 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20081201/1d683428/attachment-0001.pdf>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Forum_linear_mixed_model.r
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20081201/1d683428/attachment.pl>


More information about the R-sig-mixed-models mailing list