[R-sig-ME] Help with Linear Model

Fri Nov 4 14:17:31 CET 2016

(Try to remember to keep the list in CC -- I'm really bad at it, too!)

Looking at your QQ plot (consider posting it somewhere online for
posteriority as the list strips attachments), it seems that the
deviation from normality occurs in the form of heavy tails of the sort
you would find with a t-distribution. If you were doing Bayesian
modelling, I would suggest that you just use a t-likelihood instead of
a normal one .... but that may not even be necessary here.

(For one thing, it's not the data that have to be normally distributed,
but rather the residuals or equivalently, the reponse *conditioned* on
the predictors. The linear model is after all given by Y = B*X + B0 +
e, with e ~ N(0,sigma). And in the mixed-model case, you have
additional normal distributions mixing in there via the random
effects.)

I would check your model fit by checking e.g. 

 - fitted vs. actual (you can fake this by using faceting in ggplot or
the | operator in lattice, or you can do it for real using the
predict() function) 
 - fitted vs. residuals ( plot(lme.model) does this for you )

The former will tell you whether your model accurately represent your
data (and places where it fails to do so), while the latter will give
you a visual impression about how bad the violation of normality of the
residuals is.

More important than fulfilling distributional assumptions for me is
seeing how well the model actually fits the data and how good it is at
predicting new data. (Techniques like cross-validation or posterior
predictive check in the Bayesian framework are based on this idea as
well.) Violating the testing assumptions does mean that not all the
 frequentist guarantees hold, but if your model does a good job at
describing old and predicitng new data in practice, then that may be
enough for many applications. 

Now, if you're focussed on significance tests or traditional confidence
intervals, violations of model assumptions can ruin your day, as you
can no longer trust that the null distribution is correct. However, you
can still use bootstrapped confidence intervals, altough those take
much longer to compute.

One final thing: I see you're using Type-III tests. Even though it's
the default in certain popular commercial statistical packages, I would
encourage you to think about twice about doing so. Read Venables'
Exegeses on Linear Models and see some of the stuff John Fox (author of
the car packages) has written on this topic (e.g. https://stat.ethz.ch/
pipermail/r-help/2006-August/111927.html). A lot of ink has been
spilled on this topic and it's not always black and white, but I
generally find that Type-II tests are actually the thing that I want.

Best,
Phillip

On Fri, 2016-11-04 at 06:16 -0600, Joseph Aidoo wrote:
> Yield data needs to be transformed... Boxcox doesnt seem to be
> working for me. Here is the output for
> FAT.lme<-
> lmer(Yield.kg_ha~N.rate*Variety*Year+(Year|Site:Block),data=Jo_data14
> 15.dat)
> Anova(FAT.lme,type=3,test="F")
> 
> 
> 
> Response: Yield.kg_ha
> 
> 
> 
>                            F Df Df.res    Pr(>F)    
> (Intercept)         526.2004  1  49.26 < 2.2e-16 ***
> N.rate               11.5523  3 393.00 2.845e-07 ***
> Variety               7.7304  4 393.70 5.270e-06 ***
> Year                 34.7598  2  22.02 1.537e-07 ***
> N.rate:Variety        1.9513 12 393.54   0.02746 *  
> N.rate:Year           2.4641  6 393.30   0.02373 *  
> Variety:Year          1.8262  8 393.25   0.07068 .  
> N.rate:Variety:Year   0.8795 24 393.25   0.63104    
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> > 
> 
> 
> Shapiro-Wilk normality test
> 
> data:  resslme
> W = 0.98449, p-value = 5.986e-05
> 
> 
> 
> 
> 
> On Thu, Nov 3, 2016 at 11:38 PM, Phillip Alday <Phillip.Alday at unisa.e
> du.au> wrote:
> > One possible model would be:
> > 
> >  Yield ~ nitrogen*variety*Year + (Year|Site:block)
> > 
> > You won't explicitly model the variation between sites, but you
> > should still capture at least some of it via the Site:block
> > interaction. (If your blocks are labelled uniquely across sites,
> > then you can just use block instead of Site:block). You have a
> > total of 8 levels for blocks, which still isn't a lot but should
> > work.
> > 
> > You could also include a fixed effect for Site:
> > 
> >  Yield ~ nitrogen*variety*Year*Site + (Year|Site:block)
> > 
> > but this will add a fair amount of complexity to the model and I'm
> > not sure you have enough data for that.
> > 
> > I'm not familiar with oat yields, so I don't know if you should
> > transform Yield, the example in the oats dataset (?oats in R) don't
> > transform Yield.
> > 
> > Best,
> > Phillip
> > 
> > > On 4 Nov 2016, at 15:58, Joseph Aidoo <jaidoo at ualberta.ca> wrote:
> > >
> > > Thank you Philip  for your quick response. I am treating Year as
> > a factor.
> > >
> > >  I am not interested in the year effect. However I wish to
> > account for it.
> > > At each site we had four blocks for the experiment so I assumed
> > blocks will be in sites.
> > > Given this clarification how would have out my model?
> > >
> > > I have read some papers which combines site*years = environment. 
> > I don't know if its suitable in this experiment.
> > >
> > > Thank you
> > >
> > >
> > > On Thu, Nov 3, 2016 at 11:06 PM, Phillip Alday <Phillip.Alday at uni
> > sa.edu.au> wrote:
> > > Hi Joseph,
> > >
> > > Are you treating year as a continuous variable or a categorical?
> > If you only have three years and one is outlier-isn, it may not
> > make sense to treat them as a continuous variable (whether linear,
> > quadratic or some other form).
> > >
> > > Also, 2 sites is not enough levels for a random effect --
> > remember that RE are variance estimates and it doesn't really make
> > sense to make a variance estimate of two levels. How many blocks do
> > you have per site? Maybe Site:block is sufficient and if need be
> > you can include Site as a nuisance parameter in the fixed effects?
> > >
> > > Best,
> > > Phillip
> > >
> > > > On 3 Nov 2016, at 05:07, Joseph Aidoo <jaidoo at ualberta.ca>
> > wrote:
> > > >
> > > > My experiment was set up as a RCBD with
> > > > 5 oat varieties
> > > > 4 nitrogen rates
> > > > at 2 sites
> > > > over 3 years .
> > > >
> > > > I am investigating the effect of Variety and Nitrogen on Yield.
> > > >
> > > > In the second year of the experiment we had a drought so the
> > data was
> > > > awful. So i expect differences in Years
> > > >
> > > > I am using Variety, Nitrogen and Year as my fixed effects.
> > > > and Sites as my random effect.
> > > >
> > > > I expect and interaction between Variety X Nitrogen.
> > > > This is the model i came up with
> > > >
> > > > Yield ~ nitrogen*variety*Year + (Year|Site/block)
> > > >
> > > > My data doesnt seem normal because of the odd year with the
> > drought.
> > > > I have tried to normally transform the data but nothing seems
> > to work.
> > > >
> > > > I need advice on what to do next? Is it ideal to use a glmm?
> > What could the
> > > > best model be?
> > > >
> > > > Thanks
> > > >
> > > >       [[alternative HTML version deleted]]
> > > >
> > > > _______________________________________________
> > > > R-sig-mixed-models at r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> > >
> > >
> > 
> >