[R-sig-ME] Help with Linear Model
Phillip Alday
Phillip.Alday at unisa.edu.au
Sun Nov 6 05:10:22 CET 2016
Whether or not you're interested in a parameter has little to no
influence on whether or not it should be in your model nor whether or
not it is a fixed or random effects. Instead, that is determined by the
structure of your data. In your case, it would in many ways be nicer to
have years as a random effect (as you only care about the *variance*
due to year and not any particular fixed year), but you simply don't
have the data structure to do so with so few levels.
In your case, the year parameter is simply a nuisance parameter -- it
captures important variation, even if it's variation you don't care
about. Now you can marginalize over that effect either before modelling
(e.g. by aggregating / averaging over the time dimension) or after
modelling (your choice of test type (II, III) or using predicted
marginal / least-square means (e.g. via package lsmeans) or implicitly
(just completely omitting it from the model specification). I would
tend to leave the year effect in the model if possible -- I'm a big fan
of modelling as much as you can so that your model is as good as
possible, even if there are only a few predictors you actually care
about.
That said, if you don't have enough data for such a complex model, then
you should leave out less-interesting dimensions in your model
specification (and potentially aggregate across them beforehand). The
Bates et al preprint on parsimonious mixed models really emphasizes
this point -- it doesn't do much good to model everything and have a
degenerate and/or overfitted model. You know your data best and so you
know what tradeoffs are best for your case.
Best,
Phillip
On Fri, 2016-11-04 at 09:51 -0600, Joseph Aidoo wrote:
> Hi Phillip, I am going to bug you one last time.
> I'm considering focusing on the effect of Variety and Nitrogen only
> on yield.
> I am not interested in the effect of the years. How do I account for
> variation in years in the model now. Perhaps using at a random
> factor?
> How would you lay it out in a model.
>
> Regards
>
> Sent from my iPhone
>
> On Nov 4, 2016, at 7:17 AM, Phillip Alday <Phillip.Alday at unisa.edu.au
> > wrote:
>
> > (Try to remember to keep the list in CC -- I'm really bad at it,
> > too!)
> >
> > Looking at your QQ plot (consider posting it somewhere online for
> > posteriority as the list strips attachments), it seems that the
> > deviation from normality occurs in the form of heavy tails of the
> > sort
> > you would find with a t-distribution. If you were doing Bayesian
> > modelling, I would suggest that you just use a t-likelihood instead
> > of
> > a normal one .... but that may not even be necessary here.
> >
> > (For one thing, it's not the data that have to be normally
> > distributed,
> > but rather the residuals or equivalently, the reponse *conditioned*
> > on
> > the predictors. The linear model is after all given by Y = B*X + B0
> > +
> > e, with e ~ N(0,sigma). And in the mixed-model case, you have
> > additional normal distributions mixing in there via the random
> > effects.)
> >
> > I would check your model fit by checking e.g.
> >
> > - fitted vs. actual (you can fake this by using faceting in ggplot
> > or
> > the | operator in lattice, or you can do it for real using the
> > predict() function)
> > - fitted vs. residuals ( plot(lme.model) does this for you )
> >
> > The former will tell you whether your model accurately represent
> > your
> > data (and places where it fails to do so), while the latter will
> > give
> > you a visual impression about how bad the violation of normality of
> > the
> > residuals is.
> >
> > More important than fulfilling distributional assumptions for me is
> > seeing how well the model actually fits the data and how good it is
> > at
> > predicting new data. (Techniques like cross-validation or posterior
> > predictive check in the Bayesian framework are based on this idea
> > as
> > well.) Violating the testing assumptions does mean that not all the
> > frequentist guarantees hold, but if your model does a good job at
> > describing old and predicitng new data in practice, then that may
> > be
> > enough for many applications.
> >
> > Now, if you're focussed on significance tests or traditional
> > confidence
> > intervals, violations of model assumptions can ruin your day, as
> > you
> > can no longer trust that the null distribution is correct. However,
> > you
> > can still use bootstrapped confidence intervals, altough those take
> > much longer to compute.
> >
> > One final thing: I see you're using Type-III tests. Even though
> > it's
> > the default in certain popular commercial statistical packages, I
> > would
> > encourage you to think about twice about doing so. Read Venables'
> > Exegeses on Linear Models and see some of the stuff John Fox
> > (author of
> > the car packages) has written on this topic (e.g. https://stat.ethz
> > .ch/
> > pipermail/r-help/2006-August/111927.html). A lot of ink has been
> > spilled on this topic and it's not always black and white, but I
> > generally find that Type-II tests are actually the thing that I
> > want.
> >
> > Best,
> > Phillip
> >
> >
> > On Fri, 2016-11-04 at 06:16 -0600, Joseph Aidoo wrote:
> > > Yield data needs to be transformed... Boxcox doesnt seem to be
> >
> > working for me. Here is the output for
>
> FAT.lme<-
>
> lmer(Yield.kg_ha~N.rate*Variety*Year+(Year|Site:Block),data=Jo_data14
>
> 15.dat)
>
> Anova(FAT.lme,type=3,test="F")
>
>
>
>
> Response: Yield.kg_ha
>
>
>
>
> F Df Df.res Pr(>F)
>
> (Intercept) 526.2004 1 49.26 < 2.2e-16 ***
>
> N.rate 11.5523 3 393.00 2.845e-07 ***
>
> Variety 7.7304 4 393.70 5.270e-06 ***
>
> Year 34.7598 2 22.02 1.537e-07 ***
>
> N.rate:Variety 1.9513 12 393.54 0.02746 *
>
> N.rate:Year 2.4641 6 393.30 0.02373 *
>
> Variety:Year 1.8262 8 393.25 0.07068 .
>
> N.rate:Variety:Year 0.8795 24 393.25 0.63104
>
> ---
>
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> >
>
>
>
> Shapiro-Wilk normality test
>
>
> data: resslme
>
> W = 0.98449, p-value = 5.986e-05
>
>
>
>
>
>
> On Thu, Nov 3, 2016 at 11:38 PM, Phillip Alday <Phillip.Alday at unisa.e
>
> du.au> wrote:
>
> > One possible model would be:
>
> >
>
> > Yield ~ nitrogen*variety*Year + (Year|Site:block)
>
> >
>
> > You won't explicitly model the variation between sites, but you
>
> > should still capture at least some of it via the Site:block
>
> > interaction. (If your blocks are labelled uniquely across sites,
>
> > then you can just use block instead of Site:block). You have a
>
> > total of 8 levels for blocks, which still isn't a lot but should
>
> > work.
>
> >
>
> > You could also include a fixed effect for Site:
>
> >
>
> > Yield ~ nitrogen*variety*Year*Site + (Year|Site:block)
>
> >
>
> > but this will add a fair amount of complexity to the model and I'm
>
> > not sure you have enough data for that.
>
> >
>
> > I'm not familiar with oat yields, so I don't know if you should
>
> > transform Yield, the example in the oats dataset (?oats in R) don't
>
> > transform Yield.
>
> >
>
> > Best,
>
> > Phillip
>
> >
>
> > > On 4 Nov 2016, at 15:58, Joseph Aidoo <jaidoo at ualberta.ca> wrote:
>
> > >
>
> > > Thank you Philip for your quick response. I am treating Year as
>
> > a factor.
>
> > >
>
> > > I am not interested in the year effect. However I wish to
>
> > account for it.
>
> > > At each site we had four blocks for the experiment so I assumed
>
> > blocks will be in sites.
>
> > > Given this clarification how would have out my model?
>
> > >
>
> > > I have read some papers which combines site*years = environment.
>
> > I don't know if its suitable in this experiment.
>
> > >
>
> > > Thank you
>
> > >
>
> > >
>
> > > On Thu, Nov 3, 2016 at 11:06 PM, Phillip Alday <Phillip.Alday at uni
>
> > sa.edu.au> wrote:
>
> > > Hi Joseph,
>
> > >
>
> > > Are you treating year as a continuous variable or a categorical?
>
> > If you only have three years and one is outlier-isn, it may not
>
> > make sense to treat them as a continuous variable (whether linear,
>
> > quadratic or some other form).
>
> > >
>
> > > Also, 2 sites is not enough levels for a random effect --
>
> > remember that RE are variance estimates and it doesn't really make
>
> > sense to make a variance estimate of two levels. How many blocks do
>
> > you have per site? Maybe Site:block is sufficient and if need be
>
> > you can include Site as a nuisance parameter in the fixed effects?
>
> > >
>
> > > Best,
>
> > > Phillip
>
> > >
>
> > > > On 3 Nov 2016, at 05:07, Joseph Aidoo <jaidoo at ualberta.ca>
>
> > wrote:
>
> > > >
>
> > > > My experiment was set up as a RCBD with
>
> > > > 5 oat varieties
>
> > > > 4 nitrogen rates
>
> > > > at 2 sites
>
> > > > over 3 years .
>
> > > >
>
> > > > I am investigating the effect of Variety and Nitrogen on Yield.
>
> > > >
>
> > > > In the second year of the experiment we had a drought so the
>
> > data was
>
> > > > awful. So i expect differences in Years
>
> > > >
>
> > > > I am using Variety, Nitrogen and Year as my fixed effects.
>
> > > > and Sites as my random effect.
>
> > > >
>
> > > > I expect and interaction between Variety X Nitrogen.
>
> > > > This is the model i came up with
>
> > > >
>
> > > > Yield ~ nitrogen*variety*Year + (Year|Site/block)
>
> > > >
>
> > > > My data doesnt seem normal because of the odd year with the
>
> > drought.
>
> > > > I have tried to normally transform the data but nothing seems
>
> > to work.
>
> > > >
>
> > > > I need advice on what to do next? Is it ideal to use a glmm?
>
> > What could the
>
> > > > best model be?
>
> > > >
>
> > > > Thanks
>
> > > >
>
> > > > [[alternative HTML version deleted]]
>
> > > >
>
> > > > _______________________________________________
>
> > > > R-sig-mixed-models at r-project.org mailing list
>
> > > > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> > >
>
> > >
>
> >
>
> >
More information about the R-sig-mixed-models
mailing list