[R-sig-ME] lme vs. lmer

Wed Oct 7 20:37:32 CEST 2009

On Wed, Oct 7, 2009 at 6:03 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
> Hi Paul,
>
> Thanks for your response - it's much appreciated! Yes, I confess I'm new to
> R and regression modeling, so it has been a challenge trying to decipher the
> 'R- and modeling-speak' of the experts! I do understand what you've said
> below (which is great because it confirms that i'm on the right track...),
> however, my question relates more to how one decides which terms to drop.
>
> As i understand, one can use the p-values to determined which terms to drop
> first (i.e. drop the term with the highest p-value, then re-run the model
> and compare the two results with anova; if there's a significant difference,
> then accept the new model. Correct?)

No, that is backwards.  If the anova says the two models are
different, keep the full, unrestricted model. The significance of the
difference indicates you lost "explanatory power" when you removed a
variable or recoded to remove a parameter.

This is done since one wants a model
> that explains the data well, and non-significant terms don't contribute to
> explaining the data, right?
>
> The challenge with using glmer with quasipoisson is that it does not give
> p-values for each term or interaction - so how does one decide which term to
> drop...if any. From what i've gleaned from previous advice from others is
> that one can drop any term and compare the results with the full model by
> anova to get the p-value for that term. So should one do that for all the
> terms first (one at a time), then drop the term with the highest p-value,
> and then repeat the whole process again?

It's not a problem that "the program does not give you a p-value."
The problem is that the statistical theory is tenuous enough that
nobody is sure what the p-value ought to be.  The quasi model is based
on the idea that nobody know the sampling distribution of the random
component very well, so the standard errors are, well, sorta
not-really-standard.  I think you'd probably want to calculate some
kind of robust standard error, but I've not done it with a
quasi-poisson model.

Instead of a quasi-poisson, you could look around for a mixed model
program that has a negative binomial option.  For negative binomial,
we have more exact model of randomness and the standard errors are
more well understood.  lmer does not have nb, but I'm pretty sure I've
seen one somewhere.

If you really believe you want a p-value, you can calculate one
yourself.  From the "summary" output, inspect and you'll see there are
columns of b's and standard errrors.  divide away for yourself.

The problem of deciding which variable to drop is a hard one, it is
the same in ordinary regression.

I'm trained in a tradition that says you should try to choose
variables by theory, and don't commit the sin of dropping variables
just because they are "not significant".  If there is any
multicollinearity, the dropping process may lead to mistaken
conclusions.  This is the flaw in so-called "stepwise" regression. You
are a "bonehead" if you let the model tell you which parameters to
include.  Models will lie to you.  Model pruning of that sort--the
search for "significant" estimates--produces bad t-tests and a lot of
silly articles getting published.  I've seen economists and political
scientists crop up with survey articles saying that just about
everything we publish is misleading/wrong because of the model pruning
approach.

I've wondered if we could not work out a "regression tree" or "forest"
framework to choose which variables are needed in your glm.  I read a
lot about it, but concluded it did not exactly fit my need.  If you
are looking for some rigorous justification to include/exclude
variables, I think you have to look in that more exotic direction.  I
saw a beautiful presentation about the LASSO that selects and
estimates and accounts for shrinkage as well.

There's an article by Ed Leamer from AER with a title like "Let's take
the con out of econometrics."  it deals with the variable selection
problem. I *think* his suggested approach would be the one we call
Bayesian Model Averaging today.   If you fit a lot of models, drop
variables in and out, then the final result should somehow summarize
the variety of estimates you observed.

Sorry, this is preaching in the wrong context. You don't really have a
r-sig-mixed problem here, you have a more general (somewhat religious)
question about regression modeling.

pj

>
> Any further advice would be greatly appreciated.
> Thanks,
> Raldo
>
>
> PS. I had attached the data in a Text file in the previous e-mail; is that
> not acceptable for R Help? (I've attached the file again and it is also
> available here).
>
>
>
> On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
>> On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com>
>> wrote:
>>> Chris,
>>> Thanks for that - I should probably have mentioned that I'm using
>>> family=quasipoisson  in glmer since my data has Poisson distribution
>>> as well as being overdispersed. I'm unsure how one decides which term
>>> to drop without being informed by p-values, and so don't quite
>>> understand how the "Likelihood ratio test using anova()" , or the AIC
>>> or BIC model comparison will work in this case (I thought one's
>>> supposed to remove the term with the highest p-value from the model,
>>> and compare it with the model with the term included to see if there's
>>> a difference, not so?).
>>
>> Dear Raldo:
>>
>> Finally there is a question that I can help with!  It appears to me
>> you don't have much experience with regression modeling in R and the
>> other people who answer you are talking a bit "over your head".
>>
>> In your case, call this fit the "full" or "unrestricted" model:
>>
>> ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site),
>> data=ex4o, family=quasipoisson)
>>
>> (I'm not commenting the specification).
>>
>> Suppose you wonder "Should I leave out the multiplicative effect
>> between N and Year?"  THen you fit the model that excludes that one
>> element:
>>
>> ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site),
>> data=ex4o, family=quasipoisson)
>>
>> And then use the anova function to compare the 2 models
>>
>> anova(ex4o_r2, ex4o_new)
>>
>> This is a "pretty standard" R regression thing, similar to what people
>> do with all sorts of models in R.  It is what the "drop1" function
>> does for lm models, I believe.
>>
>> You may have seen regression models where an F test is done comparing
>> a full model against a restricted model?  This is following the same
>> line of thought.
>>
>> To test your question about the factor levels, here is what you should
>> do. SUppose the initial factor has 5 levels, and you wonder "do I
>> really need 5 levels, or can I drop out the separate estimations for 3
>> of the levels?"  Create a new factor with the simpler structure, run
>> it through the model in place of the original factor, and run anova to
>> compare the 2 models.  I wrote down some of those ideas in a paper
>> last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf),
>> but when I was done it seemed so obvious to me (& my friends) that I
>> did not try to publish it.
>>
>> With anova, there is a test= option where you can specify if you want
>> a chisq or F test.
>>
>> And, for future reference, when the experts ask you for a data example
>> to work on, they do not mean a copy of your printout, although that
>> may help.  What they want is the actual data and commands that you
>> use.  Usually, you have to upload the data somewhere for us to see,
>> along with the code.
>>
>> Good luck with your project.
>>
>> pj
>>
>> --
>> Paul E. Johnson
>> Professor, Political Science
>> 1541 Lilac Lane, Room 504
>> University of Kansas
>>
>
>
>
> --
> Raldo
>
>
> On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
>>
>> On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com>
>> wrote:
>> > Chris,
>> > Thanks for that - I should probably have mentioned that I'm using
>> > family=quasipoisson  in glmer since my data has Poisson distribution
>> > as well as being overdispersed. I'm unsure how one decides which term
>> > to drop without being informed by p-values, and so don't quite
>> > understand how the "Likelihood ratio test using anova()" , or the AIC
>> > or BIC model comparison will work in this case (I thought one's
>> > supposed to remove the term with the highest p-value from the model,
>> > and compare it with the model with the term included to see if there's
>> > a difference, not so?).
>>
>> Dear Raldo:
>>
>> Finally there is a question that I can help with!  It appears to me
>> you don't have much experience with regression modeling in R and the
>> other people who answer you are talking a bit "over your head".
>>
>> In your case, call this fit the "full" or "unrestricted" model:
>>
>> ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site),
>> data=ex4o, family=quasipoisson)
>>
>> (I'm not commenting the specification).
>>
>> Suppose you wonder "Should I leave out the multiplicative effect
>> between N and Year?"  THen you fit the model that excludes that one
>> element:
>>
>> ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site),
>> data=ex4o, family=quasipoisson)
>>
>> And then use the anova function to compare the 2 models
>>
>> anova(ex4o_r2, ex4o_new)
>>
>> This is a "pretty standard" R regression thing, similar to what people
>> do with all sorts of models in R.  It is what the "drop1" function
>> does for lm models, I believe.
>>
>> You may have seen regression models where an F test is done comparing
>> a full model against a restricted model?  This is following the same
>> line of thought.
>>
>> To test your question about the factor levels, here is what you should
>> do. SUppose the initial factor has 5 levels, and you wonder "do I
>> really need 5 levels, or can I drop out the separate estimations for 3
>> of the levels?"  Create a new factor with the simpler structure, run
>> it through the model in place of the original factor, and run anova to
>> compare the 2 models.  I wrote down some of those ideas in a paper
>> last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf),
>> but when I was done it seemed so obvious to me (& my friends) that I
>> did not try to publish it.
>>
>> With anova, there is a test= option where you can specify if you want
>> a chisq or F test.
>>
>> And, for future reference, when the experts ask you for a data example
>> to work on, they do not mean a copy of your printout, although that
>> may help.  What they want is the actual data and commands that you
>> use.  Usually, you have to upload the data somewhere for us to see,
>> along with the code.
>>
>> Good luck with your project.
>>
>> pj
>>
>> --
>> Paul E. Johnson
>> Professor, Political Science
>> 1541 Lilac Lane, Room 504
>> University of Kansas
>
>
>
> --
> Raldo
>

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas