[R-sig-ME] is multicollinearity of fixed effects resolved by random effects

Tue May 20 09:28:58 CEST 2008

Disclaimer: The following recommendation of a sequence of steps is not
the only one, perhaps not even the best one. It has worked for me in
the past.

First, I assume
- that centering was done with SCALE=FALSE,
- that you checked that linearity is defensible for the relation
between X13C and your three predictors (i.e., that you do not need
quadratic terms for some of them),
- that you used method = "ML" for comparison of models with identical
random but different fixed effect parts and you read up on the some of
the complications associated with such comparisons discussed on this
list. (You can use method="REML" for comparison of models differing
only in the random effects, again there are some qualifications.)

Second, I recommend that now you focus on the fixed-effects part. It
still is a bit random (pun intended).
> d.mod1: X13C ~ cMAT * cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod1a: X13C ~ (cMAT+ cMAP+ cLAT)^2 + TYPE + (1 | SITE) + (1 | SPECIES)
If taking out the three.factor interaction does no harm, you may want
to remove the non-significant two-factor interactions. After those are
gone, you may want to check whether there is a non-significant main
effect that is not part of an interaction (cMAP?). Then you may want
to take this one out, too. Such a hiearchical dropping out of effects
may lead you to your current favorite:
X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
which expands to:
X13C ~ cMAT+ cLAT + cMAT:cLAT+ TYPE + (1 | SITE) + (1 | SPECIES)
Note there are often good arguments for keeping theoretically
interesting effects in the model, even if they are not significant!

Third, you should plot your effects to get a good idea about the
source of the interaction.

Finally, you specify various random-effects parts and try to
understand what they mean (see previous posts to this list), for
example:
(1|SITE)
(1|GENUS)
(1|GENUS) + (1|SITE)
(1|GENUS) + (1|GENUS:SITE)   equivalent to  (1|GENUS/SITE)

Then allow the significant fixed effects to vary for the random effects.

Reinhold Kliegl

On Tue, May 20, 2008 at 1:55 AM, Jordan Mayor <clavulina at gmail.com> wrote:
> #Thanks for all of your comments.  I have done three new things to my fungal
> data since last posting my comments:
>
> 1) I realized that in my previous email I had centered my explanatory
> variables, not predictors (sorry - I had previously centered the isotope
> data for discriminant analyses),  which I have now fixed  (e.g. cMAT, cMAP,
> cLAT),
> 2) I have include TYPE as a fixed effect instead of creating separate models
> for both mycorrhizal and saprotrophic fungi to simplify, and
> 3) I have decided to compare model sets using the GENUS instead of SPECIES
> as random effects in the following lmer models and both sets converged on
> the same model according to AIC scores.
>
> Justifications for the above actions:
> 1) Fixed previous error.   The centered predictors have a much lower
> correlation with one another now suggesting I may have finally solved the
> severe multicollinearity effect???? Not sure if this is a correct
> interpretation however.
>> center=lm(cMAT~cLAT,data=dGen)
>> summary(center)
> Call:
> lm(formula = cMAT ~ cLAT, data = dGen)
> Residuals:
>     Min      1Q         Median      3Q       Max
> -17.817  -3.206   1.229         2.229  10.313
> Coefficients:
>                     Estimate    Std. Error    t value      Pr(>|t|)
> (Intercept) -0.920757   0.181696    -5.068      4.88e-07 ***
> cLAT         -0.213731   0.009673    -22.097    < 2e-16 ***
> Residual standard error: 5.466 on 911 degrees of freedom
> Multiple R-squared: 0.3489,     Adjusted R-squared: 0.3482
> F-statistic: 488.3 on 1 and 911 DF,  p-value: < 2.2e-16
>
> 2) TYPE provides good a priori knowledge about the physiology of fungi (i.e.
> trophic role in the ecosystem) and this is reflected in my dependent
> variables whose variance I am t -> the carbon and nitrogen isotopes in fungi
>
> 3) Fungal genera are much easier to taxonomically identify and thus are less
> prone to collector misidentification, regional biases, or spelling errors!
> The family level may be too coarse for ecological comparisons because many
> genera within families can form both of the trophic roles I am trying to
> model.  I don't see any reason to use GENERA as a fixed effect however,
> because individual life histories (host plant/tissue, fungal age, site
> fertility, site stress) at each SITE could modify isotope values in very
> unknown (random) ways.  I will defiantly look into using the "ape" package
> in future research, thanks for pointing this out Simon, and yes there is now
> a phylogeny for fungi to which this could be applied - see (Blackwell et al.
> 2006 Mycologia 98:829-837, Hibbett et al. 2007Mycological Research
> 111:509-547) if anyone is interested.
>
> Using TYPE as fixed and GENERA as random provided somewhat better dispersion
> across sites as evidenced in the following table previously requested by
> Douglas Bates:
>
>> table(table(dGen$GENUS))
>
>  0    1      2      3     4     5    6    7    8    9   10   11   13   15
> 16   17   19    25    28    35    54   86   97
>  1    63   21   15    6     5    9    7    1    3    2     2      1
> 2     1     2     1       1      1      2       1     1     1
>> xtabs(~ SITE, dGen)
> SITE
>                              Aheden                            Aheden 2
>                                  33                                  53
>    Ashiu (temp deciduous broadleaf)         Betsele
>                                  40                                   5
>                      Breuil, France                         Chiba
>                                  47                                   9
>                                   d                          Flakaliden
>                                  87                                  21
>                 Glacier Bay, Alaska                   Guyana
>                                   8                                  49
>                                   h                              heath
> tundra, subarctic Sweden
>                                  92                                  14
>                           Kagoshima                       Kulbacksliden
>                                   1                                   8
>                               Kyoto                               Lamar
> Haines
>                                  54                                  25
>           Lambir (lowland tropical)                  Miyajima
>                                  31                                   2
>                            Norikura                           Norrliden
>                                  12                                   1
>                             Okinawa                       Ontake (subalpine
> coniferous)
>                                   1                                  17
>                               Oodai                           pine forests
> in CA
>                                   2                                  43
>                           Shirahama                         Snowbowl
>                                   3                                  22
>                   Spruce plantation                      Stadsskogen
>                                  37                                 123
>                         Svartberget                         Tanigawa
>                                   5                                   8
> tussock tundra near Toolik Lake, AK       Vilan
>                                   8                                   7
>                         Woods Creek
>                                  45
>
> The new models are as follows:
>
> Models with SPECIES included:
> d.mod0: X13C ~ TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod5: X13C ~ cMAT + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod6: X13C ~ cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod7: X13C ~ cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod8: X13C ~ cMAT * cLAT + (1 | SITE) + (1 | SPECIES)
> d.mod2: X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod3: X13C ~ cMAT * cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod4: X13C ~ cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
> d.mod1: X13C ~ cMAT * cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
>                    Df     AIC     BIC         logLik    Chisq Chi Df
> Pr(>Chisq)
> d.mod0.p   6   2594.8  2622.8  -1291.4
> d.mod5.p   7   2596.4  2629.1  -1291.2   0.4031          1  0.5255178
> d.mod6.p   7   2596.7  2629.5  -1291.3   0.0000          0  < 2.2e-16 ***
> d.mod7.p   7   2596.7  2629.5  -1291.4   0.0000          0  < 2.2e-16 ***
> d.mod8.p   7   2926.1  2958.9  -1456.1   0.0000          0  < 2.2e-16 ***
> d.mod2.p  9   2563.2  2605.4  -1272.6   366.8690      2  < 2.2e-16 ***
>>>Best fit
> d.mod3.p   9   2585.3  2627.4  -1283.6   0.0000          0  < 2.2e-16 ***
> d.mod4.p   9   2584.6  2626.7  -1283.3   0.6785          0  < 2.2e-16 ***
> d.mod1.p 13  2569.9  2630.8  -1272.0  22.6361         4  0.0001497 ***
>
> Those same models with GENERA instead of SPECIES:
>                           Df     AIC     BIC          logLik    Chisq Chi
> Df  Pr(>Chisq)
> dGen.mod0.p   5   2428.6  2451.8   -1209.3
> dGen.mod5.p   6   2430.0  2457.7   -1209.0   0.6296       1    0.427500
> dGen.mod6.p   6   2430.6  2458.4   -1209.3   0.0000       0    < 2.2e-16 ***
> dGen.mod7.p   6   2430.6  2458.4   -1209.3   0.0000       0    < 2.2e-16 ***
> dGen.mod8.p   7   2515.3  2547.6   -1250.6   0.0000       1    1.000000
> dGen.mod2.p  8   2401.6  2438.6   -1192.8 115.6470    1    < 2.2e-16 ***
>>>>Best fit again
> dGen.mod3.p   8   2418.6  2455.5   -1201.3   0.0000       0    < 2.2e-16 ***
> dGen.mod4.p   8   2418.4  2455.3   -1201.2   0.2163       0    < 2.2e-16 ***
> dGen.mod1.p 12   2408.8  2464.3   -1192.4  17.5047      4   0.001542 **
>
> I am still uncertain if I have modeled my random effects properly however
> but is seems that Model 2 is robust regardless if I model Random effects as
> (1|SITE/GENUS) or (1|SITE) + (1|GENUS).  BUT when comparing the two best fit
> models to each other, the one with more df is significantly different
> suggesting I should choose the simpler of the two without strong evidence to
> support nesting.
>
> Models:
> dGen.mod2.1:   X13C ~ cMAT * cLAT + TYPE + (1 | SITE:GENUS)
> dGen.mod2:      X13C ~ cMAT * cLAT + TYPE + (1 | SITE/GENUS)
>                               Df     AIC        BIC           logLik
> Chisq Chi     Df     Pr(>Chisq)
> dGen.mod2.1.p  7       2421.2  2453.6      -1203.6
> dGen.mod2.p     8       2417.9  2454.8       -1200.9    5.3363
> 1       0.02089 *
>
> Sorry if I have overloaded you all with output - I just thought some would
> be interested and I wanted to follow up.  I would be very interested in
> knowing if anyone has any comments on my interpretations or could suggest
> further model configurations.  Thank you very much.
>
> Jordan Mayor
>
>
> On Sun, May 18, 2008 at 8:33 PM, Simon Blomberg <s.blomberg1 at uq.edu.au>
> wrote:
>>
>> It is easy to incorporate phylogenetic correlations among species
>> ("taxonomic factors"), using the ape package and lme. This is far better
>> than combining species into families or genera, as taxonomic heirarchies
>> are subjective, artificial, and rarely represent the true phylogeny. I
>> strongly disagree that taxonomic factors necessarily function as fixed
>> effects. The phylogeny represents what we think we know about the
>> covariance among species in all traits, measured or unmeasured. I can't
>> see how unmeasured traits or poorly-defined "taxonomic factors" can
>> possibly be included as fixed effects. If particular measured traits are
>> thought to be important in determining the mean response, they should be
>> included as fixed effects, but phylogenies are used to model the
>> covariance, not the mean. Unusual species should stand out by
>> examination of the normalized residuals.
>>
>> My approach would be to ditch species as a factor altogether, and
>> incorporate phylogenetic effects through the correlation argument in lme
>> (or gls if there are no other random effects). This assumes there is an
>> available phylogeny for the fungi, which may not be true.
>>
>> There is a very large literature on incorporating phylogeny into
>> analyses (to which I am afraid I am a small contributor).
>>
>> Simon.
>>
>> On Mon, 2008-05-19 at 10:02 +1000, John Maindonald wrote:
>> > I think grouping them into families is a very good idea.
>> > This makes use of prior insight, reduces the number of
>> > parameters to an extent that it becomes more reasonable
>> > to think about fixed effects, and you can look for individual
>> > species that stray from the path laid out for them by their
>> > families.  For the fixed effects analyses that I suggested,
>> > you might do these by families.
>> >
>> > Conceptually, there are taxonomic factors that surely
>> > function, no doubt in interaction with location variables,
>> > as fixed effects.  I consider that one ought to start by thinking
>> > of them as fixed effects, unless it can be demonstrated that
>> > data are indistinguishable from random variation.
>> >
>> > Maybe however those effects operate more as the level of
>> > genera or families than at the level of species.  Responding
>> > to this point may be a useful aim for the study.
>> >
>> > John.
>> >
>> > John Maindonald             email: john.maindonald at anu.edu.au
>> > phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
>> > Centre for Mathematics & Its Applications, Room 1194,
>> > John Dedman Mathematical Sciences Building (Building 27)
>> > Australian National University, Canberra ACT 0200.
>> >
>> >
>> > On 19 May 2008, at 4:04 AM, Jordan Mayor wrote:
>> >
>> > > Thank you JM, RK, and DB for your thoughtful responses to my
>> > > questions.  I am working on posting the file on the web and will
>> > > respond as soon as it is publicly available - (I could email it
>> > > direct to those concerned as well).  Until then here is the output
>> > > requested by DB and RK as well as more information regarding sites,
>> > > species, and the data:
>> > >
>> > > >ds=d[d$TYPE=="s",] #separates into just sap fungi as used in the
>> > > example below
>> > > >dm=d[d$TYPE=="m",] #separates into just myc fungi (treated as
>> > > distinct populations in my analyses)
>> > >
>> > > > table(table(ds$SPECIES))
>> > >
>> > >     0      1    2   3   5   6
>> > > 393 161  45   7   1   1
>> > > # As seen all species are not present at all the sites - this is
>> > > partly why I thought to nest SPECIES within SITE.  The sites range
>> > > widely from boreal tundra, to temperate California, to tropical
>> > > forests of Borneo and Guyana.  Most fungi are not so cosmopolitan as
>> > > to even potentially be present everywhere - let alone be collected
>> > > during brief collecting forays.  One site (pine forest in CA) didn't
>> > > even list  fungal names.  All fungi were listed as mycorrhizal (m),
>> > > saprotrophic (s), or of unknown ecological roles (unk).  Most of the
>> > > genera (or families) are present at all sites but that biologically
>> > > coarseness concerns me.  If, as pointed out by RK the ideal
>> > > situation would be every species present at every site then perhaps
>> > > creating a new column containing families or genera will move the
>> > > dataset toward that direction.
>> > >
>> > > > xtabs(~SITE,ds)
>> > > SITE
>> > >                              Aheden
>> > > Aheden 2
>> > >
>> > > 4                                   0
>> > >    Ashiu (temp deciduous broadleaf)        Betsele
>> > >
>> > > 21                                   0
>> > >                      Breuil, France                       Chiba
>> > >
>> > > 14                                   6
>> > >                                   d
>> > > Flakaliden
>> > >
>> > > 23                                   0
>> > >                 Glacier Bay, Alaska                Guyana
>> > >                                   4
>> > > 20
>> > >                                   h                     heath
>> > > tundra, subarctic Sweden
>> > >
>> > > 38                                   4
>> > >                           Kagoshima
>> > > Kulbacksliden
>> > >
>> > > 1                                   0
>> > >                               Kyoto                        Lamar
>> > > Haines
>> > >                                  26
>> > > 13
>> > >           Lambir (lowland tropical)               Miyajima
>> > >
>> > > 14                                   1
>> > >                            Norikura                        Norrliden
>> > >
>> > > 3                                   0
>> > >                             Okinawa       Ontake (subalpine
>> > > coniferous)
>> > >
>> > > 0                                   8
>> > >                               Oodai                  pine forests in
>> > > CA
>> > >                                   2
>> > > 25
>> > >                           Shirahama                     nowbowl
>> > >                                   2
>> > > 13
>> > >                   Spruce plantation                   Stadsskogen
>> > >                                  17
>> > > 13
>> > >                         Svartberget                     Tanigawa
>> > >
>> > > 0                                   6
>> > > tussock tundra near Toolik Lake, AK      Vilan
>> > >
>> > > 5                                   0
>> > >                         Woods Creek
>> > >                                  25                     # the zeros
>> > > here are because some sites have no X13C data - only X15N
>> > >
>> > > I did try running models with centered predictors, as suggested by
>> > > RK - see below.  I found that the AIC scores were much larger
>> > > however when the predictors were centered using my method.  I
>> > > centered by taking the mean of mycorrhizal and saprotrophic fungal
>> > > groups at each site - then I subtracted each (m) or (s) fungus from
>> > > those same group means within each site.  Because the (m) and (s)
>> > > fungi have unique sources of carbon, and nitrogen and these are
>> > > reflected in their isotope values (X13C, X15N), I deemed this
>> > > centering level to be appropriatein order to preserve the magnitude
>> > > of difference between the groups.
>> > >
>> > > > mix.model1  # Raw isotope values used
>> > > Linear mixed model fit by maximum likelihood
>> > > Formula: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
>> > >    Data: ds
>> > >   AIC   BIC    logLik     deviance  REMLdev
>> > >  1016 1041 -500.9     1002         1031
>> > > Random effects:
>> > >  Groups   Name        Variance Std.Dev.
>> > >  SPECIES  (Intercept) 0.62149  0.78835
>> > >  SITE     (Intercept) 0.34885  0.59063
>> > >  Residual             1.28911  1.13539
>> > > Number of obs: 283, groups: SPECIES, 215; SITE, 24
>> > >
>> > > Fixed effects:
>> > >               Estimate Std. Error t value
>> > > (Intercept) -2.207e+01  1.121e+00 -19.695
>> > > MAT         -6.418e-02  3.833e-02  -1.675
>> > > MAP          3.137e-05  2.100e-04   0.149
>> > > LAT         -8.690e-03  1.778e-02  -0.489
>> > >
>> > > Correlation of Fixed Effects:
>> > >           (Intr)       MAT     MAP
>> > > MAT  -0.707
>> > > MAP  -0.401  -0.247
>> > > LAT   -0.959  0.692   0.272
>> > >
>> > > > mix.model1a # Group centered within each site
>> > > Linear mixed model fit by maximum likelihood
>> > > Formula: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
>> > >    Data: ds
>> > >   AIC   BIC    logLik     deviance  REMLdev
>> > >  2623 2649  -1305     2609         2620
>> > > Random effects:
>> > >  Groups   Name        Variance Std.Dev.
>> > >  SPECIES  (Intercept) 301.9720 17.3773
>> > >  SITE     (Intercept)   5.1294  2.2648
>> > >  Residual             327.3857 18.0938
>> > > Number of obs: 283, groups: SPECIES, 215; SITE, 24
>> > >
>> > > Fixed effects:
>> > >              Estimate Std. Error t value
>> > > (Intercept) 3.463e+01  1.166e+01  2.9713
>> > > MAT         3.899e-01  4.330e-01  0.9006
>> > > MAP         2.543e-05  1.811e-03  0.0140
>> > > LAT         3.235e-01  1.867e-01  1.7324
>> > >
>> > > Correlation of Fixed Effects:
>> > >          (Intr)       MAT     MAP
>> > > MAT -0.779
>> > > MAP -0.230  -0.300
>> > > LAT  -0.955   0.732   0.116
>> > >
>> > > > anova(mix.model1,mix.model1a)
>> > > Data: ds
>> > > Models:
>> > > mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
>> > > mix.model1a: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
>> > >                             Df      AIC         BIC          logLik
>> > > Chisq Chi Df  Pr(>Chisq)
>> > > mix.model1.p    7      1015.85  1041.37  -500.93
>> > > mix.model1a.p  7      2623.49  2649.00 -1304.74     0      0       <
>> > > 2.2e-16 ***
>> > >
>> > > # As seen above, the model using my centered values performed
>> > > poorly.  My interpretation is that the centering removed the very
>> > > variability associated with climate I am trying to predict!
>> > >
>> > > #In addition, I compared the other models mentioned by RK using the
>> > > raw isotope values:
>> > >
>> > > > anova(mix.model1,mix.model2,mix.model3)
>> > > Data: ds
>> > > Models:
>> > > mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
>> > > mix.model2: X13C ~ (MAT + MAP + LAT)^2 + (1 | SPECIES) + (1 | SITE)
>> > > mix.model3: X13C ~ MAT * MAP * LAT + (1 | SPECIES) + (1 | SITE)
>> > >                          Df     AIC         BIC
>> > > logLik      Chisq    Chi Df   Pr(>Chisq)
>> > > mix.model1.p  7     1015.85  1041.37    -500.93
>> > > mix.model2.p 10    978.21    1014.66    -479.10   43.645
>> > > 3       1.796e-09 ***
>> > > mix.model3.p 11    980.18    1020.28    -479.09   0.023
>> > > 1        0.8794
>> > >
>> > > # I have refrained from trying the models mentioned by DB in order
>> > > to protect his rights to not commit the "capital mistake" ;)
>> > > however, the three predictors (MAT, MAP, LAT) do indeed seem to
>> > > overplot - I will try to select only one perhaps in my final models.
>> > >
>> > > # Again - thank you all for your help on this.
>> > >
>> > > --
>> > > Jordan Mayor, Ph.D. Candidate
>> > > Ecosystem Dynamics Research Lab
>> > > Department of Botany, University of Florida
>> > > Gainesville, FL 32611
>> >
>> > _______________________________________________
>> > R-sig-mixed-models at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> --
>> Simon Blomberg, BSc (Hons), PhD, MAppStat.
>> Lecturer and Consultant Statistician
>> Faculty of Biological and Chemical Sciences
>> The University of Queensland
>> St. Lucia Queensland 4072
>> Australia
>> Room 320 Goddard Building (8)
>> T: +61 7 3365 2506
>> http://www.uq.edu.au/~uqsblomb
>> email: S.Blomberg1_at_uq.edu.au
>>
>> Policies:
>> 1.  I will NOT analyse your data for you.
>> 2.  Your deadline is your problem.
>>
>> The combination of some data and an aching desire for
>> an answer does not ensure that a reasonable answer can
>> be extracted from a given body of data. - John Tukey.
>>
>
>