[R-sig-ME] is multicollinearity of fixed effects resolved by random effects

Mon May 19 02:33:01 CEST 2008

It is easy to incorporate phylogenetic correlations among species
("taxonomic factors"), using the ape package and lme. This is far better
than combining species into families or genera, as taxonomic heirarchies
are subjective, artificial, and rarely represent the true phylogeny. I
strongly disagree that taxonomic factors necessarily function as fixed
effects. The phylogeny represents what we think we know about the
covariance among species in all traits, measured or unmeasured. I can't
see how unmeasured traits or poorly-defined "taxonomic factors" can
possibly be included as fixed effects. If particular measured traits are
thought to be important in determining the mean response, they should be
included as fixed effects, but phylogenies are used to model the
covariance, not the mean. Unusual species should stand out by
examination of the normalized residuals.

My approach would be to ditch species as a factor altogether, and
incorporate phylogenetic effects through the correlation argument in lme
(or gls if there are no other random effects). This assumes there is an
available phylogeny for the fungi, which may not be true.

There is a very large literature on incorporating phylogeny into
analyses (to which I am afraid I am a small contributor).

Simon.

On Mon, 2008-05-19 at 10:02 +1000, John Maindonald wrote:
> I think grouping them into families is a very good idea.
> This makes use of prior insight, reduces the number of
> parameters to an extent that it becomes more reasonable
> to think about fixed effects, and you can look for individual
> species that stray from the path laid out for them by their
> families.  For the fixed effects analyses that I suggested,
> you might do these by families.
> 
> Conceptually, there are taxonomic factors that surely
> function, no doubt in interaction with location variables,
> as fixed effects.  I consider that one ought to start by thinking
> of them as fixed effects, unless it can be demonstrated that
> data are indistinguishable from random variation.
> 
> Maybe however those effects operate more as the level of
> genera or families than at the level of species.  Responding
> to this point may be a useful aim for the study.
> 
> John.
> 
> John Maindonald             email: john.maindonald at anu.edu.au
> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
> Centre for Mathematics & Its Applications, Room 1194,
> John Dedman Mathematical Sciences Building (Building 27)
> Australian National University, Canberra ACT 0200.
> 
> 
> On 19 May 2008, at 4:04 AM, Jordan Mayor wrote:
> 
> > Thank you JM, RK, and DB for your thoughtful responses to my  
> > questions.  I am working on posting the file on the web and will  
> > respond as soon as it is publicly available - (I could email it  
> > direct to those concerned as well).  Until then here is the output  
> > requested by DB and RK as well as more information regarding sites,  
> > species, and the data:
> >
> > >ds=d[d$TYPE=="s",] #separates into just sap fungi as used in the  
> > example below
> > >dm=d[d$TYPE=="m",] #separates into just myc fungi (treated as  
> > distinct populations in my analyses)
> >
> > > table(table(ds$SPECIES))
> >
> >     0      1    2   3   5   6
> > 393 161  45   7   1   1
> > # As seen all species are not present at all the sites - this is  
> > partly why I thought to nest SPECIES within SITE.  The sites range  
> > widely from boreal tundra, to temperate California, to tropical  
> > forests of Borneo and Guyana.  Most fungi are not so cosmopolitan as  
> > to even potentially be present everywhere - let alone be collected  
> > during brief collecting forays.  One site (pine forest in CA) didn't  
> > even list  fungal names.  All fungi were listed as mycorrhizal (m),  
> > saprotrophic (s), or of unknown ecological roles (unk).  Most of the  
> > genera (or families) are present at all sites but that biologically  
> > coarseness concerns me.  If, as pointed out by RK the ideal  
> > situation would be every species present at every site then perhaps  
> > creating a new column containing families or genera will move the  
> > dataset toward that direction.
> >
> > > xtabs(~SITE,ds)
> > SITE
> >                              Aheden                             
> > Aheden 2
> >                                    
> > 4                                   0
> >    Ashiu (temp deciduous broadleaf)        Betsele
> >                                   
> > 21                                   0
> >                      Breuil, France                       Chiba
> >                                   
> > 14                                   6
> >                                   d                              
> > Flakaliden
> >                                   
> > 23                                   0
> >                 Glacier Bay, Alaska                Guyana
> >                                   4                                   
> > 20
> >                                   h                     heath  
> > tundra, subarctic Sweden
> >                                   
> > 38                                   4
> >                           Kagoshima                        
> > Kulbacksliden
> >                                    
> > 1                                   0
> >                               Kyoto                        Lamar  
> > Haines
> >                                  26                                   
> > 13
> >           Lambir (lowland tropical)               Miyajima
> >                                   
> > 14                                   1
> >                            Norikura                        Norrliden
> >                                    
> > 3                                   0
> >                             Okinawa       Ontake (subalpine  
> > coniferous)
> >                                    
> > 0                                   8
> >                               Oodai                  pine forests in  
> > CA
> >                                   2                                   
> > 25
> >                           Shirahama                     nowbowl
> >                                   2                                   
> > 13
> >                   Spruce plantation                   Stadsskogen
> >                                  17                                   
> > 13
> >                         Svartberget                     Tanigawa
> >                                    
> > 0                                   6
> > tussock tundra near Toolik Lake, AK      Vilan
> >                                    
> > 5                                   0
> >                         Woods Creek
> >                                  25                     # the zeros  
> > here are because some sites have no X13C data - only X15N
> >
> > I did try running models with centered predictors, as suggested by  
> > RK - see below.  I found that the AIC scores were much larger  
> > however when the predictors were centered using my method.  I  
> > centered by taking the mean of mycorrhizal and saprotrophic fungal  
> > groups at each site - then I subtracted each (m) or (s) fungus from  
> > those same group means within each site.  Because the (m) and (s)  
> > fungi have unique sources of carbon, and nitrogen and these are  
> > reflected in their isotope values (X13C, X15N), I deemed this  
> > centering level to be appropriatein order to preserve the magnitude  
> > of difference between the groups.
> >
> > > mix.model1  # Raw isotope values used
> > Linear mixed model fit by maximum likelihood
> > Formula: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
> >    Data: ds
> >   AIC   BIC    logLik     deviance  REMLdev
> >  1016 1041 -500.9     1002         1031
> > Random effects:
> >  Groups   Name        Variance Std.Dev.
> >  SPECIES  (Intercept) 0.62149  0.78835
> >  SITE     (Intercept) 0.34885  0.59063
> >  Residual             1.28911  1.13539
> > Number of obs: 283, groups: SPECIES, 215; SITE, 24
> >
> > Fixed effects:
> >               Estimate Std. Error t value
> > (Intercept) -2.207e+01  1.121e+00 -19.695
> > MAT         -6.418e-02  3.833e-02  -1.675
> > MAP          3.137e-05  2.100e-04   0.149
> > LAT         -8.690e-03  1.778e-02  -0.489
> >
> > Correlation of Fixed Effects:
> >           (Intr)       MAT     MAP
> > MAT  -0.707
> > MAP  -0.401  -0.247
> > LAT   -0.959  0.692   0.272
> >
> > > mix.model1a # Group centered within each site
> > Linear mixed model fit by maximum likelihood
> > Formula: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
> >    Data: ds
> >   AIC   BIC    logLik     deviance  REMLdev
> >  2623 2649  -1305     2609         2620
> > Random effects:
> >  Groups   Name        Variance Std.Dev.
> >  SPECIES  (Intercept) 301.9720 17.3773
> >  SITE     (Intercept)   5.1294  2.2648
> >  Residual             327.3857 18.0938
> > Number of obs: 283, groups: SPECIES, 215; SITE, 24
> >
> > Fixed effects:
> >              Estimate Std. Error t value
> > (Intercept) 3.463e+01  1.166e+01  2.9713
> > MAT         3.899e-01  4.330e-01  0.9006
> > MAP         2.543e-05  1.811e-03  0.0140
> > LAT         3.235e-01  1.867e-01  1.7324
> >
> > Correlation of Fixed Effects:
> >          (Intr)       MAT     MAP
> > MAT -0.779
> > MAP -0.230  -0.300
> > LAT  -0.955   0.732   0.116
> >
> > > anova(mix.model1,mix.model1a)
> > Data: ds
> > Models:
> > mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
> > mix.model1a: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
> >                             Df      AIC         BIC          logLik   
> > Chisq Chi Df  Pr(>Chisq)
> > mix.model1.p    7      1015.85  1041.37  -500.93
> > mix.model1a.p  7      2623.49  2649.00 -1304.74     0      0       <  
> > 2.2e-16 ***
> >
> > # As seen above, the model using my centered values performed  
> > poorly.  My interpretation is that the centering removed the very  
> > variability associated with climate I am trying to predict!
> >
> > #In addition, I compared the other models mentioned by RK using the  
> > raw isotope values:
> >
> > > anova(mix.model1,mix.model2,mix.model3)
> > Data: ds
> > Models:
> > mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
> > mix.model2: X13C ~ (MAT + MAP + LAT)^2 + (1 | SPECIES) + (1 | SITE)
> > mix.model3: X13C ~ MAT * MAP * LAT + (1 | SPECIES) + (1 | SITE)
> >                          Df     AIC         BIC            
> > logLik      Chisq    Chi Df   Pr(>Chisq)
> > mix.model1.p  7     1015.85  1041.37    -500.93
> > mix.model2.p 10    978.21    1014.66    -479.10   43.645       
> > 3       1.796e-09 ***
> > mix.model3.p 11    980.18    1020.28    -479.09   0.023         
> > 1        0.8794
> >
> > # I have refrained from trying the models mentioned by DB in order  
> > to protect his rights to not commit the "capital mistake" ;)  
> > however, the three predictors (MAT, MAP, LAT) do indeed seem to  
> > overplot - I will try to select only one perhaps in my final models.
> >
> > # Again - thank you all for your help on this.
> >
> > -- 
> > Jordan Mayor, Ph.D. Candidate
> > Ecosystem Dynamics Research Lab
> > Department of Botany, University of Florida
> > Gainesville, FL 32611
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- 
Simon Blomberg, BSc (Hons), PhD, MAppStat. 
Lecturer and Consultant Statistician 
Faculty of Biological and Chemical Sciences 
The University of Queensland 
St. Lucia Queensland 4072 
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for 
an answer does not ensure that a reasonable answer can 
be extracted from a given body of data. - John Tukey.