[R-sig-eco] Question About Syntax For Complex ANOVA Design

Mon Nov 10 11:12:37 CET 2008

Hi Joe

Is time a continuous variable or a factor?

The thing is that the terms ARE nested. The nesting is defined by the random effects structure. The fixed effects slot into that. They way this happens is defined by the coding in the data. So I assume you have something like (simplified):

site	coast	MBL	HSP
A	E	U	..
A	E	L	..
A	E	U	..
A	E	L	..
B	W	U	..
B	W	L	..
B	W	U	..
B	W	L	..

>From this it follows that coast is a site level covariate, in that each site can only have one coast. But MBL is nested within site as the levels change within both A and B. lme is clever enough to work this out.

On the hypotheses, I'd be worried about the single factor H0's. Ignoring time as a fixed effect for a moment, and hence treating the measures through time as replicates, you are basically interested in:

 lme(HSP~coast*MBL, random= ~1|site) 

The coast * ML term tests for HSP high/low dependent on coast. To test this fit the full model with method = ML and compare it to  lme(HSP~coast+MBL, random= ~1|site, method ="ML") using anova(model1, model2). There are alot of technical issues with testing both fixed and random effects in mixed models, for details see past posts on the R-sig-ME list and also on the R Wiki (http://wiki.r-project.org/rwiki/doku.php?id=guides:lmer-tests). But lets ignore those, this should do OK.  

The if coast*ML is significant then no need to go any further. If it isn't then repeat from the coast+MBL, deleting one of those fixed terms and repeating.

To test whether the effects are conditional on site, test for the significance of the site random effect, again by fitting models with and without it (no need to use method = "ML" here).

With the time effect, things partly depend on whether it can be characterised as a factor or continuous covariate. But whatever, I don't think it makes sense to have time in random component without it being in the fixed part. The easiest rationale is where you have time as continuous. Then it could make some sense to have

 lme(HSP~coast+MBL+time, random= ~time|site) 

Here you have an overall time trend common to all sites, and then each site is allowed to vary around that. If time is a factor, then that formulation would indicate that there were different variances around the different time factor levels ie some times were more variable than others: I think there's an example in Pinhero and Bates relating to growth curves of boys and girls but boys are more variable than girls. But as I said before, it's difficult enough fitting random slope models, and cross-levle interactions between fixed effects when you have alot of data so I worry whether there are enough data in this instance. The added on top of this is whether there are degrees of freedom left to fit the models at all when you include time as a covariate. 

In terms of how the models are fitted, they are different from anovas, they're fitted by maximum likelihood techniques and are more akin to glms. This makes model comparison simpler in some respects, although the whole degrees of freedom issue which seems so simple in some packages ends up as a bit of a minefield, basically there's no "right" answer.

I think the best thing might be to have a go at fitting the models as I've described and let us know how you get on. More heavyweight and authoritative advice can also be sought from the R-sig-ME list, but again, I'd suggest fitting the models first and posting a reproducible example, you can save the commands, and include the date easily with the dput command.

regards

Mike

>>> Joe Simonis <jls468 at cornell.edu> 07/11/2008 19:13 >>>
Hi Mike,

    Thanks for the response, hopefully we can get this thing figured 
out.  And I hope things will be easier than I think--that's usually a 
good thing!  However, I do wonder if I didn't explain myself fully (too 
often the case).  My apologies for not clearly stating the hypotheses.  
Basically, the question is 'does HSP differ between high and low 
intertidal locations, and does that relationship depend on the site, the 
coast, and the time of year?'  That can be broken up into a number of 
single-factor H0s. 

    I'm not a marine person, but I'm pretty sure that the high-low MBL 
difference is well understood, since mussels in the high intertidal are 
exposed more to temperature stress, so they should be producing more 
heat shock protein (HSP).  However, this relationship likely varies 
depending on a number of things that vary across sites (geology, tidal 
action, etc).  There is almost certainly a time signature in production 
related to climatic cycles, so that's why they sampled 3 times during 
the year.  So, I guess they're mostly interested in how the MBL 
difference changes across space and time. 

    So, yes, I do want to know if time trends are different at the 
different sites (sorry for not articulating that).  So, in that case, I 
should have 'time' in the random part of the equation, correct?

    As for the other terms, I am confused by the way you suggested 
including them in the model, especially with regards to their not being 
nested (in your formulation).  The sampling regime (MBL within site 
within coast) is nested, and so the data should be analyzed as such.  
 From the way I understand nestedness in R, I have to write it out as 
c/b/a where a is nested in b is nested in c.  So I guess I am confused 
as to if     lme(HSP~coast+MBL, random= ~time|site)  would account for 
that or not.  And I'm aware that it's kind of silly to worry about 
random vs fixed effects with few levels.  However, from my understanding 
of nested ANOVA, that makes a big difference in the way that the F 
ratios are calculated (i.e. what is used to construct the 
denominator).   Since site is random, MS coast should be compared to MS 
site, rather than MS residual.  Again, that's at least from my 
understanding of how it works in the old-school way, perhaps the REML 
fitting does it differently, and so it wouldn't matter? 

    Thanks!

--Joe

Mike Dunbar wrote:
> Hi Joe
>
> I think the command you want is probably simpler than you think:
>
>  lme(HSP~coast*MBL, random= ~1|site)     
> or
>  lme(HSP~coast+MBL, random= ~1|site)     
>
> coast and MBL have distinct levels so are fixed and site is random as you say.
> Having site as random will take into account that there are repeated measures through time at each point (MBL).
> Each site has two points (MBLs). lme will treat coast and MBL correctly providing that they are coded correctly. You can check this by running the analysis and looking at the degrees of freedom for the fixed effects. It's best not to thing to hard about the structure in terms of fixed/random/fixed: even though other stats packages might encourage this. Think of the random effects (including the error term)  providing the structure and the fixed effects slotting into that structure accordingly.
> If you write random = ~time|site then you are saying random slopes for the time fixed effect, i.e. there is an overall time trend and each site responds differently around that trend. I don't think this is what you want as you don't specifically mention time trends.
>
> Or if time of year is a factor, something like
> lme(HSP~coast+MBL+time, random= ~1|site)    
> But the problem here is that you may run out of replication to estimate any of the fixed effects, each combination of coast, MBL, site and time is unique. 
>
> Also BUT BUT, even if you allow the three time samples to be replicates:
> You are potentially going to have an issue if you only have four sites. This is not alot to estimate a random effect. One option is to treat site as fixed. There is an argument that site is indeed random, so it should be treated as random, in which case I'm not sure that lme (or the newer lmer) will handle the full uncertainty for small sample sizes correctly). To do that would need a more fully Bayesian approach. But I'm writing that from memory, I don't have the reference to hand. 
>
> Finally, it all depends on what the hypothesis you are trying to test: what's the hypothesis?
>
> regards
>
> Mike
>
>
>
>
>
>   
>>>> Joe Simonis <jls468 at cornell.edu> 07/11/2008 15:53 >>>
>>>>         
> Hey Everyone,
>
>     I'm helping a friend out with analyzing some of her data, and I 
> haven't run an ANOVA like this in a while, and especially not in R.  I'm 
> having a bit of trouble figuring out the correct syntax and so I was 
> hoping to get feedback.  Any input would be welcomed.  As of now, I also 
> don't have the data, but I've been told that sample size should be equal 
> for all of the combinations (although that may not be true).  In any 
> case, for now, let's assume all sample sizes are equal.
>
>     The basics of the mensurative experiment are as follows:
>
>     The study was looking at variation in physiological values (HSP) of 
> intertidal mussels across a few different sites at three different times 
> of year.  The sampling was done in New Zealand, with 2 sites sampled on 
> each of the East and West coasts, and within each site, there were two 
> sampling points (mussel bed location, MBL), one low in the intertidal 
> one high in the intertidal.  There are two levels of MBL (low and high) 
> at each site and there are two sites for each cost.  I see this as MBL 
> nested in site, nested in coast.  However, it seems to me that only site 
> is a random factor.  Both MBLs were picked specifically at that site and 
> were done so in a way to compare high to low locations, so that seems 
> fixed to me.  Site was picked more to look at site-to-site variation 
> (i.e random factor).  And coasts were explicitly being compared (i.e. 
> fixed factor). 
>
>     So, I see that as a fixed factor nested in a random factor  nested 
> in a fixed factor.  Does that make sense?  And then there's the bit 
> about repeated measures, since they sampled mussels from each MBL 3 
> times.  I don't think that necessarily complicates things too much, but 
> maybe it does?  I can put together the ANOVA table on paper in the way 
> I'd like to analyze the data (working off of examples in Quinn and 
> Keough on pg 314, but with an additional level of nestedness).  However, 
> I am pretty lost on how to code the syntax for the analysis in R.  I've 
> had a few different ideas, but none of them really seem correct to me.  
> I think the biggest problem for me is figuring out how to keep the 
> structure of the nestedness in tact despite the fact that some factors 
> are fixed and some are random. 
>
>     The best I could come up with so far is     lme(HSP~MBL, random= 
> ~time|coast/site)     
>
>     But it doesn't seem really right to me.  I have MBL in the fixed 
> part of the model, but when it's like that, I think the nestedness gets 
> lost (since there's no cost/site/MBL anywhere). 
>
>     Again, I would appreciate any insights into the syntax or the 
> general way I am approaching this analysis.  I've been trying to piece 
> the stuff together from nested analyses and time-series analyses in 
> Crawley's book, but I'm just not getting it.
>
>     Thanks a bunch in advance!!
>
> --Joe
>
>   

-- 
Joseph L. Simonis

Cornell University
Department of Ecology & Evolutionary Biology
E231 Corson Hall
Ithaca, NY 14853 USA
email:  jls468 at cornell.edu 

http://www.people.cornell.edu/pages/jls468/ 

-- 
This message (and any attachments) is for the recipient ...{{dropped:6}}