[R-sig-eco] Regression with few observations per factor level

Fri Oct 24 09:11:49 CEST 2014

On 24/10/2014, at 09:03 AM, V. Coudrain wrote:

> Thank you all for the good discussion. To recenter the debate if you are interested in, my data are actually: 20 sample locations distributed across 5 treatments (4 locations / treatment). Each sample location has been surveyed for 4 years. Thus at the end of the experiment I will have a grand total of 80 samples.  To include the Year as an additional factor to treatment would however increase the complexity of the model, with less DF left, even more if I have to account for autocorrelation. If I consider the distribution of the pooled response variable (80) points, the distributiondoes not deviate much from a normal distribution, but this is not the case if I consider its distribution within each treatment.
> 
Valerie,

This is a nice description of the structure of your data. When you model your data, you should use the same structure in your model. If you ignore some features of this structure, you should have good reasons for your decision. Reaching those decisions needs first analysing data like it is structured. Collapsing these data into, say, five (or four? how?) means does not solve any of the problems with this structure -- among other things, means ignore the temporal autocorrelation structure. (The temporal autocorrelation may be a more important aspect than Year-as-a-factor if you are absolutely uninterested in random years.) With averaging, you really lose degrees of freedom, and are easily allured to wrong conclusions. If you have five means, you can order them in 120 ways (and four means in 24 ways). Two of these are perfectly ordered (proportion 1/60 = 0.017 of all permutations of five points) , and many more are nearly perfectly or "significantly" ordered and trick you to think that a linear regression would be a good solution. With five datum points you just can't know.

Cheers, Jari Oksanen

PS. I hope this threading pleases Gav -- this certainly hurts all Outlook users.
> 
> 
> 
> > Message du 24/10/14 à 04h37
> > De : "Chris Howden" 
> > A : "Gavin Simpson" , "Jari Oksanen" 
> > Copie à : r-sig-ecology at r-project.org, "V. Coudrain" 
> > Objet : RE: [R-sig-eco] Regression with few observations per factor level
> > 
> >
> I don’t think the data only has 4 datum, it has more than that but some factor being fit only has 4 datum / level. So it should be possible to do various residual checks at the overall model level to determine if the model is fitting well overall and if the normality assumptions are being fit overall. However it would be quite hard to test the factors individual levels to see how they are fit i.e. is this level under or over fitting, is it a good or bad fit, etc.
>  
> I think we need to be very careful recommending to people they consider the response and not look at the residuals though. Some people might take this to mean they should look at the response, rather than consider its likely distribution. There are all types of reasons a response may not look normal, but the residuals will be, meaning the normality assumptions are met and the model is OK. So if one does decide to start with a LM what’s the harm in making it a habit of always looking at your residuals, and if they aren’t normal then going from there?
>  
> All that said I still wouldn’t feel comfortable using a model with only 4 datum / factor level. Even if the residuals did look normal.
>  
> Chris Howden B.Sc. (Hons) GStat.
> Founding Partner
> Data Analysis, Modelling and Training
> Evidence Based Strategy/Policy Development, IP Commercialisation and Innovation
> (mobile) +61 (0) 410 689 945
> (skype) chris.howden
> chris at trickysolutions.com.au
>  
> <image001.jpg>
> 
>  
>  
> Disclaimer: The information in this email and any attachments to it are confidential and may contain legally privileged information. If you are not the named or intended recipient, please delete this communication and contact us immediately. Please note you are not authorised to copy, use or disclose this communication or any attachments without our consent. Although this email has been checked by anti-virus software, there is a risk that email messages may be corrupted or infected by viruses or other interferences. No responsibility is accepted for such interference. Unless expressly stated, the views of the writer are not those of the company. Tricky Solutions always does our best to provide accurate forecasts and analyses based on the data supplied, however it is possible that some important predictors were not included in the data sent to us. Information provided by us should not be solely relied upon when making decisions and clients should use their own judgement.
>  
> From: Gavin Simpson [mailto:ucfagls at gmail.com] 
> > Sent: Friday, 24 October 2014 7:15 AM
> > To: Jari Oksanen
> > Cc: Chris Howden; r-sig-ecology at r-project.org; V. Coudrain
> > Subject: Re: [R-sig-eco] Regression with few observations per factor level
>  
> I think there are actually 4 data points per level of some factor (after seeing some of the other no-threaded emails - why can't people use emails that preserve threads?**); but yes, either way this is a small data set and trying to decide if residuals are normal or not is going to be nigh on impossible.
>  
> 
> I like the suggestion that someone made to actually do some simulation to work out whether you have any power to detect an effect of a given size; seems pointless doing the analysis if you conclusions would be "well, I didn't detect an effect, but I have no power so I don't even know if I should have been able to detect an effect if one were present". You'd be in no worse off a position then than if you hadn't run the analysis or collected the data.
>  
> 
> G
>  
> 
> ** He says, hoping to heck that GMail preserves the threading information...
> 
>  
> On 23 October 2014 14:00, Jari Oksanen <jari.oksanen at oulu.fi> wrote:
> 
> > On 23/10/2014, at 18:17 PM, Gavin Simpson wrote:
> > 
> > > On 22 October 2014 17:24, Chris Howden <chris at trickysolutions.com.au> wrote:
> > >
> > >> A good place to start is by looking at your residuals  to determine if
> > >> the normality assumptions are being met, if not then some form of glm
> > >> that correctly models the residuals or a non parametric method should
> > >> be used.
> > >>
> > >
> > > Doing that could be very tricky indeed; I defy anyone, without knowledge of
> > > how the data were generated, to detect departures from normality in such a
> > > small data set. Try qqnorm(rnorm(4)) a few times and you'll see what I mean.
> > >
> > > Second, one usually considers the distribution of the response when fitting
> > > a GLM, not decide if residuals from an LM are non-Gaussian then move on.
> > > The decision to use the GLM should be motivated directly from the data and
> > > question to hand. Perhaps sometimes we can get away with fitting the LM,
> > > but that usually involves some thought, in which case one has probably
> > > already thought about the GLM as well.
> > 
> > I agree completely with Gavin. If you have four data points and fit a two-parameter linear model and in addition select a one-parameter exponential family distribution (as implied in selecting a GLM family) you don't have many degrees of freedom left. I don't think you get such models accepted in many journals. Forget the regression and get more data. Some people suggested here that an acceptable model could be possible if your data points are not single observations but means from several observations. That is true: then you can proceed, but consult a statistician on the way to proceed.
> > 
> > Cheers, Jari Oksanen
> 
> 
> 
> >
>  
> 
> --
> Gavin Simpson, PhD
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Mode, hifi, maison,… J'achète malin. Je compare les prix avec Voila.fr