[R-sig-eco] Multiple comparisons among predictors generated from same data

Bob O'Hara bohara at senckenberg.de
Fri May 25 10:41:30 CEST 2012


On 05/25/2012 10:18 AM, Gavin Simpson wrote:
> On Thu, 2012-05-24 at 15:00 -0700, J Straka wrote:
>> Hello,
>>
>> I'm planning on using a regression model to describe seed set of plants (my
>> response) using some sort of predictor based on temperature.  I have a
>> number of temperature variables calculated from the same set of data
>> (hourly temperatures for the growing season, converted to variables such as
>> average temperature, maximum temperature, minimum temperature, degree-days
>> above zero Celsius, degree days above ten Celsius, etc...), and I want to
>> decide which one should be included in my model. I know that I would
>> ideally select one based on "prior knowledge" of the system (e.g. so-called
>> "planned comparisons" or choosing a temperature threshold that is known to
>> be important for the development of seeds), but not much is known about
>> this system.
> What is the model for? Understanding so you want to interpret the
> coefficients directly as something meaningful or for prediction?
>
> If the latter I would say it doesn't really matter; choose the model
> that gives the best out-of-sample predictions (lowest error etc), or
> average predictions over a set of best/good models. Simply choosing the
> best model via some sort of selection procedure may result in a model
> with high variance (change the data a bit and different variables would
> be selected). If so, consider a regression method that applies shrinkage
> to the coefficients such as the lasso or the elastic net; this will lead
> to a small bit of bias in the estimates of the coefficients but should
> reduce the variance of the final model because you are considering the
> selection of variables as part of the model itself.
>
> If you want to interpret the model coefficients as something real then
> you have to be very careful doing any form of selection; the stepwise
> procedures and best subsets all can potentially lead to strong bias in
> the model coefficients. Be removing a variable from the model in effect
> you are saying that the sample estimate of the effect of that variable
> on the response is 0, not some small (statistically insignificant)
> value.
>
> This is a very tricky thing to get right and I'm not sure I know the
> right answer (or even if there is one!?).
An additional complication here is that the variables are going to be 
correlated, so a model with all or most in it could be unstable. If a 
single temperature variable is enough, then I'd suggest either trying 
your best to pick one, or use what everyone else uses (GDD5?), so the 
study can be comparable.

Once you have a model, it might be worth checking to see if the other 
variables tell a different story. If it's the same story but with 
different p-values, you might as well stick to the original analysis.

Bob

-- 

Bob O'Hara

Biodiversity and Climate Research Centre
Senckenberganlage 25
D-60325 Frankfurt am Main,
Germany

Tel: +49 69 798 40226
Mobile: +49 1515 888 5440
WWW:   http://www.bik-f.de/root/index.php?page_id=219
Blog: http://blogs.nature.com/boboh
Journal of Negative Results - EEB: www.jnr-eeb.org



More information about the R-sig-ecology mailing list