[R-sig-eco] Cross validate model with calibrate

Sat Nov 2 07:40:48 CET 2013

Dear Philipp,

I cannot see anything wrong in your approach. The only problem is that RDA may not do very well. It is very common that the calibrated values are outside the original range of the data and if the predictions are not very good, they can also be "unrealistic" values. RDA has no idea what is realistic for given data. 

You get cross-validated values outside data range even in the Dune meadow data you used. Here an example that applies 4-fold cross validation (in line with 3/4 subsetting in your example):

library(vegan)
data(dune, dune.env)
vec <- rep(1:4, length=nrow(dune))
pred <- rep(NA, length=nrow(dune))
set.seed(4711)
take <- sample(vec)
for(i in 1:4) pred[take==i] <- calibrate(rda(dune ~ A1, dune.env, subset = take != i), newdata = subset(dune, take==i))
range(pred)  
## [1]  0.9104486 12.3583826
with(dune.env, range(A1))
## [1]  2.8 11.5

The predicted values extend nearly 2cm below and 1cm above the data values. In this case the predicted values were not negative, but that could happen, too. The tool (RDA) may not work very well in all cases, but RMSE will tell you how good RDA is for the job. If it is bad, do not use it. It seems RDA is very poor in this dune example: RMSE is larger than standard deviation of A1 (but you could almost guess this by looking at the low constrained eigenvalues of the model).

The code above uses the 'subset' argument of vegan::rda and the subset() function of R to make the code a bit easier to write.

Cheers, Jari Oksanen

On 01/11/2013, at 15:53 PM, Philipp01 wrote:

> Hello all,
> 
> I would like to cross validate my rda() derived model with the calibrate
> function (vegan package) and calculate the RMSE as value for performance
> measure. 
> For simplicity I use the example from the predict.cca {vegan} help.
> 
> library(ade4)
> library(vegan)
> 
> data(dune)
> data(dune.env)
> 
> nbr <- as.numeric(rownames(dune))
> 
> library(caret)
> inTrain <- createDataPartition(y=nbr, p=3/4, list=F, times=1)
> 
> train.dune <- dune[inTrain[,i],];
> test.dune  <- dune[-inTrain[,i],];
> 
> train.dune.env <- dune.env[inTrain[,i],];
> test.dune.env  <- dune.env[-inTrain[,i],];
> 
> 
> mod  <- rda(train.dune ~ A1, train.dune.env)
> cal <- calibrate(mod, newdata=test.dune)
> 
> with(test.dune.env, plot(A1, cal[,"A1"] - A1, ylab="Prediction Error"))
> abline(h=0)
> 
> error <- cal - test.dune.env$A1
> (rmse <- sqrt(mean(error^2)))
> 
> When I apply this code snippet to my very own data I get positive and
> negative "cal" values, which would be unrealistic for parameters such as
> tree height (etc.). Therefore, I doubt that my approach is correct. How do
> you compute the RMSE for the rda() derived model?
> 
> Regards, Philipp 
> 
> 
> 
> 
> --
> View this message in context: http://r-sig-ecology.471788.n2.nabble.com/Cross-validate-model-with-calibrate-tp7578486.html
> Sent from the r-sig-ecology mailing list archive at Nabble.com.
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology