[Rd] crossvalidation in svm regression in e1071 gives incorre ct results (PR#8554)
Liaw, Andy
andy_liaw at merck.com
Thu Feb 2 17:28:40 CET 2006
1. This is _not_ a bug in R itself. Please don't use R's bug reporting
system for contributed packages.
2. This is _not_ a bug in svm() in `e1071'. I believe you forgot to take
sqrt.
3. You really should use the `tot.MSE' component rather than the mean of
the `MSE' component, but this is only a very small difference.
So, instead of spread[i] <- mean(mysvm$MSE), you should have spread[i] <-
sqrt(mysvm$tot.MSE). I get:
> spread <- rep(0,20)
> for (i in 1:20) {
+ spread[i] <- svm(y ~ x,data, cross=10)$tot.MSE
+ }
> summary(sqrt(spread[i]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2679 0.2679 0.2679 0.2679 0.2679 0.2679
Andy
From: no228 at cam.ac.uk
>
> Full_Name: Noel O'Boyle
> Version: 2.1.0
> OS: Debian GNU/Linux Sarge
> Submission from: (NULL) (131.111.8.96)
>
>
> (1) Description of error
>
> The 10-fold CV option for the svm function in e1071 appears
> to give incorrect
> results for the rmse.
>
> The example code in (3) uses the example regression data in the svm
> documentation. The rmse for internal prediction is 0.24. It
> is expected the
> 10-fold CV rmse should be bigger, but the result obtained
> using the "cross=10"
> option is 0.07. When the 10-fold CV is conducted either 'by
> hand' (not shown
> below) or using the errorest function in ipred (shown below)
> the answer is
> closer to 0.27, a more reasonable value.
>
> (2) Description of system
>
> I'm using the Debian Sarge version of R:
> R : Copyright 2005, The R Foundation for Statistical Computing
> Version 2.1.0 (2005-04-18), ISBN 3-900051-07-0
>
> svm is in the e1071 package from CRAN:
> Version: 1.5-11
> Date: 2005-09-19
>
> (3) Example code illustrating the problem
>
> library(e1071)
>
> set.seed(42)
> # create data
> x <- seq(0.1, 5, by = 0.05)
> y <- log(x) + rnorm(x, sd = 0.2)
> data <- as.data.frame(cbind(y,x))
>
> # estimate model and predict input values
> mysvm <- svm(y ~ x,data)
> result <- predict(mysvm, data)
> (rmse <- sqrt(mean((result-data[,1])**2)))
> # 0.2390489
>
> # built-in 10-fold CV estimate of prediction error
> spread <- rep(0,20)
> for (i in 1:20) {
> mysvm <- svm(y ~ x,data,cross=10)
> spread[i] <- mean(mysvm$MSE)
> }
> summary(spread)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.06789 0.07089 0.07236 0.07310 0.07411 0.08434 (or
> something similar)
>
> # 10-fold CV using errorest
> library(ipred)
> mysvm <- function(formula,data) {
> model <- svm(formula,data)
> function(newdata) predict(model,newdata)
> }
> spread <- rep(0,20)
> for (i in 1:20) {
> spread[i] <- errorest(y ~ x, data, model=mysvm)$error
> }
> summary(spread)
> # Min. 1st Qu. Median Mean 3rd Qu. Max.
> # 0.2601 0.2649 0.2673 0.2696 0.2741 0.2927
>
>
> Regards,
> Noel O'Boyle.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list