[R] help: pls package

Fri Jul 22 12:50:22 CEST 2005

wu sz writes:

> trainSet = as.data.frame(scale(trainSet, center = T, scale = T))
> trainSet.plsr = mvr(formula, ncomp = 14, data = trainSet, method = "kernelpls",
>                     CV = TRUE, validation = "LOO", model = TRUE, x = TRUE,
>                     y = TRUE)

[Two side notes here:
 1) scaling of the data (with its sd) should be performed inside the
 cross-validation.  In the current version of 'pls', one can use
 cvplsr <- crossval(plsr(y ~ scale(X), ncomp = 14, data = mydata),
                    length.seg = 1)
 (However, 'crossval' is slower than the built-in cross-validation on
 'mvr'/'plsr'.  In the development version of the package, scaling
 within the cross-validation has been implemented in the built-in
 cross-validation.  This will hopefully be published shortly.)

 2) The 'CV' argument is from the earlier 'pls.pcr' package, and is no
 longer used.  It is silently ignored.]

> i = 1; msep_element = c()
> while(i <= length(p)){
>    msep_element[,i] = (p[i]-y)^2
>    i = i + 1
> }

Hmm...  I don't see how you got that code to run.  This should work, though:

msep_element <- (p - y)^2

> msep = colMeans(msep_element)
> msep_sd = sd(msep_element)

You will get much closer to the true value with

sd(msep_element) / sqrt(length(y))

However, this will not produce an unbiased estimate of the sd of the
estimated MSEP, because it ignores the depencies between the
residuals.  E.g., the residual when sample 1 is predicted is not
independent of the residual when sample 2 is predicted.  In general, I
think, it will produce underestimated sds.  The effect should be
largest for small data sets.

This is the reason the pls package currently doesn't estimate se of
cross-validated MSEPs.  There is also the question of what the
estimated should be conditioned on: for leave-one-out
cross-validation, sd(MSEP | trainData) = 0.

[If someone knows how to calculate unbiased estimates of
cross-validated MSEPs, please let me know. :-)]

-- 
Bjørn-Helge Mevik