smooth.spline {stats} | R Documentation |
Fits a cubic smoothing spline to the supplied data.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)
.nknots.smspl(n)
x |
a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y. |
y |
responses. If |
w |
optional vector of weights of the same length as |
df |
the desired equivalent number of degrees of freedom (trace of
the smoother matrix). Must be in |
spar |
smoothing parameter, typically (but not necessarily) in
|
lambda |
if desired, the internal (design-dependent) smoothing
parameter |
cv |
ordinary leave-one-out ( |
all.knots |
if Alternatively, a strictly increasing |
nknots |
integer or |
keep.data |
logical specifying if the input data should be kept
in the result. If |
df.offset |
allows the degrees of freedom to be increased by
|
penalty |
the coefficient of the penalty for degrees of freedom in the GCV criterion. |
control.spar |
optional list with named components controlling the
root finding when the smoothing parameter Note that this is partly experimental and may change with general spar computation improvements!
Note that |
tol |
a tolerance for same-ness or uniqueness of the |
keep.stuff |
an experimental |
n |
for |
Neither x
nor y
are allowed to containing missing or
infinite values.
The x
vector should contain at least four distinct values.
‘Distinct’ here is controlled by tol
: values which are
regarded as the same are replaced by the first of their values and the
corresponding y
and w
are pooled accordingly.
Unless lambda
has been specified instead of spar
,
the computational \lambda
used (as a function of
s=spar
) is
\lambda = r \cdot 256^{3 s - 1}
where
r = tr(X' W X) / tr(\Sigma)
,
\Sigma
is the matrix given by
\Sigma_{ij} = \int B_i''(t) B_j''(t) dt
,
X
is given by X_{ij} = B_j(x_i)
,
W
is the diagonal matrix of weights (scaled such that
its trace is n
, the original number of observations)
and B_k(.)
is the k
-th B-spline.
Note that with these definitions, f_i = f(x_i)
, and the B-spline
basis representation f = X c
(i.e., c
is
the vector of spline coefficients), the penalized log likelihood is
L = (y - f)' W (y - f) + \lambda c' \Sigma c
, and hence
c
is the solution of the (ridge regression)
(X' W X + \lambda \Sigma) c = X' W y
.
If spar
and lambda
are missing or NULL
, the value
of df
is used to determine the degree of smoothing. If
df
is missing as well, leave-one-out cross-validation (ordinary
or ‘generalized’ as determined by cv
) is used to
determine \lambda
.
Note that from the above relation,
spar
is s = s0 + 0.0601 \cdot \log\lambda
,
which is intentionally different from the S-PLUS implementation
of smooth.spline
(where spar
is proportional to
\lambda
). In R's (\log \lambda
) scale, it makes more
sense to vary spar
linearly.
Note however that currently the results may become very unreliable
for spar
values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
spar
or the controls low
and high
outside such a
safe range, unless you know what you are doing!
Similarly, specifying lambda
instead of spar
is
delicate, notably as the range of “safe” values for
lambda
is not scale-invariant and hence entirely data dependent.
The ‘generalized’ cross-validation method GCV will work correctly when
there are duplicated points in x
. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. cv = TRUE
is best avoided in that case.
An object of class "smooth.spline"
with components
x |
the distinct |
y |
the fitted values corresponding to |
w |
the weights used at the unique values of |
yin |
the y values used at the unique |
tol |
the |
data |
only if |
n |
an integer; the (original) sample size. |
lev |
(when |
cv |
the |
cv.crit |
cross-validation score, ‘generalized’ or true, depending
on |
pen.crit |
the penalized criterion, a non-negative number; simply
the (weighted) residual sum of squares (RSS), |
crit |
the criterion value minimized in the underlying
|
df |
equivalent degrees of freedom used. Note that (currently)
this value may become quite imprecise when the true |
spar |
the value of |
ratio |
(when |
lambda |
the value of |
iparms |
named integer(3) vector where |
auxMat |
experimental; when |
fit |
list for use by
|
call |
the matched call. |
method(class = "smooth.spline")
shows a
hatvalues()
method based on the lev
vector above.
The number of unique x
values, \code{nx} = n_x
, are
determined by the tol
argument, equivalently to
nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))
The default all.knots = FALSE
and nknots = .nknots.smspl
,
entails using only O({n_x}^{0.2})
knots instead of n_x
for n_x > 49
. This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only O(n_k) + O(n)
where
n_k
is the number of knots.
In this case where not all unique x
values are
used as knots, the result is a regression spline rather than a
smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
df
) is used.
R implementation by B. D. Ripley and Martin Maechler
(spar/lambda
, etc).
This function is based on code in the GAMFIT
Fortran program by
T. Hastie and R. Tibshirani (originally taken from
http://lib.stat.cmu.edu/general/gamfit)
which makes use of spline code by Finbarr O'Sullivan. Its design
parallels the smooth.spline
function of Chambers & Hastie (1992).
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.
Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
predict.smooth.spline
for evaluating the spline
and its derivatives.
require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars) & smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE
lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), length.out = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)