smooth.spline {stats} | R Documentation |
Fit a Smoothing Spline
Description
Fits a cubic smoothing spline to the supplied data.
Usage
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)
.nknots.smspl(n)
Arguments
x |
a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y. |
y |
responses. If |
w |
optional vector of weights of the same length as |
df |
the desired equivalent number of degrees of freedom (trace of
the smoother matrix). Must be in |
spar |
smoothing parameter, typically (but not necessarily) in
|
lambda |
if desired, the internal (design-dependent) smoothing
parameter |
cv |
ordinary leave-one-out ( |
all.knots |
if Alternatively, a strictly increasing |
nknots |
integer or |
keep.data |
logical specifying if the input data should be kept
in the result. If |
df.offset |
allows the degrees of freedom to be increased by
|
penalty |
the coefficient of the penalty for degrees of freedom in the GCV criterion. |
control.spar |
optional list with named components controlling the
root finding when the smoothing parameter Note that this is partly experimental and may change with general spar computation improvements!
Note that |
tol |
a tolerance for sameness or uniqueness of the |
keep.stuff |
an experimental |
n |
for |
Details
Neither x
nor y
are allowed to containing missing or
infinite values.
The x
vector should contain at least four distinct values.
‘Distinct’ here is controlled by tol
: values which are
regarded as the same are replaced by the first of their values and the
corresponding y
and w
are pooled accordingly.
Unless lambda
has been specified instead of spar
,
the computational \lambda
used (as a function of
s=spar
) is
\lambda = r \cdot 256^{3 s - 1}
where
r = tr(X' W X) / tr(\Sigma)
,
\Sigma
is the matrix given by
\Sigma_{ij} = \int B_i''(t) B_j''(t) dt
,
X
is given by X_{ij} = B_j(x_i)
,
W
is the diagonal matrix of weights (scaled such that
its trace is n
, the original number of observations)
and B_k(.)
is the k
-th B-spline.
Note that with these definitions, f_i = f(x_i)
, and the B-spline
basis representation f = X c
(i.e., c
is
the vector of spline coefficients), the penalized log likelihood is
L = (y - f)' W (y - f) + \lambda c' \Sigma c
, and hence
c
is the solution of the (ridge regression)
(X' W X + \lambda \Sigma) c = X' W y
.
If spar
and lambda
are missing or NULL
, the value
of df
is used to determine the degree of smoothing. If
df
is missing as well, leave-one-out cross-validation (ordinary
or ‘generalized’ as determined by cv
) is used to
determine \lambda
.
Note that from the above relation,
spar
is s = s0 + 0.0601 \cdot \log\lambda
.
Note however that currently the results may become very unreliable
for spar
values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
spar
or the controls low
and high
outside such a
safe range, unless you know what you are doing!
Similarly, specifying lambda
instead of spar
is
delicate, notably as the range of “safe” values for
lambda
is not scale-invariant and hence entirely data dependent.
The ‘generalized’ cross-validation method GCV will work correctly when
there are duplicated points in x
. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. cv = TRUE
is best avoided in that case.
Value
An object of class "smooth.spline"
with components
x |
the distinct |
y |
the fitted values corresponding to |
w |
the weights used at the unique values of |
yin |
the y values used at the unique |
tol |
the |
data |
only if |
n |
an integer; the (original) sample size. |
lev |
(when |
cv |
the |
cv.crit |
cross-validation score, ‘generalized’ or true, depending
on |
pen.crit |
the penalized criterion, a non-negative number; simply
the (weighted) residual sum of squares (RSS), |
crit |
the criterion value minimized in the underlying
|
df |
equivalent degrees of freedom used. Note that (currently)
this value may become quite imprecise when the true |
spar |
the value of |
ratio |
(when |
lambda |
the value of |
iparms |
named integer(3) vector where |
auxMat |
experimental; when |
fit |
list for use by
|
call |
the matched call. |
method(class = "smooth.spline")
shows a
hatvalues()
method based on the lev
vector above.
Note
The number of unique x
values, \code{nx} = n_x
, are
determined by the tol
argument, equivalently to
nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))
The default all.knots = FALSE
and nknots = .nknots.smspl
,
entails using only O({n_x}^{0.2})
knots instead of n_x
for n_x > 49
. This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only O(n_k) + O(n)
where
n_k
is the number of knots.
In this case where not all unique x
values are
used as knots, the result is a regression spline rather than a
smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
df
) is used.
Author(s)
R implementation by B. D. Ripley and Martin Maechler
(spar/lambda
, etc).
Source
This function is based on code in the GAMFIT
Fortran program by
T. Hastie and R. Tibshirani (originally taken from
https://lib.stat.cmu.edu/general/gamfit)
which makes use of spline code by Finbarr O'Sullivan. Its design
parallels the smooth.spline
function of Chambers & Hastie (1992).
References
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.
Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
See Also
predict.smooth.spline
for evaluating the spline
and its derivatives.
Examples
require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars) & smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE
lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), length.out = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)