R: Fit a Smoothing Spline

smooth.spline {stats}

R Documentation

Fit a Smoothing Spline

Description

Fits a cubic smoothing spline to the supplied data.

Usage

smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
              all.knots = FALSE, nknots = .nknots.smspl,
              keep.data = TRUE, df.offset = 0, penalty = 1,
              control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)

.nknots.smspl(n)

Arguments

x

a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.

y

responses. If y is missing or NULL, the responses are assumed to be specified by x, with x the index vector.

w

optional vector of weights of the same length as x; defaults to all 1.

df

the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in (1,n_x], n_x the number of unique x values, see below.

spar

smoothing parameter, typically (but not necessarily) in (0,1]. When spar is specified, the coefficient \lambda of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function of spar, see the details below. Alternatively lambda may be specified instead of the scale free spar=s.

lambda

if desired, the internal (design-dependent) smoothing parameter \lambda can be specified instead of spar. This may be desirable for resampling algorithms such as cross validation or the bootstrap.

cv

ordinary leave-one-out (TRUE) or ‘generalized’ cross-validation (GCV) when FALSE; is used for smoothing parameter computation only when both spar and df are not specified; it is used however to determine cv.crit in the result. Setting it to NA for speedup skips the evaluation of leverages and any score.

all.knots

if TRUE, all distinct points in x are used as knots. If FALSE (default), a subset of x[] is used, specifically x[j] where the nknots indices are evenly spaced in 1:n, see also the next argument nknots.

Alternatively, a strictly increasing numeric vector specifying “all the knots” to be used; must be rescaled to [0, 1] already such that it corresponds to the ans $ fit$knots sequence returned, not repeating the boundary knots.

nknots

integer or function giving the number of knots to use when all.knots = FALSE. If a function (as by default), the number of knots is nknots(nx). By default using .nknots.smspl(), for n_x > 49 this is less than n_x, the number of unique x values, see the Note.

keep.data

logical specifying if the input data should be kept in the result. If TRUE (as per default), fitted values and residuals are available from the result.

df.offset

allows the degrees of freedom to be increased by df.offset in the GCV criterion.

penalty

the coefficient of the penalty for degrees of freedom in the GCV criterion.

control.spar

optional list with named components controlling the root finding when the smoothing parameter spar is computed, i.e., missing or NULL, see below.

Note that this is partly experimental and may change with general spar computation improvements!

low:: lower bound for spar; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4).
high:: upper bound for spar; defaults to +1.5.
tol:: the absolute precision (tolerance) used; defaults to 1e-4 (formerly 1e-3).
eps:: the relative precision used; defaults to 2e-8 (formerly 0.00244).
trace:: logical indicating if iterations should be traced.
maxit:: integer giving the maximal number of iterations; defaults to 500.

Note that spar is only searched for in the interval [low, high].

tol

a tolerance for sameness or uniqueness of the x values. The values are binned into bins of size tol and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).

keep.stuff

an experimental logical indicating if the result should keep extras from the internal computations. Should allow to reconstruct the X matrix and more.

n

for .nknots.smspl; typically the number of unique x values (aka n_x).

Details

Neither x nor y are allowed to containing missing or infinite values.

The x vector should contain at least four distinct values. ‘Distinct’ here is controlled by tol: values which are regarded as the same are replaced by the first of their values and the corresponding y and w are pooled accordingly.

Unless lambda has been specified instead of spar, the computational \lambda used (as a function of s=spar) is \lambda = r \cdot 256^{3 s - 1} where r = tr(X' W X) / tr(\Sigma), \Sigma is the matrix given by \Sigma_{ij} = \int B_i''(t) B_j''(t) dt, X is given by X_{ij} = B_j(x_i), W is the diagonal matrix of weights (scaled such that its trace is n, the original number of observations) and B_k(.) is the k-th B-spline.

Note that with these definitions, f_i = f(x_i), and the B-spline basis representation f = X c (i.e., c is the vector of spline coefficients), the penalized log likelihood is L = (y - f)' W (y - f) + \lambda c' \Sigma c, and hence c is the solution of the (ridge regression) (X' W X + \lambda \Sigma) c = X' W y.

If spar and lambda are missing or NULL, the value of df is used to determine the degree of smoothing. If df is missing as well, leave-one-out cross-validation (ordinary or ‘generalized’ as determined by cv) is used to determine \lambda.

Note that from the above relation, spar is s = s0 + 0.0601 \cdot \log\lambda.

Note however that currently the results may become very unreliable for spar values smaller than about -1 or -2. The same may happen for values larger than 2 or so. Don't think of setting spar or the controls low and high outside such a safe range, unless you know what you are doing! Similarly, specifying lambda instead of spar is delicate, notably as the range of “safe” values for lambda is not scale-invariant and hence entirely data dependent.

The ‘generalized’ cross-validation method GCV will work correctly when there are duplicated points in x. However, it is ambiguous what leave-one-out cross-validation means with duplicated points, and the internal code uses an approximation that involves leaving out groups of duplicated points. cv = TRUE is best avoided in that case.

Value

An object of class "smooth.spline" with components

x

the distinct x values in increasing order, see the ‘Details’ above.

y

the fitted values corresponding to x.

w

the weights used at the unique values of x.

yin

the y values used at the unique y values.

tol

the tol argument (whose default depends on x).

data

only if keep.data = TRUE: itself a list with components x, y and w of the same length. These are the original (x_i,y_i,w_i), i = 1, \dots, n, values where data$x may have repeated values and hence be longer than the above x component; see details.

n

an integer; the (original) sample size.

lev

(when cv was not NA) leverages, the diagonal values of the smoother matrix.

cv

the cv argument used; i.e., FALSE, TRUE, or NA.

cv.crit

cross-validation score, ‘generalized’ or true, depending on cv. The CV score is often called “PRESS” (and labeled on print()), for ‘PREdiction Sum of Squares’. Note that this is not the same as the (CV or GCV) score which is minimized during fitting (and returned in crit), e.g., in the case of nx < n (where nx=n_x is the number of unique x values).

pen.crit

the penalized criterion, a non-negative number; simply the (weighted) residual sum of squares (RSS), sum(.$w * residuals(.)^2) .

crit

the criterion value minimized in the underlying .Fortran routine ‘sslvrg’. When df has been specified, the criterion is 3 + (tr(S_\lambda) - df)^2, where the 3 + is there for numerical (and historical) reasons.

df

equivalent degrees of freedom used. Note that (currently) this value may become quite imprecise when the true df is between and 1 and 2.

spar

the value of spar computed or given, unless it has been given as c(lambda = *), when it set to NA here.

ratio

(when spar above is not NA), the value r, the ratio of two matrix traces.

lambda

the value of \lambda corresponding to spar, see the details above.

iparms

named integer(3) vector where ..$ipars["iter"] gives number of spar computing iterations used.

auxMat

experimental; when keep.stuff was true, a “flat” numeric vector containing parts of the internal computations.

fit

list for use by predict.smooth.spline, with components

knot:: the knot sequence (including the repeated boundary knots), scaled into [0, 1] (via min and range).
nk:: number of coefficients or number of ‘proper’ knots plus 2.
coef:: coefficients for the spline basis used.
min, range:: numbers giving the corresponding quantities of x.

call

the matched call.

method(class = "smooth.spline") shows a hatvalues() method based on the lev vector above.

Note

The number of unique x values, \code{nx} = n_x, are determined by the tol argument, equivalently to

    nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))

The default all.knots = FALSE and nknots = .nknots.smspl, entails using only O({n_x}^{0.2}) knots instead of n_x for n_x > 49. This cuts speed and memory requirements, but not drastically anymore since R version 1.5.1 where it is only O(n_k) + O(n) where n_k is the number of knots.

In this case where not all unique x values are used as knots, the result is a regression spline rather than a smoothing spline in the strict sense, but very close unless a small smoothing parameter (or large df) is used.

Author(s)

R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc).

Source

This function is based on code in the GAMFIT Fortran program by ⁠Hastie and Tibshirani (1990) (originally taken from https://lib.stat.cmu.edu/general/gamfit) which makes use of spline code by Finbarr O'Sullivan. Its design parallels the smooth.spline function of ⁠Chambers and Hastie (1992).

References

⁠Chambers JM, Hastie TJ (1992). Statistical Models in S. Chapman & Hall, London. ISBN 9780412830402.

⁠Green PJ, Silverman BW (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, London. ISBN 0412300400.

⁠Hastie TJ, Tibshirani RJ (1990). Generalized Additive Models. Chapman and Hall, London. ISBN 0412343908.

Examples

require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars)  &  smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE

lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg = 'bisque')


## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")

## consistency check:
stopifnot(all.equal(cars$dist,
                    fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)

## Visualize the behavior of  .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)

n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)

##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx  <- seq(1, length(y18), length.out = 201)
(s2   <- smooth.spline(y18)) # GCV
(s02  <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))



## The following shows the problematic behavior of 'spar' searching:
(s2  <- smooth.spline(y18, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)

[Package stats version 4.6.0 Index]