NCV {mgcv} | R Documentation |
Neighbourhood Cross Validation
Description
NCV estimates smoothing parameters by optimizing the average ability of a model to predict subsets of data when subsets of data are omitted from fitting. Usually the predicted subset is a subset of the omitted subset. If both subsets are the same single datapoint, and the average is over all datapoints, then NCV is leave-one-out cross validation. QNCV is a quadratic approximation to NCV, guaranteed finite for any family link combination.
In detail, suppose that a model is estimated by minimizing a penalized loss
\sum_i D(y_i,\theta_i) + \sum_j \lambda_j \beta^{\sf T} {S}_j \beta
where D
is a loss (such as a negative log likelihood), dependent on response y_i
and parameter vector \theta_i
, which in turn depends on covariates via one or more smooth linear predictors with coefficients \beta
. The quadratic penalty terms penalize model complexity: S_j
is a known matrix and \lambda_j
an unknown smoothing parameter. Given smoothing parameters the penalized loss is readily minimized to estimate \beta
.
The smoothing parameters also have to be estimated. To this end, choose k = 1,\ldots,m
subsets \alpha(k)\subset \{1,\ldots,n\}
and \delta(k)\subset \{1,\ldots,n\}
. Usually \delta(k)
is a subset of (or equal to) \alpha(k)
. Let \theta_i^{\alpha(k)}
denote the estimate of \theta_i
when the points indexed by \alpha(k)
are omitted from fitting. Then the NCV criterion
V = \sum_{k=1}^m \sum_{i \in \alpha(k)} D(y_i,\theta_i^{\alpha(k)})
is minimized w.r.t. the smoothing parameters, \lambda_j
. If m=n
and \alpha(k)=\delta(k)=k
then ordinary leave-one-out cross validation is recovered. This formulation covers many of the variants of cross validation reviewed in Arlot and Celisse (2010), for example.
Except for a quadratic loss, V
can not be computed exactly, but it can be computed to O(n^{-2})
accuracy (fixed basis size), by taking single Newton optimization steps from the full data \beta
estimates to the equivalent when each \alpha(k)
is dropped. This is what mgcv
does. The Newton steps require update of the full model Hessian to the equivalent when each datum is dropped. This can be achieved at O(p^2)
cost, where p
is the dimension of \beta
. Hence, for example, the ordinary cross validation criterion is computable at the O(np^2)
cost of estimating the model given smoothing parameters.
The NCV score computed in this way is optimized using a BFGS quasi-Newton method, adapted to the case in which smoothing parameters tending to infinity may cause indefiniteness.
Spatial and temporal short range autocorrelation
A routine applied problem is that smoothing parameters tend to be underestimated in the presence of un-modelled short range autocorrelation, as the smooths try to fit the local excursions in the data caused by the local autocorrelation. Cross validation will tend to 'fit the noise' when there is autocorellation, since a model that fits the noise in the data correlated with an omitted datum, will also tend to closely fit the noise in the omitted datum, because of the correlation. That is autocorrelation works against the avoidance of overfit that cross validation seeks to achieve.
For short range autocorrelation the problems can be avoided, or at least mitigated, by predicting each datum when all the data in its ‘local’ neighbourhood are omitted. The neighbourhoods being constructed in order that un-modelled correlation is minimized between the point of interest and points outside its neighbourhood. That is we set m=n
, \delta(k)=k
and \alpha(k) = {\tt nei}(k)
, where nei(k)
are the indices of the neighbours of point k
. This approach has been known for a long time (e.g. Chu and Marron, 1991; Robert et al. 2017), but was previously rather too expensive for regular use for smoothing parameter estimation.
Specifying the neighbourhoods
The neighbourhood subsets \alpha(k)
and \delta(k)
have to be supplied to gam
, and the nei
argument does this. It is a list with the following arguments.
-
k
is the vector of indices to be dropped for each neighbourhood. -
m
gives the end of each neighbourhood. Sonei$k[(nei$m[j-1]+1):nei$m[j]]
gives the points dropped for the neighbourhoodj
: that is\alpha(j)
. -
i
is the vector of indices of points to predict. -
mi
gives the corresponding endpointsmi
. Sonei$i[(nei$mi[j-1]+1):nei$mi[j]]
indexes the points to predict for neighbourhood j: that is\delta(j)
. -
jackknife
is an optional element. If supplied andTRUE
then variance estimates are based on the raw Jackkife estimate, ifFALSE
then on the standard Bayesian results. If not supplied (usual) then an estimator accounting for the neighbourhood structure is used, that largely accounts for any correlation present within neighbourhoods.jackknife
is ignored if NCV is being calculated for a model where another method is used for smoothing parameter selection.
If nei==NULL
(or k
or m
are missing) then leave-one-out cross validation is used. If nei
is supplied but NCV is not selected as the smoothing parameter estimation method, then it is simply computed (but not optimized).
Numerical issues
If a model is specified in which some coefficient values, \beta
, have non-finite likelihood then the NCV criterion computed with single Newton steps could also be non-finite. A simple fix replaces the NCV criterion with a quadratic approximation to the criterion around the full data fit. The quadratic approximation is always finite. This 'QNCV' is essential for some families, such as gevlss
.
Although the leading order cost of NCV is the same as REML or GCV, the actual cost is higher because the dominant operations costs are in matrix-vector, rather than matrix-matrix, operations, so BLAS speed ups are small. However multi-core computing is worthwhile for NCV. See the option ncv.threads
in gam.control
.
Author(s)
Simon N. Wood simon.wood@r-project.org
References
Chu and Marron (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics. 19, 1906-1918
Arlot, S. and A. Celisse (2010). A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40-79
Roberts et al. (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913-929.
Wood S.N. (2023) On Neighbourhood Cross Validation. in prep.
Examples
require(mgcv)
nei.cor <- function(h,n) { ## construct nei structure
nei <- list(mi=1:n,i=1:n)
nei$m <- cumsum(c((h+1):(2*h+1),rep(2*h+1,n-2*h-2),(2*h+1):(h+1)))
k0 <- rep(0,0); if (h>0) for (i in 1:h) k0 <- c(k0,1:(h+i))
k1 <- n-k0[length(k0):1]+1
nei$k <- c(k0,1:(2*h+1)+rep(0:(n-2*h-1),each=2*h+1),k1)
nei
}
set.seed(1)
n <- 500;sig <- .6
x <- 0:(n-1)/(n-1)
f <- sin(4*pi*x)*exp(-x*2)*5/2
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1] + e[i]
y <- f + e ## autocorrelated data
nei <- nei.cor(4,n) ## construct neighbourhoods to mitigate
b0 <- gam(y~s(x,k=40)) ## GCV based fit
gc <- gam.control(ncv.threads=2)
b1 <- gam(y~s(x,k=40),method="NCV",nei=nei,control=gc)
## use "QNCV", which is identical here...
b2 <- gam(y~s(x,k=40),method="QNCV",nei=nei,control=gc)
## plot GCV and NCV based fits...
f <- f - mean(f)
par(mfrow=c(1,2))
plot(b0,rug=FALSE,scheme=1);lines(x,f,col=2)
plot(b1,rug=FALSE,scheme=1);lines(x,f,col=2)