[Rd] Bug in glm.fit? (PR#1331)
ripley@stats.ox.ac.uk
ripley@stats.ox.ac.uk
Fri, 5 Apr 2002 16:13:31 +0100 (BST)
Berwin,
Thank you for this.
Unfortunately, although the analysis is mainly correct, but the solution
has almost as many problems as the current state. Suppose pivoting
occurs. Then for the internal code, coef needs to contain zeroes at the
omitted cols. But you have
coef <- coefold <- start # initialise coef here in order to
# be able to subset it below
so it will contain whatever start had for those cols. Further, the test
if (is.na(dev) || any(is.na(coef))) {
is followed by
coef <- (coef + coefold)/2
eta <- drop(x %*% coef)
which will ensure that if any NAs occur in coef, they will be propogated
through eta.
A related point is that you assume that coefold <- rep(0,nvars) is
a valid starting value to shrink towards, but that need not be the case.
I think one can only shrink towards the previous value after the first
iteration. (Only if start is specified is the validity checked:
an initial etastart or mustart is normally not in the space of
possible values under the model.)
There seems to me no point to test if `start' is valid if it is going to
be unused (as it will be if a valid eta is specified), so I believe the
original test logic is the correct one, not the unravelled one.
There are clearly problems in the glm.fit code, but in the absence of a
bug that has actually bitten someone, it seems only worth changing if we
have a 100% watertight solution. As I said to Brett Presnell, I believe
the time would be better spent re-writing glm ab initio.
Also, the code you submitted had all the comments removed, and so was much
harder to read than the code in glm.R, as well as being subtly
reformatted.
Brian
On Wed, 27 Feb 2002 berwin@maths.uwa.edu.au wrote:
> G'day all,
>
> I had a look at the GLM code of R (1.4.1) and I believe that there are
> problems with the function "glm.fit" that may bite in rare
> circumstances. Note, I have no data set with which I ran into
> trouble. This report is solely based on having a look at the code.
>
> Below I append a listing of the glm.fit function as produced by my
> system. I have added line numbers so that I can easily refer to the
> code that is, in my opinion, problematic. I also append a modified
> version that, hopefully, solves these problems. At least on
> "example(glm)" both functions produce the same output (if the random
> number generator is set to the same seed).
>
> Short summary: I envisage problems with glm.fit if two conditions are
> met simultaneously. Namely, (1) the design matrix is such that the QR
> algorithm used in the weighted least squares (WLS) step has to pivot
> columns (lines 81-86) and (2) during the iterations the step size has
> to be truncated (either lines 98-114 or lines 115-130).
>
> More details:
> In line 90, the variables "start" and "coef" are set to the
> coefficients of the WLS and "start" is then permuted (line 91) to take
> a possible pivoting of the QR algorithm into account. Note that
> "coef" is not permutated. Also, since the weights in the iterative
> weighted least squares algorithm change in each iteration (lines
> 60-139) it is theoretically possible that on some executions of this
> chunk pivoting occurs and on others there is no pivoting.
>
> Hoever, if the new coefficients are unsuitable and either the block at
> lines 98-114 or lines 115-130 is entered, then "start" is modified by
> shrinking it toward "coefold". There are two problems:
> 1) If this happens the first time the loop in lines 60-139 is executed,
> then coefold is initialised to the value of "start" that was passed
> down to the routine (line 58). This might well be NULL (usually is?).
> In this case line 105 and line 122 would produce a new "start" which
> is numeric(0) and glm.fit would stop execution pretty soon thereafter.
> 2) If it happens during a later execution of the loop in lines 60-139,
> then "coefold" was set to the value "coef" of the prior execution
> of the loop (line 137). Depending on whether either of the code
> chunks at lines 98-114 or lines 115-130 have been executed on that
> previous occastion, "coefold" may or may not be permuted to the same
> order as "start" in this moment. Hence, it could happen that "start"
> is shrunken towards something that doesn't make sense in line 105 or
> line 122.
>
> Relating to point 2) above. Theoretically, during the main loop
> (lines 60-139) either of the chunks in lines 98-114 or lines 115-130
> could be entered. In these chunks "coef" is set to the shrunken
> "start" at the end (line 111 or line 126). If, with the ``deviance
> for the shrunken "start"'' the convergence criterion in line 131 is
> fulfilled, then the main loop would finish. In other words, at line
> 140 there is no way to tell whether "coef" is in the order of
> "fit$coefficients" or has been permutated to have the same order as
> "start". In the later case, the additional permutation in line 157
> would garble the solution.
>
> I hope that the revised version that I append below, works around all
> these potential problems without breaking anything. As I said, I ran
> "example(glm)" with the revised version and got the same results. But
> I guess you guys have far more exhaustive tests and may want to do
> some further test to ensure that nothing is broken.
>
> The revised version also makes some cosmetic changes that reflect my
> preferences :-) :
> 1) Given the default values of the parameters "weights" and "offset" I
> was surprised about the tests in lines 17 and 19. They seemed
> superfluous. Until I noticed that "glm" can call "glm.fit" with
> actual arguments for these parameters that are NULL. Hence, I suggest
> that the default values for these arguments should be NULL. This
> makes it easier to understand the code of "glm.fit" (if one looks only
> at "glm.fit" and not at all the code "glm", "glm.fit" &c
> simultaneously).
> 2) As mentioned above, "coefold" may be initialised to NULL in line 58
> if the parameter "start" is not specified. (Which typically would be
> the case?) But furthermore, in lines 44-53 the test for the parameter
> "start" is mangled into the test for the parameter "etastart". If
> "etastart" is not specified but a bad "start" (not the correct
> length), the bad "start" would be caught. But if a valid "etastart"
> is specified, then a bad "start" would not be caught since those tests
> are not executed. (Are there users that specify both?). Anyway, I
> have untangled those sanity check. Also to get a clean initialisation
> for "coefold".
> 3) Given that the column names of fit$qr are (correctly) set in line
> 171, I didn't see the purpose of line 155. That line anyhow sets
> the column names wrongly if pivoting has occured. So I deleted that
> line.
>
> That's all for now. Thanks for the great package and I hope that this
> bug report will be useful.
>
> Cheers,
>
> Berwin
>
> ========================= original glm.fit =========================
> 1 glm.fit <-
> 2 function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL,
> 3 mustart = NULL, offset = rep(0, nobs), family = gaussian(),
> 4 control = glm.control(), intercept = TRUE)
> 5 {
> 6 x <- as.matrix(x)
> 7 xnames <- dimnames(x)[[2]]
> 8 ynames <- names(y)
> 9 conv <- FALSE
> 10 nobs <- NROW(y)
> 11 nvars <- NCOL(x)
> 12 if (nvars == 0) {
> 13 cc <- match.call()
> 14 cc[[1]] <- as.name("glm.fit.null")
> 15 return(eval(cc, parent.frame()))
> 16 }
> 17 if (is.null(weights))
> 18 weights <- rep(1, nobs)
> 19 if (is.null(offset))
> 20 offset <- rep(0, nobs)
> 21 variance <- family$variance
> 22 dev.resids <- family$dev.resids
> 23 aic <- family$aic
> 24 linkinv <- family$linkinv
> 25 mu.eta <- family$mu.eta
> 26 if (!is.function(variance) || !is.function(linkinv))
> 27 stop("illegal `family' argument")
> 28 valideta <- family$valideta
> 29 if (is.null(valideta))
> 30 valideta <- function(eta) TRUE
> 31 validmu <- family$validmu
> 32 if (is.null(validmu))
> 33 validmu <- function(mu) TRUE
> 34 if (is.null(mustart)) {
> 35 eval(family$initialize)
> 36 }
> 37 else {
> 38 mukeep <- mustart
> 39 eval(family$initialize)
> 40 mustart <- mukeep
> 41 }
> 42 if (NCOL(y) > 1)
> 43 stop("y must be univariate unless binomial")
> 44 eta <- if (!is.null(etastart) && valideta(etastart))
> 45 etastart
> 46 else if (!is.null(start))
> 47 if (length(start) != nvars)
> 48 stop(paste("Length of start should equal", nvars,
> 49 "and correspond to initial coefs for", deparse(xnames)))
> 50 else as.vector(if (NCOL(x) == 1)
> 51 x * start
> 52 else x %*% start)
> 53 else family$linkfun(mustart)
> 54 mu <- linkinv(eta)
> 55 if (!(validmu(mu) && valideta(eta)))
> 56 stop("Can't find valid starting values: please specify some")
> 57 devold <- sum(dev.resids(y, mu, weights))
> 58 coefold <- start
> 59 boundary <- FALSE
> 60 for (iter in 1:control$maxit) {
> 61 good <- weights > 0
> 62 varmu <- variance(mu)[good]
> 63 if (any(is.na(varmu)))
> 64 stop("NAs in V(mu)")
> 65 if (any(varmu == 0))
> 66 stop("0s in V(mu)")
> 67 mu.eta.val <- mu.eta(eta)
> 68 if (any(is.na(mu.eta.val[good])))
> 69 stop("NAs in d(mu)/d(eta)")
> 70 good <- (weights > 0) & (mu.eta.val != 0)
> 71 if (all(!good)) {
> 72 conv <- FALSE
> 73 warning(paste("No observations informative at iteration",
> 74 iter))
> 75 break
> 76 }
> 77 z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
> 78 w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
> 79 ngoodobs <- as.integer(nobs - sum(!good))
> 80 ncols <- as.integer(1)
> 81 fit <- .Fortran("dqrls", qr = x[good, ] * w, n = as.integer(ngoodobs),
> 82 p = nvars, y = w * z, ny = ncols, tol = min(1e-07,
> 83 control$epsilon/1000), coefficients = numeric(nvars),
> 84 residuals = numeric(ngoodobs), effects = numeric(ngoodobs),
> 85 rank = integer(1), pivot = 1:nvars, qraux = double(nvars),
> 86 work = double(2 * nvars), PACKAGE = "base")
> 87 if (nobs < fit$rank)
> 88 stop(paste("X matrix has rank", fit$rank, "but only",
> 89 nobs, "observations"))
> 90 start <- coef <- fit$coefficients
> 91 start[fit$pivot] <- coef
> 92 eta <- drop(x %*% start)
> 93 mu <- linkinv(eta <- eta + offset)
> 94 dev <- sum(dev.resids(y, mu, weights))
> 95 if (control$trace)
> 96 cat("Deviance =", dev, "Iterations -", iter, "\n")
> 97 boundary <- FALSE
> 98 if (is.na(dev) || any(is.na(coef))) {
> 99 warning("Step size truncated due to divergence")
> 100 ii <- 1
> 101 while ((is.na(dev) || any(is.na(start)))) {
> 102 if (ii > control$maxit)
> 103 stop("inner loop 1; can't correct step size")
> 104 ii <- ii + 1
> 105 start <- (start + coefold)/2
> 106 eta <- drop(x %*% start)
> 107 mu <- linkinv(eta <- eta + offset)
> 108 dev <- sum(dev.resids(y, mu, weights))
> 109 }
> 110 boundary <- TRUE
> 111 coef <- start
> 112 if (control$trace)
> 113 cat("New Deviance =", dev, "\n")
> 114 }
> 115 if (!(valideta(eta) && validmu(mu))) {
> 116 warning("Step size truncated: out of bounds.")
> 117 ii <- 1
> 118 while (!(valideta(eta) && validmu(mu))) {
> 119 if (ii > control$maxit)
> 120 stop("inner loop 2; can't correct step size")
> 121 ii <- ii + 1
> 122 start <- (start + coefold)/2
> 123 mu <- linkinv(eta <- eta + offset)
> 124 }
> 125 boundary <- TRUE
> 126 coef <- start
> 127 dev <- sum(dev.resids(y, mu, weights))
> 128 if (control$trace)
> 129 cat("New Deviance =", dev, "\n")
> 130 }
> 131 if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
> 132 conv <- TRUE
> 133 break
> 134 }
> 135 else {
> 136 devold <- dev
> 137 coefold <- coef
> 138 }
> 139 }
> 140 if (!conv)
> 141 warning("Algorithm did not converge")
> 142 if (boundary)
> 143 warning("Algorithm stopped at boundary value")
> 144 eps <- 10 * .Machine$double.eps
> 145 if (family$family == "binomial") {
> 146 if (any(mu > 1 - eps) || any(mu < eps))
> 147 warning("fitted probabilities numerically 0 or 1 occurred")
> 148 }
> 149 if (family$family == "poisson") {
> 150 if (any(mu < eps))
> 151 warning("fitted rates numerically 0 occurred")
> 152 }
> 153 if (fit$rank != nvars) {
> 154 coef[seq(fit$rank + 1, nvars)] <- NA
> 155 dimnames(fit$qr) <- list(NULL, xnames)
> 156 }
> 157 coef[fit$pivot] <- coef
> 158 xxnames <- xnames[fit$pivot]
> 159 residuals <- rep(NA, nobs)
> 160 residuals[good] <- z - (eta - offset)[good]
> 161 fit$qr <- as.matrix(fit$qr)
> 162 nr <- min(sum(good), nvars)
> 163 if (nr < nvars) {
> 164 Rmat <- diag(nvars)
> 165 Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
> 166 }
> 167 else Rmat <- fit$qr[1:nvars, 1:nvars]
> 168 Rmat <- as.matrix(Rmat)
> 169 Rmat[row(Rmat) > col(Rmat)] <- 0
> 170 names(coef) <- xnames
> 171 colnames(fit$qr) <- xxnames
> 172 dimnames(Rmat) <- list(xxnames, xxnames)
> 173 names(residuals) <- ynames
> 174 names(mu) <- ynames
> 175 names(eta) <- ynames
> 176 wt <- rep(0, nobs)
> 177 wt[good] <- w^2
> 178 names(wt) <- ynames
> 179 names(weights) <- ynames
> 180 names(y) <- ynames
> 181 names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("", sum(good) -
> 182 fit$rank))
> 183 wtdmu <- if (intercept)
> 184 sum(weights * y)/sum(weights)
> 185 else linkinv(offset)
> 186 nulldev <- sum(dev.resids(y, wtdmu, weights))
> 187 n.ok <- nobs - sum(weights == 0)
> 188 nulldf <- n.ok - as.integer(intercept)
> 189 resdf <- n.ok - fit$rank
> 190 aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
> 191 list(coefficients = coef, residuals = residuals, fitted.values = mu,
> 192 effects = fit$effects, R = Rmat, rank = fit$rank, qr = fit[c("qr",
> 193 "rank", "qraux", "pivot", "tol")], family = family,
> 194 linear.predictors = eta, deviance = dev, aic = aic.model,
> 195 null.deviance = nulldev, iter = iter, weights = wt, prior.weights = weights,
> 196 df.residual = resdf, df.null = nulldf, y = y, converged = conv,
> 197 boundary = boundary)
> 198 }
> ========================= modified glm.fit =========================
> glm.fit <-
> function (x, y, weights = NULL, start = NULL, etastart = NULL,
> mustart = NULL, offset = NULL, family = gaussian(),
> control = glm.control(), intercept = TRUE)
> {
> x <- as.matrix(x)
> xnames <- dimnames(x)[[2]]
> ynames <- names(y)
> conv <- FALSE
> nobs <- NROW(y)
> nvars <- NCOL(x)
> if (nvars == 0) {
> cc <- match.call()
> cc[[1]] <- as.name("glm.fit.null")
> return(eval(cc, parent.frame()))
> }
> if (is.null(weights))
> weights <- rep(1, nobs)
> if (is.null(offset))
> offset <- rep(0, nobs)
> variance <- family$variance
> dev.resids <- family$dev.resids
> aic <- family$aic
> linkinv <- family$linkinv
> mu.eta <- family$mu.eta
> if (!is.function(variance) || !is.function(linkinv))
> stop("illegal `family' argument")
> valideta <- family$valideta
> if (is.null(valideta))
> valideta <- function(eta) TRUE
> validmu <- family$validmu
> if (is.null(validmu))
> validmu <- function(mu) TRUE
> if (is.null(mustart)) {
> eval(family$initialize)
> }
> else {
> mukeep <- mustart
> eval(family$initialize)
> mustart <- mukeep
> }
> if (NCOL(y) > 1)
> stop("y must be univariate unless binomial")
> if (!is.null(start)){
> if (length(start) != nvars)
> stop(paste("Length of start should equal", nvars,
> "and correspond to initial coefs for", deparse(xnames)))
> coef <- coefold <- start # initialise coef here in order to
> # be able to subset it below
> }
> else
> coef <- coefold <- rep(0,nvars)
> eta <- if (!is.null(etastart) && valideta(etastart))
> etastart
> else if (!is.null(start))
> as.vector(if (NCOL(x) == 1)
> x * start
> else x %*% start)
> else family$linkfun(mustart)
> mu <- linkinv(eta)
> if (!(validmu(mu) && valideta(eta)))
> stop("Can't find valid starting values: please specify some")
> devold <- sum(dev.resids(y, mu, weights))
> boundary <- FALSE
> for (iter in 1:control$maxit) {
> good <- weights > 0
> varmu <- variance(mu)[good]
> if (any(is.na(varmu)))
> stop("NAs in V(mu)")
> if (any(varmu == 0))
> stop("0s in V(mu)")
> mu.eta.val <- mu.eta(eta)
> if (any(is.na(mu.eta.val[good])))
> stop("NAs in d(mu)/d(eta)")
> good <- (weights > 0) & (mu.eta.val != 0)
> if (all(!good)) {
> conv <- FALSE
> warning(paste("No observations informative at iteration",
> iter))
> break
> }
> z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
> w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
> ngoodobs <- as.integer(nobs - sum(!good))
> ncols <- as.integer(1)
> fit <- .Fortran("dqrls", qr = x[good, ] * w, n = as.integer(ngoodobs),
> p = nvars, y = w * z, ny = ncols, tol = min(1e-07,
> control$epsilon/1000), coefficients = numeric(nvars),
> residuals = numeric(ngoodobs), effects = numeric(ngoodobs),
> rank = integer(1), pivot = 1:nvars, qraux = double(nvars),
> work = double(2 * nvars), PACKAGE = "base")
> if (nobs < fit$rank)
> stop(paste("X matrix has rank", fit$rank, "but only",
> nobs, "observations"))
> coef[fit$pivot] <- fit$coefficients
> eta <- drop(x %*% coef)
> mu <- linkinv(eta <- eta + offset)
> dev <- sum(dev.resids(y, mu, weights))
> if (control$trace)
> cat("Deviance =", dev, "Iterations -", iter, "\n")
> boundary <- FALSE
> if (is.na(dev) || any(is.na(coef))) {
> warning("Step size truncated due to divergence")
> ii <- 1
> while ((is.na(dev) || any(is.na(coef)))) {
> if (ii > control$maxit)
> stop("inner loop 1; can't correct step size")
> ii <- ii + 1
> coef <- (coef + coefold)/2
> eta <- drop(x %*% coef)
> mu <- linkinv(eta <- eta + offset)
> dev <- sum(dev.resids(y, mu, weights))
> }
> boundary <- TRUE
> if (control$trace)
> cat("New Deviance =", dev, "\n")
> }
> if (!(valideta(eta) && validmu(mu))) {
> warning("Step size truncated: out of bounds.")
> ii <- 1
> while (!(valideta(eta) && validmu(mu))) {
> if (ii > control$maxit)
> stop("inner loop 2; can't correct step size")
> ii <- ii + 1
> coef <- (coef + coefold)/2
> mu <- linkinv(eta <- eta + offset)
> }
> boundary <- TRUE
> dev <- sum(dev.resids(y, mu, weights))
> if (control$trace)
> cat("New Deviance =", dev, "\n")
> }
> if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
> conv <- TRUE
> break
> }
> else {
> devold <- dev
> coefold <- coef
> }
> }
> if (!conv)
> warning("Algorithm did not converge")
> if (boundary)
> warning("Algorithm stopped at boundary value")
> eps <- 10 * .Machine$double.eps
> if (family$family == "binomial") {
> if (any(mu > 1 - eps) || any(mu < eps))
> warning("fitted probabilities numerically 0 or 1 occurred")
> }
> if (family$family == "poisson") {
> if (any(mu < eps))
> warning("fitted rates numerically 0 occurred")
> }
> if (fit$rank != nvars) {
> coef[fit$pivot][seq(fit$rank + 1, nvars)] <- NA
> }
> xxnames <- xnames[fit$pivot]
> residuals <- rep(NA, nobs)
> residuals[good] <- z - (eta - offset)[good]
> fit$qr <- as.matrix(fit$qr)
> nr <- min(sum(good), nvars)
> if (nr < nvars) {
> Rmat <- diag(nvars)
> Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
> }
> else Rmat <- fit$qr[1:nvars, 1:nvars]
> Rmat <- as.matrix(Rmat)
> Rmat[row(Rmat) > col(Rmat)] <- 0
> names(coef) <- xnames
> colnames(fit$qr) <- xxnames
> dimnames(Rmat) <- list(xxnames, xxnames)
> names(residuals) <- ynames
> names(mu) <- ynames
> names(eta) <- ynames
> wt <- rep(0, nobs)
> wt[good] <- w^2
> names(wt) <- ynames
> names(weights) <- ynames
> names(y) <- ynames
> names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("", sum(good) -
> fit$rank))
> wtdmu <- if (intercept)
> sum(weights * y)/sum(weights)
> else linkinv(offset)
> nulldev <- sum(dev.resids(y, wtdmu, weights))
> n.ok <- nobs - sum(weights == 0)
> nulldf <- n.ok - as.integer(intercept)
> resdf <- n.ok - fit$rank
> aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
> list(coefficients = coef, residuals = residuals, fitted.values = mu,
> effects = fit$effects, R = Rmat, rank = fit$rank, qr = fit[c("qr",
> "rank", "qraux", "pivot", "tol")], family = family,
> linear.predictors = eta, deviance = dev, aic = aic.model,
> null.deviance = nulldev, iter = iter, weights = wt, prior.weights = weights,
> df.residual = resdf, df.null = nulldf, y = y, converged = conv,
> boundary = boundary)
> }
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._