[R] weighted regression: Osius & Rojek's test for logistic regression models

Sun Jan 13 20:48:47 CET 2002

Dear all,

I am trying to implement goodness-of-fit tests for logistic regression
models described in : Hosmer, D.W., Lemeshow, S., 2000. Applied logistic
regression. New-York, John Wiley & Sons, Inc., 373 p. (pp. 152-154)

Namely I would like to reproduce Osius & Rojek's test (Osius, G., Rojek,
D., 1992. Normal goodness-of-fit tests for multinomial models with large
degrees of freedom. JASA, 87: 1145:1152.). This test involves a weighted
linear regression of the transformed fitted probabilities against
covariates, from which residual sum of squares is extracted. Using the
UIS dataset provided at
http://www-unix.oit.umass.edu/~statdata/stat-logistic.html and the same
model than in the book, I find a large discrepancy between "my" RSS and
the one reported by Hosmer and Lemeshow: 392.454 vs. 189.685.

Weighted linear regression results, weights in "nu":

> summary(fm)
[snip]
Coefficients:
                Value Std. Error  t value Pr(>|t|) 
 (Intercept)  16.5302   0.7327    22.5616   0.0000
         age  -0.2175   0.0158   -13.7293   0.0000
     ndrgfp1  -4.1627   0.2578   -16.1472   0.0000
     ndrgfp2  -1.5025   0.0977   -15.3835   0.0000
ivhxPrevious   2.1504   0.2595     8.2872   0.0000
  ivhxRecent   2.4429   0.2290    10.6676   0.0000
        race  -0.8732   0.2016    -4.3316   0.0000
       treat  -1.4424   0.1779    -8.1096   0.0000
        site  -0.6621   0.1977    -3.3486   0.0009

Residual standard error: 0.8755 on 512 degrees of freedom
[snip]

> sum( nu * resid(fm)^2 )
[1] 392.454

> sqrt(392.454 / 512)
[1] 0.876

Other intermediate calculations are the same (Pearson's chi square,
etc.) so I don't think the problem comes from the dataset.

The whole code of the function and its application on the UIS data set
are given below.

I would highly appreciate any insight on this problem.

Renaud

###################

or.test <- function(object)
{
### ancillary function
## ags adapted from agg.sum provided by Bill Dunlap	
  ags <- function(x, by){
    by <- data.frame(by)
    ord <- do.call("order", unname(by))
    x <- x[ord]
    by <- by[ord,  ]
    logical.diff <- function(group) group[-1] != group[ - length(group)]
    change <- logical.diff(by[[1]])
    for(i in seq(along = by)[-1])
      change <- change | logical.diff(by[[i]])
    by <- by[c(T, change),  , drop = F]
    by$x <- diff(c(0, cumsum(x)[c(change, T)]))
    by
    }
###
### computations
###
  mf <- model.frame(object)
## collapse the original data by covariate pattern
  xx <- ags(rep(1, nrow(mf)), mf[-1])
## observed number of cases by covariate pattern
  yy <- unname(unlist(ags(mf[ , 1], mf[-1])[ncol(xx)]))
## fitted proba
  pp <- predict(object, newdata = xx, type = "response")
## number of rows with the same covariate pattern
  mm <- unname(unlist(xx[ncol(xx)]))
## new model frame
  xx <- xx[ , - ncol(xx)]
## weights
  nu <- mm * pp * (1 - pp)
## new response
  cc <- (1 - 2 * pp) / nu
### Pearson's X2
  X2 <- sum( (yy - mm * pp)^2 / nu)
### weighted regression
  mod <- lm(cc ~ . , weights = nu, data = xx)
  rss <- sum( nu * resid(mod)^2 )
### compute the stat.
  J <- nrow(xx)
  A <- 2 * (J - sum( 1 / mm))
  z <- abs( (X2 - (J - length( coef(object) ) ) ) / sqrt(A + rss) )
### report results
  print(object$call)
  cat("Osius & Rojek's goodness-of-fit test for logistic models.\n")
  cat("Null hypothesis: model fits the data well.\n")
  cat("z =", round(z, 3), "; P =", round(2 * (1 - pnorm(z)), 3), "\n")
  }

> mod2 <- glm(dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat + site + age:ndrgfp1 + race:site,
    family = binomial(logit), data = uis, epsilon = 1e-010)
> summary(mod2, cor = 0)

Call: glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race +
treat + site + age:ndrgfp1 + race:site,
    family = binomial(logit), data = uis, epsilon = 1e-010)
Deviance Residuals:
      Min         1Q     Median        3Q      Max 
 -1.30291 -0.7884782 -0.5787787 0.9882528 2.616352

Coefficients:
                   Value Std. Error   t value 
 (Intercept) -6.84386380 1.21931572 -5.612873
         age  0.11663848 0.02887493  4.039437
     ndrgfp1  1.66903502 0.40715174  4.099295
     ndrgfp2  0.43368849 0.11690510  3.709748
ivhxPrevious -0.63463065 0.29871912 -2.124506
  ivhxRecent -0.70494755 0.26158047 -2.694955
        race  0.68410676 0.26413543  2.589985
       treat  0.43492548 0.20375957  2.134503
        site  0.51620102 0.25488809  2.025207
 age:ndrgfp1 -0.01526971 0.00602675 -2.533656
   race:site -1.42945705 0.52978052 -2.698206

(Dispersion Parameter for Binomial family taken to be 1 )

    Null Deviance: 653.7289 on 574 degrees of freedom

Residual Deviance: 597.9629 on 564 degrees of freedom

Number of Fisher Scoring Iterations: 5 

> or.test(mod2)
glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat +
site + age:ndrgfp1 + race:site, family = 
	binomial(logit), data = uis, epsilon = 1e-010)
Osius & Rojek's goodness-of-fit test for logistic models.
Null hypothesis: model fits the data well.
z = 0.085 ; P = 0.933 

-- 
Dr Renaud Lancelot, vétérinaire
CIRAD, Département Elevage et Médecine Vétérinaire (CIRAD-Emvt)
Programme Productions Animales
http://www.cirad.fr/presentation/programmes/prod-ani.shtml

ISRA-LNERV                      tel    (221) 832 49 02
BP 2057 Dakar-Hann              fax    (221) 821 18 79 (CIRAD)
Senegal                         e-mail renaud.lancelot at cirad.fr
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._