[R] weighted regression: Osius & Rojek's test for logistic regression models
Renaud Lancelot
renaud.lancelot at cirad.fr
Sun Jan 13 20:48:47 CET 2002
Dear all,
I am trying to implement goodness-of-fit tests for logistic regression
models described in : Hosmer, D.W., Lemeshow, S., 2000. Applied logistic
regression. New-York, John Wiley & Sons, Inc., 373 p. (pp. 152-154)
Namely I would like to reproduce Osius & Rojek's test (Osius, G., Rojek,
D., 1992. Normal goodness-of-fit tests for multinomial models with large
degrees of freedom. JASA, 87: 1145:1152.). This test involves a weighted
linear regression of the transformed fitted probabilities against
covariates, from which residual sum of squares is extracted. Using the
UIS dataset provided at
http://www-unix.oit.umass.edu/~statdata/stat-logistic.html and the same
model than in the book, I find a large discrepancy between "my" RSS and
the one reported by Hosmer and Lemeshow: 392.454 vs. 189.685.
Weighted linear regression results, weights in "nu":
> summary(fm)
[snip]
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 16.5302 0.7327 22.5616 0.0000
age -0.2175 0.0158 -13.7293 0.0000
ndrgfp1 -4.1627 0.2578 -16.1472 0.0000
ndrgfp2 -1.5025 0.0977 -15.3835 0.0000
ivhxPrevious 2.1504 0.2595 8.2872 0.0000
ivhxRecent 2.4429 0.2290 10.6676 0.0000
race -0.8732 0.2016 -4.3316 0.0000
treat -1.4424 0.1779 -8.1096 0.0000
site -0.6621 0.1977 -3.3486 0.0009
Residual standard error: 0.8755 on 512 degrees of freedom
[snip]
> sum( nu * resid(fm)^2 )
[1] 392.454
> sqrt(392.454 / 512)
[1] 0.876
Other intermediate calculations are the same (Pearson's chi square,
etc.) so I don't think the problem comes from the dataset.
The whole code of the function and its application on the UIS data set
are given below.
I would highly appreciate any insight on this problem.
Renaud
###################
or.test <- function(object)
{
### ancillary function
## ags adapted from agg.sum provided by Bill Dunlap
ags <- function(x, by){
by <- data.frame(by)
ord <- do.call("order", unname(by))
x <- x[ord]
by <- by[ord, ]
logical.diff <- function(group) group[-1] != group[ - length(group)]
change <- logical.diff(by[[1]])
for(i in seq(along = by)[-1])
change <- change | logical.diff(by[[i]])
by <- by[c(T, change), , drop = F]
by$x <- diff(c(0, cumsum(x)[c(change, T)]))
by
}
###
### computations
###
mf <- model.frame(object)
## collapse the original data by covariate pattern
xx <- ags(rep(1, nrow(mf)), mf[-1])
## observed number of cases by covariate pattern
yy <- unname(unlist(ags(mf[ , 1], mf[-1])[ncol(xx)]))
## fitted proba
pp <- predict(object, newdata = xx, type = "response")
## number of rows with the same covariate pattern
mm <- unname(unlist(xx[ncol(xx)]))
## new model frame
xx <- xx[ , - ncol(xx)]
## weights
nu <- mm * pp * (1 - pp)
## new response
cc <- (1 - 2 * pp) / nu
### Pearson's X2
X2 <- sum( (yy - mm * pp)^2 / nu)
### weighted regression
mod <- lm(cc ~ . , weights = nu, data = xx)
rss <- sum( nu * resid(mod)^2 )
### compute the stat.
J <- nrow(xx)
A <- 2 * (J - sum( 1 / mm))
z <- abs( (X2 - (J - length( coef(object) ) ) ) / sqrt(A + rss) )
### report results
print(object$call)
cat("Osius & Rojek's goodness-of-fit test for logistic models.\n")
cat("Null hypothesis: model fits the data well.\n")
cat("z =", round(z, 3), "; P =", round(2 * (1 - pnorm(z)), 3), "\n")
}
> mod2 <- glm(dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat + site + age:ndrgfp1 + race:site,
family = binomial(logit), data = uis, epsilon = 1e-010)
> summary(mod2, cor = 0)
Call: glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race +
treat + site + age:ndrgfp1 + race:site,
family = binomial(logit), data = uis, epsilon = 1e-010)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.30291 -0.7884782 -0.5787787 0.9882528 2.616352
Coefficients:
Value Std. Error t value
(Intercept) -6.84386380 1.21931572 -5.612873
age 0.11663848 0.02887493 4.039437
ndrgfp1 1.66903502 0.40715174 4.099295
ndrgfp2 0.43368849 0.11690510 3.709748
ivhxPrevious -0.63463065 0.29871912 -2.124506
ivhxRecent -0.70494755 0.26158047 -2.694955
race 0.68410676 0.26413543 2.589985
treat 0.43492548 0.20375957 2.134503
site 0.51620102 0.25488809 2.025207
age:ndrgfp1 -0.01526971 0.00602675 -2.533656
race:site -1.42945705 0.52978052 -2.698206
(Dispersion Parameter for Binomial family taken to be 1 )
Null Deviance: 653.7289 on 574 degrees of freedom
Residual Deviance: 597.9629 on 564 degrees of freedom
Number of Fisher Scoring Iterations: 5
> or.test(mod2)
glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat +
site + age:ndrgfp1 + race:site, family =
binomial(logit), data = uis, epsilon = 1e-010)
Osius & Rojek's goodness-of-fit test for logistic models.
Null hypothesis: model fits the data well.
z = 0.085 ; P = 0.933
--
Dr Renaud Lancelot, vétérinaire
CIRAD, Département Elevage et Médecine Vétérinaire (CIRAD-Emvt)
Programme Productions Animales
http://www.cirad.fr/presentation/programmes/prod-ani.shtml
ISRA-LNERV tel (221) 832 49 02
BP 2057 Dakar-Hann fax (221) 821 18 79 (CIRAD)
Senegal e-mail renaud.lancelot at cirad.fr
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list