[R] GLM/GAM and unobserved heterogeneity
Kyle G. Lundstedt
kylelundstedt at hotmail.com
Wed Aug 17 23:51:00 CEST 2005
Hello,
I'm interested in correcting for and measuring unobserved
heterogeneity ("missing variables") using R. In particular, I'm
searching for a simple way to measure the amount of unobserved
heterogeneity remaining in a series of increasingly complex models
(adding additional variables to each new model) on the same data.
I have a static database of 400,000 or so individual mortgage
loans, each of which is observed monthly from origination (t=0) until
termination (a binary yes/no variable). In my update database, there
are up to 60 months of observed data for each loan in the static
database, and an individual loan has an "average life" of roughly 36
months.
Each loan has static covariates observed at origination, such as
original loan amount and credit score, as well as time-varying
covariates (TVC) such as age, interest rates, and house prices.
Because these TVC change each month, I've constructed a modeling
database that merges the static database with the update database.
The resulting "loan-month" modeling database has one observation
for every loan-month, and the static covariates remain the same for
all loan-months for a given loan. Thus, the modeling database has
roughly 14.4 million loan-month records. A loan is considered
"active" as long as it has not yet terminated or been censored; my
interest is in predicting termination.
This type of data is often referred to as "event history" or
"discrete hazard" data. The standard R package to apply to such data
is "survival", with which I could estimate a Cox proportional hazard
model using coxph. The advantage of such an approach is that
unobserved heterogeneity is easily addressed using the "frailty" term.
The disadvantages, at least for my purposes, are two-fold.
First, my audience is unfamiliar with hazard models. Second, my
monthly data has many "ties" (many terminations in the same month),
so I've been told that coxph won't work well on a large dataset with
many ties.
On the other hand, because the data is measured discretely each
month, many references suggest applying generalized linear models
(GLM, "logit"-type models) or even generalized addivitive models
(GAM, "logit"-type models that incorporate nonlinearity in individual
covariates). The advantage to this approach is that GLM and GAM are
readily available in R, and my audience is very familiar with logit-
type models.
The disadvantage, however, is that I am totally unfamiliar with
ways to correct for and measure unobserved heterogeneity using GLM/
GAM-type models. I've been told that unobserved heterogeneity in the
hazard framework is analogous to random effects in the GLM/GAM
framework, but there seem to be a number of R packages that address
this issue in different ways.
So, I'd greatly appreciate suggestions on a simple way to
incorporate unobserved heterogeneity into a GLM/GAM-type model. I'm
not much of a statistician, so simple examples are always helpful.
I'm also happy to track down specific article/book references, if
folks think those might be of help.
Many thanks,
Kyle
---
kyle at hotmail . com
(email altered in obvious ways)
More information about the R-help
mailing list