[R] GLM/GAM and unobserved heterogeneity

Kyle G. Lundstedt kylelundstedt at hotmail.com
Wed Aug 17 23:51:00 CEST 2005

     I'm interested in correcting for and measuring unobserved  
heterogeneity ("missing variables") using R.  In particular, I'm  
searching for a simple way to measure the amount of unobserved  
heterogeneity remaining in a series of increasingly complex models  
(adding additional variables to each new model) on the same data.
     I have a static database of 400,000 or so individual mortgage  
loans, each of which is observed monthly from origination (t=0) until  
termination (a binary yes/no variable).  In my update database, there  
are up to 60 months of observed data for each loan in the static  
database, and an individual loan has an "average life" of roughly 36  
     Each loan has static covariates observed at origination, such as  
original loan amount and credit score, as well as time-varying  
covariates (TVC) such as age, interest rates, and house prices.   
Because these TVC change each month, I've constructed a modeling  
database that merges the static database with the update database.
     The resulting "loan-month" modeling database has one observation  
for every loan-month, and the static covariates remain the same for  
all loan-months for a given loan.  Thus, the modeling database has  
roughly 14.4 million loan-month records.  A loan is considered  
"active" as long as it has not yet terminated or been censored; my  
interest is in predicting termination.
     This type of data is often referred to as "event history" or  
"discrete hazard" data.  The standard R package to apply to such data  
is "survival", with which I could estimate a Cox proportional hazard  
model using coxph.  The advantage of such an approach is that  
unobserved heterogeneity is easily addressed using the "frailty" term.
     The disadvantages, at least for my purposes, are two-fold.   
First, my audience is unfamiliar with hazard models.  Second, my  
monthly data has many "ties" (many terminations in the same month),  
so I've been told that coxph won't work well on a large dataset with  
many ties.
     On the other hand, because the data is measured discretely each  
month, many references suggest applying generalized linear models  
(GLM, "logit"-type models) or even generalized addivitive models  
(GAM, "logit"-type models that incorporate nonlinearity in individual  
covariates).  The advantage to this approach is that GLM and GAM are  
readily available in R, and my audience is very familiar with logit- 
type models.
     The disadvantage, however, is that I am totally unfamiliar with  
ways to correct for and measure unobserved heterogeneity using GLM/ 
GAM-type models.  I've been told that unobserved heterogeneity in the  
hazard framework is analogous to random effects in the GLM/GAM  
framework, but there seem to be a number of R packages that address  
this issue in different ways.
     So, I'd greatly appreciate suggestions on a simple way to  
incorporate unobserved heterogeneity into a GLM/GAM-type model.  I'm  
not much of a statistician, so simple examples are always helpful.   
I'm also happy to track down specific article/book references, if  
folks think those might be of help.

Many thanks,
kyle  at  hotmail . com
(email altered in obvious ways)

More information about the R-help mailing list