[R] Checking the assumptions for a proper GLM model

Fri Feb 19 21:35:26 CET 2010

The best validation of assumptions is a good knowledge of the origin of your data.  And with 18 bullet points below, if you do all of these every time you are going to end up with a lot of false positives when all your assumptions are met.  Understanding your data so that you know which assumptions are most likely to be violated so you can focus on those is important, also understanding which assumptions your technique is robust against is good.

Rather than use the strict tests whose hypotheses may not match exactly what you want to test, using the vis.test function from the TeachingDemos package may be appropriate.

Specific comments inline below:

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Jay
> Sent: Thursday, February 18, 2010 6:33 AM
> To: r-help at r-project.org
> Subject: Re: [R] Checking the assumptions for a proper GLM model
> 
> So what I'm looking for is readily available tools/packages that could
> produce some of the following:
> 
> 3.6 Summary of Useful Commands (STATA: Source:
> http://www.ats.ucla.edu/stat/Stata/webbooks/logistic/chapter3/statalog3
> .htm)
> 
>     * linktest--performs a link test for model specification, in our
> case to check if logit is the right link function to use. This command
> is issued after the logit or logistic command.

The scglm function in the forward package looks like it may do this.

>     * lfit--performs goodness-of-fit test, calculates either Pearson
> chi-square goodness-of-fit statistic or Hosmer-Lemeshow chi-square
> goodness-of-fit depending on if the group option is used.

The chisq.test function in stats does goodness of fit tests, there are several other functions that show up when doing a search for "goodness" that may help (did not see any specific for glm models).  

The book Regression Modeling Strategies (which the rms package supports) talks a bit about the Hosmer-Lemeshow test and supports my claim that you really need to understand the data before using this test.  It also presents an alternative.

>     * fitstat -- is a post-estimation command that computes a variety
> of measures of fit.

It is hard to find equivalents without knowing what the measures are. Many can probably be computed from the glm summary information, others may be included in the output from lrm (rms package).

>     * lsens -- graphs sensitivity and specificity versus probability
> cutoff.

There are a couple of packages that do ROC curves, but I find that they are easy to do by hand.

>     * lstat -- displays summary statistics, including the
> classification table, sensitivity, and specificity.

These can be computed fairly easy by hand, they are probably also available in packages like epicalc or ROC.  There value as diagnostics is another matter.

>     * lroc -- graphs and calculates the area under the ROC curve based
> on the model.

I believe that lrm (rms package) computes area under the curve.  It is also an easy one to calculate by hand

>     * listcoef--lists the estimated coefficients for a variety of
> regression models, including logistic regression.

The coef and summary functions provide the coefficients for glm and other models

>     * predict dbeta --  Pregibon delta beta influence statistic

Don't know about this one but see below if this is based on leave one out stats

>     * predict deviance -- deviance residual

The resid (residuals) function has an option to return deviance residuals (it looks like it is the default).

>     * predict dx2 -- Hosmer and Lemeshow change in chi-square
> influence statistic
>     * predict dd -- Hosmer and Lemeshow change in deviance statistic
>     * predict hat -- Pregibon leverage

Don't know these ones

>     * predict residual -- Pearson residuals; adjusted for the
> covariate pattern
>     * predict rstandard -- standardized Pearson residuals; adjusted
> for the covariate pattern

See ?residuals.glm to see if any of those options work for you.

>     * ldfbeta -- influence of each individual observation on the
> coefficient estimate ( not adjusted for the covariate pattern)

For linear models there is a nice computational shortcut to do leave one out statistics, for glms you need to refit the model each time.  But with a fast computer this is still fairly quick and easy.  There may be functions existing to do this, but it would only take a couple of lines of code to do it manually.

>     * graph with [weight=some_variable] option
>     * scatlog--produces scatter plot for logistic regression.

Try ?plot.glm

>     * boxtid--performs power transformation of independent variables
> and performs nonlinearity test.
> 

If potential non linearity is an issue, splines may work better for this. There are some good examples of testing and using the splines in RMS (the book) and rms (the package).

> But, since I'm new to GLM, I owuld greatly appreciate how you/others
> go about and test the validity of a GLM model.
> 
> 

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111