[R] What is the most useful way to detect nonlinearity in lo

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Sun Dec 5 13:59:08 CET 2004

```On 05-Dec-04 Patrick Foley wrote:
> It is easy to spot response nonlinearity in normal linear
> models using plot(something.lm).
> However plot(something.glm) produces artifactual peculiarities
> since the diagnostic residuals are constrained  by the fact
> that y can only take values 0 or 1.
> What do R users find most useful in checking the linearity
> assumption of logistic regression (i.e. log-odds =a+bx)?
>
> Patrick Foley
> patfoley at csus.edu

The "most useful way to detect nonlinearity in logistic
regression" is:

a) have an awful lot of data
b) have the x (covariate) values judiciously placed.

information, especially about non-linearity, in the binary
responses is often a lot less than people intuitively expect.

This is an area where R can be especially useful for
self-education by exploring possibilities and simulation.

For example, define the function (for quadratic nonlinearity):

testlin2<-function(a,b,N){
x<-c(-1.0,-0.5,0.0,0.5,1.0)
lp<-a*x+b*x^2; p<-exp(lp)/(1+exp(lp))
n<-N*c(1,1,1,1,1)
r<-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]),
rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]),
rbinom(1,n[5],p[5])
)
resp<-cbind(r,n-r)
X<-cbind(x,x^2);colnames(X)<-c("x","x2")
summary(glm(formula = resp ~ X - 1,
family = binomial),correlation=TRUE)
}

This places N observations at each of (-1.0,0.5,0.0.5,1.0),
generates the N binary responses with probability p(x)
where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression
forcing the "intercept" term to be 0 (so that you're not
diluting the info by estimating a parameter you know to be 0),
and returns the summary(glm(...)) from which the p-values
can be extracted:

The p-value for x^2 is testlin2(a,b,N)\$coefficients[2,4]}

You can run this function as a one-off for various values of
a, b, N to get a feel for what happens. You can run a simulation
on the lines of

pvals<-numeric(1000);
for(i in (1:1000)){
pvals[i]<-testlin2(1,0.1,500)\$coefficients[2,4]
}

so that you can test how often you get a "significant" result.

For example, adopting the ritual "sigificant == P<0.05,
power = 80%", you can see a histogram of the p-values
over the conventional "significance breaks" with

hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)

and you can see your probability of getting a "significant" result
as e.g. sum(pvals < 0.05)/1000

I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1
corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly
interesting degree of nonlinearity), I had to go up to N=2000
before I was getting more than 80% of the p-values < 0.05.
That corresponds to 2000 observations at each value of x, or
10,000 observations in all.

Compare this with a similar test for non-linearity with
normally-distributed responses [exercise for the reader].

You can write functions similar to testlin2 for higher-order
nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for
a*x + b*x^2 + c*x^3, etc., (the modifications required are
obvious) and see how you get on. As I say, don't be optimistic!

In particular, run testlin3 a few times and see the sort of
mess that can come out -- in particular gruesome correlations,
which is why "correlation=TRUE" is set in the call to
summary(glm(...),correlation=TRUE).

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 05-Dec-04                                       Time: 12:59:08
------------------------------ XFMail ------------------------------

```