[R] logistic regression with constrained coefficients?

Richard A. O'Keefe ok at cs.otago.ac.nz
Thu Dec 8 23:43:31 CET 2005


I am trying to automatically construct a distance function from
a training set in order to use it to cluster another data set.
The variables are nominal.  One variable is a "class" variable
having two values; it is kept separate from the others.

I have a method which constructs a distance matrix for the levels
of a nominal variable in the context of the other variables.

I want to construct a linear combination of these which gives me
a distance between whole cases that is well associated with the
class variable, in that
    "combined distance between two cases large =>
     they most likely belong to different classes."

So from my training set I construct a set of
    (d1(x1,y1), ..., dn(xn,yn), x_class != y_class)
rows bound together as a data frame (actually I construct it by
columns), and then the obvious thing to try was

    glm(different.class ~ ., family = binomial(), data = distance.frame)

The thing is that this gives me both positve and negative coefficients,
whereas the linear combination is only guaranteed to be a metric if the
coefficients are all non-negative.

There are four fairly obvious ways to deal with that:
(1) just force the negative coefficients to 0 and hope.
    This turns out to work rather well, but still...
(2) keep all the coefficients but take max(0, linear combination of distances).
    This turns out to work rather well, but still...
(3) Drop the variables with negative coefficients from the model,
    refit, and iterate until no negative coefficients remain.
    This can hardly be said to work; sometimes nearly all the variables
    are dropped.
(4) Use a version of glm() that will let me constrain the coefficients
    to be non-negative.

I *have* searched the R-help archives, and I see that the question about
logistic regression with constrained coefficients has come up before, but
it didn't really get a satisfactory answer.  I've also searched the
documentation of more contributed packages than I could possibly understand.

There is obviously some way to do this using R's general non-linear
optimisation functions.  However, I don't know how to formulate logistic
regression that way.

This whole thing is heuristic.  I am not hell-bent on (ab?)using logistic
regression this way.  It was just an obvious thing to try.  Suggestions
for other means to the same end will be welcome.




More information about the R-help mailing list