[R] Questions regrading the lasso and glmnet

Sat May 28 18:54:34 CEST 2011

  Hi all.  Sorry for the long email.  I have been trying to find someone local to work on this with me, without much luck.  I went in to our local stats consulting service here, and the guy there told me that I already know more about model selection than he does.  :-<  He pointed me towards another professor that can perhaps help, but that prof is busy until mid-June, so I want to get as much figured out as I can before I eventually meet with him.  And that prof may turn out not to be much help anyway, in which case all I've got is you folks.  :->

  So I've got a big dataset (300,000 rows) in which the dependent variable is binary, and there are five independent variables, each continuous, and each uniformly distributed between 0 and 1.  I'm trying to use binary logistic regression to get an explanatory model with decent predictive value (but being simple, for explanation, is more important than being optimally predictive).  Considering the squares of those variables as well (but no higher powers, this list talked me out of that recently because of all the problems with collinearity and polynomial fits), and many interactions between both the variables and the squares of the variables, I have a full model formula of 116 terms; but most of those terms are of small effect (although often still significant in a glm fit, since 300,000 rows is a lot of data).  So I have a model selection problem: I want to find a simple explanatory model containing just the terms of large effect that are most important in getting an understanding of what strongly influences the dependent variable.

  I was previously trying to do model selection using a step-down approach with a large per-term penalty (much larger than the standard BIC penalty) to force more terms to drop out.  I looked at a wide range of penalty values, got a corresponding set of models with different numbers of terms, looked at the correct prediction rate for each of those models, and basically chose the simplest model that still gave me a pretty good prediction rate (for some subjective definition of "pretty good").  Typically the model I chose would have about 20 terms, out of the original 116.  (A standard BIC step-down would retain more like 100 terms, with only a very slightly better prediction rate.)

  So that was working OK, but in discussions with the folks on this list (thanks everybody for your help!), I have been exploring using the lasso for this instead, to avoid the problems with step-down, gain the benefits of shrinkage, and so forth; clearly it should be much better than my homegrown model selection procedure.  I've been reading about the lasso in Tibshirani (1996) and in The Elements of Statistical Learning.  I'm using glmnet() at present; I have seen that lars() also exists, but I don't understand its documentation as well, so I'm starting with glmnet.  So there's one question:

1. Is my choice of glmnet() ok?  On what basis should I choose glmnet() vs. lars()?

  The lasso wants variables to be centered and scaled, I gather.  glmnet() can do this for me, but I want to understand exactly what the variables are that the fit is done on, so that I can interpret the coefficients properly, so I want to do this myself.  (I'm also concerned that glmnet() might not do it correctly, since it doesn't know that some of the terms in the formula are squares and interactions.)  So I'm passing standardize=FALSE to glmnet(), and I'm doing my own scaling before, like (where df is my dataframe):

df$Cx <- scale(df$x)		# an independent variable, centered and scaled
df$Cx_sq <- df$Cx ^ 2	# the square of that variable

  I do this for each of the five independent variables.  So the variables themselves are centered and scaled, while the squared versions of the variables are exactly the square of the scaled variable; I do not scale them again.  So question 2:

2. Is the way I'm scaling the variables before calling glmnet() correct?  Or should the squares themselves be centered and scaled?

  Having scaled the variables in this way, I then construct a model matrix and call glmnet() (where f is the 116-term formula and df is my dataframe):

mf <- model.frame(f, df)
mm <- model.matrix(formula(f), mf)[,-1]
lasso <- glmnet(mm, y=df$outcome, family="binomial", standardize=FALSE)

  I do it this way because glmnet() doesn't support being passed a formula and a dataframe.  I think this is doing the right thing.  The model.matrix() call constructs new columns for all of the interactions in the formula, which of course act as separate independent variables in the regression.  One worry I have is that those, like the squares discussed above, are not themselves centered and scaled.  If there's an interaction between, say, Cx and Cy, then the model matrix column for Cx:Cy is of course just the product of the Cx column and the Cy column, and so it is not centered/scaled.  I don't know if this is correct or not.  So question 3:

3. Is my model matrix correct, or do I have a problem with the scale of the interaction variables?

  Having run glmnet(), I now have lots of coefficients for the different values of lambda.  One thing that seems problematic is that the coefficient for an interaction (Cx:Cy, say), will become non-zero while the coefficients for the underlying terms Cx and Cy are still zero.  Of course glmnet() has no way of preventing this, since it doesn't know the formula or which matrix columns are interactions, etc.  From my previous work on model selection via step-down, my understanding is that this state of affairs should not be allowed.  Using step-down, Cx and Cy will not be dropped as long as Cx:Cy is retained, for example.  As I understand it, this is for essentially the same conceptual reasons that one ought to retain the intercept in a regression even if it is non-significant; the slope of the regression depends upon it, and so dropping it is not correct.  However, I don't see any way that I can tell glmnet() about these relationships, since it doesn't have a functional interface.  So:

4. Is it a problem that the lasso fit gives non-zero coefficients for interactions whose underlying terms have zero coefficients?

  And finally comes the actual model selection problem.  The best lambda value given by cv.glmnet() is for a model that contains most of the 116 terms; I suppose that, as with the BIC step-down, this is because 300,000 rows is a lot of data, and so most of those terms are in fact well-supported by the data.  But I want a simple explanatory model with only the terms of large effect, as described before.  I'm thinking of finding the prediction rate for each lambda value used by glmnet, and using the same subjective "simplest model with a pretty good prediction rate" criterion that I was using before.  But I'm wondering:

5. Is there any way to choose a simple explanatory model, smaller than the best predictive model supported by the data, that is less arbitrary / subjective?

  And finally, I've got sort of a philosophical question that I'm struggling with that makes me wonder if the lasso is really the right tool for my job in the first place.  As lambda increases, the coefficients of terms diverge from zero, one by one, until all terms have non-zero coefficients and you've arrived at the full model.  The funny thing is, for most values of lambda, some or another term in my formula will have only just recently diverged from zero, and so that term will have a very small coefficient -- the shrinkage aspect of the lasso will have shrunk it almost, but not quite, to zero.  This is weird, given that my goal is a simple explanatory model; having terms with very small but non-zero coefficients is contrary to the goals of my statistical analysis.  The lasso is considered to be better for producing explanatory models than some other methods, because coefficients do eventually go to zero, instead of just getting smaller and smaller; but still, I wonder if I ought to be using some more extreme method of model selection that, like the step-down I was using before, either keeps a term in or drops it, with no in-between.  On the other hand, I gather that the shrinkage that the lasso does is considered to be a good thing.  I don't really understand enough about all this to know what to do.  So:

6. Should I be looking at a method other than the lasso, so that I would not get some terms with very small coefficients?  Or should I use the lasso, and just ignore, in my explanation of the model, terms with tiny coefficients?  Or is there some other way of handling this problem?

  Thanks to anybody who has even read this far, and many many thanks to any responses!

Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/