[R] LASSO: glmpath and cv.glmpath

Mon Aug 24 02:02:38 CEST 2009

The train() function in the caret package can help automate this
process. Their are 3 package vignettes and a JSS paper with
documentation. See

http://cran.r-project.org/web/packages/caret/index.html

and

www.jstatsoft.org/v28/i05/

If I remember correctly, one of the earlier papers on the lasso by
Efron didn't think that cross-validation was the best way of tuning
these models (the details escape me).

Max

2009/8/21 Steve Lianoglou <mailinglist.honeypot at gmail.com>:
> Hi,
>
> On Aug 21, 2009, at 9:47 AM, Peter Schüffler wrote:
>
>> Hi,
>>
>> perhaps you can help me to find out, how to find the best Lambda in a
>> LASSO-model.
>>
>> I have a feature selection problem with 150 proteins potentially
>> predicting Cancer or Noncancer. With a lasso model
>>
>> fit.glm <- glmpath(x=as.matrix(X), y=target, family="binomial")
>>
>> (target is 0, 1 <- Cancer non cancer, X the proteins, numerical in
>> expression), I get following path (PICTURE 1)
>> One of these models is the best, according to its crossvalidation (PICTURE
>> 2), the red line corresponds to the best crossvalidation. Its produced by
>>
>> cv <- cv.glmpath(x=as.matrix(X), y=unclass(T)-1, family="binomial", type
>> ="response", plot.it=TRUE, se=TRUE)
>> abline(v= cv$fraction[max(which(cv$cv.error==min(cv$cv.error)))],
>> col="red", lty=2, lwd=3)
>>
>>
>> Does anyone know, how to conclude from the Normfraction in PICTURE 2 to
>> the corresponding model in PICTURE 1? What is the best model? Which
>> coefficients does it have? I can only see the best model's cross validation
>> error, but not the actual model. How to see it?
>
> None of your pictures came through, so I'm not sure exactly what you're
> trying to point out, but in general the cross validation will help you find
> the best value for lambda for the lasso. I think it's the value of lambda
> that you'll use for your downstream analysis.
>
> I haven't used the glmpath package, but I have been using the glmnet package
> which is also by Hastie, newer, and I believe covers the same use cases as
> the glmpath library (though, to be honest, I'm not quite familiar w/ the cox
> proportions hazard model). Perhaps you might want to look into it.
>
> Anyway, speaking from my experience w/ the glmnet packatge, you might try
> this:
>
> 1. Determine the best value of lambda using CV. I guess you can use MSE or
> R^2 as you see fit as your yardstick of "best."
>
> 2. Train a model over all of your data and ask it for the coefficients at
> the given value of lambda from 1.
>
> 3. See which proteins have non-zero coefficients.
>
> <tongue-in-cheek>
> 4. Divine a biological story that is explained by your statistical findings
>
> 4. Publish.
> </tongue-in-cheek>
>
> I guess there are many ways to do model selection, and I'm not sure it's
> clear how effective they are (which isn't to say that you shouldn't don't do
> them)[1] ... you might want to further divide your data into
> training/tuning/test (somewhere between steps 1 and 2) as another means of
> scoring models.
>
> HTH,
> -steve
>
> [1] http://hunch.net/?p=29
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  |  Memorial Sloan-Kettering Cancer Center
>  |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 

Max