[R] How to validate model?

Frank E Harrell Jr f.harrell at vanderbilt.edu
Wed Oct 8 00:31:34 CEST 2008


Ajay ohri wrote:
> This is an approach
> 
> Run the model variables on hold out sample.
> 
> Check and compare ROC curves between build and validation datasets.
> 
> Check for changes in parameter estimates (co efficients of variables) p 
> value and signs.
> 
> Check for binning (response versus deciles of individual variables).
> 
> Check concordance, and KS Statistic.
> A decile wise performance of the model in terms of predicted versus 
> actual, rank ordering of deciles, helps in explaining the model to 
> business audience who generally have some business specific input that 
> may require scoring model to be tweaked.
> 
> This assumes multicollinearity, outliers and missing value treatment 
> have already been done, and holdout sample checks for overfitting. You 
> can always rebuild the model using a different random holdout sample.
> 
> A stable model would not change too much.
> 
> In actual implementation , try and build real time triggers for 
> deviations (%) between predicted and actual.
> 
> Regards,
> 
> Ajay

I wouldn't recommend that approach but legitimate differences of opinion 
exist on the subject.  In particular I fail to see the purpose of 
validating indirect measures such as ROC curves.

Frank

> 
> www.decisionstats.com <http://www.decisionstats.com>
> 
> On Wed, Oct 8, 2008 at 1:33 AM, Frank E Harrell Jr 
> <f.harrell at vanderbilt.edu <mailto:f.harrell at vanderbilt.edu>> wrote:
> 
>     Pedro.Rodriguez at sungard.com <mailto:Pedro.Rodriguez at sungard.com> wrote:
> 
>         Hi Frank,
> 
>         Thanks for your feedback! But I think we are talking about two
>         different
>         things.
> 
>         1) Validation: The generalization performance of the classifier.
>         See,
>         for example, "Studies on the Validation of Internal Rating
>         Systems" by
>         BIS.
> 
> 
>     I didn't think the desire was for a classifier but instead was for a
>     risk predictor.  If prediction is the goal, classification methods
>     or accuracy indexes based on classifications do not work very well.
> 
> 
> 
>         2) Calibration: Correct calibration of a PD rating system means
>         that the
>         calibrated PD estimates are accurate and conform to the observed
>         default
>         rates. See, for instance, An Overview and Framework for
>         PD Backtesting and Benchmarking, by Castermans et al.
> 
> 
>     I'm unclear on what you mean here.  Correct calibration of a
>     predictive system means that the UNcalibrated estimates are accurate
>     (i.e., they don't need any calibration).  (What is PD?)
> 
> 
> 
>         Frank, you are referring the #1 and I am referring to #2.
>         Nonetheless, I would never create a rating system if my model
>         doesn't
>         discriminate better than a coin toss.
> 
> 
>     For sure
>     Frank
> 
> 
> 
>         Regards,
> 
>         Pedro
> 
> 
> 
> 
> 
> 
>         -----Original Message-----
>         From: Frank E Harrell Jr [mailto:f.harrell at vanderbilt.edu
>         <mailto:f.harrell at vanderbilt.edu>] Sent: Tuesday, October 07,
>         2008 11:02 AM
>         To: Rodriguez, Pedro
>         Cc: maithili_shiva at yahoo.com <mailto:maithili_shiva at yahoo.com>;
>         r-help at r-project.org <mailto:r-help at r-project.org>
>         Subject: Re: [R] How to validate model?
> 
>         Pedro.Rodriguez at sungard.com <mailto:Pedro.Rodriguez at sungard.com>
>         wrote:
> 
>             Usually one validates scorecards with the ROC curve, Pietra
>             Index, KS
>             test, etc. You may be interested in the WP 14 from BIS
>             (www.bis.org <http://www.bis.org>).
> 
>             Regards,
> 
>             Pedro
> 
> 
>         No, the validation should be done using an absolute reliability
>         (calibration) curve.  You need to verify that at all levels of
>         predicted
> 
>         risk there is agreement with the true probability of failure.
>          An ROC curve does not do that, and I doubt the others do.  A
>         resampling-corrected loess calibration curve is a good approach
>         as implemented in the Design package's calibrate function.
> 
>         Frank
> 
>             -----Original Message-----
>             From: r-help-bounces at r-project.org
>             <mailto:r-help-bounces at r-project.org>
> 
>         [mailto:r-help-bounces at r-project.org
>         <mailto:r-help-bounces at r-project.org>]
> 
>             On Behalf Of Maithili Shiva
>             Sent: Tuesday, October 07, 2008 8:22 AM
>             To: r-help at r-project.org <mailto:r-help at r-project.org>
>             Subject: [R] How to validate model?
> 
>             Hi!
> 
>             I am working on scorecard model and I have arrived at the
>             regression
>             equation. I have used logistic regression using R.
> 
>             My question is how do I validate this model? I do have hold
>             out sample
>             of 5000 customers.
> 
>             Please guide me. Problem is I had never used Logistic regression
> 
>         earlier
> 
>             neither I am used to credit scoring models.
> 
>             Thanks in advance
> 
>             Maithili
> 
>             ______________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>             https://stat.ethz.ch/mailman/listinfo/r-help
>             PLEASE do read the posting guide
>             http://www.R-project.org/posting-guide.html
>             and provide commented, minimal, self-contained, reproducible
>             code.
> 
>             ______________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>             https://stat.ethz.ch/mailman/listinfo/r-help
>             PLEASE do read the posting guide
> 
>         http://www.R-project.org/posting-guide.html
> 
>             and provide commented, minimal, self-contained, reproducible
>             code.
> 
> 
> 
> 
> 
>     -- 
>     Frank E Harrell Jr   Professor and Chair           School of Medicine
>                         Department of Biostatistics   Vanderbilt University
> 
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> 
> -- 
> Regards,
> 
> Ajay Ohri
> http://tinyurl.com/liajayohri
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list