[R] an off-topic question -> model validation

Frank E Harrell Jr f.harrell at vanderbilt.edu
Fri Nov 12 05:51:12 CET 2004


Wensui Liu wrote:
> Currently, I am working on a data mining project and plan to divide
> the data table into 2 parts, one for modeling and the other for
> validation to compare several models.
> 
> But I am not sure about the percentage of data I should use to build
> the model and the one I should keep to validate the model.
> 
> Is there any literature reference about this topic? 
> 
> Thank you so much!

Data splitting is very inefficient for model validation unless the 
sample size is extremely large.  Consider using Efron's "optimism" 
bootstrap as is used in the validate function in the Design package. 
validate will also do data splitting and cross-validation though.

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list