[R] How to estimate whether overfitting?

Frank E Harrell Jr f.harrell at Vanderbilt.Edu
Mon May 10 14:58:51 CEST 2010

```On 05/10/2010 12:32 AM, bbslover wrote:
>
> many thanks .  I can try to use test set with 100 samples.
>
> anther question is that how can I rationally split my data to training set
> and test set? (training set with 108 samples, and test set with 100 samples)
>
> as I  know, the test set should the same distribute to the training set. and
> what method can deal with it to rationally split?
>
> and what packages in R can deal with splitting training/test set rationally
> question?
>
>
> if the split is random. it seems to need many times splits, and the average
> results consider as the final results.
>
> however, I want to several methods to perform split and get the firm
> training set and test set instead of random split.
>
> training set and test set should like this：ideally, the division must be
> performed sunch that points representing both traing and training set are
> distributed within the hole feature space occupied by the entire dataset,
> and each point of the test set is close to at least one point of the
> training set. this approach ensures that the similarity principle can be
> enmployed for the output prediction of the test set. Certainly,this
> condition can not always be satistied.
>
> thus, generally, what algorithms often be perform to split? and more
> rational? some paper often say, they split the data set  randomly, thus,
> what is randomly?  just selection random? or have some clear method? e.g.
> output order,  I really know, which package can do with split data
> rationally?
>
> other, if one want to get the better results, some "tips" can be done. e.g.
> they can select test set again and again, and use the test set with best
> results as final test set and say that the test set was selectd randomly,
> but it is not true random, it is false.
>
> thank you, sorry to so many questions. but it puzzled me always.  up to now,
> I have no good method to split rationally my data into training set and test
> set.
>
> at last, split training and test set should be done before modeling, and it
> seems that this can be done just from featrue? (som)  ( or feature and
> output?(alogorithm spxy. paper:"a method for calibration and validation
> subset partioning")  or just output?(output order)).
>
> but always, often there are many features to be calculated. and some featrue
> is zero or low standard deviation(sd<0.5),  should we delete these features
> before split the whole data?
>
> and use the remaining feature to split data, and just using the training set
> to build the regression model and to perform feature selection as well as to
> do cross-validation,  and the independent test set just used to test the
> built model, yes?
>
> maybe, my thinking is not clear about the whole model precess. but I think
> it is like this:
> 1) get samples
> 2) calculate features
> 3) preprocess features calculated (e.g.remove zero)
> 4)rational split data into training and test set (always puzzle me, how to
> split on earth?)
> 5)build model and at the same time tune parameter of model  based on the
> resample methods using just training set. and get the final model.
> 6) test the model performance using independent test set (unseen samples).
> 7) estimate the model. good? or bad?  overfitting?  (generally, what case is
> overfitting? can you give me a example? as i know, it is overfitting when
> the trainging set fit good, but the independent test set is bad,but what is
> good ? what is bad?    r2=0.94 in the training set and r2=0.70 in the test,
> in this case, the model is overfitting?  the model can be accepted?  and
> generally what model can be well accetpt?)
> 8) conclusion. how is the model.
>
> above is my thinking.  and many question wait for answering.
>
> thanks
>
> kevin
>
>

Kevin: I'm sorry I don't have time to deal with such a long note, but
briefly data splitting is not a good idea no matter how you do it unless
N > perhaps 20,000.  I suggest resampling, e.g., either the bootstrap
with 300 resamples or 50-fold repeats of 10-fold cross-validation.
Among other places these are implemented in my rms package.

Frank

--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
Department of Biostatistics   Vanderbilt University

```