[R] prediction error for test set-cross validation
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Mar 11 14:02:11 CET 2009
Uwe Ligges wrote:
>
>
> Mehmet U Ayvaci wrote:
>> Hi,
>>
>>
>> I have a database of 2211 rows with 31 entries each and I manually
>> split my
>> data into 10 folds for cross validation. I build logistic regression
>> model
>> as:
>>
>>
>>> model <- glm(qual ~ AgGr + FaHx + PrHx + PrSr + PaLp + SvD + IndExam +
>>
>> Rad +BrDn + BRDS + PrinFin+ SkRtr + NpRtr + SkThck +TrThkc +
>> SkLes + AxAdnp + ArcDst + MaDen + CaDt + MaMG +
>> MaMrp + MaSh + SCTub + SCFoc + MaSz,
>> family=binomial(link=logit));
>>
>>
>>
>> Where the variables are taken from the trainSet of size 1989x31. The
>> test
>> set is sized 222x31. Now my question is when I try to predict on the test
>> set it gives me the error:
>>
>>
>>
>>> predict.glm(model, testSet, type ="response")
>>
>> "Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
>> subscript out of bounds"
>>
>>
>>
>> It does fine on trainSet. so it is something about the testSet. On the
>> other
>> hand, I realized that some independent variables say "MaSz" takes 3
>> different values in the trainset vs. 4 different ones in the testSet.
>> I am
>> not sure if this is the cause.If so, what would be the remedy?
>>
>>
>>
>> Since I can retrieve the coefficients of the logistic regression, I could
>> manually calculate response for each entry in the testSet. This could
>> solve
>> my problem although burdensome. But, I don't know an easy way of doing
>> it as
>> my logistic regression have 80+ coefficients.
>
>
> Well, if "MaSz takes 3 different values in the trainset vs. 4 different
> ones in the testSet", then you won't even be able to calculate it by
> hand, because you got no coefficients for the 4th level of that factor.
> Either you need the data to estimate coefficients from or you cannot
> predict.
>
> Uwe Ligges
And note that your test sample is far too small to yield reliable
results. You need to use resampling (e.g., bootstrap or 50-fold repeats
of 10-fold cross-validation). See the validate function in the Design
package. Note that validate does not implement the proportion
classified correctly because this is an improper scoring rule with
minimum information/lowest precision/lowest power.
Frank Harrell
>
>
>
>>
>>
>>
>>
>> Could somebody advise?
>>
>>
>>
>>
>>
>> Thanks,
>> M
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list