[R-sig-Geo] To validate logistic regression

Wed Apr 27 09:01:09 CEST 2011

Dear Komine,

I have another more sophisticated approach for you.
If you really want to validate your logistic model
with x-fold internal croxxvalidation you should not
only perform your data partitioning once. I recomment
to do it 100 to 999 times to really get an estimate of
your data and model quality stability.

####5 fold Crossvalidation with 100 Permutations
k		<- 20 			##20% of the dataset as testdata
N		<- 100			##100 Permutations
permu		<- paste("Permut_",1:N,sep="")
AUC_Results 	<- matrix(NA, 1, N, dimnames=list("AUC",permu))
n 		<- ncol(Dataset)
numrows 	<- nrow(Dataset)
learnDataSize	<- round(numrows*(1-0.01*k))
testDataSize	<- numrows-learnDataSize
##loop
for (j in 1:N){
		cat("calculating",((j/N)*100),"% \n")
		learnIndex	<-sample(nrow(Dataset))[1:learnDataSize]
		learnData	<-Dataset[learnIndex,]
		testData	<-Dataset[-learnIndex,]
		mg		<-glm(formula =yourFormula,
				family = binomial(link = "logit"),data=learnData)
		bestmod_cv	<-step(mg,direction="backward",trace=0)
		predicted_cv	<-predict(bestmod_cv, newdata=testData, type="response")
		observed_cv		<-testData[,"Y"]
		AUC_result		<-roc.auc(observed_k, predicted_k)
		AUC_Results[1,j]	<-AUC_result$A
				}

Cheers,
Tobias Erik Reiners
Mammalian Ecology Group

Zitat von Dylan Beaudette <debeaudette at ucdavis.edu>:

> Another approach:
>
> See ?lrm, ?validate, and ?calibrate from the rms package.
>
> Dylan
>
> On Tuesday, April 26, 2011, Bram Van Moorter wrote:
>> Dear Komine,
>> Not sure whether this is the easiest way, but it has worked for me:
>>
>> set.seed(0)
>> head(tab <- data.frame(Y=as.numeric(runif(100)>0.5), X=rnorm(100)))
>> subs <- sample(c(1:nrow(tab)), round(nrow(tab)*0.66), replace=F)  #the
>> 66% of data you want in one sample
>> tab1 <- tab[subs, ] #the one sample
>> tab2 <- tab[!c(1:nrow(tab)) %in% subs, ] #the other sample, which are
>> the data that do not fall in the first sample
>>
>> rlog1 <- glm(Y~X,family=binomial,data=tab1)
>> summary(rlog1)
>> tab2$pred <-predict(rlog1, newdata=tab2, type="response")
>> hist(tab2$pred)
>>
>> library(ROCR) #allows you to make easily ROC's which allows the
>> assessment of your prediction
>> pred <- prediction(tab2$pred, tab2$Y)
>> perf <- performance(pred,"tpr","fpr")
>> plot(perf); abline(0, 1, col="red")  #the proportional line shows that
>> the prediction is as good as random, which you would expect in this
>> example
>>
>> Best,
>> Bram
>>
>>
>> > Hi,
>> > I would like your help to validate my logistic regression. I know how to
> do
>> > logistic regression.
>> >
>> > rlog<-glm(Y~X,family=binomial,data=tab)
>> > summary(rlog)
>> > HLgof.test(fit = fitted(rlog), obs=Y)
>> >
>> > However, I would like to validate my model. For example to divise my data
> in
>> > a sample for training (66%) and a sample for validation (34%).
>> > e.g for ma table
>> > Area   Y     X
>> > 1       1     135
>> > 1       0     200
>> > 1       1      97
>> > 1       1     160
>> > 1       0     201
>> > 1       1     144
>> > 1       0     100
>> >
>> > But I don't know how to validate it.
>> > 1- My first problem: How to create my 2 samples from my variables Y and X
>> > using pourcentage 66 ang 34 %?
>> >
>> > - How to have the pourcentage of good prediction and bad prediction?
>> >
>> > Thanks for your Help
>> > Komine
>> >
>>
>>
>> --
>> Bram Van Moorter
>> Centre for Conservation Biology (NTNU),
>> Norwegian Institute for Nature Research (NINA)
>> Trondheim (Norway)
>> email:  Bram.Van.Moorter at gmail.com
>> website: http://ase-research.org/moorter
>> phone: +47 73596060
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>
> --
> Dylan E. Beaudette
> USDA-NRCS Soil Scientist
> California Soil Resource Lab
> http://casoilresource.lawr.ucdavis.edu/
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>