[R-sig-Geo] To validate logistic regression

Tue Apr 26 12:24:00 CEST 2011

Dear Komine,
Not sure whether this is the easiest way, but it has worked for me:

set.seed(0)
head(tab <- data.frame(Y=as.numeric(runif(100)>0.5), X=rnorm(100)))
subs <- sample(c(1:nrow(tab)), round(nrow(tab)*0.66), replace=F)  #the
66% of data you want in one sample
tab1 <- tab[subs, ] #the one sample
tab2 <- tab[!c(1:nrow(tab)) %in% subs, ] #the other sample, which are
the data that do not fall in the first sample

rlog1 <- glm(Y~X,family=binomial,data=tab1)
summary(rlog1)
tab2$pred <-predict(rlog1, newdata=tab2, type="response")
hist(tab2$pred)

library(ROCR) #allows you to make easily ROC's which allows the
assessment of your prediction
pred <- prediction(tab2$pred, tab2$Y)
perf <- performance(pred,"tpr","fpr")
plot(perf); abline(0, 1, col="red")  #the proportional line shows that
the prediction is as good as random, which you would expect in this
example

Best,
Bram

> Hi,
> I would like your help to validate my logistic regression. I know how to do
> logistic regression.
>
> rlog<-glm(Y~X,family=binomial,data=tab)
> summary(rlog)
> HLgof.test(fit = fitted(rlog), obs=Y)
>
> However, I would like to validate my model. For example to divise my data in
> a sample for training (66%) and a sample for validation (34%).
> e.g for ma table
> Area   Y     X
> 1       1     135
> 1       0     200
> 1       1      97
> 1       1     160
> 1       0     201
> 1       1     144
> 1       0     100
>
> But I don't know how to validate it.
> 1- My first problem: How to create my 2 samples from my variables Y and X
> using pourcentage 66 ang 34 %?
>
> - How to have the pourcentage of good prediction and bad prediction?
>
> Thanks for your Help
> Komine
>

-- 
Bram Van Moorter
Centre for Conservation Biology (NTNU),
Norwegian Institute for Nature Research (NINA)
Trondheim (Norway)
email:  Bram.Van.Moorter at gmail.com
website: http://ase-research.org/moorter
phone: +47 73596060