[R-sig-ME] choice of prediction error calculations for zero inflated model

Mon Apr 1 13:09:24 CEST 2019

Andras,

Thanks for sharing the code.

I actually realized that there is a mistake in my implementation for 
zero-inflated model. Namely, the call to the predict() method in 
scoring_rules() provides the mean response, which in the case of 
zero-inflated models is

(1 - pi) * mean_response_non-zero_part,

where pi is the probability of being in the zero-part. But in the 
specification of prob_fun(), 'mean' should be the 
"mean_response_non-zero_part" not multiplied by (1 - pi).

You could also check that in the version with pscl.

Best,
Dimitris

On 3/31/2019 4:20 PM, Andras Farkas wrote:
> Dimitris,
> 
> thank you, very helpful.... I took a look at your scoring_rules function and attempted to "re-write" it for zero inflated poisson based on output of pscl predict models, would you mind taking a quick look, maybe it will become useful for someone else eventually:
> 
> library(caret)
> library(pscl)
> data("bioChemists", package = "pscl")
> split=0.7
> inTrain <- createDataPartition(bioChemists[,1], p=split, list=FALSE)
> training <- bioChemists[ inTrain,]
> test <- bioChemists[-inTrain,]
> fm_zip2 <- zeroinfl(art ~ . | ., data = training)
> 
> y <-test$art
> n <- length(y)
> max_count<-5000
> 
> max_count <- rep(max_count, length.out = n)
> prob_fun <- function (x, mean, pis) {
>      ind0 <- x == 0
>      out <- (1 - pis) * dpois(x, lambda = mean)
>      out[ind0] <- pis + out[ind0]
>      out
>    }
> 
> max_count_seq <- lapply(max_count, seq, from = 0)
> pred <- predict(fm_zip2, newdata = test, type = "response")
> pred_zi <- predict(fm_zip2, newdata = test, type="zero")  #zero probs equal to your zi_probs  attribute generated by predict.MixMod?.
> 
> logarithmic <- quadratic <- spherical <- numeric(n)
> for (i in seq_len(n)) {
>    p_y <- prob_fun(y[i], mean = pred[i], pis = pred_zi[i])
>    quadrat_p <- sum(prob_fun(max_count_seq[[i]], mean = pred[i],
>                              pis = pred_zi[i])^2)
>    logarithmic[i] <- log(p_y)
>    quadratic[i] <- 2 * p_y + quadrat_p
>    spherical[i] <- p_y / sqrt(quadrat_p)
> }
> result <- data.frame(logarithmic = logarithmic, quadratic = quadratic,
>                       spherical = spherical)
> 
> thanks
> 
> Andras
> 
> 
> 
> 
> On Saturday, March 30, 2019, 9:25:25 AM EDT, D. Rizopoulos <d.rizopoulos using erasmusmc.nl> wrote:
> 
> 
> 
> 
> 
> 
> You could have a look at proper scoring rules: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FScoring_rule&data=02%7C01%7Cd.rizopoulos%40erasmusmc.nl%7Cfedc913de8ea400e48d808d6b5e3fae5%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C636896388107783285&sdata=kPRwQj5q68TWZTZFIEWw6FJHY%2FlM3XaCu3W91otbwZQ%3D&reserved=0
> 
> 
> 
> For an example in mixed models check this example in the GLMMadaptive package: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrizopoulos.github.io%2FGLMMadaptive%2Farticles%2FDynamic_Predictions.html&data=02%7C01%7Cd.rizopoulos%40erasmusmc.nl%7Cfedc913de8ea400e48d808d6b5e3fae5%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C636896388107793294&sdata=XCW3fbJk6sBLMiFOt1ccrgB93waO07iRcn9%2BK6wTcj4%3D&reserved=0
> 
> 
> 
> 
> Best,
> 
> Dimitris
> 
> 
> Sent with my iPhone - apologies for typos
> 
> 
> 
> 
> 
> From: Andras Farkas via R-sig-mixed-models <r-sig-mixed-models using r-project.org>
> 
> Date: Saturday, 30 Mar 2019, 13:49
> 
> To: R-sig-mixed-models <r-sig-mixed-models using r-project.org>
> 
> Subject: [R-sig-ME] choice of prediction error calculations for zero inflated model
> 
> 
> 
>    
> Hello All,
> 
> thought I would reach out to see if you have some guidance on the following: I am working with a zero inflated data set and fitting models that should be reasonable to model such data (zeroinfl() and hurdle() from pscl, mixtures from flexmix, glm.nb, etc) and trying to compare model predictive performance based on a validation data set (70% of all data was used to train and the renaming 30% for validation)... This is not necessarily a coding question but rather a stat oriented perhaps although a working example would be helpful, if there is one: I have looked extensively to see what is in the literature for measures of predictive performance of zero inflated models based on a validation data set to compare observed vs predicted responses for count data, but could not come up with much. I am familiar with general measures of performance, like RMSE, MAE, etc but finding not much on it's appropriateness for use in my setting. Some references point towards adjusted/pseudo r squared approaches but most references are to evaluate model fit during model development vs predictive performance using a validation set... Any thoughts, directions you may be able to help me with?
> 
> much appreciate your input...
> 
> thanks,
> 
> Andras
> 
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Cd.rizopoulos%40erasmusmc.nl%7Cfedc913de8ea400e48d808d6b5e3fae5%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C636896388107793294&sdata=yScrIOYFZ1QYNKF7Jsv3Mkq%2BiQK75xs%2BYi7sZAnsElU%3D&reserved=0
> 
> 
> 
> 

-- 
Dimitris Rizopoulos
Professor of Biostatistics
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
Web (personal): http://www.drizopoulos.com/
Web (work): http://www.erasmusmc.nl/biostatistiek/
Blog: http://iprogn.blogspot.nl/