[R] Complex Survey MSE for prediction with ML

Ganz, Carl carlganz at ucla.edu
Thu Dec 1 00:14:01 CET 2016


Hello,

I have been toying with the survey package's withReplicates function, which lets users easily extend the survey package to support any weighted statistic. There are a number of ML algorithms in various packages that accept weights, and it is fairly easy to use them with withReplicates. Below is a naïve example:

library(survey)
library(rpart)
library(gbm)

data(api)

# create survey object
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

rstrat<-as.svrepdesign(dstrat)

# try rpart
predr <- as.data.frame(withReplicates(rstrat, function(w, data) {
  predict(rpart(api00~ell+meals+mobility,data=data,weights=w))
}))

# try gbm
predg <- as.data.frame(withReplicates(rstrat, function(w, data) {
  predict(gbm(api00~ell+meals+mobility,data=data,weights=w,
              n.trees=100))
}))

# try regular svyglm
preds <- as.data.frame(predict(svyglm(api00~ell+meals+mobility,rstrat)))

head(data.frame(predr,predg,preds))

With rpart, the standard errors are absurdly large, and clearly incorrect. With gbm, the results seem reasonable. 

I see in this extremely old post that you can't use quantile regression with withReplicates for some survey designs and expect to get reasonable results: https://stat.ethz.ch/pipermail/r-help/2008-August/171620.html

Quantiles and survey stats are messy business so that issue may be unique to quantile regressions, but based on that post it would seem that the function, and survey design need to have certain properties for withReplicates to generate valid SEs. This is not documented with withReplicates though. 

So my question is, what properties does an ML algorithm/survey design need for withReplicates to generate valid SEs?

Kind Regards,
Carl Ganz



More information about the R-help mailing list