%setwd("S:\\synthpop 14112017") %## Produce .tex file from .Rnw file %Sweave("inference.Rnw") %## Produce .pdf file from .tex file %tools::texi2pdf("inference.tex") \documentclass[12pt]{article} \usepackage{amssymb,amsmath,latexsym,graphics,authblk,url,setspace,caption} \usepackage[pdftex]{graphicx} \usepackage{epstopdf} \usepackage{enumerate} \usepackage{float} \usepackage{color} \usepackage{amssymb} \usepackage{setspace} \usepackage{amsmath} \usepackage{hyperref} \usepackage{verbatim} \usepackage{algorithm} \usepackage{algpseudocode} \usepackage{color} \usepackage{footnote} \usepackage{threeparttable} \usepackage{caption} \usepackage{setspace} %\usepackage{breqn} \usepackage[square,sort,comma,numbers]{natbib} \usepackage[portrait, margin=0.8in]{geometry} \title{Inference from fitted models in \textbf{synthpop}} \date{} \author{Gillian M Raab \& Beata Nowok} \renewcommand{\baselinestretch}{1.5} % 1.5 denotes double spacing. Changing it will change the spacing % \VignetteIndexEntry{Inference in synthpop} \begin{document} <>= options(prompt="R> ", width=77, digits=4, useFancyQuotes=FALSE) @ \maketitle \begin{abstract} This paper describes the methods used for inference from models fitted to synthetic data generated by the \textbf{synthpop} package. Since earlier versions of \textbf{synthpop} new methodology for making inferences from synthetic data has been published \cite{RaabNowokDibben_2017}. These new methods have the advantage over previously proposed methods \cite{RRR_2003} of usually requiring only a single synthetic data set to be produced. We explain how these methods have been implemented in \textbf{synthpop} functions. Inference from synthetic data is only valid if the model used for synthesis has reproduced the relationships between variables that influence the fit to the model. Thus it is important that the results from synthetic data are compared to those from the original data. These methods, including some new, recently developed tests, are fully explained here. \end{abstract} \section{Introduction} The \textbf{synthpop} package for creating synthetic data allows the user to produce synthetic version(s) of confidential data and also provides functions (e.g. \verb|glm.synds()|) to make inferences from statistical models fitted to the synthetic data and to compare the results with those from a fit to the original data (e.g. \verb|compare.fit.synds()|). This vignette explains the statistics computed by these functions. The main focus of the \textbf{synthpop} package \cite{synthpop_jss} is to produce data for exploratory analysis with results for publication produced when the code developed on the synthetic data is run on the original, confidential data. This use of \textbf{synthpop} will usually require only a single synthetic data set (\verb|m = 1|) with the same number of records (\verb|k|) as the original (\verb|n|), the default setting in \textbf{synthpop}. For a user who simply wants to estimate the reults that would be obtained from the original data, if they were available,an analysis of the synthetic data set, as if it were the original, is all that is required. No special methods of estimation are needed for the parameters of models fitted to synthetic data. When \verb|m > 1| the results from the different syntheses need to be combined and when the size of the synthetic data differs from the real data (\verb|k| $ \neq $ \verb|n|) the standard errors of the coefficients from the synthetic fit require adjustment to estimate the results that would be obtained from the original data. The methods implemented in \textbf{synthpop} for these situations are described in Section \ref{sec:notpopinf}. The other situation is when the user wishes to make inferences to the population parameters directly from the synthetic data. For this case the standard errors of the estimates must include the contributions to the uncertainty of the estimates from two sources: \begin{enumerate} \item{The difference between the parameters estimated from the synthetic data and from the original data;} \item{The difference between the estimates from the original data and the parameters of the model assumed to have generated the synthetic data.} \end{enumerate} The contribution from the first source decreases as the number of synthetic data sets increases, while that from the second source is unchanged. The second corresponds to the estimated standard error that would have been obtained from fitting the model to the original data. Methods for population inference are dependent on the details of how the synthetic data have been produced, as discussed in \cite{Drechsler_book} and recently developed in \cite{RaabNowokDibben_2017}. The functions \verb|glm.synds()|, \verb|lm.synds()|, \verb|polr.synds()| and \verb|multinom.synds()| in \textbf{synthpop}, and their summary function \verb|summary.fit.synds()| implement these methods . If synthesis has used samples from the posterior predictive distribution of the population, given the observed data (parameter \texttt{proper} of the function \verb|syn()|) this will be recognized by \verb|summary.fit.synds()| and appropriate calculations performed. If the model being fitted includes some variables that have not been synthesised and have not been used in synthesising models for all other variables (not in the visit.sequence or not at the start of the visit.sequence with \verb|method == ""|), the fitting function will return a component \verb|$incomplete| as TRUE. This causes the functions \verb|summary.fit.synds()| and \verb|compare.fit.synds()|) used on the fit to use methods appropriate to incomplete/partially synthetic data \cite{reiter_partial}. Details of how to carry out population inference with \textbf{synthpop} functions are in Section \ref{sec:popinf}. To emphasize the difference between inference to the expectation from the real data and inference to the population quantities the estimates and their standard errors are labelled differently in \textbf{synthpop} inference, as follows \begin{table*}[h] \centering \begin{tabular}{llll} % & \multicolumn{6}{c}{Labels for inference to } \\ % & \multicolumn{3}{c}{Fit to original } & \multicolumn{3}{c}{Population parameters } \\ Inference for & Coefficient & Standard error & z statistic \\ \hline Coefficients expected from original data & \texttt{xpct(Beta)} & \texttt{xpct(se.Beta)} & \texttt{xpct(z)} \\ Population coefficients & \texttt{Beta.syn} & \texttt{se.Beta.syn} & \texttt{z.syn} \\ \hline \end{tabular} \end{table*} The final Section \ref{sec:compare} explains the statistics calculated when the results from a fit of a model to synthetic data are compared to those from original with the function \verb|compare.fit.synds()|. The synthetic data estimates may differ from what will be obtained from the real data if the model used for the synthesis has not captured all of the relationships between variables that influence the fitted parameters. The function \verb|compare.fit.synds()| requires access to the original data as well as the synthetic data. It is designed to be used by the staff producing the synthetic data, or by the user at a validation step, with the long-term goal of evaluating and improving synthesis methods. The following notation is used for quantities used in the calculations explained below. \begin{itemize} \item{$Q$ - the ``true'' vector of coefficients in the population from which the original data are assumed to be a sample.} \item{$p$ - the number of coefficients and length of the vector $Q$.} \item{$q_{1},\cdots,q_{i},\cdots, q_{m}$ - vectors of estimated coefficients, where $q_{i}$ is the estimate from the fit to the $i^{th}$ synthesis and their mean vector is $\bar{q}_{m} = \sum_{i=1}^{m}{q}_{i}/m$.} \item{$\hat{V}_{orig}$ - the estimated variance-covariance matrix of the parameter estimates from a fit of the model to the original data; its diagonal vector is denoted by $v_{orig}$.} \item{$\hat{V}_{1},\cdots,\hat{V}_{i},\cdots, \hat{V}_{m}$ - the estimated variance-covariance matrices of the coefficients $q_{i}$, calculated from each synthetic data set as if it were the original, and their mean matrix $\bar{V}_{m} = \sum_{i=1}^{m}{\hat{V}}_{i}/m$.} \item{$v_{1},\cdots,v_{i},\cdots, v_{m}$ - the diagonal vectors of $\hat{V}_{i}$ which give the estimated variances of the coefficients $q_{i}$ and their mean vector $\bar{v}_{m} = \sum_{i=1}^{m}{v_{i}/m}$.} \item{ $b_m = \sum_{i=1}^{m}{{({q}_i-\bar{q}_{m})}^2}/(m-1)$ - the between-synthesis variance of the $q_i$.} \item{ $B_m =b_{mat}'b_{mat}/(m-1)$, where $b_{mat}$ is a matrix of differences of the coefficients from their mean with $i^{th}$ row $b_{mat}[i,\,]= q_i-\bar{q}_{m}$. The diagonal of $B_m$ is $b_m$.} \end{itemize} \noindent{In all the situations discussed here $Q$ is estimated by $\bar{q}_{m}$ but its variance and standard errors will depend on what type of inference is required and how the synthesis has been carried out.} The results here apply when both the original and synthetic data are analysed by methods appropriate for simple random sampling. They could also be used when both analyses use the same methods for complex samples, as discussed in \cite{RaabNowokDibben_2017}, but methods for complex samples are not currently implemented in \textbf{synthpop}. \section{Inference to results from the original data\label{sec:notpopinf}} Here $\bar{q}_{m}$ and $\sqrt{\bar{v}_{m}}$ estimate $\hat{Q}$ and $\sqrt{v_{orig}}$, the coefficients and their standard errors from a fit to the original data. For a single synthetic data set produced by \texttt{syn()} with \texttt{m = 1} and \texttt{k = n} and the fit summarised with the default setting \texttt{population.inference = FALSE} there is no need to use the special functions for inference from synthetic data. These are the default settings in \texttt{syn()} and in \texttt{summary.fit.synds()}, used to carry out inference from an object produced by \texttt{glm.synds()} or \texttt{lm.synds()}. \renewcommand{\baselinestretch}{1.0} <<>>= library(synthpop) ods <- SD2011[, c("smoke", "sex", "age", "edu")] levels(ods$edu) <- c("NONE", "VOC", "SEC", "HIGH") s1 <- syn(ods, seed = 1234) summary(glm(smoke ~ sex + age + edu + sex * edu, data = s1$syn, family = "binomial")) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s1, family = "binomial")) @ \renewcommand{\baselinestretch}{1.5} Running this code shows that the results from \texttt{glm()} and \texttt{glm.synds()} are identical. This is also the case if the synthesis is carried out using the parameter \texttt{proper = TRUE} of \texttt{syn()} or if some of the variables in the model have not been synthesised. One small advantage of using \texttt{glm.synds()} is that it checks and gives warnings if the model includes unsynthesised variables that are not at the start of the visit sequence, which can produce results which are wrong, as in the next example. \renewcommand{\baselinestretch}{1.0} <<>>= s2 <- syn(ods, seed = 1234,visit.sequence=c("smoke","edu","age")) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s2, family = "binomial")) @ \renewcommand{\baselinestretch}{1.5} \noindent{The variable \texttt{sex} has been omitted from the \texttt{visit.sequence}, so that its relationships to other variables are not maintained in the synthetic data. Warnings appear and the results show that the coefficients involving \texttt{sex} are no longer significant and the other coefficients have changed.} The only time when \texttt{glm.synds()} will give different results from \texttt{glm()} with \texttt{population.inference = FALSE} is when \texttt{k} and \texttt{n} differ. In this case the standard errors of the coefficients are adjusted to what would be expected from the original data. \renewcommand{\baselinestretch}{1.0} <<>>= s3 <- syn(ods, seed = 1234, k = 500) ## analysing synthetic data with just glm() summary(glm(smoke ~ sex + age + edu + sex * edu, data = s3$syn, family = "binomial")) ## using glm.synds() summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s3, family = "binomial")) @ \renewcommand{\baselinestretch}{1.5} Here \texttt{glm()} fitted to a synthetic sample of size \texttt{k = 500} that is smaller than the original \texttt{n = 5000} gives larger standard errors, with only a few coefficients appearing significant, whereas \texttt{glm.synds()} estimates what would be found from the original larger sample. If \texttt{m > 1} then using \texttt{glm.synds()} will give you the average of these results for all \texttt{m} syntheses, as well as results from selected individual syntheses if you specify the parameter \texttt{msel}. Results from individual syntheses are simply what the standard functions \texttt{lm()} or \texttt{glm()} would report. \section{Inference to population parameters\label{sec:popinf}} This code gets inference to population parameters for our example with all variables synthesised. \renewcommand{\baselinestretch}{1.0} <<>>= summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s1, family = "binomial"), population.inference = TRUE) @ \renewcommand{\baselinestretch}{1.5} You will find that, as expected, the standard errors are larger than when \texttt{population.inference = FALSE}. When, as for \texttt{s1}, synthesis has been carried out with the default \texttt{proper = FALSE} the standard errors of the average estimated coefficients, $\bar{q}_m$ are calculated as $\sqrt{\bar{v}_m/m+\bar{v}_mk/n)}$. If the data had been synthesised with the \texttt{syn()} option \texttt{proper = TRUE} this will be recognized by \texttt{glm.fit.synds()} and its \texttt{summary()} function and appropriate standard errors for $\bar{q}_m$ calculated as $\sqrt{\bar{v}_m(1+k/n)/m+\bar{v}_mk/n}$. In each case the first term under the square root sign gives the contribution from the differences between $\bar{q}_m$ and $\hat{Q}$ while the second provides the contribution from the differences between $\hat{Q}$ and the population parameters $Q$. The first term is larger when synthesis is done with \texttt{proper = TRUE} as we see from the next analysis. \renewcommand{\baselinestretch}{1.0} <<>>= s4 <- syn(ods, seed = 5678, proper = TRUE) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s4, family = "binomial"), population.inference = TRUE) @ \renewcommand{\baselinestretch}{1.5} \noindent{We see here that inference is possible when only one synthetic data set is produced, but increasing \texttt{m} improves the precision of the estimates as you will see from the smaller \texttt{se(Beta.syn)} values in the example below.} \renewcommand{\baselinestretch}{1.0} <>= s5 <- syn(ods, seed = 5678, m = 10, proper = TRUE, print.flag = FALSE) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s5, family = "binomial"), population.inference = TRUE) @ \begin{verbatim} Fit to synthetic data set with 10 syntheses. Inference to population coefficients when all variables in the model are synthesised. Call: glm.synds(formula = smoke ~ sex + age + edu + sex * edu, family = "binomial", data = s5) Combined estimates: Beta.syn se.Beta.syn z.syn Pr(>|z.syn|) (Intercept) 0.12996 0.17508 0.74 0.4579 sexFEMALE 1.02161 0.16885 6.05 1.4e-09 *** age 0.00530 0.00213 2.48 0.0130 * eduVOC 0.14992 0.14708 1.02 0.3080 eduSEC 0.48663 0.15828 3.07 0.0021 ** eduHIGH 0.98525 0.19689 5.00 5.6e-07 *** sexFEMALE:eduVOC -0.51278 0.20700 -2.48 0.0132 * sexFEMALE:eduSEC -0.41484 0.21636 -1.92 0.0552 . sexFEMALE:eduHIGH -0.63809 0.25982 -2.46 0.0141 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 \end{verbatim} \renewcommand{\baselinestretch}{1.5} The \textbf{synthpop} package was written with completely synthesised data in mind but, with care, it can also be used for population inference for some incompletely synthesised data, when all values for one or more variable are unchanged. The synthesis must have been conditional on the values of all unsynthesised variables. This can be done by placing them at the start of the visit sequence and setting their parameter \texttt{method = ""}. The functions \texttt{lm.synds()} and \texttt{glm.synds()} check this condition and print warnings, as we saw for the synthesis \texttt{s2} above. When some of the variables in the formula are unchanged the between synthesis variance can be estimated from multiple syntheses \cite{reiter_partial} with standard errors calculated as $\sqrt{b_m/m + \bar{v}_mk/n}$. This standard error estimate requires $m > 1$ and larger values of $m$ such as $m > 5$ are recommended for reliable results. \renewcommand{\baselinestretch}{1.0} In the next example only one of the variables in the formula has been synthesised. \renewcommand{\baselinestretch}{1.0} <>= s6 <- syn(ods, seed = 910011, m = 12, method = c("", "", "", "cart"), print.flag = FALSE) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s6, family = "binomial"), population.inference = TRUE) @ \begin{verbatim} Fit to synthetic data set with 12 syntheses. Inference to population coefficients when all variables in the model are synthesised. Call: glm.synds(formula = smoke ~ sex + age + edu + sex * edu, family = "binomial", data = s6) Combined estimates: Beta.syn se.Beta.syn z.syn Pr(>|z.syn|) (Intercept) 0.08885 0.16709 0.53 0.59491 sexFEMALE 1.05974 0.15927 6.65 2.9e-11 *** age 0.00582 0.00204 2.86 0.00429 ** eduVOC 0.16052 0.13947 1.15 0.24977 eduSEC 0.53571 0.15147 3.54 0.00041 *** eduHIGH 1.15776 0.19147 6.05 1.5e-09 *** sexFEMALE:eduVOC -0.56293 0.19543 -2.88 0.00397 ** sexFEMALE:eduSEC -0.44036 0.20591 -2.14 0.03247 * sexFEMALE:eduHIGH -0.94733 0.24693 -3.84 0.00012 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 \end{verbatim} \renewcommand{\baselinestretch}{1.5} If a user attempts an incomplete fit with \texttt{population.inference = TRUE} from a synthesis with \texttt{m = 1}, the program will detect this and revert to inference from a complete synthesis with a resulting increase in the standard errors as we show here. \renewcommand{\baselinestretch}{1.0} <<>>= s7 <- syn(ods, seed = 910011, m = 1, method = c("", "", "", "cart"), print.flag = FALSE) summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s7, family = "binomial"), population.inference = TRUE) @ \renewcommand{\baselinestretch}{1.5} In addition to these combined estimates from all the syntheses, the user can choose to print out the results from selected individual syntheses from the \texttt{m} with the parameter \texttt{(msel)}. These are simply what the original functions (\texttt{lm()} or \texttt{glm()}) would report. \renewcommand{\baselinestretch}{1.0} <>= summary(glm.synds(smoke ~ sex + age + edu + sex * edu, data = s6, family = "binomial"), population.inference = TRUE, incomplete = TRUE, msel = 1:2) @ \begin{verbatim} Fit to synthetic data set with 12 syntheses. Inference to population coefficients when all variables in the model are synthesised. Call: glm.synds(formula = smoke ~ sex + age + edu + sex * edu, family = "binomial", data = s6) Combined estimates: Beta.syn se.Beta.syn z.syn Pr(>|z.syn|) (Intercept) 0.08885 0.16709 0.53 0.59491 sexFEMALE 1.05974 0.15927 6.65 2.9e-11 *** age 0.00582 0.00204 2.86 0.00429 ** eduVOC 0.16052 0.13947 1.15 0.24977 eduSEC 0.53571 0.15147 3.54 0.00041 *** eduHIGH 1.15776 0.19147 6.05 1.5e-09 *** sexFEMALE:eduVOC -0.56293 0.19543 -2.88 0.00397 ** sexFEMALE:eduSEC -0.44036 0.20591 -2.14 0.03247 * sexFEMALE:eduHIGH -0.94733 0.24693 -3.84 0.00012 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Estimates for selected syntheses contributing to the combined estimates: Coefficients: syn=1 syn=2 (Intercept) 0.019362 0.062836 sexFEMALE 1.266456 1.008554 age 0.005587 0.006237 eduVOC 0.334849 0.189847 eduSEC 0.502420 0.501606 eduHIGH 1.258915 1.085128 sexFEMALE:eduVOC -0.959960 -0.645905 sexFEMALE:eduSEC -0.585671 -0.222160 sexFEMALE:eduHIGH -1.105639 -0.773811 z values: syn=1 syn=2 (Intercept) 0.1232 0.4002 sexFEMALE 8.2142 6.7988 age 2.8658 3.1928 eduVOC 2.5384 1.4429 eduSEC 3.5570 3.4912 eduHIGH 6.9420 6.3346 sexFEMALE:eduVOC -5.0669 -3.5127 sexFEMALE:eduSEC -2.9878 -1.1305 sexFEMALE:eduHIGH -4.6306 -3.4075 \end{verbatim} \renewcommand{\baselinestretch}{1.5} \section{Comparing fits to the original and synthesised data\label{sec:compare}} \subsection{Overview of comparisons\label{sec:overview}} The inferences from synthetic data, described above, all depend on the synthesising model being correct. If the original data can be accessed then the results of a model fitted to the original and synthetic data can be compared with the function \texttt{compare.fit.synds()} which provides a graphical representation of the differences as well as statistics to evaluate their importance. The methods we implement for evaluating these biases are standardised differences in the coefficients and a combined lack-of-fit test for the differences between the vector of coefficients estimated from the original and the synthetic data. These can be used to evaluate the specific utility of synthetic data for this analysis and they incorporate statistical tests of the null hypothesis that the synthesising model is the same as the model that is presumed to have generated the original data. Section \ref{sec:bias} explains the details of these methods. The graphical output from \texttt{compare.fit.synds()} consists of a comparison of the confidence intervals for the coefficients calculated from the original and the synthetic data. Results also include a confidence interval overlap measure for each coefficient: the ratio of the overlap of the intervals to an average of their lengths. Previous use of confidence-interval-overlap measures \cite{WooProp,DandR2011,Drechsler_book} have compared the intervals for population inference with synthetic data to the interval from the original data. The parameter \texttt{population.inference = TRUE} can be used to plot and calculate these intervals. But in our default case with \texttt{population.inference = FALSE} plots and intervals for the estimate of $\hat{Q}$ expected from synthetic data are calculated. Details of each method are explained in Sections \ref{sec:ci1} and \ref{sec:ci2}. \subsection{Assessing the bias in estimates from synthetic data \label{sec:bias}} Function $compare.fit.synds()$ calculates the standardised difference between the original and synthetic coefficients where the standardisation uses the standard errors of the original fit, $\sqrt{v_{orig}}$. In Section \ref{sec:notpopinf} it was necessary to calculate the standard errors from an estimate $\bar{V}_{m}$ of $\hat{V}_{orig}$, but here we have access to the original data and so we can use its known value. The vector of the means of the standardised differences, $z_j$, over $ m $ synthetic data sets is $z=(\bar{q}_m-\hat{Q})/\sqrt{v_{orig}/m}$. The mean of the absolute values of $z_j$ for each coefficient, the $|\bar{z}|=\Sigma_{j=1}^p{|z_j|}/p$ value, gives the first summary measure of the difference between the observed and synthetic data. The standardised differences simply summarise the bias in the estimates in relation to the precision of the estimates from the original data. If the model used to synthesise the data was the one which generated the original data, for completely synthesised data with \texttt{proper = FALSE} the expected value of each $|z_j|$ and of $|\bar{z}|$ will be $\sqrt{2/(m\pi)}$\footnote{From the expected value of a half-Normal distribution} taking the values 0.798, 0.356, 0.252, 0.056 for $m = 1, 5, 10, 200$. For completely synthesised data with \texttt{proper = TRUE} the expectations of $|z_j|$ and of $|\bar{z}|$ become $\sqrt{(1+2/\pi)/m}$, giving larger absolute differences. As $m$ increases the bias expected in the null case when the synthesising model is correct decreases and the test described below will have greater power to detect a poor synthesising model. If the synthesising model is correct the coefficients from the synthetic data $\bar{q}_m$ will have expectations $\hat{Q}$ and variances $v_{orig}(k/n)/m$ when the synthesis parameter \texttt{proper = FALSE} or $v_{orig}(1+k/n)/m$ for synthesis with \texttt{proper = TRUE} \cite{RaabNowokDibben_2017}. These results allow a test of the bias of each coefficient to be calculated as a p-value giving the probability of a difference as large as that observed under the null hypothesis that the synthesising model is correct. A composite lack-of-fit test for the vector of differences between the coefficients from the fit to the original and synthetic data can be obtained from the quadratic form \begin{equation*} LOF = {(\bar{q}_{m}-\hat{Q})'}{{V_{diff}}}^{-1}(\bar{q}_{m}-\hat{Q)} \end{equation*} \noindent{where ${V_{diff}}$ is the between synthesis variance-covariance matrix for the differences, which depends on the synthesis method used. For completely synthesised data ${V_{diff}}=\hat{V}_{orig}(k/n)/m$ if the synthesis parameter \texttt{proper = FALSE} or ${V_{diff}}=\hat{V}_{orig}(1+k/n)/m$ for synthesis with \texttt{proper = TRUE} \cite{RaabNowokDibben_2017}. If the synthesis model is correct these $LOF$ measures will have a $\chi^2$ distribution with degrees of freedom equal to the number of coefficients in the model. We illustrate these results with three different syntheses and three different logistic regression models. For the first synthesis \texttt{s7} and the first fit \texttt{f7} the synthesis model is compatible with the model fitted. By using the new function \texttt{syn.strata} the synthesis preserves all the relationships between the stratifying variable, synthesised first, and the within-stratum relationships. Here stratification is by the dependent variable ``smoke" and within each smoking group education is modelled as a multinomial distribution conditional on sex, giving the correct model for \texttt{f7}. \renewcommand{\baselinestretch}{1.0} <>= ods <- ods[!is.na(ods$smoke), ] # remove 10 observations with missing "smoke" s8 <- syn.strata(ods, m = 5, method = "parametric", strata = "smoke", seed = 5678, visit.sequence = c("smoke", "sex", "edu", "age"), print.flag = FALSE, tab.strataobs = FALSE) f8 <- glm.synds(smoke ~ sex + edu + sex * edu, data = s8, family = "binomial") compare(f8, ods, plot.intercept = TRUE, plot = "coef") @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + sex * edu, family = "binomial", data = s8) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.45921 0.4078 0.05142 0.47653 0.8784 sexFEMALE 1.03712 1.0592 -0.02211 -0.14727 0.9624 eduVOC 0.08504 0.1139 -0.02884 -0.22471 0.9427 eduSEC 0.44279 0.5074 -0.06457 -0.46064 0.8825 eduHIGH 0.90379 0.9864 -0.08262 -0.46585 0.8812 sexFEMALE:eduVOC -0.59573 -0.6058 0.01004 0.05417 0.9862 sexFEMALE:eduSEC -0.36813 -0.4497 0.08159 0.41637 0.8938 sexFEMALE:eduHIGH -0.76002 -0.7986 0.03854 0.16529 0.9578 Measures for 5 syntheses and 8 coefficients Mean confidence interval overlap: 0.9231 Mean absolute std. coef diff: 0.3014 Mahalanobis distance ratio for lack-of-fit (target 1.0): 0.49 Lack-of-fit test: 3.948; p-value 0.8618 for test that synthesis model is compatible with a chi-squared test with 8 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} \noindent{As expected, there is no evidence of any lack of fit and $|\bar{z}|$ is not out of line with what we expect with \texttt{m = 5}.} The second example is more typical of what a user of \textbf{synthpop} might use with the default method \texttt{cart} for all variables. \renewcommand{\baselinestretch}{1.0} <>= ods <- ods[!is.na(ods$smoke), ] # remove 10 observationswith missing "smoke" s9 <- syn(ods, m = 5, seed = 5678, visit.sequence = c("sex", "edu", "age", "smoke"), print.flag = FALSE) f9 <- glm.synds(smoke ~ sex + age + edu + sex * age, data = s9, family = "binomial") compare(f9, ods) @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + age + edu + sex * age, family = "binomial", data = s9) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.197102 0.400261 -0.203160 -1.2463 0.6821 sexFEMALE 0.593279 0.224701 0.368578 2.0424 0.4790 age 0.008956 0.003104 0.005852 2.2534 0.4252 eduVOC -0.030571 -0.050985 0.020414 0.2058 0.9475 eduSEC 0.448646 0.393783 0.054862 0.5315 0.8644 eduHIGH 0.659803 0.688391 -0.028588 -0.2374 0.9394 sexFEMALE:age -0.003031 0.007702 -0.010733 -2.9877 0.2378 Measures for 5 syntheses and 7 coefficients Mean confidence interval overlap: 0.6536 Mean absolute std. coef diff: 1.358 Mahalanobis distance ratio for lack-of-fit (target 1.0): 10.15 Lack-of-fit test: 71.02; p-value 0 for test that synthesis model is compatible with a chi-squared test with 7 degrees of freedom. \end{verbatim} \begin{figure}[!ht] \begin{center} \includegraphics[width=0.9\textwidth]{Figure1inference.pdf} % first figure itself \caption{Comparison of intervals for fit \texttt{f8} and original data.}\label{fig1} \end{center} \end{figure} \renewcommand{\baselinestretch}{1.5} \noindent{While the lack-of-fit test indicates differences and there is clear evidence of bias for two of the seven coefficients and some evidence for a third one, the general pattern of coefficients, illustrated in Figure 1, is the same fore both intervals and would not mislead any decisions to be made on the basis of preliminary analysis with the synthetic data. This is an example of a general point that we should not expect that these tests of fit should show no differences since it is sufficient for the synthetic data to be a good approximation to the original. The p-values tell us that there is evidence of a difference but not how important it may be. The effect size, measured here by $z_j$ (\texttt{Std. coef diff}), gives a better measure of importance. In this example the largest absolute effect size is only \texttt{1.4}. The third synthesis uses the method \texttt{"sample"} for all variables, a bootstrap sample that does not preserve any of the relationships between variables, and the fitted model is clearly wrong with every coefficient giving evidence of a difference and the large absolute effect sizes, with three over \texttt{5}.} \renewcommand{\baselinestretch}{1.0} <>= s10 <- syn(ods, m = 5, seed = 5678, method = "sample", visit.sequence = c("sex", "edu", "age", "smoke"), print.flag = FALSE) f10 <- glm.synds(smoke ~ sex + age + edu + sex * edu, data = s10, family = "binomial") compare(f10, ods) @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + age + edu + sex * edu, family = "binomial", data = s10) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 1.102e+00 0.028147 1.073657 6.757 -0.72377 sexFEMALE -1.165e-01 1.033275 -1.149825 -7.641 -0.94926 age 9.724e-05 0.006355 -0.006258 -3.200 0.18369 eduVOC -6.982e-02 0.219490 -0.289315 -2.182 0.44343 eduSEC 3.337e-02 0.613072 -0.579698 -4.020 -0.02552 eduHIGH -2.542e-02 1.078920 -1.104339 -6.137 -0.56569 sexFEMALE:eduVOC 1.778e-01 -0.576577 0.754354 4.063 -0.03639 sexFEMALE:eduSEC 1.574e-02 -0.442560 0.458298 2.336 0.40406 sexFEMALE:eduHIGH 1.463e-01 -0.752495 0.898754 3.844 0.01931 Measures for 5 syntheses and 9 coefficients Mean confidence interval overlap: -0.1389 Mean absolute std. coef diff: 4.464 Mahalanobis distance ratio for lack-of-fit (target 1.0): 108.4 Lack-of-fit test: 975.3; p-value 0 for test that synthesis model is compatible with a chi-squared test with 9 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} When some variables in the model are not synthesised we need to set \texttt{incomplete = TRUE} for correct inference. This leads to the calculation of the variance matrix of the differences as $V_{diff}=B_m/m$. This variance matrix will be singular if the \texttt{m} is smaller than then number of coefficients ($p$) and \texttt{m} considerably larger than $p$ is recommended. In this case the lack-of-fit test needs to use Hotelling's $T^2$ statistic referred to an $F$ distribution with degrees of freedom $p$ and $m - p$. In the next example two variables are unsynthesised so methods for incomplete/partial synthesis are used and $m$ increased to 20. \begin{small} \renewcommand{\baselinestretch}{1.0} <>= s11 <- syn(ods, seed = 910011, m = 20, method = c("", "", "cart", "cart"), print.flag = FALSE) f11 <- glm.synds(smoke ~ sex + age + edu + sex * edu, data = s11, family = "binomial") compare(f11, ods) @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + age + edu + sex * edu, family = "binomial", data = s11) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.039589 0.028147 1.144e-02 0.07201 0.9816 sexFEMALE 1.066542 1.033275 3.327e-02 0.22107 0.9436 age 0.006438 0.006355 8.294e-05 0.04241 0.9892 eduVOC 0.191791 0.219490 -2.770e-02 -0.20888 0.9467 eduSEC 0.571649 0.613072 -4.142e-02 -0.28726 0.9267 eduHIGH 1.148183 1.078920 6.926e-02 0.38493 0.9018 sexFEMALE:eduVOC -0.559032 -0.576577 1.754e-02 0.09449 0.9759 sexFEMALE:eduSEC -0.462752 -0.442560 -2.019e-02 -0.10292 0.9737 sexFEMALE:eduHIGH -0.951924 -0.752495 -1.994e-01 -0.85302 0.7824 Measures for 20 syntheses and 9 coefficients Mean confidence interval overlap: 0.9357 Mean absolute std. coef diff: 0.2519 Mahalanobis distance ratio for lack-of-fit (target 1.0): 3.38 Lack-of-fit test: 30.42; p-value 4e-04 for test that synthesis model is compatible with a chi-squared test with 9 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} \end{small} \noindent{This gives a fit that, judged by the values of $z_j$, (\texttt{Std. coef diff}) and the overlaps is acceptable, but with \texttt{m = 20} there is evidence of lack-of-fit for two coefficients.} \subsection{Confidence interval overlap when \texttt{population.inference = FALSE} \label{sec:ci1}} In this case confidence interval overlaps are calculated using the standard errors from the original fit. The overlap for the $i^{th}$ coefficient is \begin{equation} IO_{j}= \left[\frac{min(u_o,u_s)-max(l_o,l_s)}{u_o-l_o}\right] \label{2} \end{equation} \noindent{ \noindent{where the confidence interval for the original coefficient is $(l_{o},u_{o})$ and for that from the synthesised data is $(l_{s},u_{s})$. When the intervals do not overlap $IO_{j}$ becomes negative. Note that this interval is a simplification of Equation \ref{1}, below, proposed by \cite{KarrUtil06} because both intervals are the same length. The two intervals are offset by $|z_j|$ so that standardised difference and the interval overlap are linearly related by ${IO}_j=1-|z_j|/(2{q_N}_{(1-\alpha/2)})$, where ${q_N}_{(1-\alpha/2)}$ is the quantile of the Normal distribution used in calculating the confidence intervals, i.e. 1.96 for 95\% intervals. The average of the overlaps can then be used as a second summary measure of utility. The expected value of $IO_j$ for 95\% intervals will be $1-\sqrt{2/(m\pi)}/(2 \cdot 1.96)$, giving expected overlaps of 79.6\%, 90.9\%, 93.5\% and 98.6\% for $m = 1, 5, 10, 200$. To illustrate these methods we select three different fitted models. In the first case the synthesis model is compatible with the model being fitted. A model with smoking predicted from sex, education and its interactions is evaluated. The synthesis is stratified by the dependent variable, smoking, with missing values removed and a parametric model with sex and education is fitted in each stratum, giving a synthesis compatible with the model. \renewcommand{\baselinestretch}{1.0} <>= ods <- ods[!is.na(ods$smoke), ] s12 <- syn.strata(ods, m = 5, visit.sequence = c(4, 1, 2), method = "parametric", strata = "smoke", seed = 5678, print.flag = FALSE) s13 <- syn.strata(ods, m = 5, visit.sequence = c(4, 1, 2), method = "parametric", strata = "smoke", seed = 1234, proper = TRUE, print.flag = FALSE, tab.strataobs = FALSE) @ <>= f12 <- glm.synds(smoke ~ sex + edu + sex * edu, data = s12, family = "binomial") compare(f12, ods, plot.intercept = TRUE, plot = "coef") @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + sex * edu, family = "binomial", data = s12) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.45676 0.4078 0.04897 0.4538 0.8842 sexFEMALE 1.01348 1.0592 -0.04574 -0.3047 0.9223 eduVOC 0.09649 0.1139 -0.01739 -0.1355 0.9654 eduSEC 0.48782 0.5074 -0.01953 -0.1393 0.9645 eduHIGH 0.94450 0.9864 -0.04191 -0.2363 0.9397 sexFEMALE:eduVOC -0.55190 -0.6058 0.05387 0.2908 0.9258 sexFEMALE:eduSEC -0.48621 -0.4497 -0.03649 -0.1862 0.9525 sexFEMALE:eduHIGH -0.74999 -0.7986 0.04858 0.2083 0.9469 Measures for 5 syntheses and 8 coefficients Mean confidence interval overlap: 0.9377 Mean absolute std. coef diff: 0.2444 Mahalanobis distance ratio for lack-of-fit (target 1.0): 0.7 Lack-of-fit test: 5.57; p-value 0.6953 for test that synthesis model is compatible with a chi-squared test with 8 degrees of freedom. \end{verbatim} <>= f13 <- glm.synds(smoke ~ sex + edu + sex * edu, data = s13, family = "binomial") compare(f13, ods, plot.intercept = TRUE) @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + sex * edu, family = "binomial", data = s13) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.2855 0.4078 -0.12230 -1.13337 0.7109 sexFEMALE 1.2375 1.0592 0.17828 1.18760 0.6970 eduVOC 0.2531 0.1139 0.13917 1.08434 0.7234 eduSEC 0.6540 0.5074 0.14666 1.04631 0.7331 eduHIGH 0.9508 0.9864 -0.03564 -0.20096 0.9487 sexFEMALE:eduVOC -0.8695 -0.6058 -0.26370 -1.42346 0.6369 sexFEMALE:eduSEC -0.5814 -0.4497 -0.13170 -0.67214 0.8285 sexFEMALE:eduHIGH -0.7774 -0.7986 0.02114 0.09067 0.9769 Measures for 5 syntheses and 8 coefficients Mean confidence interval overlap: 0.7819 Mean absolute std. coef diff: 0.8549 Mahalanobis distance ratio for lack-of-fit (target 1.0): 2.76 Lack-of-fit test: 22.12; p-value 0.0047 for test that synthesis model is compatible with a chi-squared test with 8 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} \noindent{The first synthesis gives results in line with what would be expected for standardised differences and overlaps from a correct model with \texttt{m = 5}, and the second with synthesis from the posterior predictive distribution of the parameters gives larger standardised differences and lower overlaps. In neither case does the lack-of-fit test suggest any problem with the synthesising model.} The next example uses parametric models which capture some of the relationships but largely miss the interaction between education and sex. \renewcommand{\baselinestretch}{1.0} <>= s14 <- syn(ods, m = 5, seed = 9101112, method = "parametric", print.flag = FALSE) s15 <- syn(ods, m = 5, seed = 1415, method = "parametric", proper = TRUE, print.flag = FALSE) @ <>= f14 <- glm.synds(smoke ~ sex + edu + age + sex * edu, data = s14, family = "binomial") compare(f14, ods, plot.intercept = TRUE, plot = "coef") @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + age + sex * edu, family = "binomial", data = s14) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.085176 0.028147 0.057029 0.35891 0.9084 sexFEMALE 0.697431 1.033275 -0.335844 -2.23180 0.4307 eduVOC 0.103703 0.219490 -0.115786 -0.87314 0.7773 eduSEC 0.609909 0.613072 -0.003163 -0.02193 0.9944 eduHIGH 0.960800 1.078920 -0.118120 -0.65646 0.8325 age 0.006173 0.006355 -0.000182 -0.09307 0.9763 sexFEMALE:eduVOC -0.051487 -0.576577 0.525089 2.82787 0.2786 sexFEMALE:eduSEC -0.091203 -0.442560 0.351357 1.79093 0.5431 sexFEMALE:eduHIGH -0.161805 -0.752495 0.590690 2.52655 0.3555 Measures for 5 syntheses and 9 coefficients Mean confidence interval overlap: 0.6774 Mean absolute std. coef diff: 1.265 Mahalanobis distance ratio for lack-of-fit (target 1.0): 8.25 Lack-of-fit test: 74.28; p-value 0 for test that synthesis model is compatible with a chi-squared test with 9 degrees of freedom. \end{verbatim} <>= f15 <- glm.synds(smoke ~ sex + edu + age + sex * edu, data = s15, family = "binomial") compare(f15, ods, plot.intercept = TRUE) @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + age + sex * edu, family = "binomial", data = s15) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) -0.094687 0.028147 -0.122834 -0.7731 0.8028 sexFEMALE 0.659187 1.033275 -0.374088 -2.4859 0.3658 eduVOC 0.233921 0.219490 0.014431 0.1088 0.9722 eduSEC 0.674266 0.613072 0.061194 0.4244 0.8917 eduHIGH 0.961463 1.078920 -0.117457 -0.6528 0.8335 age 0.008807 0.006355 0.002451 1.2534 0.6802 sexFEMALE:eduVOC -0.031396 -0.576577 0.545180 2.9361 0.2510 sexFEMALE:eduSEC -0.003785 -0.442560 0.438775 2.2365 0.4294 sexFEMALE:eduHIGH -0.186075 -0.752495 0.566420 2.4227 0.3819 Measures for 5 syntheses and 9 coefficients Mean confidence interval overlap: 0.6232 Mean absolute std. coef diff: 1.477 Mahalanobis distance ratio for lack-of-fit (target 1.0): 10.71 Lack-of-fit test: 96.42; p-value 0 for test that synthesis model is compatible with a chi-squared test with 9 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} \noindent{Both syntheses identify the problems with the synthesis model, but the first analysis with \texttt{proper = FALSE} shows it more clearly.} \subsection{Confidence intervals for \texttt{population.inference = TRUE}. \label{sec:ci2}} When the parameter \texttt{population.inference} of the \texttt{compare.fit.synds()} function is set to \texttt{TRUE} interval overlaps are calculated as \begin{equation} IO_{j}= 0.5\left[\frac{min(u_o,u_s)-max(l_o,l_s)}{u_o-l_o}+\frac{min(u_o,u_s)-max(l_o,l_s)}{u_s-l_s} \right] \label{1} \end{equation} \noindent{}as proposed by \cite{KarrUtil06}. Here the proportion of overlap is calculated with the geometric mean of the interval lengths in the denominator. When the intervals are calculated on the original scales (\texttt{plot = "coef"}) they differ in length, with that from the synthetic data usually being wider. If the intervals are printed and plotted as z statistics (\texttt{plot = "z"})} these differences in width of intervals are not apparent because each one has been standardised by its standard error. The differences are most pronounced for small \texttt{m} and when the data have been synthesised with \texttt{proper = TRUE} as in \texttt{s16} below. \renewcommand{\baselinestretch}{1.0} <>= s16 <- syn(ods, proper = TRUE, print.flag = FALSE) f16 <- glm.synds(smoke ~ sex + edu + sex * edu, data = s16, family = "binomial") @ <>= compare(f16, ods, plot.intercept = TRUE, plot = "coef") @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + sex * edu, family = "binomial", data = s16) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.36822 0.4078 -0.03957 -0.3667 0.9064 sexFEMALE 1.24320 1.0592 0.18397 1.2255 0.6874 eduVOC 0.01954 0.1139 -0.09434 -0.7350 0.8125 eduSEC 0.74137 0.5074 0.23401 1.6695 0.5741 eduHIGH 1.10822 0.9864 0.12182 0.6869 0.8248 sexFEMALE:eduVOC -0.56713 -0.6058 0.03863 0.2085 0.9468 sexFEMALE:eduSEC -0.81774 -0.4497 -0.36803 -1.8782 0.5208 sexFEMALE:eduHIGH -1.18534 -0.7986 -0.38677 -1.6586 0.5769 Measures for one synthesis and 8 coefficients Mean confidence interval overlap: 0.7312 Mean absolute std. coef diff: 1.054 Mahalanobis distance ratio for lack-of-fit (target 1.0): 1.65 Lack-of-fit test: 13.18; p-value 0.1058 for test that synthesis model is compatible with a chi-squared test with 8 degrees of freedom. \end{verbatim} <>= compare(f16, ods, plot.intercept = TRUE, population.inference = TRUE, plot = "coef") @ \begin{verbatim} Call used to fit models to the data: glm.synds(formula = smoke ~ sex + edu + sex * edu, family = "binomial", data = s16) Differences between results based on synthetic and observed data: Synthetic Observed Diff Std. coef diff CI overlap (Intercept) 0.36822 0.4078 -0.03957 -0.3667 0.7887 sexFEMALE 1.24320 1.0592 0.18397 1.2255 0.7887 eduVOC 0.01954 0.1139 -0.09434 -0.7350 0.7887 eduSEC 0.74137 0.5074 0.23401 1.6695 0.7414 eduHIGH 1.10822 0.9864 0.12182 0.6869 0.7887 sexFEMALE:eduVOC -0.56713 -0.6058 0.03863 0.2085 0.7887 sexFEMALE:eduSEC -0.81774 -0.4497 -0.36803 -1.8782 0.6995 sexFEMALE:eduHIGH -1.18534 -0.7986 -0.38677 -1.6586 0.7436 Measures for one synthesis and 8 coefficients Mean confidence interval overlap: 0.766 Mean absolute std. coef diff: 1.054 Mahalanobis distance ratio for lack-of-fit (target 1.0): 1.65 Lack-of-fit test: 13.18; p-value 0.1058 for test that synthesis model is compatible with a chi-squared test with 8 degrees of freedom. \end{verbatim} \renewcommand{\baselinestretch}{1.5} \section{Acknowledgement} We are most grateful to Dr. J\"org Drechsler for his constructive criticisms of a previous draft of this vignette and to other \texttt{synthpop} users for their feedback. \bibliographystyle{acm} \bibliography{inference} \end{document}