\documentclass{article}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} %\VignetteIndexEntry{Population contrasts} \SweaveOpts{prefix.string=tests,width=6,height=4, keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{width=6,height=4} \setkeys{Gin}{width=\textwidth} <>= options(continue=" ", width=70) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=8) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #reset default library(survival) library(splines) @ \title{Population contrasts} \author{Terry M Therneau \\ \emph{Mayo Clinic}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth] {tests-#1.pdf}} \newcommand{\ybar}{\overline{y}} \begin{document} \maketitle \ \tableofcontents \section{Introduction} Statisticians and their clients have always been fond of single number summaries for a data set, perhaps too much so. Consider the hypothetical data shown in figure \ref{fig1} comparing treatments A and B with age as a confounder. What is a succinct but useful summary of the difference between treatment arms A and B? One approach is to select a fixed \emph{population} for the age distribution, and then compute the mean effect over that population. \begin{figure} <>= plot(c(50,85), c(2,4.5), type='n', xlab="Age", ylab="Effect") #abline(.645, .042, lty=1, col=1, lwd=2) #abline(.9, .027, lty=1, col=2, lwd=2) abline(.35, .045, lty=1, col=1, lwd=2) abline(1.1, .026, lty=1, col=2, lwd=2) legend(50, 4.2, c("Treatment A", "Treatment B"), col=c(1,2), lty=1, lwd=2, cex=1.3, bty='n') @ \caption{Treatment effects for a hypothetical study.} \label{fig1} \end{figure} More formally, assume we have a fitted model. We want to compute the conditional expectation \begin{equation*} m_A = E_F\left( \hat y | trt=A \right) \end{equation*} where $F$ is some chosen population for the covariates other than treatment. Important follow-up questions are: what population should be used, what statistic should be averaged, what computational algorithm should be used, and what are the statistical properties of the resulting estimate? Neither the statistic nor population questions should be taken lightly, and both need to be closely linked to the scientific question. If, for instance, the model of figure 1 were used to inform a nursing home formulary, then the distribution $F$ might be focused on higher ages. Four common populations are \begin{itemize} \item Empirical: The data set itself. For the simple example above, this would be the distribution of all $n$ ages in the data set, irrespective of treatment. \item Factorial or Yates: This is only applicable if the adjusting variables are all categorical, and consists of all unique combinations of them. That is, the data set one would envision for a balanced factorial experiment. \item External: An external reference such as the age/sex distribution of the US census. This is common in epidemiology. \item SAS type 3: A factorial distribution for the categorical predictors and the data distribution for the others. More will be said about this in section \ref{sect:SAS}. \end{itemize} The \code{yates} function is designed to compute such population averages from a fitted model, along with desired contrasts on the resultant estimates, e.g., whether the population average effects for treatment A and treatment B are equal. The function has been tested with the results of \code{lm}, \code{glm}, and \code{coxph} fits, and can easily be extended to any R model that includes a standard set of objects in the result, i.e., \code{terms}, \code{contrasts}, \code{xlevels}, and \code{assign}. The routine's name is a nod to the 1934 paper by Yates \cite{Yates34} \emph{The analysis of multiple classifications with unequal numbers in different classes}. In dealing with an unbalanced 2-way layout he states that \begin{quote}\ldots in the absence of any further assumptions the efficient estimates of the average A effects are obtained by taking the means of the observed sub-class means for like A over all B sub-classes. \end{quote} In his example these sub-class means are the predicted values for each A*B combination, thus his estimate for each level of A is the mean predicted value over B, i.e., the mean prediction for A over a factorial population for B. Yates then develops formulas for calculating these quantities and testing their equality that are practical for the manual methods of the time; these are now easily accomplished directly using matrix computations. (It is interesting that Yates' paper focuses far more on estimands than tests, while later references to his work focus almost exclusively on the latter, e.g., the ``Yates sum of squares''. Reflecting, again, our profession's singular focus on p values.) This concept of population averages is actually a common one in statistics. Taking an average is, after all, nearly the first thing a statistician will do. Yates' weighted means analysis, the g-estimates of causal models, direct adjusted survival curves, and least squares means are but a small sample of the idea's continual rediscovery. Searle et. al. \cite{Searle80} use the term population marginal mean (PMM), which we will adopt as the acronym, though they deal only with linear models, assume a factorial population, and spend most of their energy writing out explicit formulas for particular cases. They also use a separate acronym to distinguish the estimated PMM based on a model fit from the ideal, which we will not do. \section{Solder Example} \subsection{Data} In 1988 an experiment was designed and implemented at one of AT\&T's factories to investigate alternatives in the wave soldering procedure for mounting electronic components to printed circuit boards. The experiment varied a number of factors relevant to the process. The response, measured by eye, is the number of visible solder skips. The data set was used in the book Statistical Models in S \cite{Chambers93} and is included in the \code{survival} package. <>= summary(solder) length(unique(solder$PadType)) @ A perfectly balanced experiment would have 3*2*10*3 = 180 observations for each Mask, corresponding to all combinations of Opening, Solder Thickness, PadType and Panel. The A3 Mask has extra replicates for a subset of the Opening*Thickness combinations, however, while Mask A6 lacks observations for these sets. Essentially, one extra run of 180 was done with a mixture of Masks. Figure \ref{fig:solder} gives an overview of univariate results for each factor. \begin{figure} <>= # reproduce their figure 1.1 temp <- lapply(1:5, function(x) tapply(solder$skips, solder[[x]], mean)) plot(c(0.5, 5.5), range(unlist(temp)), type='n', xaxt='n', xlab="Factors", ylab ="mean number of skips") axis(1, 1:5, names(solder)[1:5]) for (i in 1:5) { y <- temp[[i]] x <- rep(i, length(y)) text(x-.1, y, names(y), adj=1) segments(i, min(y), i, max(y)) segments(x-.05, y, x+.05, y) } @ \caption{Overview of the solder data.} \label{fig:solder} \end{figure} \subsection{Linear model} A subset of the solder data that excludes Mask A6 is exactly the type of data set considered in Yates' paper: a factorial design whose data set is not quite balanced. Start with a simple fit and then obtain the Yates predictions. <>= with(solder[solder$Mask!='A6',], ftable(Opening, Solder, PadType)) fit1 <- lm(skips ~ Opening + Solder + Mask + PadType + Panel, data=solder, subset= (Mask != 'A6')) y1 <- yates(fit1, ~Opening, population = "factorial") print(y1, digits=2) # (less digits, for vignette page width) @ The printout has two parts: the left hand columns are population marginal mean (PMM) values and the right hand columns are tests on those predicted values. The default is a single global test that these PMM are all equal. Under a factorial population these are the Yates' weighted means \cite{Yates34} and the corresponding test is the Yates' sum of squares for that term. These would be labeled as ``least squares means'' and ``type III SS'', respectively, by the glm procedure of SAS. More on this correspondence appears in the section on the SGTT algorithm. Now we repeat this using the default population, which is the set of all 810 combinations for Solder, Mask, PadType and Panel found in the non-A6 data. The \code{pairwise} option requests tests on all pairs of openings. <>= y2 <- yates(fit1, "Opening", population = "data", test="pairwise") print(y2, digits=2) # # compare the two results temp <- rbind(diff(y1$estimate[,"pmm"]), diff(y2$estimate[,"pmm"])) dimnames(temp) <- list(c("factorial", "empirical"), c("2 vs 1", "3 vs 2")) round(temp,5) @ Although the PMM values shift with the new population, the difference in PMM values between any two pairs is unchanged. This is because we have fit a model with no interactions. Referring to figure \ref{fig1}, this is a model where all of the predictions are parallel lines; shifting the population left or right will change the PMM, but has no effect on the difference between two lines. For a linear model with no interactions, the test statistics created by the \code{yates} function are thus not very interesting, since they will be no different than simple comparisons of the model coefficients. Here are results from a more interesting fit that includes interactions. <>= fit2 <- lm(skips ~ Opening + Mask*PadType + Panel, solder, subset= (Mask != "A6")) temp <- yates(fit2, ~Opening, population="factorial") print(temp, digits=2) @ This solder data set is close to being balanced, and the means change only a small amount. (One hallmark of a completely balanced experiment is that any PMM values are unaffected by the addition of interaction terms.) \subsection{Missing cells} Models that involve factors and interactions can have an issue with missing cells as shown by the example below using the full version of the solder data. <>= fit3 <- lm(skips ~ Opening * Mask + Solder + PadType + Panel, solder) temp <- yates(fit3, ~Mask, test="pairwise") print(temp, digits=2) @ The population predictions for each Mask include all combinations of Opening, Solder, PadType, and Panel that are found in the data. The above call implicity uses the default value of \code{population=`data'}, which is the option that users will normally select. The underlying algorithm amounts to: \begin{enumerate} \item Make a copy of the data set (900 obs), and set Mask to A1.5 for all observations \item Get the 900 resulting predicted values from the model, and take their average \item Repeat 1 and 2 for each mask type. \end{enumerate} However, there were no observations in the data set with Mask = A6 and Opening = Large. Formally, predictions for the A6/Large combination are \emph{not estimable}, and as a consequence neither are any population averages that include those predicted values, nor any tests that involve those population averages. This lack of estimability is entirely due to the inclusion of a Mask by Opening interaction term in the model, which states that each Mask/Opening combination has a unique effect, which in turn implies that we need an estimate for all Mask*Opening pairs to compute population predictions for all levels of the Mask variable. If you do the above steps `by hand' using the R \code{predict} function, it will return a value for all 900 observations along with a warning message that the results may not be reliable, and the warning is correct in this case. The result of \code{coef(fit2)} reveals that the fit generated an NA as one of the coefficients. The presence of a missing value shows that some preditions will not be estimable, but it is not possible to determine \emph{which} ones are estimable from the coefficients alone. The predict function knows that some predictions will be wrong, but not which ones. A formal definition of estimability for a given prediction is that it can be written as a linear combination of the rows of $X$, the design matrix for the fit. The \code{yates} function performs the necessary calculations to verify formal estimability of each predicted value, and thus is able to correctly identify the deficient terms. \section{Generalized linear models} \label{sect:glm} Since the solder response is a count of the number of skips, Poisson regression is a more natural modeling approach than linear regression. In a glm model we need to consider more carefully both the population and the statistic to be averaged. <>= gfit2 <- glm(skips ~ Opening * Mask + PadType + Solder, data=solder, family=poisson) y1 <- yates(gfit2, ~ Mask, predict = "link") print(y1, digits =2) print( yates(gfit2, ~ Mask, predict = "response"), digits=2) @ Mean predicted values for the number of skips using \code{type=`response'} in \code{predict.glm} are similar to estimates when the linear regression model was used. Prediction of type `link' yields a population average of the linear predictor $X\beta$. Though perfectly legal (you can, after all, take the mean of anything you want), the linear predictor PMM values can be more difficult to interpret. Since the default link function for Poisson regression is log(), the PMM of the linear predictor equals the mean of the log(predicted values). Since $\exp({\rm mean}(\log(x)))$ defines the geometric mean of a random variable $x$, one can view the exponentiated version of the link estimate as a geometric mean of predicted values over the population. Given the skewness of Poisson counts, this may actually be an advantageous summary. Arguments for other links are much less clear, however: the authors have for instance never been able to come to a working understanding of what an ``average log odds'' would represent, i.e., the link from logistic regression. One computational advantage of the linear predictor lies in creating the variance. Since \code{mean(Z \%*\% b) = colMeans(Z) \%*\%b} the computation can be done in 3 steps: first create $Z$ with one row for each observation in the population, obtain the vector of column means $d= 1'Z/m$, and then the PMM estimate is $d'\hat\beta$ and its variance is $d'Vd$ where $V$ is the variance matrix of $\hat\beta$. For other PMM quantities the routine samples \code{nsim} vectors from $b \sim N(\hat\beta, V)$, then computes PMM estimates separately for each vector $b$ and forms an empirical variance of the PMM estimates from the results. For nonlinear predictors such as the response, the population choice matters even for an additive model. The two results below have different estimates for the ``between PMM differences'' and tests. <>= gfit1 <- glm(skips ~ Opening + Mask +PadType + Solder, data=solder, family=poisson) yates(gfit1, ~ Opening, test="pairwise", predict = "response", population='data') yates(gfit1, ~ Opening, test="pairwise", predict = "response", population="yates") @ \section{Free Light Chain} As an example for the more usual case, a data set which does \emph{not} arise from a balanced factorial experiment, we will look at the free light chain data set. In 2012, Dispenzieri and colleagues examined the distribution and consequences of the free light chain value, a laboratory test, on a large fraction of the 1995 population of Olmsted County, Minnesota aged 50 or older \cite{Kyle06, Dispenzieri12}. The R data set \code{flchain} contains a 50\% random sample of this larger study and is included as a part of the \code{survival} package. The primary purpose of the study was to measure the amount of plasma immunoglobulin and its components. Intact immunoglobulins are composed of a heavy chain and light chain portion. In normal subjects there is overproduction of the light chain component by the immune cells leading to a small amount of \emph{free light chain} in the circulation. Excessive amounts of free light chain (FLC) are thought to be a marker of disregulation in the immune system. An important medical question is whether high levels of FLC have an impact on survival, which will be explored using a Cox model. Free light chains have two major forms denoted as kappa and lambda; we will use the sum of the two. A confounding factor is that FLC values rise with age, in part because it is eliminated by the kidneys and renal function declines with age. The age distribution of males and females differs, so we will adjust any comparisons for both age and sex. The impact of age on mortality is dominatingly large and so correction for the age imbalance is critical when exploring the impact of FLC on survival. Figure \ref{fig:flc} shows the trend in FLC values as a function of age. For illustration of linear models using factors, we have also created a categorical age value using deciles of age. \begin{figure} <>= male <- (flchain$sex=='M') flchain$flc <- flchain$kappa + flchain$lambda mlow <- with(flchain[male,], smooth.spline(age, flc)) flow <- with(flchain[!male,], smooth.spline(age, flc)) plot(flow, type='l', ylim=range(flow$y, mlow$y), xlab="Age", ylab="FLC") lines(mlow, col=2, lwd=2) legend(60, 6, c("Female", "Male"), lty=1, col=1:2, lwd=2, bty='n') @ \caption{Free light chain values as a function of age.} \label{fig:flc} \end{figure} The table of counts shows that the sex distribution becomes increasingly unbalanced at the older ages, from about 1/2 females in the youngest group to a 4:1 ratio in the oldest. <>= flchain$flc <- flchain$kappa + flchain$lambda age2 <- cut(flchain$age, c(49, 59, 69, 79, 89, 120), labels=c("50-59", "60-69", "70-79", "80-89", "90+")) fgroup <- cut(flchain$flc, quantile(flchain$flc, c(0, .5, .75, .9, 1)), include.lowest=TRUE, labels=c("<50", "50-75", "75-90", ">90")) counts <- with(flchain, table(sex, age2)) counts # # Mean FLC in each age/sex group cellmean <- with(flchain, tapply(flc, list(sex, age2), mean)) round(cellmean,1) @ Notice that the male/female difference in FLC varies with age, \Sexpr{round(cellmean[1,1],1)} versus \Sexpr{round(cellmean[2,1],1)} at age 50--59 years and \Sexpr{round(cellmean[1,5],1)} versus \Sexpr{round(cellmean[2,5],1)} at age 90 years, and as shown in figure \ref{fig:flc}. The data does not fit a simple additive model; there are ``interactions'' to use statistical parlance. Men and women simply do not age in quite the same way. \subsection{Linear models} Compare the mean FLC for males to females, with and without adjusting for age. <>= library(splines) flc1 <- lm(flc ~ sex, flchain) flc2a <- lm(flc ~ sex + ns(age, 3), flchain) flc2b <- lm(flc ~ sex + age2, flchain) flc3a <- lm(flc ~ sex * ns(age, 3), flchain) flc3b <- lm(flc ~ sex * age2, flchain) # # prediction at age 65 (which is near the mean) tdata <- data.frame(sex=c("F", "M"), age=65, age2="60-69") temp <- rbind("unadjusted" = predict(flc1, tdata), "additive, continuous age" = predict(flc2a, tdata), "additive, discrete age" = predict(flc2b, tdata), "interaction, cont age" = predict(flc3a, tdata), "interaction, discrete" = predict(flc3b, tdata)) temp <- cbind(temp, temp[,2]- temp[,1]) colnames(temp) <- c("Female", "Male", "M - F") round(temp,2) @ The between sex difference is underestimated without adjustment for age. The females are over-represented at the high ages, which inflates their estimate. For this particular data set, both continuous and categorical age adjustment are able to recover the true size of the increment for males. Now look at population adjustment. <>= yates(flc3a, ~sex) # additive, continuous age # yates(flc3b, ~sex) # interaction, categorical age # yates(flc3b, ~sex, population="factorial") @ The population average values are just a bit higher than the prediction at the mean due to the upward curvature of the age vs FLC curve. The average for a factorial population is larger yet, however. This is because it is the average for an unusual population which has as many 90+ year old subjects as 50--59 year old; i.e., it is the correct answer to a rather odd question, since this is a population that will never be encountered in real life. We can also reverse the question and examine age effects after adjusting for sex. For the continuous model the age values for the PMM need to be specified using the \code{levels} argument; otherwise the routine will not know which ages are ``of interest'' to the reader. (With a factor the routine will assume that you want all the levels, but a subset can be chosen using the \code{levels} argument.) <>= yates(flc3a, ~ age, levels=c(65, 75, 85)) yates(flc3b, ~ age2) @ Not surprisingly the predicton at age 65 years for a continuous model is quite close to that for the 60--69 year age group from a dicrete model. \section{Cox Models} Finally we come to Cox models which are, after all, the point of this vignette, and the question that prompted creation of the \code{yates} function. Here the question of what to predict is more serious. To get a feel for the data look at three simple models. <>= options(show.signif.stars=FALSE) # show statistical intelligence coxfit1 <- coxph(Surv(futime, death) ~ sex, flchain) coxfit2 <- coxph(Surv(futime, death) ~ sex + age, flchain) coxfit3 <- coxph(Surv(futime, death) ~ sex * age, flchain) anova(coxfit1, coxfit2, coxfit3) # exp(c(coef(coxfit1), coef(coxfit2)[1])) # sex estimate without and with age @ The model with an age*sex interaction does not fit substantially better than the additive model. This is actually not a surprise: in a plot of log(US death rate) the curves for males and females are essentially parallel after age 50 years. (See the \code{survexp.us} data set, for instance.) The sex coefficients for models 1 and 2 differ substantially. Males in this data set have almost 1.5 the death rate of females at any given age, but when age is ignored the fact that females dominate the oldest ages almost completely cancels this out and males appear to have the same `overall' mortality. Adjustment for both age and sex is critical to understanding the potential effect of FLC on survival. Dispenzieri \cite{Dispenzieri12} looked at the impact of FLC by dividing the sample into those above and below the 90th percentile of FLC; for illustration we will use 4 groups consisting of the lowest 50\%, 50 to 75th percentile, 75 to 90th percentile and above the 90th percentile. <>= coxfit4 <- coxph(Surv(futime, death) ~ fgroup*age + sex, flchain) yates(coxfit4, ~ fgroup, predict="linear") yates(coxfit4, ~ fgroup, predict="risk") @ We see that after adjustment for age and sex, FLC is a strong predictor of survival. Since the Cox model is a model of relative risk, any constant term is arbitrary: one could add 100 to all of the log rates (type `linear' above) and have as valid an answer. To keep the coefficients on a sensible scale, the \code{yates} function centers the mean linear predictor of the original data at zero. This centers the linear predictor at 0 but does not precisely center the risks at exp(0) = 1 due to Jensen's inequality, but suffices to keep values in a sensible range for display. A similar argument to that found in section \ref{sect:glm} about the arithmetic versus geometric mean can be made here, but a more fundamental issue is that the overall hazard function for a population is not the average of the hazards for each of its members, and in fact will change over time as higher risk members of the population die. Though computable, the mean hazard ratio only applies to the study at time 0, before selective death begins to change the structure of the remaining population, and likewise for the average linear predictor = mean log hazard ratio. A PMM based on either of these estimates is hard to interpret. Survival curves, however, do lead to a proper average: the survival curve of a population is the mean of the individual survival curves of its members. Functions computed from the survival curve, such as the mean time until event, will also be proper and interpretable. The longest death time for the FLC data set is at \Sexpr{round(with(flchain, max(futime[death==1])/365.25),1)} years; as a one number summary of each PMM curve we will use a restricted mean survival with a threshold of 13 years. <>= # longest time to death round(max(flchain$futime[flchain$death==1]) / 365.25, 1) #compute naive survival curve flkm <- survfit(Surv(futime, death) ~ fgroup, data=flchain) print(flkm, rmean= 13*365.25, scale=365.25) @ Straightforward survival prediction takes longer than recommended for a CRAN vignette: there are \Sexpr{nrow(flchain)} subjects in the study and 4 FLC groups, which leads to just over 30 thousand predicted survival curves when using the default \code{population=`data'}, and each curve has over 2000 time points (the number of unique death times). To compute a variance, this is then repeated the default \code{nsim = 200} times. We can use this as an opportunity to demonstrate a user supplied population, i.e., a data set containing a population of values for the control variables. We'll use every 20th observation in the \code{flchain} data as the population and also reduce the number of simulations to limit the run time. <>= mypop <- flchain[seq(1, nrow(flchain), by=20), c("age", "sex")] ysurv <- yates(coxfit4, ~fgroup, predict="survival", nsim=50, population = mypop, options=list(rmean=365.25*13)) ysurv # display side by side temp <- rbind("simple KM" = summary(flkm, rmean=13*365.25)$table[, "rmean"], "population adjusted" = ysurv$estimate[,"pmm"]) round(temp/365.25, 2) @ The spread in restricted mean values between the different FLC groups is considerably less in the marginal (i.e., PMM) survival curves than in the unadjusted Kaplan-Meier, with the biggest change for those above the 90th percentile of FLC. Males dominate the highest FLC values. Without adjustment for sex, the survival for the high FLC group is biased downward by the male predominance. The \code{ysurv} object also contains an optional \code{summary} component, which in this case is the set of 4 PMM survival curves. Plot these along with the unadjusted curves, with solid lines for the PMM estimates and dashed for the unadjusted curves. This shows the difference between adjusted and unadjusted even more clearly. Adjustment for age and sex has pulled the curves together, though the $>90$th percentile group still stands apart from the rest. (It is not a 1 number summary with a simple p value, however, and so spurned by users and journals ;-). <>= plot(flkm, xscale=365.25, fun="event", col=1:4, lty=2, xlab="Years", ylab="Death") lines(ysurv$summary, fun="event", col=1:4, lty=1, lwd=2) legend(0, .65, levels(fgroup), lty=1, lwd=2, col=1:4, bty='n') @ \section{Mathematics} The underlying code uses a simple brute force algorithm. It first builds a population data set for the control variables that includes a placeholder for the variable of interest. Then one at a time, it places each level of variable(s) of interest in all rows of the data (e.g., sex=F for all rows). It computes the model's prediction for all rows, and then comptes the mean prediction. Since predicted values are independent of how the variables in a model are coded, the result is also independent of coding. When the prediction is the simple linear predictor $X\beta$, we take advantage of the fact that $\mbox{\rm mean}(X \beta) = [\mbox{\rm column means}(X)] \beta = c\beta$. If $C$ is the matrix of said column means, one row for each of the groups of interest, then $C\beta$ is the vector of PMM values and $CVC'$ is the variance covariance matrix of those values, where $V$ is the variance matrix of $\hat \beta$. The lion's share of the work is building the individual $X$ matrices, and that is unchanged. For other than the linear case, the variance is obtained by simulation. Assume that $\hat\beta \sim N(\beta, V)$, and draw \code{nsim} independent samples $b$ from $N(\hat\beta, V)$. PMM values are computed for each instance of $b$, and an empirical variance matrix for the PMM values is then computed. \section{SAS glim type III (SGTT) algorithm} \label{sect:SAS} Earlier in this document reference was made to the SAS ``type 3'' estimates, and we now delve into that topic. It is placed at the end because it is a side issue with respect to population averages. However, whatever one's opinion on the wisdom or folly of the SAS estimator, one cannot ignore its ubiquity, and showing how it fits into this framework is an important part of the picture. As groundwork we start with some compuational facts about ANOVA. Assume an $X$ matrix for the model in standard order: the intercept, then main effects, then first order interactions, second order interactions, etc. as one proceeds from the leftmost column. \begin{itemize} \item Let $LDL' = (X'X)$ be the generalized Cholesky decomposition of the $X'X$ matrix, where $L$ is lower triangular with $L_{ii}=1$ and $D$ is diagonal. $D_{ii}$ will be zero if column $i$ of $X$ can be written as a linear combination of prior columns and will be positive otherwise. \item Let $L'_k$ refer to the rows of $L'$ that correspond to the $k$th term in the model, e.g., a main effect. Then the test statistics for $L'_1\beta= 0$, $L'_2 \beta=0$, \ldots, are the sums of squares for the standard ANOVA table, often referred to as type I or sequential sums of squares. \item If $X$ corresponds to a balanced factorial design, then $L_{ij} \ne 0$ only if $i>= options(contrasts = c("contr.treatment", "contr.poly")) # default nfit1 <- lm(skips ~ Solder*Opening + PadType, solder) drop1(nfit1, ~Solder) @ This shows a type III sum of squares of 389.88. However, if a different coding is used then we get a very different SS: it increases by over 25 fold. <>= options(contrasts = c("contr.SAS", "contr.poly")) nfit2 <- lm(skips ~ Solder*Opening + PadType, solder) drop1(nfit2, ~Solder) @ The example shows a primary problem with the NSTT: the answer that you get depends on how the contrasts were coded. For a simple 2 way interaction like the above, it turns out that the NSTT actually tests the effect of Solder within the reference cell for Opening, the NSTT is not a global test at all. <>= with(solder, tapply(skips, list(Solder, Opening), mean)) @ Looking at the simple cell means shown above, it is no surprise that the \code{contr.SAS} fit, which uses Opening=S as the reference will yield a large NSTT SS since it is a comparison of 17.4 and 5.5, while the \code{contr.treatment} version using Opening=L as reference has a much smaller NSTT. In fact, re-running a particular analysis with different reference levels for one or more of the adjusting variables is a quick way to diagnose probable use of the NSTT algorithm by a program. Several R libraries that advertise type 3 computations actually use the NSTT. The biggest problem with the NSTT is that it sometimes gives the correct answer. If one uses summation constraints, a form of the model that most of us have not seen since graduate school: \begin{align*} y_{ijk} &= \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon \\ \sum_i \alpha_i & = 0 \\ \sum_j \beta_j & = 0 \\ \sum_{ij} \gamma_{ij} &=0 \end{align*} then the `reference cell' for Opening is the mean Opening effect, and the NSTT for Solder will correspond to an PMM using the factorial population, the PMM is invariant to the chosen coding. <>= options(contrasts = c("contr.sum", "contr.poly")) nfit3 <- lm(skips ~ Solder*Opening + PadType, solder) drop1(nfit3, ~Solder ) yates(nfit1, ~Solder, population='factorial') @ Thus our acronym for this method of not-safe-type-three (NSTT), since the method does work if one is particularly careful. Given the number of incorrect analyses that have arisen from this approach `nonsense type 3' would also be valid interpretation, however. <>= options(contrasts = c("contr.treatment", "contr.poly")) # restore @ \section{Conclusion} The population average predicted value or marginal estimate is a very useful statistical concept. The \code{yates} function computes these using a direct approach: create the relevant population, get \emph{all} the predicted values, and average them. An advantage of this simple approach is that because predicted values do not depend on how a model is parameterized, the results are naturally invariant to how categorical variables are represented. But like many good ideas in statistics, proper application of the idea requires some thought with respect to the choice of a statistic to be averaged, and the population over which take the expectation. For linear models the simple linear predictor $X\beta$ is an obvious choice as a statistic, but for other models the choice is more nuanced. For a Cox model we prefer the restricted mean, but this is an area that needs more research for a firm recommendation. In terms of the population choice, the population should match the research question at hand. The factorial population in particular has been badly overused. This population is appropriate for the use cases described in Yates \cite{Yates34} and Goodnight \cite{Goodnight78} that relate to designed experiments, the solder data is a good example. But in observational or clinical data the factorial population result will too often be the ``answer to a question that nobody asked''. In the FLC data set, for instance, the factorial population has equal numbers of subjects in every age group: a a population that no one will ever observe. Using the result from such computations to inform medical decisions begins to resemble medieval debates of angels dancing on the head of a pin. This suboptimal population choice is the primary damming criticism for ``type III'' estimates. Because the detailed algorithm and criteria for type III tests is not well documented, many psuedo type 3 estimates have arisen of which the NSTT is perhaps the most common. Statistical software that claims to produce a ``type III'' estimate while doing something else is indisputably wrong, even dangerous, and should not be tolerated. \appendix \section{Miscellaneous notes.} None of the material below is presented with proof. Comments to the authors that fill in any of these gaps are welcome. \subsection{Subsampling} One reaction that I have seen to the solder data is to instead analyse a balanced subset of the data, throwing away extra observations, because then ``the result is simple to understand''. Let's take this desire at face value, but then add statistics to it: rerun the two steps of ``select a subset'' and ``compute the balanced 2-way solution'' multiple times, with different random subsets each time. In fact, we could be more compuslive and tabulate the solution for \emph{all} the possible balanced subsets, and then take an average. This will give the Yates estimate. \subsection{Type 3} Based on the $L$ matrix argument above, one natural defintion of type III contrasts would be an upper triangular matrix, with appropriate zeros, whose terms were orthogonal with respect to $(Z'Z)^-$, the design matrix for a balanced data set. The SAS algorithm creates contrasts in an expanded basis, and these contrasts are instead made orthogonal with respect to the identity matrix $I$. Why does this work? Multiple example cases have shown that in data with no missing cells the resulting sum of squares agrees with the Yates' SS, but a formal proof of equivalence has been elusive. It would be preferable to use an algorithm that does not require rebuilding the $X$ matrix in the extended basis set used by the SAS approach, but instead directly used the $X$ matrix of the user, as it would be simpler and more portable code. For the case of no missing cells the Cholesky decompostion of $Z'Z$, for instance, satisfies this requirement. The original fit may use treatment, SAS, Helmert, or any other dummy variable coding for the factors without affecting the computation or result. We have not been able to discover an approach that replicates the SAS algorithm when there are missing cells, however. Is there a unique set of contrasts that fullfill the type III requirements for any given model and data set pair, or many? How might they be constructed? Since tests of $L'\beta =0$ and $cL'\beta=0$ are equivalent for any constant $c$, wlog we can assume that the diagonal of $L$ is 0 or 1. \subsection{Additive models} Consider the free light chain data and a linear model containing age group and sex as predictors. We have spoken of factorial, data, and external populations, each of which leads to a global test comparing the PMM estimates for males versus females. What population would give the smallest variance for that comparison? (A question that perhaps only an academic statistician could love.) Here are the sample sizes of each cell. <>= with(flchain, table(sex, age2)) @ The optimal population for comparing male to female will have population sizes in each age group that are proportional to the harmonic means in each column: $(1/n_{11} + 1/n_{21})^{-1}$ for age group 1, and etc. These are the familiar denominators found in the t-test. A PMM built using these population weights will yield the same sex difference and test as a simple additive model with sex and age group. Linear regression is, after all, the UMVE. \bibliographystyle{plain} \bibliography{refer} \end{document}