\documentclass[12pt]{article} \usepackage{amsmath,amssymb} %\usepackage{mydef2} \usepackage[round]{natbib} \usepackage{parskip,url} \usepackage{graphicx,subfig} \setlength{\topmargin}{0cm} \addtolength{\textheight}{2cm} %\lhead{} %\input{title} %\VignetteIndexEntry{Comparison of methods in the metap package} \title{Comparison of methods in the \pkg{metap} package} \author{Michael Dewey} \newcommand{\pkg}[1]{\texttt{#1}} \newcommand{\func}[1]{\texttt{#1}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\codefont}{\footnotesize} \newcommand{\mygraph}[3]{% \begin{figure}[htbp] \includegraphics[height=6cm,width=10cm]{compare-#1} \caption{#2} \label{#3} \end{figure} } % in mypdf fourth parameter needs [] \newcommand{\mypdf}[4]{% \begin{figure}[htbp] \includegraphics#4{#1.pdf} \caption{#2} \label{#3} \end{figure} } \newcommand{\twograph}[8]{% \begin{figure}[htbp] \subfloat[#2\label{#3}]{\includegraphics[height=6cm,width=7cm]{compare-#1}}% \subfloat[#5\label{#6}]{\includegraphics[height=6cm,width=7cm]{compare-#4}} \caption{#7} \label{#8} \end{figure} } \begin{document} \maketitle \section{Introduction} \subsection{What is this document for?} This document describes some methods for the meta--analysis of $p$--values (significance values) contained in the package \pkg{metap} and contains comments on the performance of the various algorithms under a small number of different scenarios with hints on the choice of method. \subsection{Notation} The $k$ studies give rise to $p$--values, $p_i,\;i = 1, \dots, k$. These are assumed to be independent. We shall also need the ordered $p$--values: $p_{[1]} \le p_{[2]}, \dots, \le p_{[k]}$ and weights $w_i,\;i = 1, \dots, k$. Logarithms are natural. A function for combining $p$--values is denoted $g$. The size of the test is $\alpha$. We may also need $k$ degrees of freedom, $\nu_i$. The methods are referred to by the name of the function in \func{metap}. Table \ref{funcs} shows other descriptions of each method. \begin{table}[htbp] \begin{tabular}{lll} Function name & \multicolumn{2}{c}{Description(s)} \\[1ex] & \multicolumn{1}{c}{Eponym} \\ \func{invchisq} & Lancaster's method & Inverse chi square \\ \func{invt} & & Inverse t \\ \func{logitp} & & Logistic\\ \func{meanp} \\ \func{meanz} \\ \func{maximump} \\ \func{minimump} & Tippett's method \\ \func{sumlog} & Fisher's method & Chi square (2 df)\\ \func{sump} & Edgington's method & Uniform\\ \func{sumz} & Stouffer's method & Normal\\ \func{truncated} & Truncated Fisher\\ \func{truncated} & & rank--truncated\\ \func{votep} \\ \func{wilkinsonp} & Wilkinson's method \\ \end{tabular} \caption{Methods considered in this document} \label{funcs} \end{table} \section{Theoretical results} There have been various attempts to clarify the problem and to discuss optimality of the methods. A detailed account was provided by \citet{liptak58}. \citet{birnbaum54} considered the property of admissibility. A method is admissible if when it rejects $H_0$ for a set of $p_i$ it will also reject $H_0$ for $P^*_i$ where $p^*_i \le p_i$ for all $i$. He considered that Fisher's and Tippett's method were admissible. See also \citet{owen09}. He also points out the problem is poorly specified. This may account for the number of methods available and their differing behaviour. The null hypothesis $H_0$ is well defined, that all $p_i$ have a uniform distribution on the unit interval. There are two classes of alternative hypothesis \begin{itemize} \item $H_A$: all $p_i$ have the same (unknown) non--uniform, non--increasing density, \item $H_B$: at least one $p_i$ has an (unknown) non--uniform, non--increasing density. \end{itemize} If all the tests being combined come from what are basically replicates then $H_A$ is appropriate whereas if they are of different kinds of test or different conditions then $H_B$ is appropriate. Note that Birnbaum specifically considers the possibility that the tests being combined may be very different for instance some tests of means, some of variances, and so on. \section{The methods} \subsection{Comparison scenarios} To provide a standard of comparison we shall use the following two situations. Some authors have also used the case of exactly two $p_i$. \begin{description} %\subsubsection{What if all $p_i = p$?\label{twopisection}} \item[What if all $p_i = p$?]\label{twopisection} Perhaps surprisingly there are substantial differences here as we shall see when we look at each method. We shall describe how the returned value varies with $p$ and $k$. %\subsubsection{Cancellation} \item[Cancellation] When the collection of primary studies contains a number of values significant in both directions the methods can give very different results. If the intention of the synthesis is to examine a directional hypothesis one would want a method where these cancelled out. The decision between methods should be made on theoretical grounds of course. We shall use the following four values as our example. \end{description} {\codefont <<>>= cancel <- c(0.001, 0.001, 0.999, 0.999) @ } <>= library(metap) data(dat.metap) validity <- dat.metap$validity$p genvec <- function(pvals, kvals, fun, name) { ps <- length(pvals) ks <- length(kvals) temp <- matrix(-1, nrow = ps, ncol = ks) for(i in 1:ps) for(j in 1:ks) { temp[i, j] <- fun(rep(pvals[i], kvals[j]))$p } temp2 <- as.vector(temp) res <- data.frame(method = rep(name, length(temp2)), p = rep(pvals, ks), k = rep(kvals, each = ps), g = temp2 ) res } @ \subsection{Methods using transformation of the $p$--values} One class of methods relies on transforming the $p$--values first. \begin{table}[htbp] \begin{tabular}{lll} Function name & Definition & Critical value \\[1ex] \func{invchisq} & $\sum_{i=1}^k \chi^2_{\nu_i}(p_i)$ & $\chi^2_{\sum{\nu_i}}(\alpha)$ \\[1ex] \func{invt} & $\frac{\sum_{i=1}^k t_{\nu_i}(p_i)}% {\sqrt{\sum_{i=1}^k \frac{\nu_i}{\nu_i - 2}}}$ & $z(\alpha)$ \\[1ex] \func{logitp} & $\frac{\sum_{i=1}^k \log\frac{p}{1 - p}}{C}$ & $t_{5k+4}$ \\ & where $C = \sqrt\frac{k \pi^2 (5 k + 2)}{3(5 k + 4)}$ & \\[1ex] \func{meanz} & $\frac{\bar{z}}{s_{\bar{z}}}$ & $t_{k-1}(\alpha)$ \\ & where $\bar{z} = \sum_{i=1}^k \frac{z(p_i)}{k}$ \\ & and $s_{\bar{z}} = \frac{s_z}{\sqrt{k}}$ & \\[1ex] \func{sumlog} & $\sum_{i=1}^{k} - 2 \log p_i$ & $\chi_{2k}(\alpha)$ \\[1ex] \func{sumz} & $\frac{\sum_{i=1}^k z(p_i)}{\sqrt{k}}$ & $z(\alpha)$\\ \end{tabular} \caption{Definitions of methods using transformation of the $p$ values} \label{transdefs} \end{table} <>= kvals <- c(4, 5, 6, 8, 10, 15, 20) pvals <- c(0.2, 0.3, 0.3679, 0.4, 0.5, 0.6) dat <- rbind( genvec(pvals, kvals, logitp, "logitp"), genvec(pvals, kvals, meanz, "meanz"), genvec(pvals, kvals, sumlog, "sumlog"), genvec(pvals, kvals, sumz, "sumz") ) @ <>= lattice::xyplot(g ~ k | method, groups = p, type = "l", data = dat, auto.key = list(space = "left", lines = TRUE, title = "p"), ylab = "g(p)" ) @ \subsubsection{The method of summation of logs, Fisher's method} See Table \ref{transdefs} for the definition. This works because $- 2 \log p_i$ is a $\chi^2_2$ and the sum of $\chi^2$ is itself a $\chi^2$ with degrees of freedom equal to the sum of the degrees of freedom of the individual $\chi^2$. Of course the sum of the log of the $p_i$ is also the log of the product of the $p_i$. Fisher's method \citep{fisher25} is provided in \func{sumlog}. <>= set.seed(18122019) temp <- matrix(runif(10000), nrow = 100) fisher <- apply(temp, 1, function(x) sumlog(x)$p) lanc4 <- apply(temp, 1, function(x) invchisq(x, 4)$p) lanc16 <- apply(temp, 1, function(x) invchisq(x, 16)$p) lanc256 <- apply(temp, 1, function(x) invchisq(x, 256)$p) banda <- function(x, y) { res <- data.frame(sum = x + y, diff = (x - y)) res } dat <- data.frame(rbind(banda(fisher, lanc4), banda(fisher, lanc16), banda(fisher, lanc256), banda(lanc4, lanc16), banda(lanc4, lanc256), banda(lanc16, lanc256) ), name = factor(c(rep("FL4", 100), rep("FL16", 100), rep("FL256", 100), rep("L4L16", 100), rep("L4L256", 100), rep("L16L256", 100)), levels = c("FL4", "FL16", "FL256", "L4L16", "L4L256", "L16L256") ) ) @ <>= lattice::xyplot(diff ~ sum | name, data = dat, panel = function(x, y, ...) { lattice::panel.xyplot(x, y, ...) lattice::panel.abline(h = mean(y), lty = 2) lattice::panel.abline(h = mean(y) + 1.96 * sd(y), lty = 3) lattice::panel.abline(h = mean(y) - 1.96 * sd(y), lty = 3) } ) @ <>= stouff <- apply(temp, 1, function(x) sumz(x)$p) invt4 <- apply(temp, 1, function(x) invt(x, 4)$p) invt16 <- apply(temp, 1, function(x) invt(x, 16)$p) invt256 <- apply(temp, 1, function(x) invt(x, 256)$p) banda <- function(x, y) { res <- data.frame(sum = x + y, diff = (x - y)) res } dat <- data.frame(rbind(banda(stouff, invt4), banda(stouff, invt16), banda(stouff, invt256), banda(invt4, invt16), banda(invt4, invt256), banda(invt16, invt256) ), name = factor(c(rep("St4", 100), rep("St16", 100), rep("St256", 100), rep("t4t16", 100), rep("t4t256", 100), rep("t16t256", 100)), levels = c("St4", "St16", "St256", "t4t16", "t4t256", "t16t256") ) ) @ <>= lattice::xyplot(diff ~ sum | name, data = dat, panel = function(x, y, ...) { lattice::panel.xyplot(x, y, ...) lattice::panel.abline(h = mean(y), lty = 2) lattice::panel.abline(h = mean(y) + 1.96 * sd(y), lty = 3) lattice::panel.abline(h = mean(y) - 1.96 * sd(y), lty = 3) } ) @ As can be seen in Figure \ref{equalp} when all the $p_i=p$ \func{sumlog} returns a value which decreases with $k$ when $p<0.32$, increases with $k$ when $p>0.37$, and in between increases with $k$ and then decreases. Some detailed algebra provided in a post to https://stats.stackexchange.com/questions/243003 by Christoph Hanck suggests that the breakpoint is $e^{-1} = 0.3679$. Where the $p_i$ are less than this then for a sufficiently large $k$ (several hundred) the result will be significant and not if above that. Over the range of $k$ we are plotting this bound is not yet closely approached. \mygraph{transeqp}{Behaviour of the methods using transformed $p$ values for $k$ values of $p=p_i$}{equalp} \subsubsection{Inverse $\chi^2$ Lancaster's method} It would of course be possible to generalise Fisher's method to use transformation to $\chi^2$ with any other number of degrees of freedom rather than 2. \citet{lancaster61} suggests that this is highly correlated with \func{sumlog}. Lancaster's method is provided in \func{invchisq}. In fact the resemblance to \func{sumlog} becomes less as the number of degrees of freedom increases. Figure \ref{fishlanc} shows for a small number of selected degrees of freedom how it compares to Fisher's method. \twograph{fishlanc}{Fisher's method and Lancaster's method}{fishlanc}{stouffinvt}{Stouffer's method and inverse $t$}{stouffinvt}{Sum and difference plots of Fisher v Lancaster and Stouffer v inverse $t$}{compfishstouff} \subsubsection{The method of summation of $z$ values, Stouffer's method} The method of summation of $z$ values is provided in \func{sumz} \citep{stouffer49}. See Table \ref{transdefs} for the definition. As can be seen in Figure \ref{equalp} it returns a value for our $p_i=p$ example which decreases with $k$ when $p$ below 0.5 and increases above. A weighted version of Stouffer's method is available %\begin{equation} $\frac{\sum_{i=1}^k w_i z(p_i)}{\sqrt {\sum_{i=1}^k w_i ^ 2}}$ %\end{equation} where $w_i$ are the weights. In the absence of effect sizes (in which case a method using effect sizes would be more appropriate anyway) best results are believed to be obtained with weights proportional to the square root of the sample sizes \citep{zaykin11} following \citet{liptak58}. \subsubsection{Mean of normals method} There is also a method closely related to Stouffer's using the mean of normals provided in \func{meanz} also defined in Table \ref{transdefs} which has very similar properties except that when all the $p_i$ are equal it either gives 0 or 1 as can be seen in Figure \ref{equalp}. <<>>= meanz(c(0.3, 0.31))$p meanz(c(0.1, 0.2))$p @ The method of \func{meanz} also has the unusual property that a set of $p$--values which are all less than those in another set can still give rise to a larger overall $p$. See example above. This is the only method considered here which has this property so if it is a desirable one then that is the only method to consider. \subsubsection{The inverse $t$ method} A closely related method is the inverse $t$ method. See Table \ref{transdefs} for the definition. This method is provided in \func{invt}. As is clear from the definition this method tends to Stouffer's method as $\nu_i \to \infty$. Figure \ref{stouffinvt} shows this for selected degrees of freedom. \subsubsection{The method of summation of logits} See Table \ref{transdefs} for the definition. This method is provided in \func{logitp}. The constant $C$ was arrived at by equating skewness and kurtosis with that of the $t$--distribution \citep{loughin04}. As can be seen in Figure \ref{equalp} this method returns a value for our $p_i=p$ example which decreases with $k$ when $p$ below 0.5 and increases above. \subsubsection{Examples for methods using transformations of the $p$ values} <>= log10p <- function(x) { res <- round(-log(x, base = 10), 2) res } @ \begin{table}[htbp] \begin{tabular}{lll} Function name & \multicolumn{1}{c}{validity} & \multicolumn{1}{c}{cancel} \\ & \multicolumn{1}{c}{value expressed} \\ & \multicolumn{1}{c}{as $-\log_{10}p$} \\[1ex] \func{logitp} & \Sexpr{log10p(logitp(validity)$p)} & \Sexpr{logitp(cancel)$p} \\ \func{meanz} & \Sexpr{log10p(meanz(validity)$p)} & \Sexpr{meanz(cancel)$p} \\ \func{sumlog} & \Sexpr{log10p(sumlog(validity)$p)} & \Sexpr{round(sumlog(cancel)$p, 5)} \\ \func{sumz} & \Sexpr{log10p(sumz(validity)$p)} & \Sexpr{sumz(cancel)$p}\\ \end{tabular} \caption{Examples of methods using transformation of the $p$ values} \label{transexamples} \end{table} Using the same example dataset which we have already plotted and our cancellation dataset we have the values in Table \ref{transexamples}. As can be seen all the methods cancel except for \func{sumlog}. The agreement for the validity dataset is close except for \func{meanz} whoch gives a value several orders of magnitude greater than the other three. Lancaster's method and inverse $t$ are not shown as they are both infinite families of possible methods and in any event are similar to Fisher's method and Stouffer's method respectively. \subsection{Methods using untransformed $p$--values} \begin{table}[htbp] \begin{tabular}{lll} Function name & Definition & Critical value \\[1ex] \func{meanp} & $\bar p = \frac{\sum_{i=1}^k p_i}{k}$ \\ & $z = (0.5 - \bar{p}) \sqrt{12k}$ & $z(\alpha)$ \\ \func{minimump} & $p_{[1]}$ & $1 - (1 - \alpha)^{\frac{1}{k}}$ \\ \func{maximump} & $p_{[k]}$ & $\alpha^k$ \\ \func{wilkinsonp} & $p_{[r]}$ & $\sum_{s=r}^k {k \choose s}\alpha^s (1 - \alpha)^{k-s}$\\[1ex] \func{sump} & $\frac{(S)^k}{k!}% - {k \choose 1}\frac{(S - 1)^k}{k!}% + {k \choose 2}\frac{(S - 2)^k}{k!} - \dots$ & $\alpha$ \\ & where $S = \sum_{i=1}^k p_i$ \\ \end{tabular} \caption{Definitions of methods not using transformation of the $p$ values, % the series for \func{sump} continues until the term in in the numerator $(S-i)$ becomes negative} \label{untransdefs} \end{table} <>= kvals <- c(4, 5, 6, 8, 10, 15, 20) pvals <- c(0.2, 0.3, 0.3679, 0.4, 0.5, 0.6) dat <- rbind( genvec(pvals, kvals, meanp, "meanp"), genvec(pvals, kvals, maximump, "maximump"), genvec(pvals, kvals, minimump, "minimump"), genvec(pvals, kvals, sump, "sump"), genvec(pvals, kvals, votep, "votep") ) @ <>= lattice::xyplot(g ~ k | method, groups = p, type = "l", data = dat, auto.key = list(space = "left", lines = TRUE, title = "p"), ylab = "g(p)" ) @ \mygraph{untranseqp}{Behaviour of the methods using untransformed $p$ values for $k$ values of $p=p_i$}{unequalp} \subsubsection{The method of minimum $p$, maximum $p$, and Wilkinson's method} The methods of minimum $p$ \citep{tippett31}, maximum $p$ and Wilkinson \citep{wilkinson51} are defined in Table \ref{untransdefs}. Wilkinson's method depends on which value (the $r$th) of $p_{[i]}$ is selected. % p is pbeta(p[r], r, k+1-r) % critical p is qbeta(alpha, r, k+1-r) Wilkinson's method is provided in \func{wilkinsonp} and a convenience function \func{minimump} with its own \code{print} method is provided for the minimum $p$ method ($r=1$). It is also possible to use the method for the maximum $p$ (that is $r=k$) and a convenience function \func{maximump} is provided for that purpose. As can be seen in Figure \ref{unequalp} these methods return a value for our $p_i=p$ example which always increases with $k$ which is true for \func{minimump} and which always decreases with $k$ which is true for \func{maximump} \subsubsection{The method of summation of $p$--values, Edgington's method\label{sump}} Defined in Table \ref{untransdefs} \citep{edgington72a}. This method is provided in \func{sump}. As can be seen in Figure \ref{unequalp} this method returns a value for our $p_i=p$ example which decreases with $k$ when $p$ below 0.5 and increases above. Some authors use a simpler version, $\frac{(\sum p)^k}{k!}$, for instance \citet{rosenthal78} in the text although compare his Table 4. This can be very conservative when $\sum p > 1$ There seems no particular need to use this method but it is returned by \func{sump} as the value of \code{conservativep} for use in checking published values. Note also that there can be numerical problems for extreme values of $S$ and in that case recourse might be made to \func{meanp} which has similar properties. \subsubsection{The mean $p$ method} Defined in Table \ref{untransdefs}. Although this method is attributed to Edgington \citep{edgington72b} when the phrase Edgington's method is used it refers to the method of summation of $p$--values described above in Section \ref{sump}. As can be seen in Figure \ref{unequalp} this method returns a value for our $p_i=p$ example which decreases with $k$ when $p$ below 0.5 and increases above. Not surprisingly this method gives very similar results to Edington's other method implemented in \func{sump} and since it does not have the numerical problems of that method it might perhaps be preferred. \subsubsection{Examples for methods using untransformed $p$--values} Using the same example dataset which we have already plotted and our cancellation dataset we have the values in Table \ref{untransexamples}. As can be seen \func{meanp} and \func{sump} cancel but the other two do not. Agreement here is not so good especially for the maximum p method. Wilkinson's method not shown as it depends on the value of $r$. \begin{table}[htbp] \begin{tabular}{lrl} Function name & \multicolumn{1}{c}{validity} & \multicolumn{1}{c}{cancel} \\ & \multicolumn{1}{c}{value expressed} \\ & \multicolumn{1}{c}{as $-\log_{10}p$} \\[1ex] \\[1ex] \func{minimump} & \Sexpr{log10p(minimump(validity)$p)} & \Sexpr{round(minimump(cancel)$p, 5)} \\ \func{maximump} & \Sexpr{log10p(maximump(validity)$p)} & \Sexpr{round(maximump(cancel)$p, 5)} \\ \func{meanp} & \Sexpr{log10p(meanp(validity)$p)} & \Sexpr{meanp(cancel)$p}\\ \func{sump} & \Sexpr{log10p(sump(validity)$p)} & \Sexpr{sump(cancel)$p} \\ \end{tabular} \caption{Examples for methods using the untransformed $p$ values} \label{untransexamples} \end{table} \subsection{Other methods} \subsubsection{The method of vote--counting} A simple way of looking at the problem is vote counting. Strictly speaking this is not a method which combines $p$--values in the same sense as the other methods. If most of the studies have produced results in favour of the alternative hypothesis irrespective of whether any of them is individually significant then that might be regarded as evidence for that alternative. The numbers for and against may be compared with what would be expected under the null using the binomial distribution. A variation on this would allow for a neutral zone of studies which are considered neither for nor against. For instance one might only count studies which have reached some conventional level of statistical significance in the two different directions. This method returns a value for our $p_i=p$ example which is 1 for $p$ values above 0.5 and otherwise invariant with $p$ but decreases with $k$. This method does cancel significant values in both directions. \begin{table}[htbp] \begin{tabular}{lll} Function name & validity & cancel \\[1ex] \func{votep} & \Sexpr{round(votep(validity)$p, 6)} & \Sexpr{round(votep(cancel)$p, 5)} \\ \end{tabular} \caption{Examples for vote counting} \label{votepexamples} \end{table} \subsubsection{Methods not using all $p$--values} If there is a hypothesis that the signal will be concentrated in only a few $p$--values then alternative methods are available in \func{truncated}. This is a wrapper to two packages available on CRAN: \pkg{TFisher} which provides the truncated Fisher method \citep{zaykin07,zhang18} and \pkg{mutoss} which provides the rank--truncated Fisher method \citep{dudbridge03}. Note that Table \ref{truncatedexamples} only shows results for the validity data--set as, since the methods explicitly only consider results in one direction the cancellation issue does not arise. \begin{table}[htbp] \begin{tabular}{lll} Function name & truncated at $p$ = 0.5 & truncated at rank = 5 \\[1ex] \func{truncated} & \Sexpr{log10p(truncated(validity, ptrunc = 0.5)$p)} & \Sexpr{log10p(truncated(validity, rtrunc = 5)$p)} \\ \end{tabular} \caption{Examples for truncated using the validity data--set expressed as $-\log_{10}p$} \label{truncatedexamples} \end{table} These methods are appropriate for the situation where it is known that many of the $p$--values are noise and there will only be a few signals. \section{Loughin's recommendations} In his simulation study \citet{loughin04} carried out extensive comparisons. Note that he did not consider all the methods implemented here. These omissions are not too important for our purposes. The methods implemented here as \func{invchisq}, \func{invt}, \func{meanp} and \func{meanz} are all very similar to ones which he did study. The truncation methods appeared about the same time as his work but in any case are fundamentally different. Vote counting is arguably not a method of the same sort. As Loghin points out the first thing to consider is whether large $p$--values should cancel small ones. If this is not desired then the only methods to consider are those in \func{sumlog} (Fisher), \func{minimump} (Tippett) and \func{maximump}. \mypdf{struct}{Loughin's recommendations based on structure}{struct}{[height=6cm,width=8cm]} He bases his recommendations on criteria of structure and the arrangement of evidence against $H_0$. Figure \ref{struct} shows a summary of his recommendations about the structure of the evidence. \mypdf{strength}{Loughin's recommendations based on where the strength of the evidence is located}{strength}{[height=8cm,width=12cm]} Figure \ref{strength} summarise his recommendations about the arrangement of evidence. Overall he considered the choice to lie between Stouffer's method, Fisher's method and the logistic method implemented in \func{logitp}. As has already been mentioned Fisher's method cancels whereas the other two do not so if the weak evidence in a small number of $p$--values is not to be over--whelmed by the others then Fisher is the best choice. However where the evidence is more evenly spread Stouffer's method may be preferred. The logistic method represents a compromise between them and is perhaps best suited where the pattern of evidence is not clear in advance. The other methods are not universally ruled out and may be helpful in the specific circumstance outlined in his summaries. \bibliography{metap} \bibliographystyle{plainnat} \end{document}