library(SPCompute)
When the trait of interest is binary, we assume the following logistic regression model: \[\log\bigg(\frac{\text{P}(Y_i=1|X)}{1-\text{P}(Y_i=1|X)}\bigg) = X\beta,\] where the design matrix contains a column of \(1's\) for the intercept, a column of genotypes \(G\) and a column for non-genetic covariate \(E\) (optional).
The regression parameter vector \(\beta\) contains \(\beta_0\), \(\beta_G\) and \(\beta_E\), respectively represent intercept parameter, genetic effect and covariate effect.
To compute power or sample size, the user will need to specify the following information:
preva
, the prevalence rate of the disease in the
population, defined as \(\text{P}(Y=1)\).betaG
, the true effect size of genetic effect.pG
, the minor allele frequency of the SNP.If there exists non-genetic covariate \(E\) in the model, the user will also need to specify the following parameters:
betaE
, the true effect size of non-genetic covariate
effect.gammaG
, the parameter that specifies the dependency
between \(E\) and \(G\).If the non-genetic covariate \(E\) is binary, the following marginal information on \(E\) should be specified:
pE
, the population prevalence rate of \(E\), defined as \(\text{P}(E=1)\). Otherwise if it is
continuous, the required marginal information should be:
muE
, the population mean of \(E\).
sigmaE
, the population SD of \(E\).
These parameters should be summarized into a list, with appropriate names, such as the following when covariate is binary:
<- list(preva = 0.2, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0) para
To compute power given a sample size n
, the user can use
the function Compute_Size
, after specifying the argument
for:
parameters
, a list of true parameter values defined as
above.n
, the given sample sizecovariate
, the type of covariate, should be “binary”,
“continuous” or “none”.mode
, the genetic mode, should be “additive”,
“dominant” or “recessive”.alpha
, the significance level.method
, the method used to do the computation. Should
be “semi-sim” (faster for large sample) or “expand” (better for smaller
sample).For example:
Compute_Power(parameters = para, n = 2e4, covariate = "binary", mode = "additive", alpha = 0.05, method = "semi-sim")
#> [1] 0.6908458
or:
Compute_Power(parameters = para, n = 2e4, covariate = "binary", mode = "additive", alpha = 0.05, method = "expand")
#> [1] 0.6884145
Similarly, to compute the required sample size to achieve a certain
power, one just needs to change the argument n
to
PowerAim
, which defines the target power, and uses the
function Compute_Size
:
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "additive", alpha = 0.05, method = "semi-sim"))
#> [1] 25978
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "additive", alpha = 0.05, method = "expand"))
#> [1] 26128
or if the genetic mode is dominant:
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "semi-sim"))
#> [1] 30872
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "expand"))
#> [1] 30863
If one wants to have more accurate estimate of sample size or power
when using the method semi-sim
, the parameter
B
can be set to larger value, which will take longer
run-time:
round(Compute_Size(parameters = para, PowerAim = 0.8, covariate = "binary", mode = "dominant", alpha = 0.05, method = "semi-sim", B = 5e5))
#> [1] 30817
Unlike the case to compute power, when computing sample size, it is
always recommended to use the method semi-sim
, since the
method expand
will not work when the sample size is
extremely small, and will work at a much slower speed when the sample
size is extremely large.
When the trait of interest is continuous, we assume the following linear regression model: \[Y = X\beta + \epsilon,\] where the noise \(\epsilon\sim N(0,\sigma_\epsilon^2)\).
To compute power or sample size, the procedures will be always the
same as in the binary case, except that the parameter preva
will be replaced by the set of parameters:
TraitMean
, specifying the population mean of the
continuous trait, i.e. \(\mathbb{E}(Y)\).TraitSD
, specifying the population standard deviation
of the continuous trait.The R function Compute_Size
or
Compute_Power
will then compute quantities such as \(\sigma_\epsilon\) using these marginal
information automatically.
Alternatively, the user may also inputs the values of \(\sigma_\epsilon\) directly instead of
inputting the value of TraitSD
, by replacing
TraitSD
with
ResidualSD
, the value of \(\sigma_\epsilon\) in the model.Now the computation can be proceeded by specifying
response = "continuous"
:
<- list(TraitMean = 3, TraitSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8580452
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4270
Or:
<- list(TraitMean = 3, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
Note that for continuous trait, the value of TraitMean
is only used to do parameter conversion, which will not affect the
result of power or sample size computation. So if the purpose is not to
compute the converted parameter such as \(\beta_0\), the value of
TraitMean
can be set to arbitrary numeric value, as shown
in the following example:
<- list(TraitMean = 3, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
<- list(TraitMean = 30000, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.3, pE = 0.3, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
Similarly, when the covariate \(E\)
and the SNP \(G\) are independent
(i.e. gammaG = 0
), power or sample size computation will
not depend on the covariate information at all:
<- list(TraitMean = 0, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 1000, pE = 0.5, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
<- list(TraitMean = 0, ResidualSD = 1, pG = 0.1, betaG = 0.1, betaE = 0.001, pE = 0.1, gammaG = 0)
para Compute_Power(parameters = para, n = 5e3, covariate = "binary", response = "continuous", mode = "additive", alpha = 0.05)
#> [1] 0.8508388
round(Compute_Size(parameters = para, PowerAim = 0.8, response = "continuous", mode = "additive", alpha = 0.05))
#> [1] 4360
When there are multiple covariates (\(E_1\) and \(E_2\)) simultaneously affecting the trait \(Y\), we assume the following (glm) model: \[g_1[\mathbb{E}(Y|E_1,E_2,G)] = \beta_0 + \beta_GG + \beta_{E_1}E_1 + \beta_{E_2}E_2.\]
The dependency between \(E_1, E_2\) and \(G\) are specified through the following (nested) second stage regressions: \[\begin{equation} \begin{aligned} g_2[\mathbb{E}(E_1|G)] &= \gamma_{01} + \gamma_{G_1}G\\ g_3[\mathbb{E}(E_2|G,E_1)] &= \gamma_{02} + \gamma_{G_2}G + \gamma_E E_1.\\ \end{aligned} \end{equation}\]
The covariates and the trait could be either continuous or binary, where the link functions \(g_1,g_2\) and \(g_3\) correspond to linear and logistic regression respectively as described in previous sections.
For example when \(Y\) is binary, \(E_1\) is binary (with \(\beta_{E_1} = 0.3,\ \text{P}(E_1=1)=0.3\)) and \(E_2\) is continuous (with \(\beta_{E_2} = 0.2,\ \mathbb{E}(E_2) = 0, \ \text{Var}(E_2) = 0.4\)), the power can be computed as:
<- list(preva = 0.2, pG = 0.1, betaG = 0.1, betaE = c(0.3, 0.2),
para pE = 0.3, gammaG = c(0.15,0.25), gammaE = 0.1,
muE = 0, sigmaE = sqrt(0.4))
Compute_Power_multi(parameters = para, n = 3000, mode = "additive",
covariate = c("binary", "continuous"), response = "binary")
#> [1] 0.159002
In other words, this correspond to the models: \[\begin{equation} \begin{aligned} \log\bigg(\frac{\text{P}(Y=1|E_1,E_2,G)}{1-\text{P}(Y=1|E_1,E_2,G)}\bigg) &= \beta_0 + 0.1 G + 0.3 E_1 + 0.2 E_2 \\ \log\bigg(\frac{\text{P}(E_1=1|G)}{1-\text{P}(E_1=1|G)}\bigg) &= \gamma_{01} + 0.15 G\\ \mathbb{E}(E_2|G,E_1) &= \gamma_{02} + 0.25 G + 0.1 E_1,\\ \end{aligned} \end{equation}\] where all the remaining parameters \(\bigg(\beta_0, \gamma_{01}, \gamma_{02}, \text{Var}[E_2|G,E_1] \bigg)\) are automatically solved using the marginal information on \(Y\), \(E_1\), \(E_2\) and \(G\).
Note that when gammaE
is unspecified, by the default it
is assumed the two covariates are conditionally independent given \(G\).