The DYNATE package accompanies the paper “Localizing Rare-Variant Association Regions via Multiple Testing Embedded in an Aggregation Tree”. The package is designed to pinpoint the disease-associated rare variant sets with a controlled false discovery rate in case-control study.
library(DYNATE)Here, we show how to apply DYNATE.
We require the input data is a data frame with a long format: each row is a rare variant (SNP) per sample. Specifically, the input data should contains 6 variables with name Sample.Name, Sample.Type, snpID, domainID, X1, X2. Where variables Sample.Name, snpID and domainID indicate the Sample ID, SNP ID, and domain ID, respectively; Variable Sample.Type indicates the case/control status of each sample; Variables X1 and X2 are covariates that could be considered in the analysis. The snp_dat below is a toy simulated data with 6 variables and 210,454 rows. The data contains 2,000 samples (1,000 cases and 1,000 controls). In total 16,281 SNPs reside in 2,000 domains are considered in snp_dat.
str(snp_dat)
#> 'data.frame': 54282 obs. of 6 variables:
#> $ snpID : num 1 1 1 1 1 1 2 2 2 2 ...
#> $ Sample.Type: chr "case" "case" "ctrl" "ctrl" ...
#> $ Sample.Name: int 58 695 1373 1504 1898 1938 110 199 292 326 ...
#> $ domainID : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ X1 : num 0.38242 -0.46531 -0.23738 -0.0889 0.00982 ...
#> $ X2 : int 1 1 1 0 1 1 1 0 1 0 ...First, we set the tuning parameters as follows. Please refer to the paper for detailed tuning parameters selection procedure.
M <- 5 # leaf size
L <- 3 # layer number
alpha <- 0.05 # desired FDRSecond, we use Test_Leaf function to construct leaves and generate leaf P-values for the case-control study.
# Model consider covariates effect:
t1 <- Sys.time()
p_leaf <- Test_Leaf(snp_dat=snp_dat,thresh_val=M)
t2 <- Sys.time()
t2-t1
#> Time difference of 1.350046 secsIn the output data frames p_leaf, each row links to a rare variant (SNPs), and the number of rows equals the number of rare variants (SNPs) we considered (SNPs that link to a leaf with p-value=1 are excluded for maintaining the algorithm stability). The data frame includes 5 variables. In the data frame, variable L1 is leaf ID; variable pvals is the leaf level p values; variable Test indicates the name of the statistical test to generate the leaf level p values (FET or score).
str(p_leaf)
#> Classes 'data.table' and 'data.frame': 4214 obs. of 5 variables:
#> $ L1 : int 1 2 3 3 4 5 5 6 7 8 ...
#> $ pvals : num 0.453 0.887 0.548 0.548 0.895 ...
#> $ snpID : num 1 2 3 4 5 6 7 8 9 10 ...
#> $ domainID: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Test : chr "FET" "FET" "FET" "FET" ...
#> - attr(*, ".internal.selfref")=<externalptr>Finally, we use the function DYNATE to conduct dynamic and hierarchical testing based on the leaf level p values.
out <- DYNATE(struct_map=p_leaf,L=L,alpha=alpha)In the output data frames out, each row links to a unique SNP that is detected by DYNATE. The variables snpID, L1, and domainID link to the detected SNP ID, leaf ID, and domain ID, respectively; Variable Test links to the name of the statistical test we applied (FET or score); Variable pvals1 links to the leaf level p-values; Variable Layer indicates in which layer the SNP is detected.
str(out)
#> tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
#> $ L1 : int [1:9] 205 1357 1357 1462 2085 2479 3178 546 547
#> $ snpID : num [1:9] 277 1811 1812 1957 2785 ...
#> $ domainID: int [1:9] 20 168 168 181 312 376 493 66 66
#> $ Test : chr [1:9] "FET" "FET" "FET" "FET" ...
#> $ pvals1 : num [1:9] 1.25e-08 4.76e-34 4.76e-34 6.65e-05 2.09e-13 ...
#> $ Layer : int [1:9] 1 1 1 1 1 1 1 2 2