Using HTCondor for bootstrap analysis

Frederick Boehm

2020-12-01

Overview

Performing a bootstrap analysis with qtl2pleio requires first sampling a (bivariate) phenotype from the inferred bivariate normal distribution, then a two-dimensional, two-QTL QTL scan over the defined scan region. Often, one wants to include more than 100 markers in a scan region. Each two-dimensional scan can take tens of minutes with a single core. Thus, performing a bootstrap analysis with, say, 1000 bootstrap samples is prohibitively time-consuming on most computers. For this reason, we used a high-throughput computing cluster at the University of Wisconsin-Madison (where qtl2pleio was developed). The UW-Madison Center for High-Throughput Computing (CHTC) is available to all UW-Madison researchers. While we recognize that qtl2pleio users may not have access to the CHTC, we anticipate that some will have access to computing clusters. Hopefully this vignette can be adapted to your needs.

Directory structure, files & setup for using Condor

We placed on github a repository that contains the results from our bootstrap analysis of the Recla data.

Here is the repository’s url: https://github.com/fboehm/qtl2pleio-manuscript-chtc/

Within this repo are three subdirectories. We’ll examine in detail the “Recla-bootstrap” subdirectory.

Within this directory, we created four subdirectories:

  1. Rscript
  2. data
  3. shell_scripts
  4. submit_files

The Github repository also contains a fifth subdirectory, squid and a sixth subdirectory results, that are included for completeness. In practice, files in squid were in a distinct directory. We describe it below.

Files in Rscript

The Rscript subdirectory contains a file, boot-Recla-10-22.R that contains R code for our bootstrap analysis.

fn <- file.path("https://raw.githubusercontent.com", 
                "fboehm/qtl2pleio-manuscript/master/chtc", 
                "Recla-bootstrap/Rscript/boot-Recla-10-22.R"
                )
foo <- readLines(fn)

Here we display the contents of our boot-Recla-10-22.R file.

##
pleio_peak_index <- 788
##



library(qtl2pleio)
library(stringr)
library(dplyr)
library(readr)
args <- R.utils::commandArgs(trailingOnly = TRUE, asValues = TRUE)
# translate command line args
print(args)
print(args$argname)
(proc_num <- as.numeric(args$argname))
(run_num <- as.numeric(args$run_num))
(nboot_per_job <- args$nboot_per_job) # think 1 boot per job
(nsnp <- as.integer(args$nsnp))
(s1 <- as.integer(args$s1))
###############
library(qtl2)
recla <- readRDS("data/recla.rds")
# make sex a covariate for use in pvl_scan
recla[[6]][ , 1, drop = FALSE] -> sex
sex <- sex == "female"
# insert pseudomarkers
insert_pseudomarkers(recla, step = 0.1) -> pseudomap

## ------------------------------------------------------------------------
## ------------------------------------------------------------------------
pp <- readRDS("data/recla-aprobs-chr8.rds")

## ------------------------------------------------------------------------
kinship <- readRDS("data/recla-kinship.rds")

## ------------------------------------------------------------------------
recla$pheno -> ph
log(ph) -> lph
apply(FUN = broman::winsorize, X = lph, MARGIN = 2) -> wlph

## ------------------------------------------------------------------------
phe <- wlph[, c(10, 22)]
gm <- pseudomap$`8`
k <- kinship[[8]]
## ------------------------------------------------------------------------
library(qtl2pleio)


###############
# simulate a phenotype
X1 <- pp[ , , pleio_peak_index]
cbind(X1, unlist(sex)) -> Xpre
## remove subjects with missing values of phenotype
is.na(phe[ , 1]) | is.na(phe[ , 2]) -> missing_indic
phe_nona <- phe[!missing_indic, ]
Xpre_nona <- Xpre[!missing_indic, ]
k_nona <- k[!missing_indic, !missing_indic]
##
gemma2::stagger_mats(Xpre_nona, Xpre_nona) -> X
set.seed(proc_num)
calc_covs(pheno = phe_nona, kinship = k_nona) -> cc_out
(cc_out$Vg -> Vg)
(cc_out$Ve -> Ve)
# calculate Sigma
calc_Sigma(Vg = Vg, Ve = Ve, K = k_nona) -> Sigma
solve(Sigma) -> Sigma_inv
# calc Bhat 
B <- calc_Bhat(X = X, Sigma_inv = Sigma_inv, Y = phe_nona)
# Start loop
lrt <- numeric()
for (i in 1:nboot_per_job){
  sim1(X = X, B = B, Vg = Vg, Ve = Ve, kinship = k_nona) -> foo
  matrix(foo, ncol = 2, byrow = FALSE) -> Ysim
  rownames(Ysim) <- rownames(phe_nona)
  colnames(Ysim) <- c("t1", "t2")
  scan_pvl(probs = pp[!missing_indic, , ], pheno = Ysim, covariates = sex[!missing_indic, , drop = FALSE], kinship = k_nona, start_snp1 = s1, n_snp = nsnp) -> loglik
# in above call, s1 & nsnp come from command line args
  calc_lrt_tib(loglik) -> lrt[i]
}

fn_out <- paste0("recla-boot-run", run_num, "_", proc_num, ".txt")
write.table(lrt, fn_out)
q("no")

Files in data

The data subdirectory contains three rds files, one for each of the three input components.

  1. founder allele dosages for Chr 8
  2. phenotypes
  3. kinship matrix (derived via LOCO method)

Files in shell_scripts

We examine only the file that is needed for the Recla analysis, boot-Recla-10-22.sh.

Here is the contents of the file.

#!/bin/bash

# untar your R installation
tar -xzf R13.tar.gz
tar -xzf SLIBS.tar.gz
# make sure the script will use your R installation
export PATH=$(pwd)/R/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/SS:$LD_LIBRARY_PATH
# run R, with the name of your  R script
R CMD BATCH '--args argname='$1' nsnp='$2' s1='$3' nboot_per_job='$4' run_num='$5'' Rscript/boot-Recla-10-22.R 'boot400_run'$5'-'$1'.Rout'

We need to unzip the R installation, which we’ve named R13.tar.gz, and SLIBS file on the remote computer where the actual computing will occur. Following that, and needed adjustments to the paths, we execute R. Note that we have, in this file, specified the variable values by numbers, like $1 for argname. The values assigned to each of these is specified in the “submit” file.

Files in submit_files

The HTCondor “submit” file provides instructions for controlling the interaction with the high-throughput resources. Below is the text in the submit file boot-Recla-10-22.sub

# hello-chtc.sub
# My very first HTCondor submit file
s1=650
# start scan at this SNP index value
nsnp=350
# length of scan in number of SNPs
# which run in the experimental design?
run_num=561
#
nboot_per_job=1
#  for almost all jobs), the desired name of the HTCondor log file,
#  and the desired name of the standard error file.  
#  Wherever you see $(Cluster), HTCondor will insert the queue number
#  assigned to this set of jobs at the time of submission.
universe = vanilla
log = boot-$(Process)-run$(run_num).log
error = boot-$(Process)-run$(run_num).err
#
# Specify your executable (single binary or a script that runs several
#  commands), arguments, and a files for HTCondor to store standard
#  output (or "screen output").
#  $(Process) will be a integer number for each job, starting with "0"
#  and increasing for the relevant number of jobs.
executable = ../shell_scripts/boot-Recla-10-22.sh
arguments = $(Process) $(nsnp) $(s1) $(nboot_per_job) $(run_num)
output = boot-$(Process)-run$(run_num).out
#
# Specify that HTCondor should transfer files to and from the
#  computer where each job runs. The last of these lines *would* be
when_to_transfer_output = ON_EXIT
transfer_input_files = ../data,../Rscript,../shell_scripts,http://proxy.chtc.wisc.edu/SQUID/fjboehm/R13.tar.gz,http://proxy.chtc.wisc.edu/SQUID/SLIBS.tar.gz
#
# Tell HTCondor what amount of compute resources
#  each job will need on the computer where it runs.
request_cpus = 1
request_memory = 3GB
request_disk = 1GB
#
requirements = (OpSysMajorVer == 6) || (OpSysMajorVer == 7)

# which computer grids to use:
+WantFlocking = true
+WantGlideIn = true

# Tell HTCondor to run instances of our job:
queue 1000

The ordering of the arguments on in the line that starts with arguments = dictates the numbers that are assigned to each argument. These numbers are used in the shell script, above.

Files on SQUID

The UW-Madison CHTC provides for each user disk space on a web proxy. More information about SQUID is available online.

In practice, we placed files containing founder allele dosages (for chromosomes of interest) and our (compressed) R installation on SQUID.

Submitting jobs to Condor (at UW-Madison CHTC)

We followed the usual procedure for submitting R jobs with the CHTC. To submit the file boot-Recla-10-22.sub, we first changed into its directory. We then typed

condor_submit boot-Recla-10-22.sub

By using this command from within the submit_files subdirectory, we ensure that our outputted files are returned to the submit_files subdirectory. We then manually moved the outputted results files to the results subdirectory.

We found that up to ten percent of jobs would initially fail and require re-submission. There are a variety of reasons that may explain the need for re-submission. Because of this need, we include in the git repository the file boot-Recla-10-22-fix.sub, which gives the code that we used for resubmission. The file bad-jobs-run561 gives the ids of the jobs that failed the initial runs.

Session info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       Ubuntu 20.10                
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-12-01                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date       lib source                       
##  assertthat    0.2.1      2019-03-21 [3] CRAN (R 4.0.0)               
##  callr         3.5.1      2020-10-13 [3] CRAN (R 4.0.3)               
##  cli           2.2.0      2020-11-20 [3] CRAN (R 4.0.3)               
##  crayon        1.3.4.9000 2020-10-22 [3] Github (r-lib/crayon@6b3f0c6)
##  desc          1.2.0      2018-05-01 [3] CRAN (R 4.0.0)               
##  devtools      2.3.2      2020-09-18 [3] CRAN (R 4.0.3)               
##  digest        0.6.27     2020-10-24 [3] CRAN (R 4.0.3)               
##  ellipsis      0.3.1      2020-05-15 [3] CRAN (R 4.0.0)               
##  evaluate      0.14       2019-05-28 [3] CRAN (R 4.0.0)               
##  fansi         0.4.1      2020-01-08 [3] CRAN (R 4.0.0)               
##  fs            1.5.0      2020-07-31 [3] CRAN (R 4.0.3)               
##  glue          1.4.2      2020-08-27 [3] CRAN (R 4.0.2)               
##  htmltools     0.5.0      2020-06-16 [3] CRAN (R 4.0.0)               
##  knitr         1.30       2020-09-22 [3] CRAN (R 4.0.3)               
##  magrittr      2.0.1      2020-11-17 [3] CRAN (R 4.0.3)               
##  memoise       1.1.0      2017-04-21 [3] CRAN (R 4.0.0)               
##  pkgbuild      1.1.0      2020-07-13 [3] CRAN (R 4.0.2)               
##  pkgload       1.1.0      2020-05-29 [3] CRAN (R 4.0.0)               
##  prettyunits   1.1.1      2020-01-24 [3] CRAN (R 4.0.0)               
##  processx      3.4.4      2020-09-03 [3] CRAN (R 4.0.3)               
##  ps            1.4.0      2020-10-07 [3] CRAN (R 4.0.3)               
##  R6            2.5.0      2020-10-28 [3] CRAN (R 4.0.3)               
##  remotes       2.2.0      2020-07-21 [3] CRAN (R 4.0.3)               
##  rlang         0.4.9      2020-11-26 [3] CRAN (R 4.0.3)               
##  rmarkdown     2.5        2020-10-21 [3] CRAN (R 4.0.3)               
##  rprojroot     2.0.2      2020-11-15 [3] CRAN (R 4.0.3)               
##  sessioninfo   1.1.1      2018-11-05 [3] CRAN (R 4.0.0)               
##  stringi       1.5.3      2020-09-09 [3] CRAN (R 4.0.2)               
##  stringr       1.4.0      2019-02-10 [3] CRAN (R 4.0.0)               
##  testthat      3.0.0      2020-10-31 [3] CRAN (R 4.0.3)               
##  usethis       1.6.3      2020-09-17 [3] CRAN (R 4.0.3)               
##  withr         2.3.0      2020-09-22 [3] CRAN (R 4.0.3)               
##  xfun          0.19       2020-10-30 [3] CRAN (R 4.0.3)               
##  yaml          2.2.1      2020-02-01 [3] CRAN (R 4.0.0)               
## 
## [1] /tmp/RtmpbYyM8o/Rinst8229dd84ff4f
## [2] /tmp/RtmpQ3N82V/temp_libpath803da596e6e5d
## [3] /home/fred/R/x86_64-pc-linux-gnu-library/4.0
## [4] /usr/local/lib/R/site-library
## [5] /usr/lib/R/site-library
## [6] /usr/lib/R/library