[BioC] DMR identification: minfi speed, memory problems
Allegra Petti [guest]
guest at bioconductor.org
Mon Feb 24 19:41:43 CET 2014
Dear All,
I am trying to identify Differentially Methylated Regions (DMRs) using a tab-delimited file of beta values obtained with Illumina's 450k methylation arrays. I am using minfi and its implementation of bumphunter (development version 1.9.11).
However, I run out of memory when I use a mere 100 permutations, and have tried different amounts of memory to no avail. Of further concern is the fact that the bumphunter reference (Jaffe, et al (2011) International J. of Epidemiology: 41:200-209) states that each permutation takes two hours (see p. 205). At that rate, a standard run with 1000 permutations would take over 83 days.
My questions are as follows (code and output included below):
1. Is there a better way to run minfi/bumphunter, or specific resources I should allocate? (I am currently using parallel processing with four cores, and have tried various amounts of memory.)
2. Does minfi/bumphunter include a non-permutation-based method for estimating statistical significance? Jaffe et al state that they implemented a second, faster, p-value calculation following the method of Effron and Tibshirani (see p. 205). Does anyone know if this method is implemented in minfi?
3. Is there a better method for identifying DMRs? (I am also experimenting with methyAnalysis, which has nice visualization options, but (a) is poorly documented and (b) surprisingly found no DMRs in my data set using default settings.)
Many thanks in advance for any insight and/or suggestions!
Sincerely,
Allegra
_________________________________________________________________
Code:
# This R script reads in a tab-delimited matrix of Beta values (with a header line)
# and finds differentially-methylated regions using minfi
# sample command: bsub -oo beta2dmr.140224.out -e beta2dmr.140224.err -q long -M 16000000 -R 'select[type==LINUX64 && mem>16000] span[hosts=1] rusage[mem=16000]' -n 4 -J beta2dmr "~/gsc/R3.0.2/bin/Rscript beta2dmr_140220.r"
library(minfi); # currently requires development version 1.9.11
library(IlluminaHumanMethylation450kanno.ilmn12.hg19); # Note: had to be installed manually
library(foreach); # used for parallel processing
library(doRNG); # is this actually used??
library(doParallel); # "backend" for foreach package
registerDoParallel(cores = 4);
#betas.frame <- read.table(file="testbetas.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");
betas.frame <- read.table(file="betas_140220.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");
betas <- as.matrix(betas.frame); # convert data frame to matrix
data.rs <- RatioSet(Beta = betas,annotation=c(array= "IlluminaHumanMethylation450k", annotation = "ilmn12.hg19")); # create RatioSet
data.grs <- mapToGenome(data.rs); # create GenomicRatioSet
sampleNames(data.rs); # display the sample names (ie column headings)
targets <- read.450k.sheet('/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis',recursive=FALSE); # read Sample Sheet
samples <- sampleNames(data.rs); # get sample names (ie column headers) from input file of beta values
j <- match(samples,targets$Sample); # get row indices from sample sheet that correspond to sample names (column headers) in beta value input file
pheno <- targets$Sample_Group[j]; # extract corresponding phenotype (e.g. "tumor" or "normal") from Sample_Group column of Sample Sheet
designMatrix <- model.matrix(~ pheno); # specify design matrix for regression
dmr <- bumphunter(data.grs, design = designMatrix, cutoff = 0.05, B=100); # run bumphunter with B permutations
save(dmr, file="dmr.minfi.100_140222");
__________________________________________________________________
Standard Output:
[1] "TCGA.AB.2940.03A.01D.0742.05" "TCGA.AB.2944.03A.01D.0743.05"
[3] "TCGA.AB.2866.03A.01D.0741.05" "TCGA.AB.2871.03A.01D.0742.05"
[5] "TCGA.AB.2934.03A.01D.0743.05" "TCGA.AB.2807.03A.01D.0741.05"
[7] "TCGA.AB.2809.03A.01D.0741.05" "TCGA.AB.2984.03A.01D.0742.05"
[9] "TCGA.AB.2921.03A.01D.0743.05" "TCGA.AB.2952.03A.01D.0742.05"
[11] "TCGA.AB.2829.03A.01D.0741.05" "TCGA.AB.2901.03A.01D.0742.05"
[13] "TCGA.AB.2827.03A.01D.0741.05" "TCGA.AB.2884.03A.01D.0742.05"
[15] "TCGA.AB.3011.03A.01D.0742.05" "TCGA.AB.2878.03A.01D.0742.05"
[17] "TCGA.AB.2933.03A.01D.0742.05" "TCGA.AB.2843.03A.01D.0741.05"
[19] "TCGA.AB.2814.03A.01D.0741.05" "TCGA.AB.2856.03A.01D.0741.05"
[21] "TCGA.AB.2855.03A.01D.0741.05" "TCGA.AB.2919.03A.01D.0743.05"
[23] "TCGA.AB.2967.03A.01D.0742.05" "TCGA.AB.2817.03A.01D.0742.05"
[25] "TCGA.AB.2887.03A.01D.0742.05" "TCGA.AB.2894.03A.01D.0742.05"
[27] "TCGA.AB.2895.03A.01D.0742.05" "TCGA.AB.2947.03A.01D.0743.05"
[29] "TCGA.AB.2930.03A.01D.0743.05" "TCGA.AB.2896.03A.01D.0742.05"
[31] "TCGA.AB.2983.03A.01D.0742.05" "TCGA.AB.2945.03A.01D.0741.05"
[33] "TCGA.AB.2918.03A.01D.0743.05" "TCGA.AB.2928.03A.01D.0743.05"
[35] "TCGA.AB.2979.03A.01D.0742.05" "TCGA.AB.2969.03A.01D.0741.05"
[37] "TCGA.AB.2891.03A.01D.0742.05" "TCGA.AB.2971.03A.01D.0741.05"
[39] "TCGA.AB.2964.03A.01D.0741.05" "TCGA.AB.2835.03A.01D.0741.05"
[41] "TCGA.AB.3002.03A.01D.0742.05" "TCGA.AB.2853.03A.01D.0741.05"
[43] "TCGA.AB.2836.03A.01D.0741.05" "TCGA.AB.2909.03A.01D.0743.05"
[45] "TCGA.AB.2932.03A.01D.0743.05" "TCGA.AB.2926.03A.01D.0742.05"
[47] "TCGA.AB.2826.03A.01D.0741.05" "TCGA.AB.2911.03A.01D.0741.05"
[49] "TCGA.AB.2920.03A.01D.0742.05" "TCGA.AB.2917.03A.01D.0741.05"
[51] "TCGA.AB.2869.03A.01D.0742.05" "TCGA.AB.2873.03A.01D.0742.05"
[53] "TCGA.AB.2879.03A.01D.0742.05" "TCGA.AB.2831.03A.01D.0741.05"
[55] "TCGA.AB.2816.03A.01D.0742.05" "TCGA.AB.2990.03A.01D.0742.05"
[read.450k.sheet] Found the following CSV files:
[1] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv"
[2] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv"
------------------------------------------------------------
Sender: LSF System <lsfadmin at blade14-4-11.gsc.wustl.edu>
Subject: Job 5329542: <beta2dmr> Exited
Job <beta2dmr> was submitted from host <linus2091.gsc.wustl.edu> by user <apetti> in cluster <lsfcluster1>.
Job was executed on host(s) <4*blade14-4-11.gsc.wustl.edu>, in queue <long>, as user <apetti> in cluster <lsfcluster1>.
</gscuser/apetti> was used as the home directory.
__________________________________________________________________
Standard Error:
Loading required package: methods
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: âBiocGenericsâ
The following objects are masked from âpackage:parallelâ:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from âpackage:statsâ:
xtabs
The following objects are masked from âpackage:baseâ:
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, as.vector, cbind, colnames, duplicated, eval, evalq,
get, intersect, is.unsorted, lapply, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rep.int,
rownames, sapply, setdiff, sort, table, tapply, union, unique,
unlist
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: lattice
Loading required package: GenomicRanges
Loading required package: IRanges
Loading required package: XVector
Loading required package: Biostrings
Loading required package: bumphunter
Loading required package: foreach
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1 2013-03-22
Attaching package: âlocfitâ
The following objects are masked from âpackage:GenomicRangesâ:
left, right
Loading required package: rngtools
Loading required package: pkgmaker
Loading required package: registry
Attaching package: âpkgmakerâ
The following object is masked from âpackage:IRangesâ:
new2
Warning messages:
1: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv", :
Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
2: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv", :
Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
3: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv", :
Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
4: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv", :
Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
[bumphunterEngine] Parallelizing using 4 workers/cores (backend: doParallel, version: 1.0.6).
[bumphunterEngine] Computing coefficients.
[bumphunterEngine] Performing 100 permutations.
[bumphunterEngine] Computing marginal permutation p-values.
[bumphunterEngine] cutoff: 0.05
[bumphunterEngine] Finding regions.
[bumphunterEngine] Found 233469 bumps.
[bumphunterEngine] Computing regions for each permutation.
Warning message:
In regionFinder(x = beta, chr = chr, pos = pos, cluster = cluster, :
NAs found and removed. ind changed.
Execution halted
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=C
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] doParallel_1.0.6
[2] doRNG_1.5.5
[3] rngtools_1.2.3
[4] pkgmaker_0.17.4
[5] registry_0.2
[6] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
[7] minfi_1.9.11
[8] bumphunter_1.2.0
[9] locfit_1.5-9.1
[10] iterators_1.0.6
[11] foreach_1.4.1
[12] Biostrings_2.30.1
[13] GenomicRanges_1.14.4
[14] XVector_0.2.0
[15] IRanges_1.20.6
[16] lattice_0.20-24
[17] Biobase_2.22.0
[18] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.24.0 DBI_0.2-7 MASS_7.3-29
[4] R.methodsS3_1.6.1 RColorBrewer_1.0-5 RSQLite_0.11.4
[7] XML_3.98-1.1 annotate_1.40.0 beanplot_1.1
[10] codetools_0.2-8 digest_0.6.4 genefilter_1.44.0
[13] grid_3.0.2 illuminaio_0.2.0 itertools_0.1-1
[16] limma_3.18.4 matrixStats_0.8.14 mclust_4.2
[19] multtest_2.18.0 nlme_3.1-113 nor1mix_1.1-4
[22] preprocessCore_1.24.0 reshape_0.8.4 siggenes_1.36.0
[25] splines_3.0.2 stats4_3.0.2 stringr_0.6.2
[28] survival_2.37-7 tools_3.0.2 xtable_1.7-1
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list