[BioC] DMR identification: minfi speed, memory problems

Mon Feb 24 19:41:43 CET 2014

Dear All,

I am trying to identify Differentially Methylated Regions (DMRs) using a tab-delimited file of beta values obtained with Illumina's 450k methylation arrays. I am using minfi and its implementation of bumphunter (development version 1.9.11).

However, I run out of memory when I use a mere 100 permutations, and have tried different amounts of memory to no avail. Of further concern is the fact that the bumphunter reference (Jaffe, et al (2011) International J. of Epidemiology: 41:200-209) states that each permutation takes two hours (see p. 205). At that rate, a standard run with 1000 permutations would take over 83 days.

My questions are as follows (code and output included below):
1. Is there a better way to run minfi/bumphunter, or specific resources I should allocate? (I am currently using parallel processing with four cores, and have tried various amounts of memory.)
2. Does minfi/bumphunter include a non-permutation-based method for estimating statistical significance? Jaffe et al state that they implemented a second, faster, p-value calculation following the method of Effron and Tibshirani (see p. 205). Does anyone know if this method is implemented in minfi?
3. Is there a better method for identifying DMRs? (I am also experimenting with methyAnalysis, which has nice visualization options, but (a) is poorly documented and (b) surprisingly found no DMRs in my data set using default settings.)

Many thanks in advance for any insight and/or suggestions!

Sincerely,
Allegra
_________________________________________________________________
Code:

# This R script reads in a tab-delimited matrix of Beta values (with a header line)                           
# and finds differentially-methylated regions using minfi                                                     

# sample command: bsub -oo beta2dmr.140224.out -e beta2dmr.140224.err -q long -M 16000000 -R 'select[type==LINUX64 && mem>16000] span[hosts=1] rusage[mem=16000]' -n 4 -J beta2dmr "~/gsc/R3.0.2/bin/Rscript beta2dmr_140220.r"                                                                                                        

library(minfi); # currently requires development version 1.9.11                                               
library(IlluminaHumanMethylation450kanno.ilmn12.hg19); # Note: had to be installed manually                   
library(foreach); # used for parallel processing                                                              
library(doRNG); # is this actually used??                                                                     
library(doParallel); # "backend" for foreach package                                                          
registerDoParallel(cores = 4);

#betas.frame <- read.table(file="testbetas.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");  
betas.frame <- read.table(file="betas_140220.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");
betas <- as.matrix(betas.frame); # convert data frame to matrix                                               
data.rs <- RatioSet(Beta = betas,annotation=c(array= "IlluminaHumanMethylation450k",  annotation = "ilmn12.hg19")); # create RatioSet                                                                                      
data.grs <- mapToGenome(data.rs); # create GenomicRatioSet                                                    
sampleNames(data.rs); # display the sample names (ie column headings)                                         
targets <- read.450k.sheet('/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis',recursive=FALSE); # read Sample Sheet                                                                                            
samples <- sampleNames(data.rs); # get sample names (ie column headers) from input file of beta values        
j <- match(samples,targets$Sample); # get row indices from sample sheet that correspond to sample names (column headers) in beta value input file                                                                          
pheno <- targets$Sample_Group[j]; # extract corresponding phenotype (e.g. "tumor" or "normal") from Sample_Group column of Sample Sheet                                                                                    
designMatrix <- model.matrix(~ pheno); # specify design matrix for regression                                 
dmr <- bumphunter(data.grs, design = designMatrix, cutoff = 0.05, B=100); # run bumphunter with B permutations                                                                                                             
save(dmr, file="dmr.minfi.100_140222");

__________________________________________________________________
Standard Output:

 [1] "TCGA.AB.2940.03A.01D.0742.05" "TCGA.AB.2944.03A.01D.0743.05"
 [3] "TCGA.AB.2866.03A.01D.0741.05" "TCGA.AB.2871.03A.01D.0742.05"
 [5] "TCGA.AB.2934.03A.01D.0743.05" "TCGA.AB.2807.03A.01D.0741.05"
 [7] "TCGA.AB.2809.03A.01D.0741.05" "TCGA.AB.2984.03A.01D.0742.05"
 [9] "TCGA.AB.2921.03A.01D.0743.05" "TCGA.AB.2952.03A.01D.0742.05"
[11] "TCGA.AB.2829.03A.01D.0741.05" "TCGA.AB.2901.03A.01D.0742.05"
[13] "TCGA.AB.2827.03A.01D.0741.05" "TCGA.AB.2884.03A.01D.0742.05"
[15] "TCGA.AB.3011.03A.01D.0742.05" "TCGA.AB.2878.03A.01D.0742.05"
[17] "TCGA.AB.2933.03A.01D.0742.05" "TCGA.AB.2843.03A.01D.0741.05"
[19] "TCGA.AB.2814.03A.01D.0741.05" "TCGA.AB.2856.03A.01D.0741.05"
[21] "TCGA.AB.2855.03A.01D.0741.05" "TCGA.AB.2919.03A.01D.0743.05"
[23] "TCGA.AB.2967.03A.01D.0742.05" "TCGA.AB.2817.03A.01D.0742.05"
[25] "TCGA.AB.2887.03A.01D.0742.05" "TCGA.AB.2894.03A.01D.0742.05"
[27] "TCGA.AB.2895.03A.01D.0742.05" "TCGA.AB.2947.03A.01D.0743.05"
[29] "TCGA.AB.2930.03A.01D.0743.05" "TCGA.AB.2896.03A.01D.0742.05"
[31] "TCGA.AB.2983.03A.01D.0742.05" "TCGA.AB.2945.03A.01D.0741.05"
[33] "TCGA.AB.2918.03A.01D.0743.05" "TCGA.AB.2928.03A.01D.0743.05"
[35] "TCGA.AB.2979.03A.01D.0742.05" "TCGA.AB.2969.03A.01D.0741.05"
[37] "TCGA.AB.2891.03A.01D.0742.05" "TCGA.AB.2971.03A.01D.0741.05"
[39] "TCGA.AB.2964.03A.01D.0741.05" "TCGA.AB.2835.03A.01D.0741.05"
[41] "TCGA.AB.3002.03A.01D.0742.05" "TCGA.AB.2853.03A.01D.0741.05"
[43] "TCGA.AB.2836.03A.01D.0741.05" "TCGA.AB.2909.03A.01D.0743.05"
[45] "TCGA.AB.2932.03A.01D.0743.05" "TCGA.AB.2926.03A.01D.0742.05"
[47] "TCGA.AB.2826.03A.01D.0741.05" "TCGA.AB.2911.03A.01D.0741.05"
[49] "TCGA.AB.2920.03A.01D.0742.05" "TCGA.AB.2917.03A.01D.0741.05"
[51] "TCGA.AB.2869.03A.01D.0742.05" "TCGA.AB.2873.03A.01D.0742.05"
[53] "TCGA.AB.2879.03A.01D.0742.05" "TCGA.AB.2831.03A.01D.0741.05"
[55] "TCGA.AB.2816.03A.01D.0742.05" "TCGA.AB.2990.03A.01D.0742.05"
[read.450k.sheet] Found the following CSV files:
[1] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv"  
[2] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv"

------------------------------------------------------------
Sender: LSF System <lsfadmin at blade14-4-11.gsc.wustl.edu>
Subject: Job 5329542: <beta2dmr> Exited

Job <beta2dmr> was submitted from host <linus2091.gsc.wustl.edu> by user <apetti> in cluster <lsfcluster1>.
Job was executed on host(s) <4*blade14-4-11.gsc.wustl.edu>, in queue <long>, as user <apetti> in cluster <lsfcluster1>.
</gscuser/apetti> was used as the home directory.

__________________________________________________________________
Standard Error:

Loading required package: methods
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: â€˜BiocGenericsâ€™

The following objects are masked from â€˜package:parallelâ€™:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from â€˜package:statsâ€™:

    xtabs

The following objects are masked from â€˜package:baseâ€™:

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, as.vector, cbind, colnames, duplicated, eval, evalq,
    get, intersect, is.unsorted, lapply, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: lattice
Loading required package: GenomicRanges
Loading required package: IRanges
Loading required package: XVector
Loading required package: Biostrings
Loading required package: bumphunter
Loading required package: foreach
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1   2013-03-22

Attaching package: â€˜locfitâ€™

The following objects are masked from â€˜package:GenomicRangesâ€™:

    left, right

Loading required package: rngtools
Loading required package: pkgmaker
Loading required package: registry

Attaching package: â€˜pkgmakerâ€™

The following object is masked from â€˜package:IRangesâ€™:

    new2

Warning messages:
1: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
  Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
2: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
  Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
3: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
  Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
4: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
  Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
[bumphunterEngine] Parallelizing using 4 workers/cores (backend: doParallel, version: 1.0.6).
[bumphunterEngine] Computing coefficients.
[bumphunterEngine] Performing 100 permutations.
[bumphunterEngine] Computing marginal permutation p-values.
[bumphunterEngine] cutoff: 0.05
[bumphunterEngine] Finding regions.
[bumphunterEngine] Found 233469 bumps.
[bumphunterEngine] Computing regions for each permutation.

Warning message:
In regionFinder(x = beta, chr = chr, pos = pos, cluster = cluster,  :
  NAs found and removed. ind changed.
Execution halted

 -- output of sessionInfo(): 

R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=C             
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] doParallel_1.0.6                                  
 [2] doRNG_1.5.5                                       
 [3] rngtools_1.2.3                                    
 [4] pkgmaker_0.17.4                                   
 [5] registry_0.2                                      
 [6] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
 [7] minfi_1.9.11                                      
 [8] bumphunter_1.2.0                                  
 [9] locfit_1.5-9.1                                    
[10] iterators_1.0.6                                   
[11] foreach_1.4.1                                     
[12] Biostrings_2.30.1                                 
[13] GenomicRanges_1.14.4                              
[14] XVector_0.2.0                                     
[15] IRanges_1.20.6                                    
[16] lattice_0.20-24                                   
[17] Biobase_2.22.0                                    
[18] BiocGenerics_0.8.0                                

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.24.0  DBI_0.2-7             MASS_7.3-29          
 [4] R.methodsS3_1.6.1     RColorBrewer_1.0-5    RSQLite_0.11.4       
 [7] XML_3.98-1.1          annotate_1.40.0       beanplot_1.1         
[10] codetools_0.2-8       digest_0.6.4          genefilter_1.44.0    
[13] grid_3.0.2            illuminaio_0.2.0      itertools_0.1-1      
[16] limma_3.18.4          matrixStats_0.8.14    mclust_4.2           
[19] multtest_2.18.0       nlme_3.1-113          nor1mix_1.1-4        
[22] preprocessCore_1.24.0 reshape_0.8.4         siggenes_1.36.0      
[25] splines_3.0.2         stats4_3.0.2          stringr_0.6.2        
[28] survival_2.37-7       tools_3.0.2           xtable_1.7-1   

--
Sent via the guest posting facility at bioconductor.org.