[BioC] featureCounts segfault
Wei Shi
shi at wehi.EDU.AU
Mon May 12 07:31:16 CEST 2014
Hi Ryan,
On May 10, 2014, at 6:13 AM, Ryan C. Thompson wrote:
> Hi Wei,
>
> I've been testing out the new arguments for featureCounts on my use case, and I'm getting a segfault when I try to run it. I suspect that I may be using a too large annotation and featureCounts is running out of memory. Specifically, I'm counting ChIP-Seq reads in bins placed every 50 bp across the entire human genome, so that's about 60 million bins, which is a lot. Below is the log output for the crash. Note that "featureCounts.fragments is my function that simply calls featureCounts with the appropriate arguments. In any case, the crash seems to happen while loading the annotation file.
>
> Should I try splitting the annotation into smaller pieces, calling featureCounts on each one, and then rbinding them together?
>
Yes, your suspicion is right. The problem was caused by your very large annotation file. We have now improved the featureCounts code to make it be able to read in an annotation file as large as 3GB. Changes have just been committed to bioc devel (Rsubread 1.15.4).
However, there is an issue here. It looks like you are going to summarize 142 BAM files at the same time. After summarization, featureCounts returns a List object which includes a count table in which rows are features and columns are samples. The count table to be generated from your data will be extremely big (possibly 10s-100s gigabyes). Does your computer have enough memory to save these data? featureCounts may crash due to insufficient memory.
You may run chromosome-wise summarization to try to reduce memory use, or you may run SourceForge version of featureCounts (a C program running on unix/mac shell) which saves data to a file.
> Also, an unrelated question: when the read is reduced to its 5-prime end, is this reducing its length to one or zero? I ask because I want to know whether to extend by the fragment length or the fragment length minus 1.
>
> Thanks,
>
> -Ryan
>
It is reduced to length one.
Best,
Wei
> Error output:
>
> ========== _____ _ _ ____ _____ ______ _____
> ===== / ____| | | | _ \| __ \| ____| /\ | __ \
> ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
> ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
> ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
> ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
> Rsubread 1.15.2
>
> //========================== featureCounts setting ===========================\\
> || ||
> || Input files : 142 BAM files ||
> || S /gpfs/home/rcthomps/Projects/sarah-cd4/bam ... ||
> ...
> || S /gpfs/home/rcthomps/Projects/sarah-cd4/bam ... ||
> || ||
> || Output file : ./.Rsubread_featureCounts_pid17803 ||
> || Annotations : ./.Rsubread_UserProvidedAnnotation_pid17803 ... ||
> || ||
> || Threads : 8 ||
> || Level : meta-feature level ||
> || Paired-end : no ||
> || Strand specific : no ||
> || Multimapping reads : not counted ||
> || Multi-overlapping reads : counted ||
> || Read extensions : 0 on 5' and 147 on 3' ends ||
> || Read reduction to : 5' end ||
> || ||
> \\===================== http://subread.sourceforge.net/ ======================//
>
> //================================= Running ==================================\\
> || ||
> || Load annotation file ./.Rsubread_UserProvidedAnnotation_pid17803 ... ||
>
> *** caught segfault ***
> address 0x7f428c4fd38d, cause 'memory not mapped'
>
> Traceback:
> 1: .C("R_readSummary_wrapper", as.integer(n), as.character(cmd), PACKAGE = "Rsubread")
> 2: featureCounts(bam, annot.ext = saf, isPairedEnd = FALSE, read2pos = 5, readExtension3 = fraglength - 1, allowMultiOverlap = TRUE, strandSpecific = 0, nthreads = nthreads, ...)
> 3: featureCounts.fragments(bamfiles, windows, fraglength = histone.width)
>
> > sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
>
> attached base packages:
> [1] grDevices datasets parallel graphics utils stats methods
> [8] base
>
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BSgenome_1.30.0
> [3] BiocParallel_0.4.1 xlsx_0.5.5
> [5] xlsxjars_0.6.0 rJava_0.9-6
> [7] plyr_1.8 doParallel_1.0.8
> [9] iterators_1.0.7 foreach_1.4.2
> [11] Rsubread_1.15.2 Rsamtools_1.14.3
> [13] Biostrings_2.30.1 GenomicRanges_1.14.4
> [15] XVector_0.2.0 BiocInstaller_1.12.1
> [17] stringr_0.6.2 IRanges_1.20.6
> [19] BiocGenerics_0.8.0 R.utils_1.29.8
> [21] R.oo_1.17.0 R.methodsS3_1.6.1
> [23] ggplot2_0.9.3.1
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.2 BBmisc_1.5 bitops_1.0-6 brew_1.0-6
> [5] codetools_0.2-8 colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0
> [9] digest_0.6.4 fail_1.2 grid_3.0.2 gtable_0.1.2
> [13] labeling_0.2 MASS_7.3-30 munsell_0.4.2 proto_0.3-10
> [17] RColorBrewer_1.0-5 reshape2_1.2.2 RSQLite_0.11.4 scales_0.2.3
> [21] sendmailR_1.1-2 stats4_3.0.2 tools_3.0.2 zlibbioc_1.8.0
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list