[BioC] featureCounts segfault

Tue May 13 08:04:17 CEST 2014

Hi Ryan,

That sounds a better way to do it. Let us know if you ran into any problems.

Cheers,

Wei

On May 13, 2014, at 4:02 PM, Ryan wrote:

> Hi Wei,
> 
> Thanks for making the change. I think I will split my annotation into chunks that R and featureCounts can both handle, and then merge them all into a bigmatrix object or something like that which can handle the large size of the object.
> 
> -Ryan
> 
> On Sun May 11 22:31:16 2014, Wei Shi wrote:
>> Hi Ryan,
>> 
>> On May 10, 2014, at 6:13 AM, Ryan C. Thompson wrote:
>> 
>>> Hi Wei,
>>> 
>>> I've been testing out the new arguments for featureCounts on my use case, and I'm getting a segfault when I try to run it. I suspect that I may be using a too large annotation and featureCounts is running out of memory. Specifically, I'm counting ChIP-Seq reads in bins placed every 50 bp across the entire human genome, so that's about 60 million bins, which is a lot. Below is the log output for the crash. Note that "featureCounts.fragments is my function that simply calls featureCounts with the appropriate arguments. In any case, the crash seems to happen while loading the annotation file.
>>> 
>>> Should I try splitting the annotation into smaller pieces, calling featureCounts on each one, and then rbinding them together?
>>> 
>> 
>> Yes, your suspicion is right. The problem was caused by your very large annotation file. We have now improved the featureCounts code to make it be able to read in an annotation file as large as 3GB. Changes have just been committed to bioc devel (Rsubread 1.15.4).
>> 
>> However, there is an issue here. It looks like you are going to summarize 142 BAM files at the same time. After summarization, featureCounts returns a List object which includes a count table in which rows are features and columns are samples. The count table to be generated from your data will be extremely big (possibly 10s-100s gigabyes). Does your computer have enough memory to save these data? featureCounts may crash due to insufficient memory.
>> 
>> You may run chromosome-wise summarization to try to reduce memory use, or you may run SourceForge version of featureCounts (a C program running on unix/mac shell) which saves data to a file.
>> 
>> 
>>> Also, an unrelated question: when the read is reduced to its 5-prime end, is this reducing its length to one or zero? I ask because I want to know whether to extend by the fragment length or the fragment length minus 1.
>>> 
>>> Thanks,
>>> 
>>> -Ryan
>>> 
>> 
>> It is reduced to length one.
>> 
>> 
>> Best,
>> 
>> Wei
>> 
>> 
>>> Error output:
>>> 
>>>        ==========     _____ _    _ ____  _____ ______          _____
>>>        =====         / ____| |  | |  _ \|  __ \|  ____| /\   |  __ \
>>>          =====      | (___ | |  | | |_) | |__) | |__ /  \  | |  | |
>>>            ====      \___ \| |  | |  _ <|  _  /|  __| / /\ \ | |  | |
>>>              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
>>>        ==========   |_____/ \____/|____/|_| \_\______/_/    \_\_____/
>>>       Rsubread 1.15.2
>>> 
>>> //========================== featureCounts setting ===========================\\
>>> || ||
>>> ||             Input files : 142 BAM files                                    ||
>>> ||                           S /gpfs/home/rcthomps/Projects/sarah-cd4/bam ... ||
>>> ...
>>> ||                           S /gpfs/home/rcthomps/Projects/sarah-cd4/bam ... ||
>>> || ||
>>> ||             Output file : ./.Rsubread_featureCounts_pid17803               ||
>>> ||             Annotations : ./.Rsubread_UserProvidedAnnotation_pid17803  ... ||
>>> || ||
>>> ||                 Threads : 8                                                ||
>>> ||                   Level : meta-feature level                               ||
>>> ||              Paired-end : no                                               ||
>>> ||         Strand specific : no                                               ||
>>> ||      Multimapping reads : not counted                                      ||
>>> || Multi-overlapping reads : counted                                          ||
>>> ||         Read extensions : 0 on 5' and 147 on 3' ends                       ||
>>> ||       Read reduction to : 5' end                                           ||
>>> || ||
>>> \\===================== http://subread.sourceforge.net/ ======================//
>>> 
>>> //================================= Running ==================================\\
>>> || ||
>>> || Load annotation file ./.Rsubread_UserProvidedAnnotation_pid17803 ...       ||
>>> 
>>> *** caught segfault ***
>>> address 0x7f428c4fd38d, cause 'memory not mapped'
>>> 
>>> Traceback:
>>> 1: .C("R_readSummary_wrapper", as.integer(n), as.character(cmd),     PACKAGE = "Rsubread")
>>> 2: featureCounts(bam, annot.ext = saf, isPairedEnd = FALSE, read2pos = 5,     readExtension3 = fraglength - 1, allowMultiOverlap = TRUE,     strandSpecific = 0, nthreads = nthreads, ...)
>>> 3: featureCounts.fragments(bamfiles, windows, fraglength = histone.width)
>>> 
>>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> 
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
>>> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
>>> 
>>> attached base packages:
>>> [1] grDevices datasets  parallel  graphics  utils stats     methods
>>> [8] base
>>> 
>>> other attached packages:
>>> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BSgenome_1.30.0
>>> [3] BiocParallel_0.4.1 xlsx_0.5.5
>>> [5] xlsxjars_0.6.0 rJava_0.9-6
>>> [7] plyr_1.8 doParallel_1.0.8
>>> [9] iterators_1.0.7 foreach_1.4.2
>>> [11] Rsubread_1.15.2 Rsamtools_1.14.3
>>> [13] Biostrings_2.30.1 GenomicRanges_1.14.4
>>> [15] XVector_0.2.0 BiocInstaller_1.12.1
>>> [17] stringr_0.6.2 IRanges_1.20.6
>>> [19] BiocGenerics_0.8.0 R.utils_1.29.8
>>> [21] R.oo_1.17.0 R.methodsS3_1.6.1
>>> [23] ggplot2_0.9.3.1
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] BatchJobs_1.2      BBmisc_1.5 bitops_1.0-6       brew_1.0-6
>>> [5] codetools_0.2-8    colorspace_1.2-4 DBI_0.2-7          dichromat_2.0-0
>>> [9] digest_0.6.4       fail_1.2 grid_3.0.2         gtable_0.1.2
>>> [13] labeling_0.2       MASS_7.3-30 munsell_0.4.2      proto_0.3-10
>>> [17] RColorBrewer_1.0-5 reshape2_1.2.2 RSQLite_0.11.4     scales_0.2.3
>>> [21] sendmailR_1.1-2    stats4_3.0.2 tools_3.0.2        zlibbioc_1.8.0
>> 
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the addressee.
>> You must not disclose, forward, print or use it without the permission of the sender.
>> ______________________________________________________________________

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}