[BioC] Thread safety of BSgenome getSeq()
Hervé Pagès
hpages at fhcrc.org
Tue Jul 1 21:14:23 CEST 2014
Hi Jeff,
On 07/01/2014 10:40 AM, Johnston, Jeffrey wrote:
> Thanks for looking into this Hervé and Martin, and for the quick fix. It is working as expected now. I did have to install the updated packages from source as the binaries do not seem to be updating for Mac or Linux.
Glad it works for you now. Yes we still need to generate the binaries
for Windows and Mac. Will do soon. Thanks for the reminder!
Cheers,
H.
>
> Thanks,
> Jeff
>
> On Jun 18, 2014, at 1:14 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi Jeff,
>>
>> Sorry for the delay in answering.
>>
>> Thanks for reporting this. Martin Morgan was able to reproduce and
>> found out that the problem is originating in the razf C code from
>> Samtools (which is included in the Rsamtools package). The razf code
>> provides the low level functionality for writing/reading RAZipp'ed
>> compressed FASTA files, which is the format used in BioC 2.14 for
>> storing the genomic sequences contained in the BSgenome data packages.
>>
>> Other issues have been reported with these RAZip based BSgenome
>> packages, and, more importantly, it seems that the original Samtools
>> developers have moved to other projects and are not going to maintain
>> the razf code anymore.
>>
>> So this new issue you found convinced me that we should stop using
>> the RAZip format in the BSgenome data packages. I'm currently in the
>> process of rolling them back to the old on-disk format (i.e. 1
>> serialized DNAString object per chromosome). Expect them to show up in
>> the BioC package repos in the next couple of days. They'll be at
>> version 1.3.1000 (source packages will show up today, binary packages
>> will follow).
>>
>> Also note that in devel (BioC 3.0), the BSgenome data packages use the
>> 2bit format (from UCSC), which is superior in any aspect to the
>> previous on-disk formats (i.e. in terms of speed, memory usage, and
>> size on disk). Unfortunately I cannot use this format for the BSgenome
>> data packages in release because that relies on functionalities in
>> the BSgenome *software* package that are only available in BioC devel.
>>
>> Let me know if you have any questions or concerns about this.
>>
>> Cheers,
>> H.
>>
>> On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote:
>>> Hi,
>>>
>>> I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU:
>>>
>>> #------------------
>>> library(GenomicRanges)
>>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
>>> gr.list <- lapply(1:6, function(i) gr )
>>>
>>> seqs.list <- mclapply(gr.list, function(gr) {
>>> message("getSeq() started")
>>> s <- getSeq(Dmelanogaster, gr) # does not reliably return if mc.cores > 1
>>> message("getSeq() returned")
>>> s
>>> }, mc.cores=2)
>>> #------------------
>>>
>>> If I instead load the BSgenome package inside the parallelized function everything is fine:
>>>
>>> #------------------
>>> library(GenomicRanges)
>>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>>> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
>>> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE)
>>> gr.list <- lapply(1:6, function(i) gr )
>>>
>>> seqs.list <- mclapply(gr.list, function(gr) {
>>> library(BSgenome.Dmelanogaster.UCSC.dm3)
>>> message("getSeq() started")
>>> s <- getSeq(Dmelanogaster, gr) # always works
>>> message("getSeq() returned")
>>> s
>>> }, mc.cores=2)
>>> #------------------
>>>
>>> I can reproduce this issue on both Mac and Linux (both 64-bit).
>>>
>>> Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()?
>>>
>>> Thanks,
>>> Jeff Johnston
>>> Zeitlinger Lab
>>> Stowers Institute for Medical Research
>>>
>>> #------------------
>>>> sessionInfo()
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0 Biostrings_2.32.0 XVector_0.4.0
>>> [5] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.8 BiocGenerics_0.10.0
>>> [9] setwidth_1.0-3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 zlibbioc_1.10.0
>>> #------------------
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list