[BioC] Thread safety of BSgenome getSeq()

Hervé Pagès hpages at fhcrc.org
Wed Jun 18 20:14:52 CEST 2014


Hi Jeff,

Sorry for the delay in answering.

Thanks for reporting this. Martin Morgan was able to reproduce and
found out that the problem is originating in the razf C code from
Samtools (which is included in the Rsamtools package). The razf code
provides the low level functionality for writing/reading RAZipp'ed
compressed FASTA files, which is the format used in BioC 2.14 for
storing the genomic sequences contained in the BSgenome data packages.

Other issues have been reported with these RAZip based BSgenome
packages, and, more importantly, it seems that the original Samtools
developers have moved to other projects and are not going to maintain
the razf code anymore.

So this new issue you found convinced me that we should stop using
the RAZip format in the BSgenome data packages. I'm currently in the
process of rolling them back to the old on-disk format (i.e. 1
serialized DNAString object per chromosome). Expect them to show up in
the BioC package repos in the next couple of days. They'll be at
version 1.3.1000 (source packages will show up today, binary packages
will follow).

Also note that in devel (BioC 3.0), the BSgenome data packages use the
2bit format (from UCSC), which is superior in any aspect to the
previous on-disk formats (i.e. in terms of speed, memory usage, and
size on disk). Unfortunately I cannot use this format for the BSgenome
data packages in release because that relies on functionalities in
the BSgenome *software* package that are only available in BioC devel.

Let me know if you have any questions or concerns about this.

Cheers,
H.

On 06/06/2014 01:15 PM, Johnston, Jeffrey wrote:
> Hi,
>
> I have encountered some issues using getSeq() on a BSgenome object inside a function parallelized with mclapply(). When calling getSeq() from multiple threads simultaneously, at least one will hang indefinitely using 100% CPU:
>
> #------------------
> library(GenomicRanges)
> library(BSgenome.Dmelanogaster.UCSC.dm3)
> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
> gr.list <- lapply(1:6, function(i) gr )
>
> seqs.list <- mclapply(gr.list, function(gr) {
>    message("getSeq() started")
>    s <- getSeq(Dmelanogaster, gr)  # does not reliably return if mc.cores > 1
>    message("getSeq() returned")
>    s
> }, mc.cores=2)
> #------------------
>
> If I instead load the BSgenome package inside the parallelized function everything is fine:
>
> #------------------
> library(GenomicRanges)
> library(BSgenome.Dmelanogaster.UCSC.dm3)
> gr <- GRanges(ranges=IRanges(start=sample(seqlengths(Dmelanogaster)["chr2L"] - 20, 10000), width=20), seqnames="chr2L", strand="+")
> detach(name="package:BSgenome.Dmelanogaster.UCSC.dm3", unload=TRUE)
> gr.list <- lapply(1:6, function(i) gr )
>
> seqs.list <- mclapply(gr.list, function(gr) {
>    library(BSgenome.Dmelanogaster.UCSC.dm3)
>    message("getSeq() started")
>    s <- getSeq(Dmelanogaster, gr)  # always works
>    message("getSeq() returned")
>    s
> }, mc.cores=2)
> #------------------
>
> I can reproduce this issue on both Mac and Linux (both 64-bit).
>
> Is this just a limitation of BSgenome? Is there a better workaround than making sure the package is not loaded before the call to mclapply()?
>
> Thanks,
> Jeff Johnston
> Zeitlinger Lab
> Stowers Institute for Medical Research
>
> #------------------
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.99 BSgenome_1.32.0                        Biostrings_2.32.0                      XVector_0.4.0
> [5] GenomicRanges_1.16.3                   GenomeInfoDb_1.0.2                     IRanges_1.22.8                         BiocGenerics_0.10.0
> [9] setwidth_1.0-3
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     zlibbioc_1.10.0
> #------------------
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list