[Bioc-devel] C++ parallel computing

Thu May 27 01:21:18 CEST 2021

Thank you for the information. I guess I'll try to stick to the R-level
parallelization whenever possible.

Best,
Oleksii

On Wed, 26 May 2021 at 13:47, Martin Morgan <mtmorgan.bioc using gmail.com> wrote:

> The best way to process large files is in chunks using BamFile(…,
> yieldSize = …) and by using ScanBamParam() to select just the components of
> the bam files of interest. The number of cores is basically irrelevant for
> input -- you'll be using just one, so choose yieldSize to use modest
> amounts of memory for primary data, e.g, 4 GB per file, and process each
> file separately.
>
> Figure the iteration-by-chunk solution for one file; the simplest example
> is in ?Rsamtools::BamFile
>
>
>      ## Use 'yieldSize' to iterate through a file in chunks.
>      bf <- open(BamFile(fl, yieldSize=1000))
>      while (nrec <- length(scanBam(bf)[[1]][[1]]))
>          cat("records:", nrec, "\n")
>      close(bf)
>
> but you'd likely want the convenience of
> GenomicAlignments::readGAlignments() / readGAlignmentPairs().
>
> Once this is working, write this as a proper function, specifying all
> packages required for the function to complete, e.g.,
>
> fun = function(fl, yieldSize) {
>     library(Rsamtools)
>     nrec <- 0L
>     bf <- open(BamFile(fl, yieldSize=yieldSize))
>     repeat {
>         len <- length(scanBam(bf)[[1]][[1]])
>         if (len == 0L)
>             break
>         nrec = nrec + len
>     }
>     close(bf)
>     nrec
> }
>
> try to minimize the size of the inputs (here just the file name) and the
> outputs (nrec, a single integer), perhaps using the file system to
> temporarily store large results. Use BiocParallel::bplapply to apply this
> to all files
>
>     bplapply(fls, fun, yieldSize = 1000000)
>
> I would actually recommend BiocParallel::SnowParam() (separate processes)
> because (a) this enforces the discipline that the function does not rely
> implicitly on the state of the parent process and (b) ensures operation
> across all OS, and easier transition to, e.g., an HPC cluster. The fixed
> cost of starting separate processes for each file are outweighed by the
> time spent processing the file in the process.
>
> GenomicFiles::reduceByYield() or reduceByFile() might be relevant.
>
> I am not totally current (others on this list probably know more) but I
> don't think openMP is supported on MacOS (
> https://mac.r-project.org/openmp/) so would be a poor choice at the C
> level if cross-platform utility were important. If it were me, and again I
> do not have enough recent experience, I might aim for Intel Threaded
> Building Blocks, using RcppParallel for inspiration.
>
> Martin
>
> From: Oleksii Nikolaienko <oleksii.nikolaienko using gmail.com>
> Date: Tuesday, May 25, 2021 at 6:28 PM
> To: Martin Morgan <mtmorgan.bioc using gmail.com>
> Cc: "bioc-devel using r-project.org" <bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] C++ parallel computing
>
> Hi Martin,
> thanks for your answer. The goal is to speed up my package (epialleleR),
> where most of the functions are already written in C++, but the code is
> single-threaded. Tasks include: apply analog of
> GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate
> per-read methylation beta values, create methylation cytosine reports with
> prefiltering of sequence reads. Probably all of them I could parallelize
> at the level of R, but even in this case I'd maybe like to use OpenMP SIMD
> directives.
> And yes, the plan is to use Rhtslib. Current backend for reading BAM
> is Rsamtools, however I believe I could speed things up significantly by
> avoiding unnecessary type conversions and cutting other corners. It doesn't
> hurt much when the BAM file is smaller than 1GB, but for 20-40GB file
> loading takes more than an hour (24 cores, 378GB RAM workstation).
>
> Best,
> Oleksii
>
>
> On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto:
> mtmorgan.bioc using gmail.com> wrote:
> If the BAM files are each processed independently, and each processing
> task takes a while, then it is probably 'good enough' to use R-level
> parallel evaluation using BiocParallel (currently the recommendation for
> Bioconductor packages) or other evaluation framework. Also, presumably you
> will use Rhtslib, which provides C-level access to the hts library. This
> will requiring writing C / C++ code to interface between R and the hts
> library, and will of course be a significant underataking.
>
> It might be worth outlining in a bit more detail what your task is and how
> (not too much detail!) you've tried to implement this in Rsamtools.
>
> Martin Morgan
>
> On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko"
> <mailto:bioc-devel-bounces using r-project.org on behalf of mailto:
> oleksii.nikolaienko using gmail.com> wrote:
>
>     Dear Bioc team,
>     I'd like to ask for your advice on the parallelization within a Bioc
>     package. Please point me to a better place if this mailing list is not
>     appropriate.
>     After a bit of thinking I decided that I'd like to parallelize
> processing
>     at the level of C++ code. Would you strongly recommend not to and use
> an R
>     approach instead (e.g. "future")?
>     If parallel C++ is ok, what would be the best solution for all major
> OSs?
>     My initial choice was OpenMP, but then it seems that Apple has
> something
>     against it (https://mac.r-project.org/openmp/). My own dev
> environment is
>     mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway.
>
>     (On the actual task: loading and specific processing of very large BAM
>     files, ideally significantly faster than by means of Rsamtools as a
> backend)
>
>     Best,
>     Oleksii Nikolaienko
>
>         [[alternative HTML version deleted]]
>
>     _______________________________________________
>     mailto:Bioc-devel using r-project.org mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]