[Bioc-devel] C++ parallel computing

Martin Morgan mtmorg@n@b|oc @end|ng |rom gm@||@com
Wed May 26 13:47:18 CEST 2021


The best way to process large files is in chunks using BamFile(…, yieldSize = …) and by using ScanBamParam() to select just the components of the bam files of interest. The number of cores is basically irrelevant for input -- you'll be using just one, so choose yieldSize to use modest amounts of memory for primary data, e.g, 4 GB per file, and process each file separately.

Figure the iteration-by-chunk solution for one file; the simplest example is in ?Rsamtools::BamFile

     
     ## Use 'yieldSize' to iterate through a file in chunks.
     bf <- open(BamFile(fl, yieldSize=1000)) 
     while (nrec <- length(scanBam(bf)[[1]][[1]]))
         cat("records:", nrec, "\n")
     close(bf)

but you'd likely want the convenience of GenomicAlignments::readGAlignments() / readGAlignmentPairs().

Once this is working, write this as a proper function, specifying all packages required for the function to complete, e.g.,

fun = function(fl, yieldSize) {
    library(Rsamtools)
    nrec <- 0L
    bf <- open(BamFile(fl, yieldSize=yieldSize))
    repeat {
        len <- length(scanBam(bf)[[1]][[1]])
        if (len == 0L)
            break
        nrec = nrec + len
    }
    close(bf)
    nrec
}

try to minimize the size of the inputs (here just the file name) and the outputs (nrec, a single integer), perhaps using the file system to temporarily store large results. Use BiocParallel::bplapply to apply this to all files

    bplapply(fls, fun, yieldSize = 1000000)

I would actually recommend BiocParallel::SnowParam() (separate processes) because (a) this enforces the discipline that the function does not rely implicitly on the state of the parent process and (b) ensures operation across all OS, and easier transition to, e.g., an HPC cluster. The fixed cost of starting separate processes for each file are outweighed by the time spent processing the file in the process.

GenomicFiles::reduceByYield() or reduceByFile() might be relevant.

I am not totally current (others on this list probably know more) but I don't think openMP is supported on MacOS (https://mac.r-project.org/openmp/) so would be a poor choice at the C level if cross-platform utility were important. If it were me, and again I do not have enough recent experience, I might aim for Intel Threaded Building Blocks, using RcppParallel for inspiration.

Martin

From: Oleksii Nikolaienko <oleksii.nikolaienko using gmail.com>
Date: Tuesday, May 25, 2021 at 6:28 PM
To: Martin Morgan <mtmorgan.bioc using gmail.com>
Cc: "bioc-devel using r-project.org" <bioc-devel using r-project.org>
Subject: Re: [Bioc-devel] C++ parallel computing

Hi Martin,
thanks for your answer. The goal is to speed up my package (epialleleR), where most of the functions are already written in C++, but the code is single-threaded. Tasks include: apply analog of GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate per-read methylation beta values, create methylation cytosine reports with prefiltering of sequence reads. Probably all of them I could parallelize at the level of R, but even in this case I'd maybe like to use OpenMP SIMD directives.
And yes, the plan is to use Rhtslib. Current backend for reading BAM is Rsamtools, however I believe I could speed things up significantly by avoiding unnecessary type conversions and cutting other corners. It doesn't hurt much when the BAM file is smaller than 1GB, but for 20-40GB file loading takes more than an hour (24 cores, 378GB RAM workstation).

Best,
Oleksii


On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto:mtmorgan.bioc using gmail.com> wrote:
If the BAM files are each processed independently, and each processing task takes a while, then it is probably 'good enough' to use R-level parallel evaluation using BiocParallel (currently the recommendation for Bioconductor packages) or other evaluation framework. Also, presumably you will use Rhtslib, which provides C-level access to the hts library. This will requiring writing C / C++ code to interface between R and the hts library, and will of course be a significant underataking.

It might be worth outlining in a bit more detail what your task is and how (not too much detail!) you've tried to implement this in Rsamtools.

Martin Morgan

On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko" <mailto:bioc-devel-bounces using r-project.org on behalf of mailto:oleksii.nikolaienko using gmail.com> wrote:

    Dear Bioc team,
    I'd like to ask for your advice on the parallelization within a Bioc
    package. Please point me to a better place if this mailing list is not
    appropriate.
    After a bit of thinking I decided that I'd like to parallelize processing
    at the level of C++ code. Would you strongly recommend not to and use an R
    approach instead (e.g. "future")?
    If parallel C++ is ok, what would be the best solution for all major OSs?
    My initial choice was OpenMP, but then it seems that Apple has something
    against it (https://mac.r-project.org/openmp/). My own dev environment is
    mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway.

    (On the actual task: loading and specific processing of very large BAM
    files, ideally significantly faster than by means of Rsamtools as a backend)

    Best,
    Oleksii Nikolaienko

        [[alternative HTML version deleted]]

    _______________________________________________
    mailto:Bioc-devel using r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel


More information about the Bioc-devel mailing list