[Bioc-devel] C++ parallel computing

Oleksii Nikolaienko o|ek@||@n|ko|@|enko @end|ng |rom gm@||@com
Wed May 26 00:28:32 CEST 2021


Hi Martin,
thanks for your answer. The goal is to speed up my package (epialleleR),
where most of the functions are already written in C++, but the code is
single-threaded. Tasks include: apply analog of
GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate
per-read methylation beta values, create methylation cytosine reports with
prefiltering of sequence reads. Probably all of them I could parallelize
at the level of R, but even in this case I'd maybe like to use OpenMP SIMD
directives.
And yes, the plan is to use Rhtslib. Current backend for reading BAM
is Rsamtools, however I believe I could speed things up significantly by
avoiding unnecessary type conversions and cutting other corners. It doesn't
hurt much when the BAM file is smaller than 1GB, but for 20-40GB file
loading takes more than an hour (24 cores, 378GB RAM workstation).

Best,
Oleksii


On Tue, 25 May 2021 at 19:39, Martin Morgan <mtmorgan.bioc using gmail.com> wrote:

> If the BAM files are each processed independently, and each processing
> task takes a while, then it is probably 'good enough' to use R-level
> parallel evaluation using BiocParallel (currently the recommendation for
> Bioconductor packages) or other evaluation framework. Also, presumably you
> will use Rhtslib, which provides C-level access to the hts library. This
> will requiring writing C / C++ code to interface between R and the hts
> library, and will of course be a significant underataking.
>
> It might be worth outlining in a bit more detail what your task is and how
> (not too much detail!) you've tried to implement this in Rsamtools.
>
> Martin Morgan
>
> On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko" <
> bioc-devel-bounces using r-project.org on behalf of
> oleksii.nikolaienko using gmail.com> wrote:
>
>     Dear Bioc team,
>     I'd like to ask for your advice on the parallelization within a Bioc
>     package. Please point me to a better place if this mailing list is not
>     appropriate.
>     After a bit of thinking I decided that I'd like to parallelize
> processing
>     at the level of C++ code. Would you strongly recommend not to and use
> an R
>     approach instead (e.g. "future")?
>     If parallel C++ is ok, what would be the best solution for all major
> OSs?
>     My initial choice was OpenMP, but then it seems that Apple has
> something
>     against it (https://mac.r-project.org/openmp/). My own dev
> environment is
>     mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway.
>
>     (On the actual task: loading and specific processing of very large BAM
>     files, ideally significantly faster than by means of Rsamtools as a
> backend)
>
>     Best,
>     Oleksii Nikolaienko
>
>         [[alternative HTML version deleted]]
>
>     _______________________________________________
>     Bioc-devel using r-project.org mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list