[Bioc-devel] BiocParallel load balancing and runtime

Anna Plaxienko @nn@ @end|ng |rom p|@x|enko@com
Tue Aug 8 15:21:23 CEST 2023


My motivation for using distributed memory was that my package is also
accessible on Windows. Is it better to use shared memory as default but
check the user's system and then switch to socket only if necessary?

Regarding the real data. I have 68 samples (rows) of methylation EPIC array
data (850K columns), that I split by chromosomes. So I get 22 matrices,
each from 80K to 10K columns – that's why I need load balancing. When I use
*clusterApplyLB*, the running time of my method is 38 minutes. With
*bplapply* it's 42 minutes. In other examples the difference is the same
10-15%. It's of course not dramatic, if you've already waited 38 minutes,
you can wait an extra 4 :) But I'm just curious as to why and if it's
something I can fix.

вт, 8 авг. 2023 г. в 15:04, Waldir Leoncio Netto <w.l.netto using medisin.uio.no>:

> Dear Anna,
>
> According to the documentation of "BiocParallelParam", SnowParam() is a
> subclass suitable for distributed memory (e.g. cluster) computing. If
> you're running your code on a simpler machine with shared memory (e.g. your
> PC), you're probably better off using MulticoreParam() instead. Here's a
> modified example based on yours:
>
> # Setup
> library(parallel)
> library(BiocParallel)
> my_list <- list(1:10, 11:20, 21:30, 31:40, 41:50, 51:60, 61:70, 71:80,
> 81:90)
> FUN <- function(x) return(x ^ 10)
> ncores <- min(detectCores() - 1L, 10L)
>
> # Parallel
> cl <- makeCluster(ncores)
> print(system.time(res <- clusterApplyLB(cl, my_list, FUN)))
> stopCluster(cl)
>
> # BiocParallel
> parallel_param_1 <- SnowParam(workers = ncores, tasks = length(my_list))
> print(system.time(res2 <- bplapply(my_list, FUN, BPPARAM =
> parallel_param_1)))
> parallel_param_2 <- MulticoreParam(workers = ncores, tasks =
> length(my_list))
> print(system.time(res3 <- bplapply(my_list, FUN, BPPARAM =
> parallel_param_2)))
>
> On my machine, the output is as follows (notice the last column, with the
> total time, shows MulticoreParam() performing better than parallel):
>
> brukar system brukt
>  0.000 0.004  0.088
> brukar system brukt
>  0.114 0.001  1.336
> brukar system brukt
>  0.074 0.124  0.060
>
> How does that work on your actual data?
>
> Best,
> Waldir
>
> ti., 08.08.2023 kl. 13.10 +0200, skrev Anna Plaxienko:
>
> Hi all!
>
> I'm switching from the base R *parallel* package to *BiocParallel* for my
> Bioconductor submission and I have two questions. First, I wanted advice on
> whether I've implemented load balancing correctly. Second, I've noticed
> that the running time is about 15% longer with BiocParallel. Any ideas why?
>
>
> Parallel code
>
> cl <- makeCluster(ncores)
> res <- clusterApplyLB(cl, my_list, FUN)
> stopCluster(cl)
>
> BiocParallel
>
> parallel_param <- SnowParam(workers = ncores, type = "SOCK", tasks =
> length(my_list))
> res2 <- bplapply(my_list, FUN, BPPARAM = parallel_param)
>
> Thank you!
>
> Best regards,
> Anna Plaksienko
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list