[R-sig-hpc] Why pure computation time in parallel is longer than the serial version?

Sat Feb 22 12:43:08 CET 2014

Hi Xuening,

2 physical vs 2 physical * 2 logical threads: See e.g. here:  http://unix.stackexchange.com/a/88290

You say you have 2 *physical* cores. That's the number you want to use for the parallel execution. Logical cores are just 2 (or more) threads running on the same physical core. IIRC, this can speed up things mainly if the 2 threads run very different operations.

I think this does not help the BLAS because there you have massive amounts of the *same* operations and they can easily be parallelized. Instead, you get overhead and maybe caching/scheduling "conflicts". 

I think if you want to profit from the 2 phys * 2 log. thread architecture, you'd need to optimize compilation to be aware of this. But even then I'd not expect too much here: 2 threads on the physical cores probably don't leave much space for other calculations to be done "meanwhile". 

All in all, I think it is just the same behaviour you see when scheduling more threads than cores in general (e.g. on a machine that has 1 logical core per physical core).

HTH,

Claudia

--
Claudia Beleites, Chemist
Spectroscopy/Imaging
Leibniz Institute of Photonic Technology
Albert-Einstein-Str. 9
07745 Jena
Germany

email: claudia.beleites at ipht-jena.de
phone: +49 3641 206-133
fax:   +49 2641 206-399

________________________________________
Von: r-sig-hpc-bounces at r-project.org [r-sig-hpc-bounces at r-project.org]" im Auftrag von "Xuening Zhu [puddingnnn529 at gmail.com]
Gesendet: Samstag, 22. Februar 2014 11:30
An: Roger Bivand
Cc: r-sig-hpc at r-project.org
Betreff: Re: [R-sig-hpc] Why pure computation time in parallel is longer than the serial version?

Roger,
Much thanks to you~ I've done some further experiments to exploit something
like openblas for parallel.
My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2 physical
cores and additional 2 logical cores. The memory size is 8G. And my
operating system is unbuntu 12.04.

I choose a 10^3 * 10^4 matrix and wants to evaluate its
multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I just
want to make the computation longer. Maybe more cases can be compared later.

The R (3.0.2) is re-compiled with open-blas. Inline function is employed to
change the number of threads are defined as below:
require(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
                                       body =
'openblas_set_num_threads(*ipt);',
                                       otherdefs = c ('extern void
openblas_set_num_threads(int);'),
                                       libargs = c
('-L/home/pudding/OpenBLAS/ -lopenblas'),
                                       language = "C",
                                       convention = ".C")
##################################################################################

First I only compare open-blas with default R BLAS in the experiment:
(1) multiplication with the default R BLAS:

> mat = matrix(1:1e7,ncol=1e4)> system.time(t(mat)%*%mat)  user  system elapsed
84.517  0.320  85.090

(2) open-blas with 2 threads specified:

> openblas.set.num.threads(2)
$ipt
[1] 2
> system.time(t(mat)%*%mat)
   user  system elapsed
 10.164   0.512   5.549

(3) open-blas with 4 threads specified:

> openblas.set.num.threads(4)
$ipt
[1] 4

> system.time(t(mat)%*%mat)
   user  system elapsed
 26.954   1.556   8.147

Things is a little strange that 4 threads is even slower than 2 threads!

##################################################################
Then I want to mix multicore with open-blas. I try to change the implicit
parallel of matrix multiplication into explicit version. So I just split
the data into several partitions and  Things become very wired here.

(1) First I specify the number of threads to be 1 in open-blas, and 2 cores
are used in mclapply:
> openblas.set.num.threads(1)
$ipt
[1] 1

> system.time({
+   group = sample(rep(1:8,length.out=ncol(mat)))
+   mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+   #mcaffinity(1:8)
+   #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+   #cores = detectCores()
+   a = mclapply(mm,function(m){
+     cat('Running!!\n')
+     t(m)%*%m
+     #tcrossprod(m)
+   },mc.cores=2)
+   b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
   user  system elapsed
  0.352   0.168   1.363

(2) Then I change cores in mclapply (in parallel package) from 2 to 4. The
time is even longer as below. (Sometimes it even gives me a segfault error
but unfortunately I have no means to reproduce it ><.)

> system.time({
+   group = sample(rep(1:8,length.out=ncol(mat)))
+   mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+   #mcaffinity(1:8)
+   #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+   #cores = detectCores()
+   a = mclapply(mm,function(m){
+     cat('Running!!\n')
+     t(m)%*%m
+     #tcrossprod(m)
+   },mc.cores=4)
+   b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
   user  system elapsed
  0.400   0.148   1.597

(3)  When I change the number of threads from 1 to some number>1, I can't
have results being returned from mclapply any more. There are some
conflicts between mclapply and open-blas multi-threads speeding algorithm,
I guess.

2014-02-18 18:38 GMT+08:00 Roger Bivand <Roger.Bivand at nhh.no>:

> wesley goi <wesley at ...> writes:
>
> >
> > Hi xuening,
> >
> > I use multicore's mclapply() function extensively and have recently
> > changed the BLAS lib to openblas to help with running a PCA on a big
> > matrix, everything ran fine. However, I was wondering if the openblas
> > lib will interfere with multicore.
> >
> > So i guess so far thereâ EURO (tm)s no way to assigned the threads which
> > openblas uses hence it shdnt be used in a multicore script to be
> > submitted to a cluster else itâ EURO'll consume all the cores?
>
> Please do use the list archives; the thread:
>
> https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001339.html
>
> provides much insight into the AFFINITY issue - see also mcaffinity()
> in the parallel package. If your BLAS is trying to use all available
> cores anyway, and you then try to run in parallel on top of that, your
> high-level processes will compete across the available cores for
> resources with BLAS, as each BLAS call on each core will try to
> spread work across the same set of cores. Please also see:
>
> http://www.jstatsoft.org/v31/i01/
>
> and perhaps also:
>
> http://ideas.repec.org/p/hhs/nhheco/2010_025.html
>
> Neither are new, but are based on trying things out rather than
> speculating. As pointed out before, Brian's comment tells you what you
> need to know:
>
> https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/662780c9/
> attachment.pl
>
> Hope this clarifies,
>
> Roger
>
> >
> > On 18 Feb, 2014, at 11:25 am, Xuening Zhu <puddingnnn529 at ...> wrote:
> >
> > > Hi Wesley:
> > > I installed open-blas before. It went well when I run serial
> > operations. 2 threads can be seen in 'top'. But
> > I can't change thread number through the methods it provided.
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

--
Xuening Zhu
--------------------------------------------------------
Master of Business Statistics
Guanghua School of Management, Peking University

        [[alternative HTML version deleted]]