[R-sig-hpc] Why pure computation time in parallel is longer than the serial version?
beleites,claudia
claudia.beleites at ipht-jena.de
Sat Feb 22 12:43:08 CET 2014
Hi Xuening,
2 physical vs 2 physical * 2 logical threads: See e.g. here: http://unix.stackexchange.com/a/88290
You say you have 2 *physical* cores. That's the number you want to use for the parallel execution. Logical cores are just 2 (or more) threads running on the same physical core. IIRC, this can speed up things mainly if the 2 threads run very different operations.
I think this does not help the BLAS because there you have massive amounts of the *same* operations and they can easily be parallelized. Instead, you get overhead and maybe caching/scheduling "conflicts".
I think if you want to profit from the 2 phys * 2 log. thread architecture, you'd need to optimize compilation to be aware of this. But even then I'd not expect too much here: 2 threads on the physical cores probably don't leave much space for other calculations to be done "meanwhile".
All in all, I think it is just the same behaviour you see when scheduling more threads than cores in general (e.g. on a machine that has 1 logical core per physical core).
HTH,
Claudia
--
Claudia Beleites, Chemist
Spectroscopy/Imaging
Leibniz Institute of Photonic Technology
Albert-Einstein-Str. 9
07745 Jena
Germany
email: claudia.beleites at ipht-jena.de
phone: +49 3641 206-133
fax: +49 2641 206-399
________________________________________
Von: r-sig-hpc-bounces at r-project.org [r-sig-hpc-bounces at r-project.org]" im Auftrag von "Xuening Zhu [puddingnnn529 at gmail.com]
Gesendet: Samstag, 22. Februar 2014 11:30
An: Roger Bivand
Cc: r-sig-hpc at r-project.org
Betreff: Re: [R-sig-hpc] Why pure computation time in parallel is longer than the serial version?
Roger,
Much thanks to you~ I've done some further experiments to exploit something
like openblas for parallel.
My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2 physical
cores and additional 2 logical cores. The memory size is 8G. And my
operating system is unbuntu 12.04.
I choose a 10^3 * 10^4 matrix and wants to evaluate its
multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I just
want to make the computation longer. Maybe more cases can be compared later.
The R (3.0.2) is re-compiled with open-blas. Inline function is employed to
change the number of threads are defined as below:
require(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
body =
'openblas_set_num_threads(*ipt);',
otherdefs = c ('extern void
openblas_set_num_threads(int);'),
libargs = c
('-L/home/pudding/OpenBLAS/ -lopenblas'),
language = "C",
convention = ".C")
##################################################################################
First I only compare open-blas with default R BLAS in the experiment:
(1) multiplication with the default R BLAS:
> mat = matrix(1:1e7,ncol=1e4)> system.time(t(mat)%*%mat) user system elapsed
84.517 0.320 85.090
(2) open-blas with 2 threads specified:
> openblas.set.num.threads(2)
$ipt
[1] 2
> system.time(t(mat)%*%mat)
user system elapsed
10.164 0.512 5.549
(3) open-blas with 4 threads specified:
> openblas.set.num.threads(4)
$ipt
[1] 4
> system.time(t(mat)%*%mat)
user system elapsed
26.954 1.556 8.147
Things is a little strange that 4 threads is even slower than 2 threads!
##################################################################
Then I want to mix multicore with open-blas. I try to change the implicit
parallel of matrix multiplication into explicit version. So I just split
the data into several partitions and Things become very wired here.
(1) First I specify the number of threads to be 1 in open-blas, and 2 cores
are used in mclapply:
> openblas.set.num.threads(1)
$ipt
[1] 1
> system.time({
+ group = sample(rep(1:8,length.out=ncol(mat)))
+ mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+ #mcaffinity(1:8)
+ #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+ #cores = detectCores()
+ a = mclapply(mm,function(m){
+ cat('Running!!\n')
+ t(m)%*%m
+ #tcrossprod(m)
+ },mc.cores=2)
+ b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
user system elapsed
0.352 0.168 1.363
(2) Then I change cores in mclapply (in parallel package) from 2 to 4. The
time is even longer as below. (Sometimes it even gives me a segfault error
but unfortunately I have no means to reproduce it ><.)
> system.time({
+ group = sample(rep(1:8,length.out=ncol(mat)))
+ mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+ #mcaffinity(1:8)
+ #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+ #cores = detectCores()
+ a = mclapply(mm,function(m){
+ cat('Running!!\n')
+ t(m)%*%m
+ #tcrossprod(m)
+ },mc.cores=4)
+ b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
user system elapsed
0.400 0.148 1.597
(3) When I change the number of threads from 1 to some number>1, I can't
have results being returned from mclapply any more. There are some
conflicts between mclapply and open-blas multi-threads speeding algorithm,
I guess.
2014-02-18 18:38 GMT+08:00 Roger Bivand <Roger.Bivand at nhh.no>:
> wesley goi <wesley at ...> writes:
>
> >
> > Hi xuening,
> >
> > I use multicore's mclapply() function extensively and have recently
> > changed the BLAS lib to openblas to help with running a PCA on a big
> > matrix, everything ran fine. However, I was wondering if the openblas
> > lib will interfere with multicore.
> >
> > So i guess so far thereâ EURO (tm)s no way to assigned the threads which
> > openblas uses hence it shdnt be used in a multicore script to be
> > submitted to a cluster else itâ EURO'll consume all the cores?
>
> Please do use the list archives; the thread:
>
> https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001339.html
>
> provides much insight into the AFFINITY issue - see also mcaffinity()
> in the parallel package. If your BLAS is trying to use all available
> cores anyway, and you then try to run in parallel on top of that, your
> high-level processes will compete across the available cores for
> resources with BLAS, as each BLAS call on each core will try to
> spread work across the same set of cores. Please also see:
>
> http://www.jstatsoft.org/v31/i01/
>
> and perhaps also:
>
> http://ideas.repec.org/p/hhs/nhheco/2010_025.html
>
> Neither are new, but are based on trying things out rather than
> speculating. As pointed out before, Brian's comment tells you what you
> need to know:
>
> https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/662780c9/
> attachment.pl
>
> Hope this clarifies,
>
> Roger
>
> >
> > On 18 Feb, 2014, at 11:25 am, Xuening Zhu <puddingnnn529 at ...> wrote:
> >
> > > Hi Wesley:
> > > I installed open-blas before. It went well when I run serial
> > operations. 2 threads can be seen in 'top'. But
> > I can't change thread number through the methods it provided.
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
--
Xuening Zhu
--------------------------------------------------------
Master of Business Statistics
Guanghua School of Management, Peking University
[[alternative HTML version deleted]]
More information about the R-sig-hpc
mailing list