[R-sig-hpc] Why pure computation time in parallel is longer than the serial version?
Norm Matloff
matloff at cs.ucdavis.edu
Wed Feb 26 06:49:34 CET 2014
I think our thread here (pun intended) has become somewhat unclear.
I believe Simon was using the term "hyperthreading" in a broader sense
than what George is using below.
What Simon was describing is the general notion of running more threads
than there are cores. This is motivated, as he said, by memory
hierarchy considerations, e.g. the need to keep the cores busy in case
of a page fault. The thread that had been running in such a situation
is suspended pending the paging op, but with extra threads available the
OS can schedule one and fully utilize the machine. This is sometimes
termed "oversubscription" of the cores.
The strict use of the term "hyperthreading," which I believe George (and
Claudia) meant, is a hardware issue. Modern CPUs are highly pipelined,
with multiple arithmetic/logic units, so it is possible to run more than
one thread at a time (though possibly at different stages of the pipe)
on a core. Thus a quad core machine with hyperthreading degree 2 acts
somewhat like an octo core machine, though not quite, due to the
delicate nature of the pipe scheduling.
If you combine Simon's and George's posts, one might profitably run more
than 8 threads on that quad core machine. And by the way, the OS, at
least in Unix-family systems, will show the machine as having 8 cores.
Norm
On Wed, Feb 26, 2014 at 12:17:11AM -0500, George Ostrouchov wrote:
> A nice explanation of hyperthreading is at
> http://lifehacker.com/how-hyper-threading-really-works-and-when-its-actuall-1394216262
>
> While HT generally does not hurt, there is more going on because the
> matrices are partitioned differently for 2 threads than for 4 threads,
> resulting in essentially a slightly different algorithms. This is what
> we see in distributed matrix algorithms in pbdDMAT, where the defaults
> would be a 1x2 partition vs. 2x2 partition of the matrix, giving
> slightly different algorithms. Most of the same principles apply in
> shared memory.
>
> And yes, HT is used in high-end HPC. Placing high on the top500 usually
> takes using all resources as efficiently as possible. See for example
> https://www.olcf.ornl.gov/kb_articles/hyper-threading/
>
> --George
>
> On 2/25/14 1:41 PM, Jim Gattiker wrote:
> > I second that hyperthreading can be valuable in scientific computations.
> > If one is after the maximum performance (on a desktop or workgroup cluster)
> > in my experience there's no alternative to doing a little benchmarking of
> > application scaling characteristics before kicking off a 'production' run.
> > Just a fact of life at this time, and I don't see it changing anytime soon.
> > FWIW, I suspect that because HT is often not used in high-end HPC
> > (because there's enough money to meet spec with physical cores), and lore
> > is HT doesn't help much for the specific application of gaming, some come
> > to an incorrect generalization.
> >
> > --j
> >
> >
> > On Tue, Feb 25, 2014 at 7:31 AM, Simon Urbanek
> > <simon.urbanek at r-project.org>wrote:
> >
> >> On Feb 22, 2014, at 1:33 PM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> >>
> >>> On Sat, 22 Feb 2014, beleites,claudia wrote:
> >>>
> >>>> Hi Xuening,
> >>>>
> >>>> 2 physical vs 2 physical * 2 logical threads: See e.g. here:
> >> http://unix.stackexchange.com/a/88290
> >>>> You say you have 2 *physical* cores. That's the number you want to use
> >> for the parallel execution. Logical cores are just 2 (or more) threads
> >> running on the same physical core. IIRC, this can speed up things mainly if
> >> the 2 threads run very different operations.
> >>> Yes, this is my experience - I turn off Intel hyperthreads in BIOS to
> >> prevent software getting confused. BLAS sees available compute resources,
> >> so your BLAS may be installed to see 4 cores, but doesn't know that two are
> >> hyperthreads and compete for physical resources. It may be that by limiting
> >> BLAS to 2, it gets privileged access to the two real cores, and other OS
> >> (or other) tasks running at the same time use the hyperthreads.
> >> I disagree - you can easily get more performance than the number of
> >> physical cores because there are other components like memory
> >> allocation/access. For example, on Nehalem 8 cores I get more than 13x
> >> speed up (relative to single core) when using 16 HTs on mt BLAS operations,
> >> because other HTs can compute while non-computing parts of the operation
> >> are requested and a dedicated core would just wait. It is generally
> >> recommended to use more threads than the number of physical cores. HTs do
> >> make a significant difference (here over 60% faster than without). That
> >> said, as everything in the parallel world, this really depends on the
> >> actual use case (we have seen operations that are actually faster when run
> >> serially than on a mt BLAS but that's another story).
> >>
> >> Cheers,
> >> Simon
> >>
> >>
> >>
> >>> Roger
> >>>
> >>>> I think this does not help the BLAS because there you have massive
> >> amounts of the *same* operations and they can easily be parallelized.
> >> Instead, you get overhead and maybe caching/scheduling "conflicts".
> >>>> I think if you want to profit from the 2 phys * 2 log. thread
> >> architecture, you'd need to optimize compilation to be aware of this. But
> >> even then I'd not expect too much here: 2 threads on the physical cores
> >> probably don't leave much space for other calculations to be done
> >> "meanwhile".
> >>>> All in all, I think it is just the same behaviour you see when
> >> scheduling more threads than cores in general (e.g. on a machine that has 1
> >> logical core per physical core).
> >>>> HTH,
> >>>>
> >>>> Claudia
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Claudia Beleites, Chemist
> >>>> Spectroscopy/Imaging
> >>>> Leibniz Institute of Photonic Technology
> >>>> Albert-Einstein-Str. 9
> >>>> 07745 Jena
> >>>> Germany
> >>>>
> >>>> email: claudia.beleites at ipht-jena.de
> >>>> phone: +49 3641 206-133
> >>>> fax: +49 2641 206-399
> >>>>
> >>>>
> >>>> ________________________________________
> >>>> Von: r-sig-hpc-bounces at r-project.org [r-sig-hpc-bounces at r-project.org]"
> >> im Auftrag von "Xuening Zhu [puddingnnn529 at gmail.com]
> >>>> Gesendet: Samstag, 22. Februar 2014 11:30
> >>>> An: Roger Bivand
> >>>> Cc: r-sig-hpc at r-project.org
> >>>> Betreff: Re: [R-sig-hpc] Why pure computation time in parallel is
> >> longer than the serial version?
> >>>> Roger,
> >>>> Much thanks to you~ I've done some further experiments to exploit
> >> something
> >>>> like openblas for parallel.
> >>>> My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2
> >> physical
> >>>> cores and additional 2 logical cores. The memory size is 8G. And my
> >>>> operating system is unbuntu 12.04.
> >>>>
> >>>> I choose a 10^3 * 10^4 matrix and wants to evaluate its
> >>>> multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I
> >> just
> >>>> want to make the computation longer. Maybe more cases can be compared
> >> later.
> >>>> The R (3.0.2) is re-compiled with open-blas. Inline function is
> >> employed to
> >>>> change the number of threads are defined as below:
> >>>> require(inline)
> >>>> openblas.set.num.threads <- cfunction( signature(ipt="integer"),
> >>>> body =
> >>>> 'openblas_set_num_threads(*ipt);',
> >>>> otherdefs = c ('extern void
> >>>> openblas_set_num_threads(int);'),
> >>>> libargs = c
> >>>> ('-L/home/pudding/OpenBLAS/ -lopenblas'),
> >>>> language = "C",
> >>>> convention = ".C")
> >>>>
> >> ##################################################################################
> >>>> First I only compare open-blas with default R BLAS in the experiment:
> >>>> (1) multiplication with the default R BLAS:
> >>>>
> >>>>> mat = matrix(1:1e7,ncol=1e4)> system.time(t(mat)%*%mat) user system
> >> elapsed
> >>>> 84.517 0.320 85.090
> >>>>
> >>>>
> >>>> (2) open-blas with 2 threads specified:
> >>>>
> >>>>> openblas.set.num.threads(2)
> >>>> $ipt
> >>>> [1] 2
> >>>>> system.time(t(mat)%*%mat)
> >>>> user system elapsed
> >>>> 10.164 0.512 5.549
> >>>>
> >>>> (3) open-blas with 4 threads specified:
> >>>>
> >>>>> openblas.set.num.threads(4)
> >>>> $ipt
> >>>> [1] 4
> >>>>
> >>>>> system.time(t(mat)%*%mat)
> >>>> user system elapsed
> >>>> 26.954 1.556 8.147
> >>>>
> >>>> Things is a little strange that 4 threads is even slower than 2 threads!
> >>>>
> >>>> ##################################################################
> >>>> Then I want to mix multicore with open-blas. I try to change the
> >> implicit
> >>>> parallel of matrix multiplication into explicit version. So I just split
> >>>> the data into several partitions and Things become very wired here.
> >>>>
> >>>> (1) First I specify the number of threads to be 1 in open-blas, and 2
> >> cores
> >>>> are used in mclapply:
> >>>>> openblas.set.num.threads(1)
> >>>> $ipt
> >>>> [1] 1
> >>>>
> >>>>> system.time({
> >>>> + group = sample(rep(1:8,length.out=ncol(mat)))
> >>>> + mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
> >>>> + #mcaffinity(1:8)
> >>>> + #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
> >>>> + #cores = detectCores()
> >>>> + a = mclapply(mm,function(m){
> >>>> + cat('Running!!\n')
> >>>> + t(m)%*%m
> >>>> + #tcrossprod(m)
> >>>> + },mc.cores=2)
> >>>> + b = Reduce("+",a)
> >>>> + })
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> user system elapsed
> >>>> 0.352 0.168 1.363
> >>>>
> >>>> (2) Then I change cores in mclapply (in parallel package) from 2 to 4.
> >> The
> >>>> time is even longer as below. (Sometimes it even gives me a segfault
> >> error
> >>>> but unfortunately I have no means to reproduce it ><.)
> >>>>
> >>>>> system.time({
> >>>> + group = sample(rep(1:8,length.out=ncol(mat)))
> >>>> + mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
> >>>> + #mcaffinity(1:8)
> >>>> + #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
> >>>> + #cores = detectCores()
> >>>> + a = mclapply(mm,function(m){
> >>>> + cat('Running!!\n')
> >>>> + t(m)%*%m
> >>>> + #tcrossprod(m)
> >>>> + },mc.cores=4)
> >>>> + b = Reduce("+",a)
> >>>> + })
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> Running!!
> >>>> user system elapsed
> >>>> 0.400 0.148 1.597
> >>>>
> >>>> (3) When I change the number of threads from 1 to some number>1, I
> >> can't
> >>>> have results being returned from mclapply any more. There are some
> >>>> conflicts between mclapply and open-blas multi-threads speeding
> >> algorithm,
> >>>> I guess.
> >>>>
> >>>>
> >>>> 2014-02-18 18:38 GMT+08:00 Roger Bivand <Roger.Bivand at nhh.no>:
> >>>>
> >>>>> wesley goi <wesley at ...> writes:
> >>>>>
> >>>>>> Hi xuening,
> >>>>>>
> >>>>>> I use multicore's mclapply() function extensively and have recently
> >>>>>> changed the BLAS lib to openblas to help with running a PCA on a big
> >>>>>> matrix, everything ran fine. However, I was wondering if the openblas
> >>>>>> lib will interfere with multicore.
> >>>>>>
> >>>>>> So i guess so far there? EURO (tm)s no way to assigned the threads
> >> which
> >>>>>> openblas uses hence it shdnt be used in a multicore script to be
> >>>>>> submitted to a cluster else it? EURO'll consume all the cores?
> >>>>> Please do use the list archives; the thread:
> >>>>>
> >>>>> https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001339.html
> >>>>>
> >>>>> provides much insight into the AFFINITY issue - see also mcaffinity()
> >>>>> in the parallel package. If your BLAS is trying to use all available
> >>>>> cores anyway, and you then try to run in parallel on top of that, your
> >>>>> high-level processes will compete across the available cores for
> >>>>> resources with BLAS, as each BLAS call on each core will try to
> >>>>> spread work across the same set of cores. Please also see:
> >>>>>
> >>>>> http://www.jstatsoft.org/v31/i01/
> >>>>>
> >>>>> and perhaps also:
> >>>>>
> >>>>> http://ideas.repec.org/p/hhs/nhheco/2010_025.html
> >>>>>
> >>>>> Neither are new, but are based on trying things out rather than
> >>>>> speculating. As pointed out before, Brian's comment tells you what you
> >>>>> need to know:
> >>>>>
> >>>>>
> >> https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/662780c9/
> >>>>> attachment.pl
> >>>>>
> >>>>> Hope this clarifies,
> >>>>>
> >>>>> Roger
> >>>>>
> >>>>>> On 18 Feb, 2014, at 11:25 am, Xuening Zhu <puddingnnn529 at ...> wrote:
> >>>>>>
> >>>>>>> Hi Wesley:
> >>>>>>> I installed open-blas before. It went well when I run serial
> >>>>>> operations. 2 threads can be seen in 'top'. But
> >>>>>> I can't change thread number through the methods it provided.
> >>>>> _______________________________________________
> >>>>> R-sig-hpc mailing list
> >>>>> R-sig-hpc at r-project.org
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Xuening Zhu
> >>>> --------------------------------------------------------
> >>>> Master of Business Statistics
> >>>> Guanghua School of Management, Peking University
> >>>>
> >>>> [[alternative HTML version deleted]]
> >>>>
> >>>>
> >>> --
> >>> Roger Bivand
> >>> Department of Economics, Norwegian School of Economics,
> >>> Helleveien 30, N-5045 Bergen, Norway.
> >>> voice: +47 55 95 93 55; fax +47 55 95 95 43
> >>> e-mail: Roger.Bivand at nhh.no
> >>> _______________________________________________
> >>> R-sig-hpc mailing list
> >>> R-sig-hpc at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> >> _______________________________________________
> >> R-sig-hpc mailing list
> >> R-sig-hpc at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> >>
> > [[alternative HTML version deleted]]
> >
> >
> >
> > _______________________________________________
> > R-sig-hpc mailing list
> > R-sig-hpc at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
More information about the R-sig-hpc
mailing list