[R-sig-hpc] Why pure computation time in parallel is longer than the serial version?

Mon Feb 24 05:44:25 CET 2014

hi

2014-02-22 19:30 GMT+09:00 Xuening Zhu <puddingnnn529 at gmail.com>:
>
> My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2 physical
> cores and additional 2 logical cores. The memory size is 8G. And my

Logical performance of your CPU..

use SIMD 4 FLOPSparClock x 2.5GHz x 2phisicalcore = 20GFLOPS
use AVX   8 FLOPSparClock x 2.5GHz x 2phisicalcore = 40GFLOPS
# Because a physical core is two, the computing unit is two.

amount of the operation of DGEMM is O(N^3).

>
> I choose a 10^3 * 10^4 matrix and wants to evaluate its
> multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I just
> want to make the computation longer. Maybe more cases can be compared later.

amount of the operation of the procession is O(N^3).
require calculation ... 2*(3e3 * 4e3 * (3e3+4e3)/2) = 84GFLOPS

use SIMD 84/20 = 4.2 sec
use AVX   84/40 = 2.1 sec

so becomes 2.1 seconds in the logical peak performance in your CPU.
Maybe, because the effective efficiency is about range 80% to 90%, it
is likely to become about 2.6 seconds normally.

>    user  system elapsed
>  10.164   0.512   5.549

maybe this performance of NEHALEM Core.
and hyperthread decreases the efficiency of cache in the procession.

Best Regards,
-- 
EI-JI Nakama  <nakama (a) ki.rim.or.jp>
"\u4e2d\u9593\u6804\u6cbb"  <nakama (a) ki.rim.or.jp>