[Rd] MKL Acceleration encouraging; need adjust package builds?
Paul Johnson
pauljohn32 at gmail.com
Mon Nov 23 18:27:30 CET 2015
Dear R-devel:
The Cluster administrators at KU got enthusiastic about testing
R-3.2.2 with Intel MKL when I asked for some BLAS integration. Below
I forward a performance report, which is encouraging, and thought you
would like to know the numbers. Appears to my untrained eye there are
some extraordinary speedups on Cholesky decomposition, determinants,
and matrix inversion.
They had difficulty getting R to compile with R shared BLAS (don't
know what went wrong there), so they went the other direction.
In his message to me, the technician says that I should consider
adjusting the compilation flags on the packages that use BLAS. Do you
think that is needed? R is compiled with non-shared BLAS libraries,
won't packages know where to look for BLAS headers?
2. If I need to do that, I wonder how to do it and which packages need
attention. Eigen and Armadillo packages, and possibly the ones that
depend on them, lme4, anything flowing through Rcpp.
Here's the build for some packages. Are they finding MKL BLAS? How
would I know?
* installing *source* package 'RcppArmadillo' ...
** package 'RcppArmadillo' successfully unpacked and MD5 sums checked
* checking LAPACK_LIBS: divide-and-conquer complex SVD available via
system LAPACK
** libs
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c RcppArmadillo.cpp -o RcppArmadillo.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c RcppExports.cpp -o RcppExports.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c fastLm.cpp -o fastLm.o
g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o RcppArmadillo.so RcppArmadillo.o RcppExports.o
fastLm.o -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppArmadillo/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (RcppArmadillo)
* installing *source* package 'RcppEigen' ...
** package 'RcppEigen' successfully unpacked and MD5 sums checked
** libs
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c RcppEigen.cpp -o RcppEigen.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c RcppExports.cpp -o RcppExports.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
-I../inst/include -fpic -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -c fastLm.cpp -o fastLm.o
g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o RcppEigen.so RcppEigen.o RcppExports.o fastLm.o
-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppEigen/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (RcppEigen)
* installing *source* package 'MatrixModels' ...
** package 'MatrixModels' successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
Creating a generic function for 'resid' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'fitted.values' from package 'stats'
in package 'MatrixModels'
Creating a generic function for 'coefficients' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'formula' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'coef' from package 'stats' in package
'MatrixModels'
Creating a generic function for 'fitted' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'residuals' from package 'stats' in
package 'MatrixModels'
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (MatrixModels)
* installing *source* package 'quantreg' ...
** package 'quantreg' successfully unpacked and MD5 sums checked
** libs
gfortran -fpic -g -O2 -c akj.f -o akj.o
gfortran -fpic -g -O2 -c boot.f -o boot.o
gfortran -fpic -g -O2 -c brute.f -o brute.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64 -c chlfct.c -o
chlfct.o
gfortran -fpic -g -O2 -c cholesky.f -o cholesky.o
gfortran -fpic -g -O2 -c combos.f -o combos.o
gfortran -fpic -g -O2 -c crq.f -o crq.o
gfortran -fpic -g -O2 -c crqfnb.f -o crqfnb.o
gfortran -fpic -g -O2 -c dsel05.f -o dsel05.o
gfortran -fpic -g -O2 -c etime.f -o etime.o
gfortran -fpic -g -O2 -c extract.f -o extract.o
gfortran -fpic -g -O2 -c idmin.f -o idmin.o
gfortran -fpic -g -O2 -c iswap.f -o iswap.o
gfortran -fpic -g -O2 -c kuantile.f -o kuantile.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64 -c mcmb.c -o
mcmb.o
gfortran -fpic -g -O2 -c penalty.f -o penalty.o
gfortran -fpic -g -O2 -c powell.f -o powell.o
gfortran -fpic -g -O2 -c rls.f -o rls.o
gfortran -fpic -g -O2 -c rq0.f -o rq0.o
gfortran -fpic -g -O2 -c rq1.f -o rq1.o
gfortran -fpic -g -O2 -c rqbr.f -o rqbr.o
gfortran -fpic -g -O2 -c rqfn.f -o rqfn.o
gfortran -fpic -g -O2 -c rqfnb.f -o rqfnb.o
gfortran -fpic -g -O2 -c rqfnc.f -o rqfnc.o
gfortran -fpic -g -O2 -c rqs.f -o rqs.o
gfortran -fpic -g -O2 -c sparskit2.f -o sparskit2.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64 -c srqfn.c -o
srqfn.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64 -c srqfnc.c -o
srqfnc.o
gfortran -fpic -g -O2 -c srtpai.f -o srtpai.o
gcc -std=gnu99 -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o quantreg.so akj.o boot.o brute.o chlfct.o
cholesky.o combos.o crq.o crqfnb.o dsel05.o etime.o extract.o idmin.o
iswap.o kuantile.o mcmb.o penalty.o powell.o rls.o rq0.o rq1.o rqbr.o
rqfn.o rqfnb.o rqfnc.o rqs.o sparskit2.o srqfn.o srqfnc.o srtpai.o
-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-lgfortran -lm -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/quantreg/libs
** R
** data
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (quantreg)
pj
Hi PJ,
We're still running the benchmarks to quantify the performance increase.
The R benchmarks for the MKL version are promising. The performance increase is
varied from test to test, but there isn't any degradation in performance by
using the MKL version. You can expect a 2x to 10x performance increase
depending on the matrix calculations you are performing. Here are the
compilation arguments we used for compiling R with MKL:
--disable-BLAS-shlib
--with-blas="-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -W
l,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core
-Wl,--end-group -fopenmp -ldl -lpthread -lm" --with-lapack
You may want to include these while recompiling R packages which use BLAS.
Here are the results of the benchmark for the standard R 3.2.2:
R Benchmark 2.5
===============
Number of times each test is run__________________________: 3
I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 2.69466666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 1.42433333333333
Sorting of 7,000,000 random values__________________ (sec): 2.34466666666667
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 33.187
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 14.52
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 4.51008013606039
II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 1.203
Eigenvalues of a 640x640 random matrix______________ (sec): 1.60599999999999
Determinant of a 2500x2500 random matrix____________ (sec): 7.64266666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec): 8.05900000000001
Inverse of a 1600x1600 random matrix________________ (sec): 8.64166666666667
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 4.62477425061321
III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.25633333333335
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.894999999999982
Grand common divisors of 400,000 pairs (recursion)__ (sec): 1.714
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.4013333333333
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 2.041
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.44505946077978
Total time for all 15 tests_________________________ (sec): 88.6306666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 3.11209972260597
--- End of test ---
Here are the results for the MKL version:
R Benchmark 2.5
===============
Number of times each test is run__________________________: 3
I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 2.88466666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 1.45933333333333
Sorting of 7,000,000 random values__________________ (sec): 2.35166666666667
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 3.37233333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 1.68666666666666
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 2.25337542617509
II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 1.232
Eigenvalues of a 640x640 random matrix______________ (sec): 0.823333333333333
Determinant of a 2500x2500 random matrix____________ (sec): 1.752
Cholesky decomposition of a 3000x3000 matrix________ (sec): 1.417
Inverse of a 1600x1600 random matrix________________ (sec): 1.33833333333334
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.32693082905282
III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.28600000000001
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 1.00833333333334
Grand common divisors of 400,000 pairs (recursion)__ (sec): 1.82266666666666
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.40533333333334
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 1.91199999999998
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.48790723568791
Total time for all 15 tests_________________________ (sec): 25.7516666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.64469699141649
--- End of test ---
--
Paul E. Johnson
Professor, Political Science Director
1541 Lilac Lane, Room 504 Center for Research Methods
University of Kansas University of Kansas
http://pj.freefaculty.org http://crmda.ku.edu
More information about the R-devel
mailing list