[R-pkg-devel] using portable simd instructions

Vladimir Dergachev vo|ody@ @end|ng |rom m|nd@pr|ng@com
Wed Mar 27 15:02:23 CET 2024


I like assembler, and I do use SIMD intrinsincs in some of my code (not 
R), but sparingly.

The issue is more than portability between platforms, but also portability 
between processors - if you write your optimized code using AVX, it might 
not take advantage of newer AVX512 cpus.

In many cases your compiler will do the right thing and optimize your 
code.

I suggest:

    * write your code in plain C, test it with some long computation and 
use "perf top" on Linux to observe the code hotspots and which assembler 
instructions are being used.

    * if you see instructions like "addps" these are vectorized. If you see 
instructions like "addss" these are *not* vectorized.

    * if you see a few instructions as hotspots with arguments in 
parenthesis "vmovaps %xmm1,(%r8)" then you are likely limited by memory 
access.

    * If you are not limited by memory access and the compiler produces a 
lot of "addss" or similar that are hotspots, then you need to look at your 
code and make it more parallelizable.

    * How to make your C code more parallelizable:

    You want to make easy to interpret loops like

          for(i=start;i<stop;i++) {
                  a[i]=b[i]+c[i];
 		}

    You can help the compiler by using "restrict" keyword to indicate that 
arrays do not overlap, or (as a sledgehammer) "#pragma ivdep". But before 
using keywords check with "perf top" which code is actually a hotspot, as 
the compiler can generate good code without restrict keywords, by using 
multiple code paths.

    * You can create small temporary arrays to make your algorithm look 
more like loops above. The small arrays should be at least 16 wide, 
because AVX512 has instructions that operate on 16 floats at a time.

     * To allow use of small arrays you can unroll your loops. Note that 
compilers do unrolling themselves, so doing it manually is only helpful if 
this makes the inner body of the loop more parallelizable.

     * You can debug why the compiler does not parallelize your code by 
turning on diagnostics. For gcc the flag is "-fopt-info-vec-missed=vec_info.txt"

     * In very rare cases you use intrinsics. For me this is typically a 
situation when I need to find a value and the index of a maximum or 
minimum in an array - compilers do not optimize this well, at least for 
many different ways of coding this in C that I have tried many years ago.

     * If after all your work you got a factor of 2 speedup you are doing 
fine. If you want larger speedup change your algorithm.

best

Vladimir Dergachev

On Wed, 27 Mar 2024, Dirk Eddelbuettel wrote:

>
> On 27 March 2024 at 08:48, jesse koops wrote:
> | Thank you, I was not aware of the easy way to search CRAN. I looked at
> | rcppsimdjson of course, but couldn't figure it out since it is done in
> | the simdjson library if interpret it correclty, not within the R
> | ecosystem and I didn't know how that would change things. Writing R
> | extensions assumes a lot of  prior knowledge so I will have to work my
> | way up to there first.
>
> I think I have (at least) one other package doing something like this _in the
> library layer too_ as suggested by Tomas, namely crc32c as used by digest.
> You could study how crc32c [0] does this for x86_64 and arm64 to get hardware
> optimization. (This may be more specific cpu hardware optimization but at
> least the library and cmake files are small.)
>
> I decided as a teenager that assembler wasn't for me and haven't looked back,
> but I happily take advantage of it when bundled well. So strong second for
> the recommendation by Tomas to rely on this being done in an external and
> tested library.
>
> (Another interesting one there is highway [1]. Just packaging that would
> likely be an excellent contribution.)
>
> Dirk
>
> [0] repo: https://github.com/google/crc32c
> [1] repo: https://github.com/google/highway
>    docs: https://google.github.io/highway/en/master/
>
>
> |
> | Op di 26 mrt 2024 om 15:41 schreef Dirk Eddelbuettel <edd using debian.org>:
> | >
> | >
> | > On 26 March 2024 at 10:53, jesse koops wrote:
> | > | How can I make this portable and CRAN-acceptable?
> | >
> | > But writing (or borrowing ?) some hardware detection via either configure /
> | > autoconf or cmake. This is no different than other tasks decided at install-time.
> | >
> | > Start with 'Writing R Extensions', as always, and work your way up from
> | > there. And if memory serves there are already a few other packages with SIMD
> | > at CRAN so you can also try to take advantage of the search for a 'token'
> | > (here: 'SIMD') at the (unofficial) CRAN mirror at GitHub:
> | >
> | >    https://github.com/search?q=org%3Acran%20SIMD&type=code
> | >
> | > Hth, Dirk
> | >
> | > --
> | > dirk.eddelbuettel.com | @eddelbuettel | edd using debian.org
>
> -- 
> dirk.eddelbuettel.com | @eddelbuettel | edd using debian.org
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>



More information about the R-package-devel mailing list