[R] x86 SSE* Pointer Favors

Fri Jun 13 08:42:41 CEST 2008

Hi Ivo,

On Friday 13 June 2008 12:23:06 am ivo welch wrote:
> Dear Statisticians--- This is not even an R question, so please
> forgive me.  I have so much ignorance in this matter that I do not
> know where to begin.  I hope someone can point me to documentation
> and/or a sample.

You will sure find some answers to your questions if you look into 
R-admin.html file under "Building from source" section. Do a search on BLAS 
and you will be presented with some options. Using a bit of R web site search 
on the same keyword will give you even more food for thought.

> I want to compute a covariance as quickly as non-humanly possible on
> an Intel core processor (up to SSE4) under linux.  Alas, I have no
> idea how to engage CPU vectorization.  Do I need to use special data
> types, or is "double" correct?  Does SSE* understand NaN?  Should I
> rely on gcc autodetection of the vectorized meaning of my code, or are
> there specific libraries that I should call?

I use Goto BLAS library and it works great. Usually runs 3 to 30 times faster 
than the stock R BLAS library, depending on your code. Enabling SSE 
instructions in addition while building R (yes, you have to enable them 
explicitly, see man gcc) is possible but does not help much since all maths 
is mostly done in BLAS.

That said, optimized BLAS libraries give most speed increase with older 
processors. Newer crop of multi-core CPUs with large shared caches is much 
more difficult to hand-tune code for. You may want to subscribe to Goto BLAS 
mailing list for an in-depth discussion. ATLAS community is also very helpful 
(I use their code with our AMD CPUs).

> What I want to learn about is as simple as it gets:
>   typedef double Double;  // or whatever SSE* needs as close equivalent
>   Double vector1[N], vector2[N];
>   // then fill them with stuff.

R does not have types, everything that does not look like character string or 
an integer is treated as double. All arithmetics are always done in double 
precision.

>   vector3= vector_mult(vector1,vector2, N);
>   vector4= sum(vector1, N);
>
> I just need a pointer and/or primer.  PS: If someone knows of a
> superfast vectorized implementation of Gentleman's WLS algorithm,
> please point me to it, too.  I am still using my old non-vectorized C
> routines.

HTH,
Ivan