[R-SIG-Mac]Altivec and BLAS

Thu, 7 Feb 2002 10:46:46 -0800

The statement
      "Apple provides an AltiVec tuned BLAS library. Please see 
vBLAS.h in vecLib.framework."
appeared in an email message that appeared on Apples "scitech" 
mailing list. I thought this might be of interest to R Mac 
users/developers. I quote the entire email below.

-Don

----------------------- start of quote of email 
------------------------------------------
Subject: Re: scitech digest, Vol 2 #20 - 10 msgs
From: Ian Ollmann <iano@apple.com>
To: scitech@lists.apple.com
X-Converted-To-Plain-Text: from multipart/alternative by demime 0.98b
X-Converted-To-Plain-Text: Alternative section used was text/plain
Sender: scitech-admin@lists.apple.com
X-BeenThere: scitech@lists.apple.com
X-Mailman-Version: 2.0.8
List-Unsubscribe: <http://www.lists.apple.com/mailman/listinfo/scitech>,
	<mailto:scitech-request@lists.apple.com?subject=unsubscribe>
List-Id: Topics regarding Apple's support for science & technology 
<scitech.lists.apple.com>
List-Post: <mailto:scitech@lists.apple.com>
List-Help: <mailto:scitech-request@lists.apple.com?subject=help>
List-Subscribe: <http://www.lists.apple.com/mailman/listinfo/scitech>,
	<mailto:scitech-request@lists.apple.com?subject=subscribe>
List-Archive: <http://www.lists.apple.com/archives/scitech/>
Date: Thu, 7 Feb 2002 10:26:33 -0800

Hi all,

	Just a few comments from someone who does Altivec and performance
computing for a living:

>>  Cube 450 MHz G4:        341 sec.
>>  PM G4e 867 MHz:         234 sec.
>>  Compaq Alpha 666 MHz:   124 sec.
>>  Pentium 4 Xeon 1.7 GHz: 174 sec.
>>  Athlon 1.4 GHz/Win2000: 142 sec.
>>  IBM Power 3, 375 MHz:   194 sec. (single node of SP supercomputer)
>>  MIPS R14000, 500 MHz:   104 sec. (single node of Origin 3800

In this case, the 867 MHz performance is not scaling linearly with the
Cube performance. It should. When it doesn't that is usually an
indication that your code does not properly saturate the FPU pipelines.
The G4/867 has a 5 stage FPU, whereas the G4/450 has a 3 stage FPU.
Older compilers don't always schedule for this fact. Also some programs
are written as if the processor has a one stage pipeline, which can tie
the hands of even an educated compiler. Assuming the Cube results are
optimal, one should expect to get a time of about 177 sec with this code
if it scaled linearly. In my work, it is not surprising to see a factor
of two FPU performance improvement if you pay attention to pipelines.

If you are a member of Apple's select and Premier developer programs,
you might take a look at your code execution with Sim_G4. That will
document on an instruction by instruction basis where the inefficiency
is. Usually the problems are fairly straightforward to fix once you know
what they are.

>  As an extra cost, one can purchase a package called VAST from Absoft
>  which reportedly automatically vectorizes ones fortran code for
>  AltiVec. If your code is very dependent on Blas, one can build a
>  AltiVec tuned Blas library with the Atlas package using gcc. Note that
>  the AltiVec Blas library currently shipped with v7.0 of the Absoft
>  compiler for OSX is not (!) optimized with AltiVec. One must build the
>  Atlas library.

Apple provides an AltiVec tuned BLAS library. Please see vBLAS.h in
vecLib.framework.

>  Now, if only those of us dependent on double precision floating point
>  could take advantage of AltiVec for scientific computing....

If it was me, I'd lobby for two FPU's instead. The vector register is
only 128 bytes wide. Two way parallelism is not enough to justify
adopting the overhead of a SIMD unit in my opinion. Also, with two FPU's
you can use existing code with no additional training. Where a vector
unit really shines is when parallelism is much higher -- 4, 8 or 16 way
parallelism.

>  2) Cache prefetching (C only, but it would be easy to write some
>  wrapper libs for Fortran -might impact performance though): I found
>  that adding some cache prefetching ("vec_dst" etc.) to these same two
>  key routines also gave an improvement of about 5-10% -not to be sniffed
>  at for only an hour's work...

One can actually write benchmark cases to show up to a four fold
improvement due to prefetching. 20-30% is much more common.

>  One of the problems I found with this on OSX is that there do not seem
>  to be libs for some single precision functions in OS X. For example,
>  trig functions, sqrt, exp, log, etc. All of these are actually
>  calculated in DP, then 'put back' into the SP vars.

On the G4 FPU, single and double precision are equally expensive, except
for divide. In some cases, single precision is more expensive due to the
need to round double precision results back down to single precision.

>  4) 'Altivectorisation' ;-)
>  I did spent some time fiddling with these same two routines (they take
>  up about 75-85% of total CPU) trying to add altivec instructions
>  (vec_madd, mainly), but it actually ended up slightly slower than the
>  non-altivec code. (It all still gave the same results, though, so I'm
>  assuming it worked OK.) I presume this has something to do with being
>  'memory-bound' (i.e. cache misses etc., even with the cache
>  prefetching)...? Has anyone had any experience of dealing with this
>  sort of thing, especially in the context of altivec?

Yes. You might find the materials here a good read:

	http://idisk.mac.com/simd/Public

Also, the Altivec Forum <altivec_forum@forum.altivec.org> is probably
one of the most sane, informative, generous and helpful mailing lists on
the internet.

>  Lnx   PIII (Coppermine?) 800   256K      g77/gcc        449
>  DUX   Alpha ev5          500   1MB?      DIGITAL Fort   422
>  X04   G4 (7400?)         400   1MB       g77/gcc        317
>  DUX   Alpha ev6          500   1MB       DIGITAL Fort   210
>  X11   G4 (7450?)         867   2MB(L3)   g77/gcc        207
>  W2K   Pentium IV        1700   1-2MB?    Absoft Fort   ~180(*)
>  DUX   Alpha ev6          667   1MB?      DIGITAL Fort   178
>  T64   Alpha ev67(?)      833   2MB       DIGITAL Fort   163(+)

Here again, the 867 results don't scale with the 400 results. You should
be seeing about 146 seconds.

>  The BIG issue is Altivec...
>  If you can get a moderate improvement with Altivec, that will make all
>  the difference  -potentially putting the G4 well out in front. Since
>  Absoft's f90 compiler automatically makes use of Altivec for some
>  operations, it's worth checking out...

"Properly done" you should in my experience see about a 3x speed
improvement with AltiVec over rigorously optimized, hand tuned scalar
FPU code. Ideally it should be 4x, but there is always some overhead to
deal with with real world problems, mostly to do with rearranging data
in register, handling alignment, loading constants, stack setup, etc.

Achieving a real world 3-4x performance improvement requires that you
learn something about vector programming thought patterns and a little
bit more about the processor. Altivec is a bit fussier than the scalar
units about data layout and data flow patterns. If you choose to ignore
that and continue to practice standard standard scalar programming
methods in the context of the vector unit, AltiVec can actually be
slower.

Please take a look at the website and tutorial reference above for
specifics. This is a much larger subject than I can cover here.

Best Regards,

Ian Ollmann
CoreOS / AltiVec
_______________________________________________
scitech mailing list | scitech@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.

----------------------------------------- end of quote of email 
-----------------------
-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------