[Rd] Cell or PS3 Port
Ed Knutson
ed at sixfoursystems.com
Fri Nov 2 17:51:41 CET 2007
The main core of the Cell (the PPE) uses IBM's version of hyperthreading
to expose two logical, main CPU's to the OS, so code that is "simply"
multi-threaded should still see an advantage. In addition, IBM provides
an SDK which includes workflow management as well as libraries to
support common linear algebra and other math functions on the
sub-processors (called SPE's). They also provide an interface to a
hardware RNG as well as 3 software types (2 psuedo, 1 quasi) that are
coded for the SPE.
Each SPE has its own small, local memory store and communicates with
main memory using a DMA queue. It seems to be a question of breaking up
each task into units that are small enough to offload to an SPE. My
initial direction will be to set up a rudimentary workflow manager. As
an optimized function is encountered, a sufficient number of SPE threads
will be spawned and execution of the main thread will wait for all
results. As for the optimized functions, I intend to start with the
ones who already have an analogous implementation in the IBM math libraries.
MPI has been employed by some Cell developers to allow multiple SPE's
working on sections of the same task to communicate with each other. I
like the idea of this approach, since it lays the groundwork to allow
multiple Cell (or really any) processors to be clustered.
Luke Tierney wrote:
> I have been experimenting with ways of parallelizing many of the
> functions in the math library. There are two experimental packages
> available in http://www.stat.uiowa.edu/~luke/R/experimental: pnmath,
> based on OpenMP, and pnmath0, based on basic pthreads. I'm not sure
> to what degree the approach there would carry over to GPUs or Cell
> where the additional processors are different from the main processor
> and may not share memory (I forget how that works on Cell).
>
> The first issue is that you need some modifications to the some
> functions to ensure they are thread-safe. For the most part these are
> minor; a few functions would require major changes and I have not
> tackled them for now (Bessel functions, wilcox, signrank I believe).
> RNG functions are also not suitable for parallelization given the
> dependence on the sequential underlying RNG.
>
> It is not too hard to get parallel versions to use all available
> processor cores. The challenge is to make sure that the parallel
> versions don't run slower than the serial versions. They may if the
> amount of data is too small. What is too small for each function
> depends on the OS and the processor/memory architecture; if memory is
> not shared this gets more complicated still. For some very simple
> functions (floor, ceiling, sign) I could not see any reliable benefit
> of parallelization for reasonable data sizes on the systems I was
> using so I left those alone for now.
More information about the R-devel
mailing list