[Rd] Mere chat on vectorisation matters

Wed May 10 20:16:52 CEST 2006

Hi, people.  Allow me to chat a tiny bit on two vectorisation-related 
matters, in the context of R.  I'm curious about if the following ideas 
have ever been considered, and rejected already.

First is about using the so-called Duff's device for partially unrolling 
loops.  I did not overly check in R sources, and am not familiar with 
them anyway, but the only usage I saw is within "src/gnuwin32/malloc.c".  
Maybe it could be put to good usage in "src/main/arithmetic.c" and 
elsewhere.  Second is about what is called "chaining" on some vector 
computers, in which one vector operation uses, as an operand, the result 
of another vector operation, even before that result is sent for 
register or memory storage; R could use this technique for sparing 
memory, when it "knows" that the result is going to be discarded anyway.

I used and abused Duff's device a good while ago, when I was working
in computer graphics; it was routinely used to speed up image-wide 
operations.  With a few properly devised C pre-processor macros, it was 
made easy to use (I thrown mine away a few years ago, recognizing I lost 
interest in low-level coding matters, the macros could easily be 
rethought anyway).  Questions existed at the time about unrolled loops 
fitting or not within specialised fetch-next-instruction caches of some 
CPUs, but nowadays, memory caches are much bigger then they used to be, 
I have the prejudice it is just not a problem anymore.  Maybe more of 
a concern might be the conditionals implementing vector recycling 
(already hidden in macros), as they may disrupt the speed of merely 
falling through linear code.  One might probably do without jumps using 
clever masking operations, yet I wonder how far we would safely 
benchmark at configuration time to decide best code to generate, and how 
good C would be to write masked conditionals.  I'm not familiar enough 
with modern CPUs to judge if this really needs to be addressed or not.

I would not doubt that hardware chaining is worth all the efforts the 
engineers put so the hardware recognises and activates it on the fly.  
Vectorised chaining implemented in software as a way to spare memory, 
may be much of a challenge, as it requires sort of half-compilation.  
One one hand, it might alleviate memory problems which are often the 
subject of discussions on R-help; through thrashing, going over real 
memory and into paging may considerably slow down an R application.  On 
the other hand, unless very carefully implemented, chaining overhead 
might slow down all non-thrashing applications, which is most of them.  
Nevertheless, being softer on memory requirements is already a concern 
in R, I vaguely remember having read that R "tries to prove" that 
a vector being modified will not needed anymore in its original form, 
and when the proof succeeds, the original vector gets modified without 
prior copying.  Chaining, despite difficult to implement, might be 
a significant further step, and so, be worth a discussion.

-- 
François Pinard   http://pinard.progiciels-bpi.ca