[Rd] Workaround very slow NAN/Infinities arithmetic?

Thu Sep 30 18:06:43 CEST 2021

Brodie Gaslam wrote:
> André,

> I'm not an R core member, but happen to have looked a little bit at this
> issue myself.  I've seen similar things on Skylake and Coffee Lake 2
> (9700, one generation past your latest) too.  I think it would make sense
> to have some handling of this, although I would want to show the trade-off
> with performance impacts on CPUs that are not affected by this, and on
> vectors that don't actually have NAs and similar.  I think the performance
> impact is likely to be small so long as branch prediction is active, but
> since branch prediction is involved you might need to check with different
> ratios of NAs (not for your NA bailout branch, but for e.g. interaction
> of what you add and the existing `na.rm=TRUE` logic).

For operators such as '+', randomly placed NAs could slow down AMD processor due to many branch mispredictions. However, the functions using long doubles are mainly "cumulative" functions such as sum, mean, cumsum, etc. They can stop at the first NA found.

> You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
> which might complicate the logic a fair bit.

When an infinity is found, the rest of the vector must be searched for infinities of the opposite side or NAs.
Mixing NaN and NAs has always been platform-dependent in R, but, on x87, it seems that the first NA/NaN met wins. Being consistent with that behavior is easy.

I wrote a first patch for the sum() function, taking in account all those special +Inf/-Inf/NA/NaN cases. Without any NA, the sum() function was 1.6% slower with my first patch on a Celeron J1900. Not a big deal, in my opinion.

However, I could easily compensate that loss (and much more) by improving the code.
By splitting the na.rm=TRUE and na.rm=FALSE code paths, I could save a useless FP comparison (for NAN), a useless CMOV (for the 'updated' variable) and useless SSE->memory->FP87 moves (high latency).

My new code is x1.75 times faster, for sum(big_vector_without_NAs, na.rm=FALSE).
It is x1.35 times faster for sum(big_vector_without_NAs, na.rm=TRUE)

Of course, it is much faster if there are any NA, because it stops at the first NA found. For infinities on AMD CPUs, it may not necessarily be faster.

-- 
Sincerely
André GILLIBERT