[Rd] Possible page inefficiency in do_matrix in array.c

Tue Sep 4 23:05:14 CEST 2012

On Sep 4, 2012, at 8:07 AM, "Matthew Dowle" <mdowle at mdowle.plus.com> wrote:

> 
>> Actually, my apologies, I was assuming that your example was based on the
>> SO question while it is not at all (the code is not involved in that test
>> case). Reversing the order does indeed cause a delay. Switching to a
>> single index doesn't seem to have any impact. R-devel has the faster
>> version now (which now also works with large vectors).
>> 
>> Cheers,
>> Simon
> 
> I was intrigued why the compiler doesn't swap the loops when you thought
> it should, though. You're not usually wrong! From GCC's documentation (end
> of last paragraph is the most significant) :
> 
> ====
> 
> -floop-interchange
> Perform loop interchange transformations on loops. Interchanging two
> nested loops switches the inner and outer loops. For example, given a loop
> like:
>          DO J = 1, M
>            DO I = 1, N
>              A(J, I) = A(J, I) * C
>            ENDDO
>          ENDDO
> 
> loop interchange transforms the loop as if it were written:
> 
>          DO I = 1, N
>            DO J = 1, M
>              A(J, I) = A(J, I) * C
>            ENDDO
>          ENDDO
> 
> which can be beneficial when N is larger than the caches, because in
> Fortran, the elements of an array are stored in memory contiguously by
> column, and the original loop iterates over rows, potentially creating at
> each access a cache miss. This optimization applies to all the languages
> supported by GCC and is not limited to Fortran. To use this code
> transformation, GCC has to be configured with --with-ppl and --with-cloog
> to enable the Graphite loop transformation infrastructure.
> 
> ====
> 
> Could R build scripts be configured to set these gcc flags to turn on
> "Graphite", then? I guess one downside could be the time to compile.
> 

The is something odd happening - when I use stand-alone code it works: 

$ gcc -o t2 -O3 t2.c

$ time ./t2
0x7fbae3b4f010
real	0m1.045s
user	0m0.784s
sys	0m0.260s

$ gcc -o t2 -floop-interchange -O3 t2.c

$ time ./t2
0x7f4e516f2010
real	0m0.418s
user	0m0.044s
sys	0m0.372s

However, when I split off the loop into a parametrized function it doesn't:

$ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
0x7fdd37cca010
loop time = 1772.085ms
$ gcc -O3 t.c tt.c -o t && ./t
0x7f3aa8777010
loop time = 1763.888ms

For comparison , manually swapping i and j:
$ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
0x7feecd4c9010
loop time = 451.744ms

For the same reason, it doesn't work for the old R code. I wonder what's happening there - I guess the optimizer is not smart enough to realize the coverage is the entire m*n span despite the fact that m and n are parameters ... But it's certainly something it should optimize as it did when used directly in a function... Odd ...

Cheers,
Simon

--

PS:  Note this has nothing to do with "R builds scripts" - it is user's responsibility to add optimization flags since they are very much compiler and architecture-dependent. Also optimizations are occasionally known to backfire, so the default is always more conservative.

FWIW this is my latest incarnation of configuring optimized R on E5 machines (minus BLAS and prefix flags which will vary by system):

'--enable-lto' '--enable-R-shlib' 'CFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'CXXFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FCFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 

-- test code (test compiled with gcc (Ubuntu/Linaro 4.7.1-7ubuntu1) 4.7.1 20120814 (prerelease))

t.c:

void foo(int *x, const unsigned int m, const unsigned int n) {
    int i, j;
    for (i = 0; i < m; i++)
	for(j = 0; j < n; j++)
	    x[i + j * m] = 1;
}

tt.c:

#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>

void foo(int *, int, int);

static double ts() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (double)tv.tv_sec + ((double) tv.tv_usec) / 1000000.0;
}

int main() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    double a = ts(), b;
    foo(x, m , n);
    b = ts();
    printf("%p\nloop time = %.3fms\n", x, (b - a) * 1000);
    return 0;
}

t2.c:

#include <stdio.h>
#include <stdlib.h>

int *foo() {
    int m = 10000, n = 10000;
    int *x = (int*) malloc(m * n * sizeof(int));
    int i, j;
    for (i = 0; i < m; i++)
	for(j = 0; j < n; j++)
	    x[i + j * m] = 1;
    return x;
}

int main() {
    printf("%p", foo());
    return 0;
}

> Matthew
> 
> 
>> 
>> On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
>> 
>>> On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:
>>> 
>>>> 
>>>> In do_matrix in src/array.c there is a type switch containing :
>>>> 
>>>> case LGLSXP :
>>>>  for (i = 0; i < nr; i++)
>>>>  for (j = 0; j < nc; j++)
>>>>      LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>> 
>>>> That seems page inefficient, iiuc. Think it should be :
>>>> 
>>>> case LGLSXP :
>>>>  for (j = 0; j < nc; j++)
>>>>  for (i = 0; i < nr; i++)
>>>>      LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>> 
>>>> or more simply :
>>>> 
>>>> case LGLSXP :
>>>>  for (i = 0; i < nc*nr; i++)
>>>>      LOGICAL(ans)[i] = NA_LOGICAL;
>>>> 
>>>> ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
>>>> and nr are type int ).
>>>> 
>>>> Same goes for all the other types in that switch.
>>>> 
>>>> This came up on Stack Overflow here :
>>>> http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
>>>> 
>>> 
>>> That is completely irrelevant - modern compilers will optimize the loops
>>> accordingly and there is no difference in speed. If you don't believe
>>> it, run benchmarks ;)
>>> 
>>> original
>>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>>> Unit: milliseconds
>>>                              expr      min       lq  median       uq
>>>  max
>>> 1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196
>>> 1409.901
>>> 
>>> 
>>> swapped
>>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>>> Unit: milliseconds
>>>                              expr      min       lq   median      uq
>>>  max
>>> 1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001
>>> 1246.573
>>> 
>>> Cheers,
>>> Simon
>>> 
>>> 
>>>> Matthew
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>>> 
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> 
>> 
>> 
> 
> 
>