[Rd] Possible page inefficiency in do_matrix in array.c
Simon Urbanek
simon.urbanek at r-project.org
Tue Sep 4 23:05:14 CEST 2012
On Sep 4, 2012, at 8:07 AM, "Matthew Dowle" <mdowle at mdowle.plus.com> wrote:
>
>> Actually, my apologies, I was assuming that your example was based on the
>> SO question while it is not at all (the code is not involved in that test
>> case). Reversing the order does indeed cause a delay. Switching to a
>> single index doesn't seem to have any impact. R-devel has the faster
>> version now (which now also works with large vectors).
>>
>> Cheers,
>> Simon
>
> I was intrigued why the compiler doesn't swap the loops when you thought
> it should, though. You're not usually wrong! From GCC's documentation (end
> of last paragraph is the most significant) :
>
> ====
>
> -floop-interchange
> Perform loop interchange transformations on loops. Interchanging two
> nested loops switches the inner and outer loops. For example, given a loop
> like:
> DO J = 1, M
> DO I = 1, N
> A(J, I) = A(J, I) * C
> ENDDO
> ENDDO
>
> loop interchange transforms the loop as if it were written:
>
> DO I = 1, N
> DO J = 1, M
> A(J, I) = A(J, I) * C
> ENDDO
> ENDDO
>
> which can be beneficial when N is larger than the caches, because in
> Fortran, the elements of an array are stored in memory contiguously by
> column, and the original loop iterates over rows, potentially creating at
> each access a cache miss. This optimization applies to all the languages
> supported by GCC and is not limited to Fortran. To use this code
> transformation, GCC has to be configured with --with-ppl and --with-cloog
> to enable the Graphite loop transformation infrastructure.
>
> ====
>
> Could R build scripts be configured to set these gcc flags to turn on
> "Graphite", then? I guess one downside could be the time to compile.
>
The is something odd happening - when I use stand-alone code it works:
$ gcc -o t2 -O3 t2.c
$ time ./t2
0x7fbae3b4f010
real 0m1.045s
user 0m0.784s
sys 0m0.260s
$ gcc -o t2 -floop-interchange -O3 t2.c
$ time ./t2
0x7f4e516f2010
real 0m0.418s
user 0m0.044s
sys 0m0.372s
However, when I split off the loop into a parametrized function it doesn't:
$ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
0x7fdd37cca010
loop time = 1772.085ms
$ gcc -O3 t.c tt.c -o t && ./t
0x7f3aa8777010
loop time = 1763.888ms
For comparison , manually swapping i and j:
$ gcc -floop-interchange -O3 t.c tt.c -o t && ./t
0x7feecd4c9010
loop time = 451.744ms
For the same reason, it doesn't work for the old R code. I wonder what's happening there - I guess the optimizer is not smart enough to realize the coverage is the entire m*n span despite the fact that m and n are parameters ... But it's certainly something it should optimize as it did when used directly in a function... Odd ...
Cheers,
Simon
--
PS: Note this has nothing to do with "R builds scripts" - it is user's responsibility to add optimization flags since they are very much compiler and architecture-dependent. Also optimizations are occasionally known to backfire, so the default is always more conservative.
FWIW this is my latest incarnation of configuring optimized R on E5 machines (minus BLAS and prefix flags which will vary by system):
'--enable-lto' '--enable-R-shlib' 'CFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'CXXFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native' 'FCFLAGS=-g -O3 -fgcse-las -fgcse-sm -fgraphite-identity -floop-interchange -floop-strip-mine -floop-block -ftree-loop-distribution -mavx -march=native -mtune=native'
-- test code (test compiled with gcc (Ubuntu/Linaro 4.7.1-7ubuntu1) 4.7.1 20120814 (prerelease))
t.c:
void foo(int *x, const unsigned int m, const unsigned int n) {
int i, j;
for (i = 0; i < m; i++)
for(j = 0; j < n; j++)
x[i + j * m] = 1;
}
tt.c:
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
void foo(int *, int, int);
static double ts() {
struct timeval tv;
gettimeofday(&tv, 0);
return (double)tv.tv_sec + ((double) tv.tv_usec) / 1000000.0;
}
int main() {
int m = 10000, n = 10000;
int *x = (int*) malloc(m * n * sizeof(int));
double a = ts(), b;
foo(x, m , n);
b = ts();
printf("%p\nloop time = %.3fms\n", x, (b - a) * 1000);
return 0;
}
t2.c:
#include <stdio.h>
#include <stdlib.h>
int *foo() {
int m = 10000, n = 10000;
int *x = (int*) malloc(m * n * sizeof(int));
int i, j;
for (i = 0; i < m; i++)
for(j = 0; j < n; j++)
x[i + j * m] = 1;
return x;
}
int main() {
printf("%p", foo());
return 0;
}
> Matthew
>
>
>>
>> On Sep 2, 2012, at 10:32 PM, Simon Urbanek wrote:
>>
>>> On Sep 2, 2012, at 10:04 PM, Matthew Dowle wrote:
>>>
>>>>
>>>> In do_matrix in src/array.c there is a type switch containing :
>>>>
>>>> case LGLSXP :
>>>> for (i = 0; i < nr; i++)
>>>> for (j = 0; j < nc; j++)
>>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>>
>>>> That seems page inefficient, iiuc. Think it should be :
>>>>
>>>> case LGLSXP :
>>>> for (j = 0; j < nc; j++)
>>>> for (i = 0; i < nr; i++)
>>>> LOGICAL(ans)[i + j * NR] = NA_LOGICAL;
>>>>
>>>> or more simply :
>>>>
>>>> case LGLSXP :
>>>> for (i = 0; i < nc*nr; i++)
>>>> LOGICAL(ans)[i] = NA_LOGICAL;
>>>>
>>>> ( with some fine tuning required since NR is type R_xlen_t whilst i, nc
>>>> and nr are type int ).
>>>>
>>>> Same goes for all the other types in that switch.
>>>>
>>>> This came up on Stack Overflow here :
>>>> http://stackoverflow.com/questions/12220128/reason-for-faster-matrix-allocation-in-r
>>>>
>>>
>>> That is completely irrelevant - modern compilers will optimize the loops
>>> accordingly and there is no difference in speed. If you don't believe
>>> it, run benchmarks ;)
>>>
>>> original
>>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>>> Unit: milliseconds
>>> expr min lq median uq
>>> max
>>> 1 matrix(nrow = 10000, ncol = 9999) 940.5519 940.6644 941.136 954.7196
>>> 1409.901
>>>
>>>
>>> swapped
>>>> microbenchmark(matrix(nrow=10000, ncol=9999), times=10)
>>> Unit: milliseconds
>>> expr min lq median uq
>>> max
>>> 1 matrix(nrow = 10000, ncol = 9999) 949.9638 950.6642 952.7497 961.001
>>> 1246.573
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>> Matthew
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>
>>
>
>
>
More information about the R-devel
mailing list