[Rd] Measuring and comparing .C and .Call overhead

Tue Jul 19 20:47:46 CEST 2011

On Jul 19, 2011, at 11:07 AM, Alireza Mahani wrote:

> Further pursuing my curiosity to measure the efficiency of R/C++ interface, I
> conducted a simple matrix-vector multiplication test using .C and .Call
> functions in R. In each case, I measured the execution time in R, as well as
> inside the C++ function. Subtracting the two, I came up with a measure of
> overhead associated with each call. I assume that this overhead would be
> non-existent of the entire code was implemented in C++. [Please let me know
> if my approach is somehow faulty.] The results of this simple test have
> significant implications for my software development approach. I hope others
> find this information useful as well. I have attached the code files in the
> bottom. Below I provide a summary output, and interpret the results. [All
> times in table below are in microseconds]
> 
> m/n/time_.Call_R/time_.Call_C++/.Call_overhead/time_.C_R/time_.C_C++/.C_overhead
> 100/10/3.23/1.33/*1.90*/16.89/0.96/*15.93*
> 200/15/5.38/3.44/*1.94*/30.77/2.81/*27.96*
> 500/20/12.48/10.4/*2.08*/78.12/9.37/*68.75*
> 

Your "overhead" calculations are anything but since they include a loop, value replacement, arithmetics, unnecessary function calls (you don't need as.double() for .Call - it's more efficient to use coerceVector() or throw an error depending on what you desire) - that have nothing to do with .Call.

In addition, if you really cared about overhead, you would avoid symbol lookup millions of times:

foo.c:
#include <Rinternals.h>
SEXP foo() { return R_NilValue; }

> system.time(for(i in 1:1e6) .Call("foo"))
   user  system elapsed 
  4.943   0.016   4.960 
> foo = getNativeSymbolInfo("foo")$address
> system.time(for(i in 1:1e6) .Call(foo))
   user  system elapsed 
  0.363   0.001   0.365 

This is very close to just calling a constant closure with the same result:

> f=function() NULL
> system.time(for(i in 1:1e6) f())
   user  system elapsed 
  0.238   0.001   0.239 

So .Call overhead is essentially negligible - and not what you were measuring ;).

There is only a tiny overhead in forcing the argument promise, but it's constant:

> x=rnorm(1e6)
> system.time(for(i in 1:1e6) .Call(foo,x))
   user  system elapsed 
  0.455   0.001   0.456 
> x=rnorm(1e9)
> system.time(for(i in 1:1e6) .Call(foo,x))
   user  system elapsed 
  0.454   0.000   0.455 

and it's again no different that for a closure:
> f=function(x) NULL
> system.time(for(i in 1:1e6) f())
   user  system elapsed 
  0.259   0.001   0.259 
> system.time(for(i in 1:1e6) f(x))
   user  system elapsed 
  0.329   0.001   0.329 

so this is inherent to any call. .Call is the fastest you can get, there is nothing faster (other than hacking your code into R internals...).

Cheers,
Simon

> Interpretation:
> 
> 1- .Call overhead holds nearly constant, i.e. independent of data size. This
> is expected since .Call works by passing pointers. (How can we explain the
> slight increase in overhead?)
> 2- C++ times for .C are somewhat better than .Call. This is likely to be due
> to the overhead associated with unpacking the SEXP pointers in a .Call
> function.
> 3- The overhead for .C dominates the execution time. For a 500x20 matrix,
> the overhead is ~90% of total time. This means that whenever we need to make
> repeated calls to a C/C++ function from R, and when performance is important
> to us, .Call is much preferred to .C, even at modest data sizes.
> 4- Overhead for .C scales sub-linearly with data size. I imagine that this
> overhead can be reduced through registering the functions and using the
> style field in R_CMethodDef to optimize data transfer (per Section 5.4 of
> "Writing R Extensions"), but perhaps not by more than a half.
> 5- Even using .Call, the overhead is still a significant percentage of total
> time, though the contribution decreases with data size (and compute load of
> function). However, in the context of parallelization benefits, this
> overhead puts an unpleasant cap on achievable gains. For example, even if
> the compute time goes down by 100x, if overhead was even 10% of the time,
> our effective gain will saturate at 10x. In other words, effective
> parallelization can only be achieved by moving big chunks of the code to
> C/C++ so as to avoid repeated calls from R to C.
> 
> I look forward to your comments.
> 
> -Alireza
> --------------------------------
> #code file: mvMultiply.r
> #measuring execution time for a matrix-vector multiplication (A%*%x) using
> .C and .Call functions
> #execution time is measured both in R and inside the C++ functions;
> subtracting the two is 
> #used as an indicator of overhead associated with each call
> 
> rm(list = ls())
> system("R CMD SHLIB mvMultiply.cc")
> dyn.load("mvMultiply.so")
> 
> m <- 100 #number of rows in matrix A
> n <- 10 #number of columns in matrix A (= number of rows in vector x)
> N <- 1000000
> 
> A <- runif(m*n)
> x <- runif(n)
> 
> # measuring .Call
> tCall_c <- 0.0
> t1 <- proc.time()[3]
> for (i in 1:N) {
> 	tCall_c <- tCall_c + .Call("matvecMultiply", as.double(A), as.double(x))
> }
> tCall_R <- proc.time()[3] - t1
> cat(".Call - Time measured in R: ", round(tCall_R,2), "sec\n")
> cat(".Call - Time measured in C++: ", round(tCall_c,2), "sec\n")
> cat(".Call - Implied overhead: ", round(tCall_R,2) - round(tCall_c,2), "sec 
> ->  per call: ", 1000000*(round(tCall_R,2) - round(tCall_c,2))/N, "usec\n")
> 
> # measuring .C
> tC_c <- 0.0
> t1 <- proc.time()[3]
> for (i in 1:N) {
> 	tC_c <- tC_c + .C("matvecMultiply2", as.double(A), as.double(x),
> double(c(m)), as.integer(c(m)), as.integer(c(n)), t = double(1))$t
> }
> tC_R <- proc.time()[3] - t1
> cat(".C - Time measured in R: ", round(tC_R,2), "sec\n")
> cat(".C - Time measured in C++: ", round(tC_c,2), "sec\n")
> cat(".C - Implied overhead: ", round(tC_R,2) - round(tC_c,2), "sec  ->  per
> call: ", 1000000*(round(tC_R,2) - round(tC_c,2))/N, "usec\n")
> 
> dyn.unload("mvMultiply.so")
> --------------------------------
> #code file: myMultiply.cc
> #include <Rinternals.h>
> #include <R.h>
> #include <sys/time.h>
> 
> extern "C" {
> 
> SEXP matvecMultiply(SEXP A, SEXP x) {
> 	timeval time1, time2;
> 	gettimeofday(&time1, NULL);
> 	SEXP execTime_sxp;
> 	PROTECT(execTime_sxp = allocVector(REALSXP, 1));
> 	double *execTime; execTime = REAL(execTime_sxp);
> 	double *rA = REAL(A), *rx = REAL(x), *ry;
> 	int n = length(x);
> 	int m = length(A) / n;
> 	SEXP y;
> 	PROTECT(y = allocVector(REALSXP, m));
> 	ry = REAL(y);
> 	for (int i = 0; i < m; i++) {
> 		ry[i] = 0.0;
> 		for (int j = 0; j < n; j++) {
> 			ry[i] += rA[j * m + i] * rx[j];
> 		}
> 	}
> 	UNPROTECT(1);
> 	gettimeofday(&time2, NULL);
> 	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
> time1.tv_usec)/1000000.0;
> 	UNPROTECT(1);
> 	return execTime_sxp;
> }
> 
> void matvecMultiply2(double *A, double *x, double *y, int *m_ptr, int
> *n_ptr, double *execTime) {
> 	timeval time1, time2;
> 	gettimeofday(&time1, NULL);
> 	int m = *m_ptr;
> 	int n = *n_ptr;
> 	for (int i = 0; i < m; i++) {
> 		y[i] = 0.0;
> 		for (int j = 0; j < n; j++) {
> 			y[i] += A[j * m + i] * x[j];
> 		}
> 	}
> 	gettimeofday(&time2, NULL);
> 	*execTime = (time2.tv_sec - time1.tv_sec) + (time2.tv_usec -
> time1.tv_usec)/1000000.0;
> }
> 
> }
> --------------------------------
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Measuring-and-comparing-C-and-Call-overhead-tp3678361p3678361.html
> Sent from the R devel mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>