performance of apply

Thu, 28 May 1998 15:25:59 +0200 (CEST)

I noticed that apply is VERY SLOW when applied to a "large"
dimension as for example when computing the row sums of a matrix with
thousands of rows.

To demonstrate it, I did some benchmarking for different methods of
computing the row sums of an nx10 matrix with n =3D 2000, ..., 10000.

The first method (M1) I used is the normal apply command:
	y <- apply(x,1,sum)
The second method (M2) uses a for-loop for the computations, where the
memory for the resulting vector has been allocated before. That is, for
n=3D2000:
	z <- numeric(2000); for (i in 1:2000) z[i] <- sum(x[i,])
The third method (M3) also uses a for-loop, but the resulting vector
is built recursively, i.e.
	z1 <- NULL; for (i in 1:2000) z1 <- c(z1,sum(x[i,]))

All computations have been made on a Pentium II 233MHz, 256MB, R
started as R -v 50. The following table shows the minimum, mean, and
maximum CPU-time in seconds as measured by system.time over 10 runs
for every computation for different values of n.

n	M1			M2		M3
2000	 4.03  4.16   4.34	0.27 0.40 0.47	0.51 0.63 0.71
4000	12.65 13.40  14.68	0.73 0.81 0.94	1.78 1.86 1.98
6000	26.51 28.14  29.50	1.19 1.22 1.38	3.79 3.80 3.80
8000	52.06 63.43  67.61	1.46 1.63 1.69	6.38 6.41 6.58
10000	84.06 98.17 118.94	1.93 2.01 2.13  9.78 9.79 9.81

That is, the computation of the sums of the rows of a 10000x10 matrix
with apply takes about 100sec on average, where a simple for-loop does
the same job in about 2sec.

The next table shows for every method the relative increase of time as
compared to the increase of the number of rows.

n	M1	M2	M3
1	 1	1	 1
2	 3.221	2.025	 2.952
3	 6.764	3.050	 6.032
4	15.247	4.075	10.175
5	23.599	5.025	15.540

You can see that the time for M2 (for-loop with memory allocated at
the beginning) increases linearly with n, which is to be expected,
whereas apply (M1) is scaling quadratically with the number of rows. M3
(for-loop, but new memory-allocation every time) is somewhere in between.

So obviously, R becomes slow, when new memory has to be allocated
repeatedly. I do not see a quick fix for apply, because apply can be
used with arbitrary functions which might yield different results for
different input data. But I think as long as memory allocation seems
to be a performance bottleneck, one should be aware of this problem,
use simple for-loops and use apply only on small data (small
dimensions) or where it is really necessary.

	Andreas

************************************************************************
*                          Andreas Weingessel                          *
************************************************************************
* Institut f=FCr Statistik      *                Tel: (+43 1) 58801 4541 =
*
* Technische Universit=E4t Wien *                Fax: (+43 1)  504 14 98 =
*
* Wiedner Hauptstr. 8-10/1071 *     Andreas.Weingessel@ci.tuwien.ac.at *
* A-1040 Wien, Austria        * http://www.ci.tuwien.ac.at/~weingessel *
************************************************************************

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._