[Rd] suggestion how to use memcpy in duplicate.c

Sat Apr 24 03:21:59 CEST 2010

Follow up...

Hervé Pagès wrote:
> Hi Matthew,
> 
> Matthew Dowle wrote:
>> Just to add some clarification, the suggestion wasn't motivated by 
>> speeding up a length 3 vector being recycled 3.3 million times.  But 
>> its a good point that any change should not make that case slower.  I 
>> don't know how much vectorCopy is called really,  
>> DUPLICATE_ATOMIC_VECTOR seems more significant, which doesn't recycle, 
>> and already had the FIXME next to it.
>>
>> Where copyVector is passed a large source though, then memcpy should 
>> be faster than any of the methods using a for loop through each 
>> element (whether recycling or not),  allowing for the usual caveats. 
>> What are the timings like if you repeat the for loop 100 times to get 
>> a more robust timing ?  It needs to be a repeat around the for loop 
>> only, not the allocVector whose variance looks to be included in those 
>> timings below. Then increase the size of the source vector,  and 
>> compare to memcpy.
> 
> On my system (DELL LATITUDE laptop with 64-bit 9.04 Ubuntu):
> 
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> 
> void *memcpy2(char *dest, const char *src, size_t n)
> {
>         int i;
> 
>         for (i = 0; i < n; i++) *(dest++) = *(src++);
>         return dest;
> }
> 
> int main()
> {
>         int n, kmax, k;
>         char *x, *y;
> 
>         n = 25000000;
>     kmax = 100;
>         x = (char *) malloc(n);
>         y = (char *) malloc(n);
>         for (k = 0; k < kmax; k++)
>                 //memcpy2(y, x, n);
>                 memcpy(y, x, n);
>         return 0;
> }
> 
> Benchmarks:
> 
> n = 25000000, kmax = 100, memcpy2:
> 
>   real    0m8.123s
>   user    0m8.077s
>   sys    0m0.040s
> 
> n = 25000000, k = 100, memcpy:
> 
>   real    0m1.076s
>   user    0m1.004s
>   sys    0m0.060s
> 
> n = 25000, kmax = 100000, memcpy2:
> 
>   real    0m8.033s
>   user    0m8.005s
>   sys    0m0.012s
> 
> n = 25000, kmax = 100000, memcpy:
> 
>   real    0m0.353s
>   user    0m0.352s
>   sys    0m0.000s
> 
> n = 25, kmax = 100000000, memcpy2:
> 
>   real    0m8.351s
>   user    0m8.313s
>   sys    0m0.008s
> 
> n = 25, kmax = 100000000, memcpy:
> 
>   real    0m0.628s
>   user    0m0.624s
>   sys    0m0.004s
> 
> So depending on the size of the memory area to copy, GNU memcpy() is
> between 7.5x and 22x faster than using a for() loop. You can reasonably
> expect that the authors of memcpy() have done their best to optimize
> the code for most platforms they support, for big and small memory
> areas, and that if there was a need to branch based on the size of the
> area, that's already done *inside* memcpy() (I'm just speculating here,
> I didn't look at memcpy's source code).

So for copying a vector of integer (with recycling of the source),
yes, a memcpy-based implementation is much faster, for long and small
vectors (even for a length 3 vector being recycled 3.3 million
times ;-) ), at least on my system:

nt = 3; ns = 10000000; kmax = 100; copy_ints:

   real	0m1.206s
   user	0m1.168s
   sys	0m0.040s

nt = 3; ns = 10000000; kmax = 100; copy_ints2:

   real	0m6.326s
   user	0m6.264s
   sys	0m0.052s

Code:
=======================================================================
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void memcpy_with_recycling_of_src(char *dest, size_t dest_nblocks,
				  const char *src, size_t src_nblocks,
				  size_t blocksize)
{
	int i, imax, q;
	size_t src_size;

	imax = dest_nblocks - src_nblocks;
	src_size = src_nblocks * blocksize;
	for (i = 0; i <= imax; i += src_nblocks) {
		memcpy(dest, src, src_size);
		dest += src_size;
		i += src_nblocks;
	}
	q = dest_nblocks - i;
	if (q > 0)
		memcpy(dest, src, q * blocksize);
	return;
}

void copy_ints(int *dest, int dest_length,
                const int *src, int src_length)
{
	memcpy_with_recycling_of_src((char *) dest, dest_length,
				     (char *) src, src_length,
				     sizeof(int));
}

/* the copyVector() way */
void copy_ints2(int *dest, int dest_length,
                 const int *src, int src_length)
{
	int i;

	for (i = 0; i < dest_length; i++)
		dest[i] = src[i % src_length];
}

int main()
{
	int nt, ns, kmax, k;
	int *t, *s;

	nt = 3;
	ns = 10000000;
	kmax = 100;
	t = (int *) malloc(nt * sizeof(int));
	s = (int *) malloc(ns * sizeof(int));
	for (k = 0; k < kmax; k++)
		//copy_ints(s, ns, t, nt);
		copy_ints2(s, ns, t, nt);
	return 0;
}

Note that the function that actually does the job is
memcpy_with_recycling_of_src(). It can be reused for copying
vectors with elements of an arbitrary size.

Cheers,
H.

> 
> Cheers,
> H.
> 
>>
>> Matthew
>>
>> "William Dunlap" <wdunlap at tibco.com> wrote in message 
>> news:77EB52C6DD32BA4D87471DCD70C8D70002CE6BFA at NA-PA-VBE03.na.tibco.com...
>> If I were worried about the time this loop takes,
>> I would avoid using i%nt.  For the attached C code
>> compile with gcc 4.3.3 with -O2 I get
>>   > # INTEGER() in loop
>>   > system.time( r1 <- .Call("my_rep1", 1:3, 1e7) )
>>      user  system elapsed
>>     0.060   0.012   0.071
>>
>>   > # INTEGER() before loop
>>   > system.time( r2 <- .Call("my_rep2", 1:3, 1e7) )
>>      user  system elapsed
>>     0.076   0.008   0.086
>>
>>   > # replace i%src_length in loop with j=0 before loop and
>>   > #    if(++j==src_length) j=0 ;
>>   > # in the loop.
>>   > system.time( r3 <- .Call("my_rep3", 1:3, 1e7) )
>>      user  system elapsed
>>     0.024   0.028   0.050
>>   > identical(r1,r2) && identical(r2,r3)
>>   [1] TRUE
>>
>> The C code is:
>> #define USE_RINTERNALS /* pretend we are in the R kernel */
>> #include <R.h>
>> #include <Rinternals.h>
>>
>>
>> SEXP my_rep1(SEXP s_src, SEXP s_dest_length)
>> {
>>     int src_length = length(s_src) ;
>>     int dest_length = asInteger(s_dest_length) ;
>>     int i,j ;
>>     SEXP s_dest ;
>>     PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>     if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>     for(i=0;i<dest_length;i++) {
>>         INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ;
>>     }
>>     UNPROTECT(1) ;
>>     return s_dest ;
>> }
>> SEXP my_rep2(SEXP s_src, SEXP s_dest_length)
>> {
>>     int src_length = length(s_src) ;
>>     int dest_length = asInteger(s_dest_length) ;
>>     int *psrc = INTEGER(s_src) ;
>>     int *pdest ;
>>     int i ;
>>     SEXP s_dest ;
>>     PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>     pdest = INTEGER(s_dest) ;
>>     if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>     /* end of boilerplate */
>>     for(i=0;i<dest_length;i++) {
>>         pdest[i] = psrc[i % src_length] ;
>>     }
>>     UNPROTECT(1) ;
>>     return s_dest ;
>> }
>> SEXP my_rep3(SEXP s_src, SEXP s_dest_length)
>> {
>>     int src_length = length(s_src) ;
>>     int dest_length = asInteger(s_dest_length) ;
>>     int *psrc = INTEGER(s_src) ;
>>     int *pdest ;
>>     int i,j ;
>>     SEXP s_dest ;
>>     PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>     pdest = INTEGER(s_dest) ;
>>     if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>     /* end of boilerplate */
>>     for(j=0,i=0;i<dest_length;i++) {
>>         *pdest++ = psrc[j++] ;
>>         if (j==src_length) {
>>             j = 0 ;
>>         }
>>     }
>>     UNPROTECT(1) ;
>>     return s_dest ;
>> }
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>> -----Original Message-----
>>> From: r-devel-bounces at r-project.org
>>> [mailto:r-devel-bounces at r-project.org] On Behalf Of Romain Francois
>>> Sent: Wednesday, April 21, 2010 12:32 PM
>>> To: Matthew Dowle
>>> Cc: r-devel at stat.math.ethz.ch
>>> Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c
>>>
>>> Le 21/04/10 17:54, Matthew Dowle a écrit :
>>>>> From copyVector in duplicate.c :
>>>> void copyVector(SEXP s, SEXP t)
>>>> {
>>>>      int i, ns, nt;
>>>>      nt = LENGTH(t);
>>>>      ns = LENGTH(s);
>>>>      switch (TYPEOF(s)) {
>>>> ...
>>>>      case INTSXP:
>>>>      for (i = 0; i<  ns; i++)
>>>>          INTEGER(s)[i] = INTEGER(t)[i % nt];
>>>>      break;
>>>> ...
>>>>
>>>> could that be replaced with :
>>>>
>>>>      case INTSXP:
>>>>      for (i=0; i<ns/nt; i++)
>>>>          memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char
>>> *)DATAPTR(t),
>>>> nt*sizeof(int));
>>>>      break;
>>> or at least with something like this:
>>>
>>> int* p_s = INTEGER(s) ;
>>> int* p_t = INTEGER(t) ;
>>> for( i=0 ; i < ns ; i++){
>>> p_s[i] = p_t[i % nt];
>>> }
>>>
>>> since expanding the INTEGER macro over and over has a price.
>>>
>>>> and similar for the other types in copyVector.  This won't
>>> help regular
>>>> vector copies, since those seem to be done by the
>>> DUPLICATE_ATOMIC_VECTOR
>>>> macro, see next suggestion below, but it should help
>>> copyMatrix which calls
>>>> copyVector, scan.c which calls copyVector on three lines,
>>> dcf.c (once) and
>>>> dounzip.c (once).
>>>>
>>>> For the DUPLICATE_ATOMIC_VECTOR macro there is already a
>>> comment next to it
>>>> :
>>>>
>>>>      <FIXME>: surely memcpy would be faster here?
>>>>
>>>> which seems to refer to the for loop  :
>>>>
>>>>      else { \
>>>>      int __i__; \
>>>>      type *__fp__ = fun(from), *__tp__ = fun(to); \
>>>>      for (__i__ = 0; __i__<  __n__; __i__++) \
>>>>        __tp__[__i__] = __fp__[__i__]; \
>>>>    } \
>>>>
>>>> Could that loop be replaced by the following ?
>>>>
>>>>     else { \
>>>>     memcpy((char *)DATAPTR(to), (char *)DATAPTR(from),
>>> __n__*sizeof(type)); \
>>>>     }\
>>>>
>>>> In the data.table package, dogroups.c uses this technique,
>>> so the principle
>>>> is tested and works well so far.
>>>>
>>>> Are there any road blocks preventing this change, or is
>>> anyone already
>>>> working on it ?  If not then I'll try and test it (on
>>> Ubuntu 32bit) and
>>>> submit patch with timings, as before.  Comments/pointers
>>> much appreciated.
>>>> Matthew
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>
>>> -- 
>>> Romain Francois
>>> Professional R Enthusiast
>>> +33(0) 6 28 91 30 30
>>> http://romainfrancois.blog.free.fr
>>> |- http://bit.ly/9aKDM9 : embed images in Rd documents
>>> |- http://tr.im/OIXN : raster images and RImageJ
>>> |- http://tr.im/OcQe : Rcpp 0.7.7
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319