[Rd] suggestion how to use memcpy in duplicate.c

Sat Apr 24 17:20:02 CEST 2010

Herve,

I think you code just confirms what I said -- for small nt for() wins, otherwise memcpy wins. Taking your measurements (they are a bit crude since they measure overhead as well):

ginaz:sandbox$ time ./hmc.mc 1

real	0m7.294s
user	0m7.239s
sys	0m0.054s
ginaz:sandbox$ time ./hmc 1

real	0m3.773s
user	0m3.746s
sys	0m0.024s

so for() is about 2x faster

ginaz:sandbox$ time ./hmc 3

real	0m4.751s
user	0m4.718s
sys	0m0.023s
ginaz:sandbox$ time ./hmc.mc 3

real	0m3.098s
user	0m3.051s
sys	0m0.045s

memcpy is about 50% faster.

It also proves me right when I said we should only special-case the common case of scalar recycling and use memcpy for everything else.

Cheers,
Simon

On Apr 23, 2010, at 9:21 PM, Hervé Pagès wrote:

> Follow up...
> 
> Hervé Pagès wrote:
>> Hi Matthew,
>> Matthew Dowle wrote:
>>> Just to add some clarification, the suggestion wasn't motivated by speeding up a length 3 vector being recycled 3.3 million times.  But its a good point that any change should not make that case slower.  I don't know how much vectorCopy is called really,  DUPLICATE_ATOMIC_VECTOR seems more significant, which doesn't recycle, and already had the FIXME next to it.
>>> 
>>> Where copyVector is passed a large source though, then memcpy should be faster than any of the methods using a for loop through each element (whether recycling or not),  allowing for the usual caveats. What are the timings like if you repeat the for loop 100 times to get a more robust timing ?  It needs to be a repeat around the for loop only, not the allocVector whose variance looks to be included in those timings below. Then increase the size of the source vector,  and compare to memcpy.
>> On my system (DELL LATITUDE laptop with 64-bit 9.04 Ubuntu):
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> void *memcpy2(char *dest, const char *src, size_t n)
>> {
>>        int i;
>>        for (i = 0; i < n; i++) *(dest++) = *(src++);
>>        return dest;
>> }
>> int main()
>> {
>>        int n, kmax, k;
>>        char *x, *y;
>>        n = 25000000;
>>    kmax = 100;
>>        x = (char *) malloc(n);
>>        y = (char *) malloc(n);
>>        for (k = 0; k < kmax; k++)
>>                //memcpy2(y, x, n);
>>                memcpy(y, x, n);
>>        return 0;
>> }
>> Benchmarks:
>> n = 25000000, kmax = 100, memcpy2:
>>  real    0m8.123s
>>  user    0m8.077s
>>  sys    0m0.040s
>> n = 25000000, k = 100, memcpy:
>>  real    0m1.076s
>>  user    0m1.004s
>>  sys    0m0.060s
>> n = 25000, kmax = 100000, memcpy2:
>>  real    0m8.033s
>>  user    0m8.005s
>>  sys    0m0.012s
>> n = 25000, kmax = 100000, memcpy:
>>  real    0m0.353s
>>  user    0m0.352s
>>  sys    0m0.000s
>> n = 25, kmax = 100000000, memcpy2:
>>  real    0m8.351s
>>  user    0m8.313s
>>  sys    0m0.008s
>> n = 25, kmax = 100000000, memcpy:
>>  real    0m0.628s
>>  user    0m0.624s
>>  sys    0m0.004s
>> So depending on the size of the memory area to copy, GNU memcpy() is
>> between 7.5x and 22x faster than using a for() loop. You can reasonably
>> expect that the authors of memcpy() have done their best to optimize
>> the code for most platforms they support, for big and small memory
>> areas, and that if there was a need to branch based on the size of the
>> area, that's already done *inside* memcpy() (I'm just speculating here,
>> I didn't look at memcpy's source code).
> 
> So for copying a vector of integer (with recycling of the source),
> yes, a memcpy-based implementation is much faster, for long and small
> vectors (even for a length 3 vector being recycled 3.3 million
> times ;-) ), at least on my system:
> 
> nt = 3; ns = 10000000; kmax = 100; copy_ints:
> 
>  real	0m1.206s
>  user	0m1.168s
>  sys	0m0.040s
> 
> nt = 3; ns = 10000000; kmax = 100; copy_ints2:
> 
>  real	0m6.326s
>  user	0m6.264s
>  sys	0m0.052s
> 
> 
> Code:
> =======================================================================
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> 
> void memcpy_with_recycling_of_src(char *dest, size_t dest_nblocks,
> 				  const char *src, size_t src_nblocks,
> 				  size_t blocksize)
> {
> 	int i, imax, q;
> 	size_t src_size;
> 
> 	imax = dest_nblocks - src_nblocks;
> 	src_size = src_nblocks * blocksize;
> 	for (i = 0; i <= imax; i += src_nblocks) {
> 		memcpy(dest, src, src_size);
> 		dest += src_size;
> 		i += src_nblocks;
> 	}
> 	q = dest_nblocks - i;
> 	if (q > 0)
> 		memcpy(dest, src, q * blocksize);
> 	return;
> }
> 
> void copy_ints(int *dest, int dest_length,
>               const int *src, int src_length)
> {
> 	memcpy_with_recycling_of_src((char *) dest, dest_length,
> 				     (char *) src, src_length,
> 				     sizeof(int));
> }
> 
> /* the copyVector() way */
> void copy_ints2(int *dest, int dest_length,
>                const int *src, int src_length)
> {
> 	int i;
> 
> 	for (i = 0; i < dest_length; i++)
> 		dest[i] = src[i % src_length];
> }
> 
> int main()
> {
> 	int nt, ns, kmax, k;
> 	int *t, *s;
> 
> 	nt = 3;
> 	ns = 10000000;
> 	kmax = 100;
> 	t = (int *) malloc(nt * sizeof(int));
> 	s = (int *) malloc(ns * sizeof(int));
> 	for (k = 0; k < kmax; k++)
> 		//copy_ints(s, ns, t, nt);
> 		copy_ints2(s, ns, t, nt);
> 	return 0;
> }
> 
> Note that the function that actually does the job is
> memcpy_with_recycling_of_src(). It can be reused for copying
> vectors with elements of an arbitrary size.
> 
> Cheers,
> H.
> 
>> Cheers,
>> H.
>>> 
>>> Matthew
>>> 
>>> "William Dunlap" <wdunlap at tibco.com> wrote in message news:77EB52C6DD32BA4D87471DCD70C8D70002CE6BFA at NA-PA-VBE03.na.tibco.com...
>>> If I were worried about the time this loop takes,
>>> I would avoid using i%nt.  For the attached C code
>>> compile with gcc 4.3.3 with -O2 I get
>>>  > # INTEGER() in loop
>>>  > system.time( r1 <- .Call("my_rep1", 1:3, 1e7) )
>>>     user  system elapsed
>>>    0.060   0.012   0.071
>>> 
>>>  > # INTEGER() before loop
>>>  > system.time( r2 <- .Call("my_rep2", 1:3, 1e7) )
>>>     user  system elapsed
>>>    0.076   0.008   0.086
>>> 
>>>  > # replace i%src_length in loop with j=0 before loop and
>>>  > #    if(++j==src_length) j=0 ;
>>>  > # in the loop.
>>>  > system.time( r3 <- .Call("my_rep3", 1:3, 1e7) )
>>>     user  system elapsed
>>>    0.024   0.028   0.050
>>>  > identical(r1,r2) && identical(r2,r3)
>>>  [1] TRUE
>>> 
>>> The C code is:
>>> #define USE_RINTERNALS /* pretend we are in the R kernel */
>>> #include <R.h>
>>> #include <Rinternals.h>
>>> 
>>> 
>>> SEXP my_rep1(SEXP s_src, SEXP s_dest_length)
>>> {
>>>    int src_length = length(s_src) ;
>>>    int dest_length = asInteger(s_dest_length) ;
>>>    int i,j ;
>>>    SEXP s_dest ;
>>>    PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>>    if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>>    for(i=0;i<dest_length;i++) {
>>>        INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ;
>>>    }
>>>    UNPROTECT(1) ;
>>>    return s_dest ;
>>> }
>>> SEXP my_rep2(SEXP s_src, SEXP s_dest_length)
>>> {
>>>    int src_length = length(s_src) ;
>>>    int dest_length = asInteger(s_dest_length) ;
>>>    int *psrc = INTEGER(s_src) ;
>>>    int *pdest ;
>>>    int i ;
>>>    SEXP s_dest ;
>>>    PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>>    pdest = INTEGER(s_dest) ;
>>>    if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>>    /* end of boilerplate */
>>>    for(i=0;i<dest_length;i++) {
>>>        pdest[i] = psrc[i % src_length] ;
>>>    }
>>>    UNPROTECT(1) ;
>>>    return s_dest ;
>>> }
>>> SEXP my_rep3(SEXP s_src, SEXP s_dest_length)
>>> {
>>>    int src_length = length(s_src) ;
>>>    int dest_length = asInteger(s_dest_length) ;
>>>    int *psrc = INTEGER(s_src) ;
>>>    int *pdest ;
>>>    int i,j ;
>>>    SEXP s_dest ;
>>>    PROTECT(s_dest = allocVector(INTSXP, dest_length)) ;
>>>    pdest = INTEGER(s_dest) ;
>>>    if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ;
>>>    /* end of boilerplate */
>>>    for(j=0,i=0;i<dest_length;i++) {
>>>        *pdest++ = psrc[j++] ;
>>>        if (j==src_length) {
>>>            j = 0 ;
>>>        }
>>>    }
>>>    UNPROTECT(1) ;
>>>    return s_dest ;
>>> }
>>> 
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>> 
>>>> -----Original Message-----
>>>> From: r-devel-bounces at r-project.org
>>>> [mailto:r-devel-bounces at r-project.org] On Behalf Of Romain Francois
>>>> Sent: Wednesday, April 21, 2010 12:32 PM
>>>> To: Matthew Dowle
>>>> Cc: r-devel at stat.math.ethz.ch
>>>> Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c
>>>> 
>>>> Le 21/04/10 17:54, Matthew Dowle a écrit :
>>>>>> From copyVector in duplicate.c :
>>>>> void copyVector(SEXP s, SEXP t)
>>>>> {
>>>>>     int i, ns, nt;
>>>>>     nt = LENGTH(t);
>>>>>     ns = LENGTH(s);
>>>>>     switch (TYPEOF(s)) {
>>>>> ...
>>>>>     case INTSXP:
>>>>>     for (i = 0; i<  ns; i++)
>>>>>         INTEGER(s)[i] = INTEGER(t)[i % nt];
>>>>>     break;
>>>>> ...
>>>>> 
>>>>> could that be replaced with :
>>>>> 
>>>>>     case INTSXP:
>>>>>     for (i=0; i<ns/nt; i++)
>>>>>         memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char
>>>> *)DATAPTR(t),
>>>>> nt*sizeof(int));
>>>>>     break;
>>>> or at least with something like this:
>>>> 
>>>> int* p_s = INTEGER(s) ;
>>>> int* p_t = INTEGER(t) ;
>>>> for( i=0 ; i < ns ; i++){
>>>> p_s[i] = p_t[i % nt];
>>>> }
>>>> 
>>>> since expanding the INTEGER macro over and over has a price.
>>>> 
>>>>> and similar for the other types in copyVector.  This won't
>>>> help regular
>>>>> vector copies, since those seem to be done by the
>>>> DUPLICATE_ATOMIC_VECTOR
>>>>> macro, see next suggestion below, but it should help
>>>> copyMatrix which calls
>>>>> copyVector, scan.c which calls copyVector on three lines,
>>>> dcf.c (once) and
>>>>> dounzip.c (once).
>>>>> 
>>>>> For the DUPLICATE_ATOMIC_VECTOR macro there is already a
>>>> comment next to it
>>>>> :
>>>>> 
>>>>>     <FIXME>: surely memcpy would be faster here?
>>>>> 
>>>>> which seems to refer to the for loop  :
>>>>> 
>>>>>     else { \
>>>>>     int __i__; \
>>>>>     type *__fp__ = fun(from), *__tp__ = fun(to); \
>>>>>     for (__i__ = 0; __i__<  __n__; __i__++) \
>>>>>       __tp__[__i__] = __fp__[__i__]; \
>>>>>   } \
>>>>> 
>>>>> Could that loop be replaced by the following ?
>>>>> 
>>>>>    else { \
>>>>>    memcpy((char *)DATAPTR(to), (char *)DATAPTR(from),
>>>> __n__*sizeof(type)); \
>>>>>    }\
>>>>> 
>>>>> In the data.table package, dogroups.c uses this technique,
>>>> so the principle
>>>>> is tested and works well so far.
>>>>> 
>>>>> Are there any road blocks preventing this change, or is
>>>> anyone already
>>>>> working on it ?  If not then I'll try and test it (on
>>>> Ubuntu 32bit) and
>>>>> submit patch with timings, as before.  Comments/pointers
>>>> much appreciated.
>>>>> Matthew
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> Romain Francois
>>>> Professional R Enthusiast
>>>> +33(0) 6 28 91 30 30
>>>> http://romainfrancois.blog.free.fr
>>>> |- http://bit.ly/9aKDM9 : embed images in Rd documents
>>>> |- http://tr.im/OIXN : raster images and RImageJ
>>>> |- http://tr.im/OcQe : Rcpp 0.7.7
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>