[R] Need very fast application of 'diff' - ideas?
Hans W Borchers
hwborchers at googlemail.com
Sat Jan 28 17:20:22 CET 2012
R. Michael Weylandt <michael.weylandt <at> gmail.com> writes:
>
> I'd write your own diff() that eliminates the method dispatch and
> argument checking that diff -> diff.default does.
>
> x[-1] - x[-len(x)] # is all you really need.
> (# you could also try something like c(x[-1], NA) - x which may be
> marginally faster as it only subsets x once but you should profile to
> find out)
>
> is probably about as fast as you can get within pure R code (the
> function overhead will add a little bit of time as well, so if speed
> is truly the only thing that matters, best not to use it. If you wanna
> go for even more speed, you'll have to go to compiled code; I'd
> suggest inline+Rcpp as the easiest way to do so. That could get it
> down to a single pass through the vector in pure C (or nice C++) which
> seems to be a lower bound for speed.
>
> Michael
Python has become astonishingly fast during the last years. On an iMAc with
3.06 GHz I can see the following timings (though I do feel a bit suspicious
about the timings Python reports):
Python 0.040 s Version 2.6.1, 1e7 integer elements
Matlab 0.095 s Matlab's diff function (Version R2011b)
Matlab 0.315 s Matlab using x(2:N)-x(1:(N-1))
R 2.14.1 0.375 s R's diff() function
R 0.365 s R using x[-1]-x[-N]
R 0.270 s R using c(x[-1],NA)-x)
R+Fortran 0.180 s R function calling .Fortran
R+C 0.180 s R function calling .C
where---as an example---the C code looks like:
void diff(int *n, int *x, int *d)
{ for (long i=0; i<*n-2; i++) d[i] = x[i+1] - x[i]; }
There appears to be a factor of 4 between R+compiled code and Python code.
It is also interesting to see that in Matlab 'diff' is considerably faster
than differencing vectors, while in R it is slower.
P. S.: To make the comparison fair I have used the following Python call:
python -m timeit -n 1 -r 1
-s 'import numpy'
-s 'arr = numpy.random.randint(0, 1000, (10000000,1)).astype("int32")'
'diff = arr[1:] - arr[:-1]'
i.e., used 32-bit integers and included the indexing in the loop.
> On Fri, Jan 27, 2012 at 7:15 PM, Kevin Ummel <kevinummel <at> gmail.com> wrote:
> > Hi everyone,
> >
> > Speed is the key here.
> >
> > I need to find the difference between a vector and its one-period lag
> > (i.e. the difference between each value and the subsequent one in the
> > vector). Let's say the vector contains 10 million random integers
> > between 0 and 1,000. The solution vector will have 9,999,999 values,
> > since their is no lag for the 1st observation.
> >
> > In R we have:
> >
> > #Set up input vector
> > x = runif(n=10e6, min=0, max=1000)
> > x = round(x)
> >
> > #Find one-period difference
> > y = diff(x)
> >
> > Question is: How can I get the 'diff(x)' part as fast as absolutely
> > possible? I queried some colleagues who work with other languages, and
> > they provided equivalent solutions in Python and Clojure that, on their
> > machines, appear to be potentially much faster
> > (I've put the code below in case anyone is interested).
> > However, they mentioned that the overhead in passing the data between
> > languages could kill any improvements. I don't have much experience
> > integrating other languages, so I'm hoping the community has some ideas
> > about how to approach this particular problem...
> >
> > Many thanks,
> > Kevin
> >
> > In iPython:
> >
> > In [3]: import numpy as np
> > In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16")
> > In [5]: arr1 = arr[1:].view()
> > In [6]: timeit arr2 = arr1 - arr[:-1]
> > 10 loops, best of 3: 20.1 ms per loop
> >
> > In Clojure:
> >
> > (defn subtract-lag
> > [n]
> > (let [v (take n (repeatedly rand))]
> > (time (dorun (map - v (cons 0 v))))))
>
More information about the R-help
mailing list