[R] Is R the right choice for simulating first passage times of random walks?

Tue Aug 2 13:41:12 CEST 2011

Dear Dennis and Steve,

Am Sonntag, den 31.07.2011, 23:32 -0400 schrieb Steve Lianoglou:

[…]

> How about trying to write the of this `f4` function below using the
> rcpp/inline combo. The C/C++ you will need to write looks to be quite
> trivial, let's change f4 to accept an x argument as a vector:
> 
> I've defined f4 in the same way as Dennis did:
> 
> > f4 <- function()
> >  {
> >     x <- sample(c(-1L,1L), 1)
> >
> >      if (x >= 0 ) {return(1)} else {
> >           csum <- x
> >           len <- 1
> >               while(csum < 0) {
> >                   csum <- csum + sample(c(-1, 1), 1)
> >                   len <- len + 1
> >                                          }     }
> >      len
> >  }
> 
> Now, let's do some inline/c++ mojo:
> 
> library(inline)
> inc <- "
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> "
> 
> fxx <-cxxfunction(includes=inc, plugin="Rcpp", body="
>   int len = 1;
>   int x = ((rand() % 2 ) == 0) ? 1 : -1;
>   int csum = x;
> 
>   while (csum < 0) {
>     x = ((rand() % 2 ) == 0) ? 1 : -1;
>     len++;
>     csum = csum + x;
>   }
> 
>   return wrap(len);
> ")
> 
> Assuming I've faithfully translated this into c++, the timings aren't
> all that comparable.
> 
> Doing 500 replicates with the pure R version:
> 
> set.seed(123)
> system.time(out <- replicate(500, f4()))
>    user  system elapsed
>  31.525   0.120  32.510
> 
> Doing 10,000 replicates using the fxx function doesn't even break a sweat:
> 
> system.time(outxx <- replicate(10000, fxx()))
>    user  system elapsed
>   0.371   0.001   0.373
> 
> range(out)
> [1]       1 1994308
> 
> range(outxx)
> [1]        1 11909394

thank you very much for your suggestions.

This is indeed a nice speed.

1. I first had that implemented in FORTRAN (and Python) too, but turned
to R for two reasons. First I wanted to use also other distributions
later on and thought that it would be easier with R and that R would
have that implemented as fast as possible. Secondly I thought that R
would also operate faster having the right vectorization and using
`csum()`. But I guess it is difficult to find a good model to use the
advantages of R.

Especially looking at `top` when running this example CPU is used 100 %
but memory only 40 MB from 2 GB. So if one could use another data
structure maybe the calculations could be done on more walks at once.

2. It is indeed possible that the walk never returns to zero, so I
should make sure, that I abort the while loop after a certain length.

3. Looking at the data types I am wondering if some integer overflow(?)
could happen. I could make the length variable unsigned I suppose [1].
But still `csum` could go from `-len` to 0 and for the normal random
walk unsigned should not be a problem too besides that the logic/checks
have to be adapted.

For integrated random walks, `ccsum += csum`, `ccsum` would go from
-(ccsum**2)/2 up to 0. So later on I should use probably the 64 bit data
type (unsigned) `long` for `ccsum`, `csum` and `length` to avoid those
problems. Memory does not seem to be a problem. Also I need to add an
additional check for the height and length in the while loop like the
following.

	(csum < 0) && (csum > -ULONG_MAX) && (len =< ULONG_MAX)

So I came up with the following and to use unsigned I only consider that
the random walk stays positive instead of negative.

-------- 8< -------- code -------- >8 --------
library(inline)
inc <- "
#include <climits>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
"

f9 <-cxxfunction(includes=inc, plugin="Rcpp", body="
  unsigned long len = 1;

  if ((rand() % 2 ) == 0) {
    return wrap(len);
  }

  unsigned long x = 1;

  for (unsigned long csum = x; csum > 0; csum = ((rand() % 2 ) == 0) ? csum + 1: csum - 1) {
    len++;
    if ((csum == ULONG_MAX) && (len == ULONG_MAX)) {
      return wrap(len);
    }
  }

  return wrap(len);
")
-------- 8< -------- code -------- >8 --------

I do not know if the compiler would have optimized it that way anyway
and if there is any difference (besides the overflow checks).

> set.seed(1); system.time( z9_1 <- replicate(1000, f9()) )
       User      System verstrichen 
      0.076       0.004       0.084 
> range(z9_1)
[1]       1 1449034
> length(z9_1)
[1] 1000

Thanks,

Paul

[1] https://secure.wikimedia.org/wikipedia/en/wiki/Integer_(computer_science)#Common_integral_data_types
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110802/378b9b6b/attachment.bin>