[R] Is R the right choice for simulating first passage times of random walks?

Mon Aug 1 14:49:42 CEST 2011

Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt :
> Glad to help -- I haven't taken a look at Dennis' solution (which may be far
> better than mine), but if you do want to keep going down the path outlined
> below you might consider the following:

I will try Dennis’ solution right away but looked at your suggestions
first. Thank you very much.

> Instead of throwing away a simulation if something starts negative, why not
> just multiply the entire sample by -1: that lets you still use the sample
> and saves you some computations: of course you'll have to remember to adjust
> your final results accordingly.

That is a nice suggestion. For a symmetric random walk this is indeed
possible and equivalent to looking when the walk first hits zero.

> This might avoid the loop:
> 
> x = ## Whatever x is.
> xLag = c(0,x[-length(x)]) # 'lag' x by 1 step.
> which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
> things, this +1 may be extraneous.
>
> The inner expression sets a 0 except where there is a switch from negative
> to positive and a one there: the which.max function returns the location of
> the first maximum, which is the first 1, in the vector. If you are
> guaranteed the run starts negative, then the location of the first positive
> should give you the length of the negative run.

That is the same idea as from Bill [1]. The problem is, when the walk
never returns to zero in a sample, `which.max(»everything FALSE)`
returns 1 [2]. That is no problem though, when we do not have to worry
about a walk starting with a positive value and adding 1 (+1) can be
omitted when we count the epochs of first hitting 0 instead of the time
of how long the walk stayed negative, which is always one less.

Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
start with a negative value. `(x>=0)` should be good enough in this
case.

> This all gives you,
> 
> f4 <- function(n = 100000, # number of simulations
>                length = 100000) # length of iterated sum
> {
>        R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
> 
> >        R = apply(R,1,cumsum)
> >
>                       R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row

The line above seems to look the columns instead of rows. I think the
following is correct since after the `apply()` above the random walks
are in the columns.

	R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)]

> >        fTemp <- function(x) {
> >
>                             xLag = c(0,x[-length(x)])
>                             return(which.max((x>=0) & (xLag <0))+1)
> 
> >        countNegative = apply(R,2,fTemp)
> >        tabulate(as.vector(countNegative), length)
> > }
> 
> That just crashed my computer though, so I wouldn't recommend it for large
> n,length.

Welcome to my world. I would have never thought that simulating random
walks with a length of say a million would create that much data and
push common desktop systems with let us say 4 GB of RAM to their limits.

> Instead, you can help a little by combining the lagging and the &
> all in one.
> 
> f4 <- function(n = 100000, llength = 100000)
> {
>     R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
>     R = apply(R,1,cumsum)
>     R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row
>     R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
>     countNegative = apply(R,1,which.max) + 1
>     return (tabulate(as.vector(countNegative), length) )
> 
> > }

I left that one out, because as written above the check can be
shortened.

> Of course, this is all starting to approach a very specific question that
> could actually be approached much more efficiently if it's your end goal
> (though I think I remember from your first email a different end goal):

That is true. But to learn some optimization techniques on a simple
example is much appreciated and will hopefully help me later on for the
iterated random walk cases.

> We can use the symmetry and "restart"ability of RW to do the following:
> 
> x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T)
> D  = diff(which(x == 0))

Nice!

> This will give you a vector of how long x stays positive or negative at a
> time. Thinking through some simple translations lets you see that this set
> has the same distribution as how long a RW that starts negative stays
> negative.

I have to write those translations down. On first sight though we need
again to handle the case where it stays negative the whole time. `D`
then has length 0 and we have to count that for a walk longer than
`BIGNUMBER`.

> Again, this is only good for answering a very specific question
> about random walks and may not be useful if you have other more complicated
> questions in sight.

Just testing for 0 for the iterated cases will not be enough for
iterated random walks since an iterated random walk can go from negative
to non-negative without being zero at this time/epoch.

I implemented all your suggestions and got the following.

-------- 8< -------- code -------- >8 --------
f4 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)
	R = apply(R,1,cumsum) ## this applies cumsum `row-wise' to R and will make your life INFINITELY better
	fTemp <- function(x) {
		if (x[1] >= 0 ) {
			return(1)
		}

		for (i in 1:length-1) {
			if (x[i] < 0 && x[i + 1] >= 0) {
				return(as.integer(i/2 + 2))
			}
		}
	}
	countNegative = apply(R,2,fTemp)
	tabulate(as.vector(countNegative), length)
}

f5 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)

	R = apply(R,1,cumsum)
	R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row

	R <- R>=0
	countNegative = apply(R,2,which.max)
	tabulate(as.vector(countNegative), length)
}

f6 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	x = cumsum(sample(c(-1L,1L), length*n,replace=T))
	D = diff(which(c(0, x) == 0))
	tabulate(D, max(D))
}
-------- 8< -------- code -------- >8 --------

The timings differ quite much which is expected though.

> # f1 is using only for loops but only does half the calculations
> # and does not yet flip random walks starting with a positive value.
> set.seed(1) ; system.time( z1 <- f1(300, 1e5) )
       User      System verstrichen
      2.700       0.008       2.729
> # f1 adapted with flips
> set.seed(1) ; system.time( z1f <- f1withflip(300, 1e5) )
       User      System verstrichen 
      4.457       0.004       4.475
> set.seed(1) ; system.time( z4 <- f4(300, 1e5) )
       User      System verstrichen
      8.033       0.380       8.739
> set.seed(1) ; system.time( z5 <- f5(300, 1e5) )
       User      System verstrichen
      9.640       0.812      10.588
> set.seed(1) ; system.time( z6 <- f6(300, 1e5) )
       User      System verstrichen
      4.208       0.328       4.606

So `f6` seems to be the most efficient setting right now and even is
slightly faster than `f1` with the for loops. But we have to keep in
mind that both operate on different data sets although `set.seed(1)` is
used and `f6` treats the problem totally different.

One other thought is that when reusing the walks starting with a positiv
term and flipping those we can probably also take the backward/reverse
walk (dual problem). I will try that too.

Thank you very much,

Paul

[1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html
[2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110801/d4411d79/attachment.bin>