[R] Simulation

Wed May 13 22:56:22 CEST 2009

Barry Rowlingson wrote:
> On Wed, May 13, 2009 at 5:36 PM, Wacek Kusnierczyk
> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>> Barry Rowlingson wrote:
>
>>>  Soln - "for" loop:
>>>
>>>  > z=list()
>>>  > for(i in 1:1000){z[[i]]=rnorm(100,0,1)}
>>>
>>> now inspect the individual bits:
>>>
>>>  > hist(z[[1]])
>>>  > hist(z[[545]])
>>>
>>> If that's the problem, then I suggest she reads an introduction to R...

>> i'd suggest reading the r inferno by pat burns [1], where he deals with
>> this sort of for-looping lists the way it deserves ;)
>
>  I don't think extending a list this way is too expensive. Not like

indeed, for some, but only for some, values of m and n, it can actually
be half a hair faster than the matrix and the replicate approaches,
proposed earlier by others:

    n=100; m=100

    library(rbenchmark)
    benchmark(replications=100, columns=c('test', 'elapsed'), order=NULL,
       list={ l=list(); for (i in 1:n) l[[i]] = rnorm(m) },
       liist={ l=vector('list', n); for (i in 1:n) l[[i]] = rnorm(m) },
       matrix=matrix(rnorm(n*m), n, m),
       replicate=replicate(m, rnorm(n)) )
    #        test elapsed
    # 1      list   0.247
    # 2     liist   0.235
    # 3    matrix   0.172
    # 4 replicate   0.247

    n=10; m=1000
    ...
    #        test elapsed
    # 1      list   0.169
    # 2     liist   0.169
    # 3    matrix   0.173
    # 4 replicate   0.771

    n=1000; m=10
    ...
    #        test elapsed
    # 1      list   1.428
    # 2     liist   0.902
    # 3    matrix   0.172
    # 4 replicate   0.185

but note that when the number of replicate m-samples is sufficiently
large (relative to the sample size), the list approach is orders of
magnitude (here, one order in the last case above) less efficient than
the other approaches.  (preallocation seems only partially helpful.)

which approach to choose -- if efficiency is an issue -- would be a
matter of the actual parameter values.

> doing 1000 foo=rbind(foo,bar)s to a matrix. The overhead for extending
> a list should really only be adding a single new pointer to the list
> pointer structure. The existing list data isn't copied.

i don't think it's accurate, as far as i understand lists in r.  while a
list is, internally, holding just pointers, and extending a list does
not cause the pointed-to structures to be reallocated, the list itself
is backed with a fixed-length array (sensu c arrays), not as a pairlist,
so it needs to be reallocated.  you don't rewrite the values, but you do
rewrite the pointers.

on the other hand, an appropriate implementation of pairlists would
indeed allow you to extend a (pair)list in O(1) time. 

    benchmark(columns=c('test', 'elapsed'),
        list={ l = list(); for (i in 1:1000) l[[i]] = 0 },
        pairlist={ p = pairlist(); for (i in 1:1000) p[[i]] = 0 })
    #       test elapsed
    # 1     list   0.743
    # 2 pairlist   0.523

the result is, of course, dependent on the actual implementation details
(dynamic arrays, pairlists with cached end pointer and size, etc.)  you
may want to use the profiler (i haven't), but my guess is that extending
an r list does require the pointer data (though not the pointed-to data)
to be rewritten -- because it is a fixed-length c array, while extending
a pairlist does require list traversal (because, presumably, there is no
end pointer cache; as far as i could see in a quick look at the sources,
pairlists are represented with chained SEXPs, which have previous/next
pointers, but no end-of-list pointers, hence the traversal).

for example:

    n = 100000
    benchmark(columns=c('test', 'elapsed'), order=NULL,
        short.control={ l = list() },
        short={ l = list(); l[[1]] = 0 },
        long.control={ l = vector('list', n) },
        long={ l = vector('list', n); l[[n+1]] = 0 })
    #            test elapsed
    # 1 short.control   0.001
    # 2         short   0.001
    # 3  long.control   0.027
    # 4          long   0.138

apparently, extending a short list by one is much more efficient than
extending a long list.  likewise:

    n = 100000
    benchmark(columns=c('test', 'elapsed'), order=NULL,
        short.control={ l = pairlist() },
        short={ p = pairlist(); l[[1]] = 0 },
        long.control={ p = as.pairlist(1:n) },
        long={ p = as.pairlist(1:n); p[[n+1]] = 0 })
    #      test elapsed
    # 1 short.control   0.001
    # 2         short   0.001
    # 3  long.control   2.665
    # 4          long   3.002

apparently, extending a short pairlist by one element is much more
efficient than extending a long one -- possibly due to the cost of
traversal.  reading the elements of a list is O(1) time (because it is
array-backed), while reading the elements of a pairlist is much more
costly even for the first ones.

a look at the sources, or a comment from an r-corer, would provide
further details and/or corrections to my naive guesses.

>
> Plus lists are more flexible. You can do:
>
>  z=list()
>  for(i in 1:1000){
>   z[[i]]=rnorm(i,0,1)  # generate 'i' samples
> }
>
> and then you can see how the properties of samples of rnorm differ
> with increasing numbers of samples.

yes, but here you solve a different problem.  and here again there is an
alternative solution -- equally 'flexible', slightly faster, much cleaner:

    n = 1000
    benchmark(columns=c('test', 'elapsed'), order=NULL,
       'for'={ l = list(); for (i in 1:n) l[[i]] = rnorm(i, 0, 1) },
       lapply=lapply(1:n, rnorm, 0, 1) )
    #     test elapsed
    # 1    for   9.855
    # 2 lapply   8.923

> Yes, you can probably vectorize this with lapply or something, but I
> prefer clarity over concision when dealing with beginners...

but where's the preferred clarity in the for loop solution?

vQ