[R] Simulation

Thu May 14 08:00:28 CEST 2009

Wacek Kusnierczyk wrote:
> Barry Rowlingson wrote:
>> On Wed, May 13, 2009 at 5:36 PM, Wacek Kusnierczyk
>> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>>> Barry Rowlingson wrote:
>>>>  Soln - "for" loop:
>>>>
>>>>  > z=list()
>>>>  > for(i in 1:1000){z[[i]]=rnorm(100,0,1)}
>>>>
>>>> now inspect the individual bits:
>>>>
>>>>  > hist(z[[1]])
>>>>  > hist(z[[545]])
>>>>
>>>> If that's the problem, then I suggest she reads an introduction to R...
> 
>>> i'd suggest reading the r inferno by pat burns [1], where he deals with
>>> this sort of for-looping lists the way it deserves ;)
>>  I don't think extending a list this way is too expensive. Not like
> 
> indeed, for some, but only for some, values of m and n, it can actually
> be half a hair faster than the matrix and the replicate approaches,
> proposed earlier by others:

another approach to create a matrix, a bit more efficient than using 
matrix() but also clean for beginners IMO, is to directly assign 
dimensions to a vector, e.g.,

library(rbenchmark)

n=100; m=100
benchmark(replications=100, columns=c('test', 'elapsed'), order=NULL,
     list={ l=list(); for (i in 1:n) l[[i]] = rnorm(m) },
     liist={ l=vector('list', n); for (i in 1:n) l[[i]] = rnorm(m) },
     matrix=matrix(rnorm(n*m), n, m),
     matrix2 = {mat <- rnorm(n*m); dim(mat) <- c(n, m); mat},
     replicate=replicate(m, rnorm(n))
)

#        test elapsed
# 1      list    0.25
# 2     liist    0.25
# 3    matrix    0.22
# 4   matrix2    0.17
# 5 replicate    0.23

 > n=10; m=1000
...

#        test elapsed
# 1      list    0.17
# 2     liist    0.17
# 3    matrix    0.20
# 4   matrix2    0.15
# 5 replicate    0.75

 > n=1000; m=10
...

#        test elapsed
# 1      list    1.37
# 2     liist    0.92
# 3    matrix    0.21
# 4   matrix2    0.17
# 5 replicate    0.19


Best,
Dimitris

>     n=100; m=100
> 
>     library(rbenchmark)
>     benchmark(replications=100, columns=c('test', 'elapsed'), order=NULL,
>        list={ l=list(); for (i in 1:n) l[[i]] = rnorm(m) },
>        liist={ l=vector('list', n); for (i in 1:n) l[[i]] = rnorm(m) },
>        matrix=matrix(rnorm(n*m), n, m),
>        replicate=replicate(m, rnorm(n)) )
>     #        test elapsed
>     # 1      list   0.247
>     # 2     liist   0.235
>     # 3    matrix   0.172
>     # 4 replicate   0.247
> 
>     n=10; m=1000
>     ...
>     #        test elapsed
>     # 1      list   0.169
>     # 2     liist   0.169
>     # 3    matrix   0.173
>     # 4 replicate   0.771
>     
>     n=1000; m=10
>     ...
>     #        test elapsed
>     # 1      list   1.428
>     # 2     liist   0.902
>     # 3    matrix   0.172
>     # 4 replicate   0.185
> 
> but note that when the number of replicate m-samples is sufficiently
> large (relative to the sample size), the list approach is orders of
> magnitude (here, one order in the last case above) less efficient than
> the other approaches.  (preallocation seems only partially helpful.)
> 
> which approach to choose -- if efficiency is an issue -- would be a
> matter of the actual parameter values.
> 
> 
>> doing 1000 foo=rbind(foo,bar)s to a matrix. The overhead for extending
>> a list should really only be adding a single new pointer to the list
>> pointer structure. The existing list data isn't copied.
> 
> i don't think it's accurate, as far as i understand lists in r.  while a
> list is, internally, holding just pointers, and extending a list does
> not cause the pointed-to structures to be reallocated, the list itself
> is backed with a fixed-length array (sensu c arrays), not as a pairlist,
> so it needs to be reallocated.  you don't rewrite the values, but you do
> rewrite the pointers.
> 
> on the other hand, an appropriate implementation of pairlists would
> indeed allow you to extend a (pair)list in O(1) time. 
> 
>     benchmark(columns=c('test', 'elapsed'),
>         list={ l = list(); for (i in 1:1000) l[[i]] = 0 },
>         pairlist={ p = pairlist(); for (i in 1:1000) p[[i]] = 0 })
>     #       test elapsed
>     # 1     list   0.743
>     # 2 pairlist   0.523
> 
> the result is, of course, dependent on the actual implementation details
> (dynamic arrays, pairlists with cached end pointer and size, etc.)  you
> may want to use the profiler (i haven't), but my guess is that extending
> an r list does require the pointer data (though not the pointed-to data)
> to be rewritten -- because it is a fixed-length c array, while extending
> a pairlist does require list traversal (because, presumably, there is no
> end pointer cache; as far as i could see in a quick look at the sources,
> pairlists are represented with chained SEXPs, which have previous/next
> pointers, but no end-of-list pointers, hence the traversal).
> 
> for example:
> 
>     n = 100000
>     benchmark(columns=c('test', 'elapsed'), order=NULL,
>         short.control={ l = list() },
>         short={ l = list(); l[[1]] = 0 },
>         long.control={ l = vector('list', n) },
>         long={ l = vector('list', n); l[[n+1]] = 0 })
>     #            test elapsed
>     # 1 short.control   0.001
>     # 2         short   0.001
>     # 3  long.control   0.027
>     # 4          long   0.138
>         
> apparently, extending a short list by one is much more efficient than
> extending a long list.  likewise:
>     
>     n = 100000
>     benchmark(columns=c('test', 'elapsed'), order=NULL,
>         short.control={ l = pairlist() },
>         short={ p = pairlist(); l[[1]] = 0 },
>         long.control={ p = as.pairlist(1:n) },
>         long={ p = as.pairlist(1:n); p[[n+1]] = 0 })
>     #      test elapsed
>     # 1 short.control   0.001
>     # 2         short   0.001
>     # 3  long.control   2.665
>     # 4          long   3.002
> 
> apparently, extending a short pairlist by one element is much more
> efficient than extending a long one -- possibly due to the cost of
> traversal.  reading the elements of a list is O(1) time (because it is
> array-backed), while reading the elements of a pairlist is much more
> costly even for the first ones.
> 
> a look at the sources, or a comment from an r-corer, would provide
> further details and/or corrections to my naive guesses.
> 
>> Plus lists are more flexible. You can do:
>>
>>  z=list()
>>  for(i in 1:1000){
>>   z[[i]]=rnorm(i,0,1)  # generate 'i' samples
>> }
>>
>> and then you can see how the properties of samples of rnorm differ
>> with increasing numbers of samples.
> 
> yes, but here you solve a different problem.  and here again there is an
> alternative solution -- equally 'flexible', slightly faster, much cleaner:
> 
>     n = 1000
>     benchmark(columns=c('test', 'elapsed'), order=NULL,
>        'for'={ l = list(); for (i in 1:n) l[[i]] = rnorm(i, 0, 1) },
>        lapply=lapply(1:n, rnorm, 0, 1) )
>     #     test elapsed
>     # 1    for   9.855
>     # 2 lapply   8.923
> 
> 
>> Yes, you can probably vectorize this with lapply or something, but I
>> prefer clarity over concision when dealing with beginners...
> 
> but where's the preferred clarity in the for loop solution?
> 
> vQ
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014