[R] Simulation
Dimitris Rizopoulos
d.rizopoulos at erasmusmc.nl
Thu May 14 08:00:28 CEST 2009
Wacek Kusnierczyk wrote:
> Barry Rowlingson wrote:
>> On Wed, May 13, 2009 at 5:36 PM, Wacek Kusnierczyk
>> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>>> Barry Rowlingson wrote:
>>>> Soln - "for" loop:
>>>>
>>>> > z=list()
>>>> > for(i in 1:1000){z[[i]]=rnorm(100,0,1)}
>>>>
>>>> now inspect the individual bits:
>>>>
>>>> > hist(z[[1]])
>>>> > hist(z[[545]])
>>>>
>>>> If that's the problem, then I suggest she reads an introduction to R...
>
>>> i'd suggest reading the r inferno by pat burns [1], where he deals with
>>> this sort of for-looping lists the way it deserves ;)
>> I don't think extending a list this way is too expensive. Not like
>
> indeed, for some, but only for some, values of m and n, it can actually
> be half a hair faster than the matrix and the replicate approaches,
> proposed earlier by others:
another approach to create a matrix, a bit more efficient than using
matrix() but also clean for beginners IMO, is to directly assign
dimensions to a vector, e.g.,
library(rbenchmark)
n=100; m=100
benchmark(replications=100, columns=c('test', 'elapsed'), order=NULL,
list={ l=list(); for (i in 1:n) l[[i]] = rnorm(m) },
liist={ l=vector('list', n); for (i in 1:n) l[[i]] = rnorm(m) },
matrix=matrix(rnorm(n*m), n, m),
matrix2 = {mat <- rnorm(n*m); dim(mat) <- c(n, m); mat},
replicate=replicate(m, rnorm(n))
)
# test elapsed
# 1 list 0.25
# 2 liist 0.25
# 3 matrix 0.22
# 4 matrix2 0.17
# 5 replicate 0.23
> n=10; m=1000
...
# test elapsed
# 1 list 0.17
# 2 liist 0.17
# 3 matrix 0.20
# 4 matrix2 0.15
# 5 replicate 0.75
> n=1000; m=10
...
# test elapsed
# 1 list 1.37
# 2 liist 0.92
# 3 matrix 0.21
# 4 matrix2 0.17
# 5 replicate 0.19
Best,
Dimitris
> n=100; m=100
>
> library(rbenchmark)
> benchmark(replications=100, columns=c('test', 'elapsed'), order=NULL,
> list={ l=list(); for (i in 1:n) l[[i]] = rnorm(m) },
> liist={ l=vector('list', n); for (i in 1:n) l[[i]] = rnorm(m) },
> matrix=matrix(rnorm(n*m), n, m),
> replicate=replicate(m, rnorm(n)) )
> # test elapsed
> # 1 list 0.247
> # 2 liist 0.235
> # 3 matrix 0.172
> # 4 replicate 0.247
>
> n=10; m=1000
> ...
> # test elapsed
> # 1 list 0.169
> # 2 liist 0.169
> # 3 matrix 0.173
> # 4 replicate 0.771
>
> n=1000; m=10
> ...
> # test elapsed
> # 1 list 1.428
> # 2 liist 0.902
> # 3 matrix 0.172
> # 4 replicate 0.185
>
> but note that when the number of replicate m-samples is sufficiently
> large (relative to the sample size), the list approach is orders of
> magnitude (here, one order in the last case above) less efficient than
> the other approaches. (preallocation seems only partially helpful.)
>
> which approach to choose -- if efficiency is an issue -- would be a
> matter of the actual parameter values.
>
>
>> doing 1000 foo=rbind(foo,bar)s to a matrix. The overhead for extending
>> a list should really only be adding a single new pointer to the list
>> pointer structure. The existing list data isn't copied.
>
> i don't think it's accurate, as far as i understand lists in r. while a
> list is, internally, holding just pointers, and extending a list does
> not cause the pointed-to structures to be reallocated, the list itself
> is backed with a fixed-length array (sensu c arrays), not as a pairlist,
> so it needs to be reallocated. you don't rewrite the values, but you do
> rewrite the pointers.
>
> on the other hand, an appropriate implementation of pairlists would
> indeed allow you to extend a (pair)list in O(1) time.
>
> benchmark(columns=c('test', 'elapsed'),
> list={ l = list(); for (i in 1:1000) l[[i]] = 0 },
> pairlist={ p = pairlist(); for (i in 1:1000) p[[i]] = 0 })
> # test elapsed
> # 1 list 0.743
> # 2 pairlist 0.523
>
> the result is, of course, dependent on the actual implementation details
> (dynamic arrays, pairlists with cached end pointer and size, etc.) you
> may want to use the profiler (i haven't), but my guess is that extending
> an r list does require the pointer data (though not the pointed-to data)
> to be rewritten -- because it is a fixed-length c array, while extending
> a pairlist does require list traversal (because, presumably, there is no
> end pointer cache; as far as i could see in a quick look at the sources,
> pairlists are represented with chained SEXPs, which have previous/next
> pointers, but no end-of-list pointers, hence the traversal).
>
> for example:
>
> n = 100000
> benchmark(columns=c('test', 'elapsed'), order=NULL,
> short.control={ l = list() },
> short={ l = list(); l[[1]] = 0 },
> long.control={ l = vector('list', n) },
> long={ l = vector('list', n); l[[n+1]] = 0 })
> # test elapsed
> # 1 short.control 0.001
> # 2 short 0.001
> # 3 long.control 0.027
> # 4 long 0.138
>
> apparently, extending a short list by one is much more efficient than
> extending a long list. likewise:
>
> n = 100000
> benchmark(columns=c('test', 'elapsed'), order=NULL,
> short.control={ l = pairlist() },
> short={ p = pairlist(); l[[1]] = 0 },
> long.control={ p = as.pairlist(1:n) },
> long={ p = as.pairlist(1:n); p[[n+1]] = 0 })
> # test elapsed
> # 1 short.control 0.001
> # 2 short 0.001
> # 3 long.control 2.665
> # 4 long 3.002
>
> apparently, extending a short pairlist by one element is much more
> efficient than extending a long one -- possibly due to the cost of
> traversal. reading the elements of a list is O(1) time (because it is
> array-backed), while reading the elements of a pairlist is much more
> costly even for the first ones.
>
> a look at the sources, or a comment from an r-corer, would provide
> further details and/or corrections to my naive guesses.
>
>> Plus lists are more flexible. You can do:
>>
>> z=list()
>> for(i in 1:1000){
>> z[[i]]=rnorm(i,0,1) # generate 'i' samples
>> }
>>
>> and then you can see how the properties of samples of rnorm differ
>> with increasing numbers of samples.
>
> yes, but here you solve a different problem. and here again there is an
> alternative solution -- equally 'flexible', slightly faster, much cleaner:
>
> n = 1000
> benchmark(columns=c('test', 'elapsed'), order=NULL,
> 'for'={ l = list(); for (i in 1:n) l[[i]] = rnorm(i, 0, 1) },
> lapply=lapply(1:n, rnorm, 0, 1) )
> # test elapsed
> # 1 for 9.855
> # 2 lapply 8.923
>
>
>> Yes, you can probably vectorize this with lapply or something, but I
>> prefer clarity over concision when dealing with beginners...
>
> but where's the preferred clarity in the for loop solution?
>
> vQ
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center
Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
More information about the R-help
mailing list