[R] Memory Problems with a Simple Bootstrap - Part II
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Aug 2 16:15:25 CEST 2008
The following version of boot:::ordinary.array will enable this to run in
300Mb:
ordinary.array <- function(n, R, strata)
{
inds <- as.integer(names(table(strata)))
if (length(inds) == 1) {
output <- sample(n, n*R, replace=TRUE)
dim(output) <- c(R, n)
} else {
output <- matrix(as.integer(0), R, n)
for(is in inds) {
gp <- (1:n)[strata == is]
output[, gp] <- sample(gp, R*length(gp), replace=TRUE)
}
}
output
}
Note that you will have to replace the function in the 'boot' name space
(either re-install boot afer editing the sources or use fixInNamespace)
On Sat, 2 Aug 2008, Prof Brian Ripley wrote:
> On Sat, 2 Aug 2008, Tom La Bone wrote:
>
>> I have distilled my bootstrap problem down to this bit of code, which
>> calculates an estimate of the 95th percentile of 7500 random numbers drawn
>> from a standard normal distribution:
>>
>> library(boot)
>> per95 <- function( annual.data, b.index) {
>> sample.data <- annual.data[b.index]
>> return(quantile(sample.data,probs=c(0.95))) }
>> m <- 10000
>> x <- rnorm(7500,0,1)
>> B <- boot(data=x,statistic=per95,R=m)
>>
>> Error: cannot allocate vector of size 286.1 Mb
>>
>> This was result was observed with R 2.7.1 and 2.7.1patched when run on a
>> Windows XP computer with 4Gb of memory.
>>
>> This does not seem to be an excessively large and complicated calculation,
>> so is this an intentional limitation of the boot function, a result of bad
>> choices on my part, or a bug?
>
> Use of a 32-bit OS was a bad choice on your part. On 64-bit Linux it runs
> fine in
>> gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 146670 7.9 350000 18.7 350000 18.7
> Vcells 3189171 24.4 168442002 1285.2 193746905 1478.2
>
> That's too much usage for a 2GB address space.
>
> boot() sets up an index array, in your case of size 7500x10000 or 600Mb.
> That dominates a 2Gb address space.
>
> What you could do is
>
> B <- replicate(10, boot(data=x,statistic=per95,R=1000), FALSE)
> Ball <- B[[1]]
> Ball$t <- do.call("rbind", lapply(B, "[[", "t"))
>
> that is, combine 10 independent runs (and that runs in ca 200Mb).
>
> BTW to Jim Holtman: adding a gc() call is not very helpful. R will run gc to
> get memory if it is running out, and whereas the pattern of gc calls can
> affect the fragmentation, it is pretty much random whether adding gc calls
> helps or hinders.
>
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list