[R] Memory Problems with a Simple Bootstrap - Part II

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Aug 2 16:15:25 CEST 2008


The following version of boot:::ordinary.array will enable this to run in 
300Mb:

ordinary.array <- function(n, R, strata)
{
     inds <- as.integer(names(table(strata)))
     if (length(inds) == 1) {
         output <- sample(n, n*R, replace=TRUE)
         dim(output) <- c(R, n)
     } else {
         output <- matrix(as.integer(0), R, n)
         for(is in inds) {
             gp <- (1:n)[strata == is]
             output[, gp] <- sample(gp, R*length(gp), replace=TRUE)
         }
     }
     output
}

Note that you will have to replace the function in the 'boot' name space 
(either re-install boot afer editing the sources or use fixInNamespace)


On Sat, 2 Aug 2008, Prof Brian Ripley wrote:

> On Sat, 2 Aug 2008, Tom La Bone wrote:
>
>> I have distilled my bootstrap problem down to this bit of code, which
>> calculates an estimate of the 95th percentile of 7500 random numbers drawn
>> from a standard normal distribution:
>> 
>> library(boot)
>> per95 <- function( annual.data, b.index) {
>>  sample.data <- annual.data[b.index]
>>  return(quantile(sample.data,probs=c(0.95))) }
>> m <- 10000
>> x <- rnorm(7500,0,1)
>> B <- boot(data=x,statistic=per95,R=m)
>> 
>> Error: cannot allocate vector of size 286.1 Mb
>> 
>> This was result was observed with R 2.7.1 and 2.7.1patched when run on a
>> Windows XP computer with 4Gb of memory.
>> 
>> This does not seem to be an excessively large and complicated calculation,
>> so is this an intentional limitation of the boot function, a result of bad
>> choices on my part, or a bug?
>
> Use of a 32-bit OS was a bad choice on your part.  On 64-bit Linux it runs 
> fine in
>> gc()
>          used (Mb) gc trigger   (Mb)  max used   (Mb)
> Ncells  146670  7.9     350000   18.7    350000   18.7
> Vcells 3189171 24.4  168442002 1285.2 193746905 1478.2
>
> That's too much usage for a 2GB address space.
>
> boot() sets up an index array, in your case of size 7500x10000 or 600Mb. 
> That dominates a 2Gb address space.
>
> What you could do is
>
> B <- replicate(10, boot(data=x,statistic=per95,R=1000), FALSE)
> Ball <- B[[1]]
> Ball$t <- do.call("rbind", lapply(B, "[[", "t"))
>
> that is, combine 10 independent runs (and that runs in ca 200Mb).
>
> BTW to Jim Holtman: adding a gc() call is not very helpful.  R will run gc to 
> get memory if it is running out, and whereas the pattern of gc calls can 
> affect the fragmentation, it is pretty much random whether adding gc calls 
> helps or hinders.
>
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list