[R] A difficulty with boot package

Sat Dec 31 01:18:50 CET 2005

On Fri, 2005-12-30 at 18:37 +0100, justin bem wrote:
> Hi,
>    
>   I have a difficulty with the bootstrap procedure in boot package.
> How can I specify the size of sample at each bootstrap ? 
>   I use 
>         >myboot<-boot(data,boot.fun,R=300)
>   when I display myboot$t I get a vector with just one value the than
> the compute statistic in my data file after reading about bootstrap in
> the book MASS I add 
>        >set.seed(101)
>   But I get the same result. than mean than at each boot the sample
> draw is exactly the data file. How can I do resolve this ?
>    
>   Sincerly !

Justin,

First, when creating a new post, please do not do so by replying to an
existing post. This screws up the threading in the list archive, making
it difficult for folks to search for your post and replies, when the
thread ends up being embedded in another unrelated one.

Second, with bootstrapping, the bootstrap samples are done using
sampling with replacement from the original sample. Thus, the bootstrap
replicates are, by default, the same size as the original sample.

You might want to look at the examples in ?sample, specifically the
resample() function created there, for some additional insight.

Using set.seed() simply allows for the ability to repeat the same
sequence of sampling randomizations. It has no effect on the replicate
sample size.

In terms of what you are seeing with respect to only getting a single
statistic value, I suspect that there is a problem in how you have
defined your statistic function.

Using the example from page 134 in MASS4:

  library(MASS)
  library(boot)

  set.seed(101)
  gal.boot <- boot(galaxies, function(x, i) median(x[i]), R = 1000)

  > str(gal.boot$t)
   num [1:1000, 1] 20930 21492 20196 20179 21184 ...

Here 't' is of length 1000, the same as R.

The median function is passed the galaxies data as 'x' and a set of
sampled indices 'i'. Thus, x[i] is the randomly selected replicate
passed to median(). Each replicate has 82 elements (the length of
galaxies) and this is repeated 1000 times.

Note that by default, the 'statistic' function in boot() is defined to
take two arguments, 'data' and 'i'. Since median() does not use these,
we create the boot() function call as above. We could have done it as:

my.median <- function(data, i)
{ 
  median(data[i])
}

set.seed(101)
my.boot <- boot(galaxies, my.median, R = 1000)

# Now let's compare the resultant boot objects
> all.equal(gal.boot, my.boot)
[1] "Component 6: target, current don't match when deparsed"
[2] "Component 8: target, current don't match when deparsed"

They are identical except for the 'statistic' and 'call' elements, which
simply reflects the different statistic function expressions used.

The key is that the statistic function (in a routine bootstrap such as
these) takes two arguments. It could take more (see below).

If you want to get a feel for how each of the 1000 replicates looks, you
could use the boot.array() function:

  boot.array(gal.boot, indices = TRUE)

boot.array() with 'indices = TRUE' will return a matrix of values
representing each replicate as a row (for 1000 rows in this case) and
the _indices_ of the values from galaxies (1:82) that was used in each
replicate (row).

With 'indices = FALSE' (the default here), boot.array() will give you
information pertaining to how frequently each of the 82 elements in
galaxies was used in each replicate. Keep in mind that this is sampling
WITH replacement, so some values will be used more than once in each
replicate.

For example:

  > table(boot.array(gal.boot, indices = FALSE))

      0     1     2     3     4     5     6     7
  30027 30317 15134  4998  1248   233    37     6

This shows that the 82 values in galaxies were used anywhere from 0 to 7
times in any given replicate. The sum() of the above is 82000 of course
and each replicate does have 82 elements:

# Get the sum of each row from the boot.array() call
# above and create a table
> table(rowSums(boot.array(gal.boot, indices = FALSE)))

  82
1000

If you actually wanted to generate the replicates used, you could do
something like:

  matrix(galaxies[boot.array(gal.boot, indices = TRUE)], 
         ncol = 82, byrow = TRUE)

which would return a 1000 x 82 matrix with the actual galaxies data as
sampled in the 1000 replicates.

In terms of defining the sample size in each replicate, there is an
example in Davison and Hinkley's book (cited in ?boot) and for which the
boot package is supporting software. The example is on page 528 and
provides an approach for specifying the sample size for the replicates,
if that it what you truly want to do. This example uses the 'city' data:

> city
     u   x
1  138 143
2   93 104
3   61  69
4  179 260
5   48  75
6   37  63
7   29  50
8   23  48
9   30 111
10   2  50

city.subset <- function(data, i, n = 10)
{
  # This step takes the sampled indices
  # and returns the first 'n' of them.
  # Then subsets 'data' using these, in 
  # this case, using the first 'n' rows of
  # all of the sampled rows in the city dataset
  d <- data[i[1:n], ]

  # Now the statistic is calculated on a
  # replicate with a sample size of 'n'
  mean(d[, 2]) / mean(d[, 1])
}

# Let's do the boot with 200 replicates
# each with a sample size of 5. 'n' is passed as
# part of the "..." argument in boot().
city.boot <- boot(city, city.subset, R = 200, n = 5)

Note that the 'statistic' function here takes three arguments:

  data: this will be 'city'

  i: sampled indices

  n: replicate sample size

As pointed out in D&H, the boot.array() function would need to be
adjusted here to work properly:

  boot.array(city.boot, indices = TRUE)[, 1:5]

so that you only return the first 5 values for each replicate.

There are important considerations if your replicates are sized less
than the original sample size, so I would not do this lightly and would
recommend securing a copy of D&H to cover some of this area.

HTH,

Marc Schwartz