[R] bootstrap sample for clustered data

Mon Sep 17 05:22:44 CEST 2018

Hi there,

I posted this message before but there may be some confusion in my previous post. So here is a clearer version:

I'd like to do a bootstrap sampling for clustered data. Then I will run some complicated models (say mixed effects models) on the bootstrapped sample. Here id is the cluster. Note different clusters have different number of subjects, e.g., id 2 has 2 observations, id 3 has 3 observations.

id=c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
y=c(.5, .6, .4, .3, .4, 1, .9, 1, .5, 2, 2.2, 3)
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1 )

xx=data.frame(id, x, y)

boot.cluster <- function(x, id){

  boot.id <- sample(unique(id), replace=T)
  out <- lapply(boot.id, function(i) x[id%in%i,])

  return( do.call("rbind",out) )

}

boot.xx=boot.cluster(xx, xx$id)

Here is the generated boot.xx dataset:

   id x y
   3 0 0.4
   3 0 1.0
   3 0 0.9
   1 0 0.5
   1 0 0.6
   5 1 2.2
   5 1 3.0
   2 1 0.4
   2 1 0.3
   1 0 0.5
   1 0 0.6

You can see that some clusters (ids) appears multiple times (e.g., id 1 appears in two places - 4 rows), since bootstrap does a sample with replacement, we could have the same cluster multiple times. Thus, we cannot do a mixed effects model using this data, as we should assume all the clusters are different in this new data. Instead, I will reorganize the data as below (id is reordered from the above boot.xx data). This is the step I need help:

  id x  y
   1 0 0.4
   1 0 1.0
   1 0 0.9
   2 0 0.5
   2 0 0.6
   3 1 2.2
   3 1 3.0
   4 1 0.4
   4 1 0.3
   5 0 0.5
   5 0 0.6

Can someone help me with it? Thanks!

Lei Liu
Professor of Biostatistics
Washington University in St. Louis

	[[alternative HTML version deleted]]