[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices

Sat Aug 19 19:15:02 CEST 2017

I din't have the patience to go through your missive in detail, but do
note that it is not reproducible, as you have not provided a "data"
object. You **are** asked to provide a small reproducible example by
the posting guide.

Of course, others with more patience and/or more smarts may not need
the reprex to figure out what's going on. But if not ...

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sat, Aug 19, 2017 at 7:39 AM, Aleksander Główka <aglowka at stanford.edu> wrote:
> I'm implementing a custom bootstrap resampling procedure in R. This
> procedure resamples clusters of data points obtained by different subjects
> in an experiment. Since the bootstrap samples need to have the same size as
> the original dataset, `target.set.size`, I select speakers compute their
> data point contributions to make sure I have a set of the right size.
>
>     set.seed(1)
>     target.sample.size = 1742
>     count.lookup = rbind(levels(data$subj), as.numeric(table(data$subj)))
>
> To this end, I create a dynamic list of resampled subjects,
> `sample.subjects`, that keep on being selected and appended to the list as
> long as their summed data point contributions do not exceed
> `target.set.size`. To conveniently retrieve the number of data points that a
> given subject contributes I constructed a reference matrix, `count.lookup`,
> where the first row contains subject codes and the second row contains their
> respective data point counts.
>
>     > count.lookup
>
>     [,1]  [,2]  [,3]  [,4]  [,5]
>     [1,] "5"   "6"   "13"  "18"  "20"
>     [2,] "337" "202" "311" "740" "152"
>
> This is how the resampling works:
>
>     for (iter in 1:1000){
>
>       #select first subject
>       #empty list overwrites sample subjects from previous iteration
>       sample.subjects = list()
>       sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,
> prob=NULL)
>
>       #determine subject position in data point count lookup
>       first.subj.pos = which(count.lookup[1,]==sample.subjects,
> arr.ind=TRUE)
>
>       #add contribution of first subject to data point count
>       sample.size = as.numeric(count.lookup[2,first.subj.pos])
>
>       #select subject clusters until you exceed target sample size
>       while(sample.size < target.sample.size){
>
>         #add another subject
>         current.subject = sample(unique(data$subj), 1, replace=TRUE,
> prob=NULL)
>         sample.subjects[length(sample.subjects)+1] = current.subject
>
>         #determine subject's position in data point lookup
>         curr.subj.pos = which(count.lookup[1,]==current.subject,
> arr.ind=TRUE)
>
>         #add subject contribution to the data point count
>         sample.size = sample.size +
> as.numeric(count.lookup[2,curr.subj.pos])
>       }
>
>       #initialize intermediate data frame; intermediate because it will be
> shortened to fit target size
>       inter.set = data.frame(matrix(, nrow = 0, ncol = ncol(data)))
>
>       #build the bootstrap sample from the selected subjects
>       for(j in 1:length(sample.subjects)){
>
>         inter.set = rbind(inter.set, data[data$subj == sample.subjects[j],])
>
>       }
>
>       #procustean bed of target sample size
>       final.set = inter.set[1:target.sample.size,]
>
>       write.csv(final.set, paste("bootstrap_sample_", iter,".csv", sep=""),
> row.names=FALSE)
>       cat("Bootstrap Iteration", iter, "completed\n")
>
>       #clean up sample.size for next bootstrap iteration
>       sample.size = 0
>
>     }
>
> My problem is that when I sample the second subject onward and add it to
> `sample.subjects` (regardless of whether it is a list of a vector), what
> actually gets added to `sample.subjects` seems to be the index of that
> subject in `count.lookup`! When I select the first subject code and create a
> list consisting of just that subject code as the only element, everything is
> fine.
>
>     > sample.subjects[1] = sample(unique(tt1$subj), 1, replace=TRUE,
> prob=NULL)
>     > sample.subjects
>     [[1]]
>     [1] 5
>
> I know this is the actual subject number because when I check the number of
> data points that this subject contributes in `count.lookup`, it is the
> number that corresponds to subject 5.
>
>     > sample.size = as.numeric(tt1.lookup[2,first.subj.pos])
>     > sample.size
>
> However, when I append further sampled subject codes to the list, for some
> reason they surface as their index number in count.lookup.
>
>     > sample.subjects
>     [[1]]
>     [1] 5
>
>     [[2]]
>     [1] 5
>
>     [[3]]
>     [1] 1
>
>     [[4]]
>     [1] 2
>
>     [[5]]
>     [1] 5
>
>     [[6]]
>     [1] 2
>
>     [[7]]
>     [1] 2
>
>     [[8]]
>     [1] 3
>
>     [[9]]
>     [1] 3
>
> The third element, for example, is 1. This coincides with none of the
> subject codes in count.lookup.
>
> It seems the problem lies in how I append to `sample.subjects`. I tried both
> vectors and list as data structures in which to store sampled subject codes.
> For each data type, I tried two ways of appending: the one I present above,
> and one that is more idiomatic in R:
>
> sampled.subjects = [current.subject, sampled.subjects] (for lists)
>
> and
>
> sampled.subjects = c(current.subject, sampled.subjects) (for vectors)
>
> Are these appending strategies flawed here or is there some stupid error I'm
> making somewhere else that is making the indices to surface instead of
> subject codes?
>
> I'd appreciate all your help!
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.