[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices

Sat Aug 19 19:41:14 CEST 2017

Thank you and apologies for not having posted the data along with the code.

After poking some more, I found the bug.

I first initialize sample.subjects as an an empty list:

sample.subjects = list()

And then I try to the first element of that empty list.

sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)

Needless to say, an empty list has no elements.

After changing this last line to:

sample.subjects = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)

the code runs without issues. I actually don't need the initialization line. It only caused unnecessary confusion.

Thank you!

On 8/19/2017 7:15 PM, Bert Gunter wrote:
> I din't have the patience to go through your missive in detail, but do
> note that it is not reproducible, as you have not provided a "data"
> object. You **are** asked to provide a small reproducible example by
> the posting guide.
>
> Of course, others with more patience and/or more smarts may not need
> the reprex to figure out what's going on. But if not ...
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sat, Aug 19, 2017 at 7:39 AM, Aleksander Główka <aglowka at stanford.edu> wrote:
>> I'm implementing a custom bootstrap resampling procedure in R. This
>> procedure resamples clusters of data points obtained by different subjects
>> in an experiment. Since the bootstrap samples need to have the same size as
>> the original dataset, `target.set.size`, I select speakers compute their
>> data point contributions to make sure I have a set of the right size.
>>
>>      set.seed(1)
>>      target.sample.size = 1742
>>      count.lookup = rbind(levels(data$subj), as.numeric(table(data$subj)))
>>
>> To this end, I create a dynamic list of resampled subjects,
>> `sample.subjects`, that keep on being selected and appended to the list as
>> long as their summed data point contributions do not exceed
>> `target.set.size`. To conveniently retrieve the number of data points that a
>> given subject contributes I constructed a reference matrix, `count.lookup`,
>> where the first row contains subject codes and the second row contains their
>> respective data point counts.
>>
>>      > count.lookup
>>
>>      [,1]  [,2]  [,3]  [,4]  [,5]
>>      [1,] "5"   "6"   "13"  "18"  "20"
>>      [2,] "337" "202" "311" "740" "152"
>>
>> This is how the resampling works:
>>
>>      for (iter in 1:1000){
>>
>>        #select first subject
>>        #empty list overwrites sample subjects from previous iteration
>>        sample.subjects = list()
>>        sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,
>> prob=NULL)
>>
>>        #determine subject position in data point count lookup
>>        first.subj.pos = which(count.lookup[1,]==sample.subjects,
>> arr.ind=TRUE)
>>
>>        #add contribution of first subject to data point count
>>        sample.size = as.numeric(count.lookup[2,first.subj.pos])
>>
>>        #select subject clusters until you exceed target sample size
>>        while(sample.size < target.sample.size){
>>
>>          #add another subject
>>          current.subject = sample(unique(data$subj), 1, replace=TRUE,
>> prob=NULL)
>>          sample.subjects[length(sample.subjects)+1] = current.subject
>>
>>          #determine subject's position in data point lookup
>>          curr.subj.pos = which(count.lookup[1,]==current.subject,
>> arr.ind=TRUE)
>>
>>          #add subject contribution to the data point count
>>          sample.size = sample.size +
>> as.numeric(count.lookup[2,curr.subj.pos])
>>        }
>>
>>        #initialize intermediate data frame; intermediate because it will be
>> shortened to fit target size
>>        inter.set = data.frame(matrix(, nrow = 0, ncol = ncol(data)))
>>
>>        #build the bootstrap sample from the selected subjects
>>        for(j in 1:length(sample.subjects)){
>>
>>          inter.set = rbind(inter.set, data[data$subj == sample.subjects[j],])
>>
>>        }
>>
>>        #procustean bed of target sample size
>>        final.set = inter.set[1:target.sample.size,]
>>
>>        write.csv(final.set, paste("bootstrap_sample_", iter,".csv", sep=""),
>> row.names=FALSE)
>>        cat("Bootstrap Iteration", iter, "completed\n")
>>
>>        #clean up sample.size for next bootstrap iteration
>>        sample.size = 0
>>
>>      }
>>
>> My problem is that when I sample the second subject onward and add it to
>> `sample.subjects` (regardless of whether it is a list of a vector), what
>> actually gets added to `sample.subjects` seems to be the index of that
>> subject in `count.lookup`! When I select the first subject code and create a
>> list consisting of just that subject code as the only element, everything is
>> fine.
>>
>>      > sample.subjects[1] = sample(unique(tt1$subj), 1, replace=TRUE,
>> prob=NULL)
>>      > sample.subjects
>>      [[1]]
>>      [1] 5
>>
>> I know this is the actual subject number because when I check the number of
>> data points that this subject contributes in `count.lookup`, it is the
>> number that corresponds to subject 5.
>>
>>      > sample.size = as.numeric(tt1.lookup[2,first.subj.pos])
>>      > sample.size
>>
>> However, when I append further sampled subject codes to the list, for some
>> reason they surface as their index number in count.lookup.
>>
>>      > sample.subjects
>>      [[1]]
>>      [1] 5
>>
>>      [[2]]
>>      [1] 5
>>
>>      [[3]]
>>      [1] 1
>>
>>      [[4]]
>>      [1] 2
>>
>>      [[5]]
>>      [1] 5
>>
>>      [[6]]
>>      [1] 2
>>
>>      [[7]]
>>      [1] 2
>>
>>      [[8]]
>>      [1] 3
>>
>>      [[9]]
>>      [1] 3
>>
>> The third element, for example, is 1. This coincides with none of the
>> subject codes in count.lookup.
>>
>> It seems the problem lies in how I append to `sample.subjects`. I tried both
>> vectors and list as data structures in which to store sampled subject codes.
>> For each data type, I tried two ways of appending: the one I present above,
>> and one that is more idiomatic in R:
>>
>> sampled.subjects = [current.subject, sampled.subjects] (for lists)
>>
>> and
>>
>> sampled.subjects = c(current.subject, sampled.subjects) (for vectors)
>>
>> Are these appending strategies flawed here or is there some stupid error I'm
>> making somewhere else that is making the indices to surface instead of
>> subject codes?
>>
>> I'd appreciate all your help!
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.