[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices
Aleksander Główka
aglowka at stanford.edu
Sat Aug 19 19:41:14 CEST 2017
Thank you and apologies for not having posted the data along with the code.
After poking some more, I found the bug.
I first initialize sample.subjects as an an empty list:
sample.subjects = list()
And then I try to the first element of that empty list.
sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)
Needless to say, an empty list has no elements.
After changing this last line to:
sample.subjects = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)
the code runs without issues. I actually don't need the initialization line. It only caused unnecessary confusion.
Thank you!
On 8/19/2017 7:15 PM, Bert Gunter wrote:
> I din't have the patience to go through your missive in detail, but do
> note that it is not reproducible, as you have not provided a "data"
> object. You **are** asked to provide a small reproducible example by
> the posting guide.
>
> Of course, others with more patience and/or more smarts may not need
> the reprex to figure out what's going on. But if not ...
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sat, Aug 19, 2017 at 7:39 AM, Aleksander Główka <aglowka at stanford.edu> wrote:
>> I'm implementing a custom bootstrap resampling procedure in R. This
>> procedure resamples clusters of data points obtained by different subjects
>> in an experiment. Since the bootstrap samples need to have the same size as
>> the original dataset, `target.set.size`, I select speakers compute their
>> data point contributions to make sure I have a set of the right size.
>>
>> set.seed(1)
>> target.sample.size = 1742
>> count.lookup = rbind(levels(data$subj), as.numeric(table(data$subj)))
>>
>> To this end, I create a dynamic list of resampled subjects,
>> `sample.subjects`, that keep on being selected and appended to the list as
>> long as their summed data point contributions do not exceed
>> `target.set.size`. To conveniently retrieve the number of data points that a
>> given subject contributes I constructed a reference matrix, `count.lookup`,
>> where the first row contains subject codes and the second row contains their
>> respective data point counts.
>>
>> > count.lookup
>>
>> [,1] [,2] [,3] [,4] [,5]
>> [1,] "5" "6" "13" "18" "20"
>> [2,] "337" "202" "311" "740" "152"
>>
>> This is how the resampling works:
>>
>> for (iter in 1:1000){
>>
>> #select first subject
>> #empty list overwrites sample subjects from previous iteration
>> sample.subjects = list()
>> sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,
>> prob=NULL)
>>
>> #determine subject position in data point count lookup
>> first.subj.pos = which(count.lookup[1,]==sample.subjects,
>> arr.ind=TRUE)
>>
>> #add contribution of first subject to data point count
>> sample.size = as.numeric(count.lookup[2,first.subj.pos])
>>
>> #select subject clusters until you exceed target sample size
>> while(sample.size < target.sample.size){
>>
>> #add another subject
>> current.subject = sample(unique(data$subj), 1, replace=TRUE,
>> prob=NULL)
>> sample.subjects[length(sample.subjects)+1] = current.subject
>>
>> #determine subject's position in data point lookup
>> curr.subj.pos = which(count.lookup[1,]==current.subject,
>> arr.ind=TRUE)
>>
>> #add subject contribution to the data point count
>> sample.size = sample.size +
>> as.numeric(count.lookup[2,curr.subj.pos])
>> }
>>
>> #initialize intermediate data frame; intermediate because it will be
>> shortened to fit target size
>> inter.set = data.frame(matrix(, nrow = 0, ncol = ncol(data)))
>>
>> #build the bootstrap sample from the selected subjects
>> for(j in 1:length(sample.subjects)){
>>
>> inter.set = rbind(inter.set, data[data$subj == sample.subjects[j],])
>>
>> }
>>
>> #procustean bed of target sample size
>> final.set = inter.set[1:target.sample.size,]
>>
>> write.csv(final.set, paste("bootstrap_sample_", iter,".csv", sep=""),
>> row.names=FALSE)
>> cat("Bootstrap Iteration", iter, "completed\n")
>>
>> #clean up sample.size for next bootstrap iteration
>> sample.size = 0
>>
>> }
>>
>> My problem is that when I sample the second subject onward and add it to
>> `sample.subjects` (regardless of whether it is a list of a vector), what
>> actually gets added to `sample.subjects` seems to be the index of that
>> subject in `count.lookup`! When I select the first subject code and create a
>> list consisting of just that subject code as the only element, everything is
>> fine.
>>
>> > sample.subjects[1] = sample(unique(tt1$subj), 1, replace=TRUE,
>> prob=NULL)
>> > sample.subjects
>> [[1]]
>> [1] 5
>>
>> I know this is the actual subject number because when I check the number of
>> data points that this subject contributes in `count.lookup`, it is the
>> number that corresponds to subject 5.
>>
>> > sample.size = as.numeric(tt1.lookup[2,first.subj.pos])
>> > sample.size
>>
>> However, when I append further sampled subject codes to the list, for some
>> reason they surface as their index number in count.lookup.
>>
>> > sample.subjects
>> [[1]]
>> [1] 5
>>
>> [[2]]
>> [1] 5
>>
>> [[3]]
>> [1] 1
>>
>> [[4]]
>> [1] 2
>>
>> [[5]]
>> [1] 5
>>
>> [[6]]
>> [1] 2
>>
>> [[7]]
>> [1] 2
>>
>> [[8]]
>> [1] 3
>>
>> [[9]]
>> [1] 3
>>
>> The third element, for example, is 1. This coincides with none of the
>> subject codes in count.lookup.
>>
>> It seems the problem lies in how I append to `sample.subjects`. I tried both
>> vectors and list as data structures in which to store sampled subject codes.
>> For each data type, I tried two ways of appending: the one I present above,
>> and one that is more idiomatic in R:
>>
>> sampled.subjects = [current.subject, sampled.subjects] (for lists)
>>
>> and
>>
>> sampled.subjects = c(current.subject, sampled.subjects) (for vectors)
>>
>> Are these appending strategies flawed here or is there some stupid error I'm
>> making somewhere else that is making the indices to surface instead of
>> subject codes?
>>
>> I'd appreciate all your help!
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list