[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices
Aleksander Główka
aglowka at stanford.edu
Sat Aug 19 16:39:55 CEST 2017
I'm implementing a custom bootstrap resampling procedure in R. This
procedure resamples clusters of data points obtained by different
subjects in an experiment. Since the bootstrap samples need to have the
same size as the original dataset, `target.set.size`, I select speakers
compute their data point contributions to make sure I have a set of the
right size.
set.seed(1)
target.sample.size = 1742
count.lookup = rbind(levels(data$subj), as.numeric(table(data$subj)))
To this end, I create a dynamic list of resampled subjects,
`sample.subjects`, that keep on being selected and appended to the list
as long as their summed data point contributions do not exceed
`target.set.size`. To conveniently retrieve the number of data points
that a given subject contributes I constructed a reference matrix,
`count.lookup`, where the first row contains subject codes and the
second row contains their respective data point counts.
> count.lookup
[,1] [,2] [,3] [,4] [,5]
[1,] "5" "6" "13" "18" "20"
[2,] "337" "202" "311" "740" "152"
This is how the resampling works:
for (iter in 1:1000){
#select first subject
#empty list overwrites sample subjects from previous iteration
sample.subjects = list()
sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,
prob=NULL)
#determine subject position in data point count lookup
first.subj.pos = which(count.lookup[1,]==sample.subjects,
arr.ind=TRUE)
#add contribution of first subject to data point count
sample.size = as.numeric(count.lookup[2,first.subj.pos])
#select subject clusters until you exceed target sample size
while(sample.size < target.sample.size){
#add another subject
current.subject = sample(unique(data$subj), 1, replace=TRUE,
prob=NULL)
sample.subjects[length(sample.subjects)+1] = current.subject
#determine subject's position in data point lookup
curr.subj.pos = which(count.lookup[1,]==current.subject,
arr.ind=TRUE)
#add subject contribution to the data point count
sample.size = sample.size +
as.numeric(count.lookup[2,curr.subj.pos])
}
#initialize intermediate data frame; intermediate because it will
be shortened to fit target size
inter.set = data.frame(matrix(, nrow = 0, ncol = ncol(data)))
#build the bootstrap sample from the selected subjects
for(j in 1:length(sample.subjects)){
inter.set = rbind(inter.set, data[data$subj ==
sample.subjects[j],])
}
#procustean bed of target sample size
final.set = inter.set[1:target.sample.size,]
write.csv(final.set, paste("bootstrap_sample_", iter,".csv",
sep=""), row.names=FALSE)
cat("Bootstrap Iteration", iter, "completed\n")
#clean up sample.size for next bootstrap iteration
sample.size = 0
}
My problem is that when I sample the second subject onward and add it to
`sample.subjects` (regardless of whether it is a list of a vector), what
actually gets added to `sample.subjects` seems to be the index of that
subject in `count.lookup`! When I select the first subject code and
create a list consisting of just that subject code as the only element,
everything is fine.
> sample.subjects[1] = sample(unique(tt1$subj), 1, replace=TRUE,
prob=NULL)
> sample.subjects
[[1]]
[1] 5
I know this is the actual subject number because when I check the number
of data points that this subject contributes in `count.lookup`, it is
the number that corresponds to subject 5.
> sample.size = as.numeric(tt1.lookup[2,first.subj.pos])
> sample.size
However, when I append further sampled subject codes to the list, for
some reason they surface as their index number in count.lookup.
> sample.subjects
[[1]]
[1] 5
[[2]]
[1] 5
[[3]]
[1] 1
[[4]]
[1] 2
[[5]]
[1] 5
[[6]]
[1] 2
[[7]]
[1] 2
[[8]]
[1] 3
[[9]]
[1] 3
The third element, for example, is 1. This coincides with none of the
subject codes in count.lookup.
It seems the problem lies in how I append to `sample.subjects`. I tried
both vectors and list as data structures in which to store sampled
subject codes. For each data type, I tried two ways of appending: the
one I present above, and one that is more idiomatic in R:
sampled.subjects = [current.subject, sampled.subjects] (for lists)
and
sampled.subjects = c(current.subject, sampled.subjects) (for vectors)
Are these appending strategies flawed here or is there some stupid error
I'm making somewhere else that is making the indices to surface instead
of subject codes?
I'd appreciate all your help!
More information about the R-help
mailing list