[R-sig-eco] Bootstrapping binary data - feel like I'm missing something!
Clare Embling
clare.embling at plymouth.ac.uk
Fri Mar 4 17:49:15 CET 2011
Hi,
I have point-sample data from various sites throughout the year of presence/absence of a species, and I'm trying to determine the optimum number of samples to detect differences in occurrence between sites/times of year.
So for example I have presence absence at location 1: 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1 (occurrence/sample = 4/12)
and for location 2: 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0 (occurrence/sample = 4/15)
Some locations (or times of year) are likely to have many more occurrences (i.e. higher occurrence probability) than other sites (or times of year) since they are preferred. But I thought I'd bootstrap to work out how many samples I need to be able to detect statistical differences in occurrence probability.
So I have a binary distribution, that I'm resampling using different numbers of point-surveys & looking at the mean & 95% confidence intervals of the calculated occurrence probability based on the resampled distribution (without replacement).
Here's my code (sorry it's probably long-winded - I never know the quickest way to do things! I just figure out what I want to do in R with the knowledge I've got):
# The data is contained in SB2010$CetPres - there are over 1900 samples of presence absence data for this location & year, so basically SB2010$CetPres is 0, 1, 0, 0, 1, 1, etc.
# This is setting up the number of point samples I want to take - so starting with only 100 out of the 1900 I have, and incrementing by 100 each time
Num2010Samples <- seq(100,1900,100)
# Setting up the variables
SPres <- 0
SPresRate <- 0
SB2010MeanRate <- 0
SB2010LowerCIRate <- 0
SB2010UpperCIRate <- 0
# a double loop - which is probably a slow way of doing it
# so for each number of point surveys (100 up to 1900)
# resample from the distribution 1000 times without replacement
for(i in 1:length(Num2010Samples)){
for(j in 1:1000) {
SPres <- sample(SB2010$CetPres,Num2010Samples[i])
# calculate the occurrence probability based on this resample
SPresRate[j] <- sum(SPres)/Num2010Samples[i]
}
# sort so I can obtain the mean, lower 95% CI and upper 95% CI of occurrence probability
SortPresRate <- sort(SPresRate)
SB2010MeanRate[i] <- SortPresRate[500]
SB2010LowerCIRate[i] <- SortPresRate[50]
SB2010UpperCIRate[i] <- SortPresRate[950]
}
# and plot the bootstrapped data
plot(Num2010Samples,SB2010MeanRate,type="l",ylim=c(min(SB2010LowerCIRate),max(SB2010UpperCIRate)),xlab="Number of samples",ylab="Occurrence probability",main="2010 resampling")
lines(Num2010Samples,SB2010LowerCIRate,type="l",col="blue")
lines(Num2010Samples,SB2010UpperCIRate,type="l",col="blue")
It seems logical, but I feel as though I must be missing something vital because if I have only 200 point samples the 95% confidence intervals converge in the same way as if I have 2000 point samples, suggesting that for the 200 point data I need fewer point-samples to obtain a good precision on my occurrence rate, but with 2000 samples I need a lot more samples to obtain a good precision. So I must be doing something wrong. Any suggestions?
Thanks in advance
Clare
p.s. I can provide a couple of examples actual SB2010$CetPres data if that makes it easier
More information about the R-sig-ecology
mailing list