[R-sig-eco] Bootstrapping binary data - feel like I'm missing something!

Clare Embling clare.embling at plymouth.ac.uk
Fri Mar 4 17:49:15 CET 2011


Hi,

I have point-sample data from various sites throughout the year of presence/absence of a species, and I'm trying to determine the optimum number of samples to detect differences in occurrence between sites/times of year.

So for example I have presence absence at location 1: 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1 (occurrence/sample = 4/12)
and for location 2: 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0 (occurrence/sample = 4/15)

Some locations (or times of year) are likely to have many more occurrences (i.e. higher occurrence probability) than other sites (or times of year) since they are preferred.  But I thought I'd bootstrap to work out how many samples I need to be able to detect statistical differences in occurrence probability.

So I have a binary distribution, that I'm resampling using different numbers of point-surveys & looking at the mean & 95% confidence intervals of the calculated occurrence probability based on the resampled distribution (without replacement).

Here's my code (sorry it's probably long-winded - I never know the quickest way to do things!  I just figure out what I want to do in R with the knowledge I've got):

# The data is contained in SB2010$CetPres - there are over 1900 samples of presence absence data for this location & year, so basically SB2010$CetPres is 0, 1, 0, 0, 1, 1, etc.

# This is setting up the number of point samples I want to take - so starting with only 100 out of the 1900 I have, and incrementing by 100 each time
Num2010Samples <- seq(100,1900,100)
# Setting up the variables
SPres <- 0
SPresRate <- 0
SB2010MeanRate <- 0
SB2010LowerCIRate <- 0
SB2010UpperCIRate <- 0

# a double loop - which is probably a slow way of doing it
# so for each number of point surveys (100 up to 1900)
# resample from the distribution 1000 times without replacement
for(i in 1:length(Num2010Samples)){
   for(j in 1:1000) {
      SPres <- sample(SB2010$CetPres,Num2010Samples[i])
      # calculate the occurrence probability based on this resample
      SPresRate[j] <- sum(SPres)/Num2010Samples[i]
   }
   # sort so I can obtain the mean, lower 95% CI and upper 95% CI of occurrence probability
   SortPresRate <- sort(SPresRate)
   SB2010MeanRate[i] <- SortPresRate[500]
   SB2010LowerCIRate[i] <- SortPresRate[50]
   SB2010UpperCIRate[i] <- SortPresRate[950]
}

# and plot the bootstrapped data
plot(Num2010Samples,SB2010MeanRate,type="l",ylim=c(min(SB2010LowerCIRate),max(SB2010UpperCIRate)),xlab="Number of samples",ylab="Occurrence probability",main="2010 resampling")
lines(Num2010Samples,SB2010LowerCIRate,type="l",col="blue")
lines(Num2010Samples,SB2010UpperCIRate,type="l",col="blue")

It seems logical, but I feel as though I must be missing something vital because if I have only 200 point samples the 95% confidence intervals converge in the same way as if I have 2000 point samples, suggesting that for the 200 point data I need fewer point-samples to obtain a good precision on my occurrence rate, but with 2000 samples I need a lot more samples to obtain a good precision.  So I must be doing something wrong.  Any suggestions?

Thanks in advance
Clare

p.s. I can provide a couple of examples actual SB2010$CetPres data if that makes it easier



More information about the R-sig-ecology mailing list