[R-sig-ME] proportion data based on finite population size

Wed Jun 3 17:40:56 CEST 2015

Hi,

I need to fit a mixed model for the following data situation:
A number of locations have been sampled for (plant) species. Species 
were classified as either belonging to category A or B (strictly 
binary), where, roughly speaking, A represents 'was previously there', 
and B represents 'arrived recently'. For each location the total number 
of species per category is recorded, and the main question is how 
several predictor variables influence the proprotions. A mixed model is 
used due to pronounced spatial clustering of sampled locations. Species 
numbers in both A and B range from very low to relatively high.
Colleagues have suggested a plain binomial GLMM, with the number of 
species in A and B comprising the two response-matrix columns. My 
concern here is that I don't really see underlying independent Bernoulli 
trials which gave rise to the data. At each location the total number of 
species occurring is quasi an a priori fixed, finite value, and only 
then species become grouped into the two categories. I.e. for a given 
location I cannot take a hypothetical new species and evaluate that for 
belonging to A or B (as new Bernoulli trial). In practice I suppose that 
fitting such data by a Binomial-GLMM will artificially inflate the df, 
and I wouldn't be surprised to see pronounced overdispersion. Do you 
agree with these concerns?
If so, now on to possible solutions:
Is there some finite-sample-size, or otherwise appropriate correction 
available to GLMMs?
For a new random draw I'd have to sample a new location. So a candidate 
response could be calculating A / A + B per site (and thus one df per 
site - very conservative given that A or B may actually be quite large). 
For a GLM given the lack of a Beta-distributed response a 
quasi-likelihood fit might do it, but what would be the approach 
(options / function / package) for a mixed model? Transforming the 
response ratio and using a normally distributed response might not do 
it, I am afraid.
I am also thinking of using a Poisson-GLMM with say B as response, and 
log(A) as offset variable on the right-hand side. It accounts for the 
count data nature yet relating A and B (the latter of which makes sense 
biologically speaking in this case, as - ignoring the effects of other 
covariates - A and B should be well correlated).

thanks !