[R-sig-ME] proportion data based on finite population size
Thomas M
firespot71 at gmail.com
Wed Jun 3 17:40:56 CEST 2015
Hi,
I need to fit a mixed model for the following data situation:
A number of locations have been sampled for (plant) species. Species
were classified as either belonging to category A or B (strictly
binary), where, roughly speaking, A represents 'was previously there',
and B represents 'arrived recently'. For each location the total number
of species per category is recorded, and the main question is how
several predictor variables influence the proprotions. A mixed model is
used due to pronounced spatial clustering of sampled locations. Species
numbers in both A and B range from very low to relatively high.
Colleagues have suggested a plain binomial GLMM, with the number of
species in A and B comprising the two response-matrix columns. My
concern here is that I don't really see underlying independent Bernoulli
trials which gave rise to the data. At each location the total number of
species occurring is quasi an a priori fixed, finite value, and only
then species become grouped into the two categories. I.e. for a given
location I cannot take a hypothetical new species and evaluate that for
belonging to A or B (as new Bernoulli trial). In practice I suppose that
fitting such data by a Binomial-GLMM will artificially inflate the df,
and I wouldn't be surprised to see pronounced overdispersion. Do you
agree with these concerns?
If so, now on to possible solutions:
Is there some finite-sample-size, or otherwise appropriate correction
available to GLMMs?
For a new random draw I'd have to sample a new location. So a candidate
response could be calculating A / A + B per site (and thus one df per
site - very conservative given that A or B may actually be quite large).
For a GLM given the lack of a Beta-distributed response a
quasi-likelihood fit might do it, but what would be the approach
(options / function / package) for a mixed model? Transforming the
response ratio and using a normally distributed response might not do
it, I am afraid.
I am also thinking of using a Poisson-GLMM with say B as response, and
log(A) as offset variable on the right-hand side. It accounts for the
count data nature yet relating A and B (the latter of which makes sense
biologically speaking in this case, as - ignoring the effects of other
covariates - A and B should be well correlated).
thanks !
More information about the R-sig-mixed-models
mailing list