[R] glm and percentage data with many zero values

Thu Jan 20 17:02:35 CET 2005

Dear all,

I am interested in correctly testing effects of continuous environmental 
variables and ordered factors on bacterial abundance. Bacterial 
abundance is derived from counts and expressed as percentage. My problem 
is that the abundance data contain many zero values:
Bacteria <- 
c(2.23,0,0.03,0.71,2.34,0,0.2,0.2,0.02,2.07,0.85,0.12,0,0.59,0.02,2.3,0.29,0.39,1.32,0.07,0.52,1.2,0,0.85,1.09,0,0.5,1.4,0.08,0.11,0.05,0.17,0.31,0,0.12,0,0.99,1.11,1.78,0,0,0,2.33,0.07,0.66,1.03,0.15,0.15,0.59,0,0.03,0.16,2.86,0.2,1.66,0.12,0.09,0.01,0,0.82,0.31,0.2,0.48,0.15)

First I tried transforming the data (e.g., logit) but because of the 
zeros I was not satisfied. Next I converted the percentages into integer 
values by round(Bacteria*10) or ceiling(Bacteria*10) and calculated a 
glm with a Poisson error structure; however, I am not very happy with 
this approach because it changes the original percentage data 
substantially (e.g., 0.03 becomes either 0 or 1). The same is true for 
converting the percentages into factors and calculating a multinomial or 
proportional-odds model (anyway, I do not know if this would be a 
meaningful approach).
I was searching the web and the best answer I could get was 
http://www.biostat.wustl.edu/archives/html/s-news/1998-12/msg00010.html 
in which several persons suggested quasi-likelihood. Would it be 
reasonable to use a glm with quasipoisson? If yes, how I can I find the 
appropriate variance function? Any other suggestions?

Many thanks in advance, Christian

================================

Christian Kamenik
Institute of Plant Sciences
University of Bern
Altenbergrain 21
3013 Bern
Switzerland