[R-sig-eco] Opinion on Analysis. glm() and nls()

Thu Jul 12 17:48:47 CEST 2012

Another way to think about the "number of studies" would be in the
sense of greater effort devoted to looking for parasites in a
particular host species. you could stick with the poisson glm
formulation using an offset, so

log(lambda*Studies) ~ MASS

so lambda estimates the average parasite species per study. You'd fit
this model like this

glm(Helminths~MASS + offset(log(Studies)), family=poisson)

Call:
glm(formula = Helminths ~ Mass + offset(log(Studies)), family = poisson,
    data = real.data)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-4.0154  -0.7162  -0.0456   1.5223   5.1601

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.876050   0.131105   6.682 2.36e-11 ***
Mass        -0.008331   0.001485  -5.610 2.03e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 194.22  on 45  degrees of freedom
Residual deviance: 160.18  on 44  degrees of freedom
AIC: 306.78

Number of Fisher Scoring iterations: 5

with your real.data data.frame. This suggests that once you account
for the number of studies helminth richness actually decreases with
mass, but the model is overdispersed (compare residual deviance with
df). Using family quasipoisson for a first cut at fixing that:

Call:
glm(formula = Helminths ~ Mass + offset(log(Studies)), family = quasipoisson,
    data = real.data)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-4.0154  -0.7162  -0.0456   1.5223   5.1601

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.876050   0.291488   3.005  0.00437 **
Mass        -0.008331   0.003302  -2.523  0.01532 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 4.943114)

    Null deviance: 194.22  on 45  degrees of freedom
Residual deviance: 160.18  on 44  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

And even with the inflated SE on the mass coefficient, it is still
indicating a negative effect.

However, there is some non-linearity indicated in the residuals, and
it appears to be associated with the number of studies rather than
Mass

helminths = glm(Helminths~Mass+offset(log(Studies)),data=real.data,family=quasipoisson)
plot(fitted(helminths),resid(helminths))
plot(fitted(helminths),real.data$Helminths)
abline(a=0,b=1)
plot(real.data$Mass,resid(helminths)) # pretty flat
plot(real.data$Studies,resid(helminths)) # negative residuals
associated with large # studies

So it's at least plausible that your "saturation" idea is operating
here; I'm not sure what the best way to use nls() is in this case. I
would log-transform Helminths before using it to get a more "normal"
distribution of variance.  Looking at table(real.data$Helminths) I see
you have no zeros, so that suggests that the distribution isn't really
poisson, because you don't have studies that find no Helminths. In
JAGS (similar to WinBUGS) you can fit a truncated poisson
distribution.

hth
On Thu, Jul 12, 2012 at 10:20 AM, Augusto Ribas <ribas.aca at gmail.com> wrote:
> Hello everybody.
>
> I would like to get an opinian, if someone could spare a moment.
> I'll first describe my question then show some code of the problem, if
> someone could check for me.
>
> My ideia is to measure how the Mass of frogs species increase the
> parasites richness (Helminths). Big frogs have many parasites, little
> ones have few.
> I have 3 continues variables, Host species Mass mean, Helminths
> species richness and number or studies used to get to this data.
> As the firsts 2 are simple to think, the third is not.
> The studies used are like, description of one parasite and its host,
> but sometimes, many parasites and its hosts and still one host and its
> parasites.
> And usually it just tell the presence, no counts. But all the adult
> parasite on final host.
> The studies in the table are in the perspective of the host, the
> number of studies that cite the host name, on these i got the
> parasites names.
> I know that more things are important for the parasites richness, like
> variability in the mass, coevolution, habitat, but for this simple
> question.
>
> Relation of Helminths Richness ~ Host Mass
> I thought about to use a glm with poisson error as the response as its a count.
>
> But many species have the helminths richness subestimated due to low
> number of studies on them (like 1 or 2).
> I saw on some papers that people do a regression of Richness ~ Studies
> and use the residuals to correlacionate with
> others things. But studies dont add more richeness forever, some frogs
> are more studied, but like a Species discovery curve,
> studies stop to increase the helminths richness. I dont think i can
> use a rarefaction or something on the package SPECIES
> as i dont have helminths counts on each frog specie.
> So i thought about useing a curve that represent the this. Reading
> things i thought about using a michelis menten curve
> to describe the Richness ~ Studies.
>
> So Richness ~  c * Studies / ( d + Studies) + error
>
> Where "d" would mean, biologicaly in my case, the contribution of the
> studies and "c" the maximum species richness.
> More studies = more richeness, until a moment that it would stop,
> which d tell me when it will stop.
> But c is what i expect on Helminths Richness ~ Host Mass, so i could
> put this in the place of c
> so
>
>  Richness ~  ( exp(a+b*MASS) + error ) * Studies / ( d + Studies) +error
>
> I see some problems with this, like i would expect that Studies
> contribut equaly to species richeness, what should not be true, but
> still appear better
> than use a regression of Richness ~ Studies.
>
> All said, my questions are.
>
> 01. Is this wrong? Why?
> 02. Are there other ways to try to analyse this case? How?
>
> >From here on some examples in R.
>
> #################################################################
> #Look on what the real data look like
> #################################################################
>
> real.data<-structure(list(tamanho = c(31.5, 95, 79.5, 48.5, 53.5, 45.5,
> 173.5, 88.5, 140, 80.5, 82, 64, 57.5, 62.5, 21.5, 47.5, 109.5,
> 95, 34.5, 57.5, 55, 26, 35.5, 39.5, 48.5, 34.5, 72, 50, 83, 49,
> 24, 53, 71, 46, 46, 145, 41.5, 51.5, 62.5, 134, 37.5, 76, 70,
> 61, 91, 18), estudos = c(3L, 1L, 28L, 3L, 13L, 2L, 21L, 1L, 22L,
> 11L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 7L,
> 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 4L, 30L, 1L, 2L, 2L,
> 2L, 2L, 1L, 1L, 1L, 2L, 8L), helmintos = c(5L, 1L, 19L, 14L,
> 5L, 4L, 18L, 1L, 13L, 3L, 3L, 6L, 4L, 2L, 2L, 4L, 1L, 7L, 1L,
> 1L, 1L, 1L, 1L, 5L, 2L, 1L, 3L, 1L, 2L, 3L, 4L, 1L, 7L, 6L, 13L,
> 18L, 3L, 8L, 5L, 2L, 17L, 3L, 6L, 1L, 1L, 7L)), .Names = c("Mass",
> "Studies", "Helminths"), row.names = c(1L, 2L, 3L, 4L, 5L, 7L,
> 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L,
> 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L,
> 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L,
> 47L), class = "data.frame")
> real.data
>
> plot(Helminths~Mass,data=real.data,cex=log2(real.data$Studies+1),
> xlab="Host species mean Mass",ylab="Helminths
> Richness",xlim=c(0,200),ylim=c(0,50))
> legend("topleft",legend=c("Point size are log2 of the number of
> studies on host species +1"))
>
>
> #################################################################
> #Example Helminths Richness ~ Host Mass
> #################################################################
> #example data
> Mass<- runif(50, 18, 174)
> #parameters
> a<-2
> b<-0.004
> #richeness
> richeness<-round(exp(a+b*Mass)+rnorm(length(Mass),0,1),0)
> #graph
> plot(richeness~Mass,xlab="Mass",ylab="Richeness")
> curve(exp(a+b*x),0,200,add=T)
> #model fit
> model.mass<-glm(richeness~Mass,poisson)
> summary(model.mass)
>
> points(exp(c(predict(model.mass)))[order(exp(c(predict(model.mass))))]~Mass[order(Mass)],type="l",lty=2,col="red")
> legend("topleft",col=c("black","red"),lty=c(1,2),legend=c("True","GLM fit"))
>
> #################################################################
> #Example Richness ~ Studies
> #################################################################
> #example data
> Studies<- rpois(50,5)
> #parameters c is just ilustrative here, suposse that all species have
> #the same Mass
> c<-10
> d<-2.5
> richeness<-round((c*Studies/(d+Studies))+rnorm(length(Studies),0,1),0)
>
> plot(richeness~Studies,ylab="Richeness",xlab="Studies")
> curve(c*x/(d+x),0,15,add=T,col="black")
>
> model.menten<- glm(richeness~c(1/Studies),family=gaussian(link = "inverse"))
> summary(model.menten)
> c1<-1/model.menten$coef[1]
> d1<-c1*model.menten$coef[2]
> c1
> d1
> curve(c1*x/(d1+x),0,15,add=T,col="red",lty=2)
> legend("topleft",col=c("black","red"),lty=c(1,2),legend=c("True","GLM fit"))
>
> #################################################################
> #Example Richness ~ Mass and Studies
> #################################################################
> Mass<- runif(50, 18, 174)
> a<-2
> b<-0.004
>
> #Trying to copy what the data look like
> Studies<- rpois(50,1)+1
> d<-3
>
> richeness<-round((((exp(a+b*Mass)+rnorm(length(Mass),0,1))*Studies)/(d+Studies))+rnorm(length(Studies),0,1),0)
>
> plot(richeness~Mass,cex=log2(Studies+1),xlab="Host species mean
> Mass",ylab="Helminths Richness",
> xlim=c(0,200),ylim=c(0,15))
> legend("topleft",legend=c("Point size are log2 of the number of
> studies on host species +1"))
> #depending on the set.seed, this plot resemble the real data
> #but i dont know if i'm fooling myself
> #also the Mass is not uniform, its more commom
> #small species than big ones
>
> #trying to recover the parameters with nls
> modelo.nls<-nls(richeness~exp(a+b*Mass)*Studies/(d+Studies),start=list(a=1.5,b=0.01,d=2))
> summary(modelo.nls)
> #most times recover the parameters fairly well
>
> #trying to do the samething with winbugs
> library(R2WinBUGS)
>
> #Model
> sink("test.model.txt")
> cat("
> model {
>
> # Priors
> alpha ~ dnorm(0,0.001)
> beta ~ dnorm(0,0.001)
> gamma ~ dnorm(0,0.001)
>
> # Likelihood
>  for (i in 1:n) {
>     richeness[i] ~ dpois(lambda[i])
>     lambda[i] <- exp(alpha + beta*Mass[i])*Studies[i] / ( gamma + Studies[i] )
>     }
> }
> ",fill=TRUE)
> sink()
>
> win.data <- list(richeness=richeness,Studies=Studies,Mass=Mass,n=length(Mass))
> inits <- function(){ list( alpha=rlnorm(1), beta=rlnorm(1), gamma=rlnorm(1)) }
> params <- c("alpha","beta","gamma")
>
> # MCMC settings
> nc <- 3
> ni <- 3000
> nb <- 1000
> nt <- 2
>
> #run
> out <- bugs(data=win.data, inits=inits, parameters.to.save=params,
> model.file="test.model.txt", n.thin=nt, n.chains=nc, n.burnin=nb,
> n.iter=ni, debug = TRUE)
>
> #look
> print(out, dig = 3)
> #probably i still dont understand everything i'm doing here
> #but how could i put on the model 2 errors, like when i
> #assemble the data?
> #could not make that neither for bayesian norlikehood way
>
> If you read this far. thanks for your patiance.
> Best wishes
> Augusto Ribas
>
> --
> Grato
> Augusto C. A. Ribas
>
> Site Pessoal: http://augustoribas.heliohost.org
> Lattes: http://lattes.cnpq.br/7355685961127056
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>

--
Drew Tyre

School of Natural Resources
University of Nebraska-Lincoln
416 Hardin Hall, East Campus
3310 Holdrege Street
Lincoln, NE 68583-0974

phone: +1 402 472 4054
fax: +1 402 472 2946
email: atyre2 at unl.edu
http://snr.unl.edu/tyre
http://aminpractice.blogspot.com
http://www.flickr.com/photos/atiretoo

-- 
Drew Tyre

School of Natural Resources
University of Nebraska-Lincoln
416 Hardin Hall, East Campus
3310 Holdrege Street
Lincoln, NE 68583-0974

phone: +1 402 472 4054
fax: +1 402 472 2946
email: atyre2 at unl.edu
http://snr.unl.edu/tyre
http://aminpractice.blogspot.com
http://www.flickr.com/photos/atiretoo