[R-sig-ME] A mixed effects model for variances of binomialprobability as a responsee

Sat Mar 29 07:22:04 CET 2014

I'm reattaching my data as an RData object. Here's how it looks like:

> head(df)
  subject group environment     n     k
1      S1    G1          E1  1419  1169
2      S1    G1          E2  5992   734
3      S1    G1          E3  5133   519
4      S2    G1          E1 23187 16038
5      S2    G1          E2 93255 46592
6      S2    G1          E3 35382 24486

This is probably much easire to work with now.

After defining the response as:
y=cbind(df$k,(df$n-df$k)) (following "The R Book" by Michael Crawley, Ch.
16)

I tried fitting this model:
model.null=glmer(y~log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

but I'm getting these warning messages:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0937283 (tol = 0.001)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Are there any parameters that can be tuned in order to achieve convergence?

Also, what should I rescale? Taking log(n) instead of n as a fixed effect
removed this warning:
1: In checkScaleX(X, ctrl = control) :
  Some predictor variables are on very different scales: consider rescaling

Any idea what else should I rescale?

Also, if my alternative model is:
alt.null=glmer(y~group+log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

Is there a way I can test whether the over-dispersion in G1 is
significantly smaller than in G2

Thanks a lot

On Fri, Mar 28, 2014 at 10:56 AM, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote:

> Hi Ramon and others,
>
> Thanks a lot for the advice.
>
> I think I need to rephrase my hypothesis, as follows:
> The data are binomial - for each subject, from each group, in each
> environment, I have replicates of trials and successes in these trials.
> I would like to know:
> 1. Are the data for each group (proportion or probability of success being
> the response, environment and perhaps subject begin random effects)
> significantly over-dispersed?
> 2. If 1 is true, then is the over-dispersion in group G1 significantly
> smaller than in group G2?
>
> It seems that glmer might be able to achieve that but I'm not entirely
> sure how.
>
> Advice would be greatly appreciated.
>
>
>
> On Thu, Mar 27, 2014 at 10:06 PM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>wrote:
>
>>
>>
>> On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
>> nimrod.rubinstein at gmail.com> wrote:
>> > Hi Ramon (and others reading this),
>> >
>> > Thanks for the response.
>> >
>> > I'm attaching my real data - it has these columns:
>> > subject - subject name (S1, S2,...)
>> > group - group name (G1, G2) - a subject can only belong to either of the
>> > two groups
>> > environment - environment name (E1, E2, E3).
>> > the next 27 columns (named n.1-n.27) are the number of trails in each
>> > replicate
>> > the following 27 columns (named k.1-k.27) are the number of successes
>> out
>> > of the number of trials in each corresponding replicate (so k.1 is the
>> > number of success out of n.1 trials, and so on).
>> >
>> > Note that there are NA values in corresponding n and k cells in the
>> table.
>> > Also, not all subjects were sampled from all 3 environments due to
>> > technical problems, so the data are not fully balanced.
>> >
>> > I know this table is a bit different from my earlier description of the
>> > data. It is so only and it is because I wanted to make it simple.
>> >
>> > Let me try and clarify, and hopefully all of the below will answer your
>> > questions.
>> >
>> > What may affect the variances of the success probabilities
>> (proportions)?
>> > 1. The proportion itself (natural to binomially distributed data). See
>> > attached Fig1 that plots sample variances of success probabilities vs.
>> > their sample mean.
>>
>> But what are those variances? How are you computing them? (See below,
>> though)
>>
>>
>> > 2. The number of trails in each replicate. As I said, the number of
>> trails
>> > is inherent to the subject and the environment it was measured in, the
>> > experimenter has no control over it. As the attached Fig2 shows,
>> subjects
>> > with a high average number of trials have lower sample variances of
>> their
>> > success probabilities. The opposite however is not clear. The sample
>> > variances of subjects with a low average number of trials vary greatly.
>> So
>> > obviously, the number of trails has an effect on the variance of the
>> > proportion but there's still a sizable fraction of variance (of the
>> > proportion variances) that are not explained by that.
>> > 3. The group - that's the effect I'm interested in. Fig1 and Fig2 show
>> that
>> > the G1 dots have lower sample variances of success probabilities than G2
>> > dots.
>> > 4. The environment. These environments were chosen at random and
>> therefore
>> > I think they should be REs. (In all attached figures the environments
>> isn't
>> > indicated)
>> >
>> > I've also attached a figure of the success probabilities vs, the number
>> of
>> > trials (Fig3).
>>
>> Yes, I think the scenario I was concerned about is not here. Good.
>>
>> >
>> > I think the figures show pretty clearly that both number of trails,
>> success
>> > probability, and group affect the variance of success probability, and
>> > therefore the model I'm looking for should estimate all of these.
>> >
>>
>> It is way too late here, but maybe we are using the same name (variance)
>> for two different things here: the variance of p, which is just a simple
>> function of the proportion, of the p itself, and the variance in the
>> collection of p's, where each p is not necessarily estimating the same
>> probability. And I think you care about the second problem, something like
>> saying that individuals in G1 and G2 have different dispersion in their
>> p's.  So you have a distribution for the p's of G1 and a distribution for
>> the p's of G2, and you want to know whether those distributions have the
>> same variance while also possibly controlling for other factors. I think
>> the book by Gelman and Hill has some examples of this kind (I do not have
>> it here now, though).
>>
>>
>>
>> Best,
>>
>>
>> R.
>>
>>
>>
>> >
>> > Thanks a lot,
>> > Nimrod
>> >
>> >
>> > On Thu, Mar 27, 2014 at 4:39 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com
>> >wrote:
>> >
>> >> Dear Nimrod,
>> >>
>> >> Just my 2 cents (even less than 2):
>> >>
>> >> On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
>> >> nimrod.rubinstein at gmail.com> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have the following data:
>> >> >
>> >> > 50 subjects that belong to 2 groups (25 in each group). For each
>> subject
>> >> I
>> >> > have replicated measurements (in the range of 15-30 replicates per
>> >> subject)
>> >> > of the number of trails as well as the number successes out of these
>> >> > trails. Each single trial in each single replicate for each subject
>> is
>> >> > assumed to be a Bernoulli experiment, therefore the number of
>> successes
>> >> in
>> >> > each replicate is assumed to be a binomially distributed random
>> variable.
>> >> > To complicate things even more, for each subject from each group I
>> have
>> >> not
>> >> > 1 but 3 series of these replicated measurements, from 3 different
>> >> > environments (the 3 environments are the same for all subjects).
>> >> >
>> >> > A few example lines of how my data looks like:
>> >> > subject group environment     number.trails  number.successes
>> >> >
>> >> >     S1    G1          E1 10100,13209,17621  8500,11622,14684
>> >> >
>> >> >     S1    G1          E2 13322,12225,10024  10351,10222,7951
>> >> >
>> >> >     S1    G1          E3  9812,13500,13214  7245,10622,10684
>> >> >
>> >> >     .
>> >> >
>> >> >     .
>> >> >
>> >> >     .
>> >> >
>> >> >     S50    G2          E1 60547,75345,68412 12014,13214,15244
>> >> >
>> >> >     S50    G2          E2 50897,62541,61247 10141,13251,30645
>> >> >
>> >> >     S50    G2          E3 59214,84264,74156 17245,16425,11123
>> >> >
>> >> >
>> >> > Where the comma separated number of successes correspond to the comma
>> >> > separated number of trials in each replicate.
>> >> >
>> >> > I want to test whether the variances of the success probabilities of
>> >> > subjects from group G1 are significantly smaller than the variances
>> of
>> >> the
>> >> > success probabilities of subjects from group G2.
>> >> >
>> >>
>> >> But, as you say in point 1. below (and in another reply), if this is
>> really
>> >> binomial, then you already know that if probabilities in G1 are
>> different
>> >> from probabilities in G2 your variances will differ.
>> >>
>> >> So I am not sure I even understand this hypothesis/question.
>> >>
>> >>
>> >> > In other words, for each subject in each replicate a success
>> probability
>> >> > can be estimated from the observed data. Then the variance of those
>> >> success
>> >> > probabilities over all replicates from a specific environment can be
>> >> > estimated. I would like to know whether these variances are smaller
>> in
>> >> > subjects from G1 vs. subjects from G2.
>> >> >
>> >> > The issues are:
>> >> >
>> >> >
>> >> >    1. The variance of a success probability depends on the success
>> >> >    probability itself (natural to binomially distributed data). I
>> need to
>> >> >    account for this effect.
>> >> >    2. The number of trails (although at least in the hundreds for all
>> >> >    subjects) varies several orders of magnitude among subjects. That
>> by
>> >> itself
>> >> >    affects the accuracy of estimating the success probability, but in
>> >> >    addition, I cannot rule out an inherent dependency between the
>> number
>> >> of
>> >> >    observed trials and the success probability (there is no control
>> over
>> >> the
>> >> >    number of trials in the experiment - it is observed). In other
>> words,
>> >> >    having a lower number of trails by itself may affect the success
>> >> >    probability.
>> >> >
>> >>
>> >> That last issue seems very problematic to me. If I understand
>> correctly, we
>> >> could think of a perverse situation where, say, you can never observe
>> low
>> >> success probabilities if those are related to number of trials around
>> 0.
>> >>
>> >>
>> >> Can you do a simple plot of probability of success (or its logit) vs
>> number
>> >> of trials, to rule out perverse scenarios? Is probability of success
>> always
>> >> larger than 0? And number of trials always sufficiently different from
>> 0?
>> >>
>> >>
>> >> And you say "varies by subject", so the number of trials is always the
>> same
>> >> for a subject? And the potential relationship prob <-> number of
>> trials is
>> >> due to among-subject variation? So you could approach the problem via a
>> >> random effect for subject? If you compare G1 and G2 for number of
>> trials:
>> >> are there differences?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> >
>> >> > I though a simple way to test my hypothesis is to compute the success
>> >> > probability in each replicate as number.successes/number.trails, and
>> then
>> >> > the sample mean and sample variance of these success probabilities
>> over
>> >> all
>> >> > replicates, as well as the mean number of trials over all replicates
>> >> (added
>> >> > to my data as sample.mean, sample.var and mean.trials fields). Then
>> I can
>> >> > use the lmer function in the lme4 package this way:
>> >> >
>> >> > null.model = lmer(sample.var ~ sample.mean + mean.trials +
>> >> (1|environment),
>> >> > my.data, REML = F)
>> >> > alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
>> >> > (1|environment), my.data, REML = F)
>> >> >
>> >> > So fixed effects are: sample.mean, mean.trials and group, and random
>> >> > effects is environment since the three environments in my data are
>> >> simply a
>> >> > random sample from many other possible environments.
>> >>
>> >>
>> >> I do not see how this would answer your questions, given the above
>> >> concerns/issues and other possible approaches (just compare prob.
>> between
>> >> G1 and G2, accounting for environment).
>> >>
>> >>
>> >> Best,
>> >>
>> >>
>> >> R.
>> >>
>> >>
>> >> >
>> >> > To test whether group has a significant effect I'll do:
>> >> >
>> >> > anova(null.model, alt.model).
>> >> >
>> >> > My concerns are:
>> >> >
>> >> >
>> >> >    1. Should I add subject as a random effect? My intuition is not to
>> >> since
>> >> >    subjects are unique to groups, which is the effect I'm after.
>> >> However, I do
>> >> >    want to account for the fact that sample variances of the same
>> >> subject from
>> >> >    the three different environments are correlated.
>> >> >    2. The response - sample variances, cannot be assumed to follow a
>> >> normal
>> >> >    distribution (they only take positive values to begin with and as
>> far
>> >> as I
>> >> >    know (n-1)*sample.variance/population.variance ~ chi.square with
>> df =
>> >> n-1)
>> >> >    and therefore the assumptions of a linear mixed effects model do
>> not
>> >> >    strictly hold. Is there a generalized linear model that would be
>> >> >    appropriate for this?
>> >> >
>> >> > Thanks a lot,
>> >> > nimrod
>> >> >
>> >> >       [[alternative HTML version deleted]]
>> >> >
>> >> > _______________________________________________
>> >> > R-sig-mixed-models at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> >>
>> >> --
>> >> Ramon Diaz-Uriarte
>> >> Department of Biochemistry, Lab B-25
>> >> Facultad de Medicina
>> >> Universidad Autónoma de Madrid
>> >> Arzobispo Morcillo, 4
>> >> 28029 Madrid
>> >> Spain
>> >>
>> >> Phone: +34-91-497-2412
>> >>
>> >> Email: rdiaz02 at gmail.com
>> >>        ramon.diaz at iib.uam.es
>> >>
>> >> http://ligarto.org/rdiaz
>> >>
>> >>
>>
>> --
>> Ramon Diaz-Uriarte
>> Department of Biochemistry, Lab B-25
>> Facultad de Medicina
>> Universidad Autónoma de Madrid
>> Arzobispo Morcillo, 4
>> 28029 Madrid
>> Spain
>>
>> Phone: +34-91-497-2412
>>
>> Email: rdiaz02 at gmail.com
>>        ramon.diaz at iib.uam.es
>>
>> http://ligarto.org/rdiaz
>>
>>
>