[R-sig-ME] A mixed effects model for variances of binomialprobability as a responsee

Fri Mar 28 03:06:52 CET 2014

On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <nimrod.rubinstein at gmail.com> wrote:
> Hi Ramon (and others reading this),
>
> Thanks for the response.
>
> I'm attaching my real data - it has these columns:
> subject - subject name (S1, S2,...)
> group - group name (G1, G2) - a subject can only belong to either of the
> two groups
> environment - environment name (E1, E2, E3).
> the next 27 columns (named n.1-n.27) are the number of trails in each
> replicate
> the following 27 columns (named k.1-k.27) are the number of successes out
> of the number of trials in each corresponding replicate (so k.1 is the
> number of success out of n.1 trials, and so on).
>
> Note that there are NA values in corresponding n and k cells in the table.
> Also, not all subjects were sampled from all 3 environments due to
> technical problems, so the data are not fully balanced.
>
> I know this table is a bit different from my earlier description of the
> data. It is so only and it is because I wanted to make it simple.
>
> Let me try and clarify, and hopefully all of the below will answer your
> questions.
>
> What may affect the variances of the success probabilities (proportions)?
> 1. The proportion itself (natural to binomially distributed data). See
> attached Fig1 that plots sample variances of success probabilities vs.
> their sample mean.

But what are those variances? How are you computing them? (See below, though)

> 2. The number of trails in each replicate. As I said, the number of trails
> is inherent to the subject and the environment it was measured in, the
> experimenter has no control over it. As the attached Fig2 shows, subjects
> with a high average number of trials have lower sample variances of their
> success probabilities. The opposite however is not clear. The sample
> variances of subjects with a low average number of trials vary greatly. So
> obviously, the number of trails has an effect on the variance of the
> proportion but there's still a sizable fraction of variance (of the
> proportion variances) that are not explained by that.
> 3. The group - that's the effect I'm interested in. Fig1 and Fig2 show that
> the G1 dots have lower sample variances of success probabilities than G2
> dots.
> 4. The environment. These environments were chosen at random and therefore
> I think they should be REs. (In all attached figures the environments isn't
> indicated)
>
> I've also attached a figure of the success probabilities vs, the number of
> trials (Fig3).

Yes, I think the scenario I was concerned about is not here. Good.

>
> I think the figures show pretty clearly that both number of trails, success
> probability, and group affect the variance of success probability, and
> therefore the model I'm looking for should estimate all of these.
>

It is way too late here, but maybe we are using the same name (variance)
for two different things here: the variance of p, which is just a simple
function of the proportion, of the p itself, and the variance in the
collection of p's, where each p is not necessarily estimating the same
probability. And I think you care about the second problem, something like
saying that individuals in G1 and G2 have different dispersion in their
p's.  So you have a distribution for the p's of G1 and a distribution for
the p's of G2, and you want to know whether those distributions have the
same variance while also possibly controlling for other factors. I think
the book by Gelman and Hill has some examples of this kind (I do not have
it here now, though).

Best,

R.

>
> Thanks a lot,
> Nimrod
>
>
> On Thu, Mar 27, 2014 at 4:39 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>wrote:
>
>> Dear Nimrod,
>>
>> Just my 2 cents (even less than 2):
>>
>> On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
>> nimrod.rubinstein at gmail.com> wrote:
>> > Hi all,
>> >
>> > I have the following data:
>> >
>> > 50 subjects that belong to 2 groups (25 in each group). For each subject
>> I
>> > have replicated measurements (in the range of 15-30 replicates per
>> subject)
>> > of the number of trails as well as the number successes out of these
>> > trails. Each single trial in each single replicate for each subject is
>> > assumed to be a Bernoulli experiment, therefore the number of successes
>> in
>> > each replicate is assumed to be a binomially distributed random variable.
>> > To complicate things even more, for each subject from each group I have
>> not
>> > 1 but 3 series of these replicated measurements, from 3 different
>> > environments (the 3 environments are the same for all subjects).
>> >
>> > A few example lines of how my data looks like:
>> > subject group environment     number.trails  number.successes
>> >
>> >     S1    G1          E1 10100,13209,17621  8500,11622,14684
>> >
>> >     S1    G1          E2 13322,12225,10024  10351,10222,7951
>> >
>> >     S1    G1          E3  9812,13500,13214  7245,10622,10684
>> >
>> >     .
>> >
>> >     .
>> >
>> >     .
>> >
>> >     S50    G2          E1 60547,75345,68412 12014,13214,15244
>> >
>> >     S50    G2          E2 50897,62541,61247 10141,13251,30645
>> >
>> >     S50    G2          E3 59214,84264,74156 17245,16425,11123
>> >
>> >
>> > Where the comma separated number of successes correspond to the comma
>> > separated number of trials in each replicate.
>> >
>> > I want to test whether the variances of the success probabilities of
>> > subjects from group G1 are significantly smaller than the variances of
>> the
>> > success probabilities of subjects from group G2.
>> >
>>
>> But, as you say in point 1. below (and in another reply), if this is really
>> binomial, then you already know that if probabilities in G1 are different
>> from probabilities in G2 your variances will differ.
>>
>> So I am not sure I even understand this hypothesis/question.
>>
>>
>> > In other words, for each subject in each replicate a success probability
>> > can be estimated from the observed data. Then the variance of those
>> success
>> > probabilities over all replicates from a specific environment can be
>> > estimated. I would like to know whether these variances are smaller in
>> > subjects from G1 vs. subjects from G2.
>> >
>> > The issues are:
>> >
>> >
>> >    1. The variance of a success probability depends on the success
>> >    probability itself (natural to binomially distributed data). I need to
>> >    account for this effect.
>> >    2. The number of trails (although at least in the hundreds for all
>> >    subjects) varies several orders of magnitude among subjects. That by
>> itself
>> >    affects the accuracy of estimating the success probability, but in
>> >    addition, I cannot rule out an inherent dependency between the number
>> of
>> >    observed trials and the success probability (there is no control over
>> the
>> >    number of trials in the experiment - it is observed). In other words,
>> >    having a lower number of trails by itself may affect the success
>> >    probability.
>> >
>>
>> That last issue seems very problematic to me. If I understand correctly, we
>> could think of a perverse situation where, say, you can never observe low
>> success probabilities if those are related to number of trials around 0.
>>
>>
>> Can you do a simple plot of probability of success (or its logit) vs number
>> of trials, to rule out perverse scenarios? Is probability of success always
>> larger than 0? And number of trials always sufficiently different from 0?
>>
>>
>> And you say "varies by subject", so the number of trials is always the same
>> for a subject? And the potential relationship prob <-> number of trials is
>> due to among-subject variation? So you could approach the problem via a
>> random effect for subject? If you compare G1 and G2 for number of trials:
>> are there differences?
>>
>>
>>
>>
>>
>> >
>> > I though a simple way to test my hypothesis is to compute the success
>> > probability in each replicate as number.successes/number.trails, and then
>> > the sample mean and sample variance of these success probabilities over
>> all
>> > replicates, as well as the mean number of trials over all replicates
>> (added
>> > to my data as sample.mean, sample.var and mean.trials fields). Then I can
>> > use the lmer function in the lme4 package this way:
>> >
>> > null.model = lmer(sample.var ~ sample.mean + mean.trials +
>> (1|environment),
>> > my.data, REML = F)
>> > alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
>> > (1|environment), my.data, REML = F)
>> >
>> > So fixed effects are: sample.mean, mean.trials and group, and random
>> > effects is environment since the three environments in my data are
>> simply a
>> > random sample from many other possible environments.
>>
>>
>> I do not see how this would answer your questions, given the above
>> concerns/issues and other possible approaches (just compare prob. between
>> G1 and G2, accounting for environment).
>>
>>
>> Best,
>>
>>
>> R.
>>
>>
>> >
>> > To test whether group has a significant effect I'll do:
>> >
>> > anova(null.model, alt.model).
>> >
>> > My concerns are:
>> >
>> >
>> >    1. Should I add subject as a random effect? My intuition is not to
>> since
>> >    subjects are unique to groups, which is the effect I'm after.
>> However, I do
>> >    want to account for the fact that sample variances of the same
>> subject from
>> >    the three different environments are correlated.
>> >    2. The response - sample variances, cannot be assumed to follow a
>> normal
>> >    distribution (they only take positive values to begin with and as far
>> as I
>> >    know (n-1)*sample.variance/population.variance ~ chi.square with df =
>> n-1)
>> >    and therefore the assumptions of a linear mixed effects model do not
>> >    strictly hold. Is there a generalized linear model that would be
>> >    appropriate for this?
>> >
>> > Thanks a lot,
>> > nimrod
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > R-sig-mixed-models at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>> --
>> Ramon Diaz-Uriarte
>> Department of Biochemistry, Lab B-25
>> Facultad de Medicina
>> Universidad Autónoma de Madrid
>> Arzobispo Morcillo, 4
>> 28029 Madrid
>> Spain
>>
>> Phone: +34-91-497-2412
>>
>> Email: rdiaz02 at gmail.com
>>        ramon.diaz at iib.uam.es
>>
>> http://ligarto.org/rdiaz
>>
>>

-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz