[R] comparing proportions

Fri Feb 11 07:15:53 CET 2011

You need to change models 2 and 3 to use ~ group 
+ subject. You left subject out as a fixed factor.

At 05:17 PM 2/10/2011, array chip wrote:
>Robert, thank you!
>
>I tried all 3 models you suggested. Since each 
>subject only has one line of data in my dataset, 
>would including Subject as a factor in glm() or 
>lm() lead to 0 df for resaiduals?
>
>Attached is my dataset and here is my version of the 3 models:
>
>test<-read.table("test.txt",sep='\t',header=T)
>library(lme4)
>
>## model 1: generalized mixed model
>fit1<-glmer(cbind(success,failure)~group+(1|subject),data=test,family=binomial)
>## model 2: generalized model
>fit2<-glm(cbind(success,failure)~group,data=test,family=binomial)
>## model 3: linear model
>fit3<-lm(prop.success~group,weights=success+failure,data=test)
> > fit1
>Generalized linear mixed model fit by the Laplace approximation
>Formula: cbind(success, failure) ~ group + (1 | subject)
>    Data: test
>    AIC   BIC logLik deviance
>  54.75 57.89 -24.38    48.75
>Random effects:
>  Groups  Name        Variance Std.Dev.
>  subject (Intercept) 1.8256   1.3511
>Number of obs: 21, groups: subject, 21
>Fixed effects:
>             Estimate Std. Error z value Pr(>|z|)
>(Intercept)  -0.6785     0.4950  -1.371    0.170
>groupgroup2  -0.7030     0.6974  -1.008    0.313
>
>
> > summary(fit2)
>
>Call:
>glm(formula = cbind(success, failure) ~ group, family = binomial,
>     data = test)
>
>Deviance Residuals:
>     Min       1Q   Median       3Q      Max
>-2.8204  -2.0789  -0.5407   1.0403   4.0539
>
>Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
>(Intercept)  -0.4400     0.2080  -2.115   0.0344 *
>groupgroup2  -0.6587     0.3108  -2.119   0.0341 *
>---
>Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 
>â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1
>
>(Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 78.723  on 20  degrees of freedom
>Residual deviance: 74.152  on 19  degrees of freedom
>
> > summary(fit3)
>
>Call:
>lm(formula = prop.success ~ group, data = test, weights = success +
>     failure)
>
>Residuals:
>     Min      1Q  Median      3Q     Max
>-1.1080 -0.7071 -0.2261  0.4754  1.9157
>
>Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
>(Intercept)  0.39175    0.08809   4.447 0.000276 ***
>groupgroup2 -0.14175    0.12364  -1.146 0.265838
>---
>Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 
>â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1
>
>Residual standard error: 0.8676 on 19 degrees of freedom
>Multiple R-squared: 0.0647,     Adjusted R-squared: 0.01548
>F-statistic: 1.314 on 1 and 19 DF,  p-value: 0.2658
>
>
>
>As you can see, all 3 models gave quite 
>different results, the model 2 (generalized 
>linear model) gave significant p value while 
>model 1 (generalized mixed model) and model 3 
>(regular linear model) did not. Also as you can 
>from the data, prop.success has some value 
>outside the range 0.15 to 0.85, so maybe regular 
>linear model may not be appropriate?
>
>
>
>Thank you,
>
>
>
>John
>
>
>
>
>
>
>From: Robert A LaBudde <ral at lcfltd.com>
>To: array chip <arrayprofile at yahoo.com>
>Cc: Bert Gunter <gunter.berton at gene.com>; R-help at stat.math.ethz.ch
>Sent: Thu, February 10, 2011 12:54:44 PM
>Subject: Re: [R] comparing proportions
>
>1. If you use a random effects model, you should 
>make Subject the random factor. I.e., a random 
>intercepts model with 1|Subject. Group is a 
>fixed effect: You have only 2 groups. Even if 
>you had more than 2 groups, treating Group as 
>random would return a standard deviation, not a 
>P-value as you wanted. Finally, I doubt you 
>believe the groups used are meaningless, and 
>only the population of groups is of interest. 
>Instead you consider them special, so Group is a fixed effect.
>
>2. The number of observations for each Subject 
>is the number of trials, which you previously 
>indicated were 7 to 10 in the cases listed.
>
>3. If you have no interest in the Subject 
>effect, you can use a fixed Subject factor 
>instead with glm() instead of glmer() or other 
>mixed model function. This is a good idea so 
>long as the number of subjects is, say, less 
>than 10. Otherwise a mixed model would be a better idea.
>
>I suggest you fit all three models to learn 
>about what you're doing: 1) glmer() or 
>equivalent, with cbind(successes, failures) ~ 
>1|Subject + Group; 2) glm() with 
>cbind(successes, failures) ~ Subject + Group; 
>and 3) lm(p ~ Subject + Group), where p is the 
>proportion success for a particular subject and group.
>
>Then compare the results. They will probably all 
>3 give the same conclusion to the hypothesis 
>question about Group. I would guess the glmer() 
>P-value will be larger, then the glm() and 
>finally the lm(), but the last two may reverse. 
>The lm() model may actually perform fairly well, 
>as the Edgeworth series converges rapidly to 
>normal for binomial distributions with p within 
>0.15 to 0.85 and 10+ replicates, as I stated before.
>
>I'd be interested in seeing the results of these 
>3 fits myself just for curiosity.
>
>At 01:21 PM 2/10/2011, array chip wrote:
> > Robert and Bert, thank you both very much for 
> the response, really appreciated. I agree that 
> using regular ANOVA (or regular t test) may not 
> be wise during the normality issue. So I am 
> debating between generalized linear model using 
> glm(.., family=binomial) or generalized linear 
> mixed effect model using glmer(..., 
> family=binomial). I will forward to Robert an 
> offline list email I sent to Bert about whether 
> using (1|subject) versus (1|group) in mixed 
> model specification. If using (1|group), both 
> models will give me the same testing for fixed 
> effects, which is what I am mainly interested 
> in. So do I really need a mixed model here?
> >
> > Thanks again
> >
> > John
> >
> >
> > From: Bert Gunter <<mailto:gunter.berton at gene.com>gunter.berton at gene.com>
> > To: Robert A LaBudde <<mailto:ral at lcfltd.com>ral at lcfltd.com>
> > Cc: array chip <<mailto:arrayprofile at yahoo.com>arrayprofile at yahoo.com>
> > Sent: Thu, February 10, 2011 10:04:06 AM
> > Subject: Re: [R] comparing proportions
> >
> > Robert:
> >
> > Yes, exactly. In an offlist email exchange, he clarified this for me,
> > and I suggested exactly what you did, also with the cautions that his
> > initial ad hoc suggestions were unwise. His subsequent post to R-help
> > and the sig-mixed-models lists were the result, although he appears to
> > have specified the model incorrectly in his glmer function (as
> > (1|Group) instead of (1|subject).
> >
> > Cheers,
> > Bert
> >
> > On Thu, Feb 10, 2011 at 9:55 AM, Robert A 
> LaBudde <<mailto:ral at lcfltd.com><mailto:ral at lcfltd.com>ral at lcfltd.com> wrote:
> > > prop.test() is applicable to a binomial 
> experiment in each of two classes.
> > >
> > > Your experiment is binomial only at the subject level. You then have
> > > multiple subjects in each of your groups.
> > >
> > > You have a random factor "Subjects" that must be accounted for.
> > >
> > > The best way to analyze is a generalized 
> linear mixed model with a binomial
> > > distribution family and a logit or probit link. You will probably have to
> > > investigate overdispersion. If you have a small number of subjects, and
> > > don't care about the among-subject effect, you can model them as fixed
> > > effects and use glm() instead.
> > >
> > > Your original question, I believe, related to doing an ANOVA assuming
> > > normality. In order for this to work with 
> this kind of proportion problem,
> > > you generally won't get good results unless the number of replicates per
> > > subject is 12 or more, and the proportions 
> involved are within 0.15 to 0.85.
> > > Otherwise you will have biased confidence 
> intervals and significance tests.
> > >
> > >
> > >
> > > At 07:51 PM 2/9/2011, array chip wrote:
> > >>
> > >> Content-type: text/plain
> > >> Content-disposition: inline
> > >> Content-length: 2969
> > >>
> > >> Hi Bert,
> > >>
> > >> Thanks for your reply. If I understand correctly, prop.test() is not
> > >> suitable to
> > >> my situation. The input to prop.test() is 2 numbers for each group (# of
> > >> success
> > >> and # of trials, for example, groups 1 has 5 success out of 10 trials;
> > >> group 2
> > >> has 3 success out of 7 trials; etc. prop.test() tests whether the
> > >> probability of
> > >> success is the same across groups.
> > >>
> > >> In my case, each group has several 
> subjects and each subject has 2 numbers
> > >> (#
> > >> success and # trials). So
> > >>
> > >> for group 1:
> > >> subject 1: 5 success, 10 trials
> > >> subject 2: 3 success, 8 trials
> > >> :
> > >> :
> > >>
> > >> for group 2:
> > >> subject a: 7 success, 9 trials
> > >> subject b: 6 success, 7 trials
> > >> :
> > >> :
> > >>
> > >> I want to test whether the probability of success in group 1 is the same
> > >> as in
> > >> group 2. It's like comparing 2 groups of samples using t test, what I am
> > >> uncertain about is that whether regular t 
> test (or non-pamametric test) is
> > >> still
> > >> appropriate here when the response variable is actually proportions.
> > >>
> > >> I guess prop.test() can not be used with my dataset, or I may be wrong?
> > >>
> > >> Thanks
> > >>
> > >> John
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> ________________________________
> > >> From: Bert Gunter 
> <<mailto:gunter.berton at gene.com><mailto:gunter.berton at gene.com>gunter.berton at gene.com>
> > >>
> > >> Sent: Wed, February 9, 2011 3:58:05 PM
> > >> Subject: Re: [R] comparing proportions
> > >>
> > >> 1. Is this a homework problem?
> > >>
> > >> 2. ?prop.test
> > >>
> > >> 3. If you haven't done so already, get and consult a basic statistical
> > >> methods book to help you with questions such as this.
> > >>
> > >> -- Bert
> > >>
> > >>
> > >> > Hi, I have a dataset that has 2 groups 
> of samples. For each sample, then
> > >> > response measured is the number of success (no.success) obatined with
> > >> > the
> > >> >number
> > >> > of trials (no.trials). So a porportion 
> of success (prpop.success) can be
> > >> > computed as no.success/no.trials. Now 
> the objective is to test if there
> > >> > is a
> > >> > statistical significant difference in 
> the proportion of success between
> > >> > the 2
> > >> > groups of samples (say n1=20, n2=30).
> > >> >
> > >> > I can think of 2 ways to do the test:
> > >> >
> > >> > 1. regular t test based on the variable prop.success
> > >> > 2. Mann-Whitney test based on the variable prop.success
> > >> > 2. do a binomial regression as:
> > >> >    fit<-glm(cbind(no.success,no.trials-no.success) ~ group, data=data,
> > >> >          family=binomial)
> > >> >    anova(fit, test='Chisq')
> > >> >
> > >> > My questions is:
> > >> > 1. Is t test appropriate for comparing 2 groups of proportions?
> > >> > 2. how about Mann-Whitney non-parametric test?
> > >> > 3. Among the 3, which technique is more appropriate?
> > >> > 4. any other technique you can suggest?
> > >> >
> > >> > Thank you,
> > >> >
> > >> > John
> > >> >
> > >> >
> > >> >
> > >> >        [[alternative HTML version deleted]]
> > >> >
> > >> >
> > >> > ______________________________________________
> > >> > 
> <mailto:R-help at r-project.org><mailto:R-help at r-project.org>R-help at r-project.org 
> mailing list
> > >> > 
> <<https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help
> > >> > PLEASE do read the posting guide
> > >> > 
> <http://www.R-project.org/posting-guide.html><http://www.r-project.org/posting-guide.html>http://www.R-project.org/posting-guide.html
> > >> > and provide commented, minimal, self-contained, reproducible code.
> > >> >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Bert Gunter
> > >> Genentech Nonclinical Biostatistics
> > >>
> > >>
> > >>
> > >>
> > >>        [[alternative HTML version deleted]]
> > >>
> > >>
> > >> ______________________________________________
> > >> 
> <mailto:R-help at r-project.org><mailto:R-help at r-project.org>R-help at r-project.org 
> mailing list
> > >> 
> <<https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > >> 
> <<http://www.r-project.org/posting-guide.html>http://www.R-project.org/posting-guide.html>http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.
> > >
> > > ================================================================
> > > Robert A. LaBudde, PhD, PAS, Dpl. 
> ACAFS  e-mail: <mailto:ral at lcfltd.com><mailto:ral at lcfltd.com>ral at lcfltd.com
> > > Least Cost Formulations, 
> Ltd.            URL: <http://lcfltd.com/><http://lcfltd.com/>http://lcfltd.com/
> > > 824 Timberlake Drive                    Tel: 757-467-0954
> > > Virginia Beach, VA 23464-3239            Fax: 757-467-2947
> > >
> > > "Vere scire est per causas scire"
> > > ================================================================
> > >
> > >
> >
> >
> >
> > --
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> > 467-7374
> > 
> <http://devo.gene.com/groups/devo/depts/ncb/home.shtml><http://devo.gene.com/groups/devo/depts/ncb/home.shtml>http://devo.gene.com/groups/devo/depts/ncb/home.shtml
>
>================================================================
>Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: 
><mailto:ral at lcfltd.com>ral at lcfltd.com
>Least Cost Formulations, Ltd.            URL: 
><http://lcfltd.com/>http://lcfltd.com/
>824 Timberlake Drive                    Tel: 757-467-0954
>Virginia Beach, VA 23464-3239            Fax: 757-467-2947
>
>"Vere scire est per causas scire"
>================================================================
>
>

================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================