[R] repeated measures: multiple comparisons with pairwise.t.test and multcomp disagree

Wed Jun 24 19:30:37 CEST 2015

Thank you, Thierry. And yes, Bert, it turns out that it is more of a statistical question after all, but again, since my question used specific R functions, R experts are well placed to help me.

As pairewise.t.test was recommended in a few tutorials about repeated-measure Anovas, I assumed it took into account the fact that the measures were indeed repeated, so thank you for pointing out that it does not.

But my reason for not accepting the result of multcomp went further than this. Before deciding to test 4 different durations, I had tested only two of them, corresponding to sets 1 and 2 of my example. I used a paired t test (as in t test for paired samples). I had a very significant effect, i.e. the mean of the differences calculated for each subject was significantly different from zero.

After adding two other durations and switching from my paired t test to a repeated measures design, these same 2 sets are no longer different. I think the explanation is lack of homogeneity of variances. I thought a log transformation of the raw data had been sufficient to fix this, and a Levene test on the variances of the 4 sets found no problem in this regard.

But maybe it is the variance of all the possible differences (set 1 vs 2, etc, for a total of 6 differences calculated for each subject) that matters.  I just calculated these and they range from 1.788502e-05 to 1.462171e-03. A Levene test on these 6 "groups" showed that their variances were heterogeneous.

I think I'll stay away from  the "repeated measures followed by multiple comparisons" and just report my 6 t tests for paired samples, correcting the p-level for the number of comparisons with, say, the Sidak method (p for significance is then 0.0085).

Thanks for your help. 

Denis

> Le 2015-06-23 à 08:15, Thierry Onkelinx <thierry.onkelinx at inbo.be> a écrit :
> 
> Dear Denis,
> 
> It's not multcomp which is too conservative, it is the pairwise t-test
> which is too liberal. The pairwise t-test doesn't take the random
> effect of Case into account.
> 
> Best regards,
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
> and Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
> 
> To call in the statistician after the experiment is done may be no
> more than asking him to perform a post-mortem examination: he may be
> able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does
> not ensure that a reasonable answer can be extracted from a given body
> of data. ~ John Tukey
> 
> 
> 2015-06-23 5:17 GMT+02:00 Denis Chabot <denis.chabot at me.com>:
>> Hi,
>> 
>> I am working on a problem which I think can be handled as a repeated measures analysis, and I have read many tutorials about how to do this with R. This part goes well, but I get stuck with the multiple comparisons I'd like to run afterward. I tried two methods that I have seen in my readings, but their results are quite different and I don't know which one to trust.
>> 
>> The two approaches are pairwise.t.test() and multcomp, although the latter is not available after a repeated-measures aov model, but it is after a lme.
>> 
>> I have a physiological variable measured frequently on each of 67 animals. These are then summarized with a quantile for each animal. To check the effect of experiment duration, I recalculated the quantile for each animal 4 times, using different subset of the data (so the shortest subset is part of all other subsets, the second subset is included in the 2 others, etc.). I handle this as 4 repeated (non-independent) measurements for each animal, and want to see if the average value (for 67 animals) differs for the 4 different durations.
>> 
>> Because animals with high values for this physiological trait have larger differences between the 4 durations than animals with low values, the observations were log transformed.
>> 
>> I attach the small data set (Rda format) here, but it can be obtained here if the attachment gets stripped:
>> <https://dl.dropboxusercontent.com/u/612902/RepMeasData.Rda>
>> 
>> The data.frame is simply called Data.
>> My code is
>> 
>> load("RepMeasData.Rda")
>> Data_Long = melt(Data, id="Case")
>> names(Data_Long) = c("Case","Duration", "SMR")
>> Data_Long$SMR = log10(Data_Long$SMR)
>> 
>> # I only show essential code to reproduce my opposing results
>> mixmod = lme(SMR ~ Duration, data = Data_Long, random = ~ 1 | Case)
>> anova(mixmod)
>> posthoc <- glht(mixmod, linfct = mcp(Duration = "Tukey"))
>> summary(posthoc)
>>        Simultaneous Tests for General Linear Hypotheses
>> 
>> Multiple Comparisons of Means: Tukey Contrasts
>> 
>> 
>> Fit: lme.formula(fixed = SMR ~ Duration, data = Data_Long, random = ~1 |
>>   Case)
>> 
>> Linear Hypotheses:
>>                 Estimate Std. Error z value Pr(>|z|)
>> Set2 - Set1 == 0 -0.006135   0.003375  -1.818    0.265
>> Set3 - Set1 == 0 -0.002871   0.003375  -0.851    0.830
>> Set4 - Set1 == 0  0.015395   0.003375   4.561   <1e-04 ***
>> Set3 - Set2 == 0  0.003264   0.003375   0.967    0.768
>> Set4 - Set2 == 0  0.021530   0.003375   6.379   <1e-04 ***
>> Set4 - Set3 == 0  0.018266   0.003375   5.412   <1e-04 ***
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> (Adjusted p values reported -- single-step method)
>> 
>> with(Data_Long, pairwise.t.test(SMR, Duration, p.adjust.method="holm", paired=T))
>>       Pairwise comparisons using paired t tests
>> 
>> data:  SMR and Duration
>> 
>>    Set1    Set2    Set3
>> Set2 < 2e-16 -       -
>> Set3 0.11118 0.10648 -
>> Set4 0.00475 7.9e-05 0.00034
>> 
>> P value adjustment method: holm
>> 
>> So the difference between sets 1 and 2 goes from non significant to very significant, depending on method.
>> 
>> I have other examples with essentially the same type of data and sometimes the two approches differ in the opposing way. In the example shown here, multcomp was more conservative, in some others it yielded a larger number of significant differences.
>> 
>> I admit not mastering all the intricacies of multcomp, but I have used multcomp and other methods of doing multiple comparisons many times before (but never with a repeated measures design), and always found the results very similar. When there were small differences, I trusted multcomp. This time, I get rather large differences and I am worried that I am doing something wrong.
>> 
>> Thanks in advance,
>> 
>> Denis Chabot
>> Fisheries & Oceans Canada
>> 
>> sessionInfo()
>> R version 3.2.0 (2015-04-16)
>> Platform: x86_64-apple-darwin13.4.0 (64-bit)
>> Running under: OS X 10.10.3 (Yosemite)
>> 
>> locale:
>> [1] fr_CA.UTF-8/fr_CA.UTF-8/fr_CA.UTF-8/C/fr_CA.UTF-8/fr_CA.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] multcomp_1.4-0  TH.data_1.0-6   survival_2.38-1 mvtnorm_1.0-2   nlme_3.1-120    car_2.0-25      reshape2_1.4.1
>> 
>> loaded via a namespace (and not attached):
>> [1] Rcpp_0.11.5      magrittr_1.5     splines_3.2.0    MASS_7.3-40      lattice_0.20-31  minqa_1.2.4      stringr_1.0.0
>> [8] plyr_1.8.2       tools_3.2.0      nnet_7.3-9       pbkrtest_0.4-2   parallel_3.2.0   grid_3.2.0       mgcv_1.8-6
>> [15] quantreg_5.11    lme4_1.1-7       Matrix_1.2-0     nloptr_1.0.4     codetools_0.2-11 sandwich_2.3-3   stringi_0.4-1
>> [22] SparseM_1.6      zoo_1.7-12
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.