[R-meta] Multilevel with dataset where lines are treatment means

Sat Sep 23 23:42:26 CEST 2017

Regarding question 1:

Generally speaking, you want to use 'random = ~1|experiment/study/line'. The standard random effects model includes random effects for each estimate/row. Multilevel models are an extension of this, where we add random effects for higher level clustering levels. Those are *added* to the model -- they don't replace the random effects per estimate/row. See also:

http://www.metafor-project.org/doku.php/analyses:konstantopoulos2011#a_common_mistake_in_the_three-level_model

Using 'random = ~1|experiment/study' would assume that true outcomes are homogeneous within each level of 'study within experiment'. That may not be the case, so I would always use (or at least start with) 'random = ~1|experiment/study/line'.

Regarding question 2:

I don't think there is any general consensus on how to deal with outliers/influential estimates/clusters, especially in more complex models. But I think most people would agree that we should not just delete them (unless they are clearly data entry/coding errors). 

And yes, you can use 'cluster=dat$study' (or any other higher level clustering variables, such as dat$experiment) in rstudent() and cooks.distance() for 'rma.mv' objects. Again, there isn't one best strategy here. A particular estimate could be an outlier/influential, an entire study could be, or even an whole experiment. You might as well examine all of those possibilities.

Best,
Wolfgang

-----Original Message-----
From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces at r-project.org] On Behalf Of Martineau, Roger
Sent: Thursday, 14 September, 2017 23:08
To: r-sig-meta-analysis at r-project.org
Subject: [R-meta] Multilevel with dataset where lines are treatment means

Hi all,

I have a large data set to analyze and I want to get my model (and modus operandi) right before publication. Therefore I will provide a fictive example much smaller (n=28) but having a similar structure (i.e., experiment/study).

Thanks in advance,

Roger ☺

The fictive data are:

dat <- data.frame(
          line=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28),
          experiment=c("A","A","A","A","B","B","B","B","C","C","C","C",
                       "D","D","D","D","D","D","D","D","E","E","E","E","E","E","E","E"),
          study=c("A1","A1","A2","A2","B1","B1","B2","B2","C1","C1","C1","C1",
                  "D1","D1","D1","D1","D2","D2","D2","D2","E1","E1","E1","E1","E2","E2","E2","E2"),
          outcome=c(10,18,5,25,13,48,7,25,6,13,20,35,65,45,30,15,55,45,30,10,45,40,35,30,35,26,18,8),
          SEM=c(5,5,6,6,7,7,3,3,7,7,7,7,6,6,6,6,5,5,5,5,6,6,6,6,5,5,5,5),
          moderator1=c(5,10,25,45,8,30,5,35,5,10,15,20,10,20,25,30,20,30,40,50,10,15,20,25,5,6,9,12),
          moderator2=c("cat1","cat1","cat1","cat1","cat1","cat1","cat1","cat1",
                       "cat1","cat1","cat1","cat1","cat2","cat2","cat2","cat2",
                       "cat2","cat2","cat2","cat2","cat2","cat2","cat2","cat2",
                       "cat2","cat2","cat2","cat2"))
dat
str(dat)

In this data set, "experiment" represents different publications (n=5) and one can fin 1 or 2 studies in each publication. Each study reports at least 2 treatment means (i.e., outcome) and up to 4 treatment means per study (see E2). In animal science, milk yield is an example of such an oucome. Indeed SEM of outcome is identical per study.

I included a continuous (moderator1) and a categorical moderator (moderator2). I deliberately set data such as there is an interaction (the outcome increase with increasing values of moderator1 for cat1 and vice-versa for cat2).

I have a couple of questions on the proper way to analyse such a data set reporting treatment means (and associated SEM).

**** Question 1 concerns the correct random argument structure to use:

# centering moderator:
dat <- within(dat, {
  cmoderator1   <- moderator1 - mean(moderator1,na.rm = TRUE)
})

# model with 2 options for random argument:
res <- rma.mv(outcome,SEM^2, data=dat,
              mods = ~
                cmoderator1*factor(moderator2),
              random = ~1|experiment/study, # option 1
              #random = ~1|experiment/study/line, # option 2
              method = "REML")
summary(res)

Results with option 1 and 2 are close but which one is correct ? and why ?

Based on previous posts, I assume that SE and P-values are correctly estimated using the robust function and the upper level cluster (i.e. experiment):

robust(res, cluster=dat$experiment)

In my large data set, I look for outliers and influential cases with studentitized residuals and Cook's distance values.

**** My second question is: is it OK to identify (and remove from the data set) outliers and influential cases on a study basis (i.e., using the cluster argument in the functions) ? or is it better to identify idividual treatment means (by removing the cluster argument) ?

#### .. Student residuals ####
tmp.stud.res <- rstudent(res,
                         cluster=dat$study, # is this OK ?
                         progbar=TRUE)
order(-tmp.stud.res$cluster$X2)
tmp.stud.res$cluster[c(xxx)]

#### .. Cook's distance ####
par(mfrow=c(1,1))
tmp.cook <- cooks.distance(res,
                           cluster=dat$study,  # is this OK ?
                           progbar=TRUE)
plot(tmp.cook, type="o", pch=19)
which(tmp.cook > xxx)

Indeed, I do the other functions:

### Likelihood profile plots ###
par(mfrow=c(2,1))
profile.rma.mv(res)

### Normal Q-Q Plot
par(mfrow=c(1,1))
qqnorm(rstandard(res)$z, pch=19)
qqline(rstandard(res)$z)

### Funnel plots (publication bias) ###
par(mfrow=c(2,2))
funnel(res, main="Standard Error")
funnel(res, yaxis="vi", main="Sampling Variance")
funnel(res, yaxis="seinv", main="Inverse Standard Error")
funnel(res, yaxis="vinv", main="Inverse Sampling Variance")

This is the graph which shows the interaction:

#### .. Graph ####
(res <- rma.mv(outcome, SEM^2, data=dat,
                  mods = ~
                    factor(moderator2)-1 +
                    factor(moderator2):cmoderator1,
                  random = ~1|experiment/study,
                  method = "REML"))

tmp.coeff_cat <- round(coef(summary(res)), 4)
tmp <- ddply(dat, .(moderator2), summarise,
             min = min(cmoderator1, na.rm = TRUE),
             max = max(cmoderator1, na.rm = TRUE))

f1 <- function(x)  { #  function for tmp.casein studies
  ifelse(x<tmp[1,3]& x>tmp[1,2],
         tmp.coeff_cat[1,1] +
           (tmp.coeff_cat[3,1]) *x,NA)}
f2 <- function(x)  { #  function for tmp.dietary studies
  ifelse(x<tmp[2,3]& x>tmp[2,2],
         tmp.coeff_cat[2,1] +
           (tmp.coeff_cat[4,1]) *x,NA)}

tmp.plot       <- ggplot(dat, aes(cmoderator1, outcome), na.rm = TRUE) +
  geom_line(aes(group = study, colour = moderator2, linetype = moderator2), size = 1) +
  geom_point(aes(group = study, colour = moderator2), size = 2)

tmp.plot + labs(
  x = "Moderator1 (centered values)")+
  scale_linetype_manual(
    name="Category:",
    values=c("solid","solid"),
    breaks=c("cat1","cat2"),
    labels=c("cat1","cat2")) +
  scale_colour_manual(
    name="Category:",
    values=c("brown","royalblue"),
    breaks=c("cat1","cat2"),
    labels=c("cat1","cat2")) +
  stat_function(fun=f1,  colour = "red", size = 2, linetype="solid") +
  stat_function(fun=f2,  colour = "blue", size = 2, linetype="solid") +
  ggtitle("Test")+
  theme(
    plot.title = element_text(hjust = 0.5, family="serif", size = 25),
    axis.title=element_text(family= "serif", colour="black", size=18),
    axis.line=element_line(color="black"),
    axis.ticks = element_blank(),
    axis.text=element_text(family= "serif", colour="black", size=15) ,
    legend.title=element_text(family="serif", size=15),
    legend.text=element_text(family="serif", size=15),
    legend.position="bottom")