[R-sig-ME] negatively biased variance estimates of random factors with few levels

Sun Feb 18 19:23:19 CET 2024

Dear John,

Thanks! I also noticed this while 'playing' with the simulation.

Regards,
Kalman

On Sunday, February 18th, 2024 at 7:15 PM, John Fox <jfox using mcmaster.ca> wrote:

> Dear Kalman,
> 
> I forgot to mention that I think that when, as here, it's possible to
> get 0 (or negative) variance component estimates, it's not quite right
> to censor or remove these values when estimating the expectation of the
> estimator. That's not much of an issue here, where the only 0 estimates
> are for n = 3, and even in this case they're rare.
> 
> Best,
> John
> 
> 
> On 2024-02-18 11:22 a.m., kalman.toth wrote:
> 
> > [You don't often get email from kalman.toth using protonmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > Caution: External email.
> > 
> > Dear John,
> > 
> > This seems quite convincing. ;)
> > Thanks a lot!
> > 
> > Kalman
> > 
> > On Sunday, February 18th, 2024 at 4:38 PM, John Fox jfox using mcmaster.ca wrote:
> > 
> > > Dear Kalman,
> > > 
> > > Could it be as simple as the variance being an unbiased estimator, so
> > > the standard deviation is biased?
> > > 
> > > I reran your simulation with two small modifications: I saved the
> > > variances and I increased the number of samples (nsim) from 500 to 5000,
> > > obtaining
> > > 
> > > > bias_pct
> > > 
> > > lme.20 lme.10 lme.5 lme.3 anova.20 anova.10 anova.5
> > > -2 0 0 0 -2 0 0
> > > anova.3
> > > 0
> > > 
> > > > bias_pct2 # bias without zero SD estimates
> > > 
> > > lme.20 lme.10 lme.5 lme.3 anova.20 anova.10 anova.5
> > > -1 0 0 1 -1 0 0
> > > anova.3
> > > 1
> > > 
> > > Does that help?
> > > 
> > > John
> > > 
> > > --
> > > John Fox, Professor Emeritus
> > > McMaster University
> > > Hamilton, Ontario, Canada
> > > web: https://www.john-fox.ca/
> > > On 2024-02-18 9:25 a.m., kalman.toth via R-sig-mixed-models wrote:
> > > 
> > > > Caution: External email.
> > > > 
> > > > Dear List Members,
> > > > 
> > > > On the glmmFAQ webpage Ben Bolker writes:
> > > > 
> > > > “Treating factors with small numbers of levels as random will in the best case lead to very small and/or imprecise estimates of random effects; in the worst case it will lead to various numerical difficulties such as lack of convergence, zero variance estimates, etc.. (A small simulation exercise shows that at least the estimates of the standard deviation are downwardly biased in this case; it’s not clear whether/how this bias would affect the point estimates of fixed effects or their estimated confidence intervals.) In the classical method-of-moments approach these problems may not arise (because the sums of squares are always well defined as long as there are at least two units), but the underlying problems of lack of power are there nevertheless. “
> > > > 
> > > > I am struggling to understand the reasons behind the potential bias in REML estimates when dealing with a small number of factor levels. To explore this issue, I modified Ben Bolker’s simulation as follows:
> > > > 
> > > > - set the sd1/sd2 ratio to 5/1 (to reduce the number of zero variance estimates and eliminate the possibility of causing bias)
> > > > 
> > > > - compared the REML estimates with ANOVA because 1) the ANOVA variance estimates are expected to be unbiased and 2) for these datasets the two estimates should be the same
> > > > 
> > > > - extended the simulation to sample sizes 3, 5, 10 and 20
> > > > 
> > > > My conclusions based on the extended simulation are:
> > > > 
> > > > - Consistency of REML and ANOVA estimates: the estimates are indeed the same
> > > > 
> > > > - Negative bias: Both methods give negatively biased estimates!?
> > > > 
> > > > - Impact of sample size: the magnitude of bias decreases as the sample size increases (11, 7, 4, and 1% for sample sizes 3, 5, 10 and 20, respectively) but sample sizes as large as 20 or 50 might be still negatively biased
> > > > 
> > > > I am particularly interested in addressing the treatment of factors conceptually viewed as random effects but having only a few factor levels. This scenario is common in the datasets I work with. I understand that there can be issue with the power and the SE can be large if the sample size is small. But why would both ANOVA and REML estimates be biased? And what could be done about the bias apart from treating the factor as fixed? (I prefer to treat them as fixed if they really have few levels e.g. 3-4 but not with 8-10 levels)
> > > > 
> > > > Regards,
> > > > 
> > > > Kalman
> > > > 
> > > > Please, find here the modified simulation:
> > > > 
> > > > nrep <- 5
> > > > 
> > > > simfun <- function(n1 = 20, n2 = nrep, sd1 = 1, sd2 = 0.2) {
> > > > d <- expand.grid(f1 = factor(seq(n1)), f2 = factor(seq(n2)))
> > > > u1 <- rnorm(n1, mean = 0, sd = sd1)
> > > > d$y <- rnorm(n1 * n2, mean = u1, sd = sd2)
> > > > d
> > > > }
> > > > require(lme4)
> > > > fitfun <- function(d = simfun()) {
> > > > lme <- sqrt(unlist(VarCorr(lmer(y ~ (1 | f1), data = d))))
> > > > MSb <- summary(aov(y ~ f1, data = d))[[1]]['Mean Sq']['f1',]
> > > > MSw <- summary(aov(y ~ f1, data = d))[[1]]['Mean Sq']['Residuals',]
> > > > an <- sqrt((MSb-MSw)/nrep)
> > > > list(lme = lme, anova = an)
> > > > }
> > > > 
> > > > set.seed(101)
> > > > nsim = 500
> > > > sd_dist0 <- replicate(nsim, fitfun())
> > > > sd_dist1 <- replicate(nsim, fitfun(simfun(n1 = 10)))
> > > > sd_dist2 <- replicate(nsim, fitfun(simfun(n1 = 5)))
> > > > sd_dist3 <- replicate(nsim, fitfun(simfun(n1 = 3)))
> > > > sd_List <- list(lme.20 = unlist(sd_dist0[1,]), lme.10 = unlist(sd_dist1[1,]), lme.5 = unlist(sd_dist2[1,]), lme.3 = unlist(sd_dist3[1,]),
> > > > anova.20 = unlist(sd_dist0[2,]), anova.10 = unlist(sd_dist1[2,]), anova.5 = unlist(sd_dist2[2,]), anova.3 = unlist(sd_dist3[2,]))
> > > > 
> > > > #sd_List <- list(n1.5 = sd_dist1, n1.3 = sd_dist2)
> > > > 
> > > > plotfun <- function(x, title) {
> > > > par(las = 1, bty = "l")
> > > > hist(x, breaks = 50, col = "gray", main = title, xlab = "est. sd", freq = FALSE)
> > > > }
> > > > 
> > > > par(mfrow = c(2, 4))
> > > > invisible(mapply(plotfun, sd_List, names(sd_List)))
> > > > 
> > > > sapply(sd_List, function(x) mean(x == 0 | is.na(x)))
> > > > 
> > > > sfun <- function(x) {
> > > > r <- list(mean = mean(x, na.rm=T), mean2 = mean(x[which(x > 1e-5 )]), sem = sd(x, na.rm=T)/sqrt(length(x)))
> > > > r <- with(r, c(r, list(lwr = mean - 2 * sem, upr = mean + 2 * sem)))
> > > > unlist(r)
> > > > }
> > > > print(s_tab <- sapply(sd_List, sfun), digits = 3)
> > > > 
> > > > bias_pct <- round((s_tab["mean", ] - 1) * 100)
> > > > bias_pct2 <- round((s_tab["mean2", ] - 1) * 100)
> > > > bias_pct
> > > > bias_pct2 # bias without zero SD estimates
> > > > [[alternative HTML version deleted]]
> > > > 
> > > > _______________________________________________
> > > > R-sig-mixed-models using r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models