[R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x

Mon Oct 3 20:21:47 CEST 2016

Dear Carlos,

> -----Original Message-----
> From: Carlos Familia [mailto:carlosfamilia at gmail.com]
> Sent: Monday, October 3, 2016 2:03 PM
> To: Fox, John <jfox at mcmaster.ca>
> Cc: r-sig-mixed-models at r-project.org; Thierry Onkelinx
> <thierry.onkelinx at inbo.be>
> Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot
> and slope on the observed vs fitted plot is different than y = x
> 
> Dear John,
> 
> Do you suggest then that I should get a different linear regression
> model per condition, and instead of analysing the overall effect of W
> across conditions, analyse it for each condition separately?

No, I'm not recommending that, or anything in particular beyond the observation that randomly discarding data in order to fit a fixed effect model to the remaining independent observations is probably not a good idea. I also don't see what's to be gained by fitting separate models to the conditions; if you expect interactions with condition, why not model them? I don't think that it would be responsible for me to give you statistical advice by email knowing next to nothing about your research.

> Do you think this would be fine for a referee?

I have no idea. I probably wouldn't feel that I could predict an unknown referee's response to your work even if I were in a position to recommend what you should do.

I'm sorry that I can't be more helpful.

Best,
 John

> Please note that I have no problem with this.
> 
> 
> Many thanks,
> Carlos Família
> 
> 
> 
> 	On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca
> <mailto:jfox at mcmaster.ca> > wrote:
> 
> 	Dear Carlos,
> 
> 	If I understand properly what's troubling you and the nature of the
> suggestion, I wouldn't do it -- that is, randomly discard data to get
> one observation per case.
> 
> 	Paul Johnson explained clearly why the pattern you noticed in the
> residuals vs. fitted values plot occurred, as a consequence of
> shrinkage. One way of thinking about this is that using a mixed model is
> *more* important when you have few observations per case, where
> shrinkage will be greater, than when you have many observations per
> case, where the "estimates" of the random effects will nearly coincide
> with the case means (or in a more complex model, within-case
> regressions).
> 
> 	Best,
> 	John
> 
> 	-----------------------------
> 	John Fox, Professor
> 	McMaster University
> 	Hamilton, Ontario
> 	Canada L8S 4M4
> 	Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox>
> 
> 
> 
> 
> 
> 		-----Original Message-----
> 		From: R-sig-mixed-models [mailto:r-sig-mixed-models-
> bounces at r-project.org]
> 		On Behalf Of Carlos Familia
> 		Sent: October 3, 2016 8:20 AM
> 		To: Thierry Onkelinx <thierry.onkelinx at inbo.be
> <mailto:thierry.onkelinx at inbo.be> >
> 		Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-
> models at r-project.org>
> 		Subject: Re: [R-sig-ME] Model diagnostics show slope in
> residuals plot and
> 		slope on the observed vs fitted plot is different than y = x
> 
> 		Do you think this approach is sound and easily justifiable in
> a paper?
> 
> 		Many thanks,
> 		Carlos
> 
> 
> 			On 3 Oct 2016, at 13:10, Thierry Onkelinx
> <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > wrote:
> 
> 			In case you have more than one measurement per ID,
> select one of them
> 			at random. Something like
> 
> 			library(dplyr)
> 			df %>%
> 			group_by(id) %>%
> 			sample_n(1)
> 
> 
> 
> 			ir. Thierry Onkelinx
> 			Instituut voor natuur- en bosonderzoek / Research
> Institute for Nature
> 			and Forest team Biometrie & Kwaliteitszorg / team
> Biometrics & Quality
> 			Assurance Kliniekstraat 25
> 			1070 Anderlecht
> 			Belgium
> 
> 			To call in the statistician after the experiment is done
> may be no
> 			more than asking him to perform a post-mortem
> examination: he may be
> 			able to say what the experiment died of. ~ Sir Ronald
> Aylmer Fisher
> 			The plural of anecdote is not data. ~ Roger Brinner The
> combination of
> 			some data and an aching desire for an answer does not
> ensure that a
> 			reasonable answer can be extracted from a given body of
> data. ~ John
> 			Tukey
> 
> 			2016-10-03 14:06 GMT+02:00 Carlos Familia
> <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
> 
> 
> 		<mailto:carlosfamilia at gmail.com>>:
> 
> 
> 			Dear Thierry,
> 
> 			When you say sampling the data to have only one
> observation per ID, you
> 
> 
> 		mean reducing the dataset to cases where the Y variable was
> only measured in
> 		1 condition?
> 
> 
> 			I have though about going lm  with this, but it just
> didn’t feel wright...
> 
> 			Many thanks,
> 			Carlos
> 
> 
> 
> 				On 3 Oct 2016, at 13:02, Thierry Onkelinx
> <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>
> 
> 
> 		<mailto:thierry.onkelinx at inbo.be>> wrote:
> 
> 
> 
> 				Dear Carlos,
> 
> 				I concur with Paul. After play a bit with the data
> you send me privately, I see
> 
> 
> 		a few things which cause problems:
> 
> 
> 				1) the number of measurements per ID is low. 1/3
> has one measurement in
> 
> 
> 		each level of C, 1/3 in two out of three levels of C and 1/3
> in only one level of C.
> 
> 
> 				2) the variance of ID is larger than the residual
> variance
> 				3) the effect of W is small compared to the
> variance of ID
> 
> 				If possible try to add extra covariates. If not I'd
> fall back on a simple lm.
> 
> 
> 		Either with ignoring the repeated measurements or by sampling
> the data so
> 		you have only one observation per ID.
> 
> 
> 
> 				Best regards,
> 
> 
> 				ir. Thierry Onkelinx
> 				Instituut voor natuur- en bosonderzoek / Research
> Institute for
> 				Nature and Forest team Biometrie & Kwaliteitszorg /
> team Biometrics &
> 				Quality Assurance Kliniekstraat 25
> 				1070 Anderlecht
> 				Belgium
> 
> 				To call in the statistician after the experiment is
> done may be no
> 				more than asking him to perform a post-mortem
> examination: he may be
> 				able to say what the experiment died of. ~ Sir
> Ronald Aylmer Fisher
> 				The plural of anecdote is not data. ~ Roger Brinner
> The combination
> 				of some data and an aching desire for an answer
> does not ensure that
> 				a reasonable answer can be extracted from a given
> body of data. ~
> 				John Tukey
> 
> 				2016-10-03 11:17 GMT+02:00 Paul Johnson
> <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>
> 
> 
> 		<mailto:paul.johnson at glasgow.ac.uk>>:
> 
> 
> 				Hi Carlos,
> 
> 				Your plot didn’t come through, as Thierry noted.
> However it’s expected that,
> 
> 
> 		unlike a standard linear regression model, an LMM with a
> nested structure
> 		such as yours will give a positive linear relationship
> between the residuals and
> 		the fitted values (might also be true for other structures?),
> provided the
> 		residual and random effect variances are > 0. Somebody will
> hopefully chip in
> 		with a formal explanation, but it’s basically a similar
> phenomenon to
> 		regression to the mean.
> 
> 
> 
> 				Imagine a group of students taking each taking an
> aptitude test three times.
> 
> 
> 		There are two random factors: the difference in underlying
> aptitude between
> 		the students, modelled by the student ID random effect; and
> random variation
> 		between time points within each student (e.g. how good a
> particular student is
> 		feeling on a particular day). I’m ignoring variation between
> tests — let’s
> 		unrealistically assume they’s all the same and students
> completely forget about
> 		them between tests.
> 
> 
> 
> 				The students with the best mean scores will be a
> mixture of excellent
> 
> 
> 		students having three so-so (some good, some bad) days, and
> moderately good
> 		students having the good luck to have three good days, and
> the very best
> 		scores will come from students who were both excellent and
> lucky (although
> 		this category will be small). An important point is that
> there is no way of using
> 		the data to separate the moderate-student-lucky-days high
> scores from the
> 		excellent-student-average-days scores. If we simply took the
> mean of the
> 		scores, we would be overestimating the performance of the
> students on
> 		average (we’d have good estimates of the excellent students
> and
> 		overestimates of the moderate ones), so the best estimate is
> achieved by
> 		shrinking the scores towards the mean.
> 
> 
> 
> 				This is what happens when the model is fitted. Each
> student is given a
> 
> 
> 		residual (random effect) at the student level (how good the
> student is relative
> 		to the value predicted by the fixed effects) and three
> residuals at the
> 		observation (between-test-within-student) level. For students
> with good
> 		scores, this will be a compromise between the inseparable
> excellent-student-
> 		average-days scenario and the moderate-student-lucky-days
> scenario. As a
> 		result, students with high student-level residuals (the
> student random effects)
> 		will also tend to have high inter-test residuals. The same is
> also true in negative
> 		for poor students and students having three bad days. So the
> student random
> 		effects (which are part of the fitted values) and the
> residuals will be positively
> 		correlated.
> 
> 
> 
> 				You can check this using by simulating new data
> from the fitted model re-
> 
> 
> 		fitting the model, and comparing the residuals-x-fitted plot
> (which will be
> 		"perfect”) to the one from your data. Here’s a function that
> does this for lme4
> 		fits:
> 
> 
> 
> 				devtools::install_github("pcdjohnson/GLMMmisc")
> 				library(GLMMmisc)
> 				sim.residplot(fit) # repeat this a few times to
> account for sampling
> 				error
> 
> 				If all is well, you should see a similar slope
> between the real and the
> 
> 
> 		simulated plots, in fact the general pattern of the residuals
> should be similar.
> 
> 
> 
> 				(The new package DHARMa —
> 
> 	https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
> packa
> 				ge-for-residual-diagnostics-of-glmms/
> 
> 	<https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
> pack
> 				age-for-residual-diagnostics-of-glmms/> — takes a
> similar approach to
> 				assessing residuals, but in a less quick-and-dirty,
> more formally
> 				justified way.)
> 
> 				All the best,
> 				Paul
> 
> 
> 
> 
> 
> 
> 					On 2 Oct 2016, at 16:57, Carlos Familia
> <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
> 
> 
> 		<mailto:carlosfamilia at gmail.com>> wrote:
> 
> 
> 
> 					Hello,
> 
> 					I have in hands a quite large and unbalanced
> dataset, for which a Y
> 
> 
> 		continuous dependent variable was measured in 3 different
> conditions (C) for
> 		about 3000 subjects (ID) (although, not all subjects have Y
> values for the 3
> 		conditions). Additionally, there is continuous measure W
> which was measured
> 		for all subjects.
> 
> 
> 
> 					I am interested in testing the following:
> 
> 					- Is the effect of W significant overall
> 					- Is the effect of W significant at each
> level of C
> 					- Is the effect of C significant
> 
> 					In order to try to answer this, I have
> specified the following model with
> 
> 
> 		lmer:
> 
> 
> 
> 					lmer( Y ~ W * C + (1 | ID), data = df)
> 
> 					Which seems to proper reflect the structure
> of the data (I might be wrong
> 
> 
> 		here, any suggestions would be welcome).
> 
> 
> 					However when running the diagnostic plots I
> noticed a slope in the
> 
> 
> 		residuals plot and a slope different than y = x for the
> observed vs fitted plot (as
> 		shown bellow). Which made me question the validity of the
> model for
> 		inference.
> 
> 
> 
> 					Could I still use this model for inference?
> Should I specify a different
> 
> 
> 		formula? Should I turn to lme and try to include different
> variances for each
> 		level of conditions (C)? Any ideas?
> 
> 
> 
> 					I would be really appreciated if anyone could
> help me with this.
> 
> 					Thanks in advance,
> 					Carlos Família
> 
> 
> 	_______________________________________________
> 					R-sig-mixed-models at r-project.org <mailto:R-
> sig-mixed-models at r-project.org>
> 					<mailto:R-sig-mixed-models at r-project.org>
> mailing list
> 					https://stat.ethz.ch/mailman/listinfo/r-sig-
> mixed-models
> 					<https://stat.ethz.ch/mailman/listinfo/r-sig-
> mixed-models>
> 
> 
> 
> 				_______________________________________________
> 				R-sig-mixed-models at r-project.org <mailto:R-sig-
> mixed-models at r-project.org>
> 				<mailto:R-sig-mixed-models at r-project.org> mailing
> list
> 				https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
> models
> 				<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
> models>
> 
> 
> 
> 
> 
> 
> 		[[alternative HTML version deleted]]
> 
> 		_______________________________________________
> 		R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-
> models at r-project.org>  mailing list
> 		https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>