[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Mon Dec 27 02:35:17 CET 2021

Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives a different result than using the polynomial when the data has already been subsetted. This was not at all intuitive for me.    You can see an example here: https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i

                If this is a design feature that you don’t think should be fixed, can you please include it in the documentation and explain why it makes sense to figure out the orthogonal polynomials on the entire dataset?  This feels like a serous leak of information when evaluating train and test datasets in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136

	[[alternative HTML version deleted]]