[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Mon Jan 3 16:54:26 CET 2022
>>>>> Ben Bolker
>>>>> on Mon, 27 Dec 2021 09:43:42 -0500 writes:
> I agree that it seems non-intuitive (I can't think of a
> design reason for it to look this way), but I'd like to
> stress that it's *not* an information leak; the
> predictions of the model are independent of the
> parameterization, which is all this issue affects. In a
> worst case there might be some unfortunate effects on
> numerical stability if the data-dependent bases are
> computed on a very different set of data than the model
> fitting actually uses.
> I've attached a suggested documentation patch (I hope
> it makes it through to the list, if not I can add it to
> the body of a message.)
It did make it through; thank you, Ben!
( After adding two forgotten '}' ) I've committed the help file
additions to the R sources (R-devel) in svn r81434 .
Thanks again and
"Happy New Year"
to all readers,
Martin
> On 12/26/21 8:35 PM, Balise, Raymond R wrote:
>> Hello R folks, Today I noticed that using the subset
>> argument in lm() with a polynomial gives a different
>> result than using the polynomial when the data has
>> already been subsetted. This was not at all intuitive for
>> me. You can see an example here:
>> https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
>>
>> If this is a design feature that you don’t think should
>> be fixed, can you please include it in the documentation
>> and explain why it makes sense to figure out the
>> orthogonal polynomials on the entire dataset? This feels
>> like a serous leak of information when evaluating train
>> and test datasets in a statistical learning framework.
>>
>> Ray
>>
>> Raymond R. Balise, PhD Assistant Professor Department of
>> Public Health Sciences, Biostatistics
>>
>> University of Miami, Miller School of Medicine 1120
>> N.W. 14th Street Don Soffer Clinical Research Center -
>> Room 1061 Miami, Florida 33136
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> --
> Dr. Benjamin Bolker Professor, Mathematics & Statistics
> and Biology, McMaster University Director, School of
> Computational Science and Engineering Graduate chair,
> Mathematics & Statistics x[DELETED ATTACHMENT external:
> BenB_lm-subset.patch, plain text]
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list