[R] Question re predict.glm & predict.lm in STATS

Thu Feb 17 01:09:26 CET 2022

Ok, I looked at what you sent me privately and saw your error. I'll
reproduce and fix it just using a trivial example with lm(), for which
the predict() semantics are identical. Before I do, I note that your
claim:

"The predict.glm documentation says a warning will be given if the
length of newdata is not the same as the training set used to create
the model." is **completely wrong**. What predict.glm (and predict.lm)
actually says is:

"Variables are first looked for in newdata and then searched for in
the usual way (which will include the environment of the formula used
in the fit). A warning will be given if the variables found are not of
the same length as those in newdata if it was supplied."

This is *NOT AT ALL* what you claimed. The key point that you are
missing is the phrase 'searched for in the usual way.'  The details
are a bit technical but in many ways fundamental. They can be found in
any good tutorial or perhaps by searching on "scoping in R" or
"function environments in R". It's about how R finds the objects that
variable names point to. Section 10.7 of the Intro.R manual shipped
with R (and available to you therefore) on "Scope" gives a brief
overview.

Anyway, here's the example that explains your error:

> train <- data.frame( y = runif(10), x = runif(10)) ## 10 rows
> test <- data.frame(x = runif(5))  ## 5 rows

## The following line is the source of your error
## You have specified your model incorrectly

> mdl <- lm(train$y ~train$x, data = train)

## The model is properly fitted because the variables in it, "train$y"
and "train$x" are found  "in the usual way" in the global environment,
the "enclosing environment" of the formula. (This is the technical
bit).  This leads to the sort of problem you saw with the predict
call:

> predict(mdl, newdat = test)
        1         2         3         4         5         6         7
0.6089476 0.6385268 0.9075589 0.3403276 0.2709199 0.5876634 0.8668307
        8         9        10
0.4689961 0.2571259 0.3281054
Warning message:
'newdata' had 5 rows but variables found have 10 rows

##Explanation: predict() is looking for a variable 'train$x', but test
only has a variable 'x', not 'train$x'. Since it doesn't find it, it
goes looking for 'train$x' "in the usual way" in the global
environment and finds it -- all 10 values as before. The prediction is
done using that data (the original fit) and the warning message is
emitted as per the documentation. Predicting without the newdat
argument does the same thing.

The correct syntax for fitting the original model is:
> mdl <- lm(y ~ x, data = train)

## and then the predict() call works fine using the newdat argument
(as 'x' is found there)
> predict(mdl, newdat = test)
        1         2         3         4         5
0.5134899 0.4619013 0.2458162 0.0446871 0.3146897

All of this is documented and exampled in ?glm or even ?lm or in any
tutorials on their use. Please spend the time to study these
carefully. Trying to mimic examples you find, which seems to be what
you are doing, is rarely sufficient.

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Feb 16, 2022 at 7:24 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>
> You should (almost) always reply to the list to maximize your opportunity for useful help. Also, I don't do private consulting.
>
> See ?dput and ?str for ways to put code and data as plain text into a post via copying and pasting from the R Console. You can also just type the code directly, of course. The RHelp server will strip most attachments (I think .png is OK for graphs, though. You can ask on list) if necessary). I don't recall whether Word makes it through, but you really shouldn't need such attachments anyway.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Feb 16, 2022 at 3:39 AM STEPHEN KAISLER <skaisler1 using comcast.net> wrote:
>>
>> Bert:
>>
>> Please see the attached file which shows the approach I used.
>> Thanks for any assistance that you can offer.
>>
>> Steve Kaisler
>>
>> On 02/15/2022 4:05 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>>
>>
>> ??
>> Show us the error. Show us the call.
>>
>>
>> On Tue, Feb 15, 2022, 12:14 PM STEPHEN KAISLER <skaisler1 using comcast.net> wrote:
>>
>> Folks:
>>
>> I haved glm/lm to build a model on a training set derived from auto_mpg data of 274 records (70% sampling)
>>
>> The test data set has 118 records.
>>
>> I am trying to use predict.glm or predict.lm to predict the values of mpg from disp, hp,weight, accel, and cyl.
>>
>> However I get the following message:
>>
>>
>> So, the resulting vector has 274 rows, when I believe it should have just 118 rows - the size of the test data set.
>>
>> I would appreciate it if someone could explain if am making the call
>> in error.
>>
>> Steve Kaisler
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.