[Rd] predict (PR#2686)
Mark.Bravington at csiro.au
Mark.Bravington at csiro.au
Tue Apr 1 10:48:49 MEST 2003
>> <Bravington wrote:>
>>>> `predict' complains about new factor levels, even if the
>>>> "new" levels are
>>>> merely levels in the original that didn't occur in the
>>>> original fit and were
>>>> sensibly dropped, and that don't occur in the prediction
>>>> data either.
>> <Ripley replied:>
>>> This is intentional. The coding for factors is based on the
>>> full set of
>>> levels, and should be comparable for different prediction sets.
>>>
>>> If you are using factors with fictitious levels the fix is obvious:
>>> improve the design.
>> <Bravington again:>
>> There is still an inconsistency bug between `lm' and `predict.lm',
though.
>> `lm' intentionally overlooks inactive levels of a factor,
> <Ripley again:>
> Only if an argument is set, and originally lm did not do so.
<Bravington again:>
But `lm' always does this now, doesn't it? -- even if it didn't originally.
I think you can't not drop unused levels, even if you wanted to.
>> but `predict.lm'doesn't, even when it legitimately could.
>> In particular, it is a bit odd to
>> have no problem predicting without a `newdata' argument even when the
>> original data had inactive factor levels, but then to get an error if
>> `newdata=<<original data>>' is supplied explicitly! (See example.)
>
> <Ripley:>
>Read again. predict.lm is consistent across its inputs:
>unlike lm it can
>take variable `newdata'. As I said the intention is to be consistent
>across *prediction sets*. Omitting newdata is not giving a prediction
>set.
<Bravington again:>
Mmm-- that's getting a bit metaphysical for me-- when is a prediction not a
prediction, and what is ``predict'' actually doing if it is not predicting?!
Anyhow, according to the help page for `predict.lm':
If the fit is rank-deficient, some of the columns of the design
matrix will have been dropped. Prediction from such a fit only
makes sense if `newdata' is contained in the same subspace as the
original data. That cannot be checked accurately, so a warning is
issued.
The subspace condition is obviously satisfied if the prediction data is the
same as the original data-- so prediction does "make sense" in that context
according to the documentation (as well as common sense. Normally I am no
fan of slavish adherence to documentation, but in my own interests I'll make
an exception...). And yet there's an error message, not even a warning.
Prediction from the original data was just an example, of course; my general
proposal is that inactive factor levels in the prediction set should be
dropped. I don't see how this could ever cause inconsistent behaviour across
prediction sets-- have I missed something?
cheers
Mark
*******************************
Mark Bravington
CSIRO (CMIS)
PO Box 1538
Castray Esplanade
Hobart
TAS 7001
phone (61) 3 6232 5118
fax (61) 3 6232 5012
Mark.Bravington at csiro.au
More information about the R-devel
mailing list