[Rd] predict (PR#2686)

Mark.Bravington at csiro.au Mark.Bravington at csiro.au
Thu Mar 27 02:59:35 MET 2003


<Bravington wrote:>
#> `predict' complains about new factor levels, even if the 
#"new" levels are
#> merely levels in the original that didn't occur in the 
#original fit and were
#> sensibly dropped, and that don't occur in the prediction 
#data either. 

<Ripley replied:>
#This is intentional.  The coding for factors is based on the 
#full set of 
#levels, and should be comparable for different prediction sets.
#
#If you are using factors with fictitious levels the fix is obvious: 
#improve the design.

There is still an inconsistency bug between `lm' and `predict.lm', though.
`lm' intentionally overlooks inactive levels of a factor, but `predict.lm'
doesn't, even when it legitimately could. In particular, it is a bit odd to
have no problem predicting without a `newdata' argument even when the
original data had inactive factor levels, but then to get an error if
`newdata=<<original data>>' is supplied explicitly! (See example.)

Given that the (IMHO sensible) decision to drop has been taken for `lm' to
drop inactive levels, deliberately so that users don't have to change their
designs when they don't really need to, then surely it's inconsistent for
`predict' not to do the same when it's statistically OK?

[When it's not OK-- i.e. when there are levels in the prediction data that
didn't appear in the fitting data-- the cleanest solution would perhaps be
for `predict' to return NA values and a warning, rather than an error. But
that's a separate issue.]

cheers
Mark

mark.bravington at csiro.au

Slightly expanded example, and suggested fix to `model.frame.default':

test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat',
'dog','cat'), levels=c( 'cat', 'dog', 'earwig')))
test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2)

test> predict( lm.predbug.2) # uses original data
         1         2         3 
 0.2185388 0.5843139 0.2185388 

test> predict(lm.predbug.2, newdata=scrunge.data.2) # newdata = original
data
Error in model.frame.default(object, data, xlev = xlev) : 
        factor disc has new level(s) earwig


A cure for this seems to be to add the commented line below, towards the end
of `model.frame.default':

    <<...>>
    if (length(xlev) > 0) {
        for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) {
            xi <- data[[nm]]
            if (is.null(nxl <- levels(xi))) 
                warning(paste("variable", nm, "is not a factor"))
            else {
                xi <- xi[, drop = TRUE]
                nxl <- levels( xi) # MVB: remove droppees
                if (any(m <- is.na(match(nxl, xl)))) 
                  stop(paste("factor", nm, "has new level(s)", nxl[m]))
            }
        }
    }
    else if (drop.unused.levels) {
    <<...>>

--please do not edit the information below--

Version:
 platform = i386-pc-mingw32
 arch = i386
 os = mingw32
 system = i386, mingw32
 status = 
 major = 1
 minor = 6.2
 year = 2003
 month = 01
 day = 10
 language = R

Windows 2000 Professional (build 2195) Service Pack 3.0

Search Path:
 .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info,
package:mvbutils, package:tcltk, Autoloads, package:base



More information about the R-devel mailing list