[R] can predict ignore rows with insufficient info

Wed Sep 17 08:24:22 CEST 2003

On Tue, 16 Sep 2003, Peter Whiting wrote:

> On Tue, Sep 16, 2003 at 04:31:29PM -0400, Thomas W Blackwell wrote:
> > Corrected and re-named version of function:
> > 
> > unsupported <- function(i,y,d)  {
> >    result <- rep(F, dim(d)[1])      # default return value when
> >    if (is.factor(d[[i]]))           #  d[[i]] is not a factor.
> >      result <- !(d[[i]] %in% unique(d[[i]][ !is.na(d[[y]]) ]))
> >    result  }
> > 
> > tmp.1 <- lapply(seq(along=const), unsupported, "days", const)
> > tmp.2 <- matrix(unlist(tmp.1[ names(const) != "days" ]), nrow=dim(const)[1])
> > tmp.3 <- as.logical(as.vector(tmp.2 %*% rep(1, dim(tmp.2)[2])))
> > 
> > x <- predict(g, const[ is.na(const$days) & !tmp.3, ])
> 
> Here is an approach I came up with that appears to work:

(One I sent privately to Peter.)

> predict2 <- function(g,data,...)  
> {
>   for(nm in names(g$xlevels)) { 
>     cat(paste(nm,"\n"))
>     data[[nm]]<- factor(data[[nm]],levels=g$xlevels[[nm]])
>   }
>   predict(g,data,...)
> }
> 
> It bases its operation on refactoring each predictor using the
> factor's "levels=" argument. Any element having a level not in
> g$xlevels ends up as an NA, which predict correctly handles.
> 
> I'm not sure why predict doesn't do something like this by
> default, but I am just a newbee.

Because it is thought more common for additional levels to be a mistake 
that the user would want to be alerted to.  Note also that here you are
talking about the "lm" method of predict(), and by no means all methods do 
handle NAs in the model matrix (and for those that do it is rather 
recent).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595