[R] can predict ignore rows with insufficient info

Tue Sep 16 23:09:00 CEST 2003

On Tue, Sep 16, 2003 at 04:17:59PM -0400, Thomas W Blackwell wrote:
> Peter  -
> 
> Your subsequent email seems just right.  You have to determine
> ahead of time which rows can be estimated.

It seems that predict removes rows with insufficient information
(ie, if I replace "ALBANY" with NA and refactor everything works)
- I wonder why it doesn't exhibit the same behavior when it
encounters a new level - just eliminate the row and go on...

Somewhat related: I had been assuming (incorrectly)
that length(x) would equal length(const$days) after
x<-predict(g,const) - this isn't the case if any of the rows of
const don't contain enough info for the model.  Those rows are
eliminated - I'd have expected them to just be NAs in the result.
I'll go back and look through the documents to see if there is a
straight forward way to convert:

> x
  1   3   4
1.5 1.5 1.5

to
> x
  1  2  3   4  5
1.5 NA 1.5 1.5 NA

slowly learning,
pete

  Here's a strategy,
> and possibly some code to implement it.
> 
> Let  supported(i,y,d)  be a user-written function which returns
> a logical vector indicating rows which should be omitted from
> the prediction on account of a non-covered covariate in column i
> of data frame d with outcome variable y.  Apply this function to
> all columns in your data frame using  lapply().  Then do the "or"
> of all the logical vectors by calculating the row sums of the
> numeric (0 or 1) equivalents.  Last, convert back to logical,
> and subscript your data frame with this in the call to  predict().
> 
> Here's some rough code:
> 
> supported <- function(i,y,d)  {
>    result <- rep(F, dim(d)[1])      # default return value when
>    if (is.factor(d[[i]]))           #  d[[i]] is not a factor.
>      result <- d[[i]] %in% unique(d[[i]][ !is.na(d[[y]]) ])
>    result  }
> 
> tmp.1 <- lapply(seq(along=const), supported, "days", const)
> tmp.2 <- matrix(unlist(tmp.1[ names(const) != "days" ]), nrow=dim(const)[1])
> tmp.3 <- as.logical(as.vector(tmp.2 %*% rep(1, dim(tmp.2)[2])))
> 
> x <- predict(g, const[ is.na(const$days) & !tmp.3, ])
> 
> This code uses a few arcane maneuvers.  Look at help pages for
> the relevant functions to dope out what it is doing.  Particularly
> for  lapply(), seq(), rep(), unlist(), unique(), "%*%", "%in%".
> (The last two must be quoted in order to see the help).
> 
> However, the code might work for you right out of the box !
> 
> -  tom blackwell  -  u michigan medical school  -  ann arbor  -
> 
> On Tue, 16 Sep 2003, Peter Whiting wrote:
> 
> > I need predict to ignore rows that contain levels not in the
> > model.
> >
> > Consider a data frame, "const", that has columns for the number of
> > days required to construct a site and the city and state the site
> > was constructed in.
> >
> > g<-lm(days~city,data=const)
> >
> > Some of the sites in const have not yet been completed, and therefore
> > they have days==NA. I want to predict how many days these sites
> > will take to complete (I've simplified the above discussion to
> > remove many of the other factors involved.)
> >
> > nconst<-subset(const,is.na(const$days))
> > x<-predict(g,nconst)
> > Error in model.frame.default(object, data, xlev = xlev) :
> >         factor city has new level(s) ALBANY
> >
> > This is because we haven't yet completed a site in Albany.
> > If I just had one to worry about I could easily fix it (choose
> > a nearby market with similar characteristic) but I am dealing
> > with a several hundred cities. Instead, for the cities not
> > modeled by g I'd simply like to use the state, even though I
> > don't expect it to be as good:
> >
> > g<-lm(days~state,data=const)
> > x<-predict(g,nconst)
> >
> > I'm not sure how to identify the cities in nconst that are not
> > modeled by g (my actual model has many more predictors in the
> > formula) Is there a way to instruct predict to only predict the
> > rows for which it has enough information and not complain about
> > the others?
> >
> > g<-lm(days~city,data=const)
> > x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA
> > g<-lm(days~state,data=const)
> > y<-predict(g,nconst)
> > x[is.na(x)]<-y[is.na(x)]
> >
> > thanks,
> > pete
> >