[R] can predict ignore rows with insufficient info
Peter Whiting
pete at sprint.net
Tue Sep 16 23:09:00 CEST 2003
On Tue, Sep 16, 2003 at 04:17:59PM -0400, Thomas W Blackwell wrote:
> Peter -
>
> Your subsequent email seems just right. You have to determine
> ahead of time which rows can be estimated.
It seems that predict removes rows with insufficient information
(ie, if I replace "ALBANY" with NA and refactor everything works)
- I wonder why it doesn't exhibit the same behavior when it
encounters a new level - just eliminate the row and go on...
Somewhat related: I had been assuming (incorrectly)
that length(x) would equal length(const$days) after
x<-predict(g,const) - this isn't the case if any of the rows of
const don't contain enough info for the model. Those rows are
eliminated - I'd have expected them to just be NAs in the result.
I'll go back and look through the documents to see if there is a
straight forward way to convert:
> x
1 3 4
1.5 1.5 1.5
to
> x
1 2 3 4 5
1.5 NA 1.5 1.5 NA
slowly learning,
pete
Here's a strategy,
> and possibly some code to implement it.
>
> Let supported(i,y,d) be a user-written function which returns
> a logical vector indicating rows which should be omitted from
> the prediction on account of a non-covered covariate in column i
> of data frame d with outcome variable y. Apply this function to
> all columns in your data frame using lapply(). Then do the "or"
> of all the logical vectors by calculating the row sums of the
> numeric (0 or 1) equivalents. Last, convert back to logical,
> and subscript your data frame with this in the call to predict().
>
> Here's some rough code:
>
> supported <- function(i,y,d) {
> result <- rep(F, dim(d)[1]) # default return value when
> if (is.factor(d[[i]])) # d[[i]] is not a factor.
> result <- d[[i]] %in% unique(d[[i]][ !is.na(d[[y]]) ])
> result }
>
> tmp.1 <- lapply(seq(along=const), supported, "days", const)
> tmp.2 <- matrix(unlist(tmp.1[ names(const) != "days" ]), nrow=dim(const)[1])
> tmp.3 <- as.logical(as.vector(tmp.2 %*% rep(1, dim(tmp.2)[2])))
>
> x <- predict(g, const[ is.na(const$days) & !tmp.3, ])
>
> This code uses a few arcane maneuvers. Look at help pages for
> the relevant functions to dope out what it is doing. Particularly
> for lapply(), seq(), rep(), unlist(), unique(), "%*%", "%in%".
> (The last two must be quoted in order to see the help).
>
> However, the code might work for you right out of the box !
>
> - tom blackwell - u michigan medical school - ann arbor -
>
> On Tue, 16 Sep 2003, Peter Whiting wrote:
>
> > I need predict to ignore rows that contain levels not in the
> > model.
> >
> > Consider a data frame, "const", that has columns for the number of
> > days required to construct a site and the city and state the site
> > was constructed in.
> >
> > g<-lm(days~city,data=const)
> >
> > Some of the sites in const have not yet been completed, and therefore
> > they have days==NA. I want to predict how many days these sites
> > will take to complete (I've simplified the above discussion to
> > remove many of the other factors involved.)
> >
> > nconst<-subset(const,is.na(const$days))
> > x<-predict(g,nconst)
> > Error in model.frame.default(object, data, xlev = xlev) :
> > factor city has new level(s) ALBANY
> >
> > This is because we haven't yet completed a site in Albany.
> > If I just had one to worry about I could easily fix it (choose
> > a nearby market with similar characteristic) but I am dealing
> > with a several hundred cities. Instead, for the cities not
> > modeled by g I'd simply like to use the state, even though I
> > don't expect it to be as good:
> >
> > g<-lm(days~state,data=const)
> > x<-predict(g,nconst)
> >
> > I'm not sure how to identify the cities in nconst that are not
> > modeled by g (my actual model has many more predictors in the
> > formula) Is there a way to instruct predict to only predict the
> > rows for which it has enough information and not complain about
> > the others?
> >
> > g<-lm(days~city,data=const)
> > x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA
> > g<-lm(days~state,data=const)
> > y<-predict(g,nconst)
> > x[is.na(x)]<-y[is.na(x)]
> >
> > thanks,
> > pete
> >
More information about the R-help
mailing list