[Rd] Model forecasts with new factor levels - predict.warn

Wed Aug 24 22:56:54 CEST 2005

predict.warn() -- a function to display factor levels in new data
	for linear model prediction that do not exist in the
	estimating data.

Date: 2005-8-24
From: John C. Nash (with thanks to Uwe Ligges for suggestions)
	nashjc at uottawa.ca

Motivation: In computing predictions from a linear model using factors,
it is possible to introduce new factor levels. This was encountered on
a practical level in forecasting retail sales where certain events
had not occurred during the estimation period but did occur during
the validation (i.e., trial forecast) period. predict() then gives
an error message that it cannot continue because the new data has
new levels of a factor. Indeed, it will give the new level of the
first such factor found. predict.warn is designed to give more
detailed information. for example, if there are other factor levels
that are also "new".

The underlying interest is to find a suitable method for introducing
estimates of the effects of such "new" factor levels into the model,
and I welcome exchanges on how that may be done. Indeed, one of my
main interests in submitting this tool is to learn how that may be
done appropriately and with suitable information to the user to avoid
egregious errors. For example, a new sales promotion may have many
similarities with existing promotions, so a rough estimate for the
effect may be possible, allowing for rudimentary validation testing
and for "working" values of forecasts.

Unfortunately, the lack of some facility to provide "guesstimates"
in this way leads users to ignore R or other tools and to work with
Excel. I have experience of this situation arising in practise, and
we may anticipate that these users will be difficult to convert to
using better tools. I believe we must find ways to allow users to
carry out such operations in as safe and informed manner possible.
Ideally, the follow-on to predict.warn should be a function called
something like predict.expand or predict.force that asks the user
to supply values (or possibly sets or distributions) of the "new"
factor level estimates.

Sample function:

predict.warn<-function (object, newdata,
      ...)
# this is function predict.warn.R
# Tries to warn if we do not have appropriate factor levels in 
estimation data
# that appear in the newdata.
# It does NOT yet test for non-factor variables, but that would also be 
nice.
# However, we usually assume that numerical variables can be 
interpolated and
# extrapolated.
{   message("predict.warn -- check for factor levels not present")
     message("in linear models with factors")
     message(" ")
     if (missing(newdata) || is.null(newdata)) {
# Without newdata, issue warning and return
	message("You do not have any new data.")
     }
     else {
         xlev <- object$xlevels
         nn<-names(xlev)
	dnn<-length(nn)
	message("there are ",dnn," factor names ")
	dostop<-FALSE # Ensure the return value is defined
     	for (i in seq(along=nn)) {
		message("Factor ",nn[i]," in estimation data")
		ofac <- object[["model"]][[nn[i]]]
		print(ofac)  # diagnostic
		oldlev<-levels(factor(ofac))
		nfac <- newdata[[nn[i]]]  # ?? note need for double brackets. Why?
		message("Factor ",nn[i], " in new data")
		print(nfac) # diagnostic
		message("about to try levels on newfac")
		newlev<-levels(factor(nfac))
     		for (j in seq(along=newlev)) {
			if (!(newlev[j] %in% oldlev)) {
				message("New level ", newlev[j]," not found in estimation levels of 
factor ", nn[i])
				dostop = TRUE
			}
		}
	} # end loop over names
     } # end part where we DO have new data
     return(dostop) # TRUE if we should stop
}

# end of predict.warn code

Test example:

# A simple test of predict.warn.R for 2 factors
# J C Nash 20050822
rm(list=ls())
source("predict.warn.R")
x<-c(1, 2, 3, 4, 5, 6, 7, 8)
fac1<-c("A", "A", "", "", "", "B", "")
fac1<-c(fac1, "A")
fac1
y<-c( 2, 4, 5, 8, 9.5, 12, 13, 17)
fac2<-c(1,1,1,2,2,1,2,3)
fac2<-as.factor(fac2) # Note: or else defaults to num
joe<-data.frame(x, y, fac1, fac2)
bill<-subset(joe, x<=5)
mary<-subset(joe, x>5)
message("bill")
str(bill)
print(bill)
message("mary")
str(mary)
print(mary)
est<-lm(y ~ fac1 + fac2 + x, bill)
print(est)
test<-predict.warn(est, newdata = mary)
message('predict.warn returns ',test)
message("\n")
message("Now try predict()")
predict(est, newdata = mary)

# ======================= end of msg ===========================

-- 
John C. Nash, School of Management, University of Ottawa,
Vanier Hall 451, 136 Jean-Jacques Lussier Private,
P.O. Box 450, Stn A, Ottawa, Ontario, K1N 6N5 Canada
email: nashjc on mail server uottawa.ca, voice mail: 613 562 5800 X 4796
fax 613 562 5164,  Web URL = http://macnash.admin.uottawa.ca
"Practical Forecasting for Managers" web site is at
http://www.arnoldpublishers.com/support/nash/