[R] Out-of-sample predictions with boosting model

Fri Jul 30 11:48:44 CEST 2010

Hi Travis,

I try to give you some hints that might bring you closer to a solution.
The clue to your problem (as far as I understand it) might just be to 
appropriately use the predict function of mboost. You can specify a new 
data set (e.g. a part of your original data set not used for estimation) 
and

 > predict(model, newdata = newdata)

which gives you a vector of predictions as you wanted. Thus, you could, 
for example, specify newdata such that you get your one-step ahead 
predictions.

To estimate the model only on a subset of the data you could either use

 > mboost(y ~ x1 + x2 + x3, data = some_part_of_your_dataset)

or you can apply weights

 > model <- mboost(y ~ x1 + x2 + x3, data = data,
+                 weights = c(rep(1, 100), rep(0, nrow(data) - 100)))
 > predict(model) ## gives you predictions for all observations in data

Now you can extract the subset of out-of-bag predictions, i.e., 
predictions for observations with weight 0.

One further thing to mention:
You term your model blackbox, however you should note that you do NOT 
fit a blackbox model but an additive model using P-splines (which is the 
default). You can see this if you type, e.g.,

 > coef(model)

and look at the names.

Another idea for your data problem might be that you fit ONE model with 
country as effect modifier specified via the "by" argument in all 
base-learners. A call could look like

 > mboost(y ~ bbs(x1, by = country) + bbs(x2, by = country)
+            + bbs(x3, by = country), data = data)

Or you could use random effects via brandom() base-learners. Oh, and 
please note that you need to tune your mstop value (e.g. via cvrisk)!

HTH
  Benjamin

Travis Berge <travisrhelp at gmail.com> wrote:
> Hi UseRs -
>
> I am new to R, and could use some help making out-of-sample predictions
> using a boosting model (the mboost command). The issue is complicated by the
> fact that I have panel data (time by country), and am estimating the model
> separately for each country. FYI, this is monthly data and I have 1986m1 -
> 2009m12 for 9 countries.
>
> To give you a flavor of what I am doing, here is a simple example to show
> how I make in-sample predictions:
>
> # data has following columns: country year month y x1 x2 x3
> dat = read.csv(data.csv)
>
> # Create function that estimates model, produces in-sample predictions
> bbox = function(df)
> {
> blackbox = mboost(y ~ x1 + x2 + x3)
> predict(blackbox)
> }
>
> # Use lapply to estimate by country
> bycountry = lapply(split(dat, dat$country), bbox)
>
>
> So that in the end I have an object bycountry that contains the in-sample
> predictions of the model, estimated for each country separately. What I
> would like to do is take this model and estimate it for each country using
> some initial data. I.e., estimate Australia with 1986m1-2003m12, make
> prediction about 2004m1, roll data forward. Estimate AUS with 1986m2-2004m1,
> predict 2004m2, etc for all data points. Now do the same for Canada,
> Denmark, etc.
>
> So I guess my problem is twofold. 1) How to make these out-of-sample
> predictions, by country, when my data has not been declared as time-series?
> (I do not think that mboost can handle time-series data...x1 x2 and x3 have
> been lagged appropriately). 2) How to save the one-step ahead predictions
> into a vector?
>
> Any thoughts would be greatly appreciated. Many thanks!
>
> -Travis
>
> 	[[alternative HTML version deleted]]