[R-sig-Geo] Prediction variance (map) for predictions derived using RandomForest package

Sun Jun 23 15:08:35 CEST 2013

Hi Tom, I've done something similar in the past to visualize the
distribution of the predictions attained for each observation across
the many trees within a random forest while looking at various aspects
of those ranges and correlating that with cross-validated prediction
errors.  It's relatively easy to generate and keep the predictions for
every tree for each observation (pixel in your case) using the
predict.all=TRUE argument:

predictions <- predict(random_forest, newdata=x_data_new, predict.all=TRUE)

Then to extract all of the individual trees' predictions for the first
observation:

predictions$individual[1]

You can do this to get the mean and SD for each observation (note the
mean should match the value in predictions$aggregate:

y_data$rf_mean <- apply(predictions$individual, MARGIN=1, mean)
y_data$rf_sd <- apply(predictions$individual, MARGIN=1, sd)
y_data$rf_cv <- apply(predictions$individual, MARGIN=1, sd)

In practice I've found during testing that the distribution of values
(assuming the continuous regression case since you're looking at SD in
the first place) is highly skewed.  The range, SD, CV and other
measures of distribution of the individual trees does not correlate
well at all with prediction errors in my work. I kind of makes
intuitive sense since the power of the random forest algorithm relies
in the ensemble nature of the technique, and the randomness injected
via variable sampling at each node and those measures of variation in
the predictions I've looked at quickly become irrelevant as you scale
up the number of trees in the forest.  So your mileage may vary but
I'd be interested to know what you find.

You may also want to look at the excellent quantregForest package as
it produces a randomForest object but also produces information on the
quantiles and quantile range for each observation's prediction for
you, including some nice plots that I've found useful.

Sincerely,
Forrest

On Sun, Jun 23, 2013 at 5:51 AM, Tomislav Hengl
<hengl at spatial-analyst.net> wrote:
>
> Dear list,
>
> I have a question about the randomForest models. I'm trying to figure out a
> way to estimate the prediction variance (spatially) for the randomForest
> function (http://cran.r-project.org/web/packages/randomForest/).
>
> If I run a GLM I can also derive the prediction variance using:
>
>> demo(meuse, echo=FALSE)
>> meuse.ov <- over(meuse, meuse.grid)
>> meuse.ov <- cbind(meuse.ov, meuse at data)
>> omm0 <- glm(log1p(om)~dist+ffreq, meuse.ov, family=gaussian())
>> om.glm <- predict.glm(omm0, meuse.grid, se.fit=TRUE)
>> str(om.glm)
> List of 3
>  $ fit           : Named num [1:3103] 2.34 2.34 2.32 2.29 2.34 ...
>   ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>  $ se.fit        : Named num [1:3103] 0.0491 0.0491 0.0481 0.046 0.0491 ...
>   ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>  $ residual.scale: num 0.357
>
> when I fit a randomForest model, I do not get any estimate of the model
> uncertainty (for each pixel) but just the predictions:
>
>> meuse.ov <- meuse.ov[-omm0$na.action,]
>> x <- randomForest(log1p(om)~dist+ffreq, meuse.ov)
>> om.rf <- predict(x, meuse.grid)
>> str(om.rf)
>  Named num [1:3103] 2.49 2.49 2.51 2.44 2.49 ...
>  - attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>
> Does anyone has an idea how to map the prediction variance (i.e. estimated
> or propagated error) for the randomForest models spatially?
>
> I've tried deriving a propagated error for the randomForest models (every
> fit gives another model due to random component):
>
>> l.rfk <- data.frame(om_1 = rep(NA, nrow(meuse.grid)))
>> for(i in 1:50){
> +   suppressWarnings(suppressMessages(x <-
> randomForest(log1p(om)~dist+ffreq, meuse.ov)))
> +   l.rfk[,paste("om",i,sep="_")] <- predict(x, meuse.grid)
> + } ## takes ca 1 minute
>> meuse.grid$om.rfkvar <- om.rfk at predicted$var1.var + apply(l.rfk, 1, var)
>
> but the prediction variance I get is rather small (much smaller than e.g.
> the GLM variance). Here is the complete code with some plots:
>
> R code:
> https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R
>
> Predictions UK vs randomForest-kriging:
> https://gsif.googlecode.com/svn/trunk/meuse/Fig_meuse_RK_vs_RFK.png
>
> thanx,
>
> T. (Tom) Hengl
> Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
> Network: http://profiles.google.com/tom.hengl
> Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

-- 
Forrest R. Stevens
Ph.D. Candidate, QSE3 IGERT Fellow
Department of Geography
Land Use and Environmental Change Institute
University of Florida
www.clas.ufl.edu/users/forrest