[R-sig-Geo] Prediction variance (map) for predictions derived using RandomForest package

Mon Jun 24 18:44:48 CEST 2013

Dear Forrest,

Thanks a lot for your tip. I think quantregForest is what we were 
looking for. It takes much more time to compute, but the method looks 
sound 
(http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf). I do 
simplify everything on the end and assume that I can derive upper and 
lower confidence limits for +/- 1 s.d. (0.15866, 1-0.15866) and then use 
this as the prediction variance, but this is probably as good as it 
goes. Here is the revised code:

https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R

Thank you all for your suggestions / opinions (very useful as usual).

cheers,

T. (Tom) Hengl
Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
Network: http://profiles.google.com/tom.hengl
Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ

On 23/06/2013 15:08, Forrest Stevens wrote:
> Hi Tom, I've done something similar in the past to visualize the
> distribution of the predictions attained for each observation across
> the many trees within a random forest while looking at various aspects
> of those ranges and correlating that with cross-validated prediction
> errors.  It's relatively easy to generate and keep the predictions for
> every tree for each observation (pixel in your case) using the
> predict.all=TRUE argument:
>
> predictions <- predict(random_forest, newdata=x_data_new, predict.all=TRUE)
>
> Then to extract all of the individual trees' predictions for the first
> observation:
>
> predictions$individual[1]
>
> You can do this to get the mean and SD for each observation (note the
> mean should match the value in predictions$aggregate:
>
> y_data$rf_mean <- apply(predictions$individual, MARGIN=1, mean)
> y_data$rf_sd <- apply(predictions$individual, MARGIN=1, sd)
> y_data$rf_cv <- apply(predictions$individual, MARGIN=1, sd)
>
>
> In practice I've found during testing that the distribution of values
> (assuming the continuous regression case since you're looking at SD in
> the first place) is highly skewed.  The range, SD, CV and other
> measures of distribution of the individual trees does not correlate
> well at all with prediction errors in my work. I kind of makes
> intuitive sense since the power of the random forest algorithm relies
> in the ensemble nature of the technique, and the randomness injected
> via variable sampling at each node and those measures of variation in
> the predictions I've looked at quickly become irrelevant as you scale
> up the number of trees in the forest.  So your mileage may vary but
> I'd be interested to know what you find.
>
> You may also want to look at the excellent quantregForest package as
> it produces a randomForest object but also produces information on the
> quantiles and quantile range for each observation's prediction for
> you, including some nice plots that I've found useful.
>
> Sincerely,
> Forrest
>
> On Sun, Jun 23, 2013 at 5:51 AM, Tomislav Hengl
> <hengl at spatial-analyst.net> wrote:
>>
>> Dear list,
>>
>> I have a question about the randomForest models. I'm trying to figure out a
>> way to estimate the prediction variance (spatially) for the randomForest
>> function (http://cran.r-project.org/web/packages/randomForest/).
>>
>> If I run a GLM I can also derive the prediction variance using:
>>
>>> demo(meuse, echo=FALSE)
>>> meuse.ov <- over(meuse, meuse.grid)
>>> meuse.ov <- cbind(meuse.ov, meuse at data)
>>> omm0 <- glm(log1p(om)~dist+ffreq, meuse.ov, family=gaussian())
>>> om.glm <- predict.glm(omm0, meuse.grid, se.fit=TRUE)
>>> str(om.glm)
>> List of 3
>>   $ fit           : Named num [1:3103] 2.34 2.34 2.32 2.29 2.34 ...
>>    ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>>   $ se.fit        : Named num [1:3103] 0.0491 0.0491 0.0481 0.046 0.0491 ...
>>    ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>>   $ residual.scale: num 0.357
>>
>> when I fit a randomForest model, I do not get any estimate of the model
>> uncertainty (for each pixel) but just the predictions:
>>
>>> meuse.ov <- meuse.ov[-omm0$na.action,]
>>> x <- randomForest(log1p(om)~dist+ffreq, meuse.ov)
>>> om.rf <- predict(x, meuse.grid)
>>> str(om.rf)
>>   Named num [1:3103] 2.49 2.49 2.51 2.44 2.49 ...
>>   - attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>>
>> Does anyone has an idea how to map the prediction variance (i.e. estimated
>> or propagated error) for the randomForest models spatially?
>>
>> I've tried deriving a propagated error for the randomForest models (every
>> fit gives another model due to random component):
>>
>>> l.rfk <- data.frame(om_1 = rep(NA, nrow(meuse.grid)))
>>> for(i in 1:50){
>> +   suppressWarnings(suppressMessages(x <-
>> randomForest(log1p(om)~dist+ffreq, meuse.ov)))
>> +   l.rfk[,paste("om",i,sep="_")] <- predict(x, meuse.grid)
>> + } ## takes ca 1 minute
>>> meuse.grid$om.rfkvar <- om.rfk at predicted$var1.var + apply(l.rfk, 1, var)
>>
>> but the prediction variance I get is rather small (much smaller than e.g.
>> the GLM variance). Here is the complete code with some plots:
>>
>> R code:
>> https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R
>>
>> Predictions UK vs randomForest-kriging:
>> https://gsif.googlecode.com/svn/trunk/meuse/Fig_meuse_RK_vs_RFK.png
>>
>> thanx,
>>
>> T. (Tom) Hengl
>> Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
>> Network: http://profiles.google.com/tom.hengl
>> Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
>
>