[R-sig-Geo] how do I properly reply if in digest mode? and re: Prediction variance (map) for predictions derived using RandomForest package

Sun Jun 23 15:40:56 CEST 2013

On Sun, 23 Jun 2013, Seth Myers wrote:

> 1. what is the best method to reply to single message if in digest mode?  I
> just send an email to r-sig-geo at r-project.org and use the original title
> with an re: in front.  Am I missing the obvious here or is this fine?

Seth:

All helpful contributions are always welcome!

I use Gmane to reply in-thread when I'm using digest-mode, as it is 
generally beneficial for those searching the archives to have follow-ups 
in-thread (your comments would benefit others with the same question as 
Tom's):

http://dir.gmane.org/gmane.comp.lang.r.geo

It is a bit picky, because Gmane requires you to reply in the text, not at 
its head, and protests if you add little in proportion to the length of 
the posting you are following-up. I usually delete all but the essential 
parts of the posting; patience helps. Nabble is also an alternative, but 
there you have to log in to post, and nabble postings appear to be more 
open to abuse, so the filters work harder on them. Gmane posts in-thread 
from you to the list, so you simply provide your subscribed email address 
in the header of the form.

Hope this helps,

Roger

>
> 2.  It appears Tom is trying to bootstrap a random forest model to get a 
> range of predictions so as to determine the variance or uncertainty of 
> the predictions.  I have used random forest before for predictive work 
> before. I do not understand the theory behind it extremely well (such as 
> the proofs of it not overfitting), but I know the flow of operations 
> within the algorithm well enough.  The algorithm resamples the data 
> already.  There are two random components for each tree grown in the 
> forest.  First, the data points are randomly selected and then at each 
> node in the tree a subset of the possible predictor variables are 
> randomly selected (if you have it set up that way and the argument that 
> controls that is a relatively important tuning variable).  So, I believe 
> this is why you are seeing little variance in your predictions, you are 
> just creating a "shell" in a sense that pre-does what the algorithm does 
> anyway.  Plus, given that random forest really isn't a statistical 
> technique let alone a parametric one, I would be a bit uncomfortable 
> applying standard statistical reasoning to it and shoehorning it into a 
> bootstrap technique (which is admittedly rather general).  The 
> literature on random forest is rather large.  I would look there for any 
> methods that have been published that do what you would ultimately like 
> to accomplish.
>
> Seth Myers
> FSU
>
> Message: 7
> Date: Sun, 23 Jun 2013 11:51:18 +0200
> From: Tomislav Hengl <hengl at spatial-analyst.net>
> To: R-sig-Geo at r-project.org
> Cc: andy_liaw at merck.com
> Subject: [R-sig-Geo] Prediction variance (map) for predictions derived
>        using RandomForest package
> Message-ID: <51C6C516.3090504 at spatial-analyst.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
>
> Dear list,
>
> I have a question about the randomForest models. I'm trying to figure
> out a way to estimate the prediction variance (spatially) for the
> randomForest function
> (http://cran.r-project.org/web/packages/randomForest/).
>
> If I run a GLM I can also derive the prediction variance using:
>
> > demo(meuse, echo=FALSE)
> > meuse.ov <- over(meuse, meuse.grid)
> > meuse.ov <- cbind(meuse.ov, meuse at data)
> > omm0 <- glm(log1p(om)~dist+ffreq, meuse.ov, family=gaussian())
> > om.glm <- predict.glm(omm0, meuse.grid, se.fit=TRUE)
> > str(om.glm)
> List of 3
>  $ fit           : Named num [1:3103] 2.34 2.34 2.32 2.29 2.34 ...
>   ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>  $ se.fit        : Named num [1:3103] 0.0491 0.0491 0.0481 0.046 0.0491 ...
>   ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>  $ residual.scale: num 0.357
>
> when I fit a randomForest model, I do not get any estimate of the model
> uncertainty (for each pixel) but just the predictions:
>
> > meuse.ov <- meuse.ov[-omm0$na.action,]
> > x <- randomForest(log1p(om)~dist+ffreq, meuse.ov)
> > om.rf <- predict(x, meuse.grid)
> > str(om.rf)
>  Named num [1:3103] 2.49 2.49 2.51 2.44 2.49 ...
>  - attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
>
> Does anyone has an idea how to map the prediction variance (i.e.
> estimated or propagated error) for the randomForest models spatially?
>
> I've tried deriving a propagated error for the randomForest models
> (every fit gives another model due to random component):
>
> > l.rfk <- data.frame(om_1 = rep(NA, nrow(meuse.grid)))
> > for(i in 1:50){
> +   suppressWarnings(suppressMessages(x <-
> randomForest(log1p(om)~dist+ffreq, meuse.ov)))
> +   l.rfk[,paste("om",i,sep="_")] <- predict(x, meuse.grid)
> + } ## takes ca 1 minute
> > meuse.grid$om.rfkvar <- om.rfk at predicted$var1.var + apply(l.rfk, 1, var)
>
> but the prediction variance I get is rather small (much smaller than
> e.g. the GLM variance). Here is the complete code with some plots:
>
> R code:
> https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R
>
> Predictions UK vs randomForest-kriging:
> https://gsif.googlecode.com/svn/trunk/meuse/Fig_meuse_RK_vs_RFK.png
>
> thanx,
>
> T. (Tom) Hengl
> Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
> Network: http://profiles.google.com/tom.hengl
> Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no