[R] Random Forest for Ecological Prediction under presence of Spatial Autocorrelation

Mon May 24 16:10:05 CEST 2010

You could also try the Boruta package for variable selection.

2010/5/24 Andreas Béguin <chaudeau at gmail.com>:
> Dear R-help list members,
>
> I have a statistical question regarding the Random Forest function (RF) as
> applied to ecological prediction of species presences and absences.
>
> RF seems to perform very well for prediction of species ranges or
> prevalences. However, the problem with my dataset is a high degree of
> spatial autocorrelation and therefore a low effective sample size compared
> to the full number of gridpoints (0.5 degree grid extending over all land
> areas north of 55 deg. south, ~60000 grid points). My variables are to a
> high degree correlated in x and y direction. When using the entire dataset
> in the RF function, the misclassification rate is unbelievably low,
> suggesting overfitting. The noisy marginal probability plots (see attached
> example) somehow support this idea. My question is: Is there a way to make
> the decision trees in RF more generalizable without modelling the spatial
> autocorrelation explicitly? Here are four ways of doing this I have thought
> about:
> 1. Spatially clustering observations into training and test datasets and
> averaging the predicted class probability values to approximate "real"
> certainty - This could be done on country level or in a chessboard-like
> pattern
> 2. Requiring a higher minimal nodesize to prevent the creation of
> overfitted, maximal trees - Which value of "nodesize" might be appropriate?
> 3. Reducing the number of variables involved in the model by just taking one
> out of a group of correlated variables (say, for example, only winter
> temperature instead of temperatures from all seasons) - This variable
> selection would be based on the Variable Importance plots. I was considering
> to use the Gini measure ranking instead of the accuracy ranking to produce
> simpler, more "biological" trees, please comment on this.
> 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE"
> ("presence"-"absence") observations using the "sampsize" option, thereby
> increasing the distance between the gridpoints chosen to build the model so
> as to reduce correlation between observations.
>
> Which of these pathways would you suggest to pursue? Certainly some of you
> have faced and tackled the problem of spatial autocorrelation in ecological
> prediction. I am aware of the works of Araujo et al. (2005) and Koenig
> (1999), any further suggested reading (especially examples of how spatial
> autocorrelation can be dealt with practically) would be highly welcome.
>
> Kind regards,
>
> Andreas Beguin
> ##########################################
> Division of Epidemiology and Global Health
> Department of Public Health and Clinical Medicine
> Umea University
> 907 31 Umea Sweden