[R-sig-Geo] Species Distribution by Random Forest using a multi core pc

Giuseppe Amatulli giuseppe.amatulli at gmail.com
Mon Mar 21 17:11:45 CET 2011


Hi
First of all sorry for this late answer, i was struggling with compile
all the sw in my new pc (16 processor 16 g of memory) and setting up
all the environment for the large computation.

Herein I will try to reply to the input that i receive more than one month ego.
I will insert my questions and replies in order to track back the history.
Moreover,  I will also insert codes/observations and how i solved the problems.

I'm trying to do forest species distribution at European level (1 km
resolution) by means of  randomForest running it in a cluster
computer. I'm using several predictors of different data sources. All
of them are rasters in grass format.  Therefore i was using spgrass6
to import the data in to R and apply randomForest prediction to the
layers. In the same time, reading carefully the help page of the raster
package seems to me that his "row by row" feature allows a better
performance of the memory limitation, compare to spgrass6. It is this
the case?
If raster package is more efficient, how i can use it to import grass
data?  I suppose by reading the raster under the cellhd folder
>  maps  <-  stack ( c ( 'LOCATION/PERMANENT/cellhd/grid1','LOCATION/PERMANENT/cellhd/grid2'))

On 10 January 2011 20:25, Roger Bivand <Roger.Bivand at nhh.no> wrote:

> In principle, the GRASS GDAL plugin should work in this way, but you can
> also use g.region in GRASS to set the region for readRAST6() to read, which
> could be in tiles or rows at your convenience. This might be easier if the
> predicted tiles are to be written back to GRASS as part of the process. It
> would be overkill to think of this kind of iterated region support in
> raster, which uses the region.dim= and offset= features of the GDAL
> interface, I think. It would be fun to see whether SAGA could be used in the same way with
> raster, as there is a SAGA GDAL driver.
>
> Roger

After several test i decide to use the raster package to read and
write back the GRASS data. My decision was taken due to the faster
performance in reading the data and make the prediction. The drawbacks
were
1) the no-capability of raster package in reading the region of GRASS.
Anyway i easily solved by "cutting" all the grass data using the
setted region
2) the geo-output is a tif which i re-import in grass for other analysis.

I implemented the function for the prediction in this way:

f6 <- function(x, rf, filename='') {
         out <- raster(x)
         dataType(out)='FLT4S'
         small <- canProcessInMemory(out, 1)
         filename <- trim(filename)
         if (! small & filename == '') {
                 filename <- rasterTmpFile()
         }
          if (filename != '') {
                 out <- writeStart(out, filename, overwrite=TRUE)
                 todisk <- TRUE
         } else {
                 vv <- matrix(ncol=nrow(out), nrow=ncol(out))
                 todisk <- FALSE
         }
         bs <- blockSize(out)
         for (i in 1:bs\$n) {
                 block_rasters = getValues(x,
row=bs\$row[i],nrows=bs\$nrows[i] )   # deals with no data
                 block_rasters = unknownToNA( block_rasters,NaN)
                   # i will fix it in a more
                 block_rasters = NAToUnknown ( block_rasters , 0,
force=TRUE)    # clever way
                 print (paste("predict random forest for block",i))
                 pred  = predict (rf, newdata=block_rasters)

                 if (todisk) {
                         writeValues(out, pred, bs\$row[i])
                 } else {
                         cols <- bs\$row[i]:(bs\$row[i]+bs\$nrows[i]-1)
                         vv[,cols] <- matrix(pred, nrow=out at ncols)
                 }
         }
         if (todisk) {
                 out <- writeStop(out)
         } else {
                 out <- setValues(out, as.vector(vv))
         }
         return(out)
}

This function use a raster stack of 30 layers (europe 1x1km) and apply
the random forest prediction. It is very fast just 10 minutes.
In order to use the multi-core power and run several predictions of
different species, I was insert the R script in a bash script under
the   EOF syntax and i run it in this way.

for sp  in /forest/sp_* ; do echo  $sp ;  done  | xargs -n 1 -P 4  sh
./randomForest_predciton.sh

xargs allows the use of 4 processor (-P) using 1 argument (-n 1)

of course  the bash variables needs to be export and import in r by
Sys.getenv().

I always try to do all the gis and data preprocessing in bash using
GRASS  and awk , sed and enter in R with the data completely clean and
ready for the real spatial statistic. Saving memory, computational
time and getting the best from each sw.

Now that all the processing chain is working i can start to play
around with the random forest tuning parameters and use the
suggestions of Roberts:

On 11 January 2011 06:51, Robert Hijmans <r.hijmans at gmail.com> wrote:
> You can also have a look at the 'dismo' package for species distribution
> modeling.  The vignette (under development) has a brief example with
> randomForest.
>
> To apply weights in randomForest (I have not tried this and could very well
> be wrong), you can perhaps use regression rather than classification (in my
> experience randomForest regression works much better in this context) and
> adjust your presence/absence data based on certainty (highly certain absence
> = -1, highly certain presence = 1, everything else scaled in between).

Indeed i also prefer use randomForest in a regression mode. The only
problem is the not so high  value of the explained variance, due to
binary value of the response vs the continues value of the prediction.
So my questions to the community are:
How i can calculate the performance of the model in this particular
case?
Can i consider also in this case the maximization of the explained
variance as a driven factor for the tunning?

> A perhaps better alternative would be to use the cforest function (another
> random forest implementation) in 'party', because this function has a
> weights argument.

Yes, i will test and i will let you know the performance.

Thanks again for the suggestions
best regards
Giuseppe Amatulli



More information about the R-sig-Geo mailing list