[R-sig-Geo] Parallel predict now in spatial.tools

Tim Howard tghoward at gw.dec.state.ny.us
Fri Mar 28 14:58:33 CET 2014


Jonathan,
Thanks again for your reply. 
I'll reply inline ... please see below

>>>> Jonathan Greenberg <jgrn at illinois.edu> 03/26/14 10:55 AM >>>
>Tim:
>
>One thing to make sure is if the stack has the bands named properly
>>(they need to match the formula), e.g.:
>>names(mystack) <- c("b1","b2",...)

Yes, I think I'm good on layer names, as well as maintaining all the
levels in categorical data (a classic gotcha for RF). 

>
>If you can also pass along the error you are getting, that will go a
>long way towards figuring out the issue! Cheers!

I was a little reluctant to pass on the error message until I knew a
little more. The main reason being that the normal predict function with
raster *does* work even though you might not expect it with the error
message. Here is the error coming out of predict_rasterEngine:
 
> sfQuickInit(cpus=6)
socket cluster with 6 nodes on host ‘localhost’
> prediction_rf_prob <- predict_rasterEngine(object=rf.full,
newdata=envStack, newtype='prob')
Error in predict.randomForest(object = object, newdata = newdata_df,
mget(model_parameters)) : 
  missing values in newdata
> sfQuickStop()
 

Again, I can run these exact data using raster:::predict and I get good
output.  To me that means I do not have 'missing values in newdata'. I
welcome thoughts otherwise, however. The predict call is structured like
this:

prediction_rf_prob <- predict(object=envStack, model=rf.full,
type='prob', progress = 'text')

So, I've tried recreating the problem by tweaking the Tahoe data. Below,
I've added a categorical layer to the brick. Unfortunately, it errors
out with a new error. (Null external pointer).  (Am I right that
raster:::predict and predict_rasterEngine use the term "object"
differently?)

I'll paste a replacement set of code for the Tahoe data below our
string.

I welcome comments on any of this!

Cheers, 
Tim

>--j
>
>On Wed, Mar 26, 2014 at 9:48 AM, Tim Howard
<tghoward at gw.dec.state.ny.us> wrote:
>> Jonathan,
>> Thanks for your quick reply. I just tried your Tahoe example using x
and y
>> in the call rather than a formula and it worked fine, so I was
barking up
>> the wrong tree. Sorry. I'll try to delve deeper and put together a
subset
>> example for you to check out if I can't get anywhere.
>>
>> Thanks,
>> Tim
>>
>>>>> Jonathan Greenberg <jgrn at illinois.edu> 3/26/2014 10:23 AM >>>
>>
>> Hi Tim:
>>
>> Re: stack, yep, it should work. In general, you get a decreased
>> performance from stacks since you are having to read from multiple
>> files rather than a single one, but it will still benefit from
>> parallel processing.
>>
>> Re: formula -- the best way for me to test this is to crop out a
piece
>> of your image and send me the random forest model (use ?save), and
the
>> call you were using. In theory you should be able to use anything you
>> would normally use on a data frame, but I'd have to play with it to
>> confirm! If you can set up those on e.g. google drive I can test it
>> out. Cheers!
>>
>> --j
>>
>> On Wed, Mar 26, 2014 at 6:49 AM, Tim Howard
<tghoward at gw.dec.state.ny.us>
>> wrote:
>>> Jonathan,
>>> Thank you for putting this together and for the example. I'm doing
two
>>> things differently with randomForest ... I think perhaps one of them
the
>>> function isn't handling.
>>>
>>> First, based on recommendations from Andy Liaw (and ?randomForest),
I
>>> don't
>>> use the formula interface but use x=<many columns>, y=<a column> in
the
>>> call. Does predict_rasterEngine handle the absence of a formula in
the
>>> object?
>>>
>>> Second, I have many large rasters I want to run the predict on, so
making
>>> a
>>> brick would be difficult. I use a rasterStack instead. Does your
example
>>> work with a rasterStack?
>>>
>>> I can dive deeper if any of this isn't clear or if these two tweaks
work
>>> just fine for you. I was just trying to swap out this version of
predict
>>> with another parallel version to evaluate speed and, while the
alternate
>>> version works>>> Date: Tue, 18 Mar 2014 22:14:23 -0500
>>> From: Jonathan Greenberg <jgrn at illinois.edu>
>>> To: "r-sig-geo at r-project.org" <R-sig-Geo at r-project.org>
>>> Subject: [R-sig-Geo] Parallel predict now in spatial.tools
>>> Message-ID:
>>> <CABG0rfseg+p0h4HdYOK+_Za=OLMeTKHAT+TQn7g_FkEdYiunFQ at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> R-sig-geo'ers:
>>>
>>> I finally got around to building a parallel predict statement that
>>> I've included in version 1.3.7 (or later) of spatial.tools (check
>>> http://r-forge.r-project.org/R/?group_id=1492 for the status of the
>>> build), "predict_rasterEngine". It should, in theory, be a direct
>>> swap-in for the standard generic predict() statement. Currently, it
>>> will work on any predict.* statement that has the following
features:
>>> 1) The data is passed to the predict as a data frame via a newdata
>>> parameter, and
>>> 2) The data is returned from the predict statement as a
vector/matrix.
>>>
>>> When using predict_rasterEngine, the object= parameter is your
model,
>>> and the newdata= parameter is the raster/brick/stack to apply the
>>> model to on a pixel-by-pixel basis (note that the names of the
layers
>>> must match the names of the predictor variables, in most cases).
>>>
>>> I was hoping to get some stress-testing on this, since it is a
fairly
>>> oft-requested function. If a predict.* function you'd like to use
>>> doesn't work, let me know which function it is (with some test data)
>>> and I'll see if I can tweak it to work.
>>>
>>> Right now, I have confirmed this works with randomForest. Here's an
>>> example:
>>>
>>> ######################
>>>
>>> packages_required <- c("spatial.tools","doParallel","randomForest")
>>> lapply(packages_required, require, character.only=T)
>>>
>>> # Load up a 3-band image:
>>> tahoe_highrez <- setMinMax(
>>> brick(system.file("external/tahoe_highrez.tif",
package="spatial.tools")))
>>> tahoe_highrez
>>> plotRGB(tahoe_highrez)
>>>
>>> # Load up some training points:
>>> tahoe_highrez_training_points <- readOGR(
>>> dsn=system.file("external", package="spatial.tools"),
>>> layer="tahoe_highrez_training_points")
>>>
>>> # Extract data to train the randomForest model:
>>> tahoe_highrez_training_extract <- extract(
>>> tahoe_highrez,
>>> tahoe_highrez_training_points,
>>> df=TRUE)
>>>
>>> # Fuse it back with the SPECIES info:
>>> tahoe_highrez_training_extract$SPECIES <-
>>> tahoe_highrez_training_points$SPECIES
>>>
>>> # Note the names of the bands:
>>> names(tahoe_highrez_training_extract) # the extracted data
>>> names(tahoe_highrez) # the brick
>>>
>>> # Generate a randomForest model:
>>> tahoe_rf <-
>>>
randomForest(SPECIES~tahoe_highrez.1+tahoe_highrez.2+tahoe_highrez.3,
>>> data=tahoe_highrez_training_extract)
>>>
>>> tahoe_rf
>>>
>>> # This will run the predict in parallel:
>>> sfQuickInit()
>>> prediction_rf_class <-
>>>
>>>
predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="response")
>>> prediction_rf_prob <-
>>>
predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
>>> sfQuickStop()
>>>
>>> ###############
>>>
>>> --j
>>
>>
>>
>> --
>> Jonathan A. Greenberg, PhD
>> Assistant Professor
>> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
>> Department of Geography and Geographic Information Science
>> University of Illinois at Urbana-Champaign
>> 259 Computing Applications Building, MC-150
>> 605 East Springfield Avenue
>> Champaign, IL 61820-6371
>> Phone: 217-300-1924
>> http://www.geog.illinois.edu/~jgrn/
>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype:
jgrn3007

#here is a script that adds a categorical layer to the tahoe brick,
creates a RF model
#and then tries both raster:::predict and  predict_rasterEngine.

packages_required <- c("spatial.tools","doParallel","randomForest")
lapply(packages_required, require, character.only=T)

# Load up a 3-band image:
tahoe_highrez <- setMinMax(
brick(system.file("external/tah#create a categorical layer from band 1
mat <- matrix(c(0,50,1,50,150,2,150,255,3),ncol=3,byrow=TRUE)
bnd1cat <- reclassify(tahoe_highrez[[1]], rcl=mat)
bnd1cat <- ratify(bnd1cat)
rat <- levels(bnd1cat)[[1]]
rat$types <- c('type1', 'type2', 'type3', 'type4')
rat$code <- c(1,2,3,4)
levels(bnd1cat) <- rat

#library(rasterVis) #if you want a categorical plot
#levelplot(bnd1_rc2)

tahoe_highrez <- addLayer(tahoe_highrez, bnd1cat)
names(tahoe_highrez) <-
c("tahoeOne","tahoeTwo","tahoeThree","tahoeCatFour")

# Load up some training points:
tahoe_highrez_training_points <- readOGR(
    dsn=system.file("external", package="spatial.tools"),
    layer="tahoe_highrez_training_points")

# Extract data to train the randomForest model:
tahoe_highrez_training_extract <-
extract(tahoe_highrez,tahoe_highrez_training_points,df=TRUE)

# Fuse it back with the SPECIES info:
tahoe_highrez_training_extract$SPECIES <-
tahoe_highrez_training_points$SPECIES

# Note the names of the bands:
names(tahoe_highrez_training_extract) # the extracted data
names(tahoe_highrez) # the brick

# convert to factor, ensure all the levels are there
tahoe_highrez_training_extract$tahoeCatFour <-
factor(tahoe_highrez_training_extract$tahoeCatFour)
levels(tahoe_highrez_training_extract$tahoeCatFour) <- c(1,2,3,4)
str(tahoe_highrez_training_extract)

# Generate a randomForest model:
tahoe_rf <- randomForest(y=tahoe_highrez_training_extract$SPECIES, 
                            x=tahoe_highrez_training_extract[,2:5],
                            data=tahoe_highrez_training_extract)


# try it with standard predict call -- this works                       
    
predict1_rf_prob <- predict(object=tahoe_highrez, model=tahoe_rf,
type="prob")

# try it with rasterEngine -- this fails
sfQuickInit()
predict2_rf_prob <-
predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
sfQuickStop()

plot(predict1_rf_prob)




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20140328/5bb48236/attachment.html>


More information about the R-sig-Geo mailing list