[R-sig-Geo] Parallel predict now in spatial.tools
Jonathan Greenberg
jgrn at illinois.edu
Fri Mar 28 15:38:46 CET 2014
Tim:
I notice you have:
predict_rasterEngine(object=rf.full, newdata=envStack, newtype='prob')
Was that a typo in your email? If not, this might be your issue -- the
correct parameter is "type", not "newtype" -- I noticed you are using
"type" in the raster:::predict:
?predict.randomForest
If you still get that error when you use type="prob", I think the
issue is that the raster probably has NA values (that is what usually
triggers the "missing values in newdata" error for
predict.randomForest) -- this is a valid issue, so I'll need to fix
the code a bit to deal with it -- can you confirm if its the newtype
issue first? I can probably push out a fix later this weekend if it
is, indeed, the NA issue.
Another important issue is that I don't, at present, support the RAT,
but I can also work on a fix for that.
You are correct re: the object is the model, and the newdata parameter
is the raster/brick/stack.
Cheers!
--j
On Fri, Mar 28, 2014 at 8:58 AM, Tim Howard <tghoward at gw.dec.state.ny.us> wrote:
> Jonathan,
> Thanks again for your reply.
> I'll reply inline ... please see below
>
>>>>> Jonathan Greenberg <jgrn at illinois.edu> 03/26/14 10:55 AM >>>
>
>>Tim:
>>
>>One thing to make sure is if the stack has the bands named properly
>>>(they need to match the formula), e.g.:
>>>names(mystack) <- c("b1","b2",...)
>
> Yes, I think I'm good on layer names, as well as maintaining all the levels
> in categorical data (a classic gotcha for RF).
>
>>
>>If you can also pass along the error you are getting, that will go a
>>long way towards figuring out the issue! Cheers!
>
> I was a little reluctant to pass on the error message until I knew a little
> more. The main reason being that the normal predict function with raster
> *does* work even though you might not expect it with the error message. Here
> is the error coming out of predict_rasterEngine:
>
>> sfQuickInit(cpus=6)
> socket cluster with 6 nodes on host 'localhost'
>> prediction_rf_prob <- predict_rasterEngine(object=rf.full,
>> newdata=envStack, newtype='prob')
> Error in predict.randomForest(object = object, newdata = newdata_df,
> mget(model_parameters)) :
> missing values in newdata
>> sfQuickStop()
>
>
> Again, I can run these exact data using raster:::predict and I get good
> output. To me that means I do not have 'missing values in newdata'. I
> welcome thoughts otherwise, however. The predict call is structured like
> this:
>
> prediction_rf_prob <- predict(object=envStack, model=rf.full, type='prob',
> progress = 'text')
>
> So, I've tried recreating the problem by tweaking the Tahoe data. Below,
> I've added a categorical layer to the brick. Unfortunately, it errors out
> with a new error. (Null external pointer). (Am I right that
> raster:::predict and predict_rasterEngine use the term "object"
> differently?)
>
> I'll paste a replacement set of code for the Tahoe data below our string.
>
> I welcome comments on any of this!
>
> Cheers,
> Tim
>
>
>>--j
>>
>>On Wed, Mar 26, 2014 at 9:48 AM, Tim Howard <tghoward at gw.dec.state.ny.us>
>> wrote:
>>> Jonathan,
>>> Thanks for your quick reply. I just tried your Tahoe example using x and
>>> y
>>> in the call rather than a formula and it worked fine, so I was barking up
>>> the wrong tree. Sorry. I'll try to delve deeper and put together a subset
>>> example for you to check out if I can't get anywhere.
>>>
>>> Thanks,
>>> Tim
>>>
>>>>>> Jonathan Greenberg <jgrn at illinois.edu> 3/26/2014 10:23 AM >>>
>>>
>>> Hi Tim:
>>>
>>> Re: stack, yep, it should work. In general, you get a decreased
>>> performance from stacks since you are having to read from multiple
>>> files rather than a single one, but it will still benefit from
>>> parallel processing.
>>>
>>> Re: formula -- the best way for me to test this is to crop out a piece
>>> of your image and send me the random forest model (use ?save), and the
>>> call you were using. In theory you should be able to use anything you
>>> would normally use on a data frame, but I'd have to play with it to
>>> confirm! If you can set up those on e.g. google drive I can test it
>>> out. Cheers!
>>>
>>> --j
>>>
>>> On Wed, Mar 26, 2014 at 6:49 AM, Tim Howard <tghoward at gw.dec.state.ny.us>
>>> wrote:
>>>> Jonathan,
>>>> Thank you for putting this together and for the example. I'm doing two
>>>> things differently with randomForest ... I think perhaps one of them the
>>>> function isn't handling.
>>>>
>>>> First, based on recommendations from Andy Liaw (and ?randomForest), I
>>>> don't
>>>> use the formula interface but use x=<many columns>, y=<a column> in the
>>>> call. Does predict_rasterEngine handle the absence of a formula in the
>>>> object?
>>>>
>>>> Second, I have many large rasters I want to run the predict on, so
>>>> making
>>>> a
>>>> brick would be difficult. I use a rasterStack instead. Does your example
>>>> work with a rasterStack?
>>>>
>>>> I can dive deeper if any of this isn't clear or if these two tweaks work
>>>> just fine for you. I was just trying to swap out this version of predict
>>>> with another parallel version to evaluate speed and, while the alternate
>>>> version works fine, predict_rasterEngine bailed on me.
>>>>
>>>> Thanks in advance.
>>>> Tim Howard
>>>>
>>>>
>>>>
>>>>>>>>>>
>>>> Date: Tue, 18 Mar 2014 22:14:23 -0500
>>>> From: Jonathan Greenberg <jgrn at illinois.edu>
>>>> To: "r-sig-geo at r-project.org" <R-sig-Geo at r-project.org>
>>>> Subject: [R-sig-Geo] Parallel predict now in spatial.tools
>>>> Message-ID:
>>>> <CABG0rfseg+p0h4HdYOK+_Za=OLMeTKHAT+TQn7g_FkEdYiunFQ at mail.gmail.com>
>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>
>>>> R-sig-geo'ers:
>>>>
>>>> I finally got around to building a parallel predict statement that
>>>> I've included in version 1.3.7 (or later) of spatial.tools (check
>>>> http://r-forge.r-project.org/R/?group_id=1492 for the status of the
>>>> build), "predict_rasterEngine". It should, in theory, be a direct
>>>> swap-in for the standard generic predict() statement. Currently, it
>>>> will work on any predict.* statement that has the following features:
>>>> 1) The data is passed to the predict as a data frame via a newdata
>>>> parameter, and
>>>> 2) The data is returned from the predict statement as a vector/matrix.
>>>>
>>>> When using predict_rasterEngine, the object= parameter is your model,
>>>> and the newdata= parameter is the raster/brick/stack to apply the
>>>> model to on a pixel-by-pixel basis (note that the names of the layers
>>>> must match the names of the predictor variables, in most cases).
>>>>
>>>> I was hoping to get some stress-testing on this, since it is a fairly
>>>> oft-requested function. If a predict.* function you'd like to use
>>>> doesn't work, let me know which function it is (with some test data)
>>>> and I'll see if I can tweak it to work.
>>>>
>>>> Right now, I have confirmed this works with randomForest. Here's an
>>>> example:
>>>>
>>>> ######################
>>>>
>>>> packages_required <- c("spatial.tools","doParallel","randomForest")
>>>> lapply(packages_required, require, character.only=T)
>>>>
>>>> # Load up a 3-band image:
>>>> tahoe_highrez <- setMinMax(
>>>> brick(system.file("external/tahoe_highrez.tif",
>>>> package="spatial.tools")))
>>>> tahoe_highrez
>>>> plotRGB(tahoe_highrez)
>>>>
>>>> # Load up some training points:
>>>> tahoe_highrez_training_points <- readOGR(
>>>> dsn=system.file("external", package="spatial.tools"),
>>>> layer="tahoe_highrez_training_points")
>>>>
>>>> # Extract data to train the randomForest model:
>>>> tahoe_highrez_training_extract <- extract(
>>>> tahoe_highrez,
>>>> tahoe_highrez_training_points,
>>>> df=TRUE)
>>>>
>>>> # Fuse it back with the SPECIES info:
>>>> tahoe_highrez_training_extract$SPECIES <-
>>>> tahoe_highrez_training_points$SPECIES
>>>>
>>>> # Note the names of the bands:
>>>> names(tahoe_highrez_training_extract) # the extracted data
>>>> names(tahoe_highrez) # the brick
>>>>
>>>> # Generate a randomForest model:
>>>> tahoe_rf <-
>>>> randomForest(SPECIES~tahoe_highrez.1+tahoe_highrez.2+tahoe_highrez.3,
>>>> data=tahoe_highrez_training_extract)
>>>>
>>>> tahoe_rf
>>>>
>>>> # This will run the predict in parallel:
>>>> sfQuickInit()
>>>> prediction_rf_class <-
>>>>
>>>>
>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="response")
>>>> prediction_rf_prob <-
>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
>>>> sfQuickStop()
>>>>
>>>> ###############
>>>>
>>>> --j
>>>
>>>
>>>
>>> --
>>> Jonathan A. Greenberg, PhD
>>> Assistant Professor
>>> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
>>> Department of Geography and Geographic Information Science
>>> University of Illinois at Urbana-Champaign
>>> 259 Computing Applications Building, MC-150
>>> 605 East Springfield Avenue
>>> Champaign, IL 61820-6371
>>> Phone: 217-300-1924
>>> http://www.geog.illinois.edu/~jgrn/
>>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
>
> #here is a script that adds a categorical layer to the tahoe brick, creates
> a RF model
> #and then tries both raster:::predict and predict_rasterEngine.
>
>
> packages_required <- c("spatial.tools","doParallel","randomForest")
> lapply(packages_required, require, character.only=T)
>
> # Load up a 3-band image:
> tahoe_highrez <- setMinMax(
> brick(system.file("external/tahoe_highrez.tif", package="spatial.tools")))
> tahoe_highrez
> plotRGB(tahoe_highrez)
>
> #create a categorical layer from band 1
> mat <- matrix(c(0,50,1,50,150,2,150,255,3),ncol=3,byrow=TRUE)
> bnd1cat <- reclassify(tahoe_highrez[[1]], rcl=mat)
> bnd1cat <- ratify(bnd1cat)
> rat <- levels(bnd1cat)[[1]]
> rat$types <- c('type1', 'type2', 'type3', 'type4')
> rat$code <- c(1,2,3,4)
> levels(bnd1cat) <- rat
>
> #library(rasterVis) #if you want a categorical plot
> #levelplot(bnd1_rc2)
>
> tahoe_highrez <- addLayer(tahoe_highrez, bnd1cat)
> names(tahoe_highrez) <- c("tahoeOne","tahoeTwo","tahoeThree","tahoeCatFour")
>
>
> # Load up some training points:
> tahoe_highrez_training_points <- readOGR(
> dsn=system.file("external", package="spatial.tools"),
> layer="tahoe_highrez_training_points")
>
> # Extract data to train the randomForest model:
> tahoe_highrez_training_extract <-
> extract(tahoe_highrez,tahoe_highrez_training_points,df=TRUE)
>
> # Fuse it back with the SPECIES info:
> tahoe_highrez_training_extract$SPECIES <-
> tahoe_highrez_training_points$SPECIES
>
> # Note the names of the bands:
> names(tahoe_highrez_training_extract) # the extracted data
> names(tahoe_highrez) # the brick
>
> # convert to factor, ensure all the levels are there
> tahoe_highrez_training_extract$tahoeCatFour <-
> factor(tahoe_highrez_training_extract$tahoeCatFour)
> levels(tahoe_highrez_training_extract$tahoeCatFour) <- c(1,2,3,4)
> str(tahoe_highrez_training_extract)
>
>
> # Generate a randomForest model:
> tahoe_rf <- randomForest(y=tahoe_highrez_training_extract$SPECIES,
> x=tahoe_highrez_training_extract[,2:5],
> data=tahoe_highrez_training_extract)
>
>
> # try it with standard predict call -- this works
> predict1_rf_prob <- predict(object=tahoe_highrez, model=tahoe_rf,
> type="prob")
>
> # try it with rasterEngine -- this fails
> sfQuickInit()
> predict2_rf_prob <-
> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
> sfQuickStop()
>
> plot(predict1_rf_prob)
>
>
>
--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL 61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
More information about the R-sig-Geo
mailing list