[R-sig-Geo] Parallel predict now in spatial.tools
Jonathan Greenberg
jgrn at illinois.edu
Fri Mar 28 17:36:30 CET 2014
Tim:
Would you mind crop()'ing out a piece of your stack and also save()
the randomforest model so I can do some tests? you can email them to
me directly or place them in a shared location. Thanks!
--j
On Fri, Mar 28, 2014 at 9:54 AM, Tim Howard <tghoward at gw.dec.state.ny.us> wrote:
> Yes, I caught that typo after sending the email. The same happens with
> "type" rather than "newtype". Sorry.
>
> Yes, absolutely, there are NA values in the newdata. My area of interest
> isn't rectangular and so the edges of all layers have NA values. Yes, it is
> definitely more efficient to skip all rows with any NA cells before sending
> it into RF. It is pretty critical to handle that for increased speed.
>
> I don't use the RAT either ... my first exposure to it was this morning
> trying to create a clearly categorical layer for the Tahoe data set. That's
> all. But I do have categorical data that are handled appropriately (without
> a RAT) in my data flow. I was just trying to add categorical data to the
> data flow in case that was what was throwing your function. It sounds like
> that issue might be moot. That's what I get for guessing!
>
>
> Thanks for your quick reply.
> Cheers,
> Tim
>
>>>> Jonathan Greenberg <jgrn at illinois.edu> 03/28/14 10:38 AM >>>
>
> Tim:
>
> I notice you have:
>
> predict_rasterEngine(object=rf.full, newdata=envStack, newtype='prob')
>
> Was that a typo in your email? If not, this might be your issue -- the
> correct parameter is "type", not "newtype" -- I noticed you are using
> "type" in the raster:::predict:
>
> ?predict.randomForest
>
> If you still get that error when you use type="prob", I think the
> issue is that the raster probably has NA values (that is what usually
> triggers the "missing values in newdata" error for
> predict.randomForest) -- this is a valid issue, so I'll need to fix
> the code a bit to deal with it -- can you confirm if its the newtype
> issue first? I can probably push out a fix later this weekend if it
> is, indeed, the NA issue.
>
> Another important issue is that I don't, at present, support the RAT,
> but I can also work on a fix for that.
>
> You are correct re: the object is the model, and the newdata parameter
> is the raster/brick/stack.
>
> Cheers!
>
> --j
>
> On Fri, Mar 28, 2014 at 8:58 AM, Tim Howard <tghoward at gw.dec.state.ny.us>
> wrote:
>> Jonathan,
>> Thanks again for your reply.
>> I'll reply inline ... please see below
>>
>>>>>> Jonathan Greenberg <jgrn at illinois.edu> 03/26/14 10:55 AM >>>
>>
>>>Tim:
>>>
>>>One thing to make sure is if the stack has the bands named properly
>>>>(they need to match the formula), e.g.:
>>>>names(mystack) <- c("b1","b2",...)
>>
>> Yes, I think I'm good on layer names, as well as maintaining all the
>> levels
>> in categorical data (a classic gotcha for RF).
>>
>>>
>>>If you can also pass along the error you are getting, that will go a
>>>long way towards figuring out the issue! Cheers!
>>
>> I was a little reluctant to pass on the error message until I knew a
>> little
>> more. The main reason being that the normal predict function with raster
>> *does* work even though you might not expect it with the error message.
>> Here
>> is the error coming out of predict_rasterEngine:
>>
>>> sfQuickInit(cpus=6)
>> socket cluster with 6 nodes on host 'localhost'
>>> prediction_rf_prob <- predict_rasterEngine(object=rf.full,
>>> newdata=envStack, newtype='prob')
>> Error in predict.randomForest(object = object, newdata = newdata_df,
>> mget(model_parameters)) :
>> missing values in newdata
>>> sfQuickStop()
>>
>>
>> Again, I can run these exact data using raster:::predict and I get good
>> output. To me that means I do not have 'missing values in newdata'. I
>> welcome thoughts otherwise, however. The predict call is structured like
>> this:
>>
>> prediction_rf_prob <- predict(object=envStack, model=rf.full, type='prob',
>> progress = 'text')
>>
>> So, I've tried recreating the problem by tweaking the Tahoe data. Below,
>> I've added a categorical layer to the brick. Unfortunately, it errors out
>> with a new error. (Null external pointer). (Am I right that
>> raster:::predict and predict_rasterEngine use the term "object"
>> differently?)
>>
>> I'll paste a replacement set of code for the Tahoe data below our string.
>>
>> I welcome comments on any of this!
>>
>> Cheers,
>> Tim
>>
>>
>>>--j
>>>
>>>On Wed, Mar 26, 2014 at 9:48 AM, Tim Howard <tghoward at gw.dec.state.ny.us>
>>> wrote:
>>>> Jonathan,
>>>> Thanks for your quick reply. I just tried your Tahoe example using x and
>>>> y
>>>> in the call rather than a formula and it worked fine, so I was barking
>>>> up
>>>> the wrong tree. Sorry. I'll try to delve deeper and put together a
>>>> subset
>>>> example for you to check out if I can't get anywhere.
>>>>
>>>> Thanks,
>>>> Tim
>>>>
>>>>>>> Jonathan Greenberg <jgrn at illinois.edu> 3/26/2014 10:23 AM >>>
>>>>
>>>> Hi Tim:
>>>>
>>>> Re: stack, yep, it should work. In general, you get a decreased
>>>> performance from stacks since you are having to read from multiple
>>>> files rather than a single one, but it will still benefit from
>>>> parallel processing.
>>>>
>>>> Re: formula -- the best way for me to test this is to crop out a piece
>>>> of your image and send me the random forest model (use ?save), and the
>>>> call you were using. In theory you should be able to use anything you
>>>> would normally use on a data frame, but I'd have to play with it to
>>>> confirm! If you can set up those on e.g. google drive I can test it
>>>> out. Cheers!
>>>>
>>>> --j
>>>>
>>>> On Wed, Mar 26, 2014 at 6:49 AM, Tim Howard
>>>> <tghoward at gw.dec.state.ny.us>
>>>> wrote:
>>>>> Jonathan,
>>>>> Thank you for putting this together and for the example. I'm doing two
>>>>> things differently with randomForest ... I think perhaps one of them
>>>>> the
>>>>> function isn't handling.
>>>>>
>>>>> First, based on recommendations from Andy Liaw (and ?randomForest), I
>>>>> don't
>>>>> use the formula interface but use x=<many columns>, y=<a column> in the
>>>>> call. Does predict_rasterEngine handle the absence of a formula in the
>>>>> object?
>>>>>
>>>>> Second, I have many large rasters I want to run the predict on, so
>>>>> making
>>>>> a
>>>>> brick would be difficult. I use a rasterStack instead. Does your
>>>>> example
>>>>> work with a rasterStack?
>>>>>
>>>>> I can dive deeper if any of this isn't clear or if these two tweaks
>>>>> work
>>>>> just fine for you. I was just trying to swap out this version of
>>>>> predict
>>>>> with another parallel version to evaluate speed and, while the
>>>>> alternate
>>>>> version works fine, predict_rasterEngine bailed on me.
>>>>>
>>>>> Thanks in advance.
>>>>> Tim Howard
>>>>>
>>>>>
>>>>>
>>>>>>>>>>>
>>>>> Date: Tue, 18 Mar 2014 22:14:23 -0500
>>>>> From: Jonathan Greenberg <jgrn at illinois.edu>
>>>>> To: "r-sig-geo at r-project.org" <R-sig-Geo at r-project.org>
>>>>> Subject: [R-sig-Geo] Parallel predict now in spatial.tools
>>>>> Message-ID:
>>>>> <CABG0rfseg+p0h4HdYOK+_Za=OLMeTKHAT+TQn7g_FkEdYiunFQ at mail.gmail.com>
>>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>>
>>>>> R-sig-geo'ers:
>>>>>
>>>>> I finally got around to building a parallel predict statement that
>>>>> I've included in version 1.3.7 (or later) of spatial.tools (check
>>>>> http://r-forge.r-project.org/R/?group_id=1492 for the status of the
>>>>> build), "predict_rasterEngine". It should, in theory, be a direct
>>>>> swap-in for the standard generic predict() statement. Currently, it
>>>>> will work on any predict.* statement that has the following features:
>>>>> 1) The data is passed to the predict as a data frame via a newdata
>>>>> parameter, and
>>>>> 2) The data is returned from the predict statement as a vector/matrix.
>>>>>
>>>>> When using predict_rasterEngine, the object= parameter is your model,
>>>>> and the newdata= parameter is the raster/brick/stack to apply the
>>>>> model to on a pixel-by-pixel basis (note that the names of the layers
>>>>> must match the names of the predictor variables, in most cases).
>>>>>
>>>>> I was hoping to get some stress-testing on this, since it is a fairly
>>>>> oft-requested function. If a predict.* function you'd like to use
>>>>> doesn't work, let me know which function it is (with some test data)
>>>>> and I'll see if I can tweak it to work.
>>>>>
>>>>> Right now, I have confirmed this works with randomForest. Here's an
>>>>> example:
>>>>>
>>>>> ######################
>>>>>
>>>>> packages_required <- c("spatial.tools","doParallel","randomForest")
>>>>> lapply(packages_required, require, character.only=T)
>>>>>
>>>>> # Load up a 3-band image:
>>>>> tahoe_highrez <- setMinMax(
>>>>> brick(system.file("external/tahoe_highrez.tif",
>>>>> package="spatial.tools")))
>>>>> tahoe_highrez
>>>>> plotRGB(tahoe_highrez)
>>>>>
>>>>> # Load up some training points:
>>>>> tahoe_highrez_training_points <- readOGR(
>>>>> dsn=system.file("external", package="spatial.tools"),
>>>>> layer="tahoe_highrez_training_points")
>>>>>
>>>>> # Extract data to train the randomForest model:
>>>>> tahoe_highrez_training_extract <- extract(
>>>>> tahoe_highrez,
>>>>> tahoe_highrez_training_points,
>>>>> df=TRUE)
>>>>>
>>>>> # Fuse it back with the SPECIES info:
>>>>> tahoe_highrez_training_extract$SPECIES <-
>>>>> tahoe_highrez_training_points$SPECIES
>>>>>
>>>>> # Note the names of the bands:
>>>>> names(tahoe_highrez_training_extract) # the extracted data
>>>>> names(tahoe_highrez) # the brick
>>>>>
>>>>> # Generate a randomForest model:
>>>>> tahoe_rf <-
>>>>> randomForest(SPECIES~tahoe_highrez.1+tahoe_highrez.2+tahoe_highrez.3,
>>>>> data=tahoe_highrez_training_extract)
>>>>>
>>>>> tahoe_rf
>>>>>
>>>>> # This will run the predict in parallel:
>>>>> sfQuickInit()
>>>>> prediction_rf_class <-
>>>>>
>>>>>
>>>>>
>>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="response")
>>>>> prediction_rf_prob <-
>>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
>>>>> sfQuickStop()
>>>>>
>>>>> ###############
>>>>>
>>>>> --j
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan A. Greenberg, PhD
>>>> Assistant Professor
>>>> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
>>>> Department of Geography and Geographic Information Science
>>>> University of Illinois at Urbana-Champaign
>>>> 259 Computing Applications Building, MC-150
>>>> 605 East Springfield Avenue
>>>> Champaign, IL 61820-6371
>>>> Phone: 217-300-1924
>>>> http://www.geog.illinois.edu/~jgrn/
>>>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
>>
>> #here is a script that adds a categorical layer to the tahoe brick,
>> creates
>> a RF model
>> #and then tries both raster:::predict and predict_rasterEngine.
>>
>>
>> packages_required <- c("spatial.tools","doParallel","randomForest")
>> lapply(packages_required, require, character.only=T)
>>
>> # Load up a 3-band image:
>> tahoe_highrez <- setMinMax(
>> brick(system.file("external/tahoe_highrez.tif", package="spatial.tools")))
>> tahoe_highrez
>> plotRGB(tahoe_highrez)
>>
>> #create a categorical layer from band 1
>> mat <- matrix(c(0,50,1,50,150,2,150,255,3),ncol=3,byrow=TRUE)
>> bnd1cat <- reclassify(tahoe_highrez[[1]], rcl=mat)
>> bnd1cat <- ratify(bnd1cat)
>> rat <- levels(bnd1cat)[[1]]
>> rat$types <- c('type1', 'type2', 'type3', 'type4')
>> rat$code <- c(1,2,3,4)
>> levels(bnd1cat) <- rat
>>
>> #library(rasterVis) #if you want a categorical plot
>> #levelplot(bnd1_rc2)
>>
>> tahoe_highrez <- addLayer(tahoe_highrez, bnd1cat)
>> names(tahoe_highrez) <-
>> c("tahoeOne","tahoeTwo","tahoeThree","tahoeCatFour")
>>
>>
>> # Load up some training points:
>> tahoe_highrez_training_points <- readOGR(
>> dsn=system.file("external", package="spatial.tools"),
>> layer="tahoe_highrez_training_points")
>>
>> # Extract data to train the randomForest model:
>> tahoe_highrez_training_extract <-
>> extract(tahoe_highrez,tahoe_highrez_training_points,df=TRUE)
>>
>> # Fuse it back with the SPECIES info:
>> tahoe_highrez_training_extract$SPECIES <-
>> tahoe_highrez_training_points$SPECIES
>>
>> # Note the names of the bands:
>> names(tahoe_highrez_training_extract) # the extracted data
>> names(tahoe_highrez) # the brick
>>
>> # convert to factor, ensure all the levels are there
>> tahoe_highrez_training_extract$tahoeCatFour <-
>> factor(tahoe_highrez_training_extract$tahoeCatFour)
>> levels(tahoe_highrez_training_extract$tahoeCatFour) <- c(1,2,3,4)
>> str(tahoe_highrez_training_extract)
>>
>>
>> # Generate a randomForest model:
>> tahoe_rf <- randomForest(y=tahoe_highrez_training_extract$SPECIES,
>> x=tahoe_highrez_training_extract[,2:5],
>> data=tahoe_highrez_training_extract)
>>
>>
>> # try it with standard predict call -- this works
>> predict1_rf_prob <- predict(object=tahoe_highrez, model=tahoe_rf,
>> type="prob")
>>
>> # try it with rasterEngine -- this fails
>> sfQuickInit()
>> predict2_rf_prob <-
>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")
>> sfQuickStop()
>>
>> plot(predict1_rf_prob)
>>
>>
>>
>
>
>
> --
> Jonathan A. Greenberg, PhD
> Assistant Professor
> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
> Department of Geography and Geographic Information Science
> University of Illinois at Urbana-Champaign
> 259 Computing Applications Building, MC-150
> 605 East Springfield Avenue
> Champaign, IL 61820-6371
> Phone: 217-300-1924
> http://www.geog.illinois.edu/~jgrn/
> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
>
--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL 61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
More information about the R-sig-Geo
mailing list