<html><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><META name="Author" content="Novell GroupWise WebAccess"></head><body style='font-family: Helvetica, Arial, sans-serif; font-size: 13px; '>Yes, I caught that typo after sending the email. The same happens with "type" rather than "newtype". Sorry. <br><br>Yes, absolutely, there are NA values in the newdata. My area of interest isn't rectangular and so the edges of all layers have NA values. Yes, it is definitely more efficient to skip all rows with any NA cells before sending it into RF. It is pretty critical to handle that for increased speed. <br><br>I don't use the RAT either ... my first exposure to it was this morning trying to create a clearly categorical layer for the Tahoe data set. That's all. But I do have categorical data that are handled appropriately (without a RAT) in my data flow. I was just trying to add categorical data to the data flow in case that was what was throwing your function. It sounds like that issue might be moot. That's what I get for guessing!<br><br>Thanks for your quick reply. <br>Cheers, <br>Tim<br><br>>>> Jonathan Greenberg <jgrn@illinois.edu> 03/28/14 10:38 AM >>><br>Tim:<br><br>I notice you have:<br><br>predict_rasterEngine(object=rf.full, newdata=envStack, newtype='prob')<br><br>Was that a typo in your email? If not, this might be your issue -- the<br>correct parameter is "type", not "newtype" -- I noticed you are using<br>"type" in the raster:::predict:<br><br>?predict.randomForest<br><br>If you still get that error when you use type="prob", I think the<br>issue is that the raster probably has NA values (that is what usually<br>triggers the "missing values in newdata" error for<br>predict.randomForest) -- this is a valid issue, so I'll need to fix<br>the code a bit to deal with it -- can you confirm if its the newtype<br>issue first? I can probably push out a fix later this weekend if it<br>is, indeed, the NA issue.<br><br>Another important issue is that I don't, at present, support the RAT,<br>but I can also work on a fix for that.<br><br>You are correct re: the object is the model, and the newdata parameter<br>is the raster/brick/stack.<br><br>Cheers!<br><br>--j<br><br>On Fri, Mar 28, 2014 at 8:58 AM, Tim Howard <tghoward@gw.dec.state.ny.us> wrote:<br>> Jonathan,<br>> Thanks again for your reply.<br>> I'll reply inline ... please see below<br>><br>>>>>> Jonathan Greenberg <jgrn@illinois.edu> 03/26/14 10:55 AM >>><br>><br>>>Tim:<br>>><br>>>One thing to make sure is if the stack has the bands named properly<br>>>>(they need to match the formula), e.g.:<br>>>>names(mystack) <- c("b1","b2",...)<br>><br>> Yes, I think I'm good on layer names, as well as maintaining all the levels<br>> in categorical data (a classic gotcha for RF).<br>><br>>><br>>>If you can also pass along the error you are getting, that will go a<br>>>long way towards figuring out the issue! Cheers!<br>><br>> I was a little reluctant to pass on the error message until I knew a little<br>> more. The main reason being that the normal predict function with raster<br>> *does* work even though you might not expect it with the error message. Here<br>> is the error coming out of predict_rasterEngine:<br>><br>>> sfQuickInit(cpus=6)<br>> socket cluster with 6 nodes on host 'localhost'<br>>> prediction_rf_prob <- predict_rasterEngine(object=rf.full,<br>>> newdata=envStack, newtype='prob')<br>> Error in predict.randomForest(object = object, newdata = newdata_df,<br>> mget(model_parameters)) :<br>> missing values in newdata<br>>> sfQuickStop()<br>><br>><br>> Again, I can run these exact data using raster:::predict and I get good<br>> output. To me that means I do not have 'missing values in newdata'. I<br>> welcome thoughts otherwise, however. The predict call is structured like<br>> this:<br>><br>> prediction_rf_prob <- predict(object=envStack, model=rf.full, type='prob',<br>> progress = 'text')<br>><br>> So, I've tried recreating the problem by tweaking the Tahoe data. Below,<br>> I've added a categorical layer to the brick. Unfortunately, it errors out<br>> with a new error. (Null external pointer). (Am I right that<br>> raster:::predict and predict_rasterEngine use the term "object"<br>> differently?)<br>><br>> I'll paste a replacement set of code for the Tahoe data below our string.<br>><br>> I welcome comments on any of this!<br>><br>> Cheers,<br>> Tim<br>><br>><br>>>--j<br>>><br>>>On Wed, Mar 26, 2014 at 9:48 AM, Tim Howard <tghoward@gw.dec.state.ny.us><br>>> wrote:<br>>>> Jonathan,<br>>>> Thanks for your quick reply. I just tried your Tahoe example using x and<br>>>> y<br>>>> in the call rather than a formula and it worked fine, so I was barking up<br>>>> the wrong tree. Sorry. I'll try to delve deeper and put together a subset<br>>>> example for you to check out if I can't get anywhere.<br>>>><br>>>> Thanks,<br>>>> Tim<br>>>><br>>>>>>> Jonathan Greenberg <jgrn@illinois.edu> 3/26/2014 10:23 AM >>><br>>>><br>>>> Hi Tim:<br>>>><br>>>> Re: stack, yep, it should work. In general, you get a decreased<br>>>> performance from stacks since you are having to read from multiple<br>>>> files rather than a single one, but it will still benefit from<br>>>> parallel processing.<br>>>><br>>>> Re: formula -- the best way for me to test this is to crop out a piece<br>>>> of your image and send me the random forest model (use ?save), and the<br>>>> call you were using. In theory you should be able to use anything you<br>>>> would normally use on a data frame, but I'd have to play with it to<br>>>> confirm! If you can set up those on e.g. google drive I can test it<br>>>> out. Cheers!<br>>>><br>>>> --j<br>>>><br>>>> On Wed, Mar 26, 2014 at 6:49 AM, Tim Howard <tghoward@gw.dec.state.ny.us><br>>>> wrote:<br>>>>> Jonathan,<br>>>>> Thank you for putting this together and for the example. I'm doing two<br>>>>> things differently with randomForest ... I think perhaps one of them the<br>>>>> function isn't handling.<br>>>>><br>>>>> First, based on recommendations from Andy Liaw (and ?randomForest), I<br>>>>> don't<br>>>>> use the formula interface but use x=<many columns>, y=<a column> in the<br>>>>> call. Does predict_rasterEngine handle the absence of a formula in the<br>>>>> object?<br>>>>><br>>>>> Second, I have many large rasters I want to run the predict on, so<br>>>>> making<br>>>>> a<br>>>>> brick would be difficult. I use a rasterStack instead. Does your example<br>>>>> work with a rasterStack?<br>>>>><br>>>>> I can dive deeper if any of this isn't clear or if these two tweaks work<br>>>>> just fine for you. I was just trying to swap out this version of predict<br>>>>> with another parallel version to evaluate speed and, while the alternate<br>>>>> version works fine, predict_rasterEngine bailed on me.<br>>>>><br>>>>> Thanks in advance.<br>>>>> Tim Howard<br>>>>><br>>>>><br>>>>><br>>>>>>>>>>><br>>>>> Date: Tue, 18 Mar 2014 22:14:23 -0500<br>>>>> From: Jonathan Greenberg <jgrn@illinois.edu><br>>>>> To: "r-sig-geo@r-project.org" <R-sig-Geo@r-project.org><br>>>>> Subject: [R-sig-Geo] Parallel predict now in spatial.tools<br>>>>> Message-ID:<br>>>>> <CABG0rfseg+p0h4HdYOK+_Za=OLMeTKHAT+TQn7g_FkEdYiunFQ@mail.gmail.com><br>>>>> Content-Type: text/plain; charset=ISO-8859-1<br>>>>><br>>>>> R-sig-geo'ers:<br>>>>><br>>>>> I finally got around to building a parallel predict statement that<br>>>>> I've included in version 1.3.7 (or later) of spatial.tools (check<br>>>>> http://r-forge.r-project.org/R/?group_id=1492 for the status of the<br>>>>> build), "predict_rasterEngine". It should, in theory, be a direct<br>>>>> swap-in for the standard generic predict() statement. Currently, it<br>>>>> will work on any predict.* statement that has the following features:<br>>>>> 1) The data is passed to the predict as a data frame via a newdata<br>>>>> parameter, and<br>>>>> 2) The data is returned from the predict statement as a vector/matrix.<br>>>>><br>>>>> When using predict_rasterEngine, the object= parameter is your model,<br>>>>> and the newdata= parameter is the raster/brick/stack to apply the<br>>>>> model to on a pixel-by-pixel basis (note that the names of the layers<br>>>>> must match the names of the predictor variables, in most cases).<br>>>>><br>>>>> I was hoping to get some stress-testing on this, since it is a fairly<br>>>>> oft-requested function. If a predict.* function you'd like to use<br>>>>> doesn't work, let me know which function it is (with some test data)<br>>>>> and I'll see if I can tweak it to work.<br>>>>><br>>>>> Right now, I have confirmed this works with randomForest. Here's an<br>>>>> example:<br>>>>><br>>>>> ######################<br>>>>><br>>>>> packages_required <- c("spatial.tools","doParallel","randomForest")<br>>>>> lapply(packages_required, require, character.only=T)<br>>>>><br>>>>> # Load up a 3-band image:<br>>>>> tahoe_highrez <- setMinMax(<br>>>>> brick(system.file("external/tahoe_highrez.tif",<br>>>>> package="spatial.tools")))<br>>>>> tahoe_highrez<br>>>>> plotRGB(tahoe_highrez)<br>>>>><br>>>>> # Load up some training points:<br>>>>> tahoe_highrez_training_points <- readOGR(<br>>>>> dsn=system.file("external", package="spatial.tools"),<br>>>>> layer="tahoe_highrez_training_points")<br>>>>><br>>>>> # Extract data to train the randomForest model:<br>>>>> tahoe_highrez_training_extract <- extract(<br>>>>> tahoe_highrez,<br>>>>> tahoe_highrez_training_points,<br>>>>> df=TRUE)<br>>>>><br>>>>> # Fuse it back with the SPECIES info:<br>>>>> tahoe_highrez_training_extract$SPECIES <-<br>>>>> tahoe_highrez_training_points$SPECIES<br>>>>><br>>>>> # Note the names of the bands:<br>>>>> names(tahoe_highrez_training_extract) # the extracted data<br>>>>> names(tahoe_highrez) # the brick<br>>>>><br>>>>> # Generate a randomForest model:<br>>>>> tahoe_rf <-<br>>>>> randomForest(SPECIES~tahoe_highrez.1+tahoe_highrez.2+tahoe_highrez.3,<br>>>>> data=tahoe_highrez_training_extract)<br>>>>><br>>>>> tahoe_rf<br>>>>><br>>>>> # This will run the predict in parallel:<br>>>>> sfQuickInit()<br>>>>> prediction_rf_class <-<br>>>>><br>>>>><br>>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="response")<br>>>>> prediction_rf_prob <-<br>>>>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")<br>>>>> sfQuickStop()<br>>>>><br>>>>> ###############<br>>>>><br>>>>> --j<br>>>><br>>>><br>>>><br>>>> --<br>>>> Jonathan A. Greenberg, PhD<br>>>> Assistant Professor<br>>>> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory<br>>>> Department of Geography and Geographic Information Science<br>>>> University of Illinois at Urbana-Champaign<br>>>> 259 Computing Applications Building, MC-150<br>>>> 605 East Springfield Avenue<br>>>> Champaign, IL 61820-6371<br>>>> Phone: 217-300-1924<br>>>> http://www.geog.illinois.edu/~jgrn/<br>>>> AIM: jgrn307, MSN: jgrn307@hotmail.com, Gchat: jgrn307, Skype: jgrn3007<br>><br>> #here is a script that adds a categorical layer to the tahoe brick, creates<br>> a RF model<br>> #and then tries both raster:::predict and predict_rasterEngine.<br>><br>><br>> packages_required <- c("spatial.tools","doParallel","randomForest")<br>> lapply(packages_required, require, character.only=T)<br>><br>> # Load up a 3-band image:<br>> tahoe_highrez <- setMinMax(<br>> brick(system.file("external/tahoe_highrez.tif", package="spatial.tools")))<br>> tahoe_highrez<br>> plotRGB(tahoe_highrez)<br>><br>> #create a categorical layer from band 1<br>> mat <- matrix(c(0,50,1,50,150,2,150,255,3),ncol=3,byrow=TRUE)<br>> bnd1cat <- reclassify(tahoe_highrez[[1]], rcl=mat)<br>> bnd1cat <- ratify(bnd1cat)<br>> rat <- levels(bnd1cat)[[1]]<br>> rat$types <- c('type1', 'type2', 'type3', 'type4')<br>> rat$code <- c(1,2,3,4)<br>> levels(bnd1cat) <- rat<br>><br>> #library(rasterVis) #if you want a categorical plot<br>> #levelplot(bnd1_rc2)<br>><br>> tahoe_highrez <- addLayer(tahoe_highrez, bnd1cat)<br>> names(tahoe_highrez) <- c("tahoeOne","tahoeTwo","tahoeThree","tahoeCatFour")<br>><br>><br>> # Load up some training points:<br>> tahoe_highrez_training_points <- readOGR(<br>> dsn=system.file("external", package="spatial.tools"),<br>> layer="tahoe_highrez_training_points")<br>><br>> # Extract data to train the randomForest model:<br>> tahoe_highrez_training_extract <-<br>> extract(tahoe_highrez,tahoe_highrez_training_points,df=TRUE)<br>><br>> # Fuse it back with the SPECIES info:<br>> tahoe_highrez_training_extract$SPECIES <-<br>> tahoe_highrez_training_points$SPECIES<br>><br>> # Note the names of the bands:<br>> names(tahoe_highrez_training_extract) # the extracted data<br>> names(tahoe_highrez) # the brick<br>><br>> # convert to factor, ensure all the levels are there<br>> tahoe_highrez_training_extract$tahoeCatFour <-<br>> factor(tahoe_highrez_training_extract$tahoeCatFour)<br>> levels(tahoe_highrez_training_extract$tahoeCatFour) <- c(1,2,3,4)<br>> str(tahoe_highrez_training_extract)<br>><br>><br>> # Generate a randomForest model:<br>> tahoe_rf <- randomForest(y=tahoe_highrez_training_extract$SPECIES,<br>> x=tahoe_highrez_training_extract[,2:5],<br>> data=tahoe_highrez_training_extract)<br>><br>><br>> # try it with standard predict call -- this works<br>> predict1_rf_prob <- predict(object=tahoe_highrez, model=tahoe_rf,<br>> type="prob")<br>><br>> # try it with rasterEngine -- this fails<br>> sfQuickInit()<br>> predict2_rf_prob <-<br>> predict_rasterEngine(object=tahoe_rf,newdata=tahoe_highrez,type="prob")<br>> sfQuickStop()<br>><br>> plot(predict1_rf_prob)<br>><br>><br>><br><br><br><br>-- <br>Jonathan A. Greenberg, PhD<br>Assistant Professor<br>Global Environmental Analysis and Remote Sensing (GEARS) Laboratory<br>Department of Geography and Geographic Information Science<br>University of Illinois at Urbana-Champaign<br>259 Computing Applications Building, MC-150<br>605 East Springfield Avenue<br>Champaign, IL 61820-6371<br>Phone: 217-300-1924<br>http://www.geog.illinois.edu/~jgrn/<br>AIM: jgrn307, MSN: jgrn307@hotmail.com, Gchat: jgrn307, Skype: jgrn3007<br><br></body></html>