[R-sig-Geo] problem with predict() in package raster and factor variables

Gonzalez-Mirelis Genoveva genoveva.gonzalez-mirelis at imr.no
Sun May 1 11:01:51 CEST 2016


Dear Frede and list,

My sincere apologies for not providing sufficient information or reproducible example. As a matter of fact, you are right and when I looked further into the data I used to estimate the random forest model I solved the problem myself! In case it's of interest, the problem was that the variable had to be converted to a factor *before* fitting the model, so that the result of str(v) should not be what I showed in my original mail, but instead it should be:

'data.frame':	1257 obs. of  15 variables:
 $ RefNo      : int  16 16 16 16 17 17 17 17 18 18 ...
 $ PointID    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Count      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PA         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ split      : chr  "T" "T" "T" "T" ...
 $ bathy20_1  : num  256 260 252 266 281 ...
 $ TerClass   : Factor w/ 6 levels "1","2","3","4",..: 2 2 1 1 1 2 1 1 3 3 ...

etc

Note that before, $TerClass was num, and now it's Factor w/ 6 levels

And f should look like this:

$TerClass
[1] "1" "2" "3" "4" "5" "6"

Then the predict() function works without any problems!

Furthermore, there is an example in the help file that exactly represents my case, namely this bit:

# create a RasterStack or RasterBrick with with a set of predictor layers
logo <- brick(system.file("external/rlogo.grd", package="raster"))
# known presence and absence points
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85, 
              66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31, 
              22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
              99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
              37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
# extract values for points
xy <- rbind(cbind(1, p), cbind(0, a))
v1 <- data.frame(cbind(pa=xy[,1], extract(logo, xy[,2:3])))
# cforest (other Random Forest implementation) example with factors argument
v1$red <- as.factor(round(v1$red/100))
logo$red <- round(logo[[1]]/100)
library(party)
m <- cforest(pa~., control=cforest_unbiased(mtry=3), data=v1)
f1 <- list(levels(v1$red))
names(f1) <- 'red'
pc <- predict(logo, m, OOB=TRUE, factors=f1)

Thank you all very much, and my apologies for wasting anyone's time.

Genoveva

 
________________________________________
From: Frede Aakmann Tøgersen <frtog at vestas.com>
Sent: Sunday, May 1, 2016 6:42 AM
To: Gonzalez-Mirelis Genoveva; r-sig-geo at r-project.org
Subject: RE: [R-sig-Geo] problem with predict() in package raster and factor    variables

Hi Genoveva

You haven't got a response to your question mainly due to a) missing information and b) missing reproducible example.

If you had provided the missing information I guess you would have solved the problem yourself.

I have never used raster::predict() but having a look at man for that function and you error message there is probably some differences between the data used to estimate the random forest model (you call that a subset of the object 'v') and the data in 'subbrick'. You should provide the structure of data used to fit the random forest model and 'subbrick':


> str(v)
> str(subbrick)

Please also show all the relevant R code to obtain what you want in case the error message is not related to difference in the creation of the subset of 'v' and 'subbrick'


Yours sincerely / Med venlig hilsen

Frede Aakmann Tøgersen
Specialist, M.Sc., Ph.D.
Plant Performance & Modeling

Technology & Service Solutions
T +45 9730 5135
M +45 2547 6050
frtog at vestas.com
http://www.vestas.com

Company reg. name: Vestas Wind Systems A/S
This e-mail is subject to our e-mail disclaimer statement.
Please refer to www.vestas.com/legal/notice
If you have received this e-mail in error please contact the sender.



-----Original Message-----
From: R-sig-Geo [mailto:r-sig-geo-bounces at r-project.org] On Behalf Of Gonzalez-Mirelis Genoveva
Sent: 30. april 2016 12:33
To: r-sig-geo at r-project.org
Subject: [R-sig-Geo] problem with predict() in package raster and factor variables

Dear list,
I am trying to use the function predict() (in package raster), where I supply: the new data as a RasterBrick, the model (as fit in previous steps and using a different dataset), and a few more arguments including the levels of my only one categorical value. Here is the code I'm using:

 r1 <- predict(subbrick,
              CIF.pa,
              type="response", OOB=T, factors=f)

But I keep getting the following error:

Error in checkData(oldData, RET) :

  Classes of new data do not match original data

Here are more details:

> CIF.pa

         Random Forest using Conditional Inference Trees

Number of trees:  1000

Response:  PA
Inputs:  bathy20_1, TerClass, Smax_ann, Smean_ann, Smin_ann, SPDmax_ann, SPDmean_ann, Tmax_ann, Tmean_ann, Tmin_ann
Number of observations:  986

Where 'TerClass' is a categorical variable.

Here is the data used to train CIF.pa:


> str(v)
'data.frame':   1257 obs. of  15 variables:
 $ RefNo      : int  16 16 16 16 17 17 17 17 18 18 ...
 $ PointID    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Count      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PA         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ split      : chr  "T" "T" "T" "T" ...
 $ bathy20_1  : num  256 260 252 266 281 ...
 $ TerClass   : num  2 2 1 1 1 2 1 1 3 3 ...
 $ Smax_ann   : num  35.1 35.1 35.1 35.1 35.1 ...
 $ Smean_ann  : num  35.1 35.1 35.1 35.1 35.1 ...
 $ Smin_ann   : num  34.9 34.9 34.9 34.9 35 ...
 $ SPDmax_ann : num  0.379 0.376 0.378 0.372 0.352 ...
 $ SPDmean_ann: num  0.14 0.137 0.14 0.132 0.12 ...
 $ Tmax_ann   : num  6.97 6.92 7.04 6.87 6.68 ...
 $ Tmean_ann  : num  5.76 5.73 5.79 5.71 5.54 ...
 $ Tmin_ann   : num  4.41 4.32 4.52 4.25 4.07 ...

But actually, I used a subset of v to train the model, that where v$split=='T'

Below are the values and class for TerClass for that subset

> unique(v[v$split=='T',7])
[1] 2 1 3 4 6 5
> class(v$TerClass)
[1] "numeric"

And below are the values and class for the corresponding layer of the RasterBrick:

> unique(values(subbrick$TerClass))
[1] 3 1 2 4 5 6

> class(values(subbrick$TerClass))
[1] "numeric"

And finally, here is what f looks like:

> f
$TerClass
[1] 2 1 3 4 6 5
> class(f)
[1] "list"

As far as I can see the classes in OldData and NewData should be the same, but the error persists. Any ideas on what I could be missing?

Unfortunately I am unable to reproduce the problem (I only encounter it when using my data), but any help will be hugely appreciated

Also, I am aware that I asked this question before (Apr 04, 2013; 1:22pm). Unfortunately I haven't gotten very far since then!

Many thanks in advance for any pointers.

Genoveva

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo



More information about the R-sig-Geo mailing list