[R-sig-eco] Getting output from predict.randomForest
Griffith.Michael at epamail.epa.gov
Griffith.Michael at epamail.epa.gov
Fri Sep 26 16:58:54 CEST 2008
I have been trying to use randomForest and specifically predict for
randomForest as follows:
for (y in 7:42){
data1 <- indata[c(1:5,y)]
test1 <- test[c(1:5,y)])]
data1 <- na.omit(data1))]
test1 <- na.omit(test1))]
set.seed(1234)
tree=randomForest(x=data1[,2:5], y=data1[,6], ntree=1000, mtry=3,
importance=TRUE, keep.forest=TRUE)
summary(tree)
print(tree)
tree.predict <- predict(tree, test1[,2:6], type="response",
nodes=TRUE)
table(observed = test1, predicted = tree.predict)
varUsed(tree, count=TRUE)
}
The data set, data1, has the following form, with ERClass and ChanClass
being factors:
FieldNum ERClass ChanClass DrainageArea PctFines Clinger
1 04LM099 5 1 10.2791962
0.000000 10
2 04LM127 5 1 44.9838181
0.000000 10
3 96SC002 3 1 668.9939004
0.000000 29
4 96SC037 3 1 241.9048792
0.000000 23
5 97LS051 3 1 342.3964136
0.000000 17
.
.
.
In this example, FieldNum is a sample identifier that is not used in the
analysis, Clinger is the dependent variable. The other variables are
the independent variables. The data set, test1, is a subset of 12
samples that were removed from data1 prior to the analysis with the same
variables.
What I would like is to get a prediction of the characteristics (i.e.,
something like ERClass = 3, ChanClass = 2 or 3, DrainageArea > 400,
PctFines < 10 - although I have found an example for a similar problem,
so I am not sure what it will look like exactly) of the end nodes where
the majority of the trees place each of these 12 samples).
However, the output I am currently getting is:
Call:
randomForest(x = data1[, 2:5], y = data1[, 6], ntree = 1000, mtry
= 3, importance = TRUE, keep.forest = TRUE)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 3
Mean of squared residuals: 17.6679
% Var explained: 49.65
Error in predict.randomForest(tree, test1[, 1:6], type = "response",
nodes = TRUE) :
Type of predictors in new data do not match that of the training data.
Clearly, something is wrong with my predict statement, but what? Do I
need to re-identify which variables are x and which variable is y? If
so, how? Also, am I going to get the result I am looking for? If not,
how do I need to write this to get that? The help pages I have found
have been very inadequate.
Thanks for your help.
Michael
Michael B. Griffith, Ph.D.
Research Ecologist
USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH 45268
telephone: 513 569-7034
e-mail: griffith.michael at epa.gov
More information about the R-sig-ecology
mailing list