[R-sig-eco] Getting output from predict.randomForest

Fri Sep 26 16:58:54 CEST 2008

I have been trying to use randomForest and specifically predict for
randomForest as follows:

for (y in 7:42){
  data1 <- indata[c(1:5,y)]
  test1 <- test[c(1:5,y)])]
  data1 <- na.omit(data1))]
  test1 <- na.omit(test1))]
  set.seed(1234)
  tree=randomForest(x=data1[,2:5], y=data1[,6], ntree=1000, mtry=3,
     importance=TRUE, keep.forest=TRUE)
     summary(tree)
     print(tree)
     tree.predict <- predict(tree, test1[,2:6], type="response",
nodes=TRUE)
     table(observed = test1, predicted = tree.predict)
     varUsed(tree, count=TRUE)
}

The data set, data1, has the following form, with ERClass and ChanClass
being factors:

    FieldNum ERClass ChanClass DrainageArea   PctFines Clinger
1    04LM099       5                      1           10.2791962
0.000000      10
2    04LM127       5                      1           44.9838181
0.000000      10
3    96SC002       3                      1         668.9939004
0.000000      29
4    96SC037       3                      1         241.9048792
0.000000      23
5    97LS051        3                     1          342.3964136
0.000000      17
.
.
.

In this example, FieldNum is a sample identifier that is not used in the
analysis, Clinger is the dependent variable.  The other variables are
the independent variables.  The data set, test1, is a subset of 12
samples that were removed from data1 prior to the analysis with the same
variables.

What I would like is to get a prediction of the characteristics (i.e.,
something like ERClass = 3, ChanClass = 2 or 3, DrainageArea > 400,
PctFines < 10 - although I have found an example for a similar problem,
so I am not sure what it will look like exactly) of the end nodes where
the majority of the trees place each of these 12 samples).

However, the output I am currently getting is:

Call:
 randomForest(x = data1[, 2:5], y = data1[, 6], ntree = 1000,      mtry
= 3, importance = TRUE, keep.forest = TRUE)
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 3

          Mean of squared residuals: 17.6679
                    % Var explained: 49.65
Error in predict.randomForest(tree, test1[, 1:6], type = "response",
nodes = TRUE) :
  Type of predictors in new data do not match that of the training data.

Clearly, something is wrong with my predict statement, but what?  Do I
need to re-identify which variables are x and which variable is y?  If
so, how?  Also, am I going to get the result I am looking for?  If not,
how do I need to write this to get that?  The help pages I have found
have been very inadequate.

Thanks for your help.

Michael

Michael B. Griffith, Ph.D.
Research Ecologist

USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH  45268

telephone:  513 569-7034
e-mail:  griffith.michael at epa.gov