[R-sig-eco] Ordering of nominal or categorical variables in randomForest?

Griffith.Michael at epamail.epa.gov Griffith.Michael at epamail.epa.gov
Tue Jul 8 19:57:52 CEST 2008


I am attempting to use randomForest to do classification and regression
tree analysis.  After importing the data set, I use the following
statement:

  tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
data=data, ntree=500, mtry=2,
              replace=TRUE, importance=TRUE, do.trace=TRUE,
keep.forest=TRUE)

The first two independent variables are continuous numerical variables,
while the last two are categorical variables with more than two classes.
The package seems to handle this mixture of numerical and categorical
variables, but I am unclear how to interpret the splits for the
categorical variables.

The table describing the splits has a column, split point, which for
numerical variables is the value of the indicated variable where the
left daughter group is less than the value and the right daughter group
is greater than the value.

The documentation states that for categorical variables, split point is
a integer, whose binary expansion identifies which categories go into
the left and right daughter groups.  It gives an example of a variable
with three classes and a split value of 5, which expands to 1 0 1.  In
this case, the first and third classes go into the left daughter group
and the second class goes into the right daughter group.

My question now is:  How does the package order the classes of a
categorical variable?  This is not clear in the documentation, and if
this is something basic to R, I have not found it in the help files.  In
my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
NoLF, and WCBP.  These levels are not ordered in any particular way in
the data set.  I can think of two ways the package might order the
classes:  1.  alphabetically or 2. in the order that the are first
encountered in the data set.  Are either of these correct or might there
be some other way of ordering the levels I have not thought of?

A colleague suggested that I might use is.ordered(), but I get an error
message, "Error in inherits(x, "factor") : object "L3_ER" not found."
Any other suggestions are appreciated.  Thanks.

Michael

Michael B. Griffith, Ph.D.
Research Ecologist

USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH  45268

telephone:  513 569-7034
e-mail:  griffith.michael at epa.gov



More information about the R-sig-ecology mailing list