[R-sig-eco] Ordering of nominal or categorical variables in randomForest?
cottenie at uoguelph.ca
Tue Jul 8 20:18:02 CEST 2008
my guess is that the tree analysis uses the internal order of the factor
levels. You can access this by printing the variable. If it is stored as
a factor, it will first print all the individual factor values, followed
by a line with x (in your case 5) levels: and the order in which the
distinct levels are stored. You can also get this information with the
See this example from ?factor
> factor(letters[1:20], labels="letter")
 letter1 letter2 letter3 letter4 letter5 letter6 letter7
 letter9 letter10 letter11 letter12 letter13 letter14 letter15
 letter17 letter18 letter19 letter20
20 Levels: letter1 letter2 letter3 letter4 letter5 letter6 letter7 ...
letter20 ##This is the line you are interested in
On Tue, 2008-07-08 at 13:57 -0400, Griffith.Michael at epamail.epa.gov
> I am attempting to use randomForest to do classification and regression
> tree analysis. After importing the data set, I use the following
> tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
> data=data, ntree=500, mtry=2,
> replace=TRUE, importance=TRUE, do.trace=TRUE,
> The first two independent variables are continuous numerical variables,
> while the last two are categorical variables with more than two classes.
> The package seems to handle this mixture of numerical and categorical
> variables, but I am unclear how to interpret the splits for the
> categorical variables.
> The table describing the splits has a column, split point, which for
> numerical variables is the value of the indicated variable where the
> left daughter group is less than the value and the right daughter group
> is greater than the value.
> The documentation states that for categorical variables, split point is
> a integer, whose binary expansion identifies which categories go into
> the left and right daughter groups. It gives an example of a variable
> with three classes and a split value of 5, which expands to 1 0 1. In
> this case, the first and third classes go into the left daughter group
> and the second class goes into the right daughter group.
> My question now is: How does the package order the classes of a
> categorical variable? This is not clear in the documentation, and if
> this is something basic to R, I have not found it in the help files. In
> my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
> NoLF, and WCBP. These levels are not ordered in any particular way in
> the data set. I can think of two ways the package might order the
> classes: 1. alphabetically or 2. in the order that the are first
> encountered in the data set. Are either of these correct or might there
> be some other way of ordering the levels I have not thought of?
> A colleague suggested that I might use is.ordered(), but I get an error
> message, "Error in inherits(x, "factor") : object "L3_ER" not found."
> Any other suggestions are appreciated. Thanks.
> Michael B. Griffith, Ph.D.
> Research Ecologist
> USEPA, NCEA (MS A-110)
> 26 W. Martin Luther King Dr.
> Cincinnati, OH 45268
> telephone: 513 569-7034
> e-mail: griffith.michael at epa.gov
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
More information about the R-sig-ecology