[R-sig-eco] Ordering of nominal or categorical variables in randomForest?

Karl Cottenie cottenie at uoguelph.ca
Tue Jul 8 20:18:02 CEST 2008


my guess is that the tree analysis uses the internal order of the factor
levels. You can access this by printing the variable. If it is stored as
a factor, it will first print all the individual factor values, followed
by a line with x (in your case 5) levels: and the order in which the
distinct levels are stored. You can also get this information with the
"levels" function.

See this example from ?factor

> factor(letters[1:20], labels="letter")
 [1] letter1  letter2  letter3  letter4  letter5  letter6  letter7
 [9] letter9  letter10 letter11 letter12 letter13 letter14 letter15
[17] letter17 letter18 letter19 letter20
20 Levels: letter1 letter2 letter3 letter4 letter5 letter6 letter7 ...
letter20 ##This is the line you are interested in


On Tue, 2008-07-08 at 13:57 -0400, Griffith.Michael at epamail.epa.gov
> I am attempting to use randomForest to do classification and regression
> tree analysis.  After importing the data set, I use the following
> statement:
>   tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
> data=data, ntree=500, mtry=2,
>               replace=TRUE, importance=TRUE, do.trace=TRUE,
> keep.forest=TRUE)
> The first two independent variables are continuous numerical variables,
> while the last two are categorical variables with more than two classes.
> The package seems to handle this mixture of numerical and categorical
> variables, but I am unclear how to interpret the splits for the
> categorical variables.
> The table describing the splits has a column, split point, which for
> numerical variables is the value of the indicated variable where the
> left daughter group is less than the value and the right daughter group
> is greater than the value.
> The documentation states that for categorical variables, split point is
> a integer, whose binary expansion identifies which categories go into
> the left and right daughter groups.  It gives an example of a variable
> with three classes and a split value of 5, which expands to 1 0 1.  In
> this case, the first and third classes go into the left daughter group
> and the second class goes into the right daughter group.
> My question now is:  How does the package order the classes of a
> categorical variable?  This is not clear in the documentation, and if
> this is something basic to R, I have not found it in the help files.  In
> my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
> NoLF, and WCBP.  These levels are not ordered in any particular way in
> the data set.  I can think of two ways the package might order the
> classes:  1.  alphabetically or 2. in the order that the are first
> encountered in the data set.  Are either of these correct or might there
> be some other way of ordering the levels I have not thought of?
> A colleague suggested that I might use is.ordered(), but I get an error
> message, "Error in inherits(x, "factor") : object "L3_ER" not found."
> Any other suggestions are appreciated.  Thanks.
> Michael
> Michael B. Griffith, Ph.D.
> Research Ecologist
> USEPA, NCEA (MS A-110)
> 26 W. Martin Luther King Dr.
> Cincinnati, OH  45268
> telephone:  513 569-7034
> e-mail:  griffith.michael at epa.gov
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

More information about the R-sig-ecology mailing list