[R-sig-eco] Ordering of nominal or categorical variables in randomForest?

Tue Jul 8 20:18:02 CEST 2008

Michael,

my guess is that the tree analysis uses the internal order of the factor
levels. You can access this by printing the variable. If it is stored as
a factor, it will first print all the individual factor values, followed
by a line with x (in your case 5) levels: and the order in which the
distinct levels are stored. You can also get this information with the
"levels" function.

See this example from ?factor

> factor(letters[1:20], labels="letter")
 [1] letter1  letter2  letter3  letter4  letter5  letter6  letter7
letter8 
 [9] letter9  letter10 letter11 letter12 letter13 letter14 letter15
letter16
[17] letter17 letter18 letter19 letter20
20 Levels: letter1 letter2 letter3 letter4 letter5 letter6 letter7 ...
letter20 ##This is the line you are interested in

Karl

On Tue, 2008-07-08 at 13:57 -0400, Griffith.Michael at epamail.epa.gov
wrote:
> I am attempting to use randomForest to do classification and regression
> tree analysis.  After importing the data set, I use the following
> statement:
> 
>   tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
> data=data, ntree=500, mtry=2,
>               replace=TRUE, importance=TRUE, do.trace=TRUE,
> keep.forest=TRUE)
> 
> The first two independent variables are continuous numerical variables,
> while the last two are categorical variables with more than two classes.
> The package seems to handle this mixture of numerical and categorical
> variables, but I am unclear how to interpret the splits for the
> categorical variables.
> 
> The table describing the splits has a column, split point, which for
> numerical variables is the value of the indicated variable where the
> left daughter group is less than the value and the right daughter group
> is greater than the value.
> 
> The documentation states that for categorical variables, split point is
> a integer, whose binary expansion identifies which categories go into
> the left and right daughter groups.  It gives an example of a variable
> with three classes and a split value of 5, which expands to 1 0 1.  In
> this case, the first and third classes go into the left daughter group
> and the second class goes into the right daughter group.
> 
> My question now is:  How does the package order the classes of a
> categorical variable?  This is not clear in the documentation, and if
> this is something basic to R, I have not found it in the help files.  In
> my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
> NoLF, and WCBP.  These levels are not ordered in any particular way in
> the data set.  I can think of two ways the package might order the
> classes:  1.  alphabetically or 2. in the order that the are first
> encountered in the data set.  Are either of these correct or might there
> be some other way of ordering the levels I have not thought of?
> 
> A colleague suggested that I might use is.ordered(), but I get an error
> message, "Error in inherits(x, "factor") : object "L3_ER" not found."
> Any other suggestions are appreciated.  Thanks.
> 
> Michael
> 
> Michael B. Griffith, Ph.D.
> Research Ecologist
> 
> USEPA, NCEA (MS A-110)
> 26 W. Martin Luther King Dr.
> Cincinnati, OH  45268
> 
> telephone:  513 569-7034
> e-mail:  griffith.michael at epa.gov
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology