[R] R, ctree and categorical variables
Achim.Zeileis at uibk.ac.at
Fri Jul 29 11:05:50 CEST 2011
On Thu, 28 Jul 2011, seanstclair at verizon.net wrote:
> I am running the ctree function in R.
> My data has about 10 variables, many of which are categorical. 2 of the
> categorical variables have many levels (one has 900 levels, another has
> 1,000 levels). As an example, 1 of these variables is disease code and is
> structured as A, B, C, ...., AA, AB, AC....
> Each time i've tried to run the ctree function, including these 2 variables
> in the data, the function never stops running. When i remove these 2
> variables from the data and run without them, the function returns in about
> 3 seconds.
> Q: Is there a limit to the amount of levels that a categorical variable can
> contain? Is there something else that i may be overlooking?
ctree() tries to split such a variable into two groups: left and right
daughter node. And there are 2^(k-1) - 1 possible groupings for a
categorical variable with k levels. For k=1000 this is simply too large to
be computed in finite time.
You can try to break it down to a coarser classification of levels that is
still computable. Or, if the categorical variable were ordered, it needs
to be declared and then only k-1 splits are possible which is small
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help