[R] R, ctree and categorical variables

Achim Zeileis Achim.Zeileis at uibk.ac.at
Fri Jul 29 11:05:50 CEST 2011


On Thu, 28 Jul 2011, seanstclair at verizon.net wrote:

>
>   I am running the ctree function in R.
>
>
>
>   My data has about 10 variables, many of which are categorical.  2 of the
>   categorical variables have many levels (one has 900 levels, another has
>   1,000 levels).  As an example, 1 of these variables is disease code and is
>   structured as A, B, C, ...., AA, AB, AC....
>
>
>
>   Each time i've tried to run the ctree function, including these 2 variables
>   in  the data, the function never stops running.  When i remove these 2
>   variables from the data and run without them, the function returns in about
>   3 seconds.
>
>
>
>   Q:  Is there a limit to the amount of levels that a categorical variable can
>   contain?  Is there something else that i may be overlooking?

ctree() tries to split such a variable into two groups: left and right 
daughter node. And there are 2^(k-1) - 1 possible groupings for a 
categorical variable with k levels. For k=1000 this is simply too large to 
be computed in finite time.

You can try to break it down to a coarser classification of levels that is 
still computable. Or, if the categorical variable were ordered, it needs 
to be declared and then only k-1 splits are possible which is small 
enough.

hth,
Z

>
>
>
>
>   THanks.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list