[R] multi-class classification using rpart

Wed Jan 26 08:42:10 CET 2005

Liaw, Andy wrote:

>>From: Uwe Ligges 
>>
>>WeiWei Shi wrote:
>>
>>
>>>Hi, Andy:
>>>Thanks. It works after I removed the variable. I think I 
>>
>>got a similar
>>
>>>problem when I used randomForest. And I am not sure if they were due
>>>to the same reason.
>>>
>>>Practically and Unfortunately, that variable is very 
>>
>>important to the
>>
>>>accuracy. I am wondering if there is another way besides collapsing
>>>it. BTW, I remember you mentioned some alternative implementation to
>>>randomForest (the author provided) to avoid the upper limit 
>>
>>(32, if I
>>
>>>am correct) for the level of factor which can be used in the R
>>>version's randomForest.
>>>
>>>Thanks for further assistance!
>>
>>
>>So you *really* want it to be factor?! Thought it was a 
>>mistake not to 
>>have it numerical....
>>Amazing! Maybe computers are sometimes even too fast these days.
>>
>>Uwe
> 
> 
> [Uwe: Not sure if you meant to keep this off-list.  If so, my most sincere
> apologies.]

Andy, *you* do not need to apologize (yes, I meant to keep it off list, 
but WeiWei Shi posted it anyway).

> Er... not really.  Currently (classification) randomForest encode splits on
> categorical variables by binary expansion of levels that go to the left.
> Such split is stored in (4-byte) integers, thus the 32-level restriction.
> In newer version of Breiman & Cutler's Fortran code, that restriction is
> removed by storing the entire indicator matrix (# of nodes by max. number of
> levels, then by number of trees in the forest).  For the stand-alone
> Fortran, each tree is written to file as soon as it's grown, so it doesn't
> need to store the entire forest in memory.  The R version has no such luxury
> (if you can call it that).
> 
> The way the new RF Fortran code deals with categorical variables with more
> than 10 categories is by randomly sampling some number (say 512) of random
> splits and pick the best among them.  That's probably a good strategy for
> random forests, but may not be what one would do to grow a single tree.
> 
> When growing a single tree with data containing categorical variables with
> large number of categories, one should also be mindful of the problem that,
> because of the greedy nature of the algorithm, it will tend to split on
> variables with larger numbers of possible splits, even if those variables
> are less `informative'.
> 
> Andy

Certainly you are right - I don't know all those details about 
RandomForests, but the point I tried to make is different:
Be aware not to be called a professional overfitter: Variable name 
"V141" and at least in one of those variables a factor with 88 levels...!!!

Uwe