[R] multi-class classification using rpart
Uwe Ligges
ligges at statistik.uni-dortmund.de
Wed Jan 26 08:42:10 CET 2005
Liaw, Andy wrote:
>>From: Uwe Ligges
>>
>>WeiWei Shi wrote:
>>
>>
>>>Hi, Andy:
>>>Thanks. It works after I removed the variable. I think I
>>
>>got a similar
>>
>>>problem when I used randomForest. And I am not sure if they were due
>>>to the same reason.
>>>
>>>Practically and Unfortunately, that variable is very
>>
>>important to the
>>
>>>accuracy. I am wondering if there is another way besides collapsing
>>>it. BTW, I remember you mentioned some alternative implementation to
>>>randomForest (the author provided) to avoid the upper limit
>>
>>(32, if I
>>
>>>am correct) for the level of factor which can be used in the R
>>>version's randomForest.
>>>
>>>Thanks for further assistance!
>>
>>
>>So you *really* want it to be factor?! Thought it was a
>>mistake not to
>>have it numerical....
>>Amazing! Maybe computers are sometimes even too fast these days.
>>
>>Uwe
>
>
> [Uwe: Not sure if you meant to keep this off-list. If so, my most sincere
> apologies.]
Andy, *you* do not need to apologize (yes, I meant to keep it off list,
but WeiWei Shi posted it anyway).
> Er... not really. Currently (classification) randomForest encode splits on
> categorical variables by binary expansion of levels that go to the left.
> Such split is stored in (4-byte) integers, thus the 32-level restriction.
> In newer version of Breiman & Cutler's Fortran code, that restriction is
> removed by storing the entire indicator matrix (# of nodes by max. number of
> levels, then by number of trees in the forest). For the stand-alone
> Fortran, each tree is written to file as soon as it's grown, so it doesn't
> need to store the entire forest in memory. The R version has no such luxury
> (if you can call it that).
>
> The way the new RF Fortran code deals with categorical variables with more
> than 10 categories is by randomly sampling some number (say 512) of random
> splits and pick the best among them. That's probably a good strategy for
> random forests, but may not be what one would do to grow a single tree.
>
> When growing a single tree with data containing categorical variables with
> large number of categories, one should also be mindful of the problem that,
> because of the greedy nature of the algorithm, it will tend to split on
> variables with larger numbers of possible splits, even if those variables
> are less `informative'.
>
> Andy
Certainly you are right - I don't know all those details about
RandomForests, but the point I tried to make is different:
Be aware not to be called a professional overfitter: Variable name
"V141" and at least in one of those variables a factor with 88 levels...!!!
Uwe
More information about the R-help
mailing list