[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Achim.Zeileis at uibk.ac.at
Mon Sep 14 16:52:24 CEST 2015
thanks for you interest.
> I'm currently exploring a dataset with the help of conditional inference
> trees (still very much a beginner with this technique & log. reg.
> methods as a whole t.b.h.), since they explained more variation in my
> dataset than a binary logistic regression with /glm/. I started out with
> the /party /package, but after I while I ran into the 'updated'
> /partykit /package and tried this out, too.
If you want to use individual trees (as opposed to forests), then the
"partykit" package is recommended because it contains much improved
re-implementations of ctree() and mob() as well as the mob() convenience
interfaces lmtree() and glmtree(). For forests see below.
> Now, the strange thing is that both trees look quite different -
> actually even the very first split is different.
This might be due to several partitioning variables being associated with
tiny p-values in the root node. The re-implementation in partykit
internally computes with log-p-values and hence should be numerically more
stable. In the old implementation it could happen that from several highly
significant variables, always the first is chosen because the p-values
were essentially indistinguishable for the computer.
If you think that this is not the problem, then please contact the package
maintainer with a reproducible example.
Except for bug fixes like the one above, the trees grown by
partykit::ctree and party::ctree should be the same.
> So I did some research and came across the 'forest' concept. However, it
> seems that the /varImp /function does not yet work in the /partykit
Correct. While the ctree() implementation in partykit is better than that
in party, the same is _not_ true for cforest(). The new partykit::cforest
is currently still a basic implementation which doesn't offer as many
features as the party::cforest implementation. More work is needed
especially for variable importance measures and different kinds of
> which raises the question for me how I should evaluate the /partykit
> /forest - how can I find out whether the variables are important in the
> forest as in my /partykit /tree? Is there some way to do this or some
> other solution for this problem? I'd prefer to continue the /partykit
> /implementation of ctree, since it allows more settings for the final
> plot, which I'd need to get the final (large) plot into a readable form.
> Related to this project, I'd also like to give statistics for the overall
> model, e.g. overall significance, Nagelkerke's R², a C-value. After a
> 'regular' binary log. reg., I would use the lrm function to get these
> values, but I am unsure whether it would be correct to also apply this
> method to my tree data.
Overall significance is difficult because you have done model selection
when growing the tree. As for pseudo R-squared or information criteria
etc., it is relatively easy to compute these "by hand" based on the
observed and fitted responses. An example for this is provided at:
> Any help would be greatly appreciated!
> -- Christopher
> View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html
> Sent from the R help mailing list archive at Nabble.com.
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help