[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results

Mon Sep 14 16:52:24 CEST 2015

Christopher,

thanks for you interest.

> I'm currently exploring a dataset with the help of conditional inference 
> trees (still very much a beginner with this technique & log. reg. 
> methods as a whole t.b.h.), since they explained more variation in my 
> dataset than a binary logistic regression with /glm/. I started out with 
> the /party /package, but after I while I ran into the 'updated' 
> /partykit /package and tried this out, too.

If you want to use individual trees (as opposed to forests), then the 
"partykit" package is recommended because it contains much improved 
re-implementations of ctree() and mob() as well as the mob() convenience 
interfaces lmtree() and glmtree(). For forests see below.

> Now, the strange thing is that both trees look quite different - 
> actually even the very first split is different.

This might be due to several partitioning variables being associated with 
tiny p-values in the root node. The re-implementation in partykit 
internally computes with log-p-values and hence should be numerically more 
stable. In the old implementation it could happen that from several highly 
significant variables, always the first is chosen because the p-values 
were essentially indistinguishable for the computer.

If you think that this is not the problem, then please contact the package 
maintainer with a reproducible example.

Except for bug fixes like the one above, the trees grown by 
partykit::ctree and party::ctree should be the same.

> So I did some research and came across the 'forest' concept. However, it 
> seems that the /varImp /function does not yet work in the /partykit 
> /implementation,

Correct. While the ctree() implementation in partykit is better than that 
in party, the same is _not_ true for cforest(). The new partykit::cforest 
is currently still a basic implementation which doesn't offer as many 
features as the party::cforest implementation. More work is needed 
especially for variable importance measures and different kinds of 
predictions.

> which raises the question for me how I should evaluate the /partykit 
> /forest - how can I find out whether the variables are important in the 
> forest as in my /partykit /tree? Is there some way to do this or some 
> other solution for this problem? I'd prefer to continue the /partykit 
> /implementation of ctree, since it allows more settings for the final 
> plot, which I'd need to get the final (large) plot into a readable form.
>
> Related to this project, I'd also like to give statistics for the overall
> model, e.g. overall significance, Nagelkerke's R², a C-value. After a
> 'regular' binary log. reg., I would use the lrm function to get these
> values, but I am unsure whether it would be correct to also apply this
> method to my tree data.

Overall significance is difficult because you have done model selection 
when growing the tree. As for pseudo R-squared or information criteria 
etc., it is relatively easy to compute these "by hand" based on the 
observed and fitted responses. An example for this is provided at:
http://stackoverflow.com/questions/29524670/how-to-find-the-the-deviance-of-an-as-party-object-converted-from-rpart-tree-in/29693223#29693223

> Any help would be greatly appreciated! 
>
> -- Christopher
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.