[R] R package recommendation - recursively partition data frame, calculate summaries of node data frames, plot and print summaries

Ross Gayler r.gayler at gmail.com
Tue May 30 13:51:23 CEST 2017


I am after R package recommendations.

I have a data frame with ~5 million rows and ~50 columns. (I could do what
I want with a sample of the rows, but ideally i would use all the rows.)

(1) I want to recursively partition the rows of the data frame in a way
that I manually specify. That is, I want to generate a tree structure such
that each node of the tree represents a subset of the rows of the data
frame and the child nodes of any parent node represent a partition of the
rows represented by the parent node. This is the sort of thing that tree
induction algorithms like CART and ID3 do, but I want to manually specify
the tree structure rather than have some algorithm decide it for me.

(2) I want the means for specifying the tree structure to be as simple as
possible, because the users will be trying out different tree structures.

(3) Each node (internal or terminal) of the tree represents a row subset of
the root data frame. I want to be able to specify a function to be applied
to each node that takes the node data frame as input and calculates a set
of summary statistics. I will probably write this node summary function as
a dplyr pipeline. I will want to be able to associate the summaries with
the nodes so that I keep track of the summaries in terms of the tree
structure.

(4) I want to be able to print and plot the tree of summaries in a way that
shows the summaries in the context of the tree structure. Inevitably, there
will be fiddling with the formatting of the prints and plots, so I expect i
will need user definable print/plot formatting functions that are applied
to each node of the tree.

What I am looking for is an R package that provides the best starting point
for me to implement this. I am not a particularly good programmer, so
getting a package that minimises what I have to write is important to me.

So far, the most likely packages appear to be:

   - partykit <http://partykit.r-forge.r-project.org/partykit/>
   - data.tree <https://github.com/gluc/data.tree>

I would appreciate any recommendations for R packages that would serve as a
good base; any comments on the relative merits of the packages for my
purposes; and any pointers to example code of people doing similar things.

Thanks

Ross

	[[alternative HTML version deleted]]



More information about the R-help mailing list