[R] R package recommendation - recursively partition data frame, calculate summaries of node data frames, plot and print summaries

Tue May 30 17:38:07 CEST 2017

Hello Ross,

> On 30 May 2017, at 14:51, Ross Gayler <r.gayler at gmail.com> wrote:
> 
> I am after R package recommendations.
> 
> I have a data frame with ~5 million rows and ~50 columns. (I could do what
> I want with a sample of the rows, but ideally i would use all the rows.)

Very nice question for a mental gymnastics. First thing that comes in my mind is SPEED/PERFORMANCE when you mention ~5 million and _recursive_ words as you did. That’s why, the functions that have heavy work load are written in C/FORTRAN (i.e. partykit[1]). Most of the time, You are only limited to function arguments in R if you don’t want speed lost. For instance, lmtree function in partykit package decides the tree structure by its own partitioning model. So, you don’t have any authority (based on you are not a good programmer) to change the basic decision algorithm but only function arguments. Hence, the ready-to-go R functions only let you change function arguments but not the basic internal decision algorithm written in C/FORTRAN in backstage.

> 
> (1) I want to recursively partition the rows of the data frame in a way
> that I manually specify. That is, I want to generate a tree structure such
> that each node of the tree represents a subset of the rows of the data
> frame and the child nodes of any parent node represent a partition of the
> rows represented by the parent node. This is the sort of thing that tree
> induction algorithms like CART and ID3 do, but I want to manually specify
> the tree structure rather than have some algorithm decide it for me.

At this point, “manually specify” requires you write your own decision mechanism (seems as your only chance). Scaring part here is a matrix with ~5milx50 cells and _recursive_ part. These are the first two of the things that R is not very good. Let’s create a very simple partitioning algorithm to see what we will encounter in the future.

part <- function(x, stop_rule = 1000) {
  set.seed(1)
  n <- round(runif(1, 1, nrow(x)))
  n1 <- 1:n; n2 <- (n + 1):nrow(x)
  part1 <- if (length(n1) < stop_rule)
    x[n1,] else part(x[n1,], stop_rule)
  part2 <- if (length(n2) < stop_rule)
    x[n2,] else part(x[n2,], stop_rule)
  return(list(p1 = part1, p2 = part2))
}

The _part_ function is a very simple partitioning algorithm by recursive function calling. It randomly parts a matrix till number of rows is less than 1000. Only 10 lines of code. Very nice huh? 
Now let’s create a data to play.

x <- matrix(runif(5e6), nrow = 5e6, ncol = 50)

Wow. please check the size of x. 1.9 GB! (Anyway, I have 8 core + 16 GB huge 27 inch desktop, I can handle it, I hope…) Now let’s apply the part function over x.

p <- part(x)

Please, note the execution time and watch the system memory! I used more than 16GB RAM and elapsed time is 46 sec for a very simple partitioning. And check the size of _p_ (1.9 GB). This is not good, mate. _p_ is a list of parts containing other subnodes. Terminal node contains partitioned rows of x matrix. also note that I didn’t use any information hidden in x matrix in calculations. only random length. So, in real life, if you use pure R to part things, you will end up with very long running functions (I mean days).

I think I can increase the performance a bit. Let’s try!

node <- 0
partf <- function(x, stop_rule = 1000) {
  set.seed(1)
  n <- round(runif(1, 1, nrow(x)))
  n1 <- 1:n; n2 <- (n + 1):nrow(x)
  part1 <- if (length(n1) < stop_rule) {
    node <<- node + 1
    rep(node, length(n1))
  } else partf(x[n1,], stop_rule)
  part2 <- if (length(n2) < stop_rule) {
    node <<- node + 1
    rep(node, length(n2))
  } else partf(x[n2,], stop_rule)
  return(c(part1, part2))
}

_partf_ function is the same as _part_ except it returns only an index of partitions. So, I can get rid of 1.9 GB _p_ variable.  Run the code below:

pf <- partf(x)

> 
> (2) I want the means for specifying the tree structure to be as simple as
> possible, because the users will be trying out different tree structures.

Now, excecution was completed in 32 sec and size of _pf_ is only 38 Mb. I can use _pf_ to index x matrix and calculate required summary statistics that you mention in (2). But monitor your memory pressure at this stage. _partf_function use still huge amount of RAM to complete a simple calculation. Accept these examples as a start point and decide the future of the project in your mind. 

> 
> (3) Each node (internal or terminal) of the tree represents a row subset of
> the root data frame. I want to be able to specify a function to be applied
> to each node that takes the node data frame as input and calculates a set
> of summary statistics. I will probably write this node summary function as
> a dplyr pipeline. I will want to be able to associate the summaries with
> the nodes so that I keep track of the summaries in terms of the tree
> structure.
> 
> (4) I want to be able to print and plot the tree of summaries in a way that
> shows the summaries in the context of the tree structure. Inevitably, there
> will be fiddling with the formatting of the prints and plots, so I expect i
> will need user definable print/plot formatting functions that are applied
> to each node of the tree.

Item 3 and 4 will be up to your R programming skills. Because a ready-to-go package won’t help you much as you want to construct your own partitioning rule.

> 
> What I am looking for is an R package that provides the best starting point
> for me to implement this. I am not a particularly good programmer, so
> getting a package that minimises what I have to write is important to me.

If you still insist on, I suggest you start to learn C/C++ and Rcpp and friends. So, you can write your own partitioning algorithm, your own decision mechanism. And your functions will work super duper fast and will use incredibly less RAM. Also Rcpp has parallel support. So, you can run your code faster on multicore machines. I hope my explanations make things clear in your mind.

> 
> So far, the most likely packages appear to be:
> 
>   - partykit <http://partykit.r-forge.r-project.org/partykit/>
>   - data.tree <https://github.com/gluc/data.tree>
> 
> I would appreciate any recommendations for R packages that would serve as a
> good base; any comments on the relative merits of the packages for my
> purposes; and any pointers to example code of people doing similar things.
> 
> Thanks
> 
> Ross

1- https://r-forge.r-project.org/scm/viewvc.php/pkg/partykit/src/?root=partykit <https://r-forge.r-project.org/scm/viewvc.php/pkg/partykit/src/?root=partykit>

	[[alternative HTML version deleted]]