[R] Sorting in trees problem

Wed Feb 24 17:57:26 CET 2016

"%in%" is a wrapper for match(), which uses hashing I believe
(correction welcome!),and so is generally very fast.

See ?Rprof  for profiling R code to get timings. (Haven't used it
myself, so not sure how useful it would be for this situation).

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Feb 24, 2016 at 8:45 AM, Axel Urbiz <axel.urbiz at gmail.com> wrote:
> Hello,
>
> As decision trees require sorting the variable used for splitting a given
> node, I'm trying to avoid having this recurrent sorting by only sorting all
> numeric variable first (and only once).
>
> My attempt in doing this is shown in "Solution 2" below, but although I get
> the desired result I think the %in% operation may be a costly one (and may
> even offset the benefits of pre-sorting).
>
> Any alternative solutions would be highly appreciated.
>
> ### Sample data
>
> set.seed(1)
> df <- data.frame(x1 = rnorm(20), x2 = rnorm(20))
> w <- rep(1L, nrow(df))
> # w == 1L denote observation present in the current node
> w[c(1, 8, 10)] <- 0L
>
> ### The problem: sort x1 within observations present in the current node
>
> ### Solution 1: slow for repeated sorting
> nodeObsInd <- which(w == 1L)
> sol1 <- df[nodeObsInd, ]
> sol1 <- sol1[order(sol1$x1), ]$x1
>
> ### Solution 2: sort all variables initially only.
>
> sort_fun <- function(x)
> {
>   index <- order(x)
>   x <- x[index]
>   data.frame(x, index) # the index gives original position of the obs
> }
>
> s_df <- lapply(df, function(x) sort_fun(x))
> sol2 <- s_df[[1]][s_df$x1$index %in% nodeObsInd, ]
>
> ### check same result
>
> all.equal(sol1, sol2$x)
>
>
>
>
> Regards,
> Axel.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.