[R] Complex sort problem

Mon May 21 21:43:59 CEST 2012

On Fri, May 18, 2012 at 09:20:59PM -0400, Axel Urbiz wrote:
[...]
> Petr: I kind of see your line of thought, but still cannot see how it works
> on a specific example like this one.

I did not have email in the last few days.

The previous suggestion from

  https://stat.ethz.ch/pipermail/r-help/2012-May/313197.html

was meant for the situation that we want to keep the result of
sorting according to several variables, so that later, sorting
of a subset can be done only by sorting according to a single
variable. Now, i see, all sortings are already according to
a single variable, so this is not helpful.

Try the following, which uses the example from your code.
In particular, it uses a matrix (not a data frame) and
there are no duplicates in the data.

  set.seed(1)

  dframe <- matrix(runif(250), 50, 5)

  ### store sort indexes

  sort_matrix <- matrix(ncol = ncol(dframe), nrow = nrow(dframe))

  for (i in 1:ncol(dframe)) {
    xtemp <- dframe[, i]
    sort_matrix[, i] <- sort.list(xtemp, method = "shell")
  }

  ### take a bootstrap sample

  nr_samples <- nrow(dframe)
  b.ind <- sample(1:nr_samples, nr_samples*0.5, replace = TRUE)
  freq <- tabulate(b.ind, nbins=nr_samples)

  ### create bootstrap sample sorted with respect to an arbitrary variable

  var1 <- 1
  ind <- sort_matrix[, var1]
  DF1 <- dframe[ind, ]    # this can be computed in advance (before b.ind)
  NDF1 <- DF1[rep(1:nrow(DF1), times=freq[ind]), ]

  ### compare with a straightforward method

  subDF <- dframe[b.ind, ]
  subDF1 <- subDF[order(subDF[, var1]), ]
  identical(NDF1, subDF1)

  [1] TRUE

The main step is that "ind" is used to transform both the data
and the frequency table. So, they remain consistent and the
reordered frequencies may be used for the reordered data.

Hope this helps.

Petr Savicky.