[R] Complex sort problem

Fri May 18 13:48:58 CEST 2012

On Fri, May 18, 2012 at 06:37:03AM -0400, Axel Urbiz wrote:
> Would I be able to accomplish the same if x.sample was created from x
> instead of x.sorted. The problem is that in my real problem, I have to sort
> with respect to many variables and thus keep the sample indexes consistent
> across variables. So I need to first take the sample and then sort it
> with respect to potentially any variable.

The suggestion

  set.seed(12345)
  x <- sample(0:100, 10)
  x.order <- order(x)
  x.sorted <- x[x.order]

  sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2 size with replacement

  x.sample <- x.sorted[sample.ind]
  freq <- tabulate(sample.ind, nbins=length(x))
  x.sample.sorted <- rep(x.sorted, times=freq)

uses the fact that rep(x.sorted, times=freq) keeps the order
in x.sorted. This x.sorted can be a data frame, in which
case we should use 

  sample.ind <- sample(1:nrow(x), 5, replace = TRUE)
  x.sample <- x.sorted[sample.ind, ]
  freq <- tabulate(sample.ind, nbins=nrow(x))
  x.sample.sorted <- x.sorted[rep(1:nrow(x.sorted), times=freq), ]

It is possible to have several x.sorted data frames sorted according
to different variables. In this case, we generate pairs x.sample and
x.sample.sorted which are the same sample once unsorted and once sorted.
However, we get different samples for each sorting variable.

In order to save CPU time, if the same sample should be sortable
by different variables, try the following. Calculate the order of
the original data according to each relevant variable and store them
as rank vectors determining the order of cases. Then, instead of
sorting a data frame representing a sample, determine the order from
the corresponding subset of the rank vector. This may be faster and
produces the same order.

Petr Savicky.