[R] Complex sort problem
Petr Savicky
savicky at cs.cas.cz
Thu May 17 19:43:27 CEST 2012
On Thu, May 17, 2012 at 06:45:52AM -0400, Axel Urbiz wrote:
> Dear List,
>
> Is there a way I can sort a sample based on a sort index constructed from
> the data from which the sample is taken? Basically, I need to take 'many'
> samples from the same source data and sort them. This can be very time
> consuming for long vectors. Is there any way I can sort the data only once
> initially, and use that sort order for the samples?
>
> I believe that idea is what is implemented in tree-based classifiers, so
> the data is sorted only once initially and that sort order is used for the
> child nodes.
>
>
> set.seed(12345)
> x <- sample(0:100, 10)
> x.order <- order(x)
> x.sorted <- x[x.order]
>
> sample.ind <- sample(1:length(x), 5, replace = TRUE) #sample 1/2 size with replacement
> x.sample <- x[sample.ind]
>
> x.sample.sorted <- #??? (without sorting again)
Hi.
Formally, it is possible to avoid sorting using tabulate() and rep().
However, i am not sure, whether this approach is more efficient.
set.seed(12345)
x <- sample(0:100, 10)
x.order <- order(x)
x.sorted <- x[x.order]
sample.ind <- sample(1:length(x), 5, replace = TRUE) #sample 1/2 size with replacement
x.sample <- x.sorted[sample.ind]
freq <- tabulate(sample.ind, nbins=length(x))
x.sample.sorted <- rep(x.sorted, times=freq)
identical(sort(x.sample), x.sample.sorted) # [1] TRUE
Note that x.sample is created from x.sorted in order to make x.sample
and x.sample.sorted consistent. Since sample.ind has random order, the
distributions of x[sample.ind] and x.sorted[sample.ind] are the same.
Computing the frequencies of indices, whose range is known in advance,
can be done in linear time, so theoretically more efficiently than
sorting. However, only a test may determine, what is more efficient in
your situation.
Hope this helps.
Petr Savicky.
More information about the R-help
mailing list