[R] Complex sort problem

Fri May 18 15:50:26 CEST 2012

On May 18, 2012, at 6:37 AM, Axel Urbiz wrote:

> Would I be able to accomplish the same if x.sample was created from x
> instead of x.sorted. The problem is that in my real problem, I have  
> to sort
> with respect to many variables and thus keep the sample indexes  
> consistent
> across variables. So I need to first take the sample and then sort it
> with respect to potentially any variable.

Either of the strategies I suggested should be generalizable to many  
sort criteria. It should be possible to work with indices that point  
back to the source of the sample. (Feel free to post an example. At  
the moment your requirements seem a bit vague.)

-- 
David.

> Thanks again,
> Axel.
>
> On Thu, May 17, 2012 at 1:43 PM, Petr Savicky <savicky at cs.cas.cz>  
> wrote:
>
>> On Thu, May 17, 2012 at 06:45:52AM -0400, Axel Urbiz wrote:
>>> Dear List,
>>>
>>> Is there a way I can sort a sample based on a sort index  
>>> constructed from
>>> the data from which the sample is taken? Basically, I need to take  
>>> 'many'
>>> samples from the same source data and sort them. This can be very  
>>> time
>>> consuming for long vectors. Is there any way I can sort the data  
>>> only
>> once
>>> initially, and use that sort order for the samples?
>>>
>>> I believe that idea is what is implemented in tree-based  
>>> classifiers, so
>>> the data is sorted only once initially and that sort order is used  
>>> for
>> the
>>> child nodes.
>>>
>>>
>>> set.seed(12345)
>>> x <- sample(0:100, 10)
>>> x.order <- order(x)
>>> x.sorted <- x[x.order]
>>>
>>> sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2  
>>> size
>> with replacement
>>> x.sample <- x[sample.ind]
>>>
>>> x.sample.sorted <-   #??? (without sorting again)
>>
>> Hi.
>>
>> Formally, it is possible to avoid sorting using tabulate() and rep().
>> However, i am not sure, whether this approach is more efficient.
>>
>> set.seed(12345)
>> x <- sample(0:100, 10)
>> x.order <- order(x)
>> x.sorted <- x[x.order]
>>
>> sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2  
>> size
>> with replacement
>>
>>  x.sample <- x.sorted[sample.ind]
>> freq <- tabulate(sample.ind, nbins=length(x))
>> x.sample.sorted <- rep(x.sorted, times=freq)
>>
>> identical(sort(x.sample), x.sample.sorted) # [1] TRUE
>>
>> Note that x.sample is created from x.sorted in order to make x.sample
>> and x.sample.sorted consistent. Since sample.ind has random order,  
>> the
>> distributions of x[sample.ind] and x.sorted[sample.ind] are the same.
>>
>> Computing the frequencies of indices, whose range is known in  
>> advance,
>> can be done in linear time, so theoretically more efficiently than
>> sorting. However, only a test may determine, what is more efficient  
>> in
>> your situation.
>>
>> Hope this helps.
>>
>> Petr Savicky.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT