[Rd] Couldn't (and shouldn't) is.unsorted() be faster?

Fri Apr 18 07:17:38 CEST 2008

On Thu, 17 Apr 2008, Herve Pages wrote:

> Hi,
>
> Thanks for your answers!
>
> No need to change anything. In my case, 'x' is guaranteed to be an integer
> vector with no NAs so I can call .Internal(is.unsorted(x)) directly.
>
> BTW, why not make is.unsorted() a little bit more prepared to silly user
> input:

Because R is a volunteer project and resources spent on trapping misuse 
are resources not available to be spent on other things.  (Same as for bug 
reports on fixed issues ....)

>  > is.unsorted(c(2:5, NA), na.rm=NA)
>  Error in if (!is.atomic(x) || (!na.rm && any(is.na(x)))) return(NA) :
>    missing value where TRUE/FALSE needed
>
> (or at least silently coerce those silly values to TRUE or FALSE like
> max()/min() do, following some obscure logic though).
>
> Also it's arguable that a length-1 vector cannot be considered sorted:
>  > is.unsorted(NA)
>  [1] NA
>
> Cheers,
> H.
>
>
> Prof Brian Ripley wrote:
>> I wouldn't say 'easy', and so I think we need a business case for this 
>> change.  (One of the issues is that the internals are used elsewhere and 
>> optimized for inputs without NAs.  So we would need to write separate code 
>> if we pass NAs down to C level.  As I recall, is.unsorted was a cheap R 
>> interface to existing C code.)
>> 
>> What real-world problems are being affected by this, and what would be 
>> proportional speedup of the whole analysis be from this change?
>> 
>> On Thu, 17 Apr 2008, Bill Dunlap wrote:
>> 
>>> On Thu, 17 Apr 2008, Herve Pages wrote:
>>> 
>>>> Couldn't is.unsorted() bail out immediately here (after comparing
>>>> the first 2 elements):
>>>> 
>>>> > x <- 20000000:1
>>>> > system.time(is.unsorted(x), gcFirst=TRUE)
>>>>     user  system elapsed
>>>>    0.084   0.040   0.124
>>>> 
>>>> > x <- 200000000:1
>>>> > system.time(is.unsorted(x), gcFirst=TRUE)
>>>>     user  system elapsed
>>>>    0.772   0.440   1.214
>>> 
>>> The C code does bail out upon seeing the first out- of-order pair, but
>>> before calling the C code, the S code does any(is.na(x)), forcing a
>>> scan of the entire data.  If you remove the is.na calls from
>>> is.unsorted's S code you will see the timings improve in your example.
>>> (It looks easy to do the NA checks in the C code.)
>>>
>>>   is.unsorted.no.nacheck <- function (x, na.rm = FALSE) {
>>>       if (is.null(x))
>>>           return(FALSE)
>>>       if (!is.atomic(x))
>>>           return(NA)
>>>       .Internal(is.unsorted(x))
>>>   }
>>>   > x <- 20000000:1
>>>   > system.time(is.unsorted(x), gcFirst=TRUE)
>>>   user  system elapsed
>>>     0.356   0.157   0.514
>>>   > system.time(is.unsorted.no.nacheck(x), gcFirst=TRUE)
>>>   user  system elapsed
>>>      0       0       0
>>>   > revx <- rev(x)
>>>   > system.time(is.unsorted(revx), gcFirst=TRUE)
>>>      user  system elapsed
>>>     0.500   0.170   0.672
>>>   > system.time(is.unsorted.no.nacheck(revx),gcFirst=TRUE)
>>>      user  system elapsed
>>>     0.131   0.000   0.132
>>>
>>> 
>>> ---------------------------------------------------------------------------- 
>>> Bill Dunlap
>>> Insightful Corporation
>>> bill at insightful dot com
>>> 360-428-8146
>>> 
>>> "All statements in this message represent the opinions of the author and 
>>> do
>>> not necessarily reflect Insightful Corporation policy or position."
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595