[Rd] Couldn't (and shouldn't) is.unsorted() be faster?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Apr 18 07:17:38 CEST 2008
On Thu, 17 Apr 2008, Herve Pages wrote:
> Hi,
>
> Thanks for your answers!
>
> No need to change anything. In my case, 'x' is guaranteed to be an integer
> vector with no NAs so I can call .Internal(is.unsorted(x)) directly.
>
> BTW, why not make is.unsorted() a little bit more prepared to silly user
> input:
Because R is a volunteer project and resources spent on trapping misuse
are resources not available to be spent on other things. (Same as for bug
reports on fixed issues ....)
> > is.unsorted(c(2:5, NA), na.rm=NA)
> Error in if (!is.atomic(x) || (!na.rm && any(is.na(x)))) return(NA) :
> missing value where TRUE/FALSE needed
>
> (or at least silently coerce those silly values to TRUE or FALSE like
> max()/min() do, following some obscure logic though).
>
> Also it's arguable that a length-1 vector cannot be considered sorted:
> > is.unsorted(NA)
> [1] NA
>
> Cheers,
> H.
>
>
> Prof Brian Ripley wrote:
>> I wouldn't say 'easy', and so I think we need a business case for this
>> change. (One of the issues is that the internals are used elsewhere and
>> optimized for inputs without NAs. So we would need to write separate code
>> if we pass NAs down to C level. As I recall, is.unsorted was a cheap R
>> interface to existing C code.)
>>
>> What real-world problems are being affected by this, and what would be
>> proportional speedup of the whole analysis be from this change?
>>
>> On Thu, 17 Apr 2008, Bill Dunlap wrote:
>>
>>> On Thu, 17 Apr 2008, Herve Pages wrote:
>>>
>>>> Couldn't is.unsorted() bail out immediately here (after comparing
>>>> the first 2 elements):
>>>>
>>>> > x <- 20000000:1
>>>> > system.time(is.unsorted(x), gcFirst=TRUE)
>>>> user system elapsed
>>>> 0.084 0.040 0.124
>>>>
>>>> > x <- 200000000:1
>>>> > system.time(is.unsorted(x), gcFirst=TRUE)
>>>> user system elapsed
>>>> 0.772 0.440 1.214
>>>
>>> The C code does bail out upon seeing the first out- of-order pair, but
>>> before calling the C code, the S code does any(is.na(x)), forcing a
>>> scan of the entire data. If you remove the is.na calls from
>>> is.unsorted's S code you will see the timings improve in your example.
>>> (It looks easy to do the NA checks in the C code.)
>>>
>>> is.unsorted.no.nacheck <- function (x, na.rm = FALSE) {
>>> if (is.null(x))
>>> return(FALSE)
>>> if (!is.atomic(x))
>>> return(NA)
>>> .Internal(is.unsorted(x))
>>> }
>>> > x <- 20000000:1
>>> > system.time(is.unsorted(x), gcFirst=TRUE)
>>> user system elapsed
>>> 0.356 0.157 0.514
>>> > system.time(is.unsorted.no.nacheck(x), gcFirst=TRUE)
>>> user system elapsed
>>> 0 0 0
>>> > revx <- rev(x)
>>> > system.time(is.unsorted(revx), gcFirst=TRUE)
>>> user system elapsed
>>> 0.500 0.170 0.672
>>> > system.time(is.unsorted.no.nacheck(revx),gcFirst=TRUE)
>>> user system elapsed
>>> 0.131 0.000 0.132
>>>
>>>
>>> ----------------------------------------------------------------------------
>>> Bill Dunlap
>>> Insightful Corporation
>>> bill at insightful dot com
>>> 360-428-8146
>>>
>>> "All statements in this message represent the opinions of the author and
>>> do
>>> not necessarily reflect Insightful Corporation policy or position."
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list