[R] Possible Improvement to sapply

Wed Mar 14 10:14:57 CET 2018

Well thanks, Martin, and glad to see there is some potential here. This
wasn¹t reported as a bug, but as you note really as a question originally
and with an invitation to critique my code.

On 3/14/18, 5:11 AM, "Martin Maechler" <maechler at stat.math.ethz.ch> wrote:

>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>>     on Tue, 13 Mar 2018 10:12:55 -0700 writes:
>
>> FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
>> some corners compared to identical():
>
>> > microbenchmark::microbenchmark(identical(FALSE, FALSE),
>>isFALSE(FALSE))
>> Unit: nanoseconds
>>                     expr min   lq    mean median     uq   max neval
>>  identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
>>           isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100
>
>> > microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))
>> Unit: nanoseconds
>>                    expr  min     lq    mean median   uq   max neval
>>  identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
>>           isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100
>
>> > microbenchmark::microbenchmark(identical("array", FALSE),
>>isFALSE("array"))
>> Unit: nanoseconds
>>                       expr min     lq    mean median     uq  max neval
>>  identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
>>           isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100
>
>Thank you Henrik!
>
>The speed of the new isTRUE() and isFALSE() is indeed amazing
>compared to identical() which was written to be fast itself.
>
>Note that the new code goes back to a proposal by  Hervé Pagès
>(of Bioconductor fame) in a thread with R core in April 2017.
>The goal of the new code actually *was*  to allow call like
>
>  isTRUE(c(a = TRUE))
>
>to become TRUE rather than improving speed.
>The new source code is at the end of  R/src/library/base/R/identical.R
>
>## NB:  is.logical(.) will never dispatch:
>## --                 base::is.logical(x)  <==>  typeof(x) == "logical"
>isTRUE  <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && x
>isFALSE <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && !x
>
>and one *reason* this is so fast is that all  6  functions which
>are called are primitives :
>
>> sapply(codetools::findGlobals(isTRUE), function(fn)
>>is.primitive(get(fn)))
>         !         &&         == is.logical      is.na     length
>      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
>
>and a 2nd reason is probably with the many recent improvements of the
>byte compiler.
>
>
>> That could probably be used also is sapply().  The difference is that
>> isFALSE() is a bit more liberal than identical(x, FALSE), e.g.
>
>> > isFALSE(c(a = FALSE))
>> [1] TRUE
>> > identical(c(a = FALSE), FALSE)
>> [1] FALSE
>
>> Assuming the latter is not an issue, there are 69 places in base R
>> where isFALSE() could be used:
>
>> $ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r --include="*.R" | grep -F
>>"/R/" | wc
>>      69     326    5472
>
>> and another 59 where isTRUE() can be used:
>
>> $ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R" | grep -F
>>"/R/" | wc
>>      59     307    5021
>
>Beautiful use of  'grep' -- thank you for those above, as well.
>It does need a quick manual check, but if I use the above grep
>from Emacs (via  'M-x grep')  or even better via a TAGS table
>and M-x tags-query-replace  I should be able to do the changes
>pretty quickly... and will start looking into that later today.
>
>Interestingly and to my great pleasure, the first part of the
>'Subject' of this mailing list thread, "Possible Improvement",
>*has* become true after all --
>
>-- thanks to Henrik !
>
>Martin Maechler
>ETH Zurich
>
>
>
>> On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <HDoran at air.org> wrote:
>> > Quite possibly, and I¹ll look into that. Aside from the work I was
>>doing, however, I wonder if there is a way such that sapply could avoid
>>the overhead of having to call the identical function to determine the
>>conditional path.
>> >
>> >
>> >
>> > From: William Dunlap [mailto:wdunlap at tibco.com]
>> > Sent: Tuesday, March 13, 2018 12:14 PM
>> > To: Doran, Harold <HDoran at air.org>
>> > Cc: Martin Morgan <martin.morgan at roswellpark.org>;
>>r-help at r-project.org
>> > Subject: Re: [R] Possible Improvement to sapply
>> >
>> > Could your code use vapply instead of sapply?  vapply forces you to
>>declare the type and dimensions
>> > of FUN's output and stops if any call to FUN does not match the
>>declaration.  It can use much less
>> > memory and time than sapply because it fills in the output array as
>>it goes instead of calling lapply()
>> > and seeing how it could be simplified.
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com<http://tibco.com>
>> >
>> > On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold
>><HDoran at air.org<mailto:HDoran at air.org>> wrote:
>> > Martin
>> >
>> > In terms of context of the actual problem, sapply is called millions
>>of times because the work involves scoring individual students who took
>>a test. A score for student A is generated and then student B and such
>>and there are millions of students. The psychometric process of scoring
>>students is complex and our code makes use of sapply many times for each
>>student.
>> >
>> > The toy example used length just to illustrate, our actual code
>>doesn't do that. But your point is well taken, there may be a very good
>>counterexample why my proposal doesn't achieve the goal is a
>>generalizable way.
>> >
>
>
>[.................]
>