[R] Possible Improvement to sapply

Wed Mar 14 10:11:26 CET 2018

>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>     on Tue, 13 Mar 2018 10:12:55 -0700 writes:

> FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
> some corners compared to identical():

> > microbenchmark::microbenchmark(identical(FALSE, FALSE), isFALSE(FALSE))
> Unit: nanoseconds
>                     expr min   lq    mean median     uq   max neval
>  identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
>           isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100

> > microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))
> Unit: nanoseconds
>                    expr  min     lq    mean median   uq   max neval
>  identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
>           isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100

> > microbenchmark::microbenchmark(identical("array", FALSE), isFALSE("array"))
> Unit: nanoseconds
>                       expr min     lq    mean median     uq  max neval
>  identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
>           isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100

Thank you Henrik!

The speed of the new isTRUE() and isFALSE() is indeed amazing
compared to identical() which was written to be fast itself.

Note that the new code goes back to a proposal by  Hervé Pagès
(of Bioconductor fame) in a thread with R core in April 2017.
The goal of the new code actually *was*  to allow call like

  isTRUE(c(a = TRUE))   

to become TRUE rather than improving speed.
The new source code is at the end of  R/src/library/base/R/identical.R

## NB:  is.logical(.) will never dispatch:
## --                 base::is.logical(x)  <==>  typeof(x) == "logical"
isTRUE  <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && x
isFALSE <- function(x) is.logical(x) && length(x) == 1L && !is.na(x) && !x

and one *reason* this is so fast is that all  6  functions which
are called are primitives :

> sapply(codetools::findGlobals(isTRUE), function(fn) is.primitive(get(fn)))
         !         &&         == is.logical      is.na     length 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

and a 2nd reason is probably with the many recent improvements of the
byte compiler.

> That could probably be used also is sapply().  The difference is that
> isFALSE() is a bit more liberal than identical(x, FALSE), e.g.

> > isFALSE(c(a = FALSE))
> [1] TRUE
> > identical(c(a = FALSE), FALSE)
> [1] FALSE

> Assuming the latter is not an issue, there are 69 places in base R
> where isFALSE() could be used:

> $ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r --include="*.R" | grep -F "/R/" | wc
>      69     326    5472

> and another 59 where isTRUE() can be used:

> $ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R" | grep -F "/R/" | wc
>      59     307    5021

Beautiful use of  'grep' -- thank you for those above, as well.
It does need a quick manual check, but if I use the above grep
from Emacs (via  'M-x grep')  or even better via a TAGS table
and M-x tags-query-replace  I should be able to do the changes
pretty quickly... and will start looking into that later today.

Interestingly and to my great pleasure, the first part of the
'Subject' of this mailing list thread, "Possible Improvement",
*has* become true after all --

-- thanks to Henrik !

Martin Maechler
ETH Zurich

> On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <HDoran at air.org> wrote:
> > Quite possibly, and I’ll look into that. Aside from the work I was doing, however, I wonder if there is a way such that sapply could avoid the overhead of having to call the identical function to determine the conditional path.
> >
> >
> >
> > From: William Dunlap [mailto:wdunlap at tibco.com]
> > Sent: Tuesday, March 13, 2018 12:14 PM
> > To: Doran, Harold <HDoran at air.org>
> > Cc: Martin Morgan <martin.morgan at roswellpark.org>; r-help at r-project.org
> > Subject: Re: [R] Possible Improvement to sapply
> >
> > Could your code use vapply instead of sapply?  vapply forces you to declare the type and dimensions
> > of FUN's output and stops if any call to FUN does not match the declaration.  It can use much less
> > memory and time than sapply because it fills in the output array as it goes instead of calling lapply()
> > and seeing how it could be simplified.
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com<http://tibco.com>
> >
> > On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
> > Martin
> >
> > In terms of context of the actual problem, sapply is called millions of times because the work involves scoring individual students who took a test. A score for student A is generated and then student B and such and there are millions of students. The psychometric process of scoring students is complex and our code makes use of sapply many times for each student.
> >
> > The toy example used length just to illustrate, our actual code doesn't do that. But your point is well taken, there may be a very good counterexample why my proposal doesn't achieve the goal is a generalizable way.
> >

[.................]