[Rd] findInterval

Tue Sep 17 19:27:35 CEST 2024

The other problem in this example is setting NA's.

   replace(x, x == 0, NA)

requires two instances of x making it not very pipe friendly.  In
dplyr there is na_if
to address that problem and base R might have something that addresses this
so we don't have to define our own zero2na as the base of R now has pipes.

On Tue, Sep 17, 2024 at 12:14 PM Martin Maechler
<maechler using stat.math.ethz.ch> wrote:
>
> >>>>> Gabor Grothendieck
> >>>>>     on Mon, 16 Sep 2024 11:21:55 -0400 writes:
>
>     > Suppose we have `dat` shown below and we want to find the the `y` value
>     > corresponding to the last value in `x` equal to the corresponding component
>     > of `seek` and we wish to return an output the same length as `seek` using
>     > `findInterval` to perform  the search.  This returns the correct result:
>
>     > dat <- data.frame(x = c(2, 2, 3, 4, 4, 4),
>     >                   y = c(37, 12, 19, 30, 6, 15),
>     >                  seek = 1:6)
>
>     > zero2na <- function(x) replace(x, x == 0, NA)
>
>     > dat |>
>     > transform(dat, result = y[ zero2na(findInterval(seek, x)) ] ) |>
>     > _$result
>     > ## [1] NA 12 19 15 15 15
>
> I'd write that as
>
>     with(dat, y[ zero2na(findInterval(seek, x)) ] )
>
> so I can read it with jumping hoops and stand on my head ...
>
>     > Since `findInterval` returns an index it is natural that the next step be
>     > to use the index and it is also common that we want a result that is the
>     > same length as the input.
>
> I think your example where x and y are of the same length
> not typical.
>
> Not that the design of   findInterval(x, vec, ..)  is indeed to always return
> an index, but there isn't any "nomatch", but rather a
> - "left of the leftmost", i.e.,  an x[i] < vec[1]  (as 'vec' must be
>   sorted increasingly) or
> - "right of rightmost"  , i.e.,  an x[i] > vec[length(vec)]
>
> and these should give *different* results (and not both the
> same).
>
> I don't think 'nomatch' would improve the relatively clean  findInterval()
> behavior.
>
> There are  three logical switches  ... which allow   2^3
> variants of which I now guess only 6  differ:
>
> Here's some R code showing the possibilities:
>
>
> (argsTF <- names(formals(findInterval))[-(1:2)]) # "rightmost.closed"  "all.inside" "left.open"
> FT <- c(FALSE, TRUE)
> allFT <- as.matrix(expand.grid(rightmost.closed = FT,
>                                all.inside       = FT,
>                                left.open        = FT))
> allFT
> (cn <- substr(colnames(allFT), 1,1)) #  "r" "a" "l"
>
> x <- 2:18
> v <- c(5, 10, 15) # create two bins [5,10) and [10,15)
>
> fiAll <- apply(allFT, 1, function(r.a.f)
>     do.call(findInterval, c(list(x, v), as.list(r.a.f))))
>
> cbind(x, fiAll) # has all info
>
> ## must find cool 'column names' for fiAll: construct from r.., a.., l.. = F / T
> (cn1 <- apply(`dim<-`(c(".","|")[allFT+1L], dim(allFT)), 1, paste0, collapse=""))
> ##  "..." "|.." ".|." "||." "..|" "|.|" ".||" "|||"
> colnames(fiAll) <- cn1
> cbind(x, fiAll) ## --> col. 3 == 4  and  7 == 8
> ##==> show only unique columns:
> cbind(x, t(unique(t(fiAll))))
>  ##  x ... |.. .|. ..| |.| .||
>  ##  2   0   0   1   0   0   1
>  ##  3   0   0   1   0   0   1
>  ##  4   0   0   1   0   0   1
>  ##  5   1   1   1   0   1   1
>  ##  6   1   1   1   1   1   1
>  ##  7   1   1   1   1   1   1
>  ##  8   1   1   1   1   1   1
>  ##  9   1   1   1   1   1   1
>  ## 10   2   2   2   1   1   1
>  ## 11   2   2   2   2   2   2
>  ## 12   2   2   2   2   2   2
>  ## 13   2   2   2   2   2   2
>  ## 14   2   2   2   2   2   2
>  ## 15   3   2   2   2   2   2
>  ## 16   3   3   2   3   3   2
>  ## 17   3   3   2   3   3   2
>  ## 18   3   3   2   3   3   2
>
>
> Martin

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com