[Rd] Development version of R: Improved nchar(), nzchar() but changed API

Martin Maechler maechler at lynne.stat.math.ethz.ch
Fri Apr 24 12:06:23 CEST 2015

Those of you who track R development closely,
will have noticed yesterday's commit of enhanced versions of 
nchar() and nzchar().

r68254 | maechler | 2015-04-23 18:06:37 +0200 (Thu, 23 Apr 2015) | 1 line
Changed paths:
   M doc/NEWS.Rd
   M src/library/base/R/New-Internal.R
   M src/library/base/R/zzz.R
   M src/library/base/man/nchar.Rd
   M src/main/character.c
   M src/main/names.c
   M tests/reg-tests-1a.R

nchar(x) now gives NA for character NAs, configurably via nchar(x, keepNA=*); analogously for nzchar()

Enhanced via the new argument  'keepNA' (a logical, i.e., TRUE/FALSE/NA),
but also *not* backward compatible in the current
implementation.  Here's how it works [currently], showing the (input and output
of the slightly abridged) example(nchar):

> x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
> x[3] <- NA; x
[1] "asfef"           "qwerty"          NA                "b"              
[5] "stuff.blah.yech"
> nchar(x, keepNA= TRUE) #  5  6 NA  1 15
[1]  5  6 NA  1 15
> nchar(x, keepNA=FALSE) #  5  6  2  1 15
[1]  5  6  2  1 15
> stopifnot(identical(nchar(x     ), nchar(x, keepNA= TRUE)),
            identical(nchar(x, "w"), nchar(x, keepNA=FALSE)))

The main reason for the change: it is more logical
that  NA_character_ in x  are transformed to  NA_integer_ in the result,
which is what happens with 'keepNA = TRUE', which can be
translated as "keep/preserve the NA's that were in x (the main argument)".

If you use  nchar(x, type = "words"),  or its short form  nchar(x, "w")
you implicitly ask for  'keepNA = FALSE',
because "words" is about output / formatting / etc, and there,
you'd typically want 

        nchar(c("ABC", NA),  "words") 

to give      3  2   -- which is what happens unconditionally in R <= 3.2.0.

We've found quite a few CRAN packages to "break" (R CMD check)
for R-devel r68254, because I had clearly underestimated the
number of places where current R code was built on assuming the
"pre-R-devel" (aka "current R") semantics of nchar() and
nzchar() which for R <= 3.2.0 say

       For ‘nchar’, an integer vector giving the sizes of each element,
       __currently__ always ‘2’ for missing values (for ‘NA’).

 (my emphasis added to  "currently").

As package authors, when using R-devel you may wait a day when
you see problems with R-devel (that you don't see with R 3.2.0),
but you should become aware of the slightly changed semantics of
nchar() and nzchar().

Longer term, the change should have made R more "internally coherent",
namely vectorized R functions preserving NA's by default.

Martin Maechler,
ETH Zurich

