[Rd] complex NA's match(), etc: not back-compatible change proposal

Mon May 23 18:06:52 CEST 2016

>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Fri, 13 May 2016 16:33:05 +0000 writes:

    > That, for example, complex(real=NaN) and complex(imaginary=NaN) are regarded as equal makes it possible that 

    >  length(unique(as.character(x))) > length(unique(x)) 

    > (current code of function 'factor' doesn't expect it). 

Thank you, that is an interesting remark - but is already true,
in earlier versions of R !

.. and of course this is because we do *print*   0+NaNi  etc,
i.e., we differentiate the  non-NA-but-NaN complex values in
formatting / printing but not in match(), unique() ...

and indeed, with the  'z'  example below,
  fz <- factor(z,z)
gives a warnings about duplicated levels and gives such warnings
also in current (and previous) versions of R, at least for the slightly
larger z  I've used in the tests/reg-tests-1c.R example.

For the moment I can live with that warning, as I don't think
factor()s are constructed from complex numbers "often"...
and the performance of factor() in the more regular cases is important.

> Yes, an argument for the behavior is that NA and NaN are of one kind.
> On my system, using 32-bit R for Windows from binary from CRAN, the result of sapply(z, match, table = z) (not in current R-devel) may be different from below:
    > 1 2 3 4 1 3 7 8 2 4 8 12  # R 2.10.1, different from below
    > 1 2 3 4 1 3 7 8 2 4 8 12  # R 3.2.5, different from below

interesting, thank you... and another reason why the change
(currently only in R-devel) may have been a good one: More uniformity.

    > I noticed that, by function 'cequal' in unique.c, a complex number that has both NA and NaN matches NA and also matches NaN.

    >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
    >> (z <- z[is.na(z)])
    > [1]       NA NaN+  0i       NA NaN+  1i       NA       NA       NA       NA
    > [9]   0+NaNi   1+NaNi       NA NaN+NaNi

    >> sapply(z, match, table = z[8])
    > [1] 1 1 1 1 1 1 1 1 1 1 1 1
    >> match(z, z[8])
    > [1] 1 1 1 1 1 1 1 1 1 1 1 1

Yes, I see the same. But is n't it what we expect:

All of our z[] entries has at least one NA or a NaN in its real
or imaginary, and since z[8] has both, it does match with all
z[]'s either because of the NA or because of the NaN in common.

Hence, currently, I don't think this needs to be changed...
but if there are other reasons / arguments ...

Thank you again,
Martin Maechler

    >> sessionInfo()
    > R Under development (unstable) (2016-05-12 r70604)
    > Platform: i386-w64-mingw32/i386 (32-bit)
    > Running under: Windows XP (build 2600) Service Pack 2

    > locale:
    > [1] LC_COLLATE=English_United States.1252
    > [2] LC_CTYPE=English_United States.1252
    > [3] LC_MONETARY=English_United States.1252
    > [4] LC_NUMERIC=C
    > [5] LC_TIME=English_United States.1252

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > -----------------
>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 10 May 2016 16:08:39 +0200 writes:

    >> This is an RFC / announcement related to the 2nd part of PR#16885
    >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
    >> about  complex NA's.

    >> The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the
    >> case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0
    >> patched in the mean time} triggered some more comprehensive "research".

    >> I found that we have had a long-standing inconsistency at least between the
    >> documented and the real behavior.  I am claiming that the documented
    >> behavior is desirable and hence R's current "real" behavior is bugous, and
    >> I am proposing to change it, in R-devel (to be 3.4.0) for now.

    > After the  "roaring unanimous" assent  (one private msg
    > encouraging me to go forward, no dissenting voice, hence an
    > "odds ratio" of  +Inf  in favor ;-)

    > I have now committed my proposal to R-devel (svn rev. 70597) and
    > some of us will be seeing the effect in package space within a
    > day or so, in the CRAN checks against R-devel (not for
    > bioconductor AFAIK; their checks using R-devel only when it less
    > than ca 6 months from release).

    > It's still worthwhile to discuss the issue, if you come late
    > to it, notably as ---paraphrasing Dirk on the R-package-devel list---
    > the release of 3.4.0 is almost a year away, and so now is the
    > best time to tinker with the API, in other words, consider breaking
    > rarely used legacy APIs..

    > Martin

    >> In help(match) we have been saying

    >> |  Exactly what matches what is to some extent a matter of definition.
    >> |  For all types, \code{NA} matches \code{NA} and no other value.
    >> |  For real and complex values, \code{NaN} values are regarded
    >> |  as matching any other \code{NaN} value, but not matching \code{NA}.

    >> for at least 10 years.  But we don't do that at all in the
    >> complex case (and AFAIK never got a bug report about it).

    >> Also, e.g., print(.) or format(.) do simply use  "NA" for all
    >> the different complex NA-containing numbers, where OTOH,
    >> non-NA NaN's { <=>  !is.nan(z) & is.na(z) }
    >> in format() or print() do show the NaN in real and/or imaginary
    >> parts; for an example, look at the "format" column of the matrix
    >> below, after 'print(cbind' ...

    >> The current match()---and duplicated(), unique() which are based on the same
    >> C code---*do* distinguish almost all complex NA / NaN's which is
    >> NOT according to documentation. I have found that this is just because of 
    >> of our hashing function for the complex case, chash() in R/src/main/unique.c,
    >> is bogous in the sense that it is not compatible with the above documentation
    >> and also not with the cequal() function (in the same file uniqu.c) for checking
    >> equality of complex numbers.

    >> As I have found,, a *simplified* version of the chash() function
    >> to make it compatible with cequal() does solve all the problems I've
    >> indicated,  and the current plan is to commit that change --- after some
    >> discussion time, here on R-devel ---  to the code base.

    >> My change passes  'make check-all' fine, but I'm 100% sure that there will
    >> be effects in package-space. ... one reason for this posting.

    >> As mentioned above, note that the chash() function has been in
    >> use for all three functions
    >> match()
    >> duplicated()
    >> unique()
    >> and the change will affect all three --- but just for the case of complex
    >> vectors with NA or NaN's.

    >> To show more, a small R session -- using my version of R-devel
    >> == the proposition: 
    >> The R script ('complex-NA-short.R') for (a bit more than) the
    >> session is attached {{you can attach  text/plain easily}}:

    >>> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
    >>> ##           --- = NA_real_  but that does not exist e.g., in R 2.3.1
    >>> ##                   similarly,  '1L', '2L', .. do not exist e.g., in R 2.3.1
    >>> (z <- z[is.na(z)])
    >> [1]       NA NaN+  0i       NA NaN+  1i       NA       NA       NA       NA
    >> [9]   0+NaNi   1+NaNi       NA NaN+NaNi
    >>> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ?
    >> +     r <- matrix( , length(x), length(y))
    >> +     for(i in seq(along=x))
    >> +         for(j in seq(along=y))
    >> +             r[i,j] <- identical(z[i], z[j], ...)
    >> +     r
    >> + }
    >>> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ:
    >>> ## a version that works in older versions of R, where identical() had fewer arguments!
    >>> outerID.picky <- function(x,y) {
    >> +     nF <- length(formals(identical)) - 2
    >> +     do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF))))
    >> + }
    >>> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is  a wild guess
    >>> symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R]

    >> [1,] | . . . . . . . . . . .
    >> [2,] . | . . . . . . . . . .
    >> [3,] . . | . . . . . . . . .
    >> [4,] . . . | . . . . . . . .
    >> [5,] . . . . | . . . . . . .
    >> [6,] . . . . . | . . . . . .
    >> [7,] . . . . . . | . . . . .
    >> [8,] . . . . . . . | . . . .
    >> [9,] . . . . . . . . | . . .
    >> [10,] . . . . . . . . . | . .
    >> [11,] . . . . . . . . . . | .
    >> [12,] . . . . . . . . . . . |
    >>> try(# for older R versions
    >> + stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1))
    >> + )
    >>> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_
    >> [1] 1 2 1 2 1 1 1 1 2 2 1 2
    >>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern :
    >>> print(cbind(format = format(z), t(zRI), mz), quote=FALSE)
    >> format   Re   Im   mz
    >> [1,]       NA <NA> 0    1 
    >> [2,] NaN+  0i NaN  0    2 
    >> [3,]       NA <NA> 1    1 
    >> [4,] NaN+  1i NaN  1    2 
    >> [5,]       NA 0    <NA> 1 
    >> [6,]       NA 1    <NA> 1 
    >> [7,]       NA <NA> <NA> 1 
    >> [8,]       NA NaN  <NA> 1 
    >> [9,]   0+NaNi 0    NaN  2 
    >> [10,]   1+NaNi 1    NaN  2 
    >> [11,]       NA <NA> NaN  1 
    >> [12,] NaN+NaNi NaN  NaN  2 
    >>> 
    >> -------------------------------
    >> Note that 'mz <- match(z, z)' and hence the last column of the matrix above
    >> are very different in current R, 
    >> distinguishing most kinds of NA / NaN  against the documentation (and the
    >> real/numeric case).

    >> Martin Maechler
    >> R Core Team

    >> ### Basically a shortened version of  the PR#16885 -- complex part b)
    >> ### of  R/tests/reg-tests-1c.R

    >> ## b) complex 'x' with different kinds of NaN
    >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
    >> ##           --- = NA_real_  but that does not exist e.g., in R 2.3.1
    >> ##                   similarly,  '1L', '2L', .. do not exist e.g., in R 2.3.1
    >> (z <- z[is.na(z)])
    >> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ?
    >> r <- matrix( , length(x), length(y))
    >> for(i in seq(along=x))
    >> for(j in seq(along=y))
    >> r[i,j] <- identical(z[i], z[j], ...)
    >> r
    >> }
    >> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ:
    >> ## a version that works in older versions of R, where identical() had fewer arguments!
    >> outerID.picky <- function(x,y) {
    >> nF <- length(formals(identical)) - 2
    >> do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF))))
    >> }
    >> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is  a wild guess
    >> symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R]
    >> try(# for older R versions
    >> stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1))
    >> )
    >> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_
    >> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern :
    >> print(cbind(format = format(z), t(zRI), mz), quote=FALSE)

    >> ## compute  match(z[i], z) , for  i = 1,2,..,12  :
    >> (m1z <- sapply(z, match, table = z))
    >> ## 1 2 1 2 2 2 1 2 2 2 1 2   # R 1.2.3  (2001-04-26)
    >> ## 1 2 3 4 1 3 7 8 2 4 8 7   # R 1.4.1  (2002-01-30)
    >> ## 1 2 3 4 1 3 7 8 2 4 8 12  # R 1.5.1  (2002-06-17)
    >> ## 1 2 3 4 1 3 7 8 2 4 8 12  # R 1.8.1  (2003-11-21)
    >> ## 1 2 3 4 1 3 7 8 2 4 8 12  # R 2.0.1  (2004-11-15)
    >> ## 1 2 3 4 1 3 7 4 2 4 4 12  # R 2.1.1  (2005-06-20)
    >> ## 1 2 3 4 1 3 7 4 2 4 4 12  # R 2.3.1  (2006-06-01)
    >> ## 1 2 3 4 1 3 7 8 2 4 8 12  # R 2.5.1  (2007-06-27)
    >> ## 1 2 3 4 1 3 7 4 2 4 4 12  # R 2.10.1 (2009-12-14)
    >> ## 1 2 3 4 1 3 7 4 2 4 4 12  # R 3.1.1  (2014-07-10)
    >> ## 1 2 3 4 1 3 7 4 2 4 4 12  # R 3.2.5 -- and 3.3.0 patched
    >> ## 1 2 1 2 1 1 1 1 2 2 1 2   # <<-- Martin's R-devel and proposed future R

    >> if(!exists("anyNA", mode="function")) anyNA <- function(x) any(is.na(x))
    >> stopifnot(apply(zRI, 2, anyNA)) # *all* are  NA *or* NaN (or both)
    >> is.NA <- function(.) is.na(.) & !is.nan(.)
    >> (iNaN <- apply(zRI, 2, function(.) any(is.nan(.))))
    >> (iNA <-  apply(zRI, 2, function(.) any(is.NA (.)))) # has non-NaN NA's
    >> ## In Martin's version of R-devel :
    >> stopifnot(identical(m1z == 1, iNA),
    >> identical(m1z == 2, !iNA))
    >> ## m1z uses match(x, *) with length(x) == 1 and failed in R 3.3.0
    >> stopifnot(identical(m1z, mz))
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel