[Rd] complex NA's match(), etc: not back-compatible change proposal

Fri Jun 3 21:13:11 CEST 2016

With 'z' of length 8 below, or of length 12 previously, one may try
sapply(rev(z), match, table = rev(z))
match(rev(z), rev(z))

I found that the two results were different in R devel r70604.

A shorter one:

> z <- complex(real = c(0,NaN,NaN), imaginary = c(NA,NA,0))
> sapply(z, match, table = z)
[1] 1 1 2
> match(z, z)
[1] 1 1 3

An explanation of the behavior: With normal equality, if z[2] is equal to z[1] and z[3] is not equal to z[1], z[3] is not equal to z[2]. It is not the case here with 'cequal'. However, it seems that the property is assumed in usual case of 'match'.

For it, just changing 'cequal' so that a complex number that has both NA and NaN matches NA and doesn't match NaN is enough. It also makes length(unique(.)) not order-dependent.

For more change, I am fine with '1 A'.
--------------------------------------------
On Mon, 30/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

 Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal

 Cc: R-devel at r-project.org
 Date: Monday, 30 May, 2016, 5:48 PM

 >>>>> Suharto Anggono

 >>>>>     on Sat, 28 May
 2016 09:34:08 +0000 writes:

     > On 'factor', I meant the case where
 'levels' is not
     > specified, where 'unique' is called.

 I see, thank you.

     >> factor(c(complex(real=NaN),
 complex(imaginary=NaN)))
     > [1] NaN+0i <NA>
     > Levels: NaN+0i

     > Look at <NA> in the result above.
 Yes, it happens in
     > earlier versions of R, too.

 Yes; let's call this "problem 1"

     > On matching both NA and NaN, another
 consequence is that
     > length(unique(.)) may depend on order. 
     > Example using R devel r70604:

     >> x0 <- c(0,1, NA, NaN); z <-
 outer(x0,x0, complex, length.out=1); rm(x0)
     >> (z <- z[is.na(z)])
     > [1]       NA
 NaN+  0i       NA NaN+ 
 1i       NA   
    NA       NA 
      NA
     >
 [9]   0+NaNi   1+NaNi 
      NA NaN+NaNi
     >> length(print(unique(z)))
     > [1]     NA NaN+0i
     > [1] 2
     >> length(print(unique(c(z[8],
 z[-8]))))
     > [1] NA
     > [1] 1
     >
 --------------------------------------------

 Thank you, Suharto. I agree these are even more convincing
 reasons to consider changing.
 Let's call this ("matching both NA and NaN")  "problem
 2".

 I think we agree that the R-devel -- comparted to previous
 versions -- *is* consistent in its (C level) functions
 cequal()
 and  chash() and also is consistent with the
 documentation
 of match()/unique()/duplicated().

 Hence I think a change would have to affect all of the
 above,
 including a change of documentation.

 Also, resolution of "problem 1" and "problem 2" are related,
 but
 --I think-- almost separate.
 For the following, let's use a vector notation for complex
 numbers, say
     (a, b) :== complex(real = a, imaginary = b)

 With R  (showing relevant examples):
 ##------------------------------------------------------------------------------
 options(width = max(85, getOption("width"))) # so 'z' prints
 in one line
 p.z <- function(z)
 print(noquote(paste0("(",Re(z),",",Im(z),")")))
 z <- c(1,NA,NaN); z <- outer(z,z, complex,
 length.out=1); (z <- z[is.na(z)])
 ##     NA NaN+  1i   
    NA       NA 
      NA   1+NaNi 
      NA NaN+NaNi
 p.z(z)
 ##  (NA,1)  (NaN,1)  (1,NA) 
 (NA,NA)  (NaN,NA)  (1,NaN)  (NA,NaN) 
 (NaN,NaN)
 length(p.z(unique(z[ 1:8 ])))
 ## [1] (NA,1)  (NaN,1)
 ## [1] 2
 length(p.z(unique(z[ c(8,1:7) ])))
 ## [1] (NaN,NaN) (NA,1)
 ## [1] 2
 length(p.z(unique(z[ c(7:8,1:6) ])))
 ## [1] (NA,NaN)
 ## [1] 1
 ##------------------------------------------------------------------------------

 Problem 1:
   To me, at the moment, it would seem most "natural" to
 consider a
   change where the match()/unique()/duplicated() 
 behavior  matched
   the behavior of print()/format()/as.character() 
 for such
   complex vectors.
   I think this would automatically solve the issue that
 sometimes

     length(unique(as.character(x))) >
 length(unique(x))

   The are principally two solutions to this:

   A: change  match()/unique()/duplicated()
   B: change  print()/format()/as.character()

   For A -- which seems "less disruptive" and more
 desirable to
   me -- we would have to change cequal() {and chash()!}
 and say
   that complex numbers with NA|NaN  "match" if
 they have any NA, but
   otherwise, both the regular (r,i) and the NaN must be
 at the
   exact same places (and *different* NaNs should match,
 of course).

 Problem 2:   unique(z[i])  depends on
 the permutation 'i'

   What should a change be here ...  notably after
 the "proposed"
   (rather only "considered") change   '1
 A' above ?

   Can "the" new behavior easily be described in words
 (if '1 A'
   above is already assumed)?

 At the moment, I would not tackle Problem 2.
 It would become less problematic once  Problem 1 is
 solved
 according to '1 A', because it least  length(unique(.))
 would
 not change:  It would contain *one* z[] with an NA, and
 all the
 other z[]s.

 Opinions ?  Thank you in advance for chiming in..

 Martin Maechler,
 ETH Zurich

     > On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch>
 wrote:

     > Subject: Re: [Rd] complex NA's match(),
 etc: not back-compatible change proposal

     > Cc: R-devel at r-project.org
     > Date: Monday, 23 May, 2016, 11:06 PM

     >>>>>> 
     > Suharto Anggono Suharto Anggono via
 R-devel <r-devel at r-project.org>
     >>>>>>      on Fri, 13
     > May 2016 16:33:05 +0000 writes:

     >     > That, for example,
 complex(real=NaN)
     > and complex(imaginary=NaN) are regarded
 as equal makes it
     > possible that 

     >     > 
     > length(unique(as.character(x))) >
 length(unique(x)) 

     >     > (current code of
     > function 'factor' doesn't expect it). 

     > Thank you, that is an
     > interesting remark - but is already
 true,
     > in
     > [[elided Yahoo spam]]

     > ..
     > and of course this is because we do
     > *print*   0+NaNi  etc,
     > i.e., we
     > differentiate the  non-NA-but-NaN
 complex values in
     > formatting / printing but not in
 match(),
     > unique() ...

     > and indeed,
     > with the  'z'  example below,
     >  
     > fz <- factor(z,z)
     > gives a warnings about
     > duplicated levels and gives such
 warnings
     > also in current (and previous) versions
 of R,
     > at least for the slightly
     > larger z 
     > I've used in the tests/reg-tests-1c.R
 example.

     > For the moment I can live with
     > that warning, as I don't think
     > factor()s
     > are constructed from complex numbers
 "often"...
     > and the performance of factor() in the
 more
     > regular cases is important.

     >> Yes, an argument for the behavior is
 that
     > NA and NaN are of one kind.
     >> On my
     > system, using 32-bit R for Windows from
 binary from CRAN,
     > the result of sapply(z, match, table = z)
 (not in current
     > R-devel) may be different from below:
     >    
     >> 1 2 3 4 1 3 7 8 2 4 8 12  # R
 2.10.1, different from
     > below
     >     > 1 2 3 4 1 3 7 8 2 4 8 12 
     > # R 3.2.5, different from below

     > interesting, thank you... and another
 reason
     > why the change
     > (currently only in R-devel)
     > may have been a good one: More
 uniformity.

     >     > I noticed that, by
     > function 'cequal' in unique.c, a complex
 number that
     > has both NA and NaN matches NA and also
 matches NaN.

     >     >> x0 <- c(0,1,
     > NA, NaN); z <- outer(x0,x0, complex,
 length.out=1);
     > rm(x0)
     >     >> (z <-
     > z[is.na(z)])
     >     > [1]   
     >    NA NaN+  0i       NA NaN+ 
 1i 
     >      NA       NA   
     >    NA       NA
     >    
     >> [9]   0+NaNi   1+NaNi   
     >    NA NaN+NaNi

     >     >> sapply(z, match, table =
     > z[8])
     >     > [1] 1 1 1 1 1 1 1 1 1 1 1
     > 1
     >     >> match(z, z[8])
     >     > [1] 1 1 1 1 1 1 1 1 1 1 1 1

     > Yes, I see the same. But is
     > n't it what we expect:

     > All of our z[] entries has at least one
 NA or a
     > NaN in its real
     > or imaginary, and since z[8]
     > has both, it does match with all
     > z[]'s
     > either because of the NA or because of
 the NaN in common.

     > Hence, currently, I don't
     > think this needs to be changed...
     > but if
     > there are other reasons / arguments ...

     > Thank you again,
     > Martin
     > Maechler

     >     >> sessionInfo()
     >  
     >   > R Under development (unstable)
 (2016-05-12
     > r70604)
     >     > Platform:
     > i386-w64-mingw32/i386 (32-bit)
     >     >
     > Running under: Windows XP (build 2600)
 Service Pack 2

     >     > locale:
     >     > [1] LC_COLLATE=English_United
     > States.1252
     >     > [2]
     > LC_CTYPE=English_United States.1252
     >    
     >> [3] LC_MONETARY=English_United
 States.1252
     >     > [4] LC_NUMERIC=C
     >  
     >   > [5] LC_TIME=English_United
 States.1252

     >     > attached base
     > packages:
     >     > [1] stats 
     >    graphics  grDevices utils 
     >    datasets  methods   base

     >     >
     > -----------------
     >>>>>> 
     > Martin Maechler <maechler at
 stat.math.ethz.ch>
     >>>>>>      on Tue, 10
     > May 2016 16:08:39 +0200 writes:

     >     >> This is an RFC /
 announcement
     > related to the 2nd part of PR#16885
     >    
     >>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
     >     >> about  complex NA's.

     >     >> The (somewhat
     > rare) incompatibility in R's 3.3.0
 match() behavior for
     > the
     >     >> case of complex numbers
     > with NA & NaN's {which has been fixed
 for R 3.3.0
     >     >> patched in the mean time}
     > triggered some more comprehensive
 "research".

     >     >> I found that we
     > have had a long-standing inconsistency at
 least between
     > the
     >     >> documented and the real
     > behavior.  I am claiming that the
 documented
     >     >> behavior is desirable and
 hence
     > R's current "real" behavior is bugous,
 and
     >     >> I am proposing to change
 it, in
     > R-devel (to be 3.4.0) for now.

     >     > After the  "roaring
     > unanimous" assent  (one private msg
     >  
     >   > encouraging me to go forward, no
 dissenting voice,
     > hence an
     >     > "odds ratio"
     > of  +Inf  in favor ;-)

     >  
     >   > I have now committed my proposal
 to R-devel (svn
     > rev. 70597) and
     >     > some of us will
     > be seeing the effect in package space
 within a
     >     > day or so, in the CRAN checks
     > against R-devel (not for
     >     >
     > bioconductor AFAIK; their checks using
 R-devel only when it
     > less
     >     > than ca 6 months from
     > release).

     >     >
     > It's still worthwhile to discuss the
 issue, if you come
     > late
     >     > to it, notably as
     > ---paraphrasing Dirk on the
 R-package-devel list---
     >     > the release of 3.4.0 is almost
 a
     > year away, and so now is the
     >     > best
     > time to tinker with the API, in other
 words, consider
     > breaking
     >     > rarely used legacy
     > APIs..

     >     > Martin

     >    
     >>> In help(match) we have been
 saying

     >     >> |  Exactly
     > what matches what is to some extent a
 matter of
     > definition.
     >     >> |  For all
     > types, \code{NA} matches \code{NA} and no
 other value.
     >     >> |  For real and complex
 values,
     > \code{NaN} values are regarded
     >    
     >>> |  as matching any other
 \code{NaN} value, but not
     > matching \code{NA}.

     >    
     >>> for at least 10 years.  But we
 don't do that
     > at all in the
     >     >> complex case
     > (and AFAIK never got a bug report about
 it).

     >     >> Also, e.g.,
     > print(.) or format(.) do simply use 
 "NA" for
     > all
     >     >> the different complex
     > NA-containing numbers, where OTOH,
     >    
     >>> non-NA NaN's { <=> 
 !is.nan(z) &
     > is.na(z) }
     >     >> in format() or
     > print() do show the NaN in real and/or
 imaginary
     >     >> parts; for an example,
 look at
     > the "format" column of the matrix
     >     >> below, after
     > 'print(cbind' ...

     >     >> The current match()---and
     > duplicated(), unique() which are based on
 the same
     >     >> C code---*do* distinguish
 almost
     > all complex NA / NaN's which is
     >    
     >>> NOT according to documentation. I
 have found that
     > this is just because of 
     >     >> of
     > our hashing function for the complex
 case, chash() in
     > R/src/main/unique.c,
     >     >> is
     > bogous in the sense that it is not
 compatible with the above
     > documentation
     >     >> and also not
     > with the cequal() function (in the same
 file uniqu.c) for
     > checking
     >     >> equality of complex
     > numbers.

     >     >> As
     > I have found,, a *simplified* version of
 the chash()
     > function
     >     >> to make it
     > compatible with cequal() does solve all
 the problems
     > I've
     >     >> indicated,  and the
     > current plan is to commit that change ---
 after some
     >     >> discussion time, here on
 R-devel
     > ---  to the code base.

     >  
     >   >> My change passes  'make
 check-all'
     > fine, but I'm 100% sure that there will
     >     >> be effects in
 package-space. ...
     > one reason for this posting.

     >     >> As mentioned above, note
 that
     > the chash() function has been in
     >    
     >>> use for all three functions
     >    
     >>> match()
     >     >>
     > duplicated()
     >     >> unique()
     >     >> and the change will affect
 all
     > three --- but just for the case of
 complex
     >     >> vectors with NA or NaN's.

     >     >> To show more, a
     > small R session -- using my version of
 R-devel
     >     >> == the proposition: 
     >     >> The R script
     > ('complex-NA-short.R') for (a bit more
 than) the
     >     >> session is attached {{you
 can
     > attach  text/plain easily}}:

     >     >>> x0 <- c(0,1, NA,
 NaN); z
     > <- outer(x0,x0, complex,
 length.out=1); rm(x0)
     >     >>> ##       
     >    --- = NA_real_  but that does not
 exist e.g.,
     > in R 2.3.1
     >     >>> ##       
     >            similarly,  '1L',
     > '2L', .. do not exist e.g., in R 2.3.1
     >     >>> (z <- z[is.na(z)])
     >     >> [1]       NA NaN+ 
     > 0i       NA NaN+  1i   
    NA 
     >      NA       NA   
     >    NA
     >     >>
     > [9]   0+NaNi   1+NaNi   
     >    NA NaN+NaNi
     >     >>>
     > outerID <- function(x,y, ...) { ##
 ugly; can we get
     > outer() to work ?
     >     >> + 
     >    r <- matrix( , length(x),
 length(y))
     >     >> +     for(i in
     > seq(along=x))
     >     >> +     
     >    for(j in seq(along=y))
     >    
     >>> +             r[i,j]
 <-
     > identical(z[i], z[j], ...)
     >     >>
     > +     r
     >     >> + }
     >     >>> ## Very strictly - in
 the
     > sense of identical() -- these 12 complex
 numbers all
     > differ:
     >     >>> ## a version that
     > works in older versions of R, where
 identical() had fewer
[[elided Yahoo spam]]
     >     >>> outerID.picky
     > <- function(x,y) {
     >     >> + 
     >    nF <- length(formals(identical))
 - 2
     >     >> + 
     >    do.call("outerID", c(list(x, y),
     > as.list(rep(FALSE, nF))))
     >     >> +
     > }
     >     >>> oldR <-
     > !exists("getRversion") || getRversion()
 <
     > "3.0.0" ## << FIXME: 3.0.0 is  a
 wild
     > guess
     >     >>> symnum(id.z <-
     > outerID.picky(z,z)) ## == Diagonal matrix
 [newer versions of
     > R]
     >                          
     >    
     >     >> [1,] | . . . .
     > . . . . . . .
     >     >> [2,] . | . . .
     > . . . . . . .
     >     >> [3,] . . | . .
     > . . . . . . .
     >     >> [4,] . . . | .
     > . . . . . . .
     >     >> [5,] . . . . |
     > . . . . . . .
     >     >> [6,] . . . . .
     > | . . . . . .
     >     >> [7,] . . . . .
     > . | . . . . .
     >     >> [8,] . . . . .
     > . . | . . . .
     >     >> [9,] . . . . .
     > . . . | . . .
     >     >> [10,] . . . . .
     > . . . . | . .
     >     >> [11,] . . . . .
     > . . . . . | .
     >     >> [12,] . . . . .
     > . . . . . . |
     >     >>> try(# for
     > older R versions
     >     >> +
     > stopifnot(identical(id.z, outerID(z,z)),
 oldR ||
     > identical(id.z, diag(12) == 1))
     >    
     >>> + )
     >     >>> (mz <-
     > match(z, z)) # currently different
 {NA,NaN} patterns differ
     > - not in print()/format() _FIXME_
     >    
     >>> [1] 1 2 1 2 1 1 1 1 2 2 1 2
     >    
     >>>> zRI <- rbind(Re=Re(z),
 Im=Im(z)) # and see
     > the pattern :
     >     >>>
     > print(cbind(format = format(z), t(zRI),
 mz), quote=FALSE)
     >     >>
     > format   Re   Im   mz
     >     >> [1,]       NA
     > <NA> 0    1 
     >     >> [2,]
     > NaN+  0i NaN  0    2 
     >     >>
     > [3,]       NA <NA> 1    1 
     >     >> [4,] NaN+  1i NaN  1 
   2

     >     >> [5,]       NA
     > 0    <NA> 1 
     >     >> [6,] 
     >      NA 1    <NA> 1 
     >  
     >   >> [7,]       NA <NA>
 <NA>
     > 1 
     >     >> [8,]       NA
     > NaN  <NA> 1 
     >     >>
     > [9,]   0+NaNi 0    NaN  2 
     >  
     >   >> [10,]   1+NaNi 1   
 NaN  2 
     >     >> [11,]       NA
     > <NA> NaN  1 
     >     >> [12,]
     > NaN+NaNi NaN  NaN  2 
     >     >>>

     >     >>
     > -------------------------------
     >    
     >>> Note that 'mz <- match(z, z)'
 and hence
     > the last column of the matrix above
     >    
     >>> are very different in current R,

     >     >> distinguishing most kinds
 of NA
     > / NaN  against the documentation (and
 the
     >     >> real/numeric case).

     >     >> Martin
     > Maechler
     >     >> R Core Team

     >    
     >>> ### Basically a shortened version
 of  the PR#16885
     > -- complex part b)
     >     >> ### of 
     > R/tests/reg-tests-1c.R

     >  
     >   >> ## b) complex 'x' with
 different kinds
     > of NaN
     >     >> x0 <- c(0,1, NA,
     > NaN); z <- outer(x0,x0, complex,
 length.out=1); rm(x0)
     >     >> ##           ---
     > = NA_real_  but that does not exist
 e.g., in R 2.3.1
     >     >> ##               
     >    similarly,  '1L', '2L', .. do
     > not exist e.g., in R 2.3.1
     >     >> (z
     > <- z[is.na(z)])
     >     >> outerID
     > <- function(x,y, ...) { ## ugly; can
 we get outer() to
     > work ?
     >     >> r <- matrix( ,
     > length(x), length(y))
     >     >> for(i
     > in seq(along=x))
     >     >> for(j in
     > seq(along=y))
     >     >> r[i,j] <-
     > identical(z[i], z[j], ...)
     >     >>
     > r
     >     >> }
     >    
     >>> ## Very strictly - in the sense
 of identical() --
     > these 12 complex numbers all differ:
     >    
     >>> ## a version that works in older
 versions of R,
     > [[elided Yahoo spam]]
     >    
     >>> outerID.picky <- function(x,y)
 {
     >     >> nF <-
     > length(formals(identical)) - 2
     >    
     >>> do.call("outerID", c(list(x, y),
     > as.list(rep(FALSE, nF))))
     >     >>
     > }
     >     >> oldR <-
     > !exists("getRversion") || getRversion()
 <
     > "3.0.0" ## << FIXME: 3.0.0 is  a
 wild
     > guess
     >     >> symnum(id.z <-
     > outerID.picky(z,z)) ## == Diagonal matrix
 [newer versions of
     > R]
     >     >> try(# for older R
     > versions
     >     >>
     > stopifnot(identical(id.z, outerID(z,z)),
 oldR ||
     > identical(id.z, diag(12) == 1))
     >    
     >>> )
     >     >> (mz <- match(z,
     > z)) # currently different {NA,NaN}
 patterns differ - not in
     > print()/format() _FIXME_
     >     >> zRI
     > <- rbind(Re=Re(z), Im=Im(z)) # and see
 the pattern :
     >     >> print(cbind(format =
 format(z),
     > t(zRI), mz), quote=FALSE)

     >     >> ## compute  match(z[i],
 z) ,
     > for  i = 1,2,..,12  :
     >     >> (m1z
     > <- sapply(z, match, table = z))
     >    
     >>> ## 1 2 1 2 2 2 1 2 2 2 1 2   #
 R 1.2.3 
     > (2001-04-26)
     >     >> ## 1 2 3 4 1 3 7
     > 8 2 4 8 7   # R 1.4.1  (2002-01-30)
     >     >> ## 1 2 3 4 1 3 7 8 2 4 8
 12  #
     > R 1.5.1  (2002-06-17)
     >     >> ## 1 2
     > 3 4 1 3 7 8 2 4 8 12  # R 1.8.1 
 (2003-11-21)
     >     >> ## 1 2 3 4 1 3 7 8 2 4 8
 12  #
     > R 2.0.1  (2004-11-15)
     >     >> ## 1 2
     > 3 4 1 3 7 4 2 4 4 12  # R 2.1.1 
 (2005-06-20)
     >     >> ## 1 2 3 4 1 3 7 4 2 4 4
 12  #
     > R 2.3.1  (2006-06-01)
     >     >> ## 1 2
     > 3 4 1 3 7 8 2 4 8 12  # R 2.5.1 
 (2007-06-27)
     >     >> ## 1 2 3 4 1 3 7 4 2 4 4
 12  #
     > R 2.10.1 (2009-12-14)
     >     >> ## 1 2
     > 3 4 1 3 7 4 2 4 4 12  # R 3.1.1 
 (2014-07-10)
     >     >> ## 1 2 3 4 1 3 7 4 2 4 4
 12  #
     > R 3.2.5 -- and 3.3.0 patched
     >     >>
     > ## 1 2 1 2 1 1 1 1 2 2 1 2   #
 <<--
     > Martin's R-devel and proposed future R

     >     >>
     > if(!exists("anyNA", mode="function"))
     > anyNA <- function(x) any(is.na(x))
     >    
     >>> stopifnot(apply(zRI, 2, anyNA)) #
 *all* are  NA
     > *or* NaN (or both)
     >     >> is.NA
     > <- function(.) is.na(.) &
 !is.nan(.)
     >     >> (iNaN <- apply(zRI, 2,
     > function(.) any(is.nan(.))))
     >     >>
     > (iNA <-  apply(zRI, 2, function(.)
 any(is.NA (.)))) #
     > has non-NaN NA's
     >     >> ## In
     > Martin's version of R-devel :
     >    
     >>> stopifnot(identical(m1z == 1,
 iNA),
     >     >> identical(m1z == 2,
 !iNA))
     >     >> ## m1z uses match(x, *)
 with
     > length(x) == 1 and failed in R 3.3.0
     >    
     >>> stopifnot(identical(m1z, mz))
     >    
     >>>
 ______________________________________________
     >     >> R-devel at r-project.org
 mailing
     > list
     >     >> https://stat.ethz.ch/mailman/listinfo/r-devel

     >     >
     >
 ______________________________________________
     >     > R-devel at r-project.org
     > mailing list
     >     > https://stat.ethz.ch/mailman/listinfo/r-devel

     >
 ______________________________________________
     > R-devel at r-project.org
 mailing list
     > https://stat.ethz.ch/mailman/listinfo/r-devel