[Rd] complex NA's match(), etc: not back-compatible change proposal
Suharto Anggono Suharto Anggono
suharto_anggono at yahoo.com
Sat May 28 11:34:08 CEST 2016
On 'factor', I meant the case where 'levels' is not specified, where 'unique' is called.
> factor(c(complex(real=NaN), complex(imaginary=NaN)))
[1] NaN+0i <NA>
Levels: NaN+0i
Look at <NA> in the result above. Yes, it happens in earlier versions of R, too.
On matching both NA and NaN, another consequence is that length(unique(.)) may depend on order. Example using R devel r70604:
> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
> (z <- z[is.na(z)])
[1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA
[9] 0+NaNi 1+NaNi NA NaN+NaNi
> length(print(unique(z)))
[1] NA NaN+0i
[1] 2
> length(print(unique(c(z[8], z[-8]))))
[1] NA
[1] 1
--------------------------------------------
On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
Cc: R-devel at r-project.org
Date: Monday, 23 May, 2016, 11:06 PM
>>>>>
Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>> on Fri, 13
May 2016 16:33:05 +0000 writes:
> That, for example, complex(real=NaN)
and complex(imaginary=NaN) are regarded as equal makes it
possible that
>
length(unique(as.character(x))) > length(unique(x))
> (current code of
function 'factor' doesn't expect it).
Thank you, that is an
interesting remark - but is already true,
in
[[elided Yahoo spam]]
..
and of course this is because we do
*print* 0+NaNi etc,
i.e., we
differentiate the non-NA-but-NaN complex values in
formatting / printing but not in match(),
unique() ...
and indeed,
with the 'z' example below,
fz <- factor(z,z)
gives a warnings about
duplicated levels and gives such warnings
also in current (and previous) versions of R,
at least for the slightly
larger z
I've used in the tests/reg-tests-1c.R example.
For the moment I can live with
that warning, as I don't think
factor()s
are constructed from complex numbers "often"...
and the performance of factor() in the more
regular cases is important.
> Yes, an argument for the behavior is that
NA and NaN are of one kind.
> On my
system, using 32-bit R for Windows from binary from CRAN,
the result of sapply(z, match, table = z) (not in current
R-devel) may be different from below:
> 1 2 3 4 1 3 7 8 2 4 8 12 # R 2.10.1, different from
below
> 1 2 3 4 1 3 7 8 2 4 8 12
# R 3.2.5, different from below
interesting, thank you... and another reason
why the change
(currently only in R-devel)
may have been a good one: More uniformity.
> I noticed that, by
function 'cequal' in unique.c, a complex number that
has both NA and NaN matches NA and also matches NaN.
>> x0 <- c(0,1,
NA, NaN); z <- outer(x0,x0, complex, length.out=1);
rm(x0)
>> (z <-
z[is.na(z)])
> [1]
NA NaN+ 0i NA NaN+ 1i
NA NA
NA NA
> [9] 0+NaNi 1+NaNi
NA NaN+NaNi
>> sapply(z, match, table =
z[8])
> [1] 1 1 1 1 1 1 1 1 1 1 1
1
>> match(z, z[8])
> [1] 1 1 1 1 1 1 1 1 1 1 1 1
Yes, I see the same. But is
n't it what we expect:
All of our z[] entries has at least one NA or a
NaN in its real
or imaginary, and since z[8]
has both, it does match with all
z[]'s
either because of the NA or because of the NaN in common.
Hence, currently, I don't
think this needs to be changed...
but if
there are other reasons / arguments ...
Thank you again,
Martin
Maechler
>> sessionInfo()
> R Under development (unstable) (2016-05-12
r70604)
> Platform:
i386-w64-mingw32/i386 (32-bit)
>
Running under: Windows XP (build 2600) Service Pack 2
> locale:
> [1] LC_COLLATE=English_United
States.1252
> [2]
LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> attached base
packages:
> [1] stats
graphics grDevices utils
datasets methods base
>
-----------------
>>>>>
Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Tue, 10
May 2016 16:08:39 +0200 writes:
>> This is an RFC / announcement
related to the 2nd part of PR#16885
>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885
>> about complex NA's.
>> The (somewhat
rare) incompatibility in R's 3.3.0 match() behavior for
the
>> case of complex numbers
with NA & NaN's {which has been fixed for R 3.3.0
>> patched in the mean time}
triggered some more comprehensive "research".
>> I found that we
have had a long-standing inconsistency at least between
the
>> documented and the real
behavior. I am claiming that the documented
>> behavior is desirable and hence
R's current "real" behavior is bugous, and
>> I am proposing to change it, in
R-devel (to be 3.4.0) for now.
> After the "roaring
unanimous" assent (one private msg
> encouraging me to go forward, no dissenting voice,
hence an
> "odds ratio"
of +Inf in favor ;-)
> I have now committed my proposal to R-devel (svn
rev. 70597) and
> some of us will
be seeing the effect in package space within a
> day or so, in the CRAN checks
against R-devel (not for
>
bioconductor AFAIK; their checks using R-devel only when it
less
> than ca 6 months from
release).
>
It's still worthwhile to discuss the issue, if you come
late
> to it, notably as
---paraphrasing Dirk on the R-package-devel list---
> the release of 3.4.0 is almost a
year away, and so now is the
> best
time to tinker with the API, in other words, consider
breaking
> rarely used legacy
APIs..
> Martin
>> In help(match) we have been saying
>> | Exactly
what matches what is to some extent a matter of
definition.
>> | For all
types, \code{NA} matches \code{NA} and no other value.
>> | For real and complex values,
\code{NaN} values are regarded
>> | as matching any other \code{NaN} value, but not
matching \code{NA}.
>> for at least 10 years. But we don't do that
at all in the
>> complex case
(and AFAIK never got a bug report about it).
>> Also, e.g.,
print(.) or format(.) do simply use "NA" for
all
>> the different complex
NA-containing numbers, where OTOH,
>> non-NA NaN's { <=> !is.nan(z) &
is.na(z) }
>> in format() or
print() do show the NaN in real and/or imaginary
>> parts; for an example, look at
the "format" column of the matrix
>> below, after
'print(cbind' ...
>> The current match()---and
duplicated(), unique() which are based on the same
>> C code---*do* distinguish almost
all complex NA / NaN's which is
>> NOT according to documentation. I have found that
this is just because of
>> of
our hashing function for the complex case, chash() in
R/src/main/unique.c,
>> is
bogous in the sense that it is not compatible with the above
documentation
>> and also not
with the cequal() function (in the same file uniqu.c) for
checking
>> equality of complex
numbers.
>> As
I have found,, a *simplified* version of the chash()
function
>> to make it
compatible with cequal() does solve all the problems
I've
>> indicated, and the
current plan is to commit that change --- after some
>> discussion time, here on R-devel
--- to the code base.
>> My change passes 'make check-all'
fine, but I'm 100% sure that there will
>> be effects in package-space. ...
one reason for this posting.
>> As mentioned above, note that
the chash() function has been in
>> use for all three functions
>> match()
>>
duplicated()
>> unique()
>> and the change will affect all
three --- but just for the case of complex
>> vectors with NA or NaN's.
>> To show more, a
small R session -- using my version of R-devel
>> == the proposition:
>> The R script
('complex-NA-short.R') for (a bit more than) the
>> session is attached {{you can
attach text/plain easily}}:
>>> x0 <- c(0,1, NA, NaN); z
<- outer(x0,x0, complex, length.out=1); rm(x0)
>>> ##
--- = NA_real_ but that does not exist e.g.,
in R 2.3.1
>>> ##
similarly, '1L',
'2L', .. do not exist e.g., in R 2.3.1
>>> (z <- z[is.na(z)])
>> [1] NA NaN+
0i NA NaN+ 1i NA
NA NA
NA
>>
[9] 0+NaNi 1+NaNi
NA NaN+NaNi
>>>
outerID <- function(x,y, ...) { ## ugly; can we get
outer() to work ?
>> +
r <- matrix( , length(x), length(y))
>> + for(i in
seq(along=x))
>> +
for(j in seq(along=y))
>> + r[i,j] <-
identical(z[i], z[j], ...)
>>
+ r
>> + }
>>> ## Very strictly - in the
sense of identical() -- these 12 complex numbers all
differ:
>>> ## a version that
works in older versions of R, where identical() had fewer
arguments!
>>> outerID.picky
<- function(x,y) {
>> +
nF <- length(formals(identical)) - 2
>> +
do.call("outerID", c(list(x, y),
as.list(rep(FALSE, nF))))
>> +
}
>>> oldR <-
!exists("getRversion") || getRversion() <
"3.0.0" ## << FIXME: 3.0.0 is a wild
guess
>>> symnum(id.z <-
outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
R]
>> [1,] | . . . .
. . . . . . .
>> [2,] . | . . .
. . . . . . .
>> [3,] . . | . .
. . . . . . .
>> [4,] . . . | .
. . . . . . .
>> [5,] . . . . |
. . . . . . .
>> [6,] . . . . .
| . . . . . .
>> [7,] . . . . .
. | . . . . .
>> [8,] . . . . .
. . | . . . .
>> [9,] . . . . .
. . . | . . .
>> [10,] . . . . .
. . . . | . .
>> [11,] . . . . .
. . . . . | .
>> [12,] . . . . .
. . . . . . |
>>> try(# for
older R versions
>> +
stopifnot(identical(id.z, outerID(z,z)), oldR ||
identical(id.z, diag(12) == 1))
>> + )
>>> (mz <-
match(z, z)) # currently different {NA,NaN} patterns differ
- not in print()/format() _FIXME_
>> [1] 1 2 1 2 1 1 1 1 2 2 1 2
>>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see
the pattern :
>>>
print(cbind(format = format(z), t(zRI), mz), quote=FALSE)
>>
format Re Im mz
>> [1,] NA
<NA> 0 1
>> [2,]
NaN+ 0i NaN 0 2
>>
[3,] NA <NA> 1 1
>> [4,] NaN+ 1i NaN 1 2
>> [5,] NA
0 <NA> 1
>> [6,]
NA 1 <NA> 1
>> [7,] NA <NA> <NA>
1
>> [8,] NA
NaN <NA> 1
>>
[9,] 0+NaNi 0 NaN 2
>> [10,] 1+NaNi 1 NaN 2
>> [11,] NA
<NA> NaN 1
>> [12,]
NaN+NaNi NaN NaN 2
>>>
>>
-------------------------------
>> Note that 'mz <- match(z, z)' and hence
the last column of the matrix above
>> are very different in current R,
>> distinguishing most kinds of NA
/ NaN against the documentation (and the
>> real/numeric case).
>> Martin
Maechler
>> R Core Team
>> ### Basically a shortened version of the PR#16885
-- complex part b)
>> ### of
R/tests/reg-tests-1c.R
>> ## b) complex 'x' with different kinds
of NaN
>> x0 <- c(0,1, NA,
NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0)
>> ## ---
= NA_real_ but that does not exist e.g., in R 2.3.1
>> ##
similarly, '1L', '2L', .. do
not exist e.g., in R 2.3.1
>> (z
<- z[is.na(z)])
>> outerID
<- function(x,y, ...) { ## ugly; can we get outer() to
work ?
>> r <- matrix( ,
length(x), length(y))
>> for(i
in seq(along=x))
>> for(j in
seq(along=y))
>> r[i,j] <-
identical(z[i], z[j], ...)
>>
r
>> }
>> ## Very strictly - in the sense of identical() --
these 12 complex numbers all differ:
>> ## a version that works in older versions of R,
[[elided Yahoo spam]]
>> outerID.picky <- function(x,y) {
>> nF <-
length(formals(identical)) - 2
>> do.call("outerID", c(list(x, y),
as.list(rep(FALSE, nF))))
>>
}
>> oldR <-
!exists("getRversion") || getRversion() <
"3.0.0" ## << FIXME: 3.0.0 is a wild
guess
>> symnum(id.z <-
outerID.picky(z,z)) ## == Diagonal matrix [newer versions of
R]
>> try(# for older R
versions
>>
stopifnot(identical(id.z, outerID(z,z)), oldR ||
identical(id.z, diag(12) == 1))
>> )
>> (mz <- match(z,
z)) # currently different {NA,NaN} patterns differ - not in
print()/format() _FIXME_
>> zRI
<- rbind(Re=Re(z), Im=Im(z)) # and see the pattern :
>> print(cbind(format = format(z),
t(zRI), mz), quote=FALSE)
>> ## compute match(z[i], z) ,
for i = 1,2,..,12 :
>> (m1z
<- sapply(z, match, table = z))
>> ## 1 2 1 2 2 2 1 2 2 2 1 2 # R 1.2.3
(2001-04-26)
>> ## 1 2 3 4 1 3 7
8 2 4 8 7 # R 1.4.1 (2002-01-30)
>> ## 1 2 3 4 1 3 7 8 2 4 8 12 #
R 1.5.1 (2002-06-17)
>> ## 1 2
3 4 1 3 7 8 2 4 8 12 # R 1.8.1 (2003-11-21)
>> ## 1 2 3 4 1 3 7 8 2 4 8 12 #
R 2.0.1 (2004-11-15)
>> ## 1 2
3 4 1 3 7 4 2 4 4 12 # R 2.1.1 (2005-06-20)
>> ## 1 2 3 4 1 3 7 4 2 4 4 12 #
R 2.3.1 (2006-06-01)
>> ## 1 2
3 4 1 3 7 8 2 4 8 12 # R 2.5.1 (2007-06-27)
>> ## 1 2 3 4 1 3 7 4 2 4 4 12 #
R 2.10.1 (2009-12-14)
>> ## 1 2
3 4 1 3 7 4 2 4 4 12 # R 3.1.1 (2014-07-10)
>> ## 1 2 3 4 1 3 7 4 2 4 4 12 #
R 3.2.5 -- and 3.3.0 patched
>>
## 1 2 1 2 1 1 1 1 2 2 1 2 # <<--
Martin's R-devel and proposed future R
>>
if(!exists("anyNA", mode="function"))
anyNA <- function(x) any(is.na(x))
>> stopifnot(apply(zRI, 2, anyNA)) # *all* are NA
*or* NaN (or both)
>> is.NA
<- function(.) is.na(.) & !is.nan(.)
>> (iNaN <- apply(zRI, 2,
function(.) any(is.nan(.))))
>>
(iNA <- apply(zRI, 2, function(.) any(is.NA (.)))) #
has non-NaN NA's
>> ## In
Martin's version of R-devel :
>> stopifnot(identical(m1z == 1, iNA),
>> identical(m1z == 2, !iNA))
>> ## m1z uses match(x, *) with
length(x) == 1 and failed in R 3.3.0
>> stopifnot(identical(m1z, mz))
>> ______________________________________________
>> R-devel at r-project.org mailing
list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
______________________________________________
> R-devel at r-project.org
mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list