[R] precision, incorrect(?) tapply() NA's
David James
dj at research.bell-labs.com
Mon Nov 20 16:16:35 CET 2000
Hi,
Summary:
I ran into some unexpected behavior in approx() and tapply()
that introduced NA's in "clean" data due to (?) numerical accuracy/round off.
The culprit seems to be in match() that coerces it's arguments to character,
loosing precision in the process.
[R development version 1.2.0, 08 Nov 2000, on Linux]
Example:
> r
> [1] 0.6931472 0.6931472 0.6931472 0.6931472 0.6931472 1.3373177 1.8052529
> [8] 3.3902202 4.5529401
the 5th element is not exactly equal to the first 4 elements:
> diff(r)
> [1] 0.000000e+00 0.000000e+00 0.000000e+00 1.110223e-16 6.441705e-01
> [6] 4.679352e-01 1.584967e+00 1.162720e+00
which agrees with unique(r)
> unique(r)
> [1] 0.6931472 0.6931472 1.3373177 1.8052529 3.3902202 4.5529401
(note that .Machine$double.eps is [1] 2.220446e-16, twice the 4th diff.)
Now the following call to approx() erroneously introduces an NA:
> approx(r, rec.p, ties = min)
Error in approx(r, rec.p, ties = "min") : approx(): attempted to interpolate NA
values
where the value of rec.p is
> rec.p
[1] 0.050 0.100 0.250 0.500 0.750 0.900 0.950 0.990 0.999
Looking at approx we see a call to tapply, which is producing the NA
> tapply(rec.p, r, min)
0.693147180559945 0.693147180559945 1.33731770200756 1.80525289211574
0.050 NA 0.900 0.950
3.39022015484073 4.55294011989587
0.990 0.999
Upon inspection, we see that tapply coerces its INDEX (second) argument
to factors, which in turn does a match():
> match(r, unique(r))
[1] 1 1 1 1 1 3 4 5 6
which shows that match() incorrectly assigned the 5th entry in r to the first
element of unique(r) -- not the 2nd one. (We also see that index 2 does not
appear on the output of match() at all.)
Thus match() seems to be introducing the round off as it coerces its
arguments to character:
> match
function (x, table, nomatch = NA)
.Internal(match(as.character(x), as.character(table), nomatch))
I'm enclosing the "r" and "rec.p" vectors as binary R data. I wish I could
send the ASCII versions, but then I would be introducing rounding
and the example above could not be reproduced.
By the way, I *do* realize that it's my (i.e., user) responsibility
to deal myself with such accuracy/precision issues (which I wishfully
thought I was by specifying "ties = min" to approx). Nevertheless, there
seems to be some inconsistency somewhere.
With regards,
David A. James
Statistics Research, Room 2C-253 Phone: (908) 582-3082
Bell Labs, Lucent Technologies Fax: (908) 582-3340
Murray Hill, NJ 09794-0636
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 381 bytes
Desc: tmp.Rdata
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20001120/8241c4c3/attachment.obj
More information about the R-help
mailing list