[Rd] RFC: hexadecimal constants and decimal points
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Apr 17 13:38:10 CEST 2005
These are some points stimulated by reading about C history (and
related in their implementation).
1) On some platforms
> as.integer("0xA")
[1] 10
but not all (not on Solaris nor Windows). We do not define what is
allowed, and rely on the OS's implementation of strtod (yes, not strtol).
It seems that glibc does allow hex: C99 mandates it but C89 seems not to
allow it.
I think that was a mistake, and strtol should have been used. Then C89
does mandate the handling of hex constants and also octal ones. So
changing to strtol would change the meaning of as.integer("011").
Proposal: we handle this ourselves and define what values are acceptable,
namely for as.integer:
[+|-][0-9]+
NA
0[x|X][0-9A-fa-f]+
in all cases such that the converted value is in-range. (This does mean
as.integer("1e+05") would be invalid, but is it clear that is allowed
now?)
For as.numeric(), probably the C99 rules (which include NaN, Inf,
Infinity, and we need to add NA).
Alternatively, make and document the semantics to be
as.integer(as.numeric(char_string)) (which is effectively what we have
now, although it causes surprises).
[As a side point, some locales may accept non-Roman digits. I think we
need to exclude those everywhere, not just some places like parsing.]
2) R does not have integer constants. It would be convenient if it did,
and I can see no difficulty in allowing the same conversions when parsing
as when coercing. This would have the side effect that 100 would be
integer (but the coercion rules would come into play) but
200000000000000000 would be double. And x <-0xce80 would be valid.
3) We do allow setting LC_NUMERIC, but it partially breaks R if the
decimal point is not ".". (I know of no locale in which it is not "." or
",", and we cannot allow "," as part of numeric constants when parsing.)
E.g.:
> Sys.setlocale("LC_NUMERIC", "fr_FR")
[1] "fr_FR"
Warning message:
setting 'LC_NUMERIC' may cause R to function strangely in:
setlocale(category, locale)
> x <- 3.12
> x
[1] 3
> as.numeric("3,12")
[1] 3,12
> as.numeric("3.12")
[1] NA
Warning message:
NAs introduced by coercion
We could do better by insisting that "." was the decimal point in all
interval conversions _to_ numeric. Then the effect of setting LC_NUMERIC
would primarily be on conversions _from_ numeric, especially printing and
graphical output. (One issue would be what to do with scan(), which has a
`dec' argument but is implemented assuming LC_NUMERIC=C. I would hope to
continue to have `dec' but perhaps with a locale-dependent default.) The
resulting asymmetry (R would not be able to parse its own output) would be
unhappy, but seems inevitable. (This could be implemented easily by having
a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to
control that rather than actually setting the local category. For
example, deparsing needs to be done in LC_NUMERIC=C.)
All of these could be implemented by customized versions of
strtod/strtol.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list