[R] Weird and changed as.roman() behavior
Richard O'Keefe
r@oknz @end|ng |rom gm@||@com
Sat Jan 18 04:03:51 CET 2025
"Roman numerals" is actually a tricky subject, since there were
different versions at different times.
It is worth noting that the Unicode character set (and R does support
Unicode, does it not?) includes
the Roman numeral characters for 5,000 10,000 50,000 and 100,000 so
the idea that 3999 is an acceptable limit
doesn't quite make much sense any more.
It is also worth noting that Unicode also includes single-character
versions of 1-12.
The characters are U+2160 to U+2188.
For what it's worth, the Romans could express fractions that were
multiples of 1/12.
Converting between numbers and their Roman forms is not something that
regular expressions are a good tool for.
On Fri, 17 Jan 2025 at 00:05, Martin Maechler
<maechler using stat.math.ethz.ch> wrote:
>
> >>>>> Stephanie Evert
> >>>>> on Wed, 15 Jan 2025 13:18:03 +0100 writes:
>
> > Well, the real issue then seems to be that .roman2numeric uses an invalid regular expression:
> >>> grepl("^M{,3}D?C{,4}L?X{,4}V?I{,4}$", cc)
> >> [1] TRUE TRUE TRUE TRUE TRUE
>
> > or
>
> >>> grepl("^I{,2}$", c("II", "III", "IIII"))
> >> [1] TRUE TRUE FALSE
>
>
> > Both the TRE and the PCRE specification only allow repetition quantifiers of the form
>
> > {a}
> > {a,b}
> > {a,}
>
> > https://laurikari.net/tre/documentation/regex-syntax/
> > https://www.pcre.org/original/doc/html/pcrepattern.html#SEC17
>
> > {,2} and {,4} are thus invalid and seem to result in undefined behaviour (which PCRE and TRE fill in different ways, but consistently not what was intended).
>
> >> > grepl("^I{,2}$", c("II", "III", "IIII"))
> >> [1] TRUE TRUE FALSE
>
> >> > grepl("^I{,2}$", c("II", "III", "IIII"), perl=TRUE)
> >> [1] FALSE FALSE FALSE
>
> > Fix thus is easy: {,4} => {0,4}
>
> > Best,
> > Stephanie
>
> Thanks a lot, Stephanie -- indeed, I think I would not have searched in
> this direction at all
> ( To me it seemed "obvious" that if {3,} is well defined, {,3}
> would be so, too... But I was *wrong* and actually I also
> understand and that {,3} is not needed, and {0,3} is clearer,
> whereas {3,} is not easy to re-express ( '{0,inf}' or similar
> would make the code considerably more complicated and probably slower..)
>
> Actually, to remain back compatible (see Jani's original report:
> he'd like "IIIII" to work, as it did for many/most of us),
> we should replace {,4} by {0,5}.
>
> But there's more: our current help page
> https://search.r-project.org/R/refmans/utils/html/roman.html
> says
>
> > Only numbers between 1 and 3999 have a unique representation
> > as roman numbers, and hence others result in as.roman(NA).
>
> which is really not quite true, in more than one sense:
>
> 1. as.roman(3899:3999) # works fine
>
> not producing any NA
>
> 2. I think, e.g., "MMMM"
> is a pretty unique representation of 4000.
>
> Also, one piece of other software (online)
> https://www.rapidtables.com/convert/number/date-to-roman-numerals.html
>
> does convert _dates_ up to the year 4999, see,
> https://www.rapidtables.com/convert/number/date-to-roman-numerals.html?msel=January&dsel=1&year=4999&fmtsel=MM.DD.YYYY
>
> giving MMMMCMXCIX for 4999.
>
> Hence, I also think we should enlarge the valid range from current
> {1 .. 3999} to
> {1 .. 4999}
>
> Martin
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list