[Rd] Feature Request: Allow Underscore Separated Numbers

Fri Jul 15 23:05:13 CEST 2022

Yes, Ivan, obviously someone can try out a change and check if it causes
problems.

And although I would think the majority of delayed execution eventually
either is never invoked or is done as you describe using internal functions
in trees, I suspect there exist some that do not.

For example, I can write code in another mini-language I create that I will
then analyze. What stops me from leaving quotes out from around a regular
expression because I am going to read the text exactly as is and manipulate
it, as long as the RE does not contain anything that keeps it from being
accepted as an argument to a function, such as no commas. Inside I may have
something like a pattern to  match a file name starting with anything then
an underscore and then a digit or two and finally a file suffix. I would not
want anything to parse that and remove the underscore that is part of a
filename. The argument is meant to be atomic.

Many things in the tidyverse do variations on delayed evaluation and some
seem to be a piece at a time. An example would be how mutate() allows
multiple clauses for new=f(old) where later lines use columns created in
earlier lines that did not exist before and can only be used if the
preceding part went well. I may be wrong on how it is done, but it strikes
me as possible they read in the raw text till they match an end of some kind
like a top-level comma or top level close-parenthesis. My GUESS is only then
might they evaluate that chunk after substitutions or other ploys to use
some namespace. Will a column name like evil__666__ survive?

Again, I am not AGAINST any proposal but the people who have to pay the
price in terms of needing to arrange or pay for development, documentation,
testing and so on, are the ones needed to be convinced. My point is that in
some ways R is a different kind of programming language than say python. I
experimented briefly in python and note their implementation of this feature
is fairly robust. I mean casting a string to an int works as expected as in:
a= int("1" "_" "122") returns 1122.

Be warned though that the current python implementation generates an error
if you have two or more underscores in a row as in:

a=1__1
SyntaxError: invalid decimal literal
a=1___1
SyntaxError: invalid decimal literal

And it does not tolerate one or more underscore at the end with the same
error and really gets mad at an initial underscore like _1 where it asks if
you mean "_" as a single underscore is not only a valid variable, as well as
multiple consecutive underscores, but is often used as an I DON'T CARE in
code like this, albeit any variable can be used as the last instance keeps
the value:

(_,_,a) = (1,2,3)
_
2
a
3

(In the above, you are seeing commands and output alternating, if not
clear.)

And as it happens, half of python variable contain runs of underscores to
the point where some say member functions like __name__  and __init__ are
called dunder name and dunder init  as in double double underscore. And note
that python is not that much younger than R/S and this feature was added
fairly late in version 3.6, about 5 years ago, long after version 3.0 made
many programs for version 2.x incompatible. 

My point is not python but someone may want to see how the underscore in a
number feature is actually implemented in any of the languages that now
allow it and carefully document exactly in what circumstances it is allowed
in R and also where, if anywhere, it differs from other such places.

If it can be done with a very few localized changes, great. My objections
about making regular expressions more complex  by needing to handle
underscore likely are not a major obstacle as python supports those too.

Luckily, my opinion is just my own as I have no direct stake in the outcome.
I personally handle large numbers fine.

Avi




-----Original Message-----
From: Ivan Krylov <krylov.r00t using gmail.com> 
Sent: Friday, July 15, 2022 1:22 PM
To: avi.e.gross using gmail.com
Cc: r-devel using r-project.org
Subject: Re: [Rd] Feature Request: Allow Underscore Separated Numbers

On Fri, 15 Jul 2022 11:25:32 -0400
<avi.e.gross using gmail.com> wrote:

> R normally delays evaluation so chunks of code are handed over 
> untouched to functions that often play with the text directly without 
> evaluating it until, perhaps, much later.

Do they play with the text, or with the syntax tree after it went through
the parser? While it's true that R saves the source text of the functions
for ease of debugging, it's not guaranteed that a given object will have
source references, and typical NSE functions operate on language objects
which are tree-like structures containing R values, not source text.

You are, of course, right that any changes to the syntax of the language
must be carefully considered, but if anyone wants to play with this idea, it
can be implemented in a very simple manner:

--- src/main/gram.y	(revision 82598)
+++ src/main/gram.y	(working copy)
@@ -2526,7 +2526,7 @@
     YYTEXT_PUSH(c, yyp);
     /* We don't care about other than ASCII digits */
     while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E'
-	   || c == 'x' || c == 'X' || c == 'L')
+	   || c == 'x' || c == 'X' || c == 'L' || c == '_')
     {
 	count++;
 	if (c == 'L') /* must be at the end.  Won't allow 1Le3 (at present).
*/ @@ -2533,6 +2533,9 @@
 	{   YYTEXT_PUSH(c, yyp);
 	    break;
 	}
+	if (c == '_') { /* allow an underscore anywhere inside the literal
*/
+	    continue;
+	}
 	
 	if (c == 'x' || c == 'X') {
 	    if (count > 2 || last != '0') break;  /* 0x must be first */

To an NSE function, the underscored literals are indistinguishable from
normal ones, because they don't see the literals:

stopifnot(all.equal(\() 1000000, \() 1_000_000)) f <- function(x, y)
stopifnot(all.equal(substitute(x), substitute(y))) f(1e6, 1_000_000)

Although it's true that the source references change as a result:

lapply(
 list(\() 1000000, \() 1_000_000),
 \(.) as.character(getSrcref(.))
)
# [[1]]
# [1] "\\() 1000000"
#
# [[2]]
# [1] "\\() 1_000_000"

This patch is somewhat simplistic: it allows both multiple underscores in
succession and underscores at the end of the number literal. Perl does so
too, but with a warning:

perl -wE'say "true" if 1__000_ == 1000'
# Misplaced _ in number at -e line 1.
# Misplaced _ in number at -e line 1.
# true

--
Best regards,
Ivan