[Rd] string concatenation operator (revisited)

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Wed Dec 8 03:10:12 CET 2021

Taras and Duncan and others do make a point about things not needing to be built in to the base R distribution if something similar can already be found elsewhere.

To an extent, that is quite true. But what exactly should be in the core of a language that has this kind of extensibility? 

I note how annoying it can be to load a package that then loads all kinds of other packages it depends on and often ones you personally will not know anything about and mostly never use directly. If core R was minimal, this can get worse and there can be serious overhead.

Obviously some code belongs there that directly interacts with the operating system or that implements major parts of the language. But clearly there was more put into S/R than the minimum even from early days based on how the language was expected to be used. And it has grown further over the years. The recent addition of a modified form of a pipe operator, along with a new way to declare a function so it can be added into a pipeline, are examples. Ideally, any feature that becomes used heavily that is already in a package, let alone a package with many such useful features, can be a candidate for inclusion directly or by emulation.

Back to string concatenation, I think it is fair to suggest S began as a statistical language of sorts with a heavy emphasis on numeric data and on vectorized data that led to vectors and data.frames being "built-in" so doing lots more with text was a secondary consideration that functions like paste() not only could easily handle, but could also handle vectorized input. It works pretty well and arguably overloading '+' is not needed. And note, underneath it all, R programs can largely be written using functions rather than operators. You can type:

`+`(5, `*`(2, 3))

and it evaluates to 11 and means 5+(2*3) and 

And paste() is not the only function you can use to do string concatenation. Consider one trivial use of sprintf() which also does much more:

> first <- "Avi"
> last <- "Gross"
> combined <- sprintf("%s%s", first, last)
> print(combined)
[1] "AviGross"

Obviously this also supports including a space between the %s copies and so on.

I note other languages also keep trying to expand to be everything for everybody and can use examples from many but Python is easy to see in many ways and is a bit of a competitor to R for some purposes. Python too has  packages called modules that extend the interpreted language and have had tons of modules added over the years including some to deal with items not included when the language was created. One reason R has done so well is that Python had things like lists but had no vectorized methods and other components like R did so lots of programs must first import modules like numpy and pandas to be able to create Series and Dataframes and manipulate them efficiently. But many modules have now been built on top of these extensions for various kinds of scientific programming and at some point you wonder why it is not built-in to the language to fill a gap they left. Lists are slow and dictionaries have limited use for many things. Tasks like machine learning can use huge amounts of data and do complex calculations repeatedly so Python has had to be extended. Yet, there too, most things have to be imported at runtime.

I am not a fanatic in R about the tidyverse set of packages  and often do some things using the built-in ways or use the tidyverse or mix and match. Both have value for me and some things remain easier than others depending on circumstances. Of course, using the same function name as other packages makes it hard to incorporate. But I don't think it would be hard to create a base R that includes a subset of the tidyverse as part of the base and leave other parts to be brought in only as needed.

The talk about string concatenation, also mentions the use of the glue package that I also sometimes use. The concatenation of strings and other types into a bigger string is often done in many languages and I note I have used five different methods in Python that are built-in as people keep wanting to bring in the way it is already done in some other language they like. I am talking about not so much concatenation but variants on the printf() family to format a string from many components and some look a bit like glue.  Potentially, a package like glue could also qualify as worth including in base R but let me clarify. There is a difference between being in the minimal core of a language and being in a list of packages that are by default included when R is built. Even if you include a package by default, it should not be an error to say library(name) if it is already loaded on your machine. So even after you make something part of the base distribution, people may continue to invoke it as if it was not there, lest the code be run on an older version.

The reality is that there can be significant costs in a tradeoff between ease of use with many choices and in the expense of running a bloated application that takes longer to load and more memory and spends more time searching namespaces and so on. 

Does adding a properly designed "+" cause much bloat? Maybe not. But the guardians of the language get so many requests, that realistically they can only approve a small number for each release and often then have to spend more time fixing bugs after getting complaints about code that does not work the same anymore!

-----Original Message-----
From: R-devel <r-devel-bounces using r-project.org> On Behalf Of Taras Zakharko
Sent: Tuesday, December 7, 2021 4:09 AM
To: r-devel <r-devel using r-project.org>
Subject: Re: [Rd] string concatenation operator (revisited)

Great summary, Avi. 

String concatenation cold be trivially added to R, but it probably should not be. You will notice that modern languages tend not to use “+” to do string concatenation (they either have a custom operator or a special kind of pattern to do it) due to practical issues such an approach brings (implicit type casting, lack of commutativity, performance etc.). These issues will be felt even more so in R with it’s weak typing, idiosyncratic casting behavior and NAs. 

As other’s have pointed out, any kind of behavior one wants from string concatenation can be implemented by custom operators as needed. This is not something that needs to be in the base R. I would rather like the efforts to be directed on improving string formatting (such as glue-style built-in string interpolation).

— Taras

> On 7 Dec 2021, at 02:27, Avi Gross via R-devel <r-devel using r-project.org> wrote:
> After seeing what others are saying, it is clear that you need to 
> carefully think things out before designing any implementation of a 
> more native concatenation operator whether it is called "+' or 
> anything else. There may not be any ONE right solution but unlike a 
> function version like paste() there is nowhere to place any options that specify what you mean.
> You can obviously expand paste() to accept arguments like 
> replace.NA="" or replace.NA="<NA>" and similar arguments on what to do 
> if you see a NaN, and Inf or -Inf, a NULL or even an NA.character_ and 
> so on. Heck, you might tell to make other substitutions as in 
> substitute=list(100=99, D=F) or any other nonsense you can come up with.
> But you have nowhere to put options when saying:
> c <- a + b
> Sure, you could set various global options before the addition and 
> maybe rest them after, but that is not a way I like to go for 
> something this basic.
> And enough such tinkering makes me wonder if it is easier to ask a 
> user to use a slightly different function like this:
> paste.no.na <- function(...) do.call(paste, Filter(Negate(is.na),
> list(...)))
> The above one-line function removes any NA from the argument list to 
> make a potentially shorter list before calling the real paste() using it.
> Variations can, of course, be made that allow functionality as above. 
> If R was a true object-oriented language in the same sense as others 
> like Python, operator overloading of "+" might be doable in more 
> complex ways but we can only work with what we have. I tend to agree 
> with others that in some places R is so lenient that all kinds of 
> errors can happen because it makes a guess on how to correct it.
> Generally, if you really want to mix numeric and character, many 
> languages require you to transform any arguments to make all of 
> compatible types. The paste() function is clearly stated to coerce all 
> arguments to be of type character for you. Whereas a+b makes no such 
> promises and also is not properly defined even if a and b are both of 
> type character. Sure, we can expand the language but it may still do 
> things some find not to be quite what they wanted as in "2"+"3"
> becoming "23" rather than 5. Right now, I can use
> as.numeric("2")+as.numeric("3") and get the intended result after making very clear to anyone reading the code that I wanted strings converted to floating point before the addition.
> As has been pointed out, the plus operator if used to concatenate does 
> not have a cognate for other operations like -*/ and R has used most 
> other special symbols for other purposes. So, sure, we can use something like ....
> (4 periods) if it is not already being used for something but using + 
> here is a tad confusing. Having said that, the makers of Python did 
> make that choice.
> -----Original Message-----
> From: R-devel <r-devel-bounces using r-project.org> On Behalf Of Gabriel 
> Becker
> Sent: Monday, December 6, 2021 7:21 PM
> To: Bill Dunlap <williamwdunlap using gmail.com>
> Cc: Radford Neal <radford using cs.toronto.edu>; r-devel 
> <r-devel using r-project.org>
> Subject: Re: [Rd] string concatenation operator (revisited)
> As I recall, there was a large discussion related to that which 
> resulted in the recycle0 argument being added (but defaulting to
> FALSE) for paste/paste0.
> I think a lot of these things ultimately mean that if there were to be 
> a string concatenation operator, it probably shouldn't have behavior 
> identical to paste0. Was that what you were getting at as well, Bill?
> ~G
> On Mon, Dec 6, 2021 at 4:11 PM Bill Dunlap <williamwdunlap using gmail.com> wrote:
>> Should paste0(character(0), c("a","b")) give character(0)?
>> There is a fair bit of code that assumes that paste("X",NULL) gives "X"
>> but c(1,2)+NULL gives numeric(0).
>> -Bill
>> On Mon, Dec 6, 2021 at 1:32 PM Duncan Murdoch 
>> <murdoch.duncan using gmail.com>
>> wrote:
>>> On 06/12/2021 4:21 p.m., Avraham Adler wrote:
>>>> Gabe, I agree that missingness is important to factor in. To 
>>>> somewhat
>>> abuse
>>>> the terminology, NA is often used to represent missingness. Perhaps 
>>>> concatenating character something with character something missing
>>> should
>>>> result in the original character?
>>> I think that's a bad idea.  If you wanted to represent an empty 
>>> string, you should use "" or NULL, not NA.
>>> I'd agree with Gabe, paste0("abc", NA) shouldn't give "abcNA", it 
>>> should give NA.
>>> Duncan Murdoch
>>>> Avi
>>>> On Mon, Dec 6, 2021 at 3:35 PM Gabriel Becker 
>>>> <gabembecker using gmail.com>
>>> wrote:
>>>>> Hi All,
>>>>> Seeing this and the other thread (and admittedly not having 
>>>>> clicked
>>> through
>>>>> to the linked r-help thread), I wonder about NAs.
>>>>> Should NA <concat> "hi there"  not result in NA_character_? This 
>>>>> is not what any of the paste functions do, but in my opinoin, NA +
>>> <non_na_value>
>>>>> seems like it should be NA  (not "NA"), particularly if we are 
>>>>> talking about `+` overloading, but potentially even in the case of 
>>>>> a distinct concatenation operator?
>>>>> I guess what I'm saying is that in my head missingness propagation
>>> rules
>>>>> should take priority in such an operator (ie NA + <anything> 
>>>>> should *always * be NA).
>>>>> Is that something others disagree with, or has it just not come up 
>>>>> yet
>>> in
>>>>> (the parts I have read) of this discussion?
>>>>> Best,
>>>>> ~G
>>>>> On Mon, Dec 6, 2021 at 10:03 AM Radford Neal 
>>>>> <radford using cs.toronto.edu>
>>>>> wrote:
>>>>>>>> In pqR (see pqR-project.org), I have implemented ! and !! as 
>>>>>>>> binary string concatenation operators, equivalent to paste0 and 
>>>>>>>> paste, respectively.
>>>>>>>> For instance,
>>>>>>>>> "hello" ! "world"
>>>>>>>>      [1] "helloworld"
>>>>>>>>> "hello" !! "world"
>>>>>>>>      [1] "hello world"
>>>>>>>>> "hello" !! 1:4
>>>>>>>>      [1] "hello 1" "hello 2" "hello 3" "hello 4"
>>>>>>> I'm curious about the details:
>>>>>>> Would `1 ! 2` convert both to strings?
>>>>>> They're equivalent to paste0 and paste, so 1 ! 2 produces "12", 
>>>>>> just like paste0(1,2) does.  Of course, they wouldn't have to be 
>>>>>> exactly equivalent to paste0 and paste - one could impose 
>>>>>> stricter requirements if that seemed better for error detection.
>>>>>> Off hand, though, I think automatically converting is more in 
>>>>>> keeping with the rest of R.  Explicitly converting with 
>>>>>> as.character
> could be tedious.
>>>>>> I suppose disallowing logical arguments might make sense to guard 
>>>>>> against typos where ! was meant to be the unary-not operator, but 
>>>>>> ended up being a binary operator, after some sort of typo.  I 
>>>>>> doubt that this would be a common error, though.
>>>>>> (Note that there's no ambiguity when there are no typos, except 
>>>>>> that when negation is involved a space may be needed - so, for 
>>>>>> example, "x" !  !TRUE is "xFALSE", but "x"!!TRUE is "x TRUE".
>>>>>> Existing uses of double negation are still fine - eg, a <- !!TRUE
> still sets a to TRUE.
>>>>>> Parsing of operators is greedy, so "x"!!!TRUE is "x FALSE", not
>>> "xTRUE".)
>>>>>>> Where does the binary ! fit in the operator priority?  E.g. how 
>>>>>>> is
>>>>>>>   a ! b > c
>>>>>>> parsed?
>>>>>> As (a ! b) > c.
>>>>>> Their precedence is between that of + and - and that of < and >.
>>>>>> So "x" ! 1+2 evalates to "x3" and "x" ! 1+2 < "x4" is TRUE.
>>>>>> (Actually, pqR also has a .. operator that fixes the problems 
>>>>>> with generating sequences with the : operator, and it has 
>>>>>> precedence lower than + and - and higher than ! and !!, but 
>>>>>> that's not relevant if you don't have the .. operator.)
>>>>>>    Radford Neal
>>>>>> ______________________________________________
>>>>>> R-devel using r-project.org mailing list 
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>         [[alternative HTML version deleted]]
>>>>> ______________________________________________
>>>>> R-devel using r-project.org mailing list 
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 	[[alternative HTML version deleted]]
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R-devel using r-project.org mailing list

More information about the R-devel mailing list