[Rd] "+" for character method...

Duncan Murdoch murdoch at stats.uwo.ca
Sat Aug 26 17:33:30 CEST 2006


On 8/26/2006 10:26 AM, John Chambers wrote:
> Well, two comments, in two non-compatible directions.
> 
> 1.  I have to say that I find the idea of using "+" to paste character 
> strings together aesthetically ugly.
> 
> IMO, one thing that makes functional object-based languages attractive 
> is that the generic function retains a consistent _function_, that is, 
> purpose and meaning, of which the methods are implementations.
> 
> It escapes me totally why I should think of pasting strings as addition 
> in the mathematical or intuitive sense (as Brian points out re 
> commutativity, it fails a number of axiomatic properties).  And if so, 
> what about "-", "*",  "/" and so on?  The mind boggles.

Assuming that your "totally" is literally true:

Strings don't form a commutative group under concatenation, but the 
operation is associative, and there's a zero element "".  This makes 
them a monoid or unitary semigroup.  The natural numbers (including 
zero) are another example of a monoid under addition.  It's not that 
weird to have addition defined without negatives.

Concatenation seems to me to be the most natural interpretation of 
addition for strings.

According to Wikipedia, the "+" operator is used for concatenation in 
BASIC, Pascal, Delphi, Javascript, Java, Python, C++ and Ruby.  These 
are probably the most commonly used modern languages other than C (which 
has no concatenation operator) or Fortran (which I just discovered today 
uses "//").

Other possibilities on the Wikipedia page that don't conflict with 
something else in R are:

Visual Basic and VHDL use the "&" sign.

Standard SQL, PL/I, and Maple from version 6 uses double pipe signs ("||").

OCaml uses "^".

So it seems to me that defining addition of strings to be concatenation 
is a reasonably widespread convention.

I don't think there are widespread conventions for subtraction, 
multiplication or division of strings, so I can't see any argument for 
implementing them.

> Its excuse presumably is to save typing, but I would favor using some 
> %thing% operator at the cost of a couple of extra key strokes.

I think consistency with other common languages is a stronger reason. 
Other than that, I'd be perfectly happy with %+%.

Duncan Murdoch


> 
> 2.  Having said that,  it's a reasonable hope that efficiency of 
> dispatch will not be a serious problem.  There are a bunch of fixes, for 
> semantic correctness and efficiency, nearly ready to commit (the 
> Bioconductor folks have been doing some valuable testing).  These should 
> help, and more important perhaps it's fairly easy to see how dispatch in 
> this form can be tuned for performance if necessary.
> 
> John
> 
> Bill Dunlap wrote:
>>>>     >> There have been propositions to make "+" work in S (and
>>>>     >> R) like in some other languages, namely for character
>>>>     >> (vectors),
>>>>     >>
>>>>     >> a + b := paste(a,b, sep="")
>>>> ...
>>>> yes.  I think however if we keep speed and clarity and catching
>>>> user errors all in mind, it would be enough - and better - to
>>>> only dispatch to paste(.,.) when both arguments are character
>>>> (vectors), i.e., the above case needed
>>>>  "a" + as.character(1:7) or "a" + paste(1:7) or "a" + format(1:7)
>>>> which after all is really more clearer, even more for cases of
>>>>  "1" + 2  which I'd rather want keeping to give errors.
>>>>
>>>> If  Char + Num  should work like above, then also
>>>>     Num + Char  should (since after all,  "+" should be commutative
>>>> 			apart from floating point precision issues).
>>>>
>>>> and so the internal C code gets a bit more complicated and slightly
>>>> slower..  something we had in mind we should strongly avoid...
>>>>       
>>> I doubt that it would be measurably slower, but I agree that requiring
>>> both args to be Char could be done in fewer operations than just
>>> requiring one.
>>>
>>> However, I think the consistency argument is stronger.  We have a rule
>>> that operations on mixed types promote the more restrictive type to the
>>> less restrictive one, and I don't think we should handle this case
>>> differently.
>>>
>>> So I'd say we should allow all of Char + Num, Num + Char, and Char +
>>> Char, or, if this costs too much at evaluation time, we shouldn't allow
>>> any of them.
>>>     
>> Currently doing arithmetic on mixed class data.frames
>> produces useful warnings and errors.  E.g.,
>>
>>   > z <- data.frame(Factor=factor(c("Lo","Med","High")),
>>                   Char=letters[1:3],
>>                   Num1=exp(0:2),
>>                   Num2=(1:3)*pi,
>>                   stringsAsFactors=FALSE)
>>   > z+1
>>   Error in FUN(left, right) : non-numeric argument to binary operator
>>   In addition: Warning message:
>>   + not meaningful for factors in: Ops.factor(left, right)
>>   > z[,-2] + 1
>>     Factor     Num1      Num2
>>   1     NA 2.000000  4.141593
>>   2     NA 3.718282  7.283185
>>   3     NA 8.389056 10.424778
>>   Warning message:
>>   + not meaningful for factors in: Ops.factor(left, right)
>>
>> If we made + do paste(sep="") for character+number then
>> we would lose the messages and let garbage flow further
>> down the pipe.
>>
>> Should factor data be treated as character data in this
>> case (e.g., pasting to the levels)?  That would be weird,
>> but many users confound character and factor data when
>> they are buried in data.frames.
>>
>> ----------------------------------------------------------------------------
>> Bill Dunlap
>> Insightful Corporation
>> bill at insightful dot com
>> 360-428-8146
>>
>>  "All statements in this message represent the opinions of the author and do
>>  not necessarily reflect Insightful Corporation policy or position."
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>   
>



More information about the R-devel mailing list