[Rd] Unicode whitespace

Mon Jan 7 19:03:10 CET 2008

On Sat, 5 Jan 2008, hadley wickham wrote:

> On Jan 5, 2008 1:40 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>> I presume you want this only in a UTF-8 locale?
>
> Yes, although my assumption is that this will become an increasing
> common locale as time goes by.

Probably, except on Windows.  UTF-8 support in commercial Unixen is often 
poor (i.e. it is nominally there but does not work well).  The need for 
other non-8-bit encodings is diminishing, e.g. the various shift encodings 
for Japanese will likely die out but over a long time.

>> Currently this is done by
>>
>> static int SkipSpace(void)
>> {
>>      int c;
>>      while ((c = xxgetc()) == ' ' || c == '\t' || c == '\f')
>>         /* nothing */;
>>      return c;
>> }
>>
>> in gram.c.  We could make use of isspace and its wide-char equivalent
>> iswspace.  However:
>>
>>
>> - there is the perennial debate over whether \v is whitespace.
>>
>> R-lang says
>>
>>    Although not strictly tokens, stretches of whitespace characters
>>    (spaces and tabs) serve to delimit tokens in case of ambiguity,
>>
>> which suggests it has a minimal view of whitespace.
>>
>>
>> - iswspace is often rather unreliable.  E.g. glibc says
>>
>>      The wide character class "space" always contains  at  least  the  space
>>      character and the control characters '\f', '\n', '\r', '\t', '\v'.
>>
>> and I think it usually does not contain other forms of spaces.  More
>> seriously
>>
>>      The  behaviour  of  iswspace()  depends on the LC_CTYPE category of the
>>      current locale.
>>
>> so what is a space will depend on the encoding (hence my question about
>> UTF-8).  And Ei-ji Makama was replaced iswspace on MacOS X, because
>> apparently it is wrongly implemented.
>>
>>
>> - it would complicate the parser as look-ahead would be needed (you would
>> need to read the next mbcs, check it it were whitespace and pushback if
>> needed).  We do that elsewhere, though.
>
> I had assumed the parser would be unicode/mb aware already and so
> would be an easy fix.

It's not, because it has to work on non-Unicode platforms (e.g. Windows 9x 
until R 2.7.0), and even platforms in which wchar_t is not Unicode.  (More 
precisely, we need to avoid making assumptions we can't verify on such 
platforms.)

There's a problem with 'whitespace':  \n is whitespace but is a command 
terminator -- so what should Unicode line and para separators map to?  I 
decided to use only blanks (in the sense of iswblank) as whitespace, and 
further only to use the table that Ei-ji Nakama provided for us in 
rlocale_data.h (adding NBSP).  So the new rules are that 'whitespace' in 
parsing is

\t, \f (not \v, for historical reasons, I presume)
NBSP in 8-bit Windows locales
Unicode blanks in UTF-8 locales on internally Unicode machines (and I 
doubt UTF-8 locales exist anywhere else).

> The locale issues are clearly important and
> can't easily be swept under the rug.
>
>> The only one of these 'spaces' I have much sympathy for is NBSP (which is
>> also fairly easy to generate in CP1252).  It would be easy to add that.
>> Otherwise I am not convinced it is worth the work (and added uncertainty).
>
> That's reasonable.  Another related request would be treating curly
> quotes (single and double) the same way as normal quotes, but I'd
> imagine similar caveats would apply there.

And a bit more: they are directional so you would (I presume) only want to 
match \u2029 to \u2028 etc.  That would be a lot of extra work.

>  You could also imagine using unicode arrows in place of <- and ->, but 
> that's probably heading too far down the apl/fortress road!

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595