[Rd] Unicode whitespace
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon Jan 7 19:03:10 CET 2008
On Sat, 5 Jan 2008, hadley wickham wrote:
> On Jan 5, 2008 1:40 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>> I presume you want this only in a UTF-8 locale?
>
> Yes, although my assumption is that this will become an increasing
> common locale as time goes by.
Probably, except on Windows. UTF-8 support in commercial Unixen is often
poor (i.e. it is nominally there but does not work well). The need for
other non-8-bit encodings is diminishing, e.g. the various shift encodings
for Japanese will likely die out but over a long time.
>> Currently this is done by
>>
>> static int SkipSpace(void)
>> {
>> int c;
>> while ((c = xxgetc()) == ' ' || c == '\t' || c == '\f')
>> /* nothing */;
>> return c;
>> }
>>
>> in gram.c. We could make use of isspace and its wide-char equivalent
>> iswspace. However:
>>
>>
>> - there is the perennial debate over whether \v is whitespace.
>>
>> R-lang says
>>
>> Although not strictly tokens, stretches of whitespace characters
>> (spaces and tabs) serve to delimit tokens in case of ambiguity,
>>
>> which suggests it has a minimal view of whitespace.
>>
>>
>> - iswspace is often rather unreliable. E.g. glibc says
>>
>> The wide character class "space" always contains at least the space
>> character and the control characters '\f', '\n', '\r', '\t', '\v'.
>>
>> and I think it usually does not contain other forms of spaces. More
>> seriously
>>
>> The behaviour of iswspace() depends on the LC_CTYPE category of the
>> current locale.
>>
>> so what is a space will depend on the encoding (hence my question about
>> UTF-8). And Ei-ji Makama was replaced iswspace on MacOS X, because
>> apparently it is wrongly implemented.
>>
>>
>> - it would complicate the parser as look-ahead would be needed (you would
>> need to read the next mbcs, check it it were whitespace and pushback if
>> needed). We do that elsewhere, though.
>
> I had assumed the parser would be unicode/mb aware already and so
> would be an easy fix.
It's not, because it has to work on non-Unicode platforms (e.g. Windows 9x
until R 2.7.0), and even platforms in which wchar_t is not Unicode. (More
precisely, we need to avoid making assumptions we can't verify on such
platforms.)
There's a problem with 'whitespace': \n is whitespace but is a command
terminator -- so what should Unicode line and para separators map to? I
decided to use only blanks (in the sense of iswblank) as whitespace, and
further only to use the table that Ei-ji Nakama provided for us in
rlocale_data.h (adding NBSP). So the new rules are that 'whitespace' in
parsing is
\t, \f (not \v, for historical reasons, I presume)
NBSP in 8-bit Windows locales
Unicode blanks in UTF-8 locales on internally Unicode machines (and I
doubt UTF-8 locales exist anywhere else).
> The locale issues are clearly important and
> can't easily be swept under the rug.
>
>> The only one of these 'spaces' I have much sympathy for is NBSP (which is
>> also fairly easy to generate in CP1252). It would be easy to add that.
>> Otherwise I am not convinced it is worth the work (and added uncertainty).
>
> That's reasonable. Another related request would be treating curly
> quotes (single and double) the same way as normal quotes, but I'd
> imagine similar caveats would apply there.
And a bit more: they are directional so you would (I presume) only want to
match \u2029 to \u2028 etc. That would be a lot of extra work.
> You could also imagine using unicode arrows in place of <- and ->, but
> that's probably heading too far down the apl/fortress road!
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list