[R] Substring and strsplit
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Sep 1 08:22:33 CEST 2006
On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:
> If you are using 'only' English then
> str <- "dog"
> works perfectly and it is fast.
It does also work 'perfectly' and fast in 'Unicode' in all major European
and CJK languages (and many others): extending the iconv example
> strsplit(xx, NULL)
 "f" "a" "ç" "i" "l" "e"
> charToRaw(strsplit(xx, NULL)[])
 c3 a7
on a UTF-8 system.
> But if you also dealing with Unicode character have a look at
That is a misleading reference (to your own opinion, and it is usual in
science to make clear what your source is when citing, especially if it is
yourself). Unicode itself has combining diacritical marks as separate
entries in the 'character code tables' at e.g.
http://www.unicode.org/charts/, so your understanding of 'character' seems
to differ from Unicode's.
You write about 'combined Unicode diacritics (accents)', which is
misleading, as these are not accents (and it is 'combining' not
'combined', a crucial difference). To quote Alan Wood
The _characters_ in this range are designed to be used in combination
with alphanumeric _characters_, to produce a character+diacritic that
is not present in any of the Unicode ranges. For example, ả
to produce a lower case "a" with a hook above.
So they are used for very rare glyphs made up from two Unicode characters,
and R correctly views them as two characters. (Actually R relies on the
OS services to correctly identify characters, but that appears to have
happened on the example on the RWiki page.)
You could have just thanked the R developers for ensuring that strsplit()
does work as documented even in Unicode locales.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help