[R] Substring and strsplit

Fri Sep 1 08:22:33 CEST 2006

On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:

> If you are using 'only' English then
> 
> str <- "dog"
> strsplit(str,NULL)[[1]]
> 
> works perfectly and it is fast.

It does also work 'perfectly' and fast in 'Unicode' in all major European 
and CJK languages (and many others): extending the iconv example

> xx
[1] "façile"
> strsplit(xx, NULL)
[[1]]
[1] "f" "a" "ç" "i" "l" "e"
> charToRaw(strsplit(xx, NULL)[[1]][3])
[1] c3 a7

on a UTF-8 system.

> But if you also dealing with Unicode character have a look at

http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:decomposestring

That is a misleading reference (to your own opinion, and it is usual in 
science to make clear what your source is when citing, especially if it is 
yourself).  Unicode itself has combining diacritical marks as separate 
entries in the 'character code tables' at e.g. 
http://www.unicode.org/charts/, so your understanding of 'character' seems 
to differ from Unicode's.

You write about 'combined Unicode diacritics (accents)', which is 
misleading, as these are not accents (and it is 'combining' not 
'combined', a crucial difference).  To quote Alan Wood 
(http://www.alanwood.net/unicode/combining_diacritical_marks.html)

  The _characters_ in this range are designed to be used in combination 
  with alphanumeric _characters_, to produce a character+diacritic that
  is not present in any of the Unicode ranges. For example, a&#777; 
  to produce a lower case "a" with a hook above.

So they are used for very rare glyphs made up from two Unicode characters, 
and R correctly views them as two characters.  (Actually R relies on the 
OS services to correctly identify characters, but that appears to have 
happened on the example on the RWiki page.)

You could have just thanked the R developers for ensuring that strsplit() 
does work as documented even in Unicode locales.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595