[R] Substring and strsplit

Fri Sep 1 14:19:27 CEST 2006

On 1 Sep 2006, at 08:22, Prof Brian Ripley wrote:

> On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:
>
>> If you are using 'only' English then
>>
>> str <- "dog"
>> strsplit(str,NULL)[[1]]
>>
>> works perfectly and it is fast.
>
> It does also work 'perfectly' and fast in 'Unicode' in all major  
> European
> and CJK languages (and many others): extending the iconv example
>

YES, of course, you are right. R supports Unicode and other encodings  
very well. This is one of the reasons why I've chosen R for my purposes.

If you look at my first example at this Rwiki-site, it contains  
Russian, German, and two Chinese characters to illustrate that the R  
function strsplit can handle this perfectly.

If I wrote about 'English' and 'Unicode' my only intention was to put  
it simply.
My experience is if I'm writing about 'combining diacritics' or  
'combining vowels' etc. some people don't understand these topics.
If I'm writing about 'Unicode' some have a vage association what I'm  
writing about.
Of course, in a scientific context this is absolutely wrong and  
misleading!

> http://www.unicode.org/charts/, so your understanding of  
> 'character' seems
> to differ from Unicode's.
>

Well, the term 'character' is highly ambiguous. So a better term  
would be glyph to emphasise that I mean a representation of a grapheme.
But still, even the terms 'gylph', 'grapheme', 'phoneme', etc. are  
also ambiguous.
Of course, my fault was that I didn't clarify my terminology in  
beforehand.

> You write about 'combined Unicode diacritics (accents)', which is
> misleading, as these are not accents (and it is 'combining' not
> 'combined', a crucial difference).

This was my grammatical fault. Sorry. I corrected this.

> To quote Alan Wood
> (http://www.alanwood.net/unicode/combining_diacritical_marks.html)

>   The _characters_ in this range are designed to be used in  
> combination
>   with alphanumeric _characters_, to produce a character+diacritic  
> that
>   is not present in any of the Unicode ranges. For example, a&#777;
>   to produce a lower case "a" with a hook above.
>

Yes! This is right, but ...

To illustrate MY problem I use your French example with 'façile'.

>> xx
> [1] "façile"
>> strsplit(xx, NULL)
> [[1]]
> [1] "f" "a" "ç" "i" "l" "e"
>> charToRaw(strsplit(xx, NULL)[[1]][3])
> [1] c3 a7
>
> on a UTF-8 system.
>

There are two possibilities by using Unicode to write 'façile':
1) "f" "a" "ç" "i" "l" "e"
2) "f" "a" "c" "combining cedilla (\u0327)" "i" "l" "e"

Now I use the R function strsplit and I will get two different results.

 > a <- "façile"
 > strsplit(a,NULL)
[[1]]
[1] "f" "a" "ç" "i" "l" "e"

 > b <- "façile"
 > strsplit(b,NULL)
[[1]]
[1] "f" "a" "c" "̧"   "i" "l" "e"

On the computer screen you don't see any difference in 1) and 2) {if  
your system supports this rendering}.

Always, the questions are: 'What do I want to split?' 'What is a  
character/glyph in my context?'

An other nice example I added to the wiki-site
http://wiki.r-project.org/rwiki/doku.php?id=tips:data- 
strings:decomposestring

> So they are used for very rare glyphs made up from two Unicode  
> characters,
> and R correctly views them as two characters.

R views them correctly if a character is defined as a single code point.
On the other hand, in my research I'm using hundreds of languages  
using these 'rare' glyphs!

To summarise:
- My intention was only to put it simply and short.
- It was NOT my intention to state that the R function strsplit  
doesn't support Unicode.
   The R developers did and still doing a great job! Thank you so much!
- Last but not least, SORRY for my incompleteness!

With regards,

Hans