[R] Substring and strsplit
Hans-Joerg Bibiko
bibiko at eva.mpg.de
Fri Sep 1 14:19:27 CEST 2006
On 1 Sep 2006, at 08:22, Prof Brian Ripley wrote:
> On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:
>
>> If you are using 'only' English then
>>
>> str <- "dog"
>> strsplit(str,NULL)[[1]]
>>
>> works perfectly and it is fast.
>
> It does also work 'perfectly' and fast in 'Unicode' in all major
> European
> and CJK languages (and many others): extending the iconv example
>
YES, of course, you are right. R supports Unicode and other encodings
very well. This is one of the reasons why I've chosen R for my purposes.
If you look at my first example at this Rwiki-site, it contains
Russian, German, and two Chinese characters to illustrate that the R
function strsplit can handle this perfectly.
If I wrote about 'English' and 'Unicode' my only intention was to put
it simply.
My experience is if I'm writing about 'combining diacritics' or
'combining vowels' etc. some people don't understand these topics.
If I'm writing about 'Unicode' some have a vage association what I'm
writing about.
Of course, in a scientific context this is absolutely wrong and
misleading!
> http://www.unicode.org/charts/, so your understanding of
> 'character' seems
> to differ from Unicode's.
>
Well, the term 'character' is highly ambiguous. So a better term
would be glyph to emphasise that I mean a representation of a grapheme.
But still, even the terms 'gylph', 'grapheme', 'phoneme', etc. are
also ambiguous.
Of course, my fault was that I didn't clarify my terminology in
beforehand.
> You write about 'combined Unicode diacritics (accents)', which is
> misleading, as these are not accents (and it is 'combining' not
> 'combined', a crucial difference).
This was my grammatical fault. Sorry. I corrected this.
> To quote Alan Wood
> (http://www.alanwood.net/unicode/combining_diacritical_marks.html)
> The _characters_ in this range are designed to be used in
> combination
> with alphanumeric _characters_, to produce a character+diacritic
> that
> is not present in any of the Unicode ranges. For example, ả
> to produce a lower case "a" with a hook above.
>
Yes! This is right, but ...
To illustrate MY problem I use your French example with 'façile'.
>> xx
> [1] "façile"
>> strsplit(xx, NULL)
> [[1]]
> [1] "f" "a" "ç" "i" "l" "e"
>> charToRaw(strsplit(xx, NULL)[[1]][3])
> [1] c3 a7
>
> on a UTF-8 system.
>
There are two possibilities by using Unicode to write 'façile':
1) "f" "a" "ç" "i" "l" "e"
2) "f" "a" "c" "combining cedilla (\u0327)" "i" "l" "e"
Now I use the R function strsplit and I will get two different results.
> a <- "façile"
> strsplit(a,NULL)
[[1]]
[1] "f" "a" "ç" "i" "l" "e"
> b <- "façile"
> strsplit(b,NULL)
[[1]]
[1] "f" "a" "c" "̧" "i" "l" "e"
On the computer screen you don't see any difference in 1) and 2) {if
your system supports this rendering}.
Always, the questions are: 'What do I want to split?' 'What is a
character/glyph in my context?'
An other nice example I added to the wiki-site
http://wiki.r-project.org/rwiki/doku.php?id=tips:data-
strings:decomposestring
> So they are used for very rare glyphs made up from two Unicode
> characters,
> and R correctly views them as two characters.
R views them correctly if a character is defined as a single code point.
On the other hand, in my research I'm using hundreds of languages
using these 'rare' glyphs!
To summarise:
- My intention was only to put it simply and short.
- It was NOT my intention to state that the R function strsplit
doesn't support Unicode.
The R developers did and still doing a great job! Thank you so much!
- Last but not least, SORRY for my incompleteness!
With regards,
Hans
More information about the R-help
mailing list