[Rd] A question about the API mkchar()

Mon Nov 3 22:07:40 CET 2008

1) 2.7.0 is rather old, and you were asked to update your R before 
posting.

2) No file was attached.  But how to handle encodings is in the 'R 
Internals' manual.  This is a tricky, advanced, topic in C-level R 
programming.  It is your responsibility, not ours, to get yourself up to 
the level of understanding required.  Sorry, but it is not reasonable to 
expect a personal tutorial in this forum.

On Mon, 3 Nov 2008, 王永智 wrote:

> Hi, Simon
>
>
> Thanks for your elaborated instruction on mkCharCE.
>
> Concerning the UTF-8 Encoding, mkCharCE(X, CE_UTF8) is the correct way in parsing the Unicode string.
>
> However, I met another question:
>
> My program logic is intended to read the content of a text file r.tmp, which is encoded with UTF-8. After reading it, every line will be send to another C function ext_show(t const char** text, int* length, int* errLevel) for the further handle. Attached is the text file “r.tmp”.
>
> I tried to use the following R code to accomplish the process:
>
> checkoutput<-scan(“r.tmp”,
>
>                       what='character',
>
>                       blank.lines.skip=FALSE,
>
>                       sep='\n',
>
>                       skip=0,
>
>                       quiet=TRUE,
>
>                       encoding = “unknown”)
>
> lines<-length(checkoutput)
>
> print(checkoutput)
>
> for (i in 1:lines)
>
> {
>
> Inputstring = checkoutput[i]
>
> out <- .C('ext_show',as.character(inputstring),
>
>                                         as.integer(nchar(inputstring)),
>
>                                         as.integer(err),
>
>                                         PACKAGE="mypkg")
>
> }
>
>
>
>
>
> I don’t know why, if I typed the command in R GUI environment, the Japanese character can be shown correctly. Also, if I sink the inputstring into another text file, the content of this file also written correctly.
>
> But if I use the above code passing the inputstring into function ext_show, the string passed inputstring has been changed in the function ext_show ().
>
> My current environment is WindowsXP, R 2.7.0, R encoding is "UTF-8":
>
>> getOption("encoding")
> [1] "UTF-8"
>
>> Sys.getlocale()
> [1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
>
>
> For current encoding is UTF-8, I don't think Chinese local will hinder the correct result.
>
> The ext_show is defined as below:
>
>    void ext_show(
>
>        const char** text,
>
>        int* length,
>
>        int* errLevel)
>
>        {
>
>            *errLevel = LoadLib();
>
>            int real_length = strlen(*text);
>
>            if( LOAD_SUCCESS == *errLevel )
>
>                *errLevel = ShowInScreen(*text, real_length);
>
>        }
>
> I am new to the R programming, and not every familiar with the encoding handle in R, I suspect if it is necessary to convert encoding of the inputstring before passing to the function ext_show().
>
> Many Thanks!
>
> Joey
>
>
> 在2008-10-28，"Simon Urbanek" <simon.urbanek at r-project.org> 写道：
>> On Oct 28, 2008, at 6:26 , Fán Lóng wrote:
>>
>>> Hi guys,
>>>
>>
>> Hey guy :)
>>
>>
>>> I've got a question about the API mkchar(). I have met some
>>> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>>>
>>
>> There is no mkchar() in R. Did you perhaps mean mkChar()?
>>
>>
>>> I was intending to parse an utf-8 string str_jan (some Japanese
>>> characters such asふ, whose utf-8 code is E381B5
>>
>> There is no such "UTF-8" code. I'm not sure if you meant Unicode, but
>> that would be \u3075 (Hiragana hu) for that character. The UTF-8
>> encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if
>> that's what you meant.
>>
>>
>>> ) to R API SEXP
>>> mkChar(const char *name) , we only need to create the SEXP using the
>>> string that we parsed.
>>>
>>>
>>>
>>> Unfortunately, I found when parsing the variable str_jan, R will
>>> automatically convert the str_jan according to the current locale
>>> setting,
>>
>> That is not true - it will be kept as-is regardless of the encoding.
>> Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No
>> conversion takes place when the string is created, but you have told R
>> that it is in the native encoding. If that is not true (which is your
>> case probably isn't), all bets are off since you're lying to R ;).
>>
>>
>>> so only in the English locale could the function work correctly,
>>> under other locale, such as Japanese or Chinese, the string will be
>>> convert incorrectly.
>>
>> That is clearly a nonsense since the encoding has nothing to do with
>> the locale language itself (Japanese, Chinese, ..). We are talking
>> about the encoding (note that both English and Japanese locales can
>> use UTF-8 encoding, but don't have to). I think you'll need to get the
>> concepts right here - for each string you must define the encoding in
>> order to be able to reproduce the unicode sequence that the string
>> represents. At this point it has nothing to do with the language.
>>
>>
>>> As a matter of fact, those utf-8 code already is Unicode string, and
>>> don't need to be converted at all.
>>>
>>> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,
>>> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
>>> worse. It returned the result as ucs code, an kind of Unicode under
>>> windows platform.
>>>
>>
>> Well, that's exactly what you want, isn't it? The string is correctly
>> flagged as UTF-8 so R is finally able to find out what exactly is
>> represented by that string. However, your locale apparently doesn't
>> support such characters so it cannot be displayed. If you use a locale
>> that supports it, it works just fine, for example if you use local
>> with SJIS encoding R will still know how to convert it from UTF-8 to
>> SJIS *for display*. The actual string is not touched.
>>
>> Here is a small piece of code that shows you the difference between
>> native encoding and UTF8-strings:
>>
>> #include
>> #include
>>
>> SEXP me() {
>>   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
>>   SEXP a = allocVector(STRSXP, 2);
>>   PROTECT(a);
>>   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
>>   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
>>   UNPROTECT(1);
>>   return a;
>> }
>>
>> In a UTF-8 locale it doesn't matter:
>>
>> ginaz:sandbox$ LANG=ja_JP.UTF-8 R
>>> .Call("me")
>> [1] "ふ" "ふ"
>>
>> But in any other, let's say SJIS, it does:
>>
>> ginaz:sandbox$ LANG=ja_JP.SJIS R
>>> .Call("me")
>> [1] "縺ｵ" "ふ"
>>
>> Note that the first string is wrong, because we have supplied UTF-8
>> encoding but the current one is SJIS. The second one is correct since
>> we told R that it's UTF-8 encoded.
>>
>> Finally, if the character cannot be displayed in the given encoding:
>>
>> ginaz:sandbox$ LANG=en_US.US-ASCII R
>>> .Call("me")
>> [1] "\343\201\265" "

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595