[Rd] A question about the API mkchar()

Mon Nov 3 14:03:04 CET 2008

 Hi, Simon

Thanks for your elaborated instruction on mkCharCE.

 Concerning the UTF-8 Encoding, mkCharCE(X, CE_UTF8) is the correct way in parsing the Unicode string.

 However, I met another question:

 My program logic is intended to read the content of a text file r.tmp, which is encoded with UTF-8. After reading it, every line will be send to another C function ext_show(t const char** text, int* length, int* errLevel) for the further handle. Attached is the text file “r.tmp”.

 I tried to use the following R code to accomplish the process:

checkoutput<-scan(“r.tmp”,

                       what='character',

                       blank.lines.skip=FALSE,

                       sep='\n',

                       skip=0,

                       quiet=TRUE,

                       encoding = “unknown”)               

lines<-length(checkoutput)

print(checkoutput)

for (i in 1:lines)

 {

Inputstring = checkoutput[i]

out <- .C('ext_show',as.character(inputstring),

                                         as.integer(nchar(inputstring)),

                                         as.integer(err),

                                         PACKAGE="mypkg")

 }        

I don’t know why, if I typed the command in R GUI environment, the Japanese character can be shown correctly. Also, if I sink the inputstring into another text file, the content of this file also written correctly.

 But if I use the above code passing the inputstring into function ext_show, the string passed inputstring has been changed in the function ext_show ().

My current environment is WindowsXP, R 2.7.0, R encoding is "UTF-8":

> getOption("encoding")
[1] "UTF-8"

> Sys.getlocale()
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"

For current encoding is UTF-8, I don't think Chinese local will hinder the correct result. 

 The ext_show is defined as below:

    void ext_show(

        const char** text,

        int* length,

        int* errLevel)

        {

            *errLevel = LoadLib();

            int real_length = strlen(*text);

            if( LOAD_SUCCESS == *errLevel )

                *errLevel = ShowInScreen(*text, real_length);

        }

 I am new to the R programming, and not every familiar with the encoding handle in R, I suspect if it is necessary to convert encoding of the inputstring before passing to the function ext_show().

 Many Thanks!

Joey

在2008-10-28，"Simon Urbanek" <simon.urbanek at r-project.org> 写道：
>On Oct 28, 2008, at 6:26 , Fán Lóng wrote:
>
>> Hi guys,
>>
>
>Hey guy :)
>
>
>> I've got a question about the API mkchar(). I have met some  
>> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>>
>
>There is no mkchar() in R. Did you perhaps mean mkChar()?
>
>
>> I was intending to parse an utf-8 string str_jan (some Japanese
>> characters such asふ, whose utf-8 code is E381B5
>
>There is no such "UTF-8" code. I'm not sure if you meant Unicode, but  
>that would be \u3075 (Hiragana hu) for that character. The UTF-8  
>encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if  
>that's what you meant.
>
>
>> ) to R API SEXP
>> mkChar(const char *name) , we only need to create the SEXP using the
>> string that we parsed.
>>
>>
>>
>> Unfortunately, I found when parsing the variable str_jan, R will
>> automatically convert the str_jan according to the current locale
>> setting,
>
>That is not true - it will be kept as-is regardless of the encoding.  
>Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No  
>conversion takes place when the string is created, but you have told R  
>that it is in the native encoding. If that is not true (which is your  
>case probably isn't), all bets are off since you're lying to R ;).
>
>
>> so only in the English locale could the function work correctly,  
>> under other locale, such as Japanese or Chinese, the string will be  
>> convert incorrectly.
>
>That is clearly a nonsense since the encoding has nothing to do with  
>the locale language itself (Japanese, Chinese, ..). We are talking  
>about the encoding (note that both English and Japanese locales can  
>use UTF-8 encoding, but don't have to). I think you'll need to get the  
>concepts right here - for each string you must define the encoding in  
>order to be able to reproduce the unicode sequence that the string  
>represents. At this point it has nothing to do with the language.
>
>
>> As a matter of fact, those utf-8 code already is Unicode string, and  
>> don't need to be converted at all.
>>
>> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,  
>> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
>> worse. It returned the result as ucs code, an kind of Unicode under  
>> windows platform.
>>
>
>Well, that's exactly what you want, isn't it? The string is correctly  
>flagged as UTF-8 so R is finally able to find out what exactly is  
>represented by that string. However, your locale apparently doesn't  
>support such characters so it cannot be displayed. If you use a locale  
>that supports it, it works just fine, for example if you use local  
>with SJIS encoding R will still know how to convert it from UTF-8 to  
>SJIS *for display*. The actual string is not touched.
>
>Here is a small piece of code that shows you the difference between  
>native encoding and UTF8-strings:
>
>#include 
>#include 
>
>SEXP me() {
>   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
>   SEXP a = allocVector(STRSXP, 2);
>   PROTECT(a);
>   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
>   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
>   UNPROTECT(1);
>   return a;
>}
>
>In a UTF-8 locale it doesn't matter:
>
>ginaz:sandbox$ LANG=ja_JP.UTF-8 R
> > .Call("me")
>[1] "ふ" "ふ"
>
>But in any other, let's say SJIS, it does:
>
>ginaz:sandbox$ LANG=ja_JP.SJIS R
> > .Call("me")
>[1] "縺ｵ" "ふ"
>
>Note that the first string is wrong, because we have supplied UTF-8  
>encoding but the current one is SJIS. The second one is correct since  
>we told R that it's UTF-8 encoded.
>
>Finally, if the character cannot be displayed in the given encoding:
>
>ginaz:sandbox$ LANG=en_US.US-ASCII R
> > .Call("me")
>[1] "\343\201\265" "