[Rd] A question about the API mkchar()
王永智
voyager1983 at 163.com
Mon Nov 3 14:03:04 CET 2008
Hi, Simon
Thanks for your elaborated instruction on mkCharCE.
Concerning the UTF-8 Encoding, mkCharCE(X, CE_UTF8) is the correct way in parsing the Unicode string.
However, I met another question:
My program logic is intended to read the content of a text file r.tmp, which is encoded with UTF-8. After reading it, every line will be send to another C function ext_show(t const char** text, int* length, int* errLevel) for the further handle. Attached is the text file “r.tmp”.
I tried to use the following R code to accomplish the process:
checkoutput<-scan(“r.tmp”,
what='character',
blank.lines.skip=FALSE,
sep='\n',
skip=0,
quiet=TRUE,
encoding = “unknown”)
lines<-length(checkoutput)
print(checkoutput)
for (i in 1:lines)
{
Inputstring = checkoutput[i]
out <- .C('ext_show',as.character(inputstring),
as.integer(nchar(inputstring)),
as.integer(err),
PACKAGE="mypkg")
}
I don’t know why, if I typed the command in R GUI environment, the Japanese character can be shown correctly. Also, if I sink the inputstring into another text file, the content of this file also written correctly.
But if I use the above code passing the inputstring into function ext_show, the string passed inputstring has been changed in the function ext_show ().
My current environment is WindowsXP, R 2.7.0, R encoding is "UTF-8":
> getOption("encoding")
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
For current encoding is UTF-8, I don't think Chinese local will hinder the correct result.
The ext_show is defined as below:
void ext_show(
const char** text,
int* length,
int* errLevel)
{
*errLevel = LoadLib();
int real_length = strlen(*text);
if( LOAD_SUCCESS == *errLevel )
*errLevel = ShowInScreen(*text, real_length);
}
I am new to the R programming, and not every familiar with the encoding handle in R, I suspect if it is necessary to convert encoding of the inputstring before passing to the function ext_show().
Many Thanks!
Joey
在2008-10-28,"Simon Urbanek" <simon.urbanek at r-project.org> 写道:
>On Oct 28, 2008, at 6:26 , Fán Lóng wrote:
>
>> Hi guys,
>>
>
>Hey guy :)
>
>
>> I've got a question about the API mkchar(). I have met some
>> difficulty in parsing utf-8 string to mkchar() in R-2.7.0.
>>
>
>There is no mkchar() in R. Did you perhaps mean mkChar()?
>
>
>> I was intending to parse an utf-8 string str_jan (some Japanese
>> characters such asふ, whose utf-8 code is E381B5
>
>There is no such "UTF-8" code. I'm not sure if you meant Unicode, but
>that would be \u3075 (Hiragana hu) for that character. The UTF-8
>encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if
>that's what you meant.
>
>
>> ) to R API SEXP
>> mkChar(const char *name) , we only need to create the SEXP using the
>> string that we parsed.
>>
>>
>>
>> Unfortunately, I found when parsing the variable str_jan, R will
>> automatically convert the str_jan according to the current locale
>> setting,
>
>That is not true - it will be kept as-is regardless of the encoding.
>Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No
>conversion takes place when the string is created, but you have told R
>that it is in the native encoding. If that is not true (which is your
>case probably isn't), all bets are off since you're lying to R ;).
>
>
>> so only in the English locale could the function work correctly,
>> under other locale, such as Japanese or Chinese, the string will be
>> convert incorrectly.
>
>That is clearly a nonsense since the encoding has nothing to do with
>the locale language itself (Japanese, Chinese, ..). We are talking
>about the encoding (note that both English and Japanese locales can
>use UTF-8 encoding, but don't have to). I think you'll need to get the
>concepts right here - for each string you must define the encoding in
>order to be able to reproduce the unicode sequence that the string
>represents. At this point it has nothing to do with the language.
>
>
>> As a matter of fact, those utf-8 code already is Unicode string, and
>> don't need to be converted at all.
>>
>> I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,
>> Parsing the CE_UTF8 as the argument of cetype_t, but the result is
>> worse. It returned the result as ucs code, an kind of Unicode under
>> windows platform.
>>
>
>Well, that's exactly what you want, isn't it? The string is correctly
>flagged as UTF-8 so R is finally able to find out what exactly is
>represented by that string. However, your locale apparently doesn't
>support such characters so it cannot be displayed. If you use a locale
>that supports it, it works just fine, for example if you use local
>with SJIS encoding R will still know how to convert it from UTF-8 to
>SJIS *for display*. The actual string is not touched.
>
>Here is a small piece of code that shows you the difference between
>native encoding and UTF8-strings:
>
>#include
>#include
>
>SEXP me() {
> const char c[] = { 0xe3, 0x81, 0xb5, 0 };
> SEXP a = allocVector(STRSXP, 2);
> PROTECT(a);
> SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
> SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
> UNPROTECT(1);
> return a;
>}
>
>In a UTF-8 locale it doesn't matter:
>
>ginaz:sandbox$ LANG=ja_JP.UTF-8 R
> > .Call("me")
>[1] "ふ" "ふ"
>
>But in any other, let's say SJIS, it does:
>
>ginaz:sandbox$ LANG=ja_JP.SJIS R
> > .Call("me")
>[1] "縺オ" "ふ"
>
>Note that the first string is wrong, because we have supplied UTF-8
>encoding but the current one is SJIS. The second one is correct since
>we told R that it's UTF-8 encoded.
>
>Finally, if the character cannot be displayed in the given encoding:
>
>ginaz:sandbox$ LANG=en_US.US-ASCII R
> > .Call("me")
>[1] "\343\201\265" "
More information about the R-devel
mailing list