[R] Converting two byte encoding to UTF-8
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Sat Mar 19 13:35:14 CET 2022
I have solved it!
First, the bytes I have are offset by 0x80 from what they should
contain. The actual encoding of 亜 is 0x30 0x21. But subtracting 0x80
isn't enough; they are still treated as two characters:
> iconv(as.raw(result[[1]]$kanji-0x80), from = "JIS_X0208-1990",
to="UTF-8")
[1] "外" "憶"
However, if I put those bytes in a list entry, it works:
> iconv(list(as.raw(result[[1]]$kanji-0x80)), from = "JIS_X0208-1990",
to="UTF-8")
[1] "亜"
Duncan Murdoch
On 19/03/2022 6:52 a.m., Duncan Murdoch wrote:
> I have a file that includes Japanese characters encoded using the
> "JIS_X0208-1997" encoding. According to iconvlist(), an earlier
> revision "JIS_X0208-1990" is supported, so I'd like to try that to
> decode them.
>
> However, I can't seem to find how to provide input to iconv() to do it.
> This is a two-byte encoding, so one character has bytes
>
> > as.raw(result[[1]]$kanji)
> [1] b0 a1
>
> But this is being interpreted as two characters by iconv():
>
> > iconv(as.raw(result[[1]]$kanji), from = "JIS_X0208-1990", to = "UTF-8")
> [1] "皸" "甕"
>
> I can't seem to find any input that iconv() will accept to treat this as
> a single character. (I believe the answer should be 亜 , if that helps.)
> How do I tell it to use 0xb0a1 (or 0xa1b0, if that's the right byte
> order)? I just see NA:
>
> > iconv(0xb0a1, from = "JIS_X0208-1990", to = "UTF-8")
> [1] NA
> > iconv(0xa1b0, from = "JIS_X0208-1990", to = "UTF-8")
> [1] NA
>
> Duncan Murdoch
>
More information about the R-help
mailing list