[Rd] Error in substring: invalid multibyte string
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Sat Jun 27 11:12:42 CEST 2020
On Fri, 26 Jun 2020 15:57:06 -0700
Toby Hocking <tdhock5 using gmail.com> wrote:
>invalid multibyte string at '<e4>gel-A<6b>iyoshi'
>https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html
The server says that the text is UTF-8:
curl -sI \
https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
grep Content-Type
# Content-Type: text/html; charset=UTF-8
But it's not, at least not all of it. If you ask readLines to mark
the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
mojibake and invalid multi-byte characters:
x <- readLines(
'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
encoding = 'latin1'
)[28]
substr(x, 1, 100)
# [1] "<I>Jens Oehlschlägel-Akiyoshi"
The behaviour we observe when encoding = 'latin1' is not specified
results from returned lines having "unknown" encoding. The substr()
implementation tries to interpret such strings according to multi-byte C
locale rules (using mbrtowc(3)). On my system (yours too, probably, if
it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
and this Latin-1 string does not result in valid code points when
decoded as UTF-8.
--
Best regards,
Ivan
More information about the R-devel
mailing list