utf8Conversion {base} | R Documentation |
Convert Integer Vectors to or from UTF-8-encoded Character Vectors
Description
Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
Usage
utf8ToInt(x)
intToUtf8(x, multiple = FALSE, allow_surrogate_pairs = FALSE)
Arguments
x |
object to be converted. |
multiple |
logical: should the conversion be to a single character string or multiple individual characters? |
allow_surrogate_pairs |
logical: should interpretation of
surrogate pairs be attempted? (See ‘Details’.)
Only supported for |
Details
These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.
Unicode defines a name and a number of all of the glyphs it
encompasses: the numbers are called code points: since RFC3629
they run from 0
to 0x10FFFF
(with about 5% being
assigned by version 13.0 of the Unicode standard and 7% reserved for
‘private use’).
intToUtf8
does not by default handle surrogate pairs: inputs in
the surrogate ranges are mapped to NA
. They might occur if a
UTF-16 byte stream has been read as 2-byte integers (in the correct
byte order), in which case allow_surrogate_pairs = TRUE
will
try to interpret them (with unmatched surrogate values still treated
as NA
).
Value
utf8ToInt
converts a length-one character string encoded in
UTF-8 to an integer vector of Unicode code points.
intToUtf8
converts a numeric vector of Unicode code points
either (default) to a single character string or a character vector of
single characters. Non-integral numeric values are truncated to
integers. For output to a single character string 0
is
silently omitted: otherwise 0
is mapped to ""
. The
Encoding
of a non-NA
return value is declared as
"UTF-8"
.
Invalid and NA
inputs are mapped to NA
output.
Validity
Which code points are regarded as valid has changed over the lifetime
of UTF-8. Originally all 32-bit unsigned integers were potentially
valid and could be converted to up to 6 bytes in UTF-8. Since 2003 it
has been stated that there will never be valid code points larger than
0x10FFFF
, and so valid UTF-8 encodings are never more than 4
bytes.
The code points in the surrogate-pair range 0xD800
to
0xDFFF
are prohibited in UTF-8 and so are regarded as invalid
by utf8ToInt
and by default by intToUtf8
.
The position of ‘noncharacters’ (notably 0xFFFE
and
0xFFFF
) was clarified by ‘Corrigendum 9’ in 2013. These
are valid but will never be given an official interpretation. (In some
earlier versions of R utf8ToInt
treated them as invalid.)
References
https://www.rfc-editor.org/rfc/rfc3629, the current standard for UTF-8.
https://www.unicode.org/versions/corrigendum9.html for non-characters.
Examples
## will only display in some locales and fonts
intToUtf8(0x03B2L) # Greek beta
utf8ToInt("bi\u00dfchen")
utf8ToInt("\xfa\xb4\xbf\xbf\x9f")
## A valid UTF-16 surrogate pair (for U+10437)
x <- c(0xD801, 0xDC37)
intToUtf8(x)
intToUtf8(x, TRUE)
(xx <- intToUtf8(x, , TRUE)) # will only display in some locales and fonts
charToRaw(xx)
## An example of how surrogate pairs might occur
x <- "\U10437"
charToRaw(x)
foo <- tempfile()
writeLines(x, file(foo, encoding = "UTF-16LE"))
## next two are OS-specific, but are mandated by POSIX
system(paste("od -x", foo)) # 2-byte units, correct on little-endian platforms
system(paste("od -t x1", foo)) # single bytes as hex
y <- readBin(foo, "integer", 2, 2, FALSE, endian = "little")
sprintf("%X", y)
intToUtf8(y, , TRUE)