iconv {base} | R Documentation |
Convert Character Vector between Encodings
Description
This uses system facilities to convert a character vector between encodings: the ‘i’ stands for ‘internationalization’.
Usage
iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE)
iconvlist()
Arguments
x |
a character vector, or an object to be converted to a character
vector by |
from |
a character string describing the current encoding. |
to |
a character string describing the target encoding. |
sub |
character string. If not |
mark |
logical, for expert use. Should encodings be marked? |
toRaw |
logical. Should a list of raw vectors be returned rather than a character vector? |
Details
The names of encodings and which ones are available are
platform-dependent. All R platforms support ""
(for the
encoding of the current locale), "latin1"
and "UTF-8"
.
Generally case is ignored when specifying an encoding.
On most platforms iconvlist
provides an alphabetical list of
the supported encodings. On others, the information is on the man
page for iconv(5)
or elsewhere in the man pages (but beware
that the system command iconv
may not support the same set of
encodings as the C functions R calls). Unfortunately, the names are
rarely supported across all platforms.
Elements of x
which cannot be converted (perhaps because they
are invalid or because they cannot be represented in the target
encoding) will be returned as NA
(or NULL
for
toRaw = TRUE
) unless sub
is specified.
Most versions of iconv
will allow transliteration by appending
‘//TRANSLIT’ to the to
encoding: see the examples.
Encoding "ASCII"
is accepted, and on most systems "C"
and "POSIX"
are synonyms for ASCII. Where
"ASCII/TRANSLIT"
is unsupported by the OS, "ASCII"
is
used with sub = "c99"
if from UTF-8, else sub =
"?"
. (However, musl's version of "ASCII"
substitutes
*
.)
Elements of x
with a declared encoding (UTF-8 or latin1, see
Encoding
) are converted from that encoding if from
= ""
, otherwise they are taken as being in the encoding specified by
from
.
Note that implementations of iconv
typically do not do much
validity checking and will often mis-convert inputs which are invalid
in encoding from
.
If sub = "Unicode"
or sub = "c99"
is used for a
non-UTF-8 input it is the same as sub = "byte"
.
Value
If toRaw = FALSE
(the default), the value is a character vector
of the same length and the same attributes as x
(after
conversion to a character vector). If conversion fails for an element
that element of the result is set to NA_character_
. (NB:
whether conversion fails is implementation-specific.)
NA_character_
inputs give NA_character_
outputs.
If mark = TRUE
(the default) the elements of the result have a
declared encoding if to
is "latin1"
or "UTF-8"
,
or if to = ""
and the current locale's encoding is detected as
Latin-1 (or its superset CP1252 on Windows) or UTF-8.
If toRaw = TRUE
, the value is a list of the same length and
the same attributes as x
whose elements are either NULL
(if conversion fails or the input was NA_character_
) or a raw
vector.
For iconvlist()
, a character vector (typically of a few hundred
elements) of known encoding names.
Implementation Details
There are three main implementations of iconv
in use. Linux's
most common C runtime, ‘glibc’, contains one. Several platforms
supply versions or emulations of GNU ‘libiconv’, including
previous versions of macOS and FreeBSD, in some cases with additional
encodings. On Windows we use a version of Yukihiro Nakadaira's
‘win_iconv’, which is based on Windows' codepages. (We have
added many encoding names for compatibility with other systems.) All
three have iconvlist
, ignore case in encoding names and support
‘//TRANSLIT’ (but with different results, and for
‘win_iconv’ currently a ‘best fit’ strategy is used except
for to = "ASCII"
).
The macOS 14 implementation is attributed to the ‘Citrus
Project’: the Apple headers declare it as ‘compatible’ with GNU
‘libiconv’ 1.11 from 2006. However, it differs in significant
ways including using transliteration for conversions which cannot be
represented exactly in the target encoding. (It seems this
implementation is also used in recent versions of FreeBSD. Earlier
versions of macOS used GNU ‘libiconv’ 1.11 and some
CRAN builds still do.) For a failing
conversion macOS 14 generally translated character(s) to ?
but
14.1 gives an error (so an NA
result in R).
Most commercial Unixes contain an implementation of iconv
but
none we have encountered have supported the encoding names we need:
the ‘R Installation and Administration’ manual recommended
installing GNU ‘libiconv’ on Solaris and AIX.
Some Linux distributions use ‘musl’ as their C runtime. This is less comprehensive than ‘glibc’: it does not support ‘//TRANSLIT’ but does inexact conversions (currently using ‘*’).
There are other implementations, e.g. NetBSD has used one from the Citrus project (which does not support ‘//TRANSLIT’) and there is an older FreeBSD port.
Note that you cannot rely on invalid inputs being detected, especially
for to = "ASCII"
where some implementations allow 8-bit
characters and pass them through unchanged or with transliteration or
substitution.
Some of the implementations have interesting extra encodings: for
example GNU ‘libiconv’ and macOS 14 allow to = "C99"
to use
‘\uxxxx’ escapes (or if needed ‘\Uuxxxxxxxx’) for
non-ASCII characters.
Byte Order Marks
most commonly known as ‘BOMs’.
Encodings using character units which are more than one byte in size
can be written on a file in either big-endian or little-endian order:
this applies most commonly to UCS-2, UTF-16 and UTF-32/UCS-4
encodings. Some systems will write the Unicode character
U+FEFF
at the beginning of a file in these encodings and
perhaps also in UTF-8. In that usage the character is known as a BOM,
and should be handled during input (see the ‘Encodings’ section
under connection
: re-encoded connections have some
special handling of BOMs). The rest of this section applies when this
has not been done so x
starts with a BOM.
Implementations will generally interpret a BOM for from
given
as one of "UCS-2"
, "UTF-16"
and
"UTF-32"
. Implementations differ in how they treat BOMs in
x
in other from
encodings: they may be discarded,
returned as character U+FEFF
or regarded as invalid.
Note
The most portable name for the ISO 8859-15 encoding, commonly known as
‘Latin 9’, is "iso885915"
: most platforms support both
"latin-9"
and"latin9"
but GNU ‘libiconv’ does not
support the latter. ‘musl’ (as used by Alpine Linux and other
lightweight Linux distributions) supports neither, but R remaps there
to "iso885915"
.
Encoding names "utf8"
, "mac"
and "macroman"
are
not portable. "utf8"
is converted to "UTF-8"
for
from
and to
by iconv
, but not
for e.g. fileEncoding
arguments. "macintosh"
is
the official (and most widely supported) name for ‘Mac Roman’
(https://en.wikipedia.org/wiki/Mac_OS_Roman).
Using sub
substitutes each non-convertible byte in the
input, so when converting from UTF-8 a non-convertible character may
be replaced by two or more bytes. Using sub = "c99"
or
sub = "Unicode"
will be clearer.
See Also
Examples
## In principle, as not all systems have iconvlist
try(utils::head(iconvlist(), n = 50))
## Not run:
## convert from Latin-2 to UTF-8: two of the glibc iconv variants.
iconv(x, "ISO_8859-2", "UTF-8")
iconv(x, "LATIN2", "UTF-8")
## End(Not run)
## Both x below are in latin1 and will only display correctly in a
## locale that can represent and display latin1.
x <- "fran\xE7ais"
Encoding(x) <- "latin1"
x
charToRaw(xx <- iconv(x, "latin1", "UTF-8"))
xx
## The results in the comments are those from glibc and GNU libiconv
iconv(x, "latin1", "ASCII") # NA
iconv(x, "latin1", "ASCII", "?") # "fran?ais"
iconv(x, "latin1", "ASCII", "") # "franais"
iconv(x, "latin1", "ASCII", "byte") # "fran<e7>ais"
iconv(xx, "UTF-8", "ASCII", "Unicode")# "fran<U+00E7>ais"
iconv(xx, "UTF-8", "ASCII", "c99") # "fran\\u00e7ais"
## Extracts from old R help files (they are nowadays in UTF-8)
x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x) <- "latin1"
x
try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent
## glibc gives "Ekstroem" "Joreskog" "bisschen Zurcher"
## macOS 14 gives "Ekstrom" "J\"oreskog" "bisschen Z\"urcher"
## musl gives "Ekstr*m" "J*reskog" "bi*chen Z*rcher"
iconv(x, "latin1", "ASCII", sub = "byte")
## and for Windows' 'Unicode'
str(xx <- iconv(x, "latin1", "UTF-16LE", toRaw = TRUE))
iconv(xx, "UTF-16LE", "UTF-8")
emoji <- "\U0001f604"
iconv(emoji,, "latin1", sub = "Unicode") # "<U+1F604>"
iconv(emoji,, "latin1", sub = "c99")