[Rd] ASCIIfy() - a proposal for package:tools
Arni Magnusson
arnima at hafro.is
Tue Apr 15 19:48:33 CEST 2014
Hi all,
I would like to propose the attached function ASCIIfy() to be added to the
'tools' package.
Non-ASCII characters in character vectors can be problematic for R
packages, but sometimes they cannot be avoided. To make packages portable
and build without 'R CMD check' warnings, my solution has been to convert
problematic characters in functions and datasets to escaped ASCII, so
plot(1,main="São Paulo") becomes plot(1,main="S\u00e3o Paulo").
The showNonASCII() function in package:tools is helpful to identify R
source files where characters should be converted to ASCII one way or
another, but I could not find a function to actually perform the
conversion to ASCII.
I have written the function ASCIIfy() to convert character vectors to
ASCII. I imagine other R package developers might be looking for a similar
tool, and it seems to me that package:tools is the first place they would
look, where the R Core Team has provided a variety of tools for handling
non-ASCII characters in package development.
I hope the R Core Team will adopt ASCIIfy() into the 'tools' package, to
make life easier for package developers outside the English-speaking
world. I have of course no problem with them renaming or rewriting the
function in any way.
See the attached examples - all in flat ASCII that was prepared using the
function itself! The main objective, though, is to ASCIIfy functions and
datasets, not help pages.
Arni
-------------- next part --------------
ASCIIfy <- function(string, bytes=2, fallback="?")
{
bytes <- match.arg(as.character(bytes), 1:2)
convert <- function(char) # convert to ASCII, e.g. "z", "\xfe", or "\u00fe"
{
raw <- charToRaw(char)
if(length(raw)==1 && raw<=127) # 7-bit
ascii <- char
else if(length(raw)==1 && bytes==1) # 8-bit to \x00
ascii <- paste0("\\x", raw)
else if(length(raw)==1 && bytes==2) # 8-bit to \u0000
ascii <- paste0("\\u", chartr(" ","0",formatC(as.character(raw),width=4)))
else if(length(raw)==2 && bytes==1) # 16-bit to \x00, if possible
if(utf8ToInt(char) <= 255)
ascii <- paste0("\\x", format.hexmode(utf8ToInt(char)))
else {
ascii <- fallback; warning(char, " could not be converted to 1 byte")}
else if(length(raw)==2 && bytes==2) # UTF-8 to \u0000
ascii <- paste0("\\u", format.hexmode(utf8ToInt(char),width=4))
else {
ascii <- fallback
warning(char, " could not be converted to ", bytes, " byte")}
return(ascii)
}
if(length(string) > 1)
{
sapply(string, ASCIIfy, bytes=bytes, fallback=fallback, USE.NAMES=FALSE)
}
else
{
input <- unlist(strsplit(string,"")) # "c" "a" "f" "<\'e>"
output <- character(length(input)) # "" "" "" ""
for(i in seq_along(input))
output[i] <- convert(input[i]) # "c" "a" "f" "\\u00e9"
output <- paste(output, collapse="") # "caf\\u00e9"
return(output)
}
}
-------------- next part --------------
\name{ASCIIfy}
\alias{ASCIIfy}
\title{Convert Characters to ASCII}
\description{
Convert character vector to ASCII, replacing non-ASCII characters with
single-byte (\samp{\x00}) or two-byte (\samp{\u0000}) codes.
}
\usage{
ASCIIfy(x, bytes = 2, fallback = "?")
}
\arguments{
\item{x}{a character vector, possibly containing non-ASCII
characters.}
\item{bytes}{either \code{1} or \code{2}, for single-byte
(\samp{\x00}) or two-byte (\samp{\u0000}) codes.}
\item{fallback}{an output character to use, when input characters
cannot be converted.}
}
\value{
A character vector like \code{x}, except non-ASCII characters have
been replaced with \samp{\x00} or \samp{\u0000} codes.
}
\author{Arni Magnusson.}
\note{
To render single backslashes, use these or similar techniques:
\verb{
write(ASCIIfy(x), "file.txt")
cat(paste(ASCIIfy(x), collapse="\n"), "\n", sep="")}
The resulting strings are plain ASCII and can be used in R functions
and datasets to improve package portability.
}
\seealso{
\code{\link[tools]{showNonASCII}} identifies non-ASCII characters in
a character vector.
}
\examples{
cities <- c("S\u00e3o Paulo", "Reykjav\u00edk")
print(cities)
ASCIIfy(cities, 1)
ASCIIfy(cities, 2)
athens <- "\u0391\u03b8\u03ae\u03bd\u03b1"
print(athens)
ASCIIfy(athens)
}
\keyword{}
More information about the R-devel
mailing list