[Rd] ASCIIfy() - a proposal for package:tools

Arni Magnusson arnima at hafro.is
Tue Apr 15 19:48:33 CEST 2014

Hi all,

I would like to propose the attached function ASCIIfy() to be added to the 
'tools' package.

Non-ASCII characters in character vectors can be problematic for R 
packages, but sometimes they cannot be avoided. To make packages portable 
and build without 'R CMD check' warnings, my solution has been to convert 
problematic characters in functions and datasets to escaped ASCII, so 
plot(1,main="São Paulo") becomes plot(1,main="S\u00e3o Paulo").

The showNonASCII() function in package:tools is helpful to identify R 
source files where characters should be converted to ASCII one way or 
another, but I could not find a function to actually perform the 
conversion to ASCII.

I have written the function ASCIIfy() to convert character vectors to 
ASCII. I imagine other R package developers might be looking for a similar 
tool, and it seems to me that package:tools is the first place they would 
look, where the R Core Team has provided a variety of tools for handling 
non-ASCII characters in package development.

I hope the R Core Team will adopt ASCIIfy() into the 'tools' package, to 
make life easier for package developers outside the English-speaking 
world. I have of course no problem with them renaming or rewriting the 
function in any way.

See the attached examples - all in flat ASCII that was prepared using the 
function itself! The main objective, though, is to ASCIIfy functions and 
datasets, not help pages.

-------------- next part --------------
ASCIIfy <- function(string, bytes=2, fallback="?")
  bytes <- match.arg(as.character(bytes), 1:2)
  convert <- function(char)  # convert to ASCII, e.g. "z", "\xfe", or "\u00fe"
    raw <- charToRaw(char)
    if(length(raw)==1 && raw<=127)  # 7-bit
      ascii <- char
    else if(length(raw)==1 && bytes==1)  # 8-bit to \x00
      ascii <- paste0("\\x", raw)
    else if(length(raw)==1 && bytes==2)  # 8-bit to \u0000
      ascii <- paste0("\\u", chartr(" ","0",formatC(as.character(raw),width=4)))
    else if(length(raw)==2 && bytes==1)  # 16-bit to \x00, if possible
      if(utf8ToInt(char) <= 255)
        ascii <- paste0("\\x", format.hexmode(utf8ToInt(char)))
      else {
        ascii <- fallback; warning(char, " could not be converted to 1 byte")}
    else if(length(raw)==2 && bytes==2)  # UTF-8 to \u0000
      ascii <- paste0("\\u", format.hexmode(utf8ToInt(char),width=4))
    else {
      ascii <- fallback
      warning(char, " could not be converted to ", bytes, " byte")}

  if(length(string) > 1)
    sapply(string, ASCIIfy, bytes=bytes, fallback=fallback, USE.NAMES=FALSE)
    input <- unlist(strsplit(string,""))  # "c"  "a"  "f"  "<\'e>"
    output <- character(length(input))    # ""   ""   ""   ""
    for(i in seq_along(input))
      output[i] <- convert(input[i])      # "c"  "a"  "f"  "\\u00e9"
    output <- paste(output, collapse="")  # "caf\\u00e9"
-------------- next part --------------
\title{Convert Characters to ASCII}
  Convert character vector to ASCII, replacing non-ASCII characters with
  single-byte (\samp{\x00}) or two-byte (\samp{\u0000}) codes.
ASCIIfy(x, bytes = 2, fallback = "?")
  \item{x}{a character vector, possibly containing non-ASCII
  \item{bytes}{either \code{1} or \code{2}, for single-byte
    (\samp{\x00}) or two-byte (\samp{\u0000}) codes.}
  \item{fallback}{an output character to use, when input characters
    cannot be converted.}
  A character vector like \code{x}, except non-ASCII characters have
  been replaced with \samp{\x00} or \samp{\u0000} codes.
\author{Arni Magnusson.}
  To render single backslashes, use these or similar techniques:
    write(ASCIIfy(x), "file.txt")
    cat(paste(ASCIIfy(x), collapse="\n"), "\n", sep="")}

  The resulting strings are plain ASCII and can be used in R functions
  and datasets to improve package portability.
  \code{\link[tools]{showNonASCII}} identifies non-ASCII characters in
  a character vector.
cities <- c("S\u00e3o Paulo", "Reykjav\u00edk")
ASCIIfy(cities, 1)
ASCIIfy(cities, 2)

athens <- "\u0391\u03b8\u03ae\u03bd\u03b1"

More information about the R-devel mailing list