[R] Find the dataset(s) that contain(s) non-ASCII characters
Christophe Dutang
dutangc at gmail.com
Mon Apr 4 21:19:20 CEST 2016
Dear list,
I’m maintainsing a package containing only datasets (152): http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html <http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html>
When R CMD checking the package, I get the following NOTE
* checking data for non-ASCII characters ... NOTE
Note: found 4 marked UTF-8 strings
I wonder how to find which dataset(s) (all recorded as rda files) contain(s) non-ASCII characters.
Using the iconv function let us to find or replace non-ASCII characters
iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII")
I use the following function to detect non-ASCII characters.
testASCII <- function(idata)
{
col <- (1:NCOL(idata))[sapply(idata, is.factor)]
col <- c(col, (1:NCOL(idata))[sapply(idata, is.character)])
for(i in col)
{
x <- idata[, i]
cat(colnames(idata)[i], "\n")
res <- grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
res <- c(res, grep("I_WAS_NOT_ASCII", iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII")))
if(any(length(res) > 0))
cat(res, "\n")
}
}
Unfortunately, I did not find yet which rda file contains non-ASCII characters among 56 most recent datasets. Is there a faster way to detect non-ASCII characters than to manually load and testASCII()? for example directly on rda files?
Any comment is welcome.
Regards, Christophe
> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
---------------------------------------
Christophe Dutang
LMM, UdM, Le Mans, France
web: http://dutangc.free.fr <http://dutangc.free.fr/>
[[alternative HTML version deleted]]
More information about the R-help
mailing list