[R] Request for advice on character set conversions (those damn Excel files, again ...)
Emmanuel Charpentier
charpent at bacbuc.dyndns.org
Mon Sep 8 00:02:06 CEST 2008
Dear list,
I have to read a not-so-small bunch of not-so-small Excel files, which
seem to have traversed Window 3.1, Windows95 and Windows NT versions of
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those
strings may have various encodings. In the same sheet of the same file,
some cells may be latin1, some UTF-8 and some CP437 (!).
read.xls() alows me to read those things in sets of dataframes. my
problem is to convert the encodings to UTF8 without cloberring those who
are already (looking like) UTF8.
I came to the following solution :
foo<-function(d, from="latin1",to="UTF-8"){
# Semi-smart conversion of a dataframe between charsets.
# Needed to ease use of those [@!] Excel files
# that have survived the Win3.1 --> Win95 --> NT transition,
# usually in poor shape..
conv1<-function(v,from,to) {
condconv<-function(v,from,to) {
cnv<-is.na(iconv(v,to,to))
v[cnv]<-iconv(v[cnv],from,to)
return(v)
}
if (is.factor(v)) {
l<-condconv(levels(v),from,to)
levels(v)<-l
return(v)
}
else if (is.character(v)) return(condconv(v,from,to))
else return(v)
}
for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
return(d)
}
Any advice for enhancement is welcome...
Sincerely yours,
Emmanuel Charpentier
More information about the R-help
mailing list