[R] Request for advice on character set conversions (those damn Excel files, again ...)

Mon Sep 8 00:02:06 CEST 2008

Dear list,

I have to read a not-so-small bunch of not-so-small Excel files, which 
seem to have traversed Window 3.1, Windows95 and Windows NT versions of 
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those 
strings may have various encodings. In the same sheet of the same file, 
some cells may be latin1, some UTF-8 and some CP437 (!).

read.xls() alows me to read those things in sets of dataframes. my 
problem is to convert the encodings to UTF8 without cloberring those who 
are already (looking like) UTF8.

I came to the following solution :

foo<-function(d, from="latin1",to="UTF-8"){
  # Semi-smart conversion of a dataframe between charsets.
  # Needed to ease use of those [@!] Excel files
  # that have survived the Win3.1 --> Win95 --> NT transition,
  # usually in poor shape..
  conv1<-function(v,from,to) {
    condconv<-function(v,from,to) {
      cnv<-is.na(iconv(v,to,to))
      v[cnv]<-iconv(v[cnv],from,to)
      return(v)
    }
    if (is.factor(v)) {
      l<-condconv(levels(v),from,to)
      levels(v)<-l
      return(v)
    }
    else if (is.character(v)) return(condconv(v,from,to))
    else return(v)
  }
  for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
  return(d)
}

Any advice for enhancement is welcome...

Sincerely yours,

					Emmanuel Charpentier