[R] Request for advice on character set conversions (those damn Excel files, again ...)

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Mon Sep 8 22:50:12 CEST 2008


On Mon, 08 Sep 2008 01:45:51 +0200, Peter Dalgaard wrote :

> Emmanuel Charpentier wrote:
>> Dear list,
>>         

[ Snip ... ]

> This looks reasonably sane, I think. The last loop could be d[] <-
> lapply(d, conv1, from, to), but I think that is cosmetic. You can't
> really do much better because there is no simple way of distinguishing
> between the various 8-bit character sets.

Thank you Peter !

Could you point me to some not-so-simple (or even doubleplusunsimple) 
ways ? I get the problem not so rarely, and I'd like to pull this chard 
outta my poor tired foot one and for all... and I suppose that I am not 
alone in this predicament.

>                                           You could presumably setup
> some heuristics. like the fact that the occurrence of 0x82 or 0x8a
> probably indicates cp437, but it gets tricky. (At least, in French, you
> don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash
> were missing in cp437, and therefore often replaced yen and cent symbols
> in matrix printer ROMs. We still get the occational parcel addressed to
> "¥ster Farimagsgade".)

Peter, you're gravely underestimating the ingenuity of some Excel 
l^Husers... (and your story is a possible candidate for a fortune() 
entry...).

					Emmanuel Charpentier



More information about the R-help mailing list