[R] Request for advice on character set conversions (those damn Excel files, again ...)
Emmanuel Charpentier
charpent at bacbuc.dyndns.org
Mon Sep 8 22:50:12 CEST 2008
On Mon, 08 Sep 2008 01:45:51 +0200, Peter Dalgaard wrote :
> Emmanuel Charpentier wrote:
>> Dear list,
>>
[ Snip ... ]
> This looks reasonably sane, I think. The last loop could be d[] <-
> lapply(d, conv1, from, to), but I think that is cosmetic. You can't
> really do much better because there is no simple way of distinguishing
> between the various 8-bit character sets.
Thank you Peter !
Could you point me to some not-so-simple (or even doubleplusunsimple)
ways ? I get the problem not so rarely, and I'd like to pull this chard
outta my poor tired foot one and for all... and I suppose that I am not
alone in this predicament.
> You could presumably setup
> some heuristics. like the fact that the occurrence of 0x82 or 0x8a
> probably indicates cp437, but it gets tricky. (At least, in French, you
> don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash
> were missing in cp437, and therefore often replaced yen and cent symbols
> in matrix printer ROMs. We still get the occational parcel addressed to
> "¥ster Farimagsgade".)
Peter, you're gravely underestimating the ingenuity of some Excel
l^Husers... (and your story is a possible candidate for a fortune()
entry...).
Emmanuel Charpentier
More information about the R-help
mailing list