[R] read.spss and encodings
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Thu Feb 1 14:18:30 CET 2007
Thomas Friedrichsmeier wrote:
> Hi!
>
> I'm having trouble with importing spss files containing non-ascii characters
> (R 2.4.1, debian linux, i386). To reproduce:
>
> Download the following file:
> http://statmath.wu-wien.ac.at/data/spss/de/comphomeneu.sav
>
> require (foreign)
> Sys.setlocale (locale="C")
> read.spss("comphomeneu.sav")$ARBEIT[1]
> # prints:
> # [1] im B\374ro
> # Levels: im B\374ro zuhause
>
> \374 of course is actually a u-umlaut. However, I guess in the C locale it's
> not expected to print as such. But now try this (use any UTF-8 locale you may
> have installed):
>
> Sys.setlocale (locale="de_DE.UTF-8")
> read.spss("comphomeneu.sav")$ARBEIT[1]
> # prints:
> # [1]Error in print.default(xx, quote = quote, ...) :
> # invalid multibyte string
>
> To me it looks, like read.spss () would probably need an encoding parameter,
> and / or some iconv () magic. Now, locale conversion always makes my head
> spin, so I thought I'd better post here, before calling this to be a bug in
> R. Two questions:
>
> 1) Is there some way to work around this, i.e. make sure it is converted to
> proper UTF-8 while importing? Am I missing something obvious
>
> 2) Should I submit this as a bug report?
>
1) Yes, 2) No
This is really not in read.spss, but in R itself. The short version is
that in released versions, we have
> "Im B\374ro"
[1]Error: invalid multibyte string
which is indeed a buglet, since it is not good if you cannot output what
you can input (notice that there is no problem until you try to print).
In r-devel, this has become
> "Im B\374ro"
[1] "Im B\xfcro"
so that invalid multibytes at least do not cause error. However, the
real issue is that the string is in the wrong encoding for your locale,
so you should convert it:
> iconv("Im B\xfcro", from="latin1", to="UTF-8")
[1] "Im Büro"
> iconv("Im B\374ro",from="latin1", to="UTF-8")
[1] "Im Büro"
-p
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list