[R] read.spss and umlaut
Thomas Kuster
r at fam-kuster.ch
Thu Aug 3 10:38:13 CEST 2006
Hello
Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
> This sounds like a conflict between encodings -- eg if R is assuming UTF-8
> and the file is encoding in Latin-1 then the sequence
> U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
> U+0072 : LATIN SMALL LETTER R
> is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.
Hex: 74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
Text: t e f ü r a l l e S E / 1 6
> The underlying C code (being written in the US quite a long time ago)
> doesn't know about encodings, and I don't know what the rules are in SPSS
> for valid characters (I suspect that in these old portable file formats it
> probably just reads and writes bytes, leaving it up to the OS to interpret
> them.
But why stopp the C code reading? Is "/" not the endmark of the string? What
is the problem, if I chance that in the source?
> You could try running R in a non-UTF-8 locale to see if it helps.
I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set
an other temporary?
A dirty hack like this:
sed s/ä/ae/g | sed s/ö/oe/g | sed s/ü/ue/g | sed s/Ä/Ae/g | sed s/Ö/Oe/g | sed
s/Ü/Ue/g
didn't work (file 'projets_non_umlaut.por' is not in any supported SPSS
format).
Thomas
> If anyone has definitive information about how SPSS represents strings and
> decides on valid characters that might be useful too.
>
> -thomas
>
> >> library("foreign")
> >> spssdaten <- read.spss("projets.por")
> >> attr(spssdaten$PROJETX, "value.labels")[1:20]
> >
> > Bg Stammzellenforschung Bb
> > 863
> > 862 Bb Neugestaltung des Finanzausgleichs
> > 861
> > 854 EV Postdienste f Bb 853
> > 852 Bb Bg Steuerpaket 851
> > 843 Bb Anhebung der Mehrwertsteuer s
> > 11. AHV-Revision 842
> > 841 Volkinitiative Lebenslange Verwahrung
> > 833
> > 832 Gegenentwurf zur Avanti EV Lehrstellen-Initiative 831
> > 824 EV Moratorium Plus
> > EV Strom ohne Atom 823 822 EV Ja zu
> > fairen Mieten EV Gleiche Rechte f 821
> > 815 EV Gesundheitsinitiative EV
> > Sonntags-Initiative 814 813
> >
> > The SPSS-File is okay:
> >> system("cat projets.por |grep Postdienste")
> >
> > echtserwerb 3. GenerationSD/N/EV Postdienste für alleSE/16/Änderrung Bg
> > EOG Mut
> >
> > How can I read the SPSS-File with the Umlaut?
> >
> > Bye
> > Thomas Kuster
> >
> > R: 2.1.0 (2005-04-18)
> > OS: Debian Linux, 2.6.10-isgee-neptun-1
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html and provide commented,
> > minimal, self-contained, reproducible code.
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list