[R] WG: AW: Another problem with encoding
Matthias Wendel
office at matthiaswendel.de
Fri Jan 4 19:27:39 CET 2008
Hello, Peter,
I talked with SPSS: there is a known bug with character encoding. They will fix it in the next release.
Regards,
Matthias
-----Ursprüngliche Nachricht-----
Von: Peter Dalgaard [mailto:P.Dalgaard at biostat.ku.dk]
Gesendet: Mittwoch, 2. Januar 2008 16:55
An: Matthias Wendel
Cc: r-help at stat.math.ethz.ch
Betreff: Re: [R] WG: AW: Another problem with encoding
Matthias Wendel wrote:
> Hello, Peter,
> I tried it out: iconv(names(attributes(spss[,'Y6'])[[1]][14]),
> "UTF-8", "LATIN1", sub='byte') yielded
>
> [1] "<c4>rzte Chirurgie"
>
> and c4 corresponds in most encodings to Ä. What can I do next? I
> wonder whether there is a more comfortable way then to change the occurences of <..> by the adequate character.
>
Not sure what you want here. Isn't it just the reverse conversion, iconv(...., from="latin1", to="utf8") ???
Notice that c4 is not Ä in UTF8:
> iconv("Ä", to="ascii", sub="byte")
[1] "<c3><84>"
in fact c4 is not anything in UTF8, hence the "invalid string" message.
> Regards,
> Matthias
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
> Gesendet: Dienstag, 1. Januar 2008 20:21
> An: Matthias Wendel
> Betreff: Re: AW: [R] Another problem with encoding
>
> Matthias Wendel wrote:
>
>> Happy new year and my apologies, Peter. Here are the missing facts:
>> I'm reading in a spss-file, doing some calculations and putting the
>> results in a xml file. The xml-file is UTF-8 encoded and so should the results and their labels (eg Ärzte Chirurgie):
>> Here is part of the R session:
>>
>>
>>
> As a matter of principle: Requests for more information are not offers that I will solve your problems personally. Stay on the
list!
>
> The characters seem to travel OK in email, so latin1is a guess. Have you tried the sub="byte" argument to iconv()?
>
>
>
>
>>
>>
>>> Sys.getlocale()
>>>
>>>
>> [1]
>>
>>
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETA
> RY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.125
>
>> 2"
>>
>>
>>> spss[,'Y6']
>>>
>>>
>> [1] 6 3 8 11 8 9 6 8 3 5 10 15 NA 9 8 3 8 16 6 6 NA
>> 10 5 2 7 7 6 16 7 15 7 10 12 [34] 8 7 12 12 16 7 6 8 8
>> 15 6 NA 8 99 7 12 8 9 16 7 16 8 7 7 1 15 12 8 7 10 7 8
>> 7 [67] 8 9 8 6 6 8 6 16 11 5 11 11 1 11 3 7 7 10 10 10
>> 6 11 16 NA 1 3 2 10 99 10 3 3 9 [100] 7 16 99 16 1 10 2 13
>> 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 NA 10 16 16 NA 6 10
>> 5 11 [133] 11 1 1 1 1 16 1 16 1 1 1 1 6 6 6 16 8 16 16
>> 16 16 5 6 10 99 11 11 10 6 6 1 1 6 [166] 1 11 11 16 9 11 16
>> 6 8 8 16 16 8 6 16 16 12 12 12 12 12 12 12 16 9 16 15 12 12 15
>> 10 16 15 [199] 4 1 2 14 4 4 2 5 NA 1 5 5 7 9 5 12 12 NA
>> 16 12 12 12 12 12 12 12 12 12 99 NA 12 12 NA [232] 1 16 1 7 11 5
>> 6 7 1 13 6 8 16 2 1 5 16 16 9 8 8 8 7 16 8 8 2 8 5
>> 4 6 14 5 [265] 14 8 8 14 4 4 8 14 8 14 6 2 3 14 3 16 5
>> 15 15 15 15 15 15 15 15 15 15 15 13 13 13 13 13 [298] 13 13 13 13 13
>> 13 13 13 15 6 NA 12 3 9 9 NA 10 16
>> attr(,"value.labels")
>> Verwaltung Servicegesellschaft Waldfriede (SKW)
>> 16 15
>> Kurzzeitpflege Waldfriede Sozialstation
>> 14 13
>> Krankenpflegeschule Med. Technischer Dienst
>> 12 11
>> Pflege OP Funktionsdienst
>> 10 9
>> Pflege Gynäkologie Pflege Chirurgie
>> 8 7
>> Pflege Innere Ärzte Anästhesie, Röntgen
>> 6 5
>> Ärzte Gynäkologie Ärzte Chirurgie
>> 4 3
>> Ärzte Innere Patientenberatung/-betreuung
>> 2 1
>>
>>
>>> names(attributes(spss[,'Y6'])[[1]][14])
>>>
>>>
>> [1] "Ärzte Chirurgie"
>>
>>
>>> iconv(names(attributes(spss[,'Y6'])[[1]][14]), "UTF-8", "LATIN1")
>>>
>>>
>> [1] NA
>>
>>
>>> utf8ToInt(names(attributes(spss[,'Y6'])[[1]][14]))
>>>
>>>
>> Fehler in utf8ToInt(names(attributes(spss[, "Y6"])[[1]][14])) :
>> invalid UTF-8 string
>>
>>
>> Cheers,
>> Matthias
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
>> Gesendet: Montag, 31. Dezember 2007 10:45
>> An: Matthias Wendel
>> Cc: r-help at stat.math.ethz.ch
>> Betreff: Re: [R] Another problem with encoding
>>
>> Matthias Wendel wrote:
>>
>>
>>> Hi
>>> I've imported an spss-file using read.spss. One variable has
>>> value like 'Ärzte'. I thought this is UTF-8 encoded, but it is not
>>> (as the results of iconv and utf8ToInt suggest). Is there any way to
>>>
>>>
>> find out how these spss-values are encoded?
>>
>>
>>>
>>>
>>>
>> You are assuming a bit much of your readers.
>>
>> What exactly are you doing? Is it a value, a value label, or perhaps
>> a variable name. How do the results of read.spss look on the
>>
> R
>
>> side? How did you apply iconv and utf8ToInt? What is your locale?
>>
>> I mean, we could try and guess all those details, but you are the one with the hard info, and the motivation...
>>
>>
>>
>
>
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list