[R] WG: AW: Another problem with encoding

Fri Jan 4 19:27:39 CET 2008

Hello, Peter,  
	I talked with SPSS: there is a known bug with character encoding. They will fix it in the next release.
Regards,
Matthias

-----Ursprüngliche Nachricht-----
Von: Peter Dalgaard [mailto:P.Dalgaard at biostat.ku.dk] 
Gesendet: Mittwoch, 2. Januar 2008 16:55
An: Matthias Wendel
Cc: r-help at stat.math.ethz.ch
Betreff: Re: [R] WG: AW: Another problem with encoding

Matthias Wendel wrote:
> Hello, Peter,
> 	I tried it out: iconv(names(attributes(spss[,'Y6'])[[1]][14]), 
> "UTF-8", "LATIN1", sub='byte') yielded
>
> [1] "<c4>rzte Chirurgie" 
>
> and c4 corresponds in most encodings to Ä. What can I do next? I 
> wonder whether there is a more comfortable way then to change the occurences of <..> by the adequate character.
>   
Not sure what you want here. Isn't it just the reverse conversion, iconv(...., from="latin1", to="utf8") ???

Notice that c4 is not Ä in UTF8:

> iconv("Ä", to="ascii", sub="byte")
[1] "<c3><84>"

in fact c4 is not anything in UTF8, hence the "invalid string" message.
> Regards,
> Matthias
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
> Gesendet: Dienstag, 1. Januar 2008 20:21
> An: Matthias Wendel
> Betreff: Re: AW: [R] Another problem with encoding
>
> Matthias Wendel wrote:
>   
>> Happy new year and my apologies, Peter. Here are the missing facts:
>> I'm reading in a spss-file, doing some calculations and putting the 
>> results in a xml file. The xml-file is UTF-8 encoded and so should the results and their labels (eg  Ärzte Chirurgie):
>> Here is part of the R session:
>>
>>   
>>     
> As a matter of principle: Requests for more information are not offers that I will solve your problems personally. Stay on the
list!
>
> The characters seem to travel OK in email, so latin1is a guess. Have you tried the sub="byte" argument to iconv()?
>
>
>
>   
>>   
>>     
>>> Sys.getlocale()
>>>     
>>>       
>> [1]
>>
>>     
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETA
> RY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.125
>   
>> 2"
>>   
>>     
>>> spss[,'Y6']
>>>     
>>>       
>>   [1]  6  3  8 11  8  9  6  8  3  5 10 15 NA  9  8  3  8 16  6  6 NA 
>> 10  5  2  7  7  6 16  7 15  7 10 12  [34]  8  7 12 12 16  7  6  8  8 
>> 15  6 NA  8 99  7 12  8  9 16  7 16  8  7  7  1 15 12  8  7 10  7  8  
>> 7  [67]  8  9  8  6  6  8  6 16 11  5 11 11  1 11  3  7  7 10 10 10  
>> 6 11 16 NA  1  3  2 10 99 10  3  3  9 [100]  7 16 99 16  1 10  2 13 
>> 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 NA 10 16 16 NA  6 10  
>> 5 11 [133] 11  1  1  1  1 16  1 16  1  1  1  1  6  6  6 16  8 16 16 
>> 16 16  5  6 10 99 11 11 10  6  6  1  1  6 [166]  1 11 11 16  9 11 16  
>> 6  8  8 16 16  8  6 16 16 12 12 12 12 12 12 12 16  9 16 15 12 12 15 
>> 10 16 15 [199]  4  1  2 14  4  4  2  5 NA  1  5  5  7  9  5 12 12 NA 
>> 16 12 12 12 12 12 12 12 12 12 99 NA 12 12 NA [232]  1 16  1  7 11  5  
>> 6  7  1 13  6  8 16  2  1  5 16 16  9  8  8  8  7 16  8  8  2  8  5  
>> 4  6 14  5 [265] 14  8  8 14  4  4  8 14  8 14  6  2  3 14  3 16  5 
>> 15 15 15 15 15 15 15 15 15 15 15 13 13 13 13 13 [298] 13 13 13 13 13 
>> 13 13 13 15  6 NA 12  3  9  9 NA 10 16
>> attr(,"value.labels")
>>                           Verwaltung Servicegesellschaft Waldfriede (SKW) 
>>                                   16                                   15 
>>            Kurzzeitpflege Waldfriede                        Sozialstation 
>>                                   14                                   13 
>>                  Krankenpflegeschule              Med. Technischer Dienst 
>>                                   12                                   11 
>>                            Pflege OP                      Funktionsdienst 
>>                                   10                                    9 
>>                   Pflege Gynäkologie                     Pflege Chirurgie 
>>                                    8                                    7 
>>                        Pflege Innere            Ärzte Anästhesie, Röntgen 
>>                                    6                                    5 
>>                    Ärzte Gynäkologie                      Ärzte Chirurgie 
>>                                    4                                    3 
>>                         Ärzte Innere         Patientenberatung/-betreuung 
>>                                    2                                    1 
>>   
>>     
>>> names(attributes(spss[,'Y6'])[[1]][14])
>>>     
>>>       
>> [1] "Ärzte Chirurgie"
>>   
>>     
>>> iconv(names(attributes(spss[,'Y6'])[[1]][14]), "UTF-8", "LATIN1")
>>>     
>>>       
>> [1] NA
>>   
>>     
>>> utf8ToInt(names(attributes(spss[,'Y6'])[[1]][14]))
>>>     
>>>       
>> Fehler in utf8ToInt(names(attributes(spss[, "Y6"])[[1]][14])) : 
>>   invalid UTF-8 string
>>   
>>
>> Cheers,
>> Matthias
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
>> Gesendet: Montag, 31. Dezember 2007 10:45
>> An: Matthias Wendel
>> Cc: r-help at stat.math.ethz.ch
>> Betreff: Re: [R] Another problem with encoding
>>
>> Matthias Wendel wrote:
>>   
>>     
>>> Hi
>>>     I've imported an spss-file using read.spss. One variable has 
>>> value like 'Ärzte'. I thought this is UTF-8 encoded, but it is not 
>>> (as the results of iconv and utf8ToInt suggest). Is there any way to
>>>     
>>>       
>> find out how these spss-values are encoded?
>>   
>>     
>>>   
>>>     
>>>       
>> You are assuming a bit much of your readers.
>>
>> What exactly are you doing? Is it a value, a value label, or perhaps 
>> a variable name. How do the results of read.spss look on the
>>     
> R
>   
>> side? How did you apply iconv and utf8ToInt? What is your locale?
>>
>> I mean, we could try and guess all those details, but you are the one with the hard info, and the motivation...
>>
>>   
>>     
>
>
>   

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907