[R] Umlaut read from csv-file

Heinz Tuechler tuechler at gmx.at
Sun Nov 9 10:53:41 CET 2008


At 06:25 09.11.2008, Prof Brian Ripley wrote:
>On Sat, 8 Nov 2008, Heinz Tuechler wrote:
>
>>At 08:01 08.11.2008, Prof Brian Ripley wrote:
>>>We have no idea what you understood (you didn't tell us), but the help says
>>>encoding: character vector.  The encoding(s) to be assumed when 'file'
>>>           is a character string: see 'file'.  A possible value is
>>>           '"unknown"': see the â??Detailsâ??.
>>>...
>>>      This paragraph applies if 'file' is a filename (rather than a
>>>      connection).  If 'encoding = "unknown"', an attempt is made to
>>>      guess the encoding.  The result of 'localeToCharset()' is used as
>>>      a guide.  If 'encoding' has two or more elements, they are tried
>>>      in turn until the file/URL can be read without error in the trial
>>>      encoding.
>>>So source(encoding="latin1") says the file is 
>>>encoded in Latin-1 and should be re-encoded if 
>>>necessary (e.g. in  UTF-8 locale).
>>>Setting the Encoding of parsed character strings is not mentioned.
>>>You could have written out a data frame with 
>>>write.csv() and re-read it with 
>>>read.csv(encoding = "latin1"): that was the 
>>>workaround you were given earlier (not to use source).
>>
>>Thank you for this explanation. I felt that I 
>>did not understand the help page of source() 
>>and I hoped, encoding='latin1' would have the 
>>same effect as in read.csv(), but rethinking 
>>it, I see that it would conflict with the primary functionality of source().
>>Earlier I tried writing the data.frame with 
>>write.csv and re-reading it. This works, but 
>>additional information like labels(), I have to tranfer in a second step.
>>The best way I could immagine, would be some 
>>function, which marks every character string in 
>>the whole structure of a data.frame, including all attributes, as latin1.
>
>I think it is possible that
>
>con <- file("foo")
>source(con, encoding="latin1")
>close(foo)
>
>will also do what you want, although that's an udocumented side effect.

You are right. It does work in my real data problem. Thank you.

(minor remark: I think close(foo) should be close(con))


>But all of this should be unnecessary in 
>R-patched (although it is possible that there 
>are other quirks with unmarked strings lurking 
>in the shadows, there are no other obvious changes from 2.7.2).
>
>>
>>>On Sat, 8 Nov 2008, Heinz Tuechler wrote:
>>>
>>>>At 16:52 07.11.2008, Prof Brian Ripley wrote:
>>>>>On Fri, 7 Nov 2008, Peter Dalgaard wrote:
>>>>>
>>>>>>Heinz Tuechler wrote:
>>>>>>>Dear Prof.Ripley!
>>>>>>>Thank you very much for your attention. In the given example Encoding(),
>>>>>>>or the encoding parameter of read.csv solve the problem. I hope your
>>>>>>>patch will solve also the problem, when I read a spss file by
>>>>>>>spss.get(), since this function has no encoding parameter and my real
>>>>>>>problem originated there.
>>>>>>read.spss() (package foreign) does have a reencode argument, though; and
>>>>>>this is called by spss.get(), so it looks like an easy hack to add it
>>>>>>there.
>>>>>Yes, older software like spss.get needs to 
>>>>>get updated for the internationalization 
>>>>>age.  Modifying it to have a ... argument 
>>>>>passed to read.spss would be a good idea (and future-proofing).
>>>>>In cases like this it is likely that the 
>>>>>SPSS file does contain its encoding 
>>>>>(although sometimes it does not and 
>>>>>occasionally it is wrong), so it is helpful 
>>>>>to make use of the info if it is 
>>>>>there.  However, the default is 
>>>>>read.spss(reencode=NA) because of the 
>>>>>problems of assuming that the info is correct when it is not are worse.
>>>>The cause, why I tried the example below was 
>>>>to solve the encoding by dumping and then 
>>>>re-sourcing a data.frame with the encoding 
>>>>parameter set to latin1. As you can see, 
>>>>source(x, encoding='latin1') does not have 
>>>>the effect I expected. Unfortunately I do not 
>>>>have any idea, what I understood wrong 
>>>>regarding the meaning of encoding='latin1'.
>>>>Heinz Tüchler
>>>>
>>>>us <- c("a", "b", "c", "ä", "ö", "ü")
>>>>Encoding(us)
>>>>[1] "unknown" "unknown" "unknown" "latin1"  "latin1"  "latin1"
>>>>dump('us', 'us_dump.txt')
>>>>rm(us)
>>>>source('us_dump.txt', encoding='latin1')
>>>>us
>>>>[1] "a" "b" "c" "ä" "ö" "ü"
>>>>Encoding(us)
>>>>[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
>>>>unlink('us_dump.txt')
>>>>
>>>>
>>>>>--
>>>>>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>>>>University of Oxford,             Tel:  +44 1865 272861 (self)
>>>>>1 South Parks Road,                     +44 1865 272866 (PA)
>>>>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>>>______________________________________________
>>>>R-help at r-project.org mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide 
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible code.
>>>--
>>>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>>University of Oxford,             Tel:  +44 1865 272861 (self)
>>>1 South Parks Road,                     +44 1865 272866 (PA)
>>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>>
>
>--
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list