[R] Umlaut read from csv-file
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Nov 8 08:01:34 CET 2008
We have no idea what you understood (you didn't tell us), but the help
says
encoding: character vector. The encoding(s) to be assumed when 'file'
is a character string: see 'file'. A possible value is
'"unknown"': see the ‘Details’.
...
This paragraph applies if 'file' is a filename (rather than a
connection). If 'encoding = "unknown"', an attempt is made to
guess the encoding. The result of 'localeToCharset()' is used as
a guide. If 'encoding' has two or more elements, they are tried
in turn until the file/URL can be read without error in the trial
encoding.
So source(encoding="latin1") says the file is encoded in Latin-1 and
should be re-encoded if necessary (e.g. in UTF-8 locale).
Setting the Encoding of parsed character strings is not mentioned.
You could have written out a data frame with write.csv() and re-read it
with read.csv(encoding = "latin1"): that was the workaround you were given
earlier (not to use source).
On Sat, 8 Nov 2008, Heinz Tuechler wrote:
> At 16:52 07.11.2008, Prof Brian Ripley wrote:
>> On Fri, 7 Nov 2008, Peter Dalgaard wrote:
>>
>>> Heinz Tuechler wrote:
>>>> Dear Prof.Ripley!
>>>>
>>>> Thank you very much for your attention. In the given example Encoding(),
>>>> or the encoding parameter of read.csv solve the problem. I hope your
>>>> patch will solve also the problem, when I read a spss file by
>>>> spss.get(), since this function has no encoding parameter and my real
>>>> problem originated there.
>>>
>>> read.spss() (package foreign) does have a reencode argument, though; and
>>> this is called by spss.get(), so it looks like an easy hack to add it
>>> there.
>>
>> Yes, older software like spss.get needs to get updated for the
>> internationalization age. Modifying it to have a ... argument passed to
>> read.spss would be a good idea (and future-proofing).
>>
>> In cases like this it is likely that the SPSS file does contain its
>> encoding (although sometimes it does not and occasionally it is wrong), so
>> it is helpful to make use of the info if it is there. However, the default
>> is read.spss(reencode=NA) because of the problems of assuming that the info
>> is correct when it is not are worse.
>
> The cause, why I tried the example below was to solve the encoding by dumping
> and then re-sourcing a data.frame with the encoding parameter set to latin1.
> As you can see, source(x, encoding='latin1') does not have the effect I
> expected. Unfortunately I do not have any idea, what I understood wrong
> regarding the meaning of encoding='latin1'.
>
> Heinz Tüchler
>
>
> us <- c("a", "b", "c", "ä", "ö", "ü")
> Encoding(us)
> [1] "unknown" "unknown" "unknown" "latin1" "latin1" "latin1"
> dump('us', 'us_dump.txt')
> rm(us)
> source('us_dump.txt', encoding='latin1')
> us
> [1] "a" "b" "c" "ä" "ö" "ü"
> Encoding(us)
> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> unlink('us_dump.txt')
>
>
>
>
>> --
>> Brian D. Ripley, ripley at stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list