[R] Strange characters that block import
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Oct 14 17:02:38 CEST 2009
On Wed, 14 Oct 2009, Duncan Murdoch wrote:
> On 10/14/2009 8:25 AM, arnaud Mosnier wrote:
>> Dear useRs,
>>
>> I try to import a text file that contain some strange characters coming
>> from
>> the misinterpretation of foreign language characters by another software
>> (see below).
>>
>> ----------------------------------------
>> Here is an example of text with a line containing characters that bug the
>> import
>>
>> name;number
>> zdsfbg;2
>> ;3
>> dtryjh;4
>>
>> ----------------------------------------
>>
>> R do not want to import lines after those strange characters (i.e. import
>> only the first two lines, one is the header, the second the first line of
>> data).
>>
>> I already try to import using other encoding such as latin1 or UTF-8 but it
>> does not solve the problem.
If these are control characters (that is ^Z is Ctrl-Z, but we've no
real information) then those are the same in every encoding that uses
bytes (or at least those known to iconv).
>> Replacing those character in a text editor before importing solve the
>> solution, but I want that the user of my script do not have to edit the
>> text
>> before the analysis in R.
>>
>> Any hint ??
>
> Those funny characters are octal 032, Ctrl-Z. Years ago that was defined on
> DOS/Windows as an end of file marker, and I guess our code still honours
> that.
More to the point, the Windows C run-time does (AFAIK Ctrl-Z is still
current as EOF under Windows, and Wikipedia says so too), but nothing
in the original posting mentioned this was on Windows, and ctrl-Z has
no effect on the two other OSes I tried which read such a file
successfully.
So without a single piece of the 'at a minimum' information requested
in the posting guide, we are guessing (and I am guessing your example
was done under Windows, too).
> You can work around it by stating that you're reading from a binary file, not
> a text file:
>
> f <- file("text.txt", "rb")
>
> Then read.csv2(f) fails, but readLines(f) succeeds, so this works:
>
>> f <- file("c:/temp/test.txt", "rb")
>> read.csv2(textConnection(readLines(f)))
> name number
> 1 zdsfbg 2
> 2 \032\032 \032\032 3
> 3 dtryjh 4
>
>> close(f)
>
> I don't know if there are any characters that would cause readLines to fail,
> but there might be, so I'd suggest replacing the buggy software that caused
> all the problems in the first place.
This is all a function of the OS's C runtime: I suspect Ctrl-D (eot)
is interpreted as end-of-file on some OSes. Nul (\0) will terminate
strings (that's standard in C, and enforced in recent versions of R).
> Duncan Murdoch
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list