[R-sig-Debian] invalid multibyte string at '<a0>'
Paul Johnson
pauljohn32 at gmail.com
Tue Aug 16 19:53:46 CEST 2011
On Tue, Aug 16, 2011 at 9:30 AM, Matthieu Stigler
<matthieu.stigler at gmail.com> wrote:
> Thanks a lot Anne!
>
> I'm somehow glad to see the problem is confirmed... indeed one can
> manipulate the file, but at best I would have wished to have a more direct
> "R solution"... (there are quite many files).
>
> Anyone would have other suggestion?
I suggest you clean up the most obvious sources of trouble--get rid
of spaces in the variable names, and put quotation marks around the
text values in the dataset where spaces are present.
I don't think the problem in that file is line endings. It is
multibyte characters. The output from "file" cited earlier is just
wrong, I think. There are definitely other characters in there.
I fiddled around with that file for a while. If you open it in Emacs,
you see lots of trouble signs. The last character on many of the lines
shows as a long underscore. It is not a regular underscore, it is a
multi-byte character. If you replace that character by NA, then you
can open the file with read.table("prob.csv", header=T,sep=",").
So that makes me wonder what is special about your Windows system that
it will overlook these problems.
If you change that last symbol to NA, you will get quite a different
response from R.
Note it still can't make sense out of your variable names, but it is
not an outright error
> read.csv("prob2.csv", header=T, sep=",")
Location X Time.Period
1 Afghanistan NA 2004
2 Albania NA 2009
3 Algeria NA 2006
4 Angola NA 2001
5 Antigua and Barbuda NA 2006
6 Argentina NA 2008
7 Armenia NA 2007
8 Australia NA 2000
9 Austria NA 2006
10 Azerbaijan NA 2006
11 Bahamas NA 2006
12 Bangladesh NA 2007
13 Barbados NA 2006
I don't think this is an R-sig-Debian question. This is a general R
questions, I'm certain that If you can get Brian Ripley's attention in
r-help, you are going to get the best information. He did the encoding
work in R2.0.
I'm pretty sure his first thought will be "did you read the manual?"
?read.table
fileEncoding: character string: if non-empty declares the encoding used
on a file (not a connection) so the character data can be
re-encoded. See the ‘Encoding’ section of the help for
‘file’, the ‘R Data Import/Export Manual’ and ‘Note’.
encoding: encoding to be assumed for input strings. It is used to mark
character strings as known to be in Latin-1 or UTF-8 (see
‘Encoding’): it is not used to re-encode the input, but
allows R to handle encoded strings in their native encoding
(if one of those two). See ‘Value’.
But if you demonstrate effort to understand ?Encodings, I expect he
will help you.
read.table("prob.csv", header=T, sep=",", fileEncoding="latin1")
I tried several variations, I expect you will find the file itself is
flawed, but also the encoding is obscuring the problem.
Here's a splat:
> dat <- read.table("prob.csv", header=T, sep=",", fileEncoding="latin1")
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
number of items read is not a multiple of the number of columns
> head(dat)
Location X Time.Period Low.birth.weight.newborns....
1 Afghanistan NA 2004
2 Albania NA 2009
3 Algeria NA 2006 6
4 Angola NA 2001
5 Antigua and Barbuda NA 2006 5
6 Argentina NA 2008 7
>
--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas
More information about the R-SIG-Debian
mailing list