[R] read.csv error: invalid multibyte string
Dennis Fisher
fisher at plessthan.com
Sat Dec 31 16:05:47 CET 2011
R version: 2.13.1
OS X
Colleagues,
I am working with a CSV file; for testing purposes, I created an XLS version of the file.
When I read these files using read.xls (gdata) or read.csv, I encounter an error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<b0>C'
The error occurs whether or not I invoke the "as.is" option of read.csv.
The trigger for this error is a "degree C" string (\xb0). The offending line is:
[1] "\"DD4A14\",\"VITALS\",\"SITE038\",\"038-501\",\"SCREENING\",\"\",\"Temperature\",\"37.8\",\"\xb0C\",\"1005_TS\",\"e2\",\"1005/cla\",\"\",5/25/2011,-1,2,0,0,0,0,0,0,1,7/20/2011 16:48:25,240,1"
I can get around the error by reading the file with readLines, then editing out that character:
PATH <- textConnection(sub("\xb0", "degrees", readLines(PATH)))
read.csv(PATH, header=T, as.is=T)
This alternate approach is successful. This leads to two questions:
1. Why can readLines handle that character string whereas read.csv cannot?
2. Reading the text connection is slow - it takes ~ 11 seconds to read a file with 11K rows. I edited the file to replace to offending character with "degree". read.csv reads the 11K rows of the new file in a fraction of a second. Can someone explain why reading the text connection is so much slower than reading a file?
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
More information about the R-help
mailing list