[R] The behaviour of read.csv().

Duncan Murdoch murdoch.duncan at gmail.com
Fri Dec 3 03:33:28 CET 2010


On 02/12/2010 9:18 PM, David Winsemius wrote:
>
> On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:
>
> snipped
>>
>> I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
>> The comment in the NEWS file suggests it was in response to some
>> strange csv file coming out of Excel.
>>
>> The real problem with the CSV format is that there really isn't a
>> well defined standard for it.  The first RFC about it was published
>> in 2005, and it doesn't claim to be authoritative.  Excel is kind of
>> a standard, but it does some very weird things.  (For example:
>> enter the string 01 into a field.  To keep the leading 0, you need
>> to type it as '01.  Save the file, read it back:  goodbye 0.  At
>> least that's what a website I was just on says about Excel, and what
>> OpenOffice does.)
>
> In both Excel and in OO,org you can select a column (or any other
> range) and set its format to text. (The default is numeric, not that
> different that read.table()'s default behavior.) Once a format has
> been set, you then do not need leading quotes. I just created a small
> example with OO.org Calc entered leading "0" without leading quotes
> and this code runs as desired after copying the three cells to the
> clipboard:
>
>   >  read.table(pipe("pbpaste"), colClasses="character")
>       V1
> 1   01
> 2  004
> 3 0005
>
> The same applies to date field in both OO.org and Excel. In this
> regard, it is simply a matter of understanding what is the defined
> behavior of your software and how one can manipulate it. This is no
> different than learning R's classes, coercing them to your ends, and
> dealing with other formatting issues.

You're right, I shouldn't have picked on Excel particularly here, but it 
really is a bizarre format that says the default way to read a file 
containing

"V1"
"01"
"004"
"0005"

is to assume that the column contains numeric values.  (Yes, read.csv() 
makes this same assumption.)  My main complaint is with the format.

Duncan Murdoch


>
>>
>> I've been burned so many times by storing data in .csv files, that I
>> just avoid them whenever I can.
>
> No argument there. I know one physician whose weapon of choice is
> Stata who always uses "|" as his separator, but that's perhaps because
> he works entirely in Windows. I imagine that might not be the most
> uncommon character in *NIXen.
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>



More information about the R-help mailing list