[R] The behaviour of read.csv().
Duncan Murdoch
murdoch.duncan at gmail.com
Fri Dec 3 03:33:28 CET 2010
On 02/12/2010 9:18 PM, David Winsemius wrote:
>
> On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:
>
> snipped
>>
>> I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
>> The comment in the NEWS file suggests it was in response to some
>> strange csv file coming out of Excel.
>>
>> The real problem with the CSV format is that there really isn't a
>> well defined standard for it. The first RFC about it was published
>> in 2005, and it doesn't claim to be authoritative. Excel is kind of
>> a standard, but it does some very weird things. (For example:
>> enter the string 01 into a field. To keep the leading 0, you need
>> to type it as '01. Save the file, read it back: goodbye 0. At
>> least that's what a website I was just on says about Excel, and what
>> OpenOffice does.)
>
> In both Excel and in OO,org you can select a column (or any other
> range) and set its format to text. (The default is numeric, not that
> different that read.table()'s default behavior.) Once a format has
> been set, you then do not need leading quotes. I just created a small
> example with OO.org Calc entered leading "0" without leading quotes
> and this code runs as desired after copying the three cells to the
> clipboard:
>
> > read.table(pipe("pbpaste"), colClasses="character")
> V1
> 1 01
> 2 004
> 3 0005
>
> The same applies to date field in both OO.org and Excel. In this
> regard, it is simply a matter of understanding what is the defined
> behavior of your software and how one can manipulate it. This is no
> different than learning R's classes, coercing them to your ends, and
> dealing with other formatting issues.
You're right, I shouldn't have picked on Excel particularly here, but it
really is a bizarre format that says the default way to read a file
containing
"V1"
"01"
"004"
"0005"
is to assume that the column contains numeric values. (Yes, read.csv()
makes this same assumption.) My main complaint is with the format.
Duncan Murdoch
>
>>
>> I've been burned so many times by storing data in .csv files, that I
>> just avoid them whenever I can.
>
> No argument there. I know one physician whose weapon of choice is
> Stata who always uses "|" as his separator, but that's perhaps because
> he works entirely in Windows. I imagine that might not be the most
> uncommon character in *NIXen.
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
More information about the R-help
mailing list