[R] The behaviour of read.csv().

David Winsemius dwinsemius at comcast.net
Fri Dec 3 04:43:58 CET 2010


On Dec 2, 2010, at 9:33 PM, Duncan Murdoch wrote:

> On 02/12/2010 9:18 PM, David Winsemius wrote:
>>
>> On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:
>>
>> snipped
>>>
>>> I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0.
>>> The comment in the NEWS file suggests it was in response to some
>>> strange csv file coming out of Excel.
>>>
>>> The real problem with the CSV format is that there really isn't a
>>> well defined standard for it.  The first RFC about it was published
>>> in 2005, and it doesn't claim to be authoritative.  Excel is kind of
>>> a standard, but it does some very weird things.  (For example:
>>> enter the string 01 into a field.  To keep the leading 0, you need
>>> to type it as '01.  Save the file, read it back:  goodbye 0.  At
>>> least that's what a website I was just on says about Excel, and what
>>> OpenOffice does.)
>>
>> In both Excel and in OO,org you can select a column (or any other
>> range) and set its format to text. (The default is numeric, not that
>> different that read.table()'s default behavior.) Once a format has
>> been set, you then do not need leading quotes. I just created a small
>> example with OO.org Calc entered leading "0" without leading quotes
>> and this code runs as desired after copying the three cells to the
>> clipboard:
>>
>>  >  read.table(pipe("pbpaste"), colClasses="character")
>>      V1
>> 1   01
>> 2  004
>> 3 0005
>>
>> The same applies to date field in both OO.org and Excel. In this
>> regard, it is simply a matter of understanding what is the defined
>> behavior of your software and how one can manipulate it. This is no
>> different than learning R's classes, coercing them to your ends, and
>> dealing with other formatting issues.
>
> You're right, I shouldn't have picked on Excel particularly here,  
> but it really is a bizarre format that says the default way to read  
> a file containing
>
> "V1"  # minor quibble. The V1 was added by read.table()
> "01"
> "004"
> "0005"
>
> is to assume that the column contains numeric values.

I'm a bit puzzled. Or maybe not. If you are criticizing the default  
behavior of R's read.table then I do understand (but have been taught  
by my reading of the FM that "numeric" happens iff all first <n> _are_  
coercible to "numeric" without NA generation is what one should  
expect). Excel is offering text exactly in the instances it has been  
told that the cell format is "text".


>  (Yes, read.csv() makes this same assumption.)  My main complaint is  
> with the format.

Meaning the defaults chosen for read.csv()?

-- 
David.


>
> Duncan Murdoch
>
>
>>
>>>
>>> I've been burned so many times by storing data in .csv files, that I
>>> just avoid them whenever I can.
>>
>> No argument there. I know one physician whose weapon of choice is
>> Stata who always uses "|" as his separator, but that's perhaps  
>> because
>> he works entirely in Windows. I imagine that might not be the most
>> uncommon character in *NIXen.
>>
>> --
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list