[Rd] read.csv trap

Ben Bolker bbolker at gmail.com
Fri Feb 11 23:00:30 CET 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/11/2011 03:37 PM, Laurent Gatto wrote:
> On 11 February 2011 19:39, Ben Bolker <bbolker at gmail.com> wrote:
>>
> [snip]
>>
>> What is dangerous/confusing is that R silently **wraps** longer lines if
>> fill=TRUE (which is the default for read.csv).  I encountered this when
>> working with a colleague on a long, messy CSV file that had some phantom
>> extra fields in some rows, which then turned into empty lines in the
>> data frame.
>>
> 
> As a matter of fact, this is exactly what happened to a colleague of
> mine yesterday and caused her quite a bit of trouble. On the other
> hand, it could also be considered as a 'bug' in the csv file. Although
> no formal specification exist for the csv format, RFC 4180 [1]
> indicates that 'each line should contain the same number of fields
> throughout the file'.
> 
> [1] http://tools.ietf.org/html/rfc4180
> 
> Best wishes,
> 
> Laurent

  Asserting that the bug is in the CSV file is logically consistent, but
if this is true then the "fill=TRUE" argument (which is only needed when
the lines contain different numbers of fields) should not be allowed.

 I had never seen RFC4180 before -- interesting!  I note especially
points 5-7 which define the handling of double quotation marks (but says
nothing about single quotes or using backslashes as escape characters).

  Dealing with read.[table|csv] seems a bit of an Augean task
<http://en.wikipedia.org/wiki/Augeas> (hmmm, maybe I should write a
parallel document to Burns's _Inferno_ ...)

  cheers
    Ben

> 
>>  Here is an example and a workaround that runs count.fields on the
>> whole file to find the maximum column length and set col.names
>> accordingly.  (It assumes you don't already have a file named "test.csv"
>> in your working directory ...)
>>
>>  I haven't dug in to try to write a patch for this -- I wanted to test
>> the waters and see what people thought first, and I realize that
>> read.table() is a very complicated piece of code that embodies a lot of
>> tradeoffs, so there could be lots of different approaches to trying to
>> mitigate this problem. I appreciate very much how hard it is to write a
>> robust and general function to read data files, but I also think it's
>> really important to minimize the number of traps in read.table(), which
>> will often be the first part of R that new users encounter ...
>>
>>  A quick fix for this might be to allow the number of lines analyzed
>> for length to be settable by the user, or to allow a settable 'maxcols'
>> parameter, although those would only help in the case where the user
>> already knows there is a problem.
>>
>>  cheers
>>    Ben Bolker
>>
>> ===============
>> writeLines(c("A,B,C,D",
>>             "1,a,b,c",
>>             "2,f,g,c",
>>             "3,a,i,j",
>>             "4,a,b,c",
>>             "5,d,e,f",
>>             "6,g,h,i,j,k,l,m,n"),
>>           con=file("test.csv"))
>>
>>
>> read.csv("test.csv")
>> try(read.csv("test.csv",fill=FALSE))
>>
>> ## assumes header=TRUE, fill=TRUE; should be a little more careful
>> ##  with comment, quote arguments (possibly explicit)
>> ## ... contains information about quote, comment.char, sep
>> Read.csv <- function(fn,sep=",",...) {
>>  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
>>  ncolnames <- length(colnames)
>>  maxcols <- max(count.fields(fn,sep=sep,...))
>>  if (maxcols>ncolnames) {
>>    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
>>  }
>>  ## assumes you don't have any other columns labeled "V[large number]"
>>  read.csv(fn,...,col.names=colnames)
>> }
>>
>> Read.csv("test.csv")
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq
zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT
=Rb2R
-----END PGP SIGNATURE-----



More information about the R-devel mailing list