[Rd] read.table() with quoted integers

Mon Sep 30 17:19:47 CEST 2013

Le lundi 30 septembre 2013 à 10:07 -0500, Joshua Ulrich a écrit :
> On Mon, Sep 30, 2013 at 9:45 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> > Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a écrit :
> >> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> >> > Hi!
> >> >
> >> >
> >> > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> >> > quoted integers as an acceptable value for columns for which
> >> > colClasses="integer". But when colClasses is omitted, these columns are
> >> > read as integer anyway.
> >> >
> >> > For example, let's consider a file named file.dat, containing:
> >> > "1"
> >> > "2"
> >> >
> >> >> read.table("file.dat", colClasses="integer")
> >> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
> >> >   scan() expected 'an integer' and got '"1"'
> >> >
> >> > But:
> >> >> str(read.table("file.dat"))
> >> > 'data.frame':   2 obs. of  1 variable:
> >> >  $ V1: int  1 2
> >> >
> >> > The latter result is indeed documented in ?read.table:
> >> >      Unless ‘colClasses’ is specified, all columns are read as
> >> >      character columns and then converted using ‘type.convert’ to
> >> >      logical, integer, numeric, complex or (depending on ‘as.is’)
> >> >      factor as appropriate.  Quotes are (by default) interpreted in all
> >> >      fields, so a column of values like ‘"42"’ will result in an
> >> >      integer column.
> >> >
> >> >
> >> > Should the former behavior be considered a bug?
> >> >
> >> No. If you tell read.table the column is integer and it's actually
> >> character on disk, it should be an error.
> > All values in a CSV file are stored as characters on disk, disregarding
> > the fact that they are surrounded by quotes or not. 1 is saved as
> > 00110001 (ASCII character #49), not 00000001, nor 00000000 00000000
> > 00000000 00000001 (as would for example imply a 32 bit storage of
> > integers).
> >
> Yes, I'm aware that write.table creates a character representation of
> the data on disk.  That's its purpose.  writeBin is for writing actual
> binary representations.  I thought you would understand that by
> "actually character on disk" I meant "actually a quoted value".  I
> assumed you would understand my intent.
> 
> read.table uses scan to read the file.  ?scan says:
> 
>      The allowed input for a numeric field is optional whitespace
>      followed either ‘NA’ or an optional sign followed by a decimal or
>      hexadecimal constant (see NumericConstants), or ‘NaN’, ‘Inf’ or
>      ‘infinity’ (ignoring case).  Out-of-range values are recorded as
>      ‘Inf’, ‘-Inf’ or ‘0’.
> 
>      For an integer field the allowed input is optional whitespace,
>      followed by either ‘NA’ or an optional sign and one or more digits
>      (‘0-9’): all out-of-range values are converted to ‘NA_integer_’.
> 
> There's no mention of quotes being allowed.
> 
> > So, with all due respect, please refrain from formulating such blatantly
> > erroneous statements.
> >
> So, with all due respect, please refrain from formulating such
> blatantly pedantic responses to someone trying to help you.
Sorry, your reply came across as quite abrupt for somebody trying to
help. ;-)

And I'm not really looking for help, honestly, as I found a workaround
some time ago already. I'd just like to know how we could make
read.csv.ffdf() work better in this case, and possibly improve R too.

Regards

> >
> > Regards
> >
> >
> >> > This creates problems when combined with read.table.ffdf from package
> >> > ff, since this function tries to guess the column classes by reading the
> >> > first rows of the file, and then passes colClasses to read.table to read
> >> > the remaining rows by chunks. A column of quoted integers is correctly
> >> > detected as integer in the first read, but read.table() fails in
> >> > subsequent reads.
> >> >
> >> This sounds like a issue with read.table.ffdf.  The column of quoted
> >> integers is *incorrectly* detected as integer because they're actually
> >> character on disk.  read.table.ffdf should rely on how the data are
> >> actually stored on disk (via as.is=TRUE), not how read.table might
> >> convert them once they're read into R.
> >>
> >> >
> >> > Regards
> >> >
> >> > ______________________________________________
> >> > R-devel at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >> --
> >> Joshua Ulrich  |  about.me/joshuaulrich
> >> FOSS Trading  |  www.fosstrading.com
> >