[Rd] read.table() with quoted integers
Joshua Ulrich
josh.m.ulrich at gmail.com
Mon Sep 30 17:07:19 CEST 2013
On Mon, Sep 30, 2013 at 9:45 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a écrit :
>> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
>> > Hi!
>> >
>> >
>> > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
>> > quoted integers as an acceptable value for columns for which
>> > colClasses="integer". But when colClasses is omitted, these columns are
>> > read as integer anyway.
>> >
>> > For example, let's consider a file named file.dat, containing:
>> > "1"
>> > "2"
>> >
>> >> read.table("file.dat", colClasses="integer")
>> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
>> > scan() expected 'an integer' and got '"1"'
>> >
>> > But:
>> >> str(read.table("file.dat"))
>> > 'data.frame': 2 obs. of 1 variable:
>> > $ V1: int 1 2
>> >
>> > The latter result is indeed documented in ?read.table:
>> > Unless ‘colClasses’ is specified, all columns are read as
>> > character columns and then converted using ‘type.convert’ to
>> > logical, integer, numeric, complex or (depending on ‘as.is’)
>> > factor as appropriate. Quotes are (by default) interpreted in all
>> > fields, so a column of values like ‘"42"’ will result in an
>> > integer column.
>> >
>> >
>> > Should the former behavior be considered a bug?
>> >
>> No. If you tell read.table the column is integer and it's actually
>> character on disk, it should be an error.
> All values in a CSV file are stored as characters on disk, disregarding
> the fact that they are surrounded by quotes or not. 1 is saved as
> 00110001 (ASCII character #49), not 00000001, nor 00000000 00000000
> 00000000 00000001 (as would for example imply a 32 bit storage of
> integers).
>
Yes, I'm aware that write.table creates a character representation of
the data on disk. That's its purpose. writeBin is for writing actual
binary representations. I thought you would understand that by
"actually character on disk" I meant "actually a quoted value". I
assumed you would understand my intent.
read.table uses scan to read the file. ?scan says:
The allowed input for a numeric field is optional whitespace
followed either ‘NA’ or an optional sign followed by a decimal or
hexadecimal constant (see NumericConstants), or ‘NaN’, ‘Inf’ or
‘infinity’ (ignoring case). Out-of-range values are recorded as
‘Inf’, ‘-Inf’ or ‘0’.
For an integer field the allowed input is optional whitespace,
followed by either ‘NA’ or an optional sign and one or more digits
(‘0-9’): all out-of-range values are converted to ‘NA_integer_’.
There's no mention of quotes being allowed.
> So, with all due respect, please refrain from formulating such blatantly
> erroneous statements.
>
So, with all due respect, please refrain from formulating such
blatantly pedantic responses to someone trying to help you.
>
> Regards
>
>
>> > This creates problems when combined with read.table.ffdf from package
>> > ff, since this function tries to guess the column classes by reading the
>> > first rows of the file, and then passes colClasses to read.table to read
>> > the remaining rows by chunks. A column of quoted integers is correctly
>> > detected as integer in the first read, but read.table() fails in
>> > subsequent reads.
>> >
>> This sounds like a issue with read.table.ffdf. The column of quoted
>> integers is *incorrectly* detected as integer because they're actually
>> character on disk. read.table.ffdf should rely on how the data are
>> actually stored on disk (via as.is=TRUE), not how read.table might
>> convert them once they're read into R.
>>
>> >
>> > Regards
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> --
>> Joshua Ulrich | about.me/joshuaulrich
>> FOSS Trading | www.fosstrading.com
>
More information about the R-devel
mailing list