[Rd] read.table() with quoted integers

Fri Oct 4 17:10:52 CEST 2013

On Fri, Oct 4, 2013 at 4:55 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 13-10-04 7:31 AM, Joshua Ulrich wrote:
>>
>> On Tue, Oct 1, 2013 at 11:29 AM, David Winsemius <dwinsemius at comcast.net>
>> wrote:
>>>
>>>
>>> On Sep 30, 2013, at 6:38 AM, Joshua Ulrich wrote:
>>>
>>>> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr>
>>>> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>>
>>>>> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
>>>>> quoted integers as an acceptable value for columns for which
>>>>> colClasses="integer". But when colClasses is omitted, these columns are
>>>>> read as integer anyway.
>>>>>
>>>>> For example, let's consider a file named file.dat, containing:
>>>>> "1"
>>>>> "2"
>>>>>
>>>>>> read.table("file.dat", colClasses="integer")
>>>>>
>>>>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>>> na.strings, :
>>>>>   scan() expected 'an integer' and got '"1"'
>>>>>
>>>>> But:
>>>>>>
>>>>>> str(read.table("file.dat"))
>>>>>
>>>>> 'data.frame':   2 obs. of  1 variable:
>>>>> $ V1: int  1 2
>>>>>
>>>>> The latter result is indeed documented in ?read.table:
>>>>>      Unless ‘colClasses’ is specified, all columns are read as
>>>>>      character columns and then converted using ‘type.convert’ to
>>>>>      logical, integer, numeric, complex or (depending on ‘as.is’)
>>>>>      factor as appropriate.  Quotes are (by default) interpreted in all
>>>>>      fields, so a column of values like ‘"42"’ will result in an
>>>>>      integer column.
>>>>>
>>>>>
>>>>> Should the former behavior be considered a bug?
>>>>>
>>>> No. If you tell read.table the column is integer and it's actually
>>>> character on disk, it should be an error.
>>>
>>>
>>> My reading of the `read.table` help page is that one should expect that
>>> when
>>> there is an 'integer'-class and an  `as.integer` function and  "integer"
>>> is the
>>> argument to colClasses, that `as.integer` will be applied to the values
>>> in the
>>> column. Should I be reading elsewhere?
>>>
>> I assume you're referring to the paragraph below.
>>
>>    Possible values are ‘NA’ (the default, when ‘type.convert’ is
>>    used), ‘"NULL"’ (when the column is skipped), one of the
>>    atomic vector classes (logical, integer, numeric, complex,
>>    character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
>>    Otherwise there needs to be an ‘as’ method (from package
>>    ‘methods’) for conversion from ‘"character"’ to the specified
>>    formal class.
>>
>> I read that as meaning that an "as" method is required for classes not
>> already listed in the prior sentence.  It doesn't say an "as" method
>> will be applied if colClasses is one of the atomic, factor, Date, or
>> POSIXct classes; but I can see how you might assume that, since all
>> the atomic, factor, Date, and POSIXct classes already have "as"
>> methods...
>
>
> And this does suggest a workaround for ffdf:  instead of declaring the class
> to be "integer", declare a class "ffdf_integer", and write a conversion
> method.  Or simply read everything as character and call as.integer()
> explicitly.

Just a note of concert since several proposed it:
colClasses="character") followed by as.integer() or strtoi() misses
the validation, e.g. "foo" will be turned into NA_integer_.  Using
read.table() or scan() gives an error.

/Henrik

>
> Duncan Murdoch
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel