[Rd] read.table() with quoted integers

peter dalgaard pdalgd at gmail.com
Fri Oct 4 18:15:14 CEST 2013


On Oct 4, 2013, at 17:10 , Henrik Bengtsson wrote:

> On Fri, Oct 4, 2013 at 4:55 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>> On 13-10-04 7:31 AM, Joshua Ulrich wrote:
>>> 
>>> On Tue, Oct 1, 2013 at 11:29 AM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>>> 
>>>> 
>>>> On Sep 30, 2013, at 6:38 AM, Joshua Ulrich wrote:
>>>> 
>>>>> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr>
>>>>> wrote:
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> 
>>>>>> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
>>>>>> quoted integers as an acceptable value for columns for which
>>>>>> colClasses="integer". But when colClasses is omitted, these columns are
>>>>>> read as integer anyway.
>>>>>> 
>>>>>> For example, let's consider a file named file.dat, containing:
>>>>>> "1"
>>>>>> "2"
>>>>>> 
>>>>>>> read.table("file.dat", colClasses="integer")
>>>>>> 
>>>>>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>>>> na.strings, :
>>>>>>  scan() expected 'an integer' and got '"1"'
>>>>>> 
>>>>>> But:
>>>>>>> 
>>>>>>> str(read.table("file.dat"))
>>>>>> 
>>>>>> 'data.frame':   2 obs. of  1 variable:
>>>>>> $ V1: int  1 2
>>>>>> 
>>>>>> The latter result is indeed documented in ?read.table:
>>>>>>     Unless ‘colClasses’ is specified, all columns are read as
>>>>>>     character columns and then converted using ‘type.convert’ to
>>>>>>     logical, integer, numeric, complex or (depending on ‘as.is’)
>>>>>>     factor as appropriate.  Quotes are (by default) interpreted in all
>>>>>>     fields, so a column of values like ‘"42"’ will result in an
>>>>>>     integer column.
>>>>>> 
>>>>>> 
>>>>>> Should the former behavior be considered a bug?
>>>>>> 
>>>>> No. If you tell read.table the column is integer and it's actually
>>>>> character on disk, it should be an error.
>>>> 
>>>> 
>>>> My reading of the `read.table` help page is that one should expect that
>>>> when
>>>> there is an 'integer'-class and an  `as.integer` function and  "integer"
>>>> is the
>>>> argument to colClasses, that `as.integer` will be applied to the values
>>>> in the
>>>> column. Should I be reading elsewhere?
>>>> 
>>> I assume you're referring to the paragraph below.
>>> 
>>>   Possible values are ‘NA’ (the default, when ‘type.convert’ is
>>>   used), ‘"NULL"’ (when the column is skipped), one of the
>>>   atomic vector classes (logical, integer, numeric, complex,
>>>   character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
>>>   Otherwise there needs to be an ‘as’ method (from package
>>>   ‘methods’) for conversion from ‘"character"’ to the specified
>>>   formal class.
>>> 
>>> I read that as meaning that an "as" method is required for classes not
>>> already listed in the prior sentence.  It doesn't say an "as" method
>>> will be applied if colClasses is one of the atomic, factor, Date, or
>>> POSIXct classes; but I can see how you might assume that, since all
>>> the atomic, factor, Date, and POSIXct classes already have "as"
>>> methods...
>> 
>> 
>> And this does suggest a workaround for ffdf:  instead of declaring the class
>> to be "integer", declare a class "ffdf_integer", and write a conversion
>> method.  Or simply read everything as character and call as.integer()
>> explicitly.
> 
> Just a note of concert since several proposed it:

concerN?

> colClasses="character") followed by as.integer() or strtoi() misses
> the validation, e.g. "foo" will be turned into NA_integer_.  Using
> read.table() or scan() gives an error.

The obvious fix for that would seem to be to use scan() on the character vector:

> y <- c("1","2",3,4,5)
> y
[1] "1" "2" "3" "4" "5"
> scan(text=y)
Read 5 items
[1] 1 2 3 4 5
> y <- c("1","2",3,4,"NA")
> scan(text=y)
Read 5 items
[1]  1  2  3  4 NA
> y <- c("1","2",3,4,"foo")
> scan(text=y)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got 'foo'


> 
> /Henrik
> 
>> 
>> Duncan Murdoch
>> 
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list