[Rd] Bug in read.table?

peter dalgaard pdalgd at gmail.com
Tue Nov 16 14:04:16 CET 2010


On Nov 16, 2010, at 02:59 , Ben Bolker wrote:

> Ben Bolker <bbolker <at> gmail.com> writes:
> 
>> 
>> Ben Bolker <bbolker <at> gmail.com> writes:
>> 
>>> 
>>> 
>> 
>>   Can simplify this still farther:
>> 
>> a b'c
>> d e'f
>> g h'i
> 
>  This example file leads to duplicate lines.
> Arguably it should have behavior analogous to:
> 
>> scan(what="")
> 1: a b'c
> 3: d e'f
> 5: g h'i
> 7: Read 6 items
> [1] "a"   "b'c" "d"   "e'f" "g"   "h'i"
> 
> 
>> 
>>> One of the first things that happens in read.table is that
>>> the first few lines are read with readTableHead:
>>> 
>>>  lines <- .Internal(readTableHead(file, nlines, comment.char, 
>>>       blank.lines.skip, quote, sep))
>>> 
>>  in this case, this reads the first two lines as one line;
>> the single quote at pos. 4 of the first line closes on pos.
>> 4 of the second line, preventing the first new line from
>> ending a line.
>> 
>>  R then pushes back two copies of the lines that have
>> been read (this is normal behavior; I don't quite follow the
>> logic).
>> 
>>  The rest of the file is read with scan(), 1 line at a time.
>> However, there is the discrepancy between the way
>> that readTableHead interprets new lines in the middle of
>> quoted strings (it ignores them) and the way that scan()
>> interprets them (it takes them as the end of the quoted string).
> 
> 
>  Ping?
>  I think this counts as a small, but real, bug. Should I go ahead
> and report it as such, or would someone explain why it's not a bug?
> 

I think it can be defended to file as a bug, but it is tricky to pinpoint exactly what the issue is. E.g., notice that adding a few spaces changes the behaviour of scan() considerably:

> scan(what="")
1:  a b 'c
1: d e' f
5: g h' i
8: 
Read 7 items
[1] "a"      "b"      "c\nd e" "f"      "g"      "h'"     "i"     

(I'm confused... What is it that we really want here?)

Also, as you noted originally, beware the "Well don't do that then" aspect...

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list