[Rd] read.table messes up stdin upon small, erroneous input (PR#7722)

Fri Mar 11 14:49:08 CET 2005

Full_Name: Jan T. Kim
Version: 2.0.1, devel-2005-02-24
OS: Linux 2.6.x
Submission from: (NULL) (139.222.3.229)

Run read.table(stdin()) and type in the broken table
1 2
1

terminating the input by pressing Ctrl-D at the 3rd line of input. An error
message by scan, complaining that "line 2 did not have 2 elements" appears,
as expected. However: After this, there are three empty lines buffered in
stdin:

> readLines(stdin())
[1] "" "" ""

Repeated attempts to read.table the broken input from stdin lead to even more
strange results:

> read.table(stdin())
0: 1 2
1: 1
2: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
: 
        line 2 did not have 2 elements
> read.table(stdin())
3: 1 2
4: 1
[1] V1 V2
<0 rows> (or 0-length row.names)
> 

Analysis: These effects are due to a combination of (1) the fact that
there appear to be various routes of accessing the standard input,
depending on context, and (2) the use of pushback in the process of
automatically figuring out the table format:

    * read.table uses .Internal(readTableHead(...)) to get the first
      nlines lines of the table (nlines = 5).

    * .Internal(readTableHead(...)) always returns nlines lines, adding
      empty lines if EOF comes before nlines lines are read.

    * These lines, including any empty ones not originating from the
      file in the first place, are then pushed back twice

    * The first set of lines is always consumed off by the subsequent
      code to figure out the number of columns.

    * The second set is intended to be consumed by the regular operation
      of scan.

    * However, if scan chokes before it can consume these lines, including
      the blank ones, these will be left in the pushback buffer.

    * R's interactive fetch-parse-evaluate loop does not use the connection
      provided by stdin(), and therefore, the buffered stuff is not
      noticed until the next attempt to read from the stdin connection.

The strange effects reported above could probably be fixed by modifying
the internal readTableHead function such that it does not produce emtpy
lines in order to return the number of lines "requested" by the nlines
parameter.

A more fundamental approach would be to avoid pushing back lines
altogether. The repeated scanning of the first few lines could be
done by using a textConnection instead. Some additional work will
probably be necessary to combine the first few and the remaining
lines, acquired by regular operation of scan, into the complete
table.