[Rd] scan(..., skip=1e11): infinite loop; cannot interrupt

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Sat Feb 11 09:33:16 CET 2023


On Fri, 10 Feb 2023 23:38:55 -0600
Spencer Graves <spencer.graves using prodsyse.com> wrote:

> I have a 4.54 GB file that I'm trying to read in chunks using 
> "scan(..., skip=__)".  It works as expected for small values of
> "skip" but goes into an infinite loop for "skip=1e11" and similar
> large values of skip:  I cannot even interrupt it;  I must kill R.

Skipping lines is done by two nested loops. The outer loop counts the
lines to skip; the inner loop reads characters until it encounters a
newline or end of file. The outer loop doesn't check for EOF and keeps
asking for more characters until the inner loop runs at least once for
every line it wants to skip. The following patch should avoid the
wait in such cases:

--- src/main/scan.c	(revision 83797)
+++ src/main/scan.c	(working copy)
@@ -835,7 +835,7 @@
 attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
 {
     SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
-    int c, flush, fill, blskip, multiline, escapes, skipNul;
+    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
     R_xlen_t nmax, nlines, nskip;
     const char *p, *encoding;
     RCNTXT cntxt;
@@ -952,7 +952,7 @@
 	    if(!data.con->canread)
 		error(_("cannot read from this connection"));
 	}
-	for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
+	for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
 	    while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
     }
 

Making it interruptible is a bit more work: we need to ensure that a
valid context is set up and check regularly for an interrupt.

--- src/main/scan.c	(revision 83797)
+++ src/main/scan.c	(working copy)
@@ -835,7 +835,7 @@
 attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
 {
     SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
-    int c, flush, fill, blskip, multiline, escapes, skipNul;
+    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
     R_xlen_t nmax, nlines, nskip;
     const char *p, *encoding;
     RCNTXT cntxt;
@@ -952,8 +952,6 @@
 	    if(!data.con->canread)
 		error(_("cannot read from this connection"));
 	}
-	for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
-	    while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
     }
 
     ans = R_NilValue;		/* -Wall */
@@ -966,6 +964,10 @@
     cntxt.cend = &scan_cleanup;
     cntxt.cenddata = &data;
 
+    if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
+	while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF)
+	    if (j++ % 10000 == 9999) R_CheckUserInterrupt();
+
     switch (TYPEOF(what)) {
     case LGLSXP:
     case INTSXP:

This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo
LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can
still be interrupted, even if neither newline nor EOF ever arrives.
(We never skip lines when reading from the console? I suppose it makes
sense. I think this needs to be documented and can write a
documentation patch.)

If you actually have 1e11 lines in your file and would like to read it
in chunks, it may help to use

f <- file('...')
chunk1 <- scan(f, n = n1, skip = nskip1)
# the following will continue reading where chunk1 had ended
chunk2 <- scan(f, n = n2, skip = nskip2)

...in order to avoid having to skip over chunks you have already read,
which otherwise makes the algorithm quadratic in number of lines
instead of linear. (I couldn't determine whether you're already doing
this, sorry.)

Skipping a fixed number of lines is hard: since they have variable
length, it's required to read every character in order to determine
whether it starts a new line. With byte ranges, it would have been
possible to use seek(), but not here.

-- 
Best regards,
Ivan



More information about the R-devel mailing list