[Rd] scan(..., skip=1e11): infinite loop; cannot interrupt

Fri Mar 10 13:54:43 CET 2023

On 2/11/23 09:33, Ivan Krylov wrote:
> On Fri, 10 Feb 2023 23:38:55 -0600
> Spencer Graves <spencer.graves using prodsyse.com> wrote:
>
>> I have a 4.54 GB file that I'm trying to read in chunks using
>> "scan(..., skip=__)".  It works as expected for small values of
>> "skip" but goes into an infinite loop for "skip=1e11" and similar
>> large values of skip:  I cannot even interrupt it;  I must kill R.
> Skipping lines is done by two nested loops. The outer loop counts the
> lines to skip; the inner loop reads characters until it encounters a
> newline or end of file. The outer loop doesn't check for EOF and keeps
> asking for more characters until the inner loop runs at least once for
> every line it wants to skip. The following patch should avoid the
> wait in such cases:
>
> --- src/main/scan.c	(revision 83797)
> +++ src/main/scan.c	(working copy)
> @@ -835,7 +835,7 @@
>   attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
>   {
>       SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
> -    int c, flush, fill, blskip, multiline, escapes, skipNul;
> +    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
>       R_xlen_t nmax, nlines, nskip;
>       const char *p, *encoding;
>       RCNTXT cntxt;
> @@ -952,7 +952,7 @@
>   	    if(!data.con->canread)
>   		error(_("cannot read from this connection"));
>   	}
> -	for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
> +	for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
>   	    while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
>       }
>   
>
> Making it interruptible is a bit more work: we need to ensure that a
> valid context is set up and check regularly for an interrupt.
>
> --- src/main/scan.c	(revision 83797)
> +++ src/main/scan.c	(working copy)
> @@ -835,7 +835,7 @@
>   attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
>   {
>       SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
> -    int c, flush, fill, blskip, multiline, escapes, skipNul;
> +    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
>       R_xlen_t nmax, nlines, nskip;
>       const char *p, *encoding;
>       RCNTXT cntxt;
> @@ -952,8 +952,6 @@
>   	    if(!data.con->canread)
>   		error(_("cannot read from this connection"));
>   	}
> -	for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
> -	    while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
>       }
>   
>       ans = R_NilValue;		/* -Wall */
> @@ -966,6 +964,10 @@
>       cntxt.cend = &scan_cleanup;
>       cntxt.cenddata = &data;
>   
> +    if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
> +	while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF)
> +	    if (j++ % 10000 == 9999) R_CheckUserInterrupt();
> +
>       switch (TYPEOF(what)) {
>       case LGLSXP:
>       case INTSXP:
>
> This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo
> LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can
> still be interrupted, even if neither newline nor EOF ever arrives.

Thanks, I've updated the implementation of scan() in R-devel to be 
interruptible while skipping lines.

I've done it slightly differently as I found there already was a memory 
leak, which could be fixed by creating the context a bit earlier.

I've also avoided modulo on the fast path as I saw 13% performance 
overhead on my mailbox file. Decrementing and checking against zero 
didn't have measurable overhead.

Best
Tomas

> (We never skip lines when reading from the console? I suppose it makes
> sense. I think this needs to be documented and can write a
> documentation patch.)
>
> If you actually have 1e11 lines in your file and would like to read it
> in chunks, it may help to use
>
> f <- file('...')
> chunk1 <- scan(f, n = n1, skip = nskip1)
> # the following will continue reading where chunk1 had ended
> chunk2 <- scan(f, n = n2, skip = nskip2)
>
> ...in order to avoid having to skip over chunks you have already read,
> which otherwise makes the algorithm quadratic in number of lines
> instead of linear. (I couldn't determine whether you're already doing
> this, sorry.)
>
> Skipping a fixed number of lines is hard: since they have variable
> length, it's required to read every character in order to determine
> whether it starts a new line. With byte ranges, it would have been
> possible to use seek(), but not here.
>