[Rd] scan(..., skip=1e11): infinite loop; cannot interrupt
Suharto Anggono Suharto Anggono
@uh@rto_@nggono @end|ng |rom y@hoo@com
Mon Mar 13 19:42:06 CET 2023
With
if (!j--) {
R_CheckUserInterrupt();
j = 10000;
}
as in current R devel (r83976), j goes negative (-1) and interrupt is checked every 10001 instead of 10000. I prefer
if (!--j) {
R_CheckUserInterrupt();
j = 10000;
}
.
In current R devel (r83976), if EOF is reached, the outer loop keeps going, i keeps incrementing until nskip.
The outer loop could be made to also stop on EOF.
Alternatively, not using nested loop is possible, like the following.
if (nskip) for (R_xlen_t i = 0, j = 10000; ; ) { /* MBCS-safe */
c = scanchar(FALSE, &data);
if (!j--) {
R_CheckUserInterrupt();
j = 10000;
}
if ((c == '\n' && ++i == nskip) || c == R_EOF)
break;
}
-----------
On 2/11/23 09:33, Ivan Krylov wrote:
> On Fri, 10 Feb 2023 23:38:55 -0600
> Spencer Graves <spencer.graves using prodsyse.com> wrote:
>
>> I have a 4.54 GB file that I'm trying to read in chunks using
>> "scan(..., skip=__)". It works as expected for small values of
>> "skip" but goes into an infinite loop for "skip=1e11" and similar
>> large values of skip: I cannot even interrupt it; I must kill R.
> Skipping lines is done by two nested loops. The outer loop counts the
> lines to skip; the inner loop reads characters until it encounters a
> newline or end of file. The outer loop doesn't check for EOF and keeps
> asking for more characters until the inner loop runs at least once for
> every line it wants to skip. The following patch should avoid the
> wait in such cases:
>
> --- src/main/scan.c (revision 83797)
> +++ src/main/scan.c (working copy)
> @@ -835,7 +835,7 @@
> attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
> {
> SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
> - int c, flush, fill, blskip, multiline, escapes, skipNul;
> + int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
> R_xlen_t nmax, nlines, nskip;
> const char *p, *encoding;
> RCNTXT cntxt;
> @@ -952,7 +952,7 @@
> if(!data.con->canread)
> error(_("cannot read from this connection"));
> }
> - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
> + for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
> while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
> }
>
>
> Making it interruptible is a bit more work: we need to ensure that a
> valid context is set up and check regularly for an interrupt.
>
> --- src/main/scan.c (revision 83797)
> +++ src/main/scan.c (working copy)
> @@ -835,7 +835,7 @@
> attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
> {
> SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
> - int c, flush, fill, blskip, multiline, escapes, skipNul;
> + int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
> R_xlen_t nmax, nlines, nskip;
> const char *p, *encoding;
> RCNTXT cntxt;
> @@ -952,8 +952,6 @@
> if(!data.con->canread)
> error(_("cannot read from this connection"));
> }
> - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
> - while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
> }
>
> ans = R_NilValue; /* -Wall */
> @@ -966,6 +964,10 @@
> cntxt.cend = &scan_cleanup;
> cntxt.cenddata = &data;
>
> + if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
> + while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF)
> + if (j++ % 10000 == 9999) R_CheckUserInterrupt();
> +
> switch (TYPEOF(what)) {
> case LGLSXP:
> case INTSXP:
>
> This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo
> LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can
> still be interrupted, even if neither newline nor EOF ever arrives.
Thanks, I've updated the implementation of scan() in R-devel to be
interruptible while skipping lines.
I've done it slightly differently as I found there already was a memory
leak, which could be fixed by creating the context a bit earlier.
I've also avoided modulo on the fast path as I saw 13% performance
overhead on my mailbox file. Decrementing and checking against zero
didn't have measurable overhead.
Best
Tomas
[snip]
More information about the R-devel
mailing list