[Rd] New version of the R parser in pqR
Radford Neal
radford at cs.toronto.edu
Sat Sep 19 16:07:18 CEST 2015
I have rewritten the R parser in the new version of pqR that I
recently released (pqR-2015-09-14, at pqR-project.org). The new
version of the parser is much cleaner, is faster (sometimes quite
substantially faster), has a better interface to the read-eval-print
loop, and provides a better basis for future extensions. The deparser
has also been substantially revised in pqR, and is better coordinated
with the parser. The new parser and deparser also fix a number of
bugs, as can be seen in the NEWS file at pqR-project.org.
I believe the new parser is almost completely compatible with R-3.2.2,
with just a few slight differences in the result of getParseData, and
a few deliberate changes, which could easily be undone if one really
wanted to. It works with RStudio and with the Windows GUI (I haven't
tested with the Mac GUI). Internally, there is some slight interaction
with pqR's use of read-only constants for some frequent sub-expressions,
but this also is easily removed if desired.
The new parser operates top-down, by recursive descent, rather than
using a bottom-up parser generated by Bison. This allows for much
cleaner operation in several respects. Use of PROTECT can now follow
the usual conventions, with no need to use UNPROTECT_PTR. Creation of
parse data records is relatively straightforward. The special
handling of newlines is also comparatively easy, without the need to
embed a rudimentary top-down parser within the lexical analyser, as was
done in the Bison parser.
The old read-eval-print loop operates by calling the parser repeatedly
with one line of input, two lines of input, etc. until it gets a
return code other than PARSE_INCOMPLETE. This results in the time
taken growing as the square of the number of source lines. The new
read-eval-print loop simply provides the parser with a character input
routine that reads new lines as required, while the parser avoids
looking at a character until really needed, to avoid spurious requests
for more input lines. A quadratic time growth relating to parse data
is also avoided in the new parser.
I suggest that R Core consider adopting this new parser in future R Core
releases of R, along with the associated changes to the read-eval-print
loop, and the revised version of the deparser.
The new parser is better documented than the old parser. I am also
willing to provide assistance to anyone trying to understand the code.
I have tested the new parser on the 5018 packages in the pqR repository,
but of course it's possible that problems might show up in some other
CRAN packages. I'm willing to help in resolving any such problems as
well.
Radford Neal
More information about the R-devel
mailing list