[Rd] Why does the lexical analyzer drop comments ?

Tue Mar 31 14:16:47 CEST 2009

Hi Romain,

I've been thinking for quite a long time on how to keep comments when
parsing R code and finally got a trick with inspiration from one of my
friends, i.e. to mask the comments in special assignments to "cheat" R
parser:

# keep.comment: whether to keep the comments or not
# keep.blank.line: preserve blank lines or not?
# begin.comment and end.comment: special identifiers that mark the orignial
#     comments as 'begin.comment = "#[ comments ]end.comment"'
#     and these marks will be removed after the modified code is parsed
tidy.source <- function(source = "clipboard", keep.comment = TRUE,
    keep.blank.line = FALSE, begin.comment, end.comment, ...) {
    # parse and deparse the code
    tidy.block = function(block.text) {
        exprs = parse(text = block.text)
        n = length(exprs)
        res = character(n)
        for (i in 1:n) {
            dep = paste(deparse(exprs[i]), collapse = "\n")
            res[i] = substring(dep, 12, nchar(dep) - 1)
        }
        return(res)
    }
    text.lines = readLines(source, warn = FALSE)
    if (keep.comment) {
        # identifier for comments
        identifier = function() paste(sample(LETTERS), collapse = "")
        if (missing(begin.comment))
            begin.comment = identifier()
        if (missing(end.comment))
            end.comment = identifier()
        # remove leading and trailing white spaces
        text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "",
            text.lines)
        # make sure the identifiers are not in the code
        # or the original code might be modified
        while (length(grep(sprintf("%s|%s", begin.comment, end.comment),
            text.lines))) {
            begin.comment = identifier()
            end.comment = identifier()
        }
        head.comment = substring(text.lines, 1, 1) == "#"
        # add identifiers to comment lines to cheat R parser
        if (any(head.comment)) {
            text.lines[head.comment] = gsub("\"", "\'",
text.lines[head.comment])
            text.lines[head.comment] = sprintf("%s=\"%s%s\"",
                begin.comment, text.lines[head.comment], end.comment)
        }
        # keep blank lines?
        blank.line = text.lines == ""
        if (any(blank.line) & keep.blank.line)
            text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment,
                end.comment)
        text.tidy = tidy.block(text.lines)
        # remove the identifiers
        text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment,
            end.comment), "", text.tidy)
    }
    else {
        text.tidy = tidy.block(text.lines)
    }
    cat(paste(text.tidy, collapse = "\n"), "\n", ...)
    invisible(text.tidy)
}

The above function can deal with comments which are in single lines, e.g.

f = tempfile()
writeLines('
  # rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}
', f)

Then parse the code file 'f':

> tidy.source(f)
# rotation of the word 'Animation'
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
    # redraw the plot again and again
    plot(1, ann = FALSE, type = "n", axes = FALSE)
    # rotate; use rainbow() colors
    text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 *
        i/360)
    # pause for a while
    Sys.sleep(0.01)
}

Of course this function has some limitations: it does not support
inline comments or comments which are inside incomplete code lines.
Peter's example

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well

will be parsed as

f
(a = 1)

I'm quite interested in syntax highlighting of R code and saw your
previous discussions in another posts (with Jose Quesada, etc). I'd
like to do something for your package if I could be of some help.

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
Mobile: +86-15810805877
Homepage: http://www.yihui.name
School of Statistics, Room 1037, Mingde Main Building,
Renmin University of China, Beijing, 100872, China

2009/3/21  <romain.francois at dbmail.com>:
>
> It happens in the token function in gram.c:
>
> Â Â Â  c = SkipSpace();
> Â Â Â  if (c == '#') c = SkipComment();
>
> and then SkipComment goes like that:
>
> static int SkipComment(void)
> {
> Â Â Â  int c;
> Â Â Â  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
> Â Â Â  if (c == R_EOF) EndOfFile = 2;
> Â Â Â  return c;
> }
>
> which effectively drops comments.
>
> Would it be possible to keep the information somewhere ?
>
> The source code says this:
>
> Â *Â  The function yylex() scans the input, breaking it into
> Â *Â  tokens which are then passed to the parser.Â  The lexical
> Â *Â  analyser maintains a symbol table (in a very messy fashion).
>
> so my question is could we use this symbol table to keep track of, say, COMMENT tokens.
>
> Why would I even care about that ? I'm writing a package that will
> perform syntax highlighting of R source code based on the output of the
> parser, and it seems a waste to drop the comments.
>
> An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.
>
> Am I mad if I contemplate looking into this ?
>
> Romain
>
> --
> Romain Francois
> Independent R Consultant
> +33(0) 6 28 91 30 30
> http://romainfrancois.blog.free.fr
>