[Rd] Why does the lexical analyzer drop comments ?
Yihui Xie
xieyihui at gmail.com
Tue Mar 31 14:16:47 CEST 2009
Hi Romain,
I've been thinking for quite a long time on how to keep comments when
parsing R code and finally got a trick with inspiration from one of my
friends, i.e. to mask the comments in special assignments to "cheat" R
parser:
# keep.comment: whether to keep the comments or not
# keep.blank.line: preserve blank lines or not?
# begin.comment and end.comment: special identifiers that mark the orignial
# comments as 'begin.comment = "#[ comments ]end.comment"'
# and these marks will be removed after the modified code is parsed
tidy.source <- function(source = "clipboard", keep.comment = TRUE,
keep.blank.line = FALSE, begin.comment, end.comment, ...) {
# parse and deparse the code
tidy.block = function(block.text) {
exprs = parse(text = block.text)
n = length(exprs)
res = character(n)
for (i in 1:n) {
dep = paste(deparse(exprs[i]), collapse = "\n")
res[i] = substring(dep, 12, nchar(dep) - 1)
}
return(res)
}
text.lines = readLines(source, warn = FALSE)
if (keep.comment) {
# identifier for comments
identifier = function() paste(sample(LETTERS), collapse = "")
if (missing(begin.comment))
begin.comment = identifier()
if (missing(end.comment))
end.comment = identifier()
# remove leading and trailing white spaces
text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "",
text.lines)
# make sure the identifiers are not in the code
# or the original code might be modified
while (length(grep(sprintf("%s|%s", begin.comment, end.comment),
text.lines))) {
begin.comment = identifier()
end.comment = identifier()
}
head.comment = substring(text.lines, 1, 1) == "#"
# add identifiers to comment lines to cheat R parser
if (any(head.comment)) {
text.lines[head.comment] = gsub("\"", "\'",
text.lines[head.comment])
text.lines[head.comment] = sprintf("%s=\"%s%s\"",
begin.comment, text.lines[head.comment], end.comment)
}
# keep blank lines?
blank.line = text.lines == ""
if (any(blank.line) & keep.blank.line)
text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment,
end.comment)
text.tidy = tidy.block(text.lines)
# remove the identifiers
text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment,
end.comment), "", text.tidy)
}
else {
text.tidy = tidy.block(text.lines)
}
cat(paste(text.tidy, collapse = "\n"), "\n", ...)
invisible(text.tidy)
}
The above function can deal with comments which are in single lines, e.g.
f = tempfile()
writeLines('
# rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}
', f)
Then parse the code file 'f':
> tidy.source(f)
# rotation of the word 'Animation'
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1, ann = FALSE, type = "n", axes = FALSE)
# rotate; use rainbow() colors
text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 *
i/360)
# pause for a while
Sys.sleep(0.01)
}
Of course this function has some limitations: it does not support
inline comments or comments which are inside incomplete code lines.
Peter's example
f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well
will be parsed as
f
(a = 1)
I'm quite interested in syntax highlighting of R code and saw your
previous discussions in another posts (with Jose Quesada, etc). I'd
like to do something for your package if I could be of some help.
Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
Mobile: +86-15810805877
Homepage: http://www.yihui.name
School of Statistics, Room 1037, Mingde Main Building,
Renmin University of China, Beijing, 100872, China
2009/3/21 <romain.francois at dbmail.com>:
>
> It happens in the token function in gram.c:
>
> Â Â Â c = SkipSpace();
> Â Â Â if (c == '#') c = SkipComment();
>
> and then SkipComment goes like that:
>
> static int SkipComment(void)
> {
> Â Â Â int c;
> Â Â Â while ((c = xxgetc()) != '\n' && c != R_EOF) ;
> Â Â Â if (c == R_EOF) EndOfFile = 2;
> Â Â Â return c;
> }
>
> which effectively drops comments.
>
> Would it be possible to keep the information somewhere ?
>
> The source code says this:
>
> Â *Â The function yylex() scans the input, breaking it into
>  * tokens which are then passed to the parser. The lexical
> Â *Â analyser maintains a symbol table (in a very messy fashion).
>
> so my question is could we use this symbol table to keep track of, say, COMMENT tokens.
>
> Why would I even care about that ? I'm writing a package that will
> perform syntax highlighting of R source code based on the output of the
> parser, and it seems a waste to drop the comments.
>
> An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.
>
> Am I mad if I contemplate looking into this ?
>
> Romain
>
> --
> Romain Francois
> Independent R Consultant
> +33(0) 6 28 91 30 30
> http://romainfrancois.blog.free.fr
>
More information about the R-devel
mailing list