[Rd] Why does the lexical analyzer drop comments ?
Duncan Murdoch
murdoch at stats.uwo.ca
Mon Mar 23 01:04:24 CET 2009
On 22/03/2009 4:50 PM, Romain Francois wrote:
> Romain Francois wrote:
>> Peter Dalgaard wrote:
>>> Duncan Murdoch wrote:
>>>> On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
>>>>> It happens in the token function in gram.c:
>>>>> Â Â Â c = SkipSpace();
>>>>> Â Â Â if (c == '#') c = SkipComment();
>>>>>
>>>>> and then SkipComment goes like that:
>>>>> static int SkipComment(void)
>>>>> {
>>>>> Â Â Â int c;
>>>>> Â Â Â while ((c = xxgetc()) != '\n' && c != R_EOF) ;
>>>>> Â Â Â if (c == R_EOF) EndOfFile = 2;
>>>>> Â Â Â return c;
>>>>> }
>>>>>
>>>>> which effectively drops comments.
>>>>>
>>>>> Would it be possible to keep the information somewhere ?
>>>>> The source code says this:
>>>>> Â *Â The function yylex() scans the input, breaking it into
>>>>>  * tokens which are then passed to the parser. The lexical
>>>>> Â *Â analyser maintains a symbol table (in a very messy fashion).
>>>>>
>>>>> so my question is could we use this symbol table to keep track of,
>>>>> say, COMMENT tokens.
>>>>> Why would I even care about that ? I'm writing a package that will
>>>>> perform syntax highlighting of R source code based on the output of
>>>>> the
>>>>> parser, and it seems a waste to drop the comments.
>>>>> An also, when you print a function to the R console, you don't get
>>>>> the comments, and some of them might be useful to the user.
>>>>>
>>>>> Am I mad if I contemplate looking into this ?
>>>> Comments are syntactically the same as whitespace. You don't want
>>>> them to affect the parsing.
>>> Well, you might, but there is quite some madness lying that way.
>>>
>>> Back in the bronze age, we did actually try to keep comments attached
>>> to (AFAIR) the preceding token. One problem is that the elements of
>>> the parse tree typically involve multiple tokens, and if comments
>>> after different tokens get stored in the same place something is not
>>> going back where it came from when deparsing. So we had problems with
>>> comments moving from one end of a loop the other and the like.
>> Ouch. That helps picturing the kind of madness ...
>>
>> Another way could be to record comments separately (similarly to
>> srcfile attribute for example) instead of dropping them entirely, but
>> I guess this is the same as Duncan's idea, which is easier to set up.
>>
>>> You could try extending the scheme by encoding which part of a
>>> syntactic structure the comment belongs to, but consider for instance
>>> how many places in a function call you can stick in a comment.
>>>
>>> f #here
>>> ( #here
>>> a #here (possibly)
>>> = #here
>>> 1 #this one belongs to the argument, though
>>> ) #but here as well
> Coming back on this. I actually get two expressions:
>
> > p <- parse( "/tmp/parsing.R")
> > str( p )
> length 2 expression(f, (a = 1))
> - attr(*, "srcref")=List of 2
> ..$ :Class 'srcref' atomic [1:6] 1 1 1 1 1 1
> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
> ..$ :Class 'srcref' atomic [1:6] 2 1 6 1 1 1
> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
> - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>
> But anyway, if I drop the first comment, then I get one expression with
> some srcref information:
>
> > p <- parse( "/tmp/parsing.R")
> > str( p )
> length 1 expression(f(a = 1))
> - attr(*, "srcref")=List of 1
> ..$ :Class 'srcref' atomic [1:6] 1 1 5 1 1 1
> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
> - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
>
> but as far as i can see, there is only srcref information for that
> expression as a whole, it does not go beyond, so I am not sure I can
> implement Duncan's proposal without more detailed information from the
> parser, since I will only have the chance to check if a whitespace is
> actually a comment if it is between two expressions with a srcref.
Currently srcrefs are only attached to whole statements. Since your
source only included one or two statements, you only get one or two
srcrefs. It would not be hard to attach a srcref to every
subexpression; there hasn't been a need for that before, so I didn't do
it just for the sake of efficiency.
However, it might make sense for you to have your own parser, based on
the grammar in R's parser, but handling white space differently.
Certainly it would make sense to do that before making changes to the
base R one. The whole source is in src/main/gram.y; if you're not
familiar with Bison, I can give you a hand.
Duncan Murdoch
>
> Would it be sensible then to retain the comments and their srcref
> information, but separate from the tokens used for the actual parsing,
> in some other attribute of the output of parse ?
>
> Romain
>
>>>> If you're doing syntax highlighting, you can determine the
>>>> whitespace by
>>>> looking at the srcref records, and then parse that to determine what
>>>> isn't being counted as tokens. (I think you'll find a few things
>>>> there besides whitespace, but it is a fairly limited set, so
>>>> shouldn't be too hard to recognize.)
>>>>
>>>> The Rd parser is different, because in an Rd file, whitespace is
>>>> significant, so it gets kept.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list