[Rd] Why does the lexical analyzer drop comments ?
Romain Francois
romain.francois at dbmail.com
Mon Mar 23 08:10:52 CET 2009
Duncan Murdoch wrote:
> On 22/03/2009 4:50 PM, Romain Francois wrote:
>> Romain Francois wrote:
>>> Peter Dalgaard wrote:
>>>> Duncan Murdoch wrote:
>>>>> On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
>>>>>> It happens in the token function in gram.c:
>>>>>> Â Â Â c = SkipSpace();
>>>>>> Â Â Â if (c == '#') c = SkipComment();
>>>>>>
>>>>>> and then SkipComment goes like that:
>>>>>> static int SkipComment(void)
>>>>>> {
>>>>>> Â Â Â int c;
>>>>>> Â Â Â while ((c = xxgetc()) != '\n' && c != R_EOF) ;
>>>>>> Â Â Â if (c == R_EOF) EndOfFile = 2;
>>>>>> Â Â Â return c;
>>>>>> }
>>>>>>
>>>>>> which effectively drops comments.
>>>>>>
>>>>>> Would it be possible to keep the information somewhere ?
>>>>>> The source code says this:
>>>>>> Â *Â The function yylex() scans the input, breaking it into
>>>>>>  * tokens which are then passed to the parser. The lexical
>>>>>> Â *Â analyser maintains a symbol table (in a very messy fashion).
>>>>>>
>>>>>> so my question is could we use this symbol table to keep track
>>>>>> of, say, COMMENT tokens.
>>>>>> Why would I even care about that ? I'm writing a package that will
>>>>>> perform syntax highlighting of R source code based on the output
>>>>>> of the
>>>>>> parser, and it seems a waste to drop the comments.
>>>>>> An also, when you print a function to the R console, you don't
>>>>>> get the comments, and some of them might be useful to the user.
>>>>>>
>>>>>> Am I mad if I contemplate looking into this ?
>>>>> Comments are syntactically the same as whitespace. You don't want
>>>>> them to affect the parsing.
>>>> Well, you might, but there is quite some madness lying that way.
>>>>
>>>> Back in the bronze age, we did actually try to keep comments
>>>> attached to (AFAIR) the preceding token. One problem is that the
>>>> elements of the parse tree typically involve multiple tokens, and
>>>> if comments after different tokens get stored in the same place
>>>> something is not going back where it came from when deparsing. So
>>>> we had problems with comments moving from one end of a loop the
>>>> other and the like.
>>> Ouch. That helps picturing the kind of madness ...
>>>
>>> Another way could be to record comments separately (similarly to
>>> srcfile attribute for example) instead of dropping them entirely,
>>> but I guess this is the same as Duncan's idea, which is easier to
>>> set up.
>>>
>>>> You could try extending the scheme by encoding which part of a
>>>> syntactic structure the comment belongs to, but consider for
>>>> instance how many places in a function call you can stick in a
>>>> comment.
>>>>
>>>> f #here
>>>> ( #here
>>>> a #here (possibly)
>>>> = #here
>>>> 1 #this one belongs to the argument, though
>>>> ) #but here as well
>> Coming back on this. I actually get two expressions:
>>
>> > p <- parse( "/tmp/parsing.R")
>> > str( p )
>> length 2 expression(f, (a = 1))
>> - attr(*, "srcref")=List of 2
>> ..$ :Class 'srcref' atomic [1:6] 1 1 1 1 1 1
>> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>> ..$ :Class 'srcref' atomic [1:6] 2 1 6 1 1 1
>> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>> - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>>
>> But anyway, if I drop the first comment, then I get one expression
>> with some srcref information:
>>
>> > p <- parse( "/tmp/parsing.R")
>> > str( p )
>> length 1 expression(f(a = 1))
>> - attr(*, "srcref")=List of 1
>> ..$ :Class 'srcref' atomic [1:6] 1 1 5 1 1 1
>> .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
>> - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
>>
>> but as far as i can see, there is only srcref information for that
>> expression as a whole, it does not go beyond, so I am not sure I can
>> implement Duncan's proposal without more detailed information from
>> the parser, since I will only have the chance to check if a
>> whitespace is actually a comment if it is between two expressions
>> with a srcref.
>
> Currently srcrefs are only attached to whole statements. Since your
> source only included one or two statements, you only get one or two
> srcrefs. It would not be hard to attach a srcref to every
> subexpression; there hasn't been a need for that before, so I didn't
> do it just for the sake of efficiency.
I understand that. I wanted to make sure I did not miss something.
> However, it might make sense for you to have your own parser, based on
> the grammar in R's parser, but handling white space differently.
> Certainly it would make sense to do that before making changes to the
> base R one. The whole source is in src/main/gram.y; if you're not
> familiar with Bison, I can give you a hand.
Thank you, I appreciate your help. Having my own parser is the option I
am slowly converging to.
I'll start with reading bison documentation. Besides bison documents, is
there R specific documentation on how the R parser was written ?
>
> Duncan Murdoch
>
>>
>> Would it be sensible then to retain the comments and their srcref
>> information, but separate from the tokens used for the actual
>> parsing, in some other attribute of the output of parse ?
>>
>> Romain
>>
>>>>> If you're doing syntax highlighting, you can determine the
>>>>> whitespace by
>>>>> looking at the srcref records, and then parse that to determine
>>>>> what isn't being counted as tokens. (I think you'll find a few
>>>>> things there besides whitespace, but it is a fairly limited set,
>>>>> so shouldn't be too hard to recognize.)
>>>>>
>>>>> The Rd parser is different, because in an Rd file, whitespace is
>>>>> significant, so it gets kept.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
More information about the R-devel
mailing list