[Rd] Why does the lexical analyzer drop comments ?

Mon Mar 23 08:10:52 CET 2009

Duncan Murdoch wrote:
> On 22/03/2009 4:50 PM, Romain Francois wrote:
>> Romain Francois wrote:
>>> Peter Dalgaard wrote:
>>>> Duncan Murdoch wrote:
>>>>> On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
>>>>>> It happens in the token function in gram.c:
>>>>>> Â Â Â  c = SkipSpace();
>>>>>> Â Â Â  if (c == '#') c = SkipComment();
>>>>>>
>>>>>> and then SkipComment goes like that:
>>>>>> static int SkipComment(void)
>>>>>> {
>>>>>> Â Â Â  int c;
>>>>>> Â Â Â  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
>>>>>> Â Â Â  if (c == R_EOF) EndOfFile = 2;
>>>>>> Â Â Â  return c;
>>>>>> }
>>>>>>
>>>>>> which effectively drops comments.
>>>>>>
>>>>>> Would it be possible to keep the information somewhere ?
>>>>>> The source code says this:
>>>>>> Â *Â  The function yylex() scans the input, breaking it into
>>>>>> Â *Â  tokens which are then passed to the parser.Â  The lexical
>>>>>> Â *Â  analyser maintains a symbol table (in a very messy fashion).
>>>>>>
>>>>>> so my question is could we use this symbol table to keep track 
>>>>>> of, say, COMMENT tokens.
>>>>>> Why would I even care about that ? I'm writing a package that will
>>>>>> perform syntax highlighting of R source code based on the output 
>>>>>> of the
>>>>>> parser, and it seems a waste to drop the comments.
>>>>>> An also, when you print a function to the R console, you don't 
>>>>>> get the comments, and some of them might be useful to the user.
>>>>>>
>>>>>> Am I mad if I contemplate looking into this ? 
>>>>> Comments are syntactically the same as whitespace.  You don't want 
>>>>> them to affect the parsing.
>>>> Well, you might, but there is quite some madness lying that way.
>>>>
>>>> Back in the bronze age, we did actually try to keep comments 
>>>> attached to (AFAIR) the preceding token. One problem is that the 
>>>> elements of the parse tree typically involve multiple tokens, and 
>>>> if comments after different tokens get stored in the same place 
>>>> something is not going back where it came from when deparsing. So 
>>>> we had problems with comments moving from one end of a loop the 
>>>> other and the like.
>>> Ouch. That helps picturing the kind of madness ...
>>>
>>> Another way could be to record comments separately (similarly to 
>>> srcfile attribute for example) instead of dropping them entirely, 
>>> but I guess this is the same as Duncan's idea, which is easier to 
>>> set up.
>>>
>>>> You could try extending the scheme by encoding which part of a 
>>>> syntactic structure the comment belongs to, but consider for 
>>>> instance how many places in a function call you can stick in a 
>>>> comment.
>>>>
>>>> f #here
>>>> ( #here
>>>> a #here (possibly)
>>>> = #here
>>>> 1 #this one belongs to the argument, though
>>>> ) #but here as well
>> Coming back on this. I actually get two expressions:
>>
>>  > p <- parse( "/tmp/parsing.R")
>>  > str( p )
>> length 2 expression(f, (a = 1))
>>  - attr(*, "srcref")=List of 2
>>   ..$ :Class 'srcref'  atomic [1:6] 1 1 1 1 1 1
>>   .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>>   ..$ :Class 'srcref'  atomic [1:6] 2 1 6 1 1 1
>>   .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>>  - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
>>
>> But anyway, if I drop the first comment, then I get one expression 
>> with some srcref information:
>>
>>  > p <- parse( "/tmp/parsing.R")
>>  > str( p )
>> length 1 expression(f(a = 1))
>>  - attr(*, "srcref")=List of 1
>>   ..$ :Class 'srcref'  atomic [1:6] 1 1 5 1 1 1
>>   .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
>>  - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
>>
>> but as far as i can see, there is only srcref information for that 
>> expression as a whole, it does not go beyond, so I am not sure I can 
>> implement Duncan's proposal without more detailed information from 
>> the parser, since I will only have the chance to check if a 
>> whitespace is actually a comment if it is between two expressions 
>> with a srcref.
>
> Currently srcrefs are only attached to whole statements.  Since your 
> source only included one or two statements, you only get one or two 
> srcrefs.  It would not be hard to attach a srcref to every 
> subexpression; there hasn't been a need for that before, so I didn't 
> do it just for the sake of efficiency.

I understand that. I wanted to make sure I did not miss something.

> However, it might make sense for you to have your own parser, based on 
> the grammar in R's parser, but handling white space differently. 
> Certainly it would make sense to do that before making changes to the 
> base R one.  The whole source is in src/main/gram.y; if you're not 
> familiar with Bison, I can give you a hand.

Thank you, I appreciate your help. Having my own parser is the option I 
am slowly converging to.
I'll start with reading bison documentation. Besides bison documents, is 
there R specific documentation on how the R parser was written ?

>
> Duncan Murdoch
>
>>
>> Would it be sensible then to retain the comments and their srcref 
>> information, but separate from the tokens used for the actual 
>> parsing, in some other attribute of the output of parse ?
>>
>> Romain
>>
>>>>> If you're doing syntax highlighting, you can determine the 
>>>>> whitespace by
>>>>> looking at the srcref records, and then parse that to determine 
>>>>> what isn't being counted as tokens.  (I think you'll find a few 
>>>>> things there besides whitespace, but it is a fairly limited set, 
>>>>> so shouldn't be too hard to recognize.)
>>>>>
>>>>> The Rd parser is different, because in an Rd file, whitespace is 
>>>>> significant, so it gets kept.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr