[Rd] Mapping parse tree elements to tokens

Thu Jul 30 01:24:35 CEST 2015

On 29/07/2015 5:33 PM, Jim Hester wrote:
> As Michael guessed my main use cases was code analysis.  A concrete
> example where this would help is with my test code coverage tool covr. 
> There is currently a bug when tracking coverage for if / else statements
> when the clauses do not contain brackets
> (https://github.com/jimhester/covr/issues/39).  Because only one source
> reference is generated in this case (because it is parsed as a single
> expression), it is not possible to track each of the clauses
> separately.  While I can get the source reference for the entire
> statement, in order to extract the if/else clauses I need to either use
> the tokenized information from getParseData(), or re-parse the entire if
> / else expression by hand (which seems prone to error to me).

Re-parsing may be inefficient, but it isn't error prone.  Just use
getSrcLines (or as.character(srcref), with a bit more work) to get the
source code from the original file, and re-parse it.
> 
> Another example of where this would help is linking comments to
> expressions.  While I know this topic has been discussed previously
> (https://stat.ethz.ch/pipermail/r-devel/2009-March/052731.html) and I am
> fine with the default parser dropping comments, having the ability to
> map the more detailed tokens back to the parse tree would allow the
> comments to be annotated to their closest expression.

getParseData does record comments.

> 
> Of the three options you propose I think simply supplying the index as
> an additional column from the getParseData() output would be the most
> straightforward to implement and use.
> 
> While it is true that you can get most of the way there with the current
> source references as Michael mentions in some cases having more fine
> grained location information is useful and there is no great way to get
> there currently without re-parsing the full expressions from the source
> reference.

That's duplication of effort, but it's not really that slow.

> The current getParseData output is already very implementation specific
> so I don't think it would be a great additional support burden to add
> the indexing information.  Likely the whole function would have to be
> removed if a different parsing method was used.

Luke may wish to comment, but I think the issue is that the parse tree
is not uniquely defined by the source, there are some arbitrary
decisions being made.  Our parser groups things in a certain way
(especially comments, which are not part of the parse tree, but are
recorded nonetheless), and a different implementation would necessarily
be different.  The more we put into getParseData, the harder it becomes
to change those arbitrary decisions without breaking other people's
code.  We do break things sometimes, but we don't like to do it.

Duncan Murdoch

> 
> Regardless I am glad others have shown some interest in this issue,
> thank you for taking the time to read and respond!
> 
> Jim
> 
> On Wed, Jul 29, 2015 at 2:47 PM, Duncan Murdoch
> <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     On 29/07/2015 2:30 PM, Michael Lawrence wrote:
> 
>         Probably need a generic tree based on "ParseNode" objects that
>         associate the line information with the symbol (for leaf nodes). As
>         Duncan notes, it should be possible to gather that from the table.
> 
>         But it would be nice if there was an "expr" column in the parse data
>         column in addition to "text". It would contain the parsed object.
>         Otherwise, to use the table, one is often reparsing the text, which
>         just seems redundant and inconvenient.
> 
> 
>     Can you (both Jim and Michael) describe the uses you might have for
>     this?  There are lots of possible changes that could make this
>     information available:
> 
>      - attach to each item in the parse tree, as the parser package
>     did.  (Bad idea for general use which is why I dropped it, but
>     it could be done as a special option to parse, if you aren't
>     planning to evaluate the expression.)
>      - give the index into the parse tree of each item, i.e. c(1,1),
>     c(1,2), c(1,3) in the example below, or just 1,2,3 along with a
>     function to reconstruct the full path.
>      - give a copy of the branch of the parse tree, as Michael suggests.
> 
>     etc.  Which is best for your purposes?
> 
>     Duncan Murdoch
> 
> 
>         Michael
> 
>         On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
>         <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
>         > On 29/07/2015 12:13 PM, Jim Hester wrote:
>         >>
>         >> I would like to map the parsed tokens obtained from
>         utils::getParseData()
>         >> to the parse tree and elements obtained by base::parse().
>         >>
>         >> It looks like back when this code was in the parser package
>         the parse()
>         >> function annotated the elements in the tree with their id,
>         which would
>         >> allow you to perform this mapping.  However when the code was
>         included in
>         >> R
>         >> this functionality was removed.
>         >
>         >
>         > Yes, not all elements of the parse tree can legally have
>         attributes
>         > attached.
>         >>
>         >>
>         >> ?getParseData states
>         >>    The ‘id’ values are not attached to the elements of the parse
>         >>            tree, they are only retained in the table returned by
>         >>            ‘getParseData’.
>         >>
>         >> Is there another way you can map between the getParseData()
>         tokens and
>         >> elements of the parse tree that makes this additional annotation
>         >> unnecessary?  Or is this simply not possible?
>         >
>         >
>         > I think you can't get to it, though you can get close by
>         looking at the id &
>         > parent values in the table.  For example,
>         >
>         >  code <- "x + (y + 1)"
>         >  p <- parse(text=code)
>         >
>         > getParseData(p)
>         >    line1 col1 line2 col2 id parent     token terminal text
>         > 15     1    1     1   11 15      0      expr    FALSE
>         > 1      1    1     1    1  1      3    SYMBOL     TRUE    x
>         > 3      1    1     1    1  3     15      expr    FALSE
>         > 2      1    3     1    3  2     15       '+'     TRUE    +
>         > 13     1    5     1   11 13     15      expr    FALSE
>         > 4      1    5     1    5  4     13       '('     TRUE    (
>         > 11     1    6     1   10 11     13      expr    FALSE
>         > 5      1    6     1    6  5      7    SYMBOL     TRUE    y
>         > 7      1    6     1    6  7     11      expr    FALSE
>         > 6      1    8     1    8  6     11       '+'     TRUE    +
>         > 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
>         > 9      1   10     1   10  9     11      expr    FALSE
>         > 10     1   11     1   11 10     13       ')'     TRUE    )
>         >
>         >
>         > Now p is an expression, with the parse tree in p[[1]].  From
>         the table, we
>         > can see that the root node has id 15, and 3 nodes have that as
>         a parent.
>         > Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The
>         tricky part is
>         > the re-ordering:  those correspond to `+`, x, and (y+1)
>         respectively, not
>         > the order they appear in the original source or in the table. 
>         Generally the
>         > function call appears first in the parse tree, but I'm not
>         sure you could
>         > always recognize which is the function call by looking at the
>         table.
>         >
>         > Duncan Murdoch
>         >
>         > ______________________________________________
>         > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
>         > https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
>