[PATCH 07/22] Implement token range tracking within libcpp and C/C++ FEs

Thu Sep 17 16:54:00 GMT 2015

On Wed, 2015-09-16 at 16:21 -0400, David Malcolm wrote:
> On Tue, 2015-09-15 at 12:48 +0200, Jakub Jelinek wrote:
> > On Tue, Sep 15, 2015 at 12:33:58PM +0200, Richard Biener wrote:
> > > On Tue, Sep 15, 2015 at 12:20 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > > > On Tue, Sep 15, 2015 at 12:14:22PM +0200, Richard Biener wrote:
> > > >> > diff --git a/gcc/cp/parser.h b/gcc/cp/parser.h
> > > >> > index 760467c..c7558a0 100644
> > > >> > --- a/gcc/cp/parser.h
> > > >> > +++ b/gcc/cp/parser.h
> > > >> > @@ -61,6 +61,8 @@ struct GTY (()) cp_token {
> > > >> >    BOOL_BITFIELD purged_p : 1;
> > > >> >    /* The location at which this token was found.  */
> > > >> >    location_t location;
> > > >> > +  /* The source range at which this token was found.  */
> > > >> > +  source_range range;
> > > >>
> > > >> Is it just me or does location now feel somewhat redundant with range?  Can't we
> > > >> compress that somehow?
> > > >
> > > > For a token I'd expect it is redundant, I don't see how it would be useful
> > > > for a single preprocessing token to have more than start and end locations.
> > > > But generally, for expressions, 3 locations make sense.
> > > > If you have
> > > > abc + def
> > > > ~~~~^~~~~
> > > > then having a range is useful.  In any case, I'm surprised that the ranges aren't encoded in
> > > > location_t (the data structures behind it, where we already stick also
> > > > BLOCK pointer).
> > > 
> > > Probably lack of encoding space ... I suppose upping location_t to
> > > 64bits coud solve
> > > some of that (with its own drawback on increasing size of core structures).
> > 
> > What I had in mind was just add
> >   source_location start, end;
> > to location_adhoc_data struct and use !IS_ADHOC_LOC locations to represent
> > just plain locations without block and without range (including the cases
> > where the range has both start and end equal to the locus) and IS_ADHOC_LOC
> > locations for the cases where either we have non-NULL block, or we have
> > some other range, or both.  But I haven't spent any time on that, so just
> > wondering if such an encoding has been considered.
> 
> I've been attempting to implement that.
> 
> Am I right in thinking that the ad-hoc locations never get purged? i.e.
> that once we've registered an ad-hoc location, then is that slot within
> location_adhoc_data_map is forever associated with that (locus, block)
> pair?  [or in the proposed model, the (locus, src_range, block)
> triple?].
> 
> If so, it may make more sense to put the ranges into ad-hoc locations,
> but only *after tokenization*: in this approach, the src_range would be
> a field within the tokens (like in patch 07/22), in the hope that the
> tokens are short-lived  (which AIUI is the case for libcpp and C, though
> not for C++), presumably also killing the "location" field within
> tokens.  We then stuff the range into the location_t when building trees
> (maybe putting a src_range into c_expr to further delay populating
> location_adhoc_data_map).
> 
> That way we avoid bloating the location_adhoc_data_map during lexing,
> whilst preserving the range information, and we can stuff the ranges
> into the 32-bit location_t within tree/gimple etc (albeit paying a cost
> within the location_adhoc_data_map).
> 
> Thoughts?  Hope this sounds sane.
> Dave

FWIW, I have a (very messy) implementation of this working for the C
frontend, which gives us source ranges on expressions without needing to
add any new tree nodes, or add any fields to existing tree structs.

The approach I'm using:

* ranges are stored directly as fields within cpp_token and c_token
(maybe we can ignore cp_token for now)

* ranges are stashed in the C FE, both (a) within the "struct c_expr"
and (b) within the location_t of each tree expression node as a new
field in the adhoc map.

Doing it twice may seem slightly redundant, but I think both are needed:
  (a) doing it within c_expr allows us to support ranges for constants
and VAR_DECL etc during parsing, without needing any kind of new tree
wrapper node
  (b) doing it in the ad-hoc map allows the ranges for expressions to
survive the parse and be usable in diagnostics later.

So this gives us (in the C FE): ranges for everything during parsing,
and ranges for expressions afterwards, with no new tree nodes or new
fields within tree nodes.

I'm working on cleaning it up into a much more minimal set of patches
that I hope are reviewable.

Hopefully this sounds like a good approach

I've also been looking at ways to mitigate bloat of the ad-hoc map, by
using some extra bits of location_t for representing short ranges
directly.

Dave