mbox series

[v4,0/8] diagnostics: libcpp: Overhaul locations for _Pragma tokens

Message ID 20230809221414.2849878-1-lhyatt@gmail.com
Headers show
Series diagnostics: libcpp: Overhaul locations for _Pragma tokens | expand

Message

Lewis Hyatt Aug. 9, 2023, 10:14 p.m. UTC
On Mon, Jul 31, 2023 at 06:39:15PM -0400, Lewis Hyatt wrote:
> On Fri, Jul 28, 2023 at 6:58 PM David Malcolm <dmalcolm@redhat.com> wrote:
> >
> > On Fri, 2023-07-21 at 19:08 -0400, Lewis Hyatt wrote:
> > > Add a new linemap reason LC_GEN which enables encoding the location
> > > of data
> > > that was generated during compilation and does not appear in any
> > > source file.
> > > There could be many use cases, such as, for instance, referring to
> > > the content
> > > of builtin macros (not yet implemented, but an easy lift after this
> > > one.) The
> > > first intended application is to create a place to store the input to
> > > a
> > > _Pragma directive, so that proper locations can be assigned to those
> > > tokens. This will be done in a subsequent commit.
> > >
> > > The actual change needed to the line-maps API in libcpp is not too
> > > large and
> > > requires no space overhead in the line map data structures (on 64-bit
> > > systems
> > > that is; one newly added data member to class line_map_ordinary sits
> > > inside
> > > former padding bytes.) An LC_GEN map is just an ordinary map like any
> > > other,
> > > but the TO_FILE member that normally points to the file name points
> > > instead to
> > > the actual data.  This works automatically with PCH as well, for the
> > > same
> > > reason that the file name makes its way into a PCH.  In order to
> > > avoid
> > > confusion, the member has been renamed from TO_FILE to DATA, and
> > > associated
> > > accessors adjusted.
> > >
> > > Outside libcpp, there are many small changes but most of them are to
> > > selftests, which are necessarily more sensitive to implementation
> > > details. From the perspective of the user (the "user", here, being a
> > > frontend
> > > using line maps or else the diagnostics infrastructure), the chief
> > > visible
> > > change is that the function location_get_source_line() should be
> > > passed an
> > > expanded_location object instead of a separate filename and line
> > > number.  This
> > > is not a big change because in most cases, this information came
> > > anyway from a
> > > call to expand_location and the needed expanded_location object is
> > > readily
> > > available. The new overload of location_get_source_line() uses the
> > > extra
> > > information in the expanded_location object to obtain the data from
> > > the
> > > in-memory buffer when it originated from an LC_GEN map.
> > >
> > > Until the subsequent patch that starts using LC_GEN maps, none are
> > > yet
> > > generated within GCC, hence nothing is added to the testsuite here;
> > > but all
> > > relevant selftests have been extended to cover generated data maps in
> > > addition
> > > to normal files.
> >
> > [..snip...]
> >
> > Thanks for the updated patch.
> >
> > Reading this patch, it felt a bit unnatural to me to have an
> >   (exploded location, source line)
> > pair where the exploded location seems to be representing "which source
> > file or generated buffer", but the line/column info in that
> > exploded_location is to be ignored in favor of the 2nd source line.
> >
> > I think we're missing a class: something that identifies either a
> > specific source file, or a specific generated buffer.
> >
> > How about something like either:
> >
> > class source_id
> > {
> > public:
> >   source_id (const char *filename)
> >   : m_filename_or_buffer (filename),
> >     m_len (0)
> >   {
> >   }
> >
> >   explicit source_id (const char *buffer, unsigned buffer_len)
> >   : m_filename_or_buffer (buffer),
> >     m_len (buffer_len)
> >   {
> >     linemap_assert (buffer_len > 0);
> >   }
> >
> > private:
> >   const char *m_filename_or_buffer;
> >   unsigned m_len;  // where 0 means "it's a filename"
> > };
> >
> > or:
> >
> > class source_id
> > {
> > public:
> >   source_id (const char *filename)
> >   : m_ptr (filename),
> >     m_is_buffer (false)
> >   {
> >   }
> >
> >   explicit source_id (const linemap_ordinary *buffer_linemap)
> >   : m_ptr (buffer_linemap),
> >     m_is_buffer (true)
> >   {
> >   }
> >
> > private:
> >   const void *m_ptr;
> >   bool m_is_buffer;
> > };
> >
> > and use one of these "source_id file" in place of "const char *file",
> > rather than replacing such things with expanded_location?
> >
> > > diff --git a/gcc/c-family/c-indentation.cc b/gcc/c-family/c-indentation.cc
> > > index e8d3dece770..4164fa0b1ba 100644
> > > --- a/gcc/c-family/c-indentation.cc
> > > +++ b/gcc/c-family/c-indentation.cc
> > > @@ -50,7 +50,7 @@ get_visual_column (expanded_location exploc,
> > >                  unsigned int *first_nws,
> > >                  unsigned int tab_width)
> > >  {
> > > -  char_span line = location_get_source_line (exploc.file, exploc.line);
> > > +  char_span line = location_get_source_line (exploc);
> >
> > ...so this might contine to be:
> >
> >   char_span line = location_get_source_line (exploc.file, exploc.line);
> >
> > ...but expanded_location's "file" field would become a source_id,
> > rather than a const char *.  It looks like doing do might make a lot of
> > "is this the same file or buffer?"  turn into comparisons of source_id
> > instances.
> >
> > So I think expanded_location would become:
> >
> > typedef struct
> > {
> >   /* Either the name of the source file involved, or the
> >      specific generated buffer.  */
> >   source_id file;
> >
> >   /* The line-location in the source file.  */
> >   int line;
> >
> >   int column;
> >
> >   void *data;
> >
> >   /* In a system header?. */
> >   bool sysp;
> > } expanded_location;
> >
> > and we wouldn't need to add these extra fields:
> >
> > > +
> > > +  /* If generated data, the data and its length.  The data may contain embedded
> > > +   nulls and need not be null-terminated.  */
> > > +  unsigned int generated_data_len;
> > > +  const char *generated_data;
> > >  } expanded_location;
> >
> > and we could pass around source_id instances when identifying specific
> > filenames/generated buffers.
> >
> > Does this idea simplify/clarify the patch, or make it more complicated?
> >
> > [...snip...]
> >
> > Thoughts?
> > Dave
> >
> 
> Thanks, this makes sense and I think on balance it makes the interface
> nicer to do it this way. In the last patch of this series (for SARIF
> output) I had found it necessary to use a
>     typdef std::pair<const char*, unsigned int> filename_or_buffer;
> which was the same thing in spirit as your source_id. It makes sense
> to promote that to a real class and use it more widely.
> 
> I will send out an updated series with that change later after testing.
> 
> I don't think we can simply change "file" in expanded_location to be a
> source_id, because this field is used in lots of places that don't
> care about generated data buffers, and that interface change would
> necessitate touching all of them. For example, gengtype.cc uses libcpp
> and expanded_location, and there are a bunch of call sites in the
> middle end and backend that do as well, plus e.g. the custom
> diagnostics hooks in the C++ front end that print "inlined from
> xyz.cc". I thought about making source_id implicitly convertible to
> char*, but I think that is too error prone, plus it doesn't help with
> the most common use of this field, which is to pass it to printf().
> The approach I am thinking to take is to leave "file" as it is in
> expanded_location, but to also add a "source_id src" field too. This
> way, the only call sites that need to be touched are those that care
> about this distinction, and so the number of changes is much more
> limited. But it does still achieve the goal that we don't need to use
> an expanded_location to communicate with input.cc, we can use a
> source_id instead, and that makes the interface more natural.
> 
> With the new interface, the main change needed for diagnostics code
> would be that instead of calling location_get_source_line(file_name,
> line), you would need to call location_get_source_line(src_id, line).
> In case both the source_id and the line number come from an
> expanded_location, there could be a convenience overload like
> location_get_source_line(exploc) also, but it wouldn't be necessary to
> involved an expanded_location if the source_id and line are better
> handled separately.
> 

Hello-

Here is the updated patch series with the interface change we
discussed.

I also identified an issue with the previous version. I had intended for
struct line_map_ordinary to incur 0 space overhead with this change, but
with the prior approach, the size was increasing by 8 bytes (from 24 to
32). In this round, I changed the approach to use a union so there is no
size overhead. Rather, there is an extra level of indirection incurred only
when LC_GEN maps are used, so that the impact on current usage is minimal to
none.

While working on this version I felt that the first patch is really too
large. In this iteration, I split it into 6 patches to (hopefully)
make it easier to review. If that is inconvenient for any reason, please let
me know and I can send it the old way.

Thanks again for taking a look at it.

-Lewis