Message ID | 1442957171-22904-4-git-send-email-dmalcolm@redhat.com |
---|---|
State | New |
Headers | show |
Hello, David Malcolm <dmalcolm@redhat.com> a écrit: > This is an updated version of: > https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00736.html > > Changes in V2 of the patch: > * c_lex_with_flags: don't write through the range ptr if it's NULL > * don't add any fields to the C++ frontend's cp_token for now > * libcpp/lex.c: prevent usage of stale/uninitialized data in > _cpp_temp_token and _cpp_lex_direct. > > This patch adds source *range* information to libcpp's cpp_token, and to > c_token in the C frontend. > > As noted before, to minimize churn, I kept the existing location_t > fields, though in theory these are always just equal to the start of > the source range. Funny; I first overlooked this comment of you, and then when reading the patch, I asked myself "why keep the existing location_t" ? I mean, in here: /* A preprocessing token. This has been carefully packed and should - occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. */ + occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. + FIXME: the above comment is no longer true with this patch. */ struct GTY(()) cpp_token { source_location src_loc; /* Location of first char of token. */ + source_range src_range; /* Source range covered by the token. */ ENUM_BITFIELD(cpp_ttype) type : CHAR_BIT; /* token type */ unsigned short flags; /* flags - see above */ You could just change the type of src_loc and make it be a source_range. Source range could take a converting constructor, that takes a source_location, so that the existing code that does "cpp_token.src_loc = a_source_location;" keeps working. But then, in the previous patch of the series, I see: +/* A range of source locations. + + Ranges are closed: + m_start is the first location within the range, + m_finish is the last location within the range. + + We may need a more compact way to store these, but for now, + let's do it the simple way, as a pair. */ +struct GTY(()) source_range +{ + source_location m_start; + source_location m_finish; + + void debug (const char *msg) const; + + /* We avoid using constructors, since various structs that + don't yet have constructors will embed instances of + source_range. */ + But what if we define a default constructor for that one (as well as one that takes a source_location)? Wouldn't that work for the embedding case that you talk about in that comment? The reason why I am asking all this is, memory consumption. Maybe you've measured it and it's not relevant, but otherwise, if we could do away with duplicating the start location and still miminize the churn, maybe we should try.
On Fri, 2015-09-25 at 11:13 +0200, Dodji Seketeli wrote: > Hello, > > David Malcolm <dmalcolm@redhat.com> a écrit: > > > This is an updated version of: > > https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00736.html > > > > Changes in V2 of the patch: > > * c_lex_with_flags: don't write through the range ptr if it's NULL > > * don't add any fields to the C++ frontend's cp_token for now > > * libcpp/lex.c: prevent usage of stale/uninitialized data in > > _cpp_temp_token and _cpp_lex_direct. > > > > This patch adds source *range* information to libcpp's cpp_token, and to > > c_token in the C frontend. > > > > As noted before, to minimize churn, I kept the existing location_t > > fields, though in theory these are always just equal to the start of > > the source range. > > Funny; I first overlooked this comment of you, and then when reading the > patch, I asked myself "why keep the existing location_t" ? I mean, in > here: > > /* A preprocessing token. This has been carefully packed and should > - occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. */ > + occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. > + FIXME: the above comment is no longer true with this patch. */ > struct GTY(()) cpp_token { > source_location src_loc; /* Location of first char of token. */ > + source_range src_range; /* Source range covered by the token. */ > ENUM_BITFIELD(cpp_ttype) type : CHAR_BIT; /* token type */ > unsigned short flags; /* flags - see above */ > > You could just change the type of src_loc and make it be a source_range. Interesting idea. For the general case of expressions, I want a location to mean a caret/point plus a range that contains it, but for tokens, the caret/point is always at the start of the range. So maybe a src_range would do here. That said, in patches 3 and 4 of this kit I'm experimenting with representation; as I said in the blurb for patch 3: "See the cover-letter for this patch kit which describes how we might go back to using just a location_t, and stashing the range inside the location_t. I'm doing it this way for now to allow for more flexibility as I benchmark and explore implementation options." So patches 3 and 4 are rather more experimental than patches 1,2 and 5, as I find out what the different representations do to the time&memory consumption of the compiler. I like the idea of "source_location" and "location_t" becoming a representation of "(point/caret + range)" rather than just a point/caret, since this means that we can pass location_t around as before, but then we can extract ranges from them, and all of the existing diagnostics ought to "automagically" gain underlines "for free". I'm experimenting with ways to try to do that efficiently, with strategies for packing things into the 32-bits compactly; see e.g.: https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01826.html If so, then cpp_token's src_loc would remain a source_location; source_location itself becomes richer. > Source range could take a converting constructor, that takes a > source_location, so that the existing code that does "cpp_token.src_loc > = a_source_location;" keeps working. > > But then, in the previous patch of the series, I see: > > +/* A range of source locations. > + > + Ranges are closed: > + m_start is the first location within the range, > + m_finish is the last location within the range. > + > + We may need a more compact way to store these, but for now, > + let's do it the simple way, as a pair. */ > +struct GTY(()) source_range > +{ > + source_location m_start; > + source_location m_finish; > + > + void debug (const char *msg) const; > + > + /* We avoid using constructors, since various structs that > + don't yet have constructors will embed instances of > + source_range. */ > + > > But what if we define a default constructor for that one (as well as one > that takes a source_location)? Wouldn't that work for the embedding > case that you talk about in that comment? Perhaps, but I worry that it would lead to a cascade, where suddenly we'd need constructors for various other types. I can try it, I guess. > The reason why I am asking all this is, memory consumption. Maybe > you've measured it and it's not relevant, but otherwise, if we could do > away with duplicating the start location and still miminize the churn, > maybe we should try. (nods) I'm experimenting with some different approaches here. Thanks Dave [BTW, I'm about to disappear on a vacation from tomorrow until October 6th, and away from computers]
David Malcolm <dmalcolm@redhat.com> a écrit: > On Fri, 2015-09-25 at 11:13 +0200, Dodji Seketeli wrote: [...] >> Funny; I first overlooked this comment of you, and then when reading the >> patch, I asked myself "why keep the existing location_t" ? I mean, in >> here: >> >> /* A preprocessing token. This has been carefully packed and should >> - occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. */ >> + occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. >> + FIXME: the above comment is no longer true with this patch. */ >> struct GTY(()) cpp_token { >> source_location src_loc; /* Location of first char of token. */ >> + source_range src_range; /* Source range covered by the token. */ >> ENUM_BITFIELD(cpp_ttype) type : CHAR_BIT; /* token type */ >> unsigned short flags; /* flags - see above */ >> >> You could just change the type of src_loc and make it be a source_range. > > Interesting idea. > > For the general case of expressions, I want a location to mean a > caret/point plus a range that contains it, but for tokens, the > caret/point is always at the start of the range. So maybe a src_range > would do here. > > That said, in patches 3 and 4 of this kit I'm experimenting with > representation; as I said in the blurb for patch 3: "See the > cover-letter for this patch kit which describes how we might go back to > using just a location_t, and stashing the range inside the location_t. > I'm doing it this way for now to allow for more flexibility as I > benchmark and explore implementation options." Right. > So patches 3 and 4 are rather more experimental than patches 1,2 and 5, > as I find out what the different representations do to the time&memory > consumption of the compiler. Understood. > I like the idea of "source_location" and "location_t" becoming a > representation of "(point/caret + range)" rather than just a > point/caret, since this means that we can pass location_t around as > before, but then we can extract ranges from them, and all of the > existing diagnostics ought to "automagically" gain underlines "for > free". Me too. > I'm experimenting with ways to try to do that efficiently, with > strategies for packing things into the 32-bits compactly; see e.g.: > https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01826.html > > If so, then cpp_token's src_loc would remain a source_location; > source_location itself becomes richer. > >> Source range could take a converting constructor, that takes a >> source_location, so that the existing code that does "cpp_token.src_loc >> = a_source_location;" keeps working. >> >> But then, in the previous patch of the series, I see: >> >> +/* A range of source locations. >> + >> + Ranges are closed: >> + m_start is the first location within the range, >> + m_finish is the last location within the range. >> + >> + We may need a more compact way to store these, but for now, >> + let's do it the simple way, as a pair. */ >> +struct GTY(()) source_range >> +{ >> + source_location m_start; >> + source_location m_finish; >> + >> + void debug (const char *msg) const; >> + >> + /* We avoid using constructors, since various structs that >> + don't yet have constructors will embed instances of >> + source_range. */ >> + >> >> But what if we define a default constructor for that one (as well as one >> that takes a source_location)? Wouldn't that work for the embedding >> case that you talk about in that comment? > > Perhaps, but I worry that it would lead to a cascade, where suddenly > we'd need constructors for various other types. I can try it, I > guess. If it leads to a cascade, then don't bother. My point was precisely to try to avoid the churn, while limiting the amount of data size increase for cpp_token in general. As you implied, if we can just stay with a source_location that carries the information of a pointer plus a range, even better. > [BTW, I'm about to disappear on a vacation from tomorrow until October > 6th, and away from computers] Thanks for the heads-up.
diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c index 55ceb20..57a626e 100644 --- a/gcc/c-family/c-lex.c +++ b/gcc/c-family/c-lex.c @@ -380,11 +380,14 @@ c_common_has_attribute (cpp_reader *pfile) } /* Read a token and return its type. Fill *VALUE with its value, if - applicable. Fill *CPP_FLAGS with the token's flags, if it is + applicable. Fill *LOC with the source location of the token. + If non-NULL, fill *RANGE with the source range of the token. + Fill *CPP_FLAGS with the token's flags, if it is non-NULL. */ enum cpp_ttype -c_lex_with_flags (tree *value, location_t *loc, unsigned char *cpp_flags, +c_lex_with_flags (tree *value, location_t *loc, source_range *range, + unsigned char *cpp_flags, int lex_flags) { static bool no_more_pch; @@ -397,6 +400,8 @@ c_lex_with_flags (tree *value, location_t *loc, unsigned char *cpp_flags, retry: tok = cpp_get_token_with_location (parse_in, loc); type = tok->type; + if (range) + *range = tok->src_range; retry_after_at: switch (type) diff --git a/gcc/c-family/c-pragma.h b/gcc/c-family/c-pragma.h index f6e1090..3b94e44 100644 --- a/gcc/c-family/c-pragma.h +++ b/gcc/c-family/c-pragma.h @@ -225,8 +225,8 @@ extern enum cpp_ttype pragma_lex (tree *, location_t *loc = NULL); /* This is not actually available to pragma parsers. It's merely a convenient location to declare this function for c-lex, after having enum cpp_ttype declared. */ -extern enum cpp_ttype c_lex_with_flags (tree *, location_t *, unsigned char *, - int); +extern enum cpp_ttype c_lex_with_flags (tree *, location_t *, source_range *, + unsigned char *, int); extern void c_pp_lookup_pragma (unsigned int, const char **, const char **); diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 2fab3f0..5edf563 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -170,6 +170,8 @@ struct GTY (()) c_token { ENUM_BITFIELD (pragma_kind) pragma_kind : 8; /* The location at which this token was found. */ location_t location; + /* The source range at which this token was found. */ + source_range range; /* The value associated with this token, if any. */ tree value; }; @@ -239,7 +241,9 @@ c_lex_one_token (c_parser *parser, c_token *token) { timevar_push (TV_LEX); - token->type = c_lex_with_flags (&token->value, &token->location, NULL, + token->type = c_lex_with_flags (&token->value, &token->location, + &token->range, + NULL, (parser->lex_untranslated_string ? C_LEX_STRING_NO_TRANSLATE : 0)); token->id_kind = C_ID_NONE; diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c index 0134189..9423755 100644 --- a/gcc/cp/parser.c +++ b/gcc/cp/parser.c @@ -764,7 +764,8 @@ cp_lexer_get_preprocessor_token (cp_lexer *lexer, cp_token *token) /* Get a new token from the preprocessor. */ token->type - = c_lex_with_flags (&token->u.value, &token->location, &token->flags, + = c_lex_with_flags (&token->u.value, &token->location, + NULL, &token->flags, lexer == NULL ? 0 : C_LEX_STRING_NO_JOIN); token->keyword = RID_MAX; token->pragma_kind = PRAGMA_NONE; diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index a2bdfa0..0b1a403 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -235,9 +235,11 @@ struct GTY(()) cpp_identifier { }; /* A preprocessing token. This has been carefully packed and should - occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. */ + occupy 16 bytes on 32-bit hosts and 24 bytes on 64-bit hosts. + FIXME: the above comment is no longer true with this patch. */ struct GTY(()) cpp_token { source_location src_loc; /* Location of first char of token. */ + source_range src_range; /* Source range covered by the token. */ ENUM_BITFIELD(cpp_ttype) type : CHAR_BIT; /* token type */ unsigned short flags; /* flags - see above */ diff --git a/libcpp/lex.c b/libcpp/lex.c index 0aa1090..a6f16b2 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -2169,6 +2169,8 @@ _cpp_temp_token (cpp_reader *pfile) result = pfile->cur_token++; result->src_loc = old->src_loc; + result->src_range.m_start = old->src_loc; + result->src_range.m_finish = old->src_loc; return result; } @@ -2365,6 +2367,13 @@ _cpp_lex_direct (cpp_reader *pfile) result->src_loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (buffer, buffer->cur)); + /* The token's src_range begins here. */ + result->src_range.m_start = result->src_loc; + + /* Ensure m_finish is also initialized, in case we bail out above + via a "goto fresh_line;" below. */ + result->src_range.m_finish = result->src_loc; + switch (c) { case ' ': case '\t': case '\f': case '\v': case '\0': @@ -2723,6 +2732,11 @@ _cpp_lex_direct (cpp_reader *pfile) break; } + /* The token's src_range ends here. */ + result->src_range.m_finish = + linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (buffer, buffer->cur)); + return result; }