diff mbox series

c++/modules: optimize tree flag streaming

Message ID 20240215183735.244505-1-ppalka@redhat.com
State New
Headers show
Series c++/modules: optimize tree flag streaming | expand

Commit Message

Patrick Palka Feb. 15, 2024, 6:37 p.m. UTC
Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
OK for trunk?

-- >8 --

One would expect consecutive calls to bytes_in/out::b for streaming
adjacent bits, as we do for tree flag streaming, to at least be
optimized by the compiler into individual bit operations using
statically known bit positions (and ideally merged into larger sized
reads/writes).

Unfortunately this doesn't happen because the compiler has trouble
tracking the values of this->bit_pos and this->bit_val across such
calls, likely because the compiler doesn't know 'this' and so it's
treated as global memory.  This means for each consecutive bit stream
operation, bit_pos and bit_val are loaded from memory, checked if
buffering is needed, and finally the bit is extracted from bit_val
according to the (unknown) bit_pos, even though relative to the previous
operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
is just 1 larger.  This ends up being quite slow, with tree_node_bools
taking 10% of time when streaming in parts of the std module.

This patch optimizes this by making tracking of bit_pos and bit_val
easier for the compiler.  Rather than bit_pos and bit_val being members
of the (effectively global) bytes_in/out objects, this patch factors out
the bit streaming code/state into separate classes bits_in/out that get
constructed locally as needed for bit streaming.  Since these objects
are now clearly local, the compiler can more easily track their values.

And since bit streaming is intended to be batched it's natural for these
new classes to be RAII-enabled such that the bit stream is flushed upon
destruction.

In order to make the most of this improved tracking of bit position,
this patch changes parts where we conditionally stream a tree flag
to unconditionally stream (the flag or a dummy value).  That way
the number of bits streamed and the respective bit positions are as
statically known as reasonably possible.  In lang_decl_bools and
lang_type_bools we flush the current bit buffer at the start so that
subsequent bit positions are statically known.  And in core bools, we
can add explicit early exits utilizing invariants that the compiler
can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
than 4 bits, it's more space efficient to stream them as individual
bytes rather than as packed bits (due to the 32-bit buffer).

This patch also moves the definitions of the relevant streaming classes
into anonymous namespaces so that the compiler can make more informed
decisions about inlining their member functions.

After this patch, compile time for a simple Hello World using the std
module is reduced by 7% with a release compiler.  The on-disk size of
the std module increases by 0.7% (presumably due to the extra flushing
done in lang_decl_bools and lang_type_bools).

The bit stream out performance isn't improved as much as the stream in
due to the spans/lengths instrumentation performed on stream out (which
probably should be e.g. removed for release builds?)

gcc/cp/ChangeLog:

	* module.cc
	(class data): Enclose in an anonymous namespace.
	(data::calc_crc): Moved from bytes::calc_crc.
	(class bytes): Remove.  Move bit_flush to namespace scope.
	(class bytes_in): Enclose in an anonymous namespace.  Inherit
	directly from data and adjust accordingly.  Move b and bflush
	members to bits_in.
	(class bytes_out): As above.  Remove is_set static data member.
	(bit_flush): Moved from class bytes.
	(struct bits_in): Define.
	(struct bits_out): Define.
	(bytes_out::bflush): Moved to bits_out/in.
	(bytes_in::bflush): Likewise
	(bytes_in::bfill): Removed.
	(bytes_out::b): Moved to bits_out/in.
	(bytes_in::b): Likewise.
	(class trees_in): Enclose in an anonymous namespace.
	(class trees_out): Enclose in an anonymous namespace.
	(trees_out::core_bools): Add bits_out/in parameter and use it.
	Unconditionally stream a bit for public_flag.  Add early exits
	as appropriate.
	(trees_out::core_bools): Likewise.
	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.  Unconditionally
	stream a bit for module_keyed_decls_p.
	(trees_in::lang_decl_bools): Likewise.
	(trees_out::lang_type_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.
	(trees_in::lang_type_bools): Likewise.
	(trees_out::tree_node_bools): Construct a bits_out object and
	use/pass it.
	(trees_in::tree_node_bools): Likewise.
	(trees_out::decl_value): Stream stray bit values as bytes.
	(trees_in::decl_value): Likewise.
	(module_state::write_define): Likewise.
	(module_state::read_define): Likewise.
---
 gcc/cp/module.cc | 401 ++++++++++++++++++++++++++---------------------
 1 file changed, 219 insertions(+), 182 deletions(-)

Comments

Richard Biener Feb. 16, 2024, 9:34 a.m. UTC | #1
On Thu, Feb 15, 2024 at 7:38 PM Patrick Palka <ppalka@redhat.com> wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
> OK for trunk?

Btw, there's the "bitpack" streaming support in data-streamer.h also
added for exactly the same reason, it's likely not easily re-usable
but this kind of approach is indeed important for performance.

The whole "data-streamer" was supposed to be re-usable though.

Richard.

> -- >8 --
>
> One would expect consecutive calls to bytes_in/out::b for streaming
> adjacent bits, as we do for tree flag streaming, to at least be
> optimized by the compiler into individual bit operations using
> statically known bit positions (and ideally merged into larger sized
> reads/writes).
>
> Unfortunately this doesn't happen because the compiler has trouble
> tracking the values of this->bit_pos and this->bit_val across such
> calls, likely because the compiler doesn't know 'this' and so it's
> treated as global memory.  This means for each consecutive bit stream
> operation, bit_pos and bit_val are loaded from memory, checked if
> buffering is needed, and finally the bit is extracted from bit_val
> according to the (unknown) bit_pos, even though relative to the previous
> operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> is just 1 larger.  This ends up being quite slow, with tree_node_bools
> taking 10% of time when streaming in parts of the std module.
>
> This patch optimizes this by making tracking of bit_pos and bit_val
> easier for the compiler.  Rather than bit_pos and bit_val being members
> of the (effectively global) bytes_in/out objects, this patch factors out
> the bit streaming code/state into separate classes bits_in/out that get
> constructed locally as needed for bit streaming.  Since these objects
> are now clearly local, the compiler can more easily track their values.
>
> And since bit streaming is intended to be batched it's natural for these
> new classes to be RAII-enabled such that the bit stream is flushed upon
> destruction.
>
> In order to make the most of this improved tracking of bit position,
> this patch changes parts where we conditionally stream a tree flag
> to unconditionally stream (the flag or a dummy value).  That way
> the number of bits streamed and the respective bit positions are as
> statically known as reasonably possible.  In lang_decl_bools and
> lang_type_bools we flush the current bit buffer at the start so that
> subsequent bit positions are statically known.  And in core bools, we
> can add explicit early exits utilizing invariants that the compiler
> can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> than 4 bits, it's more space efficient to stream them as individual
> bytes rather than as packed bits (due to the 32-bit buffer).
>
> This patch also moves the definitions of the relevant streaming classes
> into anonymous namespaces so that the compiler can make more informed
> decisions about inlining their member functions.
>
> After this patch, compile time for a simple Hello World using the std
> module is reduced by 7% with a release compiler.  The on-disk size of
> the std module increases by 0.7% (presumably due to the extra flushing
> done in lang_decl_bools and lang_type_bools).
>
> The bit stream out performance isn't improved as much as the stream in
> due to the spans/lengths instrumentation performed on stream out (which
> probably should be e.g. removed for release builds?)
>
> gcc/cp/ChangeLog:
>
>         * module.cc
>         (class data): Enclose in an anonymous namespace.
>         (data::calc_crc): Moved from bytes::calc_crc.
>         (class bytes): Remove.  Move bit_flush to namespace scope.
>         (class bytes_in): Enclose in an anonymous namespace.  Inherit
>         directly from data and adjust accordingly.  Move b and bflush
>         members to bits_in.
>         (class bytes_out): As above.  Remove is_set static data member.
>         (bit_flush): Moved from class bytes.
>         (struct bits_in): Define.
>         (struct bits_out): Define.
>         (bytes_out::bflush): Moved to bits_out/in.
>         (bytes_in::bflush): Likewise
>         (bytes_in::bfill): Removed.
>         (bytes_out::b): Moved to bits_out/in.
>         (bytes_in::b): Likewise.
>         (class trees_in): Enclose in an anonymous namespace.
>         (class trees_out): Enclose in an anonymous namespace.
>         (trees_out::core_bools): Add bits_out/in parameter and use it.
>         Unconditionally stream a bit for public_flag.  Add early exits
>         as appropriate.
>         (trees_out::core_bools): Likewise.
>         (trees_out::lang_decl_bools): Add bits_out/in parameter and use
>         it.  Flush the current bit buffer at the start.  Unconditionally
>         stream a bit for module_keyed_decls_p.
>         (trees_in::lang_decl_bools): Likewise.
>         (trees_out::lang_type_bools): Add bits_out/in parameter and use
>         it.  Flush the current bit buffer at the start.
>         (trees_in::lang_type_bools): Likewise.
>         (trees_out::tree_node_bools): Construct a bits_out object and
>         use/pass it.
>         (trees_in::tree_node_bools): Likewise.
>         (trees_out::decl_value): Stream stray bit values as bytes.
>         (trees_in::decl_value): Likewise.
>         (module_state::write_define): Likewise.
>         (module_state::read_define): Likewise.
> ---
>  gcc/cp/module.cc | 401 ++++++++++++++++++++++++++---------------------
>  1 file changed, 219 insertions(+), 182 deletions(-)
>
> diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> index 0291d456ff5..9477e4f2e35 100644
> --- a/gcc/cp/module.cc
> +++ b/gcc/cp/module.cc
> @@ -153,9 +153,11 @@ Classes used:
>
>     data - buffer
>
> -   bytes - data streamer
> -   bytes_in : bytes - scalar reader
> -   bytes_out : bytes - scalar writer
> +   bytes_in - scalar reader
> +   bytes_out - scalar writer
> +
> +   bits_in - bit stream reader
> +   bits_out - bit stream writer
>
>     elf - ELROND format
>     elf_in : elf - ELROND reader
> @@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
>
>  /* Variable length buffer.  */
>
> +namespace {
>  class data {
>  public:
>    class allocator {
> @@ -393,6 +396,8 @@ protected:
>      return res;
>    }
>
> +  unsigned calc_crc (unsigned) const;
> +
>  public:
>    void unuse (unsigned count)
>    {
> @@ -402,6 +407,7 @@ public:
>  public:
>    static allocator simple_memory;
>  };
> +} // anon namespace
>
>  /* The simple data allocator.  */
>  data::allocator data::simple_memory;
> @@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
>    XDELETEVEC (ptr);
>  }
>
> -/* Byte streamer base.   Buffer with read/write position and smarts
> -   for single bits.  */
> -
> -class bytes : public data {
> -public:
> -  typedef data parent;
> -
> -protected:
> -  uint32_t bit_val;    /* Bit buffer.  */
> -  unsigned bit_pos;    /* Next bit in bit buffer.  */
> -
> -public:
> -  bytes ()
> -    :parent (), bit_val (0), bit_pos (0)
> -  {}
> -  ~bytes ()
> -  {
> -  }
> -
> -protected:
> -  unsigned calc_crc (unsigned) const;
> -
> -protected:
> -  /* Finish bit packet.  Rewind the bytes not used.  */
> -  unsigned bit_flush ()
> -  {
> -    gcc_assert (bit_pos);
> -    unsigned bytes = (bit_pos + 7) / 8;
> -    unuse (4 - bytes);
> -    bit_pos = 0;
> -    bit_val = 0;
> -    return bytes;
> -  }
> -};
> -
>  /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
>     first 4 bytes, so don't include them.  */
>
>  unsigned
> -bytes::calc_crc (unsigned l) const
> +data::calc_crc (unsigned l) const
>  {
>    return crc32 (0, (unsigned char *)buffer + 4, l - 4);
>  }
> @@ -495,8 +466,9 @@ class elf_in;
>
>  /* Byte stream reader.  */
>
> -class bytes_in : public bytes {
> -  typedef bytes parent;
> +namespace {
> +class bytes_in : public data {
> +  typedef data parent;
>
>  protected:
>    bool overrun;  /* Sticky read-too-much flag.  */
> @@ -531,7 +503,6 @@ public:
>      if (offset > size)
>        set_overrun ();
>      pos = offset;
> -    bit_pos = bit_val = 0;
>    }
>
>  public:
> @@ -572,13 +543,6 @@ public:
>  public:
>    unsigned u32 ();     /* Read uncompressed integer.  */
>
> -public:
> -  bool b ();           /* Read a bool.  */
> -  void bflush ();      /* Completed a block of bools.  */
> -
> -private:
> -  void bfill ();       /* Get the next block of bools.  */
> -
>  public:
>    int c ();            /* Read a char.  */
>    int i ();            /* Read a signed int.  */
> @@ -590,6 +554,7 @@ public:
>    const void *buf (size_t); /* Read a fixed-length buffer.  */
>    cpp_hashnode *cpp_node (); /* Read a cpp node.  */
>  };
> +} // anon namespace
>
>  /* Verify the buffer's CRC is correct.  */
>
> @@ -610,8 +575,9 @@ class elf_out;
>
>  /* Byte stream writer.  */
>
> -class bytes_out : public bytes {
> -  typedef bytes parent;
> +namespace {
> +class bytes_out : public data {
> +  typedef data parent;
>
>  public:
>    allocator *memory;   /* Obtainer of memory.  */
> @@ -658,10 +624,6 @@ public:
>  public:
>    void u32 (unsigned);  /* Write uncompressed integer.  */
>
> -public:
> -  void b (bool);       /* Write bool.  */
> -  void bflush ();      /* Finish block of bools.  */
> -
>  public:
>    void c (unsigned char); /* Write unsigned char.  */
>    void i (int);                /* Write signed int.  */
> @@ -694,13 +656,127 @@ protected:
>    /* Instrumentation.  */
>    static unsigned spans[4];
>    static unsigned lengths[4];
> -  static int is_set;
> +  friend struct bits_out;
> +};
> +} // anon namespace
> +
> +/* Finish bit packet.  Rewind the bytes not used.  */
> +static unsigned
> +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> +{
> +  gcc_assert (bit_pos);
> +  unsigned bytes = (bit_pos + 7) / 8;
> +  bits.unuse (4 - bytes);
> +  bit_pos = 0;
> +  bit_val = 0;
> +  return bytes;
> +}
> +
> +/* RAII-enabled bit stream reader.  Bools are packed into bytes.  You
> +   cannot mix bools and non-bools.  Use bflush to flush the current stream
> +   of bools on demand.  Upon destruction bflush is called.
> +
> +   When reading, we don't know how many bools we'll read in.  So read
> +   4 bytes-worth, and then rewind when flushing if we didn't need them
> +   all.  You can't have a block of bools closer than 4 bytes to the
> +   end of the buffer.  */
> +
> +namespace {
> +struct bits_in {
> +  bytes_in& in;
> +  uint32_t bit_val = 0;
> +  unsigned bit_pos = 0;
> +
> +  bits_in (bytes_in& in)
> +    : in (in)
> +  {
> +  }
> +
> +  ~bits_in ()
> +  {
> +    bflush ();
> +  }
> +
> +  bits_in(const bits_in&) = delete;
> +  bits_in& operator=(const bits_in&) = delete;
> +
> +  /* Completed a block of bools.  */
> +  void bflush ()
> +  {
> +    if (bit_pos)
> +      bit_flush (in, bit_val, bit_pos);
> +  }
> +
> +  /* Read one bit.  */
> +  bool b ()
> +  {
> +    if (!bit_pos)
> +      bit_val = in.u32 ();
> +    bool x = (bit_val >> bit_pos) & 1;
> +    bit_pos = (bit_pos + 1) % 32;
> +    return x;
> +  }
>  };
> +} // anon namespace
> +
> +/* RAII-enabled bit stream writer, counterpart to bits_in.  */
> +
> +namespace {
> +struct bits_out {
> +  bytes_out& out;
> +  uint32_t bit_val = 0;
> +  unsigned bit_pos = 0;
> +  char is_set = -1;
> +
> +  bits_out (bytes_out& out)
> +    : out (out)
> +  { }
> +
> +  ~bits_out ()
> +  {
> +    bflush ();
> +  }
> +
> +  bits_out(const bits_out&) = delete;
> +  bits_out& operator=(const bits_out&) = delete;
> +
> +  /* Completed a block of bools.  */
> +  void bflush ()
> +  {
> +    if (bit_pos)
> +      {
> +       out.u32 (bit_val);
> +       out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> +      }
> +    out.spans[2]++;
> +    is_set = -1;
> +  }
> +
> +  /* Write one bit.
> +
> +     It may be worth optimizing for most bools being zero.  Some kind of
> +     run-length encoding?  */
> +  void b (bool x)
> +  {
> +    if (is_set != x)
> +      {
> +       is_set = x;
> +       out.spans[x]++;
> +      }
> +    out.lengths[x]++;
> +    bit_val |= unsigned (x) << bit_pos++;
> +    if (bit_pos == 32)
> +      {
> +       out.u32 (bit_val);
> +       out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> +      }
> +  }
> +};
> +} // anon namespace
>
>  /* Instrumentation.  */
>  unsigned bytes_out::spans[4];
>  unsigned bytes_out::lengths[4];
> -int bytes_out::is_set = -1;
>
>  /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
>     that pointed to by CRC_PTR.  */
> @@ -723,73 +799,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
>      }
>  }
>
> -/* Finish a set of bools.  */
> -
> -void
> -bytes_out::bflush ()
> -{
> -  if (bit_pos)
> -    {
> -      u32 (bit_val);
> -      lengths[2] += bit_flush ();
> -    }
> -  spans[2]++;
> -  is_set = -1;
> -}
> -
> -void
> -bytes_in::bflush ()
> -{
> -  if (bit_pos)
> -    bit_flush ();
> -}
> -
> -/* When reading, we don't know how many bools we'll read in.  So read
> -   4 bytes-worth, and then rewind when flushing if we didn't need them
> -   all.  You can't have a block of bools closer than 4 bytes to the
> -   end of the buffer.  */
> -
> -void
> -bytes_in::bfill ()
> -{
> -  bit_val = u32 ();
> -}
> -
> -/* Bools are packed into bytes.  You cannot mix bools and non-bools.
> -   You must call bflush before emitting another type.  So batch your
> -   bools.
> -
> -   It may be worth optimizing for most bools being zero.  Some kind of
> -   run-length encoding?  */
> -
> -void
> -bytes_out::b (bool x)
> -{
> -  if (is_set != x)
> -    {
> -      is_set = x;
> -      spans[x]++;
> -    }
> -  lengths[x]++;
> -  bit_val |= unsigned (x) << bit_pos++;
> -  if (bit_pos == 32)
> -    {
> -      u32 (bit_val);
> -      lengths[2] += bit_flush ();
> -    }
> -}
> -
> -bool
> -bytes_in::b ()
> -{
> -  if (!bit_pos)
> -    bfill ();
> -  bool v = (bit_val >> bit_pos++) & 1;
> -  if (bit_pos == 32)
> -    bit_flush ();
> -  return v;
> -}
> -
>  /* Exactly 4 bytes.  Used internally for bool packing and a few other
>     places.  We can't simply use uint32_t because (a) alignment and
>     (b) we need little-endian for the bool streaming rewinding to make
> @@ -2846,6 +2855,7 @@ struct post_process_data {
>     read trees with TREE_VISITED.  Thus it's quite safe to have
>     multiple concurrent readers.  Which is good, because lazy
>     loading. */
> +namespace {
>  class trees_in : public bytes_in {
>    typedef bytes_in parent;
>
> @@ -2870,15 +2880,15 @@ private:
>
>  public:
>    /* Needed for binfo writing  */
> -  bool core_bools (tree);
> +  bool core_bools (tree, bits_in&);
>
>  private:
>    /* Stream tree_core, lang_decl_specific and lang_type_specific
>       bits.  */
>    bool core_vals (tree);
> -  bool lang_type_bools (tree);
> +  bool lang_type_bools (tree, bits_in&);
>    bool lang_type_vals (tree);
> -  bool lang_decl_bools (tree);
> +  bool lang_decl_bools (tree, bits_in&);
>    bool lang_decl_vals (tree);
>    bool lang_vals (tree);
>    bool tree_node_bools (tree);
> @@ -2965,6 +2975,7 @@ private:
>  private:
>    void assert_definition (tree, bool installing);
>  };
> +} // anon namespace
>
>  trees_in::trees_in (module_state *state)
>    :parent (), state (state), unused (0)
> @@ -2982,6 +2993,7 @@ trees_in::~trees_in ()
>  }
>
>  /* Tree stream writer.  */
> +namespace {
>  class trees_out : public bytes_out {
>    typedef bytes_out parent;
>
> @@ -3043,11 +3055,11 @@ public:
>    }
>
>  private:
> -  void core_bools (tree);
> +  void core_bools (tree, bits_out&);
>    void core_vals (tree);
> -  void lang_type_bools (tree);
> +  void lang_type_bools (tree, bits_out&);
>    void lang_type_vals (tree);
> -  void lang_decl_bools (tree);
> +  void lang_decl_bools (tree, bits_out&);
>    void lang_decl_vals (tree);
>    void lang_vals (tree);
>    void tree_node_bools (tree);
> @@ -3126,6 +3138,7 @@ private:
>    static unsigned back_ref_count;
>    static unsigned null_count;
>  };
> +} // anon namespace
>
>  /* Instrumentation counters.  */
>  unsigned trees_out::tree_val_count;
> @@ -5292,9 +5305,9 @@ trees_in::start (unsigned code)
>  /* Read & write the core boolean flags.  */
>
>  void
> -trees_out::core_bools (tree t)
> +trees_out::core_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    tree_code code = TREE_CODE (t);
>
>    WB (t->base.side_effects_flag);
> @@ -5314,6 +5327,8 @@ trees_out::core_bools (tree t)
>    if (TREE_CODE_CLASS (code) != tcc_type)
>      /* This is TYPE_CACHED_VALUES_P for types.  */
>      WB (t->base.public_flag);
> +  else
> +    WB (false);
>    WB (t->base.private_flag);
>    WB (t->base.protected_flag);
>    WB (t->base.deprecated_flag);
> @@ -5327,7 +5342,7 @@ trees_out::core_bools (tree t)
>      case TARGET_MEM_REF:
>      case TREE_VEC:
>        /* These use different base.u fields.  */
> -      break;
> +      return;
>
>      default:
>        WB (t->base.u.bits.lang_flag_0);
> @@ -5374,6 +5389,7 @@ trees_out::core_bools (tree t)
>        WB (t->type_common.lang_flag_5);
>        WB (t->type_common.lang_flag_6);
>        WB (t->type_common.typeless_storage);
> +      return;
>      }
>
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> @@ -5439,6 +5455,8 @@ trees_out::core_bools (tree t)
>        WB (t->decl_common.decl_nonshareable_flag);
>        WB (t->decl_common.decl_not_flexarray);
>      }
> +  else
> +    return;
>
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
>      {
> @@ -5459,6 +5477,8 @@ trees_out::core_bools (tree t)
>        WB (t->decl_with_vis.final);
>        WB (t->decl_with_vis.regdecl_flag);
>      }
> +  else
> +    return;
>
>    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
>      {
> @@ -5485,13 +5505,16 @@ trees_out::core_bools (tree t)
>        WB ((kind >> 0) & 1);
>        WB ((kind >> 1) & 1);
>      }
> +  else
> +    return;
>  #undef WB
>  }
>
>  bool
> -trees_in::core_bools (tree t)
> +trees_in::core_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
> +
>    tree_code code = TREE_CODE (t);
>
>    RB (t->base.side_effects_flag);
> @@ -5507,6 +5530,8 @@ trees_in::core_bools (tree t)
>    RB (t->base.static_flag);
>    if (TREE_CODE_CLASS (code) != tcc_type)
>      RB (t->base.public_flag);
> +  else
> +    bits.b ();
>    RB (t->base.private_flag);
>    RB (t->base.protected_flag);
>    RB (t->base.deprecated_flag);
> @@ -5520,7 +5545,7 @@ trees_in::core_bools (tree t)
>      case TARGET_MEM_REF:
>      case TREE_VEC:
>        /* These use different base.u fields.  */
> -      break;
> +      goto done;
>
>      default:
>        RB (t->base.u.bits.lang_flag_0);
> @@ -5554,6 +5579,7 @@ trees_in::core_bools (tree t)
>        RB (t->type_common.lang_flag_5);
>        RB (t->type_common.lang_flag_6);
>        RB (t->type_common.typeless_storage);
> +      goto done;
>      }
>
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> @@ -5584,6 +5610,8 @@ trees_in::core_bools (tree t)
>        RB (t->decl_common.decl_nonshareable_flag);
>        RB (t->decl_common.decl_not_flexarray);
>      }
> +  else
> +    goto done;
>
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
>      {
> @@ -5604,6 +5632,8 @@ trees_in::core_bools (tree t)
>        RB (t->decl_with_vis.final);
>        RB (t->decl_with_vis.regdecl_flag);
>      }
> +  else
> +    goto done;
>
>    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
>      {
> @@ -5627,20 +5657,22 @@ trees_in::core_bools (tree t)
>
>        /* decl_type is a (misnamed) 2 bit discriminator.         */
>        unsigned kind = 0;
> -      kind |= unsigned (b ()) << 0;
> -      kind |= unsigned (b ()) << 1;
> +      kind |= unsigned (bits.b ()) << 0;
> +      kind |= unsigned (bits.b ()) << 1;
>        t->function_decl.decl_type = function_decl_type (kind);
>      }
>  #undef RB
> +done:
>    return !get_overrun ();
>  }
>
>  void
> -trees_out::lang_decl_bools (tree t)
> +trees_out::lang_decl_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
>
> +  bits.bflush ();
>    WB (lang->u.base.language == lang_cplusplus);
>    WB ((lang->u.base.use_template >> 0) & 1);
>    WB ((lang->u.base.use_template >> 1) & 1);
> @@ -5664,6 +5696,8 @@ trees_out::lang_decl_bools (tree t)
>    WB (lang->u.base.module_attach_p);
>    if (VAR_OR_FUNCTION_DECL_P (t))
>      WB (lang->u.base.module_keyed_decls_p);
> +  else
> +    WB (false);
>    switch (lang->u.base.selector)
>      {
>      default:
> @@ -5714,15 +5748,16 @@ trees_out::lang_decl_bools (tree t)
>  }
>
>  bool
> -trees_in::lang_decl_bools (tree t)
> +trees_in::lang_decl_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
>    struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
>
> -  lang->u.base.language = b () ? lang_cplusplus : lang_c;
> +  bits.bflush ();
> +  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
>    unsigned v;
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->u.base.use_template = v;
>    /* lang->u.base.not_really_extern is not streamed.  */
>    RB (lang->u.base.initialized_in_class);
> @@ -5738,6 +5773,8 @@ trees_in::lang_decl_bools (tree t)
>    RB (lang->u.base.module_attach_p);
>    if (VAR_OR_FUNCTION_DECL_P (t))
>      RB (lang->u.base.module_keyed_decls_p);
> +  else
> +    bits.b ();
>    switch (lang->u.base.selector)
>      {
>      default:
> @@ -5786,11 +5823,12 @@ trees_in::lang_decl_bools (tree t)
>  }
>
>  void
> -trees_out::lang_type_bools (tree t)
> +trees_out::lang_type_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
>
> +  bits.bflush ();
>    WB (lang->has_type_conversion);
>    WB (lang->has_copy_ctor);
>    WB (lang->has_default_ctor);
> @@ -5852,11 +5890,12 @@ trees_out::lang_type_bools (tree t)
>  }
>
>  bool
> -trees_in::lang_type_bools (tree t)
> +trees_in::lang_type_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
>    struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
>
> +  bits.bflush ();
>    RB (lang->has_type_conversion);
>    RB (lang->has_copy_ctor);
>    RB (lang->has_default_ctor);
> @@ -5864,8 +5903,8 @@ trees_in::lang_type_bools (tree t)
>    RB (lang->ref_needs_init);
>    RB (lang->has_const_copy_assign);
>    unsigned v;
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->use_template = v;
>
>    RB (lang->has_mutable);
> @@ -5877,8 +5916,8 @@ trees_in::lang_type_bools (tree t)
>    RB (lang->has_new);
>    RB (lang->has_array_new);
>
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->gets_delete = v;
>    RB (lang->interface_only);
>    RB (lang->interface_unknown);
> @@ -7106,18 +7145,19 @@ trees_out::tree_node_bools (tree t)
>    gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
>                        || DECL_NAMESPACE_ALIAS (t));
>
> -  core_bools (t);
> +  bits_out bits (*this);
> +  core_bools (t, bits);
>
>    switch (TREE_CODE_CLASS (TREE_CODE (t)))
>      {
>      case tcc_declaration:
>        {
>         bool specific = DECL_LANG_SPECIFIC (t) != NULL;
> -       b (specific);
> +       bits.b (specific);
>         if (specific && VAR_P (t))
> -         b (DECL_DECOMPOSITION_P (t));
> +         bits.b (DECL_DECOMPOSITION_P (t));
>         if (specific)
> -         lang_decl_bools (t);
> +         lang_decl_bools (t, bits);
>        }
>        break;
>
> @@ -7128,9 +7168,9 @@ trees_out::tree_node_bools (tree t)
>         gcc_assert (TYPE_LANG_SPECIFIC (t)
>                     == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
>
> -       b (specific);
> +       bits.b (specific);
>         if (specific)
> -         lang_type_bools (t);
> +         lang_type_bools (t, bits);
>        }
>        break;
>
> @@ -7138,34 +7178,35 @@ trees_out::tree_node_bools (tree t)
>        break;
>      }
>
> -  bflush ();
> +  bits.bflush ();
>  }
>
>  bool
>  trees_in::tree_node_bools (tree t)
>  {
> -  bool ok = core_bools (t);
> +  bits_in bits (*this);
> +  bool ok = core_bools (t, bits);
>
>    if (ok)
>      switch (TREE_CODE_CLASS (TREE_CODE (t)))
>        {
>        case tcc_declaration:
> -       if (b ())
> +       if (bits.b ())
>           {
> -           bool decomp = VAR_P (t) && b ();
> +           bool decomp = VAR_P (t) && bits.b ();
>
>             ok = maybe_add_lang_decl_raw (t, decomp);
>             if (ok)
> -             ok = lang_decl_bools (t);
> +             ok = lang_decl_bools (t, bits);
>         }
>         break;
>
>        case tcc_type:
> -       if (b ())
> +       if (bits.b ())
>           {
>             ok = maybe_add_lang_type_raw (t);
>             if (ok)
> -             ok = lang_type_bools (t);
> +             ok = lang_type_bools (t, bits);
>           }
>         break;
>
> @@ -7173,7 +7214,7 @@ trees_in::tree_node_bools (tree t)
>         break;
>        }
>
> -  bflush ();
> +  bits.bflush ();
>    if (!ok || get_overrun ())
>      return false;
>
> @@ -7699,8 +7740,7 @@ trees_out::decl_value (tree decl, depset *dep)
>           if (!(mk & MK_template_mask) && !state->is_header ())
>             {
>               /* Tell the importer whether this is a global module entity,
> -                or a module entity.  This bool merges into the next block
> -                of bools.  Sneaky.  */
> +                or a module entity.  */
>               tree o = get_originating_module_decl (decl);
>               bool is_attached = false;
>
> @@ -7709,9 +7749,9 @@ trees_out::decl_value (tree decl, depset *dep)
>                   && DECL_MODULE_ATTACH_P (not_tmpl))
>                 is_attached = true;
>
> -             b (is_attached);
> +             c (is_attached);
>             }
> -         b (dep && dep->has_defn ());
> +         c (dep && dep->has_defn ());
>         }
>        tree_node_bools (decl);
>      }
> @@ -7982,10 +8022,9 @@ trees_in::decl_value ()
>        if (mk != MK_unique)
>         {
>           if (!(mk & MK_template_mask) && !state->is_header ())
> -           /* See note in trees_out about where this bool is sequenced.  */
> -           is_attached = b ();
> +           is_attached = c ();
>
> -         has_defn = b ();
> +         has_defn = c ();
>         }
>
>        if (!tree_node_bools (decl))
> @@ -16682,10 +16721,9 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
>  {
>    sec.u (macro->count);
>
> -  sec.b (macro->fun_like);
> -  sec.b (macro->variadic);
> -  sec.b (macro->syshdr);
> -  sec.bflush ();
> +  sec.c (macro->fun_like);
> +  sec.c (macro->variadic);
> +  sec.c (macro->syshdr);
>
>    write_location (sec, macro->line);
>    if (macro->fun_like)
> @@ -16780,10 +16818,9 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
>    macro->kind = cmk_macro;
>    macro->imported_p = true;
>
> -  macro->fun_like = sec.b ();
> -  macro->variadic = sec.b ();
> -  macro->syshdr = sec.b ();
> -  sec.bflush ();
> +  macro->fun_like = sec.c ();
> +  macro->variadic = sec.c ();
> +  macro->syshdr = sec.c ();
>
>    macro->line = read_location (sec);
>
> --
> 2.44.0.rc1.15.g4fc51f00ef
>
Patrick Palka Feb. 16, 2024, 3:06 p.m. UTC | #2
On Thu, 15 Feb 2024, Patrick Palka wrote:

> Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
> OK for trunk?
> 
> -- >8 --
> 
> One would expect consecutive calls to bytes_in/out::b for streaming
> adjacent bits, as we do for tree flag streaming, to at least be
> optimized by the compiler into individual bit operations using
> statically known bit positions (and ideally merged into larger sized
> reads/writes).
> 
> Unfortunately this doesn't happen because the compiler has trouble
> tracking the values of this->bit_pos and this->bit_val across such
> calls, likely because the compiler doesn't know 'this' and so it's
> treated as global memory.  This means for each consecutive bit stream
> operation, bit_pos and bit_val are loaded from memory, checked if
> buffering is needed, and finally the bit is extracted from bit_val
> according to the (unknown) bit_pos, even though relative to the previous
> operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> is just 1 larger.  This ends up being quite slow, with tree_node_bools
> taking 10% of time when streaming in parts of the std module.
> 
> This patch optimizes this by making tracking of bit_pos and bit_val
> easier for the compiler.  Rather than bit_pos and bit_val being members
> of the (effectively global) bytes_in/out objects, this patch factors out
> the bit streaming code/state into separate classes bits_in/out that get
> constructed locally as needed for bit streaming.  Since these objects
> are now clearly local, the compiler can more easily track their values.
> 
> And since bit streaming is intended to be batched it's natural for these
> new classes to be RAII-enabled such that the bit stream is flushed upon
> destruction.
> 
> In order to make the most of this improved tracking of bit position,
> this patch changes parts where we conditionally stream a tree flag
> to unconditionally stream (the flag or a dummy value).  That way
> the number of bits streamed and the respective bit positions are as
> statically known as reasonably possible.  In lang_decl_bools and
> lang_type_bools we flush the current bit buffer at the start so that
> subsequent bit positions are statically known.  And in core bools, we
> can add explicit early exits utilizing invariants that the compiler
> can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> than 4 bits, it's more space efficient to stream them as individual
> bytes rather than as packed bits (due to the 32-bit buffer).

Oops, this last sentence is wrong.  Although the size of the bit buffer
is 32 bits, upon flushing we rewind unused bytes within the buffer,
which means streaming 2-8 bits ends up using only one byte not all four.
So v2 below undoes this pessimization.

> This patch also moves the definitions of the relevant streaming classes
> into anonymous namespaces so that the compiler can make more informed
> decisions about inlining their member functions.
> 
> After this patch, compile time for a simple Hello World using the std
> module is reduced by 7% with a release compiler.  The on-disk size of
> the std module increases by 0.7% (presumably due to the extra flushing
> done in lang_decl_bools and lang_type_bools).

The on-disk std module now only grows 0.4% instead of 0.7%.

> 
> The bit stream out performance isn't improved as much as the stream in
> due to the spans/lengths instrumentation performed on stream out (which
> probably should be e.g. removed for release builds?)

-- >8 --

gcc/cp/ChangeLog:

	* module.cc: Update comment about classes defined.
	(class data): Enclose in an anonymous namespace.
	(data::calc_crc): Moved from bytes::calc_crc.
	(class bytes): Remove.  Move bit_flush to namespace scope.
	(class bytes_in): Enclose in an anonymous namespace.  Inherit
	directly from data and adjust accordingly.  Move b and bflush
	members to bits_in.
	(class bytes_out): As above.  Remove is_set static data member.
	(bit_flush): Moved from class bytes.
	(struct bits_in): Define.
	(struct bits_out): Define.
	(bytes_out::bflush): Moved to bits_out/in.
	(bytes_in::bflush): Likewise
	(bytes_in::bfill): Removed.
	(bytes_out::b): Moved to bits_out/in.
	(bytes_in::b): Likewise.
	(class trees_in): Enclose in an anonymous namespace.
	(class trees_out): Enclose in an anonymous namespace.
	(trees_out::core_bools): Add bits_out/in parameter and use it.
	Unconditionally stream a bit for public_flag.  Add early exits
	as appropriate.
	(trees_out::core_bools): Likewise.
	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.  Unconditionally
	stream a bit for module_keyed_decls_p.
	(trees_in::lang_decl_bools): Likewise.
	(trees_out::lang_type_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.
	(trees_in::lang_type_bools): Likewise.
	(trees_out::tree_node_bools): Construct a bits_out object and
	use/pass it.
	(trees_in::tree_node_bools): Likewise.
	(trees_out::decl_value): Likewise.
	(trees_in::decl_value): Likewise.
	(module_state::write_define): Likewise.
	(module_state::read_define): Likewise.
---
 gcc/cp/module.cc | 418 ++++++++++++++++++++++++++---------------------
 1 file changed, 231 insertions(+), 187 deletions(-)

diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 0291d456ff5..2ecee007e8a 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -153,9 +153,11 @@ Classes used:
 
    data - buffer
 
-   bytes - data streamer
-   bytes_in : bytes - scalar reader
-   bytes_out : bytes - scalar writer
+   bytes_in - scalar reader
+   bytes_out - scalar writer
+
+   bits_in - bit stream reader
+   bits_out - bit stream writer
 
    elf - ELROND format
    elf_in : elf - ELROND reader
@@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
 
 /* Variable length buffer.  */
 
+namespace {
 class data {
 public:
   class allocator {
@@ -393,6 +396,8 @@ protected:
     return res;
   }
 
+  unsigned calc_crc (unsigned) const;
+
 public:
   void unuse (unsigned count)
   {
@@ -402,6 +407,7 @@ public:
 public:
   static allocator simple_memory;
 };
+} // anon namespace
 
 /* The simple data allocator.  */
 data::allocator data::simple_memory;
@@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
   XDELETEVEC (ptr);
 }
 
-/* Byte streamer base.   Buffer with read/write position and smarts
-   for single bits.  */
-
-class bytes : public data {
-public:
-  typedef data parent;
-
-protected:
-  uint32_t bit_val;	/* Bit buffer.  */
-  unsigned bit_pos;	/* Next bit in bit buffer.  */
-
-public:
-  bytes ()
-    :parent (), bit_val (0), bit_pos (0)
-  {}
-  ~bytes () 
-  {
-  }
-
-protected:
-  unsigned calc_crc (unsigned) const;
-
-protected:
-  /* Finish bit packet.  Rewind the bytes not used.  */
-  unsigned bit_flush ()
-  {
-    gcc_assert (bit_pos);
-    unsigned bytes = (bit_pos + 7) / 8;
-    unuse (4 - bytes);
-    bit_pos = 0;
-    bit_val = 0;
-    return bytes;
-  }
-};
-
 /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
    first 4 bytes, so don't include them.  */
 
 unsigned
-bytes::calc_crc (unsigned l) const
+data::calc_crc (unsigned l) const
 {
   return crc32 (0, (unsigned char *)buffer + 4, l - 4);
 }
@@ -495,8 +466,9 @@ class elf_in;
 
 /* Byte stream reader.  */
 
-class bytes_in : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_in : public data {
+  typedef data parent;
 
 protected:
   bool overrun;  /* Sticky read-too-much flag.  */
@@ -531,7 +503,6 @@ public:
     if (offset > size)
       set_overrun ();
     pos = offset;
-    bit_pos = bit_val = 0;
   }
 
 public:
@@ -573,14 +544,7 @@ public:
   unsigned u32 ();  	/* Read uncompressed integer.  */
 
 public:
-  bool b ();	    	/* Read a bool.  */
-  void bflush ();	/* Completed a block of bools.  */
-
-private:
-  void bfill ();	/* Get the next block of bools.  */
-
-public:
-  int c ();		/* Read a char.  */
+  int c () ATTRIBUTE_UNUSED;		/* Read a char.  */
   int i ();		/* Read a signed int.  */
   unsigned u ();	/* Read an unsigned int.  */
   size_t z ();		/* Read a size_t.  */
@@ -590,6 +554,7 @@ public:
   const void *buf (size_t); /* Read a fixed-length buffer.  */
   cpp_hashnode *cpp_node (); /* Read a cpp node.  */
 };
+} // anon namespace
 
 /* Verify the buffer's CRC is correct.  */
 
@@ -610,8 +575,9 @@ class elf_out;
 
 /* Byte stream writer.  */
 
-class bytes_out : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_out : public data {
+  typedef data parent;
 
 public:
   allocator *memory;	/* Obtainer of memory.  */
@@ -659,11 +625,7 @@ public:
   void u32 (unsigned);  /* Write uncompressed integer.  */
 
 public:
-  void b (bool);	/* Write bool.  */
-  void bflush ();	/* Finish block of bools.  */
-
-public:
-  void c (unsigned char); /* Write unsigned char.  */
+  void c (unsigned char) ATTRIBUTE_UNUSED; /* Write unsigned char.  */
   void i (int);		/* Write signed int.  */
   void u (unsigned);	/* Write unsigned int.  */
   void z (size_t s);	/* Write size_t.  */
@@ -694,13 +656,126 @@ protected:
   /* Instrumentation.  */
   static unsigned spans[4];
   static unsigned lengths[4];
-  static int is_set;
+  friend struct bits_out;
 };
+} // anon namespace
+
+/* Finish bit packet.  Rewind the bytes not used.  */
+static unsigned
+bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
+{
+  gcc_assert (bit_pos);
+  unsigned bytes = (bit_pos + 7) / 8;
+  bits.unuse (4 - bytes);
+  bit_pos = 0;
+  bit_val = 0;
+  return bytes;
+}
+
+/* Bit stream reader (RAII-enabled).  Bools are packed into bytes.  You
+   cannot mix bools and non-bools.  Use bflush to flush the current stream
+   of bools on demand.  Upon destruction bflush is called.
+
+   When reading, we don't know how many bools we'll read in.  So read
+   4 bytes-worth, and then rewind when flushing if we didn't need them
+   all.  You can't have a block of bools closer than 4 bytes to the
+   end of the buffer.  */
+
+namespace {
+struct bits_in {
+  bytes_in& in;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+
+  bits_in (bytes_in& in)
+    : in (in)
+  { }
+
+  ~bits_in ()
+  {
+    bflush ();
+  }
+
+  bits_in(const bits_in&) = delete;
+  bits_in& operator=(const bits_in&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      bit_flush (in, bit_val, bit_pos);
+  }
+
+  /* Read one bit.  */
+  bool b ()
+  {
+    if (!bit_pos)
+      bit_val = in.u32 ();
+    bool x = (bit_val >> bit_pos) & 1;
+    bit_pos = (bit_pos + 1) % 32;
+    return x;
+  }
+};
+} // anon namespace
+
+/* Bit stream writer (RAII-enabled), counterpart to bits_in.  */
+
+namespace {
+struct bits_out {
+  bytes_out& out;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+  char is_set = -1;
+
+  bits_out (bytes_out& out)
+    : out (out)
+  { }
+
+  ~bits_out ()
+  {
+    bflush ();
+  }
+
+  bits_out(const bits_out&) = delete;
+  bits_out& operator=(const bits_out&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+    out.spans[2]++;
+    is_set = -1;
+  }
+
+  /* Write one bit.
+
+     It may be worth optimizing for most bools being zero.  Some kind of
+     run-length encoding?  */
+  void b (bool x)
+  {
+    if (is_set != x)
+      {
+	is_set = x;
+	out.spans[x]++;
+      }
+    out.lengths[x]++;
+    bit_val |= unsigned (x) << bit_pos++;
+    if (bit_pos == 32)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+  }
+};
+} // anon namespace
 
 /* Instrumentation.  */
 unsigned bytes_out::spans[4];
 unsigned bytes_out::lengths[4];
-int bytes_out::is_set = -1;
 
 /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
    that pointed to by CRC_PTR.  */
@@ -723,73 +798,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
     }
 }
 
-/* Finish a set of bools.  */
-
-void
-bytes_out::bflush ()
-{
-  if (bit_pos)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-  spans[2]++;
-  is_set = -1;
-}
-
-void
-bytes_in::bflush ()
-{
-  if (bit_pos)
-    bit_flush ();
-}
-
-/* When reading, we don't know how many bools we'll read in.  So read
-   4 bytes-worth, and then rewind when flushing if we didn't need them
-   all.  You can't have a block of bools closer than 4 bytes to the
-   end of the buffer.  */
-
-void
-bytes_in::bfill ()
-{
-  bit_val = u32 ();
-}
-
-/* Bools are packed into bytes.  You cannot mix bools and non-bools.
-   You must call bflush before emitting another type.  So batch your
-   bools.
-
-   It may be worth optimizing for most bools being zero.  Some kind of
-   run-length encoding?  */
-
-void
-bytes_out::b (bool x)
-{
-  if (is_set != x)
-    {
-      is_set = x;
-      spans[x]++;
-    }
-  lengths[x]++;
-  bit_val |= unsigned (x) << bit_pos++;
-  if (bit_pos == 32)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-}
-
-bool
-bytes_in::b ()
-{
-  if (!bit_pos)
-    bfill ();
-  bool v = (bit_val >> bit_pos++) & 1;
-  if (bit_pos == 32)
-    bit_flush ();
-  return v;
-}
-
 /* Exactly 4 bytes.  Used internally for bool packing and a few other
    places.  We can't simply use uint32_t because (a) alignment and
    (b) we need little-endian for the bool streaming rewinding to make
@@ -2846,6 +2854,7 @@ struct post_process_data {
    read trees with TREE_VISITED.  Thus it's quite safe to have
    multiple concurrent readers.  Which is good, because lazy
    loading. */
+namespace {
 class trees_in : public bytes_in {
   typedef bytes_in parent;
 
@@ -2870,15 +2879,15 @@ private:
 
 public:
   /* Needed for binfo writing  */
-  bool core_bools (tree);
+  bool core_bools (tree, bits_in&);
 
 private:
   /* Stream tree_core, lang_decl_specific and lang_type_specific
      bits.  */
   bool core_vals (tree);
-  bool lang_type_bools (tree);
+  bool lang_type_bools (tree, bits_in&);
   bool lang_type_vals (tree);
-  bool lang_decl_bools (tree);
+  bool lang_decl_bools (tree, bits_in&);
   bool lang_decl_vals (tree);
   bool lang_vals (tree);
   bool tree_node_bools (tree);
@@ -2965,6 +2974,7 @@ private:
 private:
   void assert_definition (tree, bool installing);
 };
+} // anon namespace
 
 trees_in::trees_in (module_state *state)
   :parent (), state (state), unused (0)
@@ -2982,6 +2992,7 @@ trees_in::~trees_in ()
 }
 
 /* Tree stream writer.  */
+namespace {
 class trees_out : public bytes_out {
   typedef bytes_out parent;
 
@@ -3043,11 +3054,11 @@ public:
   }
 
 private:
-  void core_bools (tree);
+  void core_bools (tree, bits_out&);
   void core_vals (tree);
-  void lang_type_bools (tree);
+  void lang_type_bools (tree, bits_out&);
   void lang_type_vals (tree);
-  void lang_decl_bools (tree);
+  void lang_decl_bools (tree, bits_out&);
   void lang_decl_vals (tree);
   void lang_vals (tree);
   void tree_node_bools (tree);
@@ -3126,6 +3137,7 @@ private:
   static unsigned back_ref_count;
   static unsigned null_count;
 };
+} // anon namespace
 
 /* Instrumentation counters.  */
 unsigned trees_out::tree_val_count;
@@ -5292,9 +5304,9 @@ trees_in::start (unsigned code)
 /* Read & write the core boolean flags.  */
 
 void
-trees_out::core_bools (tree t)
+trees_out::core_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   tree_code code = TREE_CODE (t);
 
   WB (t->base.side_effects_flag);
@@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
   if (TREE_CODE_CLASS (code) != tcc_type)
     /* This is TYPE_CACHED_VALUES_P for types.  */
     WB (t->base.public_flag);
+  else
+    WB (false);
   WB (t->base.private_flag);
   WB (t->base.protected_flag);
   WB (t->base.deprecated_flag);
@@ -5327,7 +5341,7 @@ trees_out::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      return;
 
     default:
       WB (t->base.u.bits.lang_flag_0);
@@ -5359,7 +5373,7 @@ trees_out::core_bools (tree t)
       break;
     }
 
-  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
+  if (TREE_CODE_CLASS (code) == tcc_type)
     {
       WB (t->type_common.no_force_blk_flag);
       WB (t->type_common.needs_constructing_flag);
@@ -5374,7 +5388,10 @@ trees_out::core_bools (tree t)
       WB (t->type_common.lang_flag_5);
       WB (t->type_common.lang_flag_6);
       WB (t->type_common.typeless_storage);
+      return;
     }
+  else if (TREE_CODE_CLASS (code) != tcc_declaration)
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
     {
@@ -5439,6 +5456,8 @@ trees_out::core_bools (tree t)
       WB (t->decl_common.decl_nonshareable_flag);
       WB (t->decl_common.decl_not_flexarray);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5459,6 +5478,8 @@ trees_out::core_bools (tree t)
       WB (t->decl_with_vis.final);
       WB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5489,9 +5510,10 @@ trees_out::core_bools (tree t)
 }
 
 bool
-trees_in::core_bools (tree t)
+trees_in::core_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
+
   tree_code code = TREE_CODE (t);
 
   RB (t->base.side_effects_flag);
@@ -5507,6 +5529,8 @@ trees_in::core_bools (tree t)
   RB (t->base.static_flag);
   if (TREE_CODE_CLASS (code) != tcc_type)
     RB (t->base.public_flag);
+  else
+    bits.b ();
   RB (t->base.private_flag);
   RB (t->base.protected_flag);
   RB (t->base.deprecated_flag);
@@ -5520,7 +5544,7 @@ trees_in::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      goto done;
 
     default:
       RB (t->base.u.bits.lang_flag_0);
@@ -5539,7 +5563,7 @@ trees_in::core_bools (tree t)
       break;
     }
 
-  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
+  if (TREE_CODE_CLASS (code) == tcc_type)
     {
       RB (t->type_common.no_force_blk_flag);
       RB (t->type_common.needs_constructing_flag);
@@ -5554,7 +5578,10 @@ trees_in::core_bools (tree t)
       RB (t->type_common.lang_flag_5);
       RB (t->type_common.lang_flag_6);
       RB (t->type_common.typeless_storage);
+      goto done;
     }
+  else if (TREE_CODE_CLASS (code) != tcc_declaration)
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
     {
@@ -5584,6 +5611,8 @@ trees_in::core_bools (tree t)
       RB (t->decl_common.decl_nonshareable_flag);
       RB (t->decl_common.decl_not_flexarray);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5604,6 +5633,8 @@ trees_in::core_bools (tree t)
       RB (t->decl_with_vis.final);
       RB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5627,20 +5658,22 @@ trees_in::core_bools (tree t)
 
       /* decl_type is a (misnamed) 2 bit discriminator.	 */
       unsigned kind = 0;
-      kind |= unsigned (b ()) << 0;
-      kind |= unsigned (b ()) << 1;
+      kind |= unsigned (bits.b ()) << 0;
+      kind |= unsigned (bits.b ()) << 1;
       t->function_decl.decl_type = function_decl_type (kind);
     }
 #undef RB
+done:
   return !get_overrun ();
 }
 
 void
-trees_out::lang_decl_bools (tree t)
+trees_out::lang_decl_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->u.base.language == lang_cplusplus);
   WB ((lang->u.base.use_template >> 0) & 1);
   WB ((lang->u.base.use_template >> 1) & 1);
@@ -5664,6 +5697,8 @@ trees_out::lang_decl_bools (tree t)
   WB (lang->u.base.module_attach_p);
   if (VAR_OR_FUNCTION_DECL_P (t))
     WB (lang->u.base.module_keyed_decls_p);
+  else
+    WB (false);
   switch (lang->u.base.selector)
     {
     default:
@@ -5714,15 +5749,16 @@ trees_out::lang_decl_bools (tree t)
 }
 
 bool
-trees_in::lang_decl_bools (tree t)
+trees_in::lang_decl_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
-  lang->u.base.language = b () ? lang_cplusplus : lang_c;
+  bits.bflush ();
+  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->u.base.use_template = v;
   /* lang->u.base.not_really_extern is not streamed.  */
   RB (lang->u.base.initialized_in_class);
@@ -5738,6 +5774,8 @@ trees_in::lang_decl_bools (tree t)
   RB (lang->u.base.module_attach_p);
   if (VAR_OR_FUNCTION_DECL_P (t))
     RB (lang->u.base.module_keyed_decls_p);
+  else
+    bits.b ();
   switch (lang->u.base.selector)
     {
     default:
@@ -5786,11 +5824,12 @@ trees_in::lang_decl_bools (tree t)
 }
 
 void
-trees_out::lang_type_bools (tree t)
+trees_out::lang_type_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->has_type_conversion);
   WB (lang->has_copy_ctor);
   WB (lang->has_default_ctor);
@@ -5852,11 +5891,12 @@ trees_out::lang_type_bools (tree t)
 }
 
 bool
-trees_in::lang_type_bools (tree t)
+trees_in::lang_type_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   RB (lang->has_type_conversion);
   RB (lang->has_copy_ctor);
   RB (lang->has_default_ctor);
@@ -5864,8 +5904,8 @@ trees_in::lang_type_bools (tree t)
   RB (lang->ref_needs_init);
   RB (lang->has_const_copy_assign);
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->use_template = v;
 
   RB (lang->has_mutable);
@@ -5877,8 +5917,8 @@ trees_in::lang_type_bools (tree t)
   RB (lang->has_new);
   RB (lang->has_array_new);
 
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->gets_delete = v;
   RB (lang->interface_only);
   RB (lang->interface_unknown);
@@ -7106,18 +7146,19 @@ trees_out::tree_node_bools (tree t)
   gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
 		       || DECL_NAMESPACE_ALIAS (t));
 
-  core_bools (t);
+  bits_out bits (*this);
+  core_bools (t, bits);
 
   switch (TREE_CODE_CLASS (TREE_CODE (t)))
     {
     case tcc_declaration:
       {
 	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
-	b (specific);
+	bits.b (specific);
 	if (specific && VAR_P (t))
-	  b (DECL_DECOMPOSITION_P (t));
+	  bits.b (DECL_DECOMPOSITION_P (t));
 	if (specific)
-	  lang_decl_bools (t);
+	  lang_decl_bools (t, bits);
       }
       break;
 
@@ -7128,9 +7169,9 @@ trees_out::tree_node_bools (tree t)
 	gcc_assert (TYPE_LANG_SPECIFIC (t)
 		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
 
-	b (specific);
+	bits.b (specific);
 	if (specific)
-	  lang_type_bools (t);
+	  lang_type_bools (t, bits);
       }
       break;
 
@@ -7138,34 +7179,35 @@ trees_out::tree_node_bools (tree t)
       break;
     }
 
-  bflush ();
+  bits.bflush ();
 }
 
 bool
 trees_in::tree_node_bools (tree t)
 {
-  bool ok = core_bools (t);
+  bits_in bits (*this);
+  bool ok = core_bools (t, bits);
 
   if (ok)
     switch (TREE_CODE_CLASS (TREE_CODE (t)))
       {
       case tcc_declaration:
-	if (b ())
+	if (bits.b ())
 	  {
-	    bool decomp = VAR_P (t) && b ();
+	    bool decomp = VAR_P (t) && bits.b ();
 
 	    ok = maybe_add_lang_decl_raw (t, decomp);
 	    if (ok)
-	      ok = lang_decl_bools (t);
-	}
+	      ok = lang_decl_bools (t, bits);
+	  }
 	break;
 
       case tcc_type:
-	if (b ())
+	if (bits.b ())
 	  {
 	    ok = maybe_add_lang_type_raw (t);
 	    if (ok)
-	      ok = lang_type_bools (t);
+	      ok = lang_type_bools (t, bits);
 	  }
 	break;
 
@@ -7173,7 +7215,7 @@ trees_in::tree_node_bools (tree t)
 	break;
       }
 
-  bflush ();
+  bits.bflush ();
   if (!ok || get_overrun ())
     return false;
 
@@ -7696,11 +7738,11 @@ trees_out::decl_value (tree decl, depset *dep)
 
       if (mk != MK_unique)
 	{
+	  bits_out bits (*this);
 	  if (!(mk & MK_template_mask) && !state->is_header ())
 	    {
 	      /* Tell the importer whether this is a global module entity,
-		 or a module entity.  This bool merges into the next block
-		 of bools.  Sneaky.  */
+		 or a module entity.  */
 	      tree o = get_originating_module_decl (decl);
 	      bool is_attached = false;
 
@@ -7709,9 +7751,9 @@ trees_out::decl_value (tree decl, depset *dep)
 		  && DECL_MODULE_ATTACH_P (not_tmpl))
 		is_attached = true;
 
-	      b (is_attached);
+	      bits.b (is_attached);
 	    }
-	  b (dep && dep->has_defn ());
+	  bits.b (dep && dep->has_defn ());
 	}
       tree_node_bools (decl);
     }
@@ -7981,11 +8023,11 @@ trees_in::decl_value ()
     {
       if (mk != MK_unique)
 	{
+	  bits_in bits (*this);
 	  if (!(mk & MK_template_mask) && !state->is_header ())
-	    /* See note in trees_out about where this bool is sequenced.  */
-	    is_attached = b ();
+	    is_attached = bits.b ();
 
-	  has_defn = b ();
+	  has_defn = bits.b ();
 	}
 
       if (!tree_node_bools (decl))
@@ -16682,10 +16724,11 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
 {
   sec.u (macro->count);
 
-  sec.b (macro->fun_like);
-  sec.b (macro->variadic);
-  sec.b (macro->syshdr);
-  sec.bflush ();
+  bits_out bits (sec);
+  bits.b (macro->fun_like);
+  bits.b (macro->variadic);
+  bits.b (macro->syshdr);
+  bits.bflush ();
 
   write_location (sec, macro->line);
   if (macro->fun_like)
@@ -16780,10 +16823,11 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
   macro->kind = cmk_macro;
   macro->imported_p = true;
 
-  macro->fun_like = sec.b ();
-  macro->variadic = sec.b ();
-  macro->syshdr = sec.b ();
-  sec.bflush ();
+  bits_in bits (sec);
+  macro->fun_like = bits.b ();
+  macro->variadic = bits.b ();
+  macro->syshdr = bits.b ();
+  bits.bflush ();
 
   macro->line = read_location (sec);
Patrick Palka Feb. 27, 2024, 5:57 p.m. UTC | #3
On Fri, 16 Feb 2024, Patrick Palka wrote:

> On Thu, 15 Feb 2024, Patrick Palka wrote:
> 
> > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
> > OK for trunk?
> > 
> > -- >8 --
> > 
> > One would expect consecutive calls to bytes_in/out::b for streaming
> > adjacent bits, as we do for tree flag streaming, to at least be
> > optimized by the compiler into individual bit operations using
> > statically known bit positions (and ideally merged into larger sized
> > reads/writes).
> > 
> > Unfortunately this doesn't happen because the compiler has trouble
> > tracking the values of this->bit_pos and this->bit_val across such
> > calls, likely because the compiler doesn't know 'this' and so it's
> > treated as global memory.  This means for each consecutive bit stream
> > operation, bit_pos and bit_val are loaded from memory, checked if
> > buffering is needed, and finally the bit is extracted from bit_val
> > according to the (unknown) bit_pos, even though relative to the previous
> > operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> > is just 1 larger.  This ends up being quite slow, with tree_node_bools
> > taking 10% of time when streaming in parts of the std module.
> > 
> > This patch optimizes this by making tracking of bit_pos and bit_val
> > easier for the compiler.  Rather than bit_pos and bit_val being members
> > of the (effectively global) bytes_in/out objects, this patch factors out
> > the bit streaming code/state into separate classes bits_in/out that get
> > constructed locally as needed for bit streaming.  Since these objects
> > are now clearly local, the compiler can more easily track their values.
> > 
> > And since bit streaming is intended to be batched it's natural for these
> > new classes to be RAII-enabled such that the bit stream is flushed upon
> > destruction.
> > 
> > In order to make the most of this improved tracking of bit position,
> > this patch changes parts where we conditionally stream a tree flag
> > to unconditionally stream (the flag or a dummy value).  That way
> > the number of bits streamed and the respective bit positions are as
> > statically known as reasonably possible.  In lang_decl_bools and
> > lang_type_bools we flush the current bit buffer at the start so that
> > subsequent bit positions are statically known.  And in core bools, we
> > can add explicit early exits utilizing invariants that the compiler
> > can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> > and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> > it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> > than 4 bits, it's more space efficient to stream them as individual
> > bytes rather than as packed bits (due to the 32-bit buffer).
> 
> Oops, this last sentence is wrong.  Although the size of the bit buffer
> is 32 bits, upon flushing we rewind unused bytes within the buffer,
> which means streaming 2-8 bits ends up using only one byte not all four.
> So v2 below undoes this pessimization.
> 
> > This patch also moves the definitions of the relevant streaming classes
> > into anonymous namespaces so that the compiler can make more informed
> > decisions about inlining their member functions.
> > 
> > After this patch, compile time for a simple Hello World using the std
> > module is reduced by 7% with a release compiler.  The on-disk size of
> > the std module increases by 0.7% (presumably due to the extra flushing
> > done in lang_decl_bools and lang_type_bools).
> 
> The on-disk std module now only grows 0.4% instead of 0.7%.
> 
> > 
> > The bit stream out performance isn't improved as much as the stream in
> > due to the spans/lengths instrumentation performed on stream out (which
> > probably should be e.g. removed for release builds?)
> 
> -- >8 --
> 
> gcc/cp/ChangeLog:
> 
> 	* module.cc: Update comment about classes defined.
> 	(class data): Enclose in an anonymous namespace.
> 	(data::calc_crc): Moved from bytes::calc_crc.
> 	(class bytes): Remove.  Move bit_flush to namespace scope.
> 	(class bytes_in): Enclose in an anonymous namespace.  Inherit
> 	directly from data and adjust accordingly.  Move b and bflush
> 	members to bits_in.
> 	(class bytes_out): As above.  Remove is_set static data member.
> 	(bit_flush): Moved from class bytes.
> 	(struct bits_in): Define.
> 	(struct bits_out): Define.
> 	(bytes_out::bflush): Moved to bits_out/in.
> 	(bytes_in::bflush): Likewise
> 	(bytes_in::bfill): Removed.
> 	(bytes_out::b): Moved to bits_out/in.
> 	(bytes_in::b): Likewise.
> 	(class trees_in): Enclose in an anonymous namespace.
> 	(class trees_out): Enclose in an anonymous namespace.
> 	(trees_out::core_bools): Add bits_out/in parameter and use it.
> 	Unconditionally stream a bit for public_flag.  Add early exits
> 	as appropriate.
> 	(trees_out::core_bools): Likewise.
> 	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
> 	it.  Flush the current bit buffer at the start.  Unconditionally
> 	stream a bit for module_keyed_decls_p.
> 	(trees_in::lang_decl_bools): Likewise.
> 	(trees_out::lang_type_bools): Add bits_out/in parameter and use
> 	it.  Flush the current bit buffer at the start.
> 	(trees_in::lang_type_bools): Likewise.
> 	(trees_out::tree_node_bools): Construct a bits_out object and
> 	use/pass it.
> 	(trees_in::tree_node_bools): Likewise.
> 	(trees_out::decl_value): Likewise.
> 	(trees_in::decl_value): Likewise.
> 	(module_state::write_define): Likewise.
> 	(module_state::read_define): Likewise.

Ping.

> ---
>  gcc/cp/module.cc | 418 ++++++++++++++++++++++++++---------------------
>  1 file changed, 231 insertions(+), 187 deletions(-)
> 
> diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> index 0291d456ff5..2ecee007e8a 100644
> --- a/gcc/cp/module.cc
> +++ b/gcc/cp/module.cc
> @@ -153,9 +153,11 @@ Classes used:
>  
>     data - buffer
>  
> -   bytes - data streamer
> -   bytes_in : bytes - scalar reader
> -   bytes_out : bytes - scalar writer
> +   bytes_in - scalar reader
> +   bytes_out - scalar writer
> +
> +   bits_in - bit stream reader
> +   bits_out - bit stream writer
>  
>     elf - ELROND format
>     elf_in : elf - ELROND reader
> @@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
>  
>  /* Variable length buffer.  */
>  
> +namespace {
>  class data {
>  public:
>    class allocator {
> @@ -393,6 +396,8 @@ protected:
>      return res;
>    }
>  
> +  unsigned calc_crc (unsigned) const;
> +
>  public:
>    void unuse (unsigned count)
>    {
> @@ -402,6 +407,7 @@ public:
>  public:
>    static allocator simple_memory;
>  };
> +} // anon namespace
>  
>  /* The simple data allocator.  */
>  data::allocator data::simple_memory;
> @@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
>    XDELETEVEC (ptr);
>  }
>  
> -/* Byte streamer base.   Buffer with read/write position and smarts
> -   for single bits.  */
> -
> -class bytes : public data {
> -public:
> -  typedef data parent;
> -
> -protected:
> -  uint32_t bit_val;	/* Bit buffer.  */
> -  unsigned bit_pos;	/* Next bit in bit buffer.  */
> -
> -public:
> -  bytes ()
> -    :parent (), bit_val (0), bit_pos (0)
> -  {}
> -  ~bytes () 
> -  {
> -  }
> -
> -protected:
> -  unsigned calc_crc (unsigned) const;
> -
> -protected:
> -  /* Finish bit packet.  Rewind the bytes not used.  */
> -  unsigned bit_flush ()
> -  {
> -    gcc_assert (bit_pos);
> -    unsigned bytes = (bit_pos + 7) / 8;
> -    unuse (4 - bytes);
> -    bit_pos = 0;
> -    bit_val = 0;
> -    return bytes;
> -  }
> -};
> -
>  /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
>     first 4 bytes, so don't include them.  */
>  
>  unsigned
> -bytes::calc_crc (unsigned l) const
> +data::calc_crc (unsigned l) const
>  {
>    return crc32 (0, (unsigned char *)buffer + 4, l - 4);
>  }
> @@ -495,8 +466,9 @@ class elf_in;
>  
>  /* Byte stream reader.  */
>  
> -class bytes_in : public bytes {
> -  typedef bytes parent;
> +namespace {
> +class bytes_in : public data {
> +  typedef data parent;
>  
>  protected:
>    bool overrun;  /* Sticky read-too-much flag.  */
> @@ -531,7 +503,6 @@ public:
>      if (offset > size)
>        set_overrun ();
>      pos = offset;
> -    bit_pos = bit_val = 0;
>    }
>  
>  public:
> @@ -573,14 +544,7 @@ public:
>    unsigned u32 ();  	/* Read uncompressed integer.  */
>  
>  public:
> -  bool b ();	    	/* Read a bool.  */
> -  void bflush ();	/* Completed a block of bools.  */
> -
> -private:
> -  void bfill ();	/* Get the next block of bools.  */
> -
> -public:
> -  int c ();		/* Read a char.  */
> +  int c () ATTRIBUTE_UNUSED;		/* Read a char.  */
>    int i ();		/* Read a signed int.  */
>    unsigned u ();	/* Read an unsigned int.  */
>    size_t z ();		/* Read a size_t.  */
> @@ -590,6 +554,7 @@ public:
>    const void *buf (size_t); /* Read a fixed-length buffer.  */
>    cpp_hashnode *cpp_node (); /* Read a cpp node.  */
>  };
> +} // anon namespace
>  
>  /* Verify the buffer's CRC is correct.  */
>  
> @@ -610,8 +575,9 @@ class elf_out;
>  
>  /* Byte stream writer.  */
>  
> -class bytes_out : public bytes {
> -  typedef bytes parent;
> +namespace {
> +class bytes_out : public data {
> +  typedef data parent;
>  
>  public:
>    allocator *memory;	/* Obtainer of memory.  */
> @@ -659,11 +625,7 @@ public:
>    void u32 (unsigned);  /* Write uncompressed integer.  */
>  
>  public:
> -  void b (bool);	/* Write bool.  */
> -  void bflush ();	/* Finish block of bools.  */
> -
> -public:
> -  void c (unsigned char); /* Write unsigned char.  */
> +  void c (unsigned char) ATTRIBUTE_UNUSED; /* Write unsigned char.  */
>    void i (int);		/* Write signed int.  */
>    void u (unsigned);	/* Write unsigned int.  */
>    void z (size_t s);	/* Write size_t.  */
> @@ -694,13 +656,126 @@ protected:
>    /* Instrumentation.  */
>    static unsigned spans[4];
>    static unsigned lengths[4];
> -  static int is_set;
> +  friend struct bits_out;
>  };
> +} // anon namespace
> +
> +/* Finish bit packet.  Rewind the bytes not used.  */
> +static unsigned
> +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> +{
> +  gcc_assert (bit_pos);
> +  unsigned bytes = (bit_pos + 7) / 8;
> +  bits.unuse (4 - bytes);
> +  bit_pos = 0;
> +  bit_val = 0;
> +  return bytes;
> +}
> +
> +/* Bit stream reader (RAII-enabled).  Bools are packed into bytes.  You
> +   cannot mix bools and non-bools.  Use bflush to flush the current stream
> +   of bools on demand.  Upon destruction bflush is called.
> +
> +   When reading, we don't know how many bools we'll read in.  So read
> +   4 bytes-worth, and then rewind when flushing if we didn't need them
> +   all.  You can't have a block of bools closer than 4 bytes to the
> +   end of the buffer.  */
> +
> +namespace {
> +struct bits_in {
> +  bytes_in& in;
> +  uint32_t bit_val = 0;
> +  unsigned bit_pos = 0;
> +
> +  bits_in (bytes_in& in)
> +    : in (in)
> +  { }
> +
> +  ~bits_in ()
> +  {
> +    bflush ();
> +  }
> +
> +  bits_in(const bits_in&) = delete;
> +  bits_in& operator=(const bits_in&) = delete;
> +
> +  /* Completed a block of bools.  */
> +  void bflush ()
> +  {
> +    if (bit_pos)
> +      bit_flush (in, bit_val, bit_pos);
> +  }
> +
> +  /* Read one bit.  */
> +  bool b ()
> +  {
> +    if (!bit_pos)
> +      bit_val = in.u32 ();
> +    bool x = (bit_val >> bit_pos) & 1;
> +    bit_pos = (bit_pos + 1) % 32;
> +    return x;
> +  }
> +};
> +} // anon namespace
> +
> +/* Bit stream writer (RAII-enabled), counterpart to bits_in.  */
> +
> +namespace {
> +struct bits_out {
> +  bytes_out& out;
> +  uint32_t bit_val = 0;
> +  unsigned bit_pos = 0;
> +  char is_set = -1;
> +
> +  bits_out (bytes_out& out)
> +    : out (out)
> +  { }
> +
> +  ~bits_out ()
> +  {
> +    bflush ();
> +  }
> +
> +  bits_out(const bits_out&) = delete;
> +  bits_out& operator=(const bits_out&) = delete;
> +
> +  /* Completed a block of bools.  */
> +  void bflush ()
> +  {
> +    if (bit_pos)
> +      {
> +	out.u32 (bit_val);
> +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> +      }
> +    out.spans[2]++;
> +    is_set = -1;
> +  }
> +
> +  /* Write one bit.
> +
> +     It may be worth optimizing for most bools being zero.  Some kind of
> +     run-length encoding?  */
> +  void b (bool x)
> +  {
> +    if (is_set != x)
> +      {
> +	is_set = x;
> +	out.spans[x]++;
> +      }
> +    out.lengths[x]++;
> +    bit_val |= unsigned (x) << bit_pos++;
> +    if (bit_pos == 32)
> +      {
> +	out.u32 (bit_val);
> +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> +      }
> +  }
> +};
> +} // anon namespace
>  
>  /* Instrumentation.  */
>  unsigned bytes_out::spans[4];
>  unsigned bytes_out::lengths[4];
> -int bytes_out::is_set = -1;
>  
>  /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
>     that pointed to by CRC_PTR.  */
> @@ -723,73 +798,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
>      }
>  }
>  
> -/* Finish a set of bools.  */
> -
> -void
> -bytes_out::bflush ()
> -{
> -  if (bit_pos)
> -    {
> -      u32 (bit_val);
> -      lengths[2] += bit_flush ();
> -    }
> -  spans[2]++;
> -  is_set = -1;
> -}
> -
> -void
> -bytes_in::bflush ()
> -{
> -  if (bit_pos)
> -    bit_flush ();
> -}
> -
> -/* When reading, we don't know how many bools we'll read in.  So read
> -   4 bytes-worth, and then rewind when flushing if we didn't need them
> -   all.  You can't have a block of bools closer than 4 bytes to the
> -   end of the buffer.  */
> -
> -void
> -bytes_in::bfill ()
> -{
> -  bit_val = u32 ();
> -}
> -
> -/* Bools are packed into bytes.  You cannot mix bools and non-bools.
> -   You must call bflush before emitting another type.  So batch your
> -   bools.
> -
> -   It may be worth optimizing for most bools being zero.  Some kind of
> -   run-length encoding?  */
> -
> -void
> -bytes_out::b (bool x)
> -{
> -  if (is_set != x)
> -    {
> -      is_set = x;
> -      spans[x]++;
> -    }
> -  lengths[x]++;
> -  bit_val |= unsigned (x) << bit_pos++;
> -  if (bit_pos == 32)
> -    {
> -      u32 (bit_val);
> -      lengths[2] += bit_flush ();
> -    }
> -}
> -
> -bool
> -bytes_in::b ()
> -{
> -  if (!bit_pos)
> -    bfill ();
> -  bool v = (bit_val >> bit_pos++) & 1;
> -  if (bit_pos == 32)
> -    bit_flush ();
> -  return v;
> -}
> -
>  /* Exactly 4 bytes.  Used internally for bool packing and a few other
>     places.  We can't simply use uint32_t because (a) alignment and
>     (b) we need little-endian for the bool streaming rewinding to make
> @@ -2846,6 +2854,7 @@ struct post_process_data {
>     read trees with TREE_VISITED.  Thus it's quite safe to have
>     multiple concurrent readers.  Which is good, because lazy
>     loading. */
> +namespace {
>  class trees_in : public bytes_in {
>    typedef bytes_in parent;
>  
> @@ -2870,15 +2879,15 @@ private:
>  
>  public:
>    /* Needed for binfo writing  */
> -  bool core_bools (tree);
> +  bool core_bools (tree, bits_in&);
>  
>  private:
>    /* Stream tree_core, lang_decl_specific and lang_type_specific
>       bits.  */
>    bool core_vals (tree);
> -  bool lang_type_bools (tree);
> +  bool lang_type_bools (tree, bits_in&);
>    bool lang_type_vals (tree);
> -  bool lang_decl_bools (tree);
> +  bool lang_decl_bools (tree, bits_in&);
>    bool lang_decl_vals (tree);
>    bool lang_vals (tree);
>    bool tree_node_bools (tree);
> @@ -2965,6 +2974,7 @@ private:
>  private:
>    void assert_definition (tree, bool installing);
>  };
> +} // anon namespace
>  
>  trees_in::trees_in (module_state *state)
>    :parent (), state (state), unused (0)
> @@ -2982,6 +2992,7 @@ trees_in::~trees_in ()
>  }
>  
>  /* Tree stream writer.  */
> +namespace {
>  class trees_out : public bytes_out {
>    typedef bytes_out parent;
>  
> @@ -3043,11 +3054,11 @@ public:
>    }
>  
>  private:
> -  void core_bools (tree);
> +  void core_bools (tree, bits_out&);
>    void core_vals (tree);
> -  void lang_type_bools (tree);
> +  void lang_type_bools (tree, bits_out&);
>    void lang_type_vals (tree);
> -  void lang_decl_bools (tree);
> +  void lang_decl_bools (tree, bits_out&);
>    void lang_decl_vals (tree);
>    void lang_vals (tree);
>    void tree_node_bools (tree);
> @@ -3126,6 +3137,7 @@ private:
>    static unsigned back_ref_count;
>    static unsigned null_count;
>  };
> +} // anon namespace
>  
>  /* Instrumentation counters.  */
>  unsigned trees_out::tree_val_count;
> @@ -5292,9 +5304,9 @@ trees_in::start (unsigned code)
>  /* Read & write the core boolean flags.  */
>  
>  void
> -trees_out::core_bools (tree t)
> +trees_out::core_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    tree_code code = TREE_CODE (t);
>  
>    WB (t->base.side_effects_flag);
> @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
>    if (TREE_CODE_CLASS (code) != tcc_type)
>      /* This is TYPE_CACHED_VALUES_P for types.  */
>      WB (t->base.public_flag);
> +  else
> +    WB (false);
>    WB (t->base.private_flag);
>    WB (t->base.protected_flag);
>    WB (t->base.deprecated_flag);
> @@ -5327,7 +5341,7 @@ trees_out::core_bools (tree t)
>      case TARGET_MEM_REF:
>      case TREE_VEC:
>        /* These use different base.u fields.  */
> -      break;
> +      return;
>  
>      default:
>        WB (t->base.u.bits.lang_flag_0);
> @@ -5359,7 +5373,7 @@ trees_out::core_bools (tree t)
>        break;
>      }
>  
> -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> +  if (TREE_CODE_CLASS (code) == tcc_type)
>      {
>        WB (t->type_common.no_force_blk_flag);
>        WB (t->type_common.needs_constructing_flag);
> @@ -5374,7 +5388,10 @@ trees_out::core_bools (tree t)
>        WB (t->type_common.lang_flag_5);
>        WB (t->type_common.lang_flag_6);
>        WB (t->type_common.typeless_storage);
> +      return;
>      }
> +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> +    return;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
>      {
> @@ -5439,6 +5456,8 @@ trees_out::core_bools (tree t)
>        WB (t->decl_common.decl_nonshareable_flag);
>        WB (t->decl_common.decl_not_flexarray);
>      }
> +  else
> +    return;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
>      {
> @@ -5459,6 +5478,8 @@ trees_out::core_bools (tree t)
>        WB (t->decl_with_vis.final);
>        WB (t->decl_with_vis.regdecl_flag);
>      }
> +  else
> +    return;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
>      {
> @@ -5489,9 +5510,10 @@ trees_out::core_bools (tree t)
>  }
>  
>  bool
> -trees_in::core_bools (tree t)
> +trees_in::core_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
> +
>    tree_code code = TREE_CODE (t);
>  
>    RB (t->base.side_effects_flag);
> @@ -5507,6 +5529,8 @@ trees_in::core_bools (tree t)
>    RB (t->base.static_flag);
>    if (TREE_CODE_CLASS (code) != tcc_type)
>      RB (t->base.public_flag);
> +  else
> +    bits.b ();
>    RB (t->base.private_flag);
>    RB (t->base.protected_flag);
>    RB (t->base.deprecated_flag);
> @@ -5520,7 +5544,7 @@ trees_in::core_bools (tree t)
>      case TARGET_MEM_REF:
>      case TREE_VEC:
>        /* These use different base.u fields.  */
> -      break;
> +      goto done;
>  
>      default:
>        RB (t->base.u.bits.lang_flag_0);
> @@ -5539,7 +5563,7 @@ trees_in::core_bools (tree t)
>        break;
>      }
>  
> -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> +  if (TREE_CODE_CLASS (code) == tcc_type)
>      {
>        RB (t->type_common.no_force_blk_flag);
>        RB (t->type_common.needs_constructing_flag);
> @@ -5554,7 +5578,10 @@ trees_in::core_bools (tree t)
>        RB (t->type_common.lang_flag_5);
>        RB (t->type_common.lang_flag_6);
>        RB (t->type_common.typeless_storage);
> +      goto done;
>      }
> +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> +    goto done;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
>      {
> @@ -5584,6 +5611,8 @@ trees_in::core_bools (tree t)
>        RB (t->decl_common.decl_nonshareable_flag);
>        RB (t->decl_common.decl_not_flexarray);
>      }
> +  else
> +    goto done;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
>      {
> @@ -5604,6 +5633,8 @@ trees_in::core_bools (tree t)
>        RB (t->decl_with_vis.final);
>        RB (t->decl_with_vis.regdecl_flag);
>      }
> +  else
> +    goto done;
>  
>    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
>      {
> @@ -5627,20 +5658,22 @@ trees_in::core_bools (tree t)
>  
>        /* decl_type is a (misnamed) 2 bit discriminator.	 */
>        unsigned kind = 0;
> -      kind |= unsigned (b ()) << 0;
> -      kind |= unsigned (b ()) << 1;
> +      kind |= unsigned (bits.b ()) << 0;
> +      kind |= unsigned (bits.b ()) << 1;
>        t->function_decl.decl_type = function_decl_type (kind);
>      }
>  #undef RB
> +done:
>    return !get_overrun ();
>  }
>  
>  void
> -trees_out::lang_decl_bools (tree t)
> +trees_out::lang_decl_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
>  
> +  bits.bflush ();
>    WB (lang->u.base.language == lang_cplusplus);
>    WB ((lang->u.base.use_template >> 0) & 1);
>    WB ((lang->u.base.use_template >> 1) & 1);
> @@ -5664,6 +5697,8 @@ trees_out::lang_decl_bools (tree t)
>    WB (lang->u.base.module_attach_p);
>    if (VAR_OR_FUNCTION_DECL_P (t))
>      WB (lang->u.base.module_keyed_decls_p);
> +  else
> +    WB (false);
>    switch (lang->u.base.selector)
>      {
>      default:
> @@ -5714,15 +5749,16 @@ trees_out::lang_decl_bools (tree t)
>  }
>  
>  bool
> -trees_in::lang_decl_bools (tree t)
> +trees_in::lang_decl_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
>    struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
>  
> -  lang->u.base.language = b () ? lang_cplusplus : lang_c;
> +  bits.bflush ();
> +  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
>    unsigned v;
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->u.base.use_template = v;
>    /* lang->u.base.not_really_extern is not streamed.  */
>    RB (lang->u.base.initialized_in_class);
> @@ -5738,6 +5774,8 @@ trees_in::lang_decl_bools (tree t)
>    RB (lang->u.base.module_attach_p);
>    if (VAR_OR_FUNCTION_DECL_P (t))
>      RB (lang->u.base.module_keyed_decls_p);
> +  else
> +    bits.b ();
>    switch (lang->u.base.selector)
>      {
>      default:
> @@ -5786,11 +5824,12 @@ trees_in::lang_decl_bools (tree t)
>  }
>  
>  void
> -trees_out::lang_type_bools (tree t)
> +trees_out::lang_type_bools (tree t, bits_out& bits)
>  {
> -#define WB(X) (b (X))
> +#define WB(X) (bits.b (X))
>    const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
>  
> +  bits.bflush ();
>    WB (lang->has_type_conversion);
>    WB (lang->has_copy_ctor);
>    WB (lang->has_default_ctor);
> @@ -5852,11 +5891,12 @@ trees_out::lang_type_bools (tree t)
>  }
>  
>  bool
> -trees_in::lang_type_bools (tree t)
> +trees_in::lang_type_bools (tree t, bits_in& bits)
>  {
> -#define RB(X) ((X) = b ())
> +#define RB(X) ((X) = bits.b ())
>    struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
>  
> +  bits.bflush ();
>    RB (lang->has_type_conversion);
>    RB (lang->has_copy_ctor);
>    RB (lang->has_default_ctor);
> @@ -5864,8 +5904,8 @@ trees_in::lang_type_bools (tree t)
>    RB (lang->ref_needs_init);
>    RB (lang->has_const_copy_assign);
>    unsigned v;
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->use_template = v;
>  
>    RB (lang->has_mutable);
> @@ -5877,8 +5917,8 @@ trees_in::lang_type_bools (tree t)
>    RB (lang->has_new);
>    RB (lang->has_array_new);
>  
> -  v = b () << 0;
> -  v |= b () << 1;
> +  v = bits.b () << 0;
> +  v |= bits.b () << 1;
>    lang->gets_delete = v;
>    RB (lang->interface_only);
>    RB (lang->interface_unknown);
> @@ -7106,18 +7146,19 @@ trees_out::tree_node_bools (tree t)
>    gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
>  		       || DECL_NAMESPACE_ALIAS (t));
>  
> -  core_bools (t);
> +  bits_out bits (*this);
> +  core_bools (t, bits);
>  
>    switch (TREE_CODE_CLASS (TREE_CODE (t)))
>      {
>      case tcc_declaration:
>        {
>  	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
> -	b (specific);
> +	bits.b (specific);
>  	if (specific && VAR_P (t))
> -	  b (DECL_DECOMPOSITION_P (t));
> +	  bits.b (DECL_DECOMPOSITION_P (t));
>  	if (specific)
> -	  lang_decl_bools (t);
> +	  lang_decl_bools (t, bits);
>        }
>        break;
>  
> @@ -7128,9 +7169,9 @@ trees_out::tree_node_bools (tree t)
>  	gcc_assert (TYPE_LANG_SPECIFIC (t)
>  		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
>  
> -	b (specific);
> +	bits.b (specific);
>  	if (specific)
> -	  lang_type_bools (t);
> +	  lang_type_bools (t, bits);
>        }
>        break;
>  
> @@ -7138,34 +7179,35 @@ trees_out::tree_node_bools (tree t)
>        break;
>      }
>  
> -  bflush ();
> +  bits.bflush ();
>  }
>  
>  bool
>  trees_in::tree_node_bools (tree t)
>  {
> -  bool ok = core_bools (t);
> +  bits_in bits (*this);
> +  bool ok = core_bools (t, bits);
>  
>    if (ok)
>      switch (TREE_CODE_CLASS (TREE_CODE (t)))
>        {
>        case tcc_declaration:
> -	if (b ())
> +	if (bits.b ())
>  	  {
> -	    bool decomp = VAR_P (t) && b ();
> +	    bool decomp = VAR_P (t) && bits.b ();
>  
>  	    ok = maybe_add_lang_decl_raw (t, decomp);
>  	    if (ok)
> -	      ok = lang_decl_bools (t);
> -	}
> +	      ok = lang_decl_bools (t, bits);
> +	  }
>  	break;
>  
>        case tcc_type:
> -	if (b ())
> +	if (bits.b ())
>  	  {
>  	    ok = maybe_add_lang_type_raw (t);
>  	    if (ok)
> -	      ok = lang_type_bools (t);
> +	      ok = lang_type_bools (t, bits);
>  	  }
>  	break;
>  
> @@ -7173,7 +7215,7 @@ trees_in::tree_node_bools (tree t)
>  	break;
>        }
>  
> -  bflush ();
> +  bits.bflush ();
>    if (!ok || get_overrun ())
>      return false;
>  
> @@ -7696,11 +7738,11 @@ trees_out::decl_value (tree decl, depset *dep)
>  
>        if (mk != MK_unique)
>  	{
> +	  bits_out bits (*this);
>  	  if (!(mk & MK_template_mask) && !state->is_header ())
>  	    {
>  	      /* Tell the importer whether this is a global module entity,
> -		 or a module entity.  This bool merges into the next block
> -		 of bools.  Sneaky.  */
> +		 or a module entity.  */
>  	      tree o = get_originating_module_decl (decl);
>  	      bool is_attached = false;
>  
> @@ -7709,9 +7751,9 @@ trees_out::decl_value (tree decl, depset *dep)
>  		  && DECL_MODULE_ATTACH_P (not_tmpl))
>  		is_attached = true;
>  
> -	      b (is_attached);
> +	      bits.b (is_attached);
>  	    }
> -	  b (dep && dep->has_defn ());
> +	  bits.b (dep && dep->has_defn ());
>  	}
>        tree_node_bools (decl);
>      }
> @@ -7981,11 +8023,11 @@ trees_in::decl_value ()
>      {
>        if (mk != MK_unique)
>  	{
> +	  bits_in bits (*this);
>  	  if (!(mk & MK_template_mask) && !state->is_header ())
> -	    /* See note in trees_out about where this bool is sequenced.  */
> -	    is_attached = b ();
> +	    is_attached = bits.b ();
>  
> -	  has_defn = b ();
> +	  has_defn = bits.b ();
>  	}
>  
>        if (!tree_node_bools (decl))
> @@ -16682,10 +16724,11 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
>  {
>    sec.u (macro->count);
>  
> -  sec.b (macro->fun_like);
> -  sec.b (macro->variadic);
> -  sec.b (macro->syshdr);
> -  sec.bflush ();
> +  bits_out bits (sec);
> +  bits.b (macro->fun_like);
> +  bits.b (macro->variadic);
> +  bits.b (macro->syshdr);
> +  bits.bflush ();
>  
>    write_location (sec, macro->line);
>    if (macro->fun_like)
> @@ -16780,10 +16823,11 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
>    macro->kind = cmk_macro;
>    macro->imported_p = true;
>  
> -  macro->fun_like = sec.b ();
> -  macro->variadic = sec.b ();
> -  macro->syshdr = sec.b ();
> -  sec.bflush ();
> +  bits_in bits (sec);
> +  macro->fun_like = bits.b ();
> +  macro->variadic = bits.b ();
> +  macro->syshdr = bits.b ();
> +  bits.bflush ();
>  
>    macro->line = read_location (sec);
>  
> -- 
> 2.44.0.rc1.15.g4fc51f00ef
> 
>
Patrick Palka March 26, 2024, 2:25 p.m. UTC | #4
On Tue, 27 Feb 2024, Patrick Palka wrote:

> On Fri, 16 Feb 2024, Patrick Palka wrote:
> 
> > On Thu, 15 Feb 2024, Patrick Palka wrote:
> > 
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
> > > OK for trunk?
> > > 
> > > -- >8 --
> > > 
> > > One would expect consecutive calls to bytes_in/out::b for streaming
> > > adjacent bits, as we do for tree flag streaming, to at least be
> > > optimized by the compiler into individual bit operations using
> > > statically known bit positions (and ideally merged into larger sized
> > > reads/writes).
> > > 
> > > Unfortunately this doesn't happen because the compiler has trouble
> > > tracking the values of this->bit_pos and this->bit_val across such
> > > calls, likely because the compiler doesn't know 'this' and so it's
> > > treated as global memory.  This means for each consecutive bit stream
> > > operation, bit_pos and bit_val are loaded from memory, checked if
> > > buffering is needed, and finally the bit is extracted from bit_val
> > > according to the (unknown) bit_pos, even though relative to the previous
> > > operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> > > is just 1 larger.  This ends up being quite slow, with tree_node_bools
> > > taking 10% of time when streaming in parts of the std module.
> > > 
> > > This patch optimizes this by making tracking of bit_pos and bit_val
> > > easier for the compiler.  Rather than bit_pos and bit_val being members
> > > of the (effectively global) bytes_in/out objects, this patch factors out
> > > the bit streaming code/state into separate classes bits_in/out that get
> > > constructed locally as needed for bit streaming.  Since these objects
> > > are now clearly local, the compiler can more easily track their values.
> > > 
> > > And since bit streaming is intended to be batched it's natural for these
> > > new classes to be RAII-enabled such that the bit stream is flushed upon
> > > destruction.
> > > 
> > > In order to make the most of this improved tracking of bit position,
> > > this patch changes parts where we conditionally stream a tree flag
> > > to unconditionally stream (the flag or a dummy value).  That way
> > > the number of bits streamed and the respective bit positions are as
> > > statically known as reasonably possible.  In lang_decl_bools and
> > > lang_type_bools we flush the current bit buffer at the start so that
> > > subsequent bit positions are statically known.  And in core bools, we
> > > can add explicit early exits utilizing invariants that the compiler
> > > can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> > > and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> > > it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> > > than 4 bits, it's more space efficient to stream them as individual
> > > bytes rather than as packed bits (due to the 32-bit buffer).
> > 
> > Oops, this last sentence is wrong.  Although the size of the bit buffer
> > is 32 bits, upon flushing we rewind unused bytes within the buffer,
> > which means streaming 2-8 bits ends up using only one byte not all four.
> > So v2 below undoes this pessimization.
> > 
> > > This patch also moves the definitions of the relevant streaming classes
> > > into anonymous namespaces so that the compiler can make more informed
> > > decisions about inlining their member functions.
> > > 
> > > After this patch, compile time for a simple Hello World using the std
> > > module is reduced by 7% with a release compiler.  The on-disk size of
> > > the std module increases by 0.7% (presumably due to the extra flushing
> > > done in lang_decl_bools and lang_type_bools).
> > 
> > The on-disk std module now only grows 0.4% instead of 0.7%.
> > 
> > > 
> > > The bit stream out performance isn't improved as much as the stream in
> > > due to the spans/lengths instrumentation performed on stream out (which
> > > probably should be e.g. removed for release builds?)
> > 
> > -- >8 --
> > 
> > gcc/cp/ChangeLog:
> > 
> > 	* module.cc: Update comment about classes defined.
> > 	(class data): Enclose in an anonymous namespace.
> > 	(data::calc_crc): Moved from bytes::calc_crc.
> > 	(class bytes): Remove.  Move bit_flush to namespace scope.
> > 	(class bytes_in): Enclose in an anonymous namespace.  Inherit
> > 	directly from data and adjust accordingly.  Move b and bflush
> > 	members to bits_in.
> > 	(class bytes_out): As above.  Remove is_set static data member.
> > 	(bit_flush): Moved from class bytes.
> > 	(struct bits_in): Define.
> > 	(struct bits_out): Define.
> > 	(bytes_out::bflush): Moved to bits_out/in.
> > 	(bytes_in::bflush): Likewise
> > 	(bytes_in::bfill): Removed.
> > 	(bytes_out::b): Moved to bits_out/in.
> > 	(bytes_in::b): Likewise.
> > 	(class trees_in): Enclose in an anonymous namespace.
> > 	(class trees_out): Enclose in an anonymous namespace.
> > 	(trees_out::core_bools): Add bits_out/in parameter and use it.
> > 	Unconditionally stream a bit for public_flag.  Add early exits
> > 	as appropriate.
> > 	(trees_out::core_bools): Likewise.
> > 	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
> > 	it.  Flush the current bit buffer at the start.  Unconditionally
> > 	stream a bit for module_keyed_decls_p.
> > 	(trees_in::lang_decl_bools): Likewise.
> > 	(trees_out::lang_type_bools): Add bits_out/in parameter and use
> > 	it.  Flush the current bit buffer at the start.
> > 	(trees_in::lang_type_bools): Likewise.
> > 	(trees_out::tree_node_bools): Construct a bits_out object and
> > 	use/pass it.
> > 	(trees_in::tree_node_bools): Likewise.
> > 	(trees_out::decl_value): Likewise.
> > 	(trees_in::decl_value): Likewise.
> > 	(module_state::write_define): Likewise.
> > 	(module_state::read_define): Likewise.
> 
> Ping.

Ping.

> 
> > ---
> >  gcc/cp/module.cc | 418 ++++++++++++++++++++++++++---------------------
> >  1 file changed, 231 insertions(+), 187 deletions(-)
> > 
> > diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> > index 0291d456ff5..2ecee007e8a 100644
> > --- a/gcc/cp/module.cc
> > +++ b/gcc/cp/module.cc
> > @@ -153,9 +153,11 @@ Classes used:
> >  
> >     data - buffer
> >  
> > -   bytes - data streamer
> > -   bytes_in : bytes - scalar reader
> > -   bytes_out : bytes - scalar writer
> > +   bytes_in - scalar reader
> > +   bytes_out - scalar writer
> > +
> > +   bits_in - bit stream reader
> > +   bits_out - bit stream writer
> >  
> >     elf - ELROND format
> >     elf_in : elf - ELROND reader
> > @@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
> >  
> >  /* Variable length buffer.  */
> >  
> > +namespace {
> >  class data {
> >  public:
> >    class allocator {
> > @@ -393,6 +396,8 @@ protected:
> >      return res;
> >    }
> >  
> > +  unsigned calc_crc (unsigned) const;
> > +
> >  public:
> >    void unuse (unsigned count)
> >    {
> > @@ -402,6 +407,7 @@ public:
> >  public:
> >    static allocator simple_memory;
> >  };
> > +} // anon namespace
> >  
> >  /* The simple data allocator.  */
> >  data::allocator data::simple_memory;
> > @@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
> >    XDELETEVEC (ptr);
> >  }
> >  
> > -/* Byte streamer base.   Buffer with read/write position and smarts
> > -   for single bits.  */
> > -
> > -class bytes : public data {
> > -public:
> > -  typedef data parent;
> > -
> > -protected:
> > -  uint32_t bit_val;	/* Bit buffer.  */
> > -  unsigned bit_pos;	/* Next bit in bit buffer.  */
> > -
> > -public:
> > -  bytes ()
> > -    :parent (), bit_val (0), bit_pos (0)
> > -  {}
> > -  ~bytes () 
> > -  {
> > -  }
> > -
> > -protected:
> > -  unsigned calc_crc (unsigned) const;
> > -
> > -protected:
> > -  /* Finish bit packet.  Rewind the bytes not used.  */
> > -  unsigned bit_flush ()
> > -  {
> > -    gcc_assert (bit_pos);
> > -    unsigned bytes = (bit_pos + 7) / 8;
> > -    unuse (4 - bytes);
> > -    bit_pos = 0;
> > -    bit_val = 0;
> > -    return bytes;
> > -  }
> > -};
> > -
> >  /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
> >     first 4 bytes, so don't include them.  */
> >  
> >  unsigned
> > -bytes::calc_crc (unsigned l) const
> > +data::calc_crc (unsigned l) const
> >  {
> >    return crc32 (0, (unsigned char *)buffer + 4, l - 4);
> >  }
> > @@ -495,8 +466,9 @@ class elf_in;
> >  
> >  /* Byte stream reader.  */
> >  
> > -class bytes_in : public bytes {
> > -  typedef bytes parent;
> > +namespace {
> > +class bytes_in : public data {
> > +  typedef data parent;
> >  
> >  protected:
> >    bool overrun;  /* Sticky read-too-much flag.  */
> > @@ -531,7 +503,6 @@ public:
> >      if (offset > size)
> >        set_overrun ();
> >      pos = offset;
> > -    bit_pos = bit_val = 0;
> >    }
> >  
> >  public:
> > @@ -573,14 +544,7 @@ public:
> >    unsigned u32 ();  	/* Read uncompressed integer.  */
> >  
> >  public:
> > -  bool b ();	    	/* Read a bool.  */
> > -  void bflush ();	/* Completed a block of bools.  */
> > -
> > -private:
> > -  void bfill ();	/* Get the next block of bools.  */
> > -
> > -public:
> > -  int c ();		/* Read a char.  */
> > +  int c () ATTRIBUTE_UNUSED;		/* Read a char.  */
> >    int i ();		/* Read a signed int.  */
> >    unsigned u ();	/* Read an unsigned int.  */
> >    size_t z ();		/* Read a size_t.  */
> > @@ -590,6 +554,7 @@ public:
> >    const void *buf (size_t); /* Read a fixed-length buffer.  */
> >    cpp_hashnode *cpp_node (); /* Read a cpp node.  */
> >  };
> > +} // anon namespace
> >  
> >  /* Verify the buffer's CRC is correct.  */
> >  
> > @@ -610,8 +575,9 @@ class elf_out;
> >  
> >  /* Byte stream writer.  */
> >  
> > -class bytes_out : public bytes {
> > -  typedef bytes parent;
> > +namespace {
> > +class bytes_out : public data {
> > +  typedef data parent;
> >  
> >  public:
> >    allocator *memory;	/* Obtainer of memory.  */
> > @@ -659,11 +625,7 @@ public:
> >    void u32 (unsigned);  /* Write uncompressed integer.  */
> >  
> >  public:
> > -  void b (bool);	/* Write bool.  */
> > -  void bflush ();	/* Finish block of bools.  */
> > -
> > -public:
> > -  void c (unsigned char); /* Write unsigned char.  */
> > +  void c (unsigned char) ATTRIBUTE_UNUSED; /* Write unsigned char.  */
> >    void i (int);		/* Write signed int.  */
> >    void u (unsigned);	/* Write unsigned int.  */
> >    void z (size_t s);	/* Write size_t.  */
> > @@ -694,13 +656,126 @@ protected:
> >    /* Instrumentation.  */
> >    static unsigned spans[4];
> >    static unsigned lengths[4];
> > -  static int is_set;
> > +  friend struct bits_out;
> >  };
> > +} // anon namespace
> > +
> > +/* Finish bit packet.  Rewind the bytes not used.  */
> > +static unsigned
> > +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> > +{
> > +  gcc_assert (bit_pos);
> > +  unsigned bytes = (bit_pos + 7) / 8;
> > +  bits.unuse (4 - bytes);
> > +  bit_pos = 0;
> > +  bit_val = 0;
> > +  return bytes;
> > +}
> > +
> > +/* Bit stream reader (RAII-enabled).  Bools are packed into bytes.  You
> > +   cannot mix bools and non-bools.  Use bflush to flush the current stream
> > +   of bools on demand.  Upon destruction bflush is called.
> > +
> > +   When reading, we don't know how many bools we'll read in.  So read
> > +   4 bytes-worth, and then rewind when flushing if we didn't need them
> > +   all.  You can't have a block of bools closer than 4 bytes to the
> > +   end of the buffer.  */
> > +
> > +namespace {
> > +struct bits_in {
> > +  bytes_in& in;
> > +  uint32_t bit_val = 0;
> > +  unsigned bit_pos = 0;
> > +
> > +  bits_in (bytes_in& in)
> > +    : in (in)
> > +  { }
> > +
> > +  ~bits_in ()
> > +  {
> > +    bflush ();
> > +  }
> > +
> > +  bits_in(const bits_in&) = delete;
> > +  bits_in& operator=(const bits_in&) = delete;
> > +
> > +  /* Completed a block of bools.  */
> > +  void bflush ()
> > +  {
> > +    if (bit_pos)
> > +      bit_flush (in, bit_val, bit_pos);
> > +  }
> > +
> > +  /* Read one bit.  */
> > +  bool b ()
> > +  {
> > +    if (!bit_pos)
> > +      bit_val = in.u32 ();
> > +    bool x = (bit_val >> bit_pos) & 1;
> > +    bit_pos = (bit_pos + 1) % 32;
> > +    return x;
> > +  }
> > +};
> > +} // anon namespace
> > +
> > +/* Bit stream writer (RAII-enabled), counterpart to bits_in.  */
> > +
> > +namespace {
> > +struct bits_out {
> > +  bytes_out& out;
> > +  uint32_t bit_val = 0;
> > +  unsigned bit_pos = 0;
> > +  char is_set = -1;
> > +
> > +  bits_out (bytes_out& out)
> > +    : out (out)
> > +  { }
> > +
> > +  ~bits_out ()
> > +  {
> > +    bflush ();
> > +  }
> > +
> > +  bits_out(const bits_out&) = delete;
> > +  bits_out& operator=(const bits_out&) = delete;
> > +
> > +  /* Completed a block of bools.  */
> > +  void bflush ()
> > +  {
> > +    if (bit_pos)
> > +      {
> > +	out.u32 (bit_val);
> > +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> > +      }
> > +    out.spans[2]++;
> > +    is_set = -1;
> > +  }
> > +
> > +  /* Write one bit.
> > +
> > +     It may be worth optimizing for most bools being zero.  Some kind of
> > +     run-length encoding?  */
> > +  void b (bool x)
> > +  {
> > +    if (is_set != x)
> > +      {
> > +	is_set = x;
> > +	out.spans[x]++;
> > +      }
> > +    out.lengths[x]++;
> > +    bit_val |= unsigned (x) << bit_pos++;
> > +    if (bit_pos == 32)
> > +      {
> > +	out.u32 (bit_val);
> > +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> > +      }
> > +  }
> > +};
> > +} // anon namespace
> >  
> >  /* Instrumentation.  */
> >  unsigned bytes_out::spans[4];
> >  unsigned bytes_out::lengths[4];
> > -int bytes_out::is_set = -1;
> >  
> >  /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
> >     that pointed to by CRC_PTR.  */
> > @@ -723,73 +798,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
> >      }
> >  }
> >  
> > -/* Finish a set of bools.  */
> > -
> > -void
> > -bytes_out::bflush ()
> > -{
> > -  if (bit_pos)
> > -    {
> > -      u32 (bit_val);
> > -      lengths[2] += bit_flush ();
> > -    }
> > -  spans[2]++;
> > -  is_set = -1;
> > -}
> > -
> > -void
> > -bytes_in::bflush ()
> > -{
> > -  if (bit_pos)
> > -    bit_flush ();
> > -}
> > -
> > -/* When reading, we don't know how many bools we'll read in.  So read
> > -   4 bytes-worth, and then rewind when flushing if we didn't need them
> > -   all.  You can't have a block of bools closer than 4 bytes to the
> > -   end of the buffer.  */
> > -
> > -void
> > -bytes_in::bfill ()
> > -{
> > -  bit_val = u32 ();
> > -}
> > -
> > -/* Bools are packed into bytes.  You cannot mix bools and non-bools.
> > -   You must call bflush before emitting another type.  So batch your
> > -   bools.
> > -
> > -   It may be worth optimizing for most bools being zero.  Some kind of
> > -   run-length encoding?  */
> > -
> > -void
> > -bytes_out::b (bool x)
> > -{
> > -  if (is_set != x)
> > -    {
> > -      is_set = x;
> > -      spans[x]++;
> > -    }
> > -  lengths[x]++;
> > -  bit_val |= unsigned (x) << bit_pos++;
> > -  if (bit_pos == 32)
> > -    {
> > -      u32 (bit_val);
> > -      lengths[2] += bit_flush ();
> > -    }
> > -}
> > -
> > -bool
> > -bytes_in::b ()
> > -{
> > -  if (!bit_pos)
> > -    bfill ();
> > -  bool v = (bit_val >> bit_pos++) & 1;
> > -  if (bit_pos == 32)
> > -    bit_flush ();
> > -  return v;
> > -}
> > -
> >  /* Exactly 4 bytes.  Used internally for bool packing and a few other
> >     places.  We can't simply use uint32_t because (a) alignment and
> >     (b) we need little-endian for the bool streaming rewinding to make
> > @@ -2846,6 +2854,7 @@ struct post_process_data {
> >     read trees with TREE_VISITED.  Thus it's quite safe to have
> >     multiple concurrent readers.  Which is good, because lazy
> >     loading. */
> > +namespace {
> >  class trees_in : public bytes_in {
> >    typedef bytes_in parent;
> >  
> > @@ -2870,15 +2879,15 @@ private:
> >  
> >  public:
> >    /* Needed for binfo writing  */
> > -  bool core_bools (tree);
> > +  bool core_bools (tree, bits_in&);
> >  
> >  private:
> >    /* Stream tree_core, lang_decl_specific and lang_type_specific
> >       bits.  */
> >    bool core_vals (tree);
> > -  bool lang_type_bools (tree);
> > +  bool lang_type_bools (tree, bits_in&);
> >    bool lang_type_vals (tree);
> > -  bool lang_decl_bools (tree);
> > +  bool lang_decl_bools (tree, bits_in&);
> >    bool lang_decl_vals (tree);
> >    bool lang_vals (tree);
> >    bool tree_node_bools (tree);
> > @@ -2965,6 +2974,7 @@ private:
> >  private:
> >    void assert_definition (tree, bool installing);
> >  };
> > +} // anon namespace
> >  
> >  trees_in::trees_in (module_state *state)
> >    :parent (), state (state), unused (0)
> > @@ -2982,6 +2992,7 @@ trees_in::~trees_in ()
> >  }
> >  
> >  /* Tree stream writer.  */
> > +namespace {
> >  class trees_out : public bytes_out {
> >    typedef bytes_out parent;
> >  
> > @@ -3043,11 +3054,11 @@ public:
> >    }
> >  
> >  private:
> > -  void core_bools (tree);
> > +  void core_bools (tree, bits_out&);
> >    void core_vals (tree);
> > -  void lang_type_bools (tree);
> > +  void lang_type_bools (tree, bits_out&);
> >    void lang_type_vals (tree);
> > -  void lang_decl_bools (tree);
> > +  void lang_decl_bools (tree, bits_out&);
> >    void lang_decl_vals (tree);
> >    void lang_vals (tree);
> >    void tree_node_bools (tree);
> > @@ -3126,6 +3137,7 @@ private:
> >    static unsigned back_ref_count;
> >    static unsigned null_count;
> >  };
> > +} // anon namespace
> >  
> >  /* Instrumentation counters.  */
> >  unsigned trees_out::tree_val_count;
> > @@ -5292,9 +5304,9 @@ trees_in::start (unsigned code)
> >  /* Read & write the core boolean flags.  */
> >  
> >  void
> > -trees_out::core_bools (tree t)
> > +trees_out::core_bools (tree t, bits_out& bits)
> >  {
> > -#define WB(X) (b (X))
> > +#define WB(X) (bits.b (X))
> >    tree_code code = TREE_CODE (t);
> >  
> >    WB (t->base.side_effects_flag);
> > @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
> >    if (TREE_CODE_CLASS (code) != tcc_type)
> >      /* This is TYPE_CACHED_VALUES_P for types.  */
> >      WB (t->base.public_flag);
> > +  else
> > +    WB (false);
> >    WB (t->base.private_flag);
> >    WB (t->base.protected_flag);
> >    WB (t->base.deprecated_flag);
> > @@ -5327,7 +5341,7 @@ trees_out::core_bools (tree t)
> >      case TARGET_MEM_REF:
> >      case TREE_VEC:
> >        /* These use different base.u fields.  */
> > -      break;
> > +      return;
> >  
> >      default:
> >        WB (t->base.u.bits.lang_flag_0);
> > @@ -5359,7 +5373,7 @@ trees_out::core_bools (tree t)
> >        break;
> >      }
> >  
> > -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> > +  if (TREE_CODE_CLASS (code) == tcc_type)
> >      {
> >        WB (t->type_common.no_force_blk_flag);
> >        WB (t->type_common.needs_constructing_flag);
> > @@ -5374,7 +5388,10 @@ trees_out::core_bools (tree t)
> >        WB (t->type_common.lang_flag_5);
> >        WB (t->type_common.lang_flag_6);
> >        WB (t->type_common.typeless_storage);
> > +      return;
> >      }
> > +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> > +    return;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> >      {
> > @@ -5439,6 +5456,8 @@ trees_out::core_bools (tree t)
> >        WB (t->decl_common.decl_nonshareable_flag);
> >        WB (t->decl_common.decl_not_flexarray);
> >      }
> > +  else
> > +    return;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
> >      {
> > @@ -5459,6 +5478,8 @@ trees_out::core_bools (tree t)
> >        WB (t->decl_with_vis.final);
> >        WB (t->decl_with_vis.regdecl_flag);
> >      }
> > +  else
> > +    return;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
> >      {
> > @@ -5489,9 +5510,10 @@ trees_out::core_bools (tree t)
> >  }
> >  
> >  bool
> > -trees_in::core_bools (tree t)
> > +trees_in::core_bools (tree t, bits_in& bits)
> >  {
> > -#define RB(X) ((X) = b ())
> > +#define RB(X) ((X) = bits.b ())
> > +
> >    tree_code code = TREE_CODE (t);
> >  
> >    RB (t->base.side_effects_flag);
> > @@ -5507,6 +5529,8 @@ trees_in::core_bools (tree t)
> >    RB (t->base.static_flag);
> >    if (TREE_CODE_CLASS (code) != tcc_type)
> >      RB (t->base.public_flag);
> > +  else
> > +    bits.b ();
> >    RB (t->base.private_flag);
> >    RB (t->base.protected_flag);
> >    RB (t->base.deprecated_flag);
> > @@ -5520,7 +5544,7 @@ trees_in::core_bools (tree t)
> >      case TARGET_MEM_REF:
> >      case TREE_VEC:
> >        /* These use different base.u fields.  */
> > -      break;
> > +      goto done;
> >  
> >      default:
> >        RB (t->base.u.bits.lang_flag_0);
> > @@ -5539,7 +5563,7 @@ trees_in::core_bools (tree t)
> >        break;
> >      }
> >  
> > -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> > +  if (TREE_CODE_CLASS (code) == tcc_type)
> >      {
> >        RB (t->type_common.no_force_blk_flag);
> >        RB (t->type_common.needs_constructing_flag);
> > @@ -5554,7 +5578,10 @@ trees_in::core_bools (tree t)
> >        RB (t->type_common.lang_flag_5);
> >        RB (t->type_common.lang_flag_6);
> >        RB (t->type_common.typeless_storage);
> > +      goto done;
> >      }
> > +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> > +    goto done;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> >      {
> > @@ -5584,6 +5611,8 @@ trees_in::core_bools (tree t)
> >        RB (t->decl_common.decl_nonshareable_flag);
> >        RB (t->decl_common.decl_not_flexarray);
> >      }
> > +  else
> > +    goto done;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
> >      {
> > @@ -5604,6 +5633,8 @@ trees_in::core_bools (tree t)
> >        RB (t->decl_with_vis.final);
> >        RB (t->decl_with_vis.regdecl_flag);
> >      }
> > +  else
> > +    goto done;
> >  
> >    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
> >      {
> > @@ -5627,20 +5658,22 @@ trees_in::core_bools (tree t)
> >  
> >        /* decl_type is a (misnamed) 2 bit discriminator.	 */
> >        unsigned kind = 0;
> > -      kind |= unsigned (b ()) << 0;
> > -      kind |= unsigned (b ()) << 1;
> > +      kind |= unsigned (bits.b ()) << 0;
> > +      kind |= unsigned (bits.b ()) << 1;
> >        t->function_decl.decl_type = function_decl_type (kind);
> >      }
> >  #undef RB
> > +done:
> >    return !get_overrun ();
> >  }
> >  
> >  void
> > -trees_out::lang_decl_bools (tree t)
> > +trees_out::lang_decl_bools (tree t, bits_out& bits)
> >  {
> > -#define WB(X) (b (X))
> > +#define WB(X) (bits.b (X))
> >    const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
> >  
> > +  bits.bflush ();
> >    WB (lang->u.base.language == lang_cplusplus);
> >    WB ((lang->u.base.use_template >> 0) & 1);
> >    WB ((lang->u.base.use_template >> 1) & 1);
> > @@ -5664,6 +5697,8 @@ trees_out::lang_decl_bools (tree t)
> >    WB (lang->u.base.module_attach_p);
> >    if (VAR_OR_FUNCTION_DECL_P (t))
> >      WB (lang->u.base.module_keyed_decls_p);
> > +  else
> > +    WB (false);
> >    switch (lang->u.base.selector)
> >      {
> >      default:
> > @@ -5714,15 +5749,16 @@ trees_out::lang_decl_bools (tree t)
> >  }
> >  
> >  bool
> > -trees_in::lang_decl_bools (tree t)
> > +trees_in::lang_decl_bools (tree t, bits_in& bits)
> >  {
> > -#define RB(X) ((X) = b ())
> > +#define RB(X) ((X) = bits.b ())
> >    struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
> >  
> > -  lang->u.base.language = b () ? lang_cplusplus : lang_c;
> > +  bits.bflush ();
> > +  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
> >    unsigned v;
> > -  v = b () << 0;
> > -  v |= b () << 1;
> > +  v = bits.b () << 0;
> > +  v |= bits.b () << 1;
> >    lang->u.base.use_template = v;
> >    /* lang->u.base.not_really_extern is not streamed.  */
> >    RB (lang->u.base.initialized_in_class);
> > @@ -5738,6 +5774,8 @@ trees_in::lang_decl_bools (tree t)
> >    RB (lang->u.base.module_attach_p);
> >    if (VAR_OR_FUNCTION_DECL_P (t))
> >      RB (lang->u.base.module_keyed_decls_p);
> > +  else
> > +    bits.b ();
> >    switch (lang->u.base.selector)
> >      {
> >      default:
> > @@ -5786,11 +5824,12 @@ trees_in::lang_decl_bools (tree t)
> >  }
> >  
> >  void
> > -trees_out::lang_type_bools (tree t)
> > +trees_out::lang_type_bools (tree t, bits_out& bits)
> >  {
> > -#define WB(X) (b (X))
> > +#define WB(X) (bits.b (X))
> >    const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
> >  
> > +  bits.bflush ();
> >    WB (lang->has_type_conversion);
> >    WB (lang->has_copy_ctor);
> >    WB (lang->has_default_ctor);
> > @@ -5852,11 +5891,12 @@ trees_out::lang_type_bools (tree t)
> >  }
> >  
> >  bool
> > -trees_in::lang_type_bools (tree t)
> > +trees_in::lang_type_bools (tree t, bits_in& bits)
> >  {
> > -#define RB(X) ((X) = b ())
> > +#define RB(X) ((X) = bits.b ())
> >    struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
> >  
> > +  bits.bflush ();
> >    RB (lang->has_type_conversion);
> >    RB (lang->has_copy_ctor);
> >    RB (lang->has_default_ctor);
> > @@ -5864,8 +5904,8 @@ trees_in::lang_type_bools (tree t)
> >    RB (lang->ref_needs_init);
> >    RB (lang->has_const_copy_assign);
> >    unsigned v;
> > -  v = b () << 0;
> > -  v |= b () << 1;
> > +  v = bits.b () << 0;
> > +  v |= bits.b () << 1;
> >    lang->use_template = v;
> >  
> >    RB (lang->has_mutable);
> > @@ -5877,8 +5917,8 @@ trees_in::lang_type_bools (tree t)
> >    RB (lang->has_new);
> >    RB (lang->has_array_new);
> >  
> > -  v = b () << 0;
> > -  v |= b () << 1;
> > +  v = bits.b () << 0;
> > +  v |= bits.b () << 1;
> >    lang->gets_delete = v;
> >    RB (lang->interface_only);
> >    RB (lang->interface_unknown);
> > @@ -7106,18 +7146,19 @@ trees_out::tree_node_bools (tree t)
> >    gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
> >  		       || DECL_NAMESPACE_ALIAS (t));
> >  
> > -  core_bools (t);
> > +  bits_out bits (*this);
> > +  core_bools (t, bits);
> >  
> >    switch (TREE_CODE_CLASS (TREE_CODE (t)))
> >      {
> >      case tcc_declaration:
> >        {
> >  	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
> > -	b (specific);
> > +	bits.b (specific);
> >  	if (specific && VAR_P (t))
> > -	  b (DECL_DECOMPOSITION_P (t));
> > +	  bits.b (DECL_DECOMPOSITION_P (t));
> >  	if (specific)
> > -	  lang_decl_bools (t);
> > +	  lang_decl_bools (t, bits);
> >        }
> >        break;
> >  
> > @@ -7128,9 +7169,9 @@ trees_out::tree_node_bools (tree t)
> >  	gcc_assert (TYPE_LANG_SPECIFIC (t)
> >  		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
> >  
> > -	b (specific);
> > +	bits.b (specific);
> >  	if (specific)
> > -	  lang_type_bools (t);
> > +	  lang_type_bools (t, bits);
> >        }
> >        break;
> >  
> > @@ -7138,34 +7179,35 @@ trees_out::tree_node_bools (tree t)
> >        break;
> >      }
> >  
> > -  bflush ();
> > +  bits.bflush ();
> >  }
> >  
> >  bool
> >  trees_in::tree_node_bools (tree t)
> >  {
> > -  bool ok = core_bools (t);
> > +  bits_in bits (*this);
> > +  bool ok = core_bools (t, bits);
> >  
> >    if (ok)
> >      switch (TREE_CODE_CLASS (TREE_CODE (t)))
> >        {
> >        case tcc_declaration:
> > -	if (b ())
> > +	if (bits.b ())
> >  	  {
> > -	    bool decomp = VAR_P (t) && b ();
> > +	    bool decomp = VAR_P (t) && bits.b ();
> >  
> >  	    ok = maybe_add_lang_decl_raw (t, decomp);
> >  	    if (ok)
> > -	      ok = lang_decl_bools (t);
> > -	}
> > +	      ok = lang_decl_bools (t, bits);
> > +	  }
> >  	break;
> >  
> >        case tcc_type:
> > -	if (b ())
> > +	if (bits.b ())
> >  	  {
> >  	    ok = maybe_add_lang_type_raw (t);
> >  	    if (ok)
> > -	      ok = lang_type_bools (t);
> > +	      ok = lang_type_bools (t, bits);
> >  	  }
> >  	break;
> >  
> > @@ -7173,7 +7215,7 @@ trees_in::tree_node_bools (tree t)
> >  	break;
> >        }
> >  
> > -  bflush ();
> > +  bits.bflush ();
> >    if (!ok || get_overrun ())
> >      return false;
> >  
> > @@ -7696,11 +7738,11 @@ trees_out::decl_value (tree decl, depset *dep)
> >  
> >        if (mk != MK_unique)
> >  	{
> > +	  bits_out bits (*this);
> >  	  if (!(mk & MK_template_mask) && !state->is_header ())
> >  	    {
> >  	      /* Tell the importer whether this is a global module entity,
> > -		 or a module entity.  This bool merges into the next block
> > -		 of bools.  Sneaky.  */
> > +		 or a module entity.  */
> >  	      tree o = get_originating_module_decl (decl);
> >  	      bool is_attached = false;
> >  
> > @@ -7709,9 +7751,9 @@ trees_out::decl_value (tree decl, depset *dep)
> >  		  && DECL_MODULE_ATTACH_P (not_tmpl))
> >  		is_attached = true;
> >  
> > -	      b (is_attached);
> > +	      bits.b (is_attached);
> >  	    }
> > -	  b (dep && dep->has_defn ());
> > +	  bits.b (dep && dep->has_defn ());
> >  	}
> >        tree_node_bools (decl);
> >      }
> > @@ -7981,11 +8023,11 @@ trees_in::decl_value ()
> >      {
> >        if (mk != MK_unique)
> >  	{
> > +	  bits_in bits (*this);
> >  	  if (!(mk & MK_template_mask) && !state->is_header ())
> > -	    /* See note in trees_out about where this bool is sequenced.  */
> > -	    is_attached = b ();
> > +	    is_attached = bits.b ();
> >  
> > -	  has_defn = b ();
> > +	  has_defn = bits.b ();
> >  	}
> >  
> >        if (!tree_node_bools (decl))
> > @@ -16682,10 +16724,11 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
> >  {
> >    sec.u (macro->count);
> >  
> > -  sec.b (macro->fun_like);
> > -  sec.b (macro->variadic);
> > -  sec.b (macro->syshdr);
> > -  sec.bflush ();
> > +  bits_out bits (sec);
> > +  bits.b (macro->fun_like);
> > +  bits.b (macro->variadic);
> > +  bits.b (macro->syshdr);
> > +  bits.bflush ();
> >  
> >    write_location (sec, macro->line);
> >    if (macro->fun_like)
> > @@ -16780,10 +16823,11 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
> >    macro->kind = cmk_macro;
> >    macro->imported_p = true;
> >  
> > -  macro->fun_like = sec.b ();
> > -  macro->variadic = sec.b ();
> > -  macro->syshdr = sec.b ();
> > -  sec.bflush ();
> > +  bits_in bits (sec);
> > +  macro->fun_like = bits.b ();
> > +  macro->variadic = bits.b ();
> > +  macro->syshdr = bits.b ();
> > +  bits.bflush ();
> >  
> >    macro->line = read_location (sec);
> >  
> > -- 
> > 2.44.0.rc1.15.g4fc51f00ef
> > 
> > 
>
Patrick Palka April 9, 2024, 8:28 p.m. UTC | #5
On Tue, 26 Mar 2024, Patrick Palka wrote:

> On Tue, 27 Feb 2024, Patrick Palka wrote:
> 
> > On Fri, 16 Feb 2024, Patrick Palka wrote:
> > 
> > > On Thu, 15 Feb 2024, Patrick Palka wrote:
> > > 
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
> > > > OK for trunk?
> > > > 
> > > > -- >8 --
> > > > 
> > > > One would expect consecutive calls to bytes_in/out::b for streaming
> > > > adjacent bits, as we do for tree flag streaming, to at least be
> > > > optimized by the compiler into individual bit operations using
> > > > statically known bit positions (and ideally merged into larger sized
> > > > reads/writes).
> > > > 
> > > > Unfortunately this doesn't happen because the compiler has trouble
> > > > tracking the values of this->bit_pos and this->bit_val across such
> > > > calls, likely because the compiler doesn't know 'this' and so it's
> > > > treated as global memory.  This means for each consecutive bit stream
> > > > operation, bit_pos and bit_val are loaded from memory, checked if
> > > > buffering is needed, and finally the bit is extracted from bit_val
> > > > according to the (unknown) bit_pos, even though relative to the previous
> > > > operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> > > > is just 1 larger.  This ends up being quite slow, with tree_node_bools
> > > > taking 10% of time when streaming in parts of the std module.
> > > > 
> > > > This patch optimizes this by making tracking of bit_pos and bit_val
> > > > easier for the compiler.  Rather than bit_pos and bit_val being members
> > > > of the (effectively global) bytes_in/out objects, this patch factors out
> > > > the bit streaming code/state into separate classes bits_in/out that get
> > > > constructed locally as needed for bit streaming.  Since these objects
> > > > are now clearly local, the compiler can more easily track their values.
> > > > 
> > > > And since bit streaming is intended to be batched it's natural for these
> > > > new classes to be RAII-enabled such that the bit stream is flushed upon
> > > > destruction.
> > > > 
> > > > In order to make the most of this improved tracking of bit position,
> > > > this patch changes parts where we conditionally stream a tree flag
> > > > to unconditionally stream (the flag or a dummy value).  That way
> > > > the number of bits streamed and the respective bit positions are as
> > > > statically known as reasonably possible.  In lang_decl_bools and
> > > > lang_type_bools we flush the current bit buffer at the start so that
> > > > subsequent bit positions are statically known.  And in core bools, we
> > > > can add explicit early exits utilizing invariants that the compiler
> > > > can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> > > > and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> > > > it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> > > > than 4 bits, it's more space efficient to stream them as individual
> > > > bytes rather than as packed bits (due to the 32-bit buffer).
> > > 
> > > Oops, this last sentence is wrong.  Although the size of the bit buffer
> > > is 32 bits, upon flushing we rewind unused bytes within the buffer,
> > > which means streaming 2-8 bits ends up using only one byte not all four.
> > > So v2 below undoes this pessimization.
> > > 
> > > > This patch also moves the definitions of the relevant streaming classes
> > > > into anonymous namespaces so that the compiler can make more informed
> > > > decisions about inlining their member functions.
> > > > 
> > > > After this patch, compile time for a simple Hello World using the std
> > > > module is reduced by 7% with a release compiler.  The on-disk size of
> > > > the std module increases by 0.7% (presumably due to the extra flushing
> > > > done in lang_decl_bools and lang_type_bools).
> > > 
> > > The on-disk std module now only grows 0.4% instead of 0.7%.
> > > 
> > > > 
> > > > The bit stream out performance isn't improved as much as the stream in
> > > > due to the spans/lengths instrumentation performed on stream out (which
> > > > probably should be e.g. removed for release builds?)
> > > 
> > > -- >8 --
> > > 
> > > gcc/cp/ChangeLog:
> > > 
> > > 	* module.cc: Update comment about classes defined.
> > > 	(class data): Enclose in an anonymous namespace.
> > > 	(data::calc_crc): Moved from bytes::calc_crc.
> > > 	(class bytes): Remove.  Move bit_flush to namespace scope.
> > > 	(class bytes_in): Enclose in an anonymous namespace.  Inherit
> > > 	directly from data and adjust accordingly.  Move b and bflush
> > > 	members to bits_in.
> > > 	(class bytes_out): As above.  Remove is_set static data member.
> > > 	(bit_flush): Moved from class bytes.
> > > 	(struct bits_in): Define.
> > > 	(struct bits_out): Define.
> > > 	(bytes_out::bflush): Moved to bits_out/in.
> > > 	(bytes_in::bflush): Likewise
> > > 	(bytes_in::bfill): Removed.
> > > 	(bytes_out::b): Moved to bits_out/in.
> > > 	(bytes_in::b): Likewise.
> > > 	(class trees_in): Enclose in an anonymous namespace.
> > > 	(class trees_out): Enclose in an anonymous namespace.
> > > 	(trees_out::core_bools): Add bits_out/in parameter and use it.
> > > 	Unconditionally stream a bit for public_flag.  Add early exits
> > > 	as appropriate.
> > > 	(trees_out::core_bools): Likewise.
> > > 	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
> > > 	it.  Flush the current bit buffer at the start.  Unconditionally
> > > 	stream a bit for module_keyed_decls_p.
> > > 	(trees_in::lang_decl_bools): Likewise.
> > > 	(trees_out::lang_type_bools): Add bits_out/in parameter and use
> > > 	it.  Flush the current bit buffer at the start.
> > > 	(trees_in::lang_type_bools): Likewise.
> > > 	(trees_out::tree_node_bools): Construct a bits_out object and
> > > 	use/pass it.
> > > 	(trees_in::tree_node_bools): Likewise.
> > > 	(trees_out::decl_value): Likewise.
> > > 	(trees_in::decl_value): Likewise.
> > > 	(module_state::write_define): Likewise.
> > > 	(module_state::read_define): Likewise.
> > 
> > Ping.
> 
> Ping.

Ping.

> 
> > 
> > > ---
> > >  gcc/cp/module.cc | 418 ++++++++++++++++++++++++++---------------------
> > >  1 file changed, 231 insertions(+), 187 deletions(-)
> > > 
> > > diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> > > index 0291d456ff5..2ecee007e8a 100644
> > > --- a/gcc/cp/module.cc
> > > +++ b/gcc/cp/module.cc
> > > @@ -153,9 +153,11 @@ Classes used:
> > >  
> > >     data - buffer
> > >  
> > > -   bytes - data streamer
> > > -   bytes_in : bytes - scalar reader
> > > -   bytes_out : bytes - scalar writer
> > > +   bytes_in - scalar reader
> > > +   bytes_out - scalar writer
> > > +
> > > +   bits_in - bit stream reader
> > > +   bits_out - bit stream writer
> > >  
> > >     elf - ELROND format
> > >     elf_in : elf - ELROND reader
> > > @@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
> > >  
> > >  /* Variable length buffer.  */
> > >  
> > > +namespace {
> > >  class data {
> > >  public:
> > >    class allocator {
> > > @@ -393,6 +396,8 @@ protected:
> > >      return res;
> > >    }
> > >  
> > > +  unsigned calc_crc (unsigned) const;
> > > +
> > >  public:
> > >    void unuse (unsigned count)
> > >    {
> > > @@ -402,6 +407,7 @@ public:
> > >  public:
> > >    static allocator simple_memory;
> > >  };
> > > +} // anon namespace
> > >  
> > >  /* The simple data allocator.  */
> > >  data::allocator data::simple_memory;
> > > @@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
> > >    XDELETEVEC (ptr);
> > >  }
> > >  
> > > -/* Byte streamer base.   Buffer with read/write position and smarts
> > > -   for single bits.  */
> > > -
> > > -class bytes : public data {
> > > -public:
> > > -  typedef data parent;
> > > -
> > > -protected:
> > > -  uint32_t bit_val;	/* Bit buffer.  */
> > > -  unsigned bit_pos;	/* Next bit in bit buffer.  */
> > > -
> > > -public:
> > > -  bytes ()
> > > -    :parent (), bit_val (0), bit_pos (0)
> > > -  {}
> > > -  ~bytes () 
> > > -  {
> > > -  }
> > > -
> > > -protected:
> > > -  unsigned calc_crc (unsigned) const;
> > > -
> > > -protected:
> > > -  /* Finish bit packet.  Rewind the bytes not used.  */
> > > -  unsigned bit_flush ()
> > > -  {
> > > -    gcc_assert (bit_pos);
> > > -    unsigned bytes = (bit_pos + 7) / 8;
> > > -    unuse (4 - bytes);
> > > -    bit_pos = 0;
> > > -    bit_val = 0;
> > > -    return bytes;
> > > -  }
> > > -};
> > > -
> > >  /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
> > >     first 4 bytes, so don't include them.  */
> > >  
> > >  unsigned
> > > -bytes::calc_crc (unsigned l) const
> > > +data::calc_crc (unsigned l) const
> > >  {
> > >    return crc32 (0, (unsigned char *)buffer + 4, l - 4);
> > >  }
> > > @@ -495,8 +466,9 @@ class elf_in;
> > >  
> > >  /* Byte stream reader.  */
> > >  
> > > -class bytes_in : public bytes {
> > > -  typedef bytes parent;
> > > +namespace {
> > > +class bytes_in : public data {
> > > +  typedef data parent;
> > >  
> > >  protected:
> > >    bool overrun;  /* Sticky read-too-much flag.  */
> > > @@ -531,7 +503,6 @@ public:
> > >      if (offset > size)
> > >        set_overrun ();
> > >      pos = offset;
> > > -    bit_pos = bit_val = 0;
> > >    }
> > >  
> > >  public:
> > > @@ -573,14 +544,7 @@ public:
> > >    unsigned u32 ();  	/* Read uncompressed integer.  */
> > >  
> > >  public:
> > > -  bool b ();	    	/* Read a bool.  */
> > > -  void bflush ();	/* Completed a block of bools.  */
> > > -
> > > -private:
> > > -  void bfill ();	/* Get the next block of bools.  */
> > > -
> > > -public:
> > > -  int c ();		/* Read a char.  */
> > > +  int c () ATTRIBUTE_UNUSED;		/* Read a char.  */
> > >    int i ();		/* Read a signed int.  */
> > >    unsigned u ();	/* Read an unsigned int.  */
> > >    size_t z ();		/* Read a size_t.  */
> > > @@ -590,6 +554,7 @@ public:
> > >    const void *buf (size_t); /* Read a fixed-length buffer.  */
> > >    cpp_hashnode *cpp_node (); /* Read a cpp node.  */
> > >  };
> > > +} // anon namespace
> > >  
> > >  /* Verify the buffer's CRC is correct.  */
> > >  
> > > @@ -610,8 +575,9 @@ class elf_out;
> > >  
> > >  /* Byte stream writer.  */
> > >  
> > > -class bytes_out : public bytes {
> > > -  typedef bytes parent;
> > > +namespace {
> > > +class bytes_out : public data {
> > > +  typedef data parent;
> > >  
> > >  public:
> > >    allocator *memory;	/* Obtainer of memory.  */
> > > @@ -659,11 +625,7 @@ public:
> > >    void u32 (unsigned);  /* Write uncompressed integer.  */
> > >  
> > >  public:
> > > -  void b (bool);	/* Write bool.  */
> > > -  void bflush ();	/* Finish block of bools.  */
> > > -
> > > -public:
> > > -  void c (unsigned char); /* Write unsigned char.  */
> > > +  void c (unsigned char) ATTRIBUTE_UNUSED; /* Write unsigned char.  */
> > >    void i (int);		/* Write signed int.  */
> > >    void u (unsigned);	/* Write unsigned int.  */
> > >    void z (size_t s);	/* Write size_t.  */
> > > @@ -694,13 +656,126 @@ protected:
> > >    /* Instrumentation.  */
> > >    static unsigned spans[4];
> > >    static unsigned lengths[4];
> > > -  static int is_set;
> > > +  friend struct bits_out;
> > >  };
> > > +} // anon namespace
> > > +
> > > +/* Finish bit packet.  Rewind the bytes not used.  */
> > > +static unsigned
> > > +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> > > +{
> > > +  gcc_assert (bit_pos);
> > > +  unsigned bytes = (bit_pos + 7) / 8;
> > > +  bits.unuse (4 - bytes);
> > > +  bit_pos = 0;
> > > +  bit_val = 0;
> > > +  return bytes;
> > > +}
> > > +
> > > +/* Bit stream reader (RAII-enabled).  Bools are packed into bytes.  You
> > > +   cannot mix bools and non-bools.  Use bflush to flush the current stream
> > > +   of bools on demand.  Upon destruction bflush is called.
> > > +
> > > +   When reading, we don't know how many bools we'll read in.  So read
> > > +   4 bytes-worth, and then rewind when flushing if we didn't need them
> > > +   all.  You can't have a block of bools closer than 4 bytes to the
> > > +   end of the buffer.  */
> > > +
> > > +namespace {
> > > +struct bits_in {
> > > +  bytes_in& in;
> > > +  uint32_t bit_val = 0;
> > > +  unsigned bit_pos = 0;
> > > +
> > > +  bits_in (bytes_in& in)
> > > +    : in (in)
> > > +  { }
> > > +
> > > +  ~bits_in ()
> > > +  {
> > > +    bflush ();
> > > +  }
> > > +
> > > +  bits_in(const bits_in&) = delete;
> > > +  bits_in& operator=(const bits_in&) = delete;
> > > +
> > > +  /* Completed a block of bools.  */
> > > +  void bflush ()
> > > +  {
> > > +    if (bit_pos)
> > > +      bit_flush (in, bit_val, bit_pos);
> > > +  }
> > > +
> > > +  /* Read one bit.  */
> > > +  bool b ()
> > > +  {
> > > +    if (!bit_pos)
> > > +      bit_val = in.u32 ();
> > > +    bool x = (bit_val >> bit_pos) & 1;
> > > +    bit_pos = (bit_pos + 1) % 32;
> > > +    return x;
> > > +  }
> > > +};
> > > +} // anon namespace
> > > +
> > > +/* Bit stream writer (RAII-enabled), counterpart to bits_in.  */
> > > +
> > > +namespace {
> > > +struct bits_out {
> > > +  bytes_out& out;
> > > +  uint32_t bit_val = 0;
> > > +  unsigned bit_pos = 0;
> > > +  char is_set = -1;
> > > +
> > > +  bits_out (bytes_out& out)
> > > +    : out (out)
> > > +  { }
> > > +
> > > +  ~bits_out ()
> > > +  {
> > > +    bflush ();
> > > +  }
> > > +
> > > +  bits_out(const bits_out&) = delete;
> > > +  bits_out& operator=(const bits_out&) = delete;
> > > +
> > > +  /* Completed a block of bools.  */
> > > +  void bflush ()
> > > +  {
> > > +    if (bit_pos)
> > > +      {
> > > +	out.u32 (bit_val);
> > > +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> > > +      }
> > > +    out.spans[2]++;
> > > +    is_set = -1;
> > > +  }
> > > +
> > > +  /* Write one bit.
> > > +
> > > +     It may be worth optimizing for most bools being zero.  Some kind of
> > > +     run-length encoding?  */
> > > +  void b (bool x)
> > > +  {
> > > +    if (is_set != x)
> > > +      {
> > > +	is_set = x;
> > > +	out.spans[x]++;
> > > +      }
> > > +    out.lengths[x]++;
> > > +    bit_val |= unsigned (x) << bit_pos++;
> > > +    if (bit_pos == 32)
> > > +      {
> > > +	out.u32 (bit_val);
> > > +	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
> > > +      }
> > > +  }
> > > +};
> > > +} // anon namespace
> > >  
> > >  /* Instrumentation.  */
> > >  unsigned bytes_out::spans[4];
> > >  unsigned bytes_out::lengths[4];
> > > -int bytes_out::is_set = -1;
> > >  
> > >  /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
> > >     that pointed to by CRC_PTR.  */
> > > @@ -723,73 +798,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
> > >      }
> > >  }
> > >  
> > > -/* Finish a set of bools.  */
> > > -
> > > -void
> > > -bytes_out::bflush ()
> > > -{
> > > -  if (bit_pos)
> > > -    {
> > > -      u32 (bit_val);
> > > -      lengths[2] += bit_flush ();
> > > -    }
> > > -  spans[2]++;
> > > -  is_set = -1;
> > > -}
> > > -
> > > -void
> > > -bytes_in::bflush ()
> > > -{
> > > -  if (bit_pos)
> > > -    bit_flush ();
> > > -}
> > > -
> > > -/* When reading, we don't know how many bools we'll read in.  So read
> > > -   4 bytes-worth, and then rewind when flushing if we didn't need them
> > > -   all.  You can't have a block of bools closer than 4 bytes to the
> > > -   end of the buffer.  */
> > > -
> > > -void
> > > -bytes_in::bfill ()
> > > -{
> > > -  bit_val = u32 ();
> > > -}
> > > -
> > > -/* Bools are packed into bytes.  You cannot mix bools and non-bools.
> > > -   You must call bflush before emitting another type.  So batch your
> > > -   bools.
> > > -
> > > -   It may be worth optimizing for most bools being zero.  Some kind of
> > > -   run-length encoding?  */
> > > -
> > > -void
> > > -bytes_out::b (bool x)
> > > -{
> > > -  if (is_set != x)
> > > -    {
> > > -      is_set = x;
> > > -      spans[x]++;
> > > -    }
> > > -  lengths[x]++;
> > > -  bit_val |= unsigned (x) << bit_pos++;
> > > -  if (bit_pos == 32)
> > > -    {
> > > -      u32 (bit_val);
> > > -      lengths[2] += bit_flush ();
> > > -    }
> > > -}
> > > -
> > > -bool
> > > -bytes_in::b ()
> > > -{
> > > -  if (!bit_pos)
> > > -    bfill ();
> > > -  bool v = (bit_val >> bit_pos++) & 1;
> > > -  if (bit_pos == 32)
> > > -    bit_flush ();
> > > -  return v;
> > > -}
> > > -
> > >  /* Exactly 4 bytes.  Used internally for bool packing and a few other
> > >     places.  We can't simply use uint32_t because (a) alignment and
> > >     (b) we need little-endian for the bool streaming rewinding to make
> > > @@ -2846,6 +2854,7 @@ struct post_process_data {
> > >     read trees with TREE_VISITED.  Thus it's quite safe to have
> > >     multiple concurrent readers.  Which is good, because lazy
> > >     loading. */
> > > +namespace {
> > >  class trees_in : public bytes_in {
> > >    typedef bytes_in parent;
> > >  
> > > @@ -2870,15 +2879,15 @@ private:
> > >  
> > >  public:
> > >    /* Needed for binfo writing  */
> > > -  bool core_bools (tree);
> > > +  bool core_bools (tree, bits_in&);
> > >  
> > >  private:
> > >    /* Stream tree_core, lang_decl_specific and lang_type_specific
> > >       bits.  */
> > >    bool core_vals (tree);
> > > -  bool lang_type_bools (tree);
> > > +  bool lang_type_bools (tree, bits_in&);
> > >    bool lang_type_vals (tree);
> > > -  bool lang_decl_bools (tree);
> > > +  bool lang_decl_bools (tree, bits_in&);
> > >    bool lang_decl_vals (tree);
> > >    bool lang_vals (tree);
> > >    bool tree_node_bools (tree);
> > > @@ -2965,6 +2974,7 @@ private:
> > >  private:
> > >    void assert_definition (tree, bool installing);
> > >  };
> > > +} // anon namespace
> > >  
> > >  trees_in::trees_in (module_state *state)
> > >    :parent (), state (state), unused (0)
> > > @@ -2982,6 +2992,7 @@ trees_in::~trees_in ()
> > >  }
> > >  
> > >  /* Tree stream writer.  */
> > > +namespace {
> > >  class trees_out : public bytes_out {
> > >    typedef bytes_out parent;
> > >  
> > > @@ -3043,11 +3054,11 @@ public:
> > >    }
> > >  
> > >  private:
> > > -  void core_bools (tree);
> > > +  void core_bools (tree, bits_out&);
> > >    void core_vals (tree);
> > > -  void lang_type_bools (tree);
> > > +  void lang_type_bools (tree, bits_out&);
> > >    void lang_type_vals (tree);
> > > -  void lang_decl_bools (tree);
> > > +  void lang_decl_bools (tree, bits_out&);
> > >    void lang_decl_vals (tree);
> > >    void lang_vals (tree);
> > >    void tree_node_bools (tree);
> > > @@ -3126,6 +3137,7 @@ private:
> > >    static unsigned back_ref_count;
> > >    static unsigned null_count;
> > >  };
> > > +} // anon namespace
> > >  
> > >  /* Instrumentation counters.  */
> > >  unsigned trees_out::tree_val_count;
> > > @@ -5292,9 +5304,9 @@ trees_in::start (unsigned code)
> > >  /* Read & write the core boolean flags.  */
> > >  
> > >  void
> > > -trees_out::core_bools (tree t)
> > > +trees_out::core_bools (tree t, bits_out& bits)
> > >  {
> > > -#define WB(X) (b (X))
> > > +#define WB(X) (bits.b (X))
> > >    tree_code code = TREE_CODE (t);
> > >  
> > >    WB (t->base.side_effects_flag);
> > > @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
> > >    if (TREE_CODE_CLASS (code) != tcc_type)
> > >      /* This is TYPE_CACHED_VALUES_P for types.  */
> > >      WB (t->base.public_flag);
> > > +  else
> > > +    WB (false);
> > >    WB (t->base.private_flag);
> > >    WB (t->base.protected_flag);
> > >    WB (t->base.deprecated_flag);
> > > @@ -5327,7 +5341,7 @@ trees_out::core_bools (tree t)
> > >      case TARGET_MEM_REF:
> > >      case TREE_VEC:
> > >        /* These use different base.u fields.  */
> > > -      break;
> > > +      return;
> > >  
> > >      default:
> > >        WB (t->base.u.bits.lang_flag_0);
> > > @@ -5359,7 +5373,7 @@ trees_out::core_bools (tree t)
> > >        break;
> > >      }
> > >  
> > > -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> > > +  if (TREE_CODE_CLASS (code) == tcc_type)
> > >      {
> > >        WB (t->type_common.no_force_blk_flag);
> > >        WB (t->type_common.needs_constructing_flag);
> > > @@ -5374,7 +5388,10 @@ trees_out::core_bools (tree t)
> > >        WB (t->type_common.lang_flag_5);
> > >        WB (t->type_common.lang_flag_6);
> > >        WB (t->type_common.typeless_storage);
> > > +      return;
> > >      }
> > > +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> > > +    return;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> > >      {
> > > @@ -5439,6 +5456,8 @@ trees_out::core_bools (tree t)
> > >        WB (t->decl_common.decl_nonshareable_flag);
> > >        WB (t->decl_common.decl_not_flexarray);
> > >      }
> > > +  else
> > > +    return;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
> > >      {
> > > @@ -5459,6 +5478,8 @@ trees_out::core_bools (tree t)
> > >        WB (t->decl_with_vis.final);
> > >        WB (t->decl_with_vis.regdecl_flag);
> > >      }
> > > +  else
> > > +    return;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
> > >      {
> > > @@ -5489,9 +5510,10 @@ trees_out::core_bools (tree t)
> > >  }
> > >  
> > >  bool
> > > -trees_in::core_bools (tree t)
> > > +trees_in::core_bools (tree t, bits_in& bits)
> > >  {
> > > -#define RB(X) ((X) = b ())
> > > +#define RB(X) ((X) = bits.b ())
> > > +
> > >    tree_code code = TREE_CODE (t);
> > >  
> > >    RB (t->base.side_effects_flag);
> > > @@ -5507,6 +5529,8 @@ trees_in::core_bools (tree t)
> > >    RB (t->base.static_flag);
> > >    if (TREE_CODE_CLASS (code) != tcc_type)
> > >      RB (t->base.public_flag);
> > > +  else
> > > +    bits.b ();
> > >    RB (t->base.private_flag);
> > >    RB (t->base.protected_flag);
> > >    RB (t->base.deprecated_flag);
> > > @@ -5520,7 +5544,7 @@ trees_in::core_bools (tree t)
> > >      case TARGET_MEM_REF:
> > >      case TREE_VEC:
> > >        /* These use different base.u fields.  */
> > > -      break;
> > > +      goto done;
> > >  
> > >      default:
> > >        RB (t->base.u.bits.lang_flag_0);
> > > @@ -5539,7 +5563,7 @@ trees_in::core_bools (tree t)
> > >        break;
> > >      }
> > >  
> > > -  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
> > > +  if (TREE_CODE_CLASS (code) == tcc_type)
> > >      {
> > >        RB (t->type_common.no_force_blk_flag);
> > >        RB (t->type_common.needs_constructing_flag);
> > > @@ -5554,7 +5578,10 @@ trees_in::core_bools (tree t)
> > >        RB (t->type_common.lang_flag_5);
> > >        RB (t->type_common.lang_flag_6);
> > >        RB (t->type_common.typeless_storage);
> > > +      goto done;
> > >      }
> > > +  else if (TREE_CODE_CLASS (code) != tcc_declaration)
> > > +    goto done;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
> > >      {
> > > @@ -5584,6 +5611,8 @@ trees_in::core_bools (tree t)
> > >        RB (t->decl_common.decl_nonshareable_flag);
> > >        RB (t->decl_common.decl_not_flexarray);
> > >      }
> > > +  else
> > > +    goto done;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
> > >      {
> > > @@ -5604,6 +5633,8 @@ trees_in::core_bools (tree t)
> > >        RB (t->decl_with_vis.final);
> > >        RB (t->decl_with_vis.regdecl_flag);
> > >      }
> > > +  else
> > > +    goto done;
> > >  
> > >    if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
> > >      {
> > > @@ -5627,20 +5658,22 @@ trees_in::core_bools (tree t)
> > >  
> > >        /* decl_type is a (misnamed) 2 bit discriminator.	 */
> > >        unsigned kind = 0;
> > > -      kind |= unsigned (b ()) << 0;
> > > -      kind |= unsigned (b ()) << 1;
> > > +      kind |= unsigned (bits.b ()) << 0;
> > > +      kind |= unsigned (bits.b ()) << 1;
> > >        t->function_decl.decl_type = function_decl_type (kind);
> > >      }
> > >  #undef RB
> > > +done:
> > >    return !get_overrun ();
> > >  }
> > >  
> > >  void
> > > -trees_out::lang_decl_bools (tree t)
> > > +trees_out::lang_decl_bools (tree t, bits_out& bits)
> > >  {
> > > -#define WB(X) (b (X))
> > > +#define WB(X) (bits.b (X))
> > >    const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
> > >  
> > > +  bits.bflush ();
> > >    WB (lang->u.base.language == lang_cplusplus);
> > >    WB ((lang->u.base.use_template >> 0) & 1);
> > >    WB ((lang->u.base.use_template >> 1) & 1);
> > > @@ -5664,6 +5697,8 @@ trees_out::lang_decl_bools (tree t)
> > >    WB (lang->u.base.module_attach_p);
> > >    if (VAR_OR_FUNCTION_DECL_P (t))
> > >      WB (lang->u.base.module_keyed_decls_p);
> > > +  else
> > > +    WB (false);
> > >    switch (lang->u.base.selector)
> > >      {
> > >      default:
> > > @@ -5714,15 +5749,16 @@ trees_out::lang_decl_bools (tree t)
> > >  }
> > >  
> > >  bool
> > > -trees_in::lang_decl_bools (tree t)
> > > +trees_in::lang_decl_bools (tree t, bits_in& bits)
> > >  {
> > > -#define RB(X) ((X) = b ())
> > > +#define RB(X) ((X) = bits.b ())
> > >    struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
> > >  
> > > -  lang->u.base.language = b () ? lang_cplusplus : lang_c;
> > > +  bits.bflush ();
> > > +  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
> > >    unsigned v;
> > > -  v = b () << 0;
> > > -  v |= b () << 1;
> > > +  v = bits.b () << 0;
> > > +  v |= bits.b () << 1;
> > >    lang->u.base.use_template = v;
> > >    /* lang->u.base.not_really_extern is not streamed.  */
> > >    RB (lang->u.base.initialized_in_class);
> > > @@ -5738,6 +5774,8 @@ trees_in::lang_decl_bools (tree t)
> > >    RB (lang->u.base.module_attach_p);
> > >    if (VAR_OR_FUNCTION_DECL_P (t))
> > >      RB (lang->u.base.module_keyed_decls_p);
> > > +  else
> > > +    bits.b ();
> > >    switch (lang->u.base.selector)
> > >      {
> > >      default:
> > > @@ -5786,11 +5824,12 @@ trees_in::lang_decl_bools (tree t)
> > >  }
> > >  
> > >  void
> > > -trees_out::lang_type_bools (tree t)
> > > +trees_out::lang_type_bools (tree t, bits_out& bits)
> > >  {
> > > -#define WB(X) (b (X))
> > > +#define WB(X) (bits.b (X))
> > >    const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
> > >  
> > > +  bits.bflush ();
> > >    WB (lang->has_type_conversion);
> > >    WB (lang->has_copy_ctor);
> > >    WB (lang->has_default_ctor);
> > > @@ -5852,11 +5891,12 @@ trees_out::lang_type_bools (tree t)
> > >  }
> > >  
> > >  bool
> > > -trees_in::lang_type_bools (tree t)
> > > +trees_in::lang_type_bools (tree t, bits_in& bits)
> > >  {
> > > -#define RB(X) ((X) = b ())
> > > +#define RB(X) ((X) = bits.b ())
> > >    struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
> > >  
> > > +  bits.bflush ();
> > >    RB (lang->has_type_conversion);
> > >    RB (lang->has_copy_ctor);
> > >    RB (lang->has_default_ctor);
> > > @@ -5864,8 +5904,8 @@ trees_in::lang_type_bools (tree t)
> > >    RB (lang->ref_needs_init);
> > >    RB (lang->has_const_copy_assign);
> > >    unsigned v;
> > > -  v = b () << 0;
> > > -  v |= b () << 1;
> > > +  v = bits.b () << 0;
> > > +  v |= bits.b () << 1;
> > >    lang->use_template = v;
> > >  
> > >    RB (lang->has_mutable);
> > > @@ -5877,8 +5917,8 @@ trees_in::lang_type_bools (tree t)
> > >    RB (lang->has_new);
> > >    RB (lang->has_array_new);
> > >  
> > > -  v = b () << 0;
> > > -  v |= b () << 1;
> > > +  v = bits.b () << 0;
> > > +  v |= bits.b () << 1;
> > >    lang->gets_delete = v;
> > >    RB (lang->interface_only);
> > >    RB (lang->interface_unknown);
> > > @@ -7106,18 +7146,19 @@ trees_out::tree_node_bools (tree t)
> > >    gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
> > >  		       || DECL_NAMESPACE_ALIAS (t));
> > >  
> > > -  core_bools (t);
> > > +  bits_out bits (*this);
> > > +  core_bools (t, bits);
> > >  
> > >    switch (TREE_CODE_CLASS (TREE_CODE (t)))
> > >      {
> > >      case tcc_declaration:
> > >        {
> > >  	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
> > > -	b (specific);
> > > +	bits.b (specific);
> > >  	if (specific && VAR_P (t))
> > > -	  b (DECL_DECOMPOSITION_P (t));
> > > +	  bits.b (DECL_DECOMPOSITION_P (t));
> > >  	if (specific)
> > > -	  lang_decl_bools (t);
> > > +	  lang_decl_bools (t, bits);
> > >        }
> > >        break;
> > >  
> > > @@ -7128,9 +7169,9 @@ trees_out::tree_node_bools (tree t)
> > >  	gcc_assert (TYPE_LANG_SPECIFIC (t)
> > >  		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
> > >  
> > > -	b (specific);
> > > +	bits.b (specific);
> > >  	if (specific)
> > > -	  lang_type_bools (t);
> > > +	  lang_type_bools (t, bits);
> > >        }
> > >        break;
> > >  
> > > @@ -7138,34 +7179,35 @@ trees_out::tree_node_bools (tree t)
> > >        break;
> > >      }
> > >  
> > > -  bflush ();
> > > +  bits.bflush ();
> > >  }
> > >  
> > >  bool
> > >  trees_in::tree_node_bools (tree t)
> > >  {
> > > -  bool ok = core_bools (t);
> > > +  bits_in bits (*this);
> > > +  bool ok = core_bools (t, bits);
> > >  
> > >    if (ok)
> > >      switch (TREE_CODE_CLASS (TREE_CODE (t)))
> > >        {
> > >        case tcc_declaration:
> > > -	if (b ())
> > > +	if (bits.b ())
> > >  	  {
> > > -	    bool decomp = VAR_P (t) && b ();
> > > +	    bool decomp = VAR_P (t) && bits.b ();
> > >  
> > >  	    ok = maybe_add_lang_decl_raw (t, decomp);
> > >  	    if (ok)
> > > -	      ok = lang_decl_bools (t);
> > > -	}
> > > +	      ok = lang_decl_bools (t, bits);
> > > +	  }
> > >  	break;
> > >  
> > >        case tcc_type:
> > > -	if (b ())
> > > +	if (bits.b ())
> > >  	  {
> > >  	    ok = maybe_add_lang_type_raw (t);
> > >  	    if (ok)
> > > -	      ok = lang_type_bools (t);
> > > +	      ok = lang_type_bools (t, bits);
> > >  	  }
> > >  	break;
> > >  
> > > @@ -7173,7 +7215,7 @@ trees_in::tree_node_bools (tree t)
> > >  	break;
> > >        }
> > >  
> > > -  bflush ();
> > > +  bits.bflush ();
> > >    if (!ok || get_overrun ())
> > >      return false;
> > >  
> > > @@ -7696,11 +7738,11 @@ trees_out::decl_value (tree decl, depset *dep)
> > >  
> > >        if (mk != MK_unique)
> > >  	{
> > > +	  bits_out bits (*this);
> > >  	  if (!(mk & MK_template_mask) && !state->is_header ())
> > >  	    {
> > >  	      /* Tell the importer whether this is a global module entity,
> > > -		 or a module entity.  This bool merges into the next block
> > > -		 of bools.  Sneaky.  */
> > > +		 or a module entity.  */
> > >  	      tree o = get_originating_module_decl (decl);
> > >  	      bool is_attached = false;
> > >  
> > > @@ -7709,9 +7751,9 @@ trees_out::decl_value (tree decl, depset *dep)
> > >  		  && DECL_MODULE_ATTACH_P (not_tmpl))
> > >  		is_attached = true;
> > >  
> > > -	      b (is_attached);
> > > +	      bits.b (is_attached);
> > >  	    }
> > > -	  b (dep && dep->has_defn ());
> > > +	  bits.b (dep && dep->has_defn ());
> > >  	}
> > >        tree_node_bools (decl);
> > >      }
> > > @@ -7981,11 +8023,11 @@ trees_in::decl_value ()
> > >      {
> > >        if (mk != MK_unique)
> > >  	{
> > > +	  bits_in bits (*this);
> > >  	  if (!(mk & MK_template_mask) && !state->is_header ())
> > > -	    /* See note in trees_out about where this bool is sequenced.  */
> > > -	    is_attached = b ();
> > > +	    is_attached = bits.b ();
> > >  
> > > -	  has_defn = b ();
> > > +	  has_defn = bits.b ();
> > >  	}
> > >  
> > >        if (!tree_node_bools (decl))
> > > @@ -16682,10 +16724,11 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
> > >  {
> > >    sec.u (macro->count);
> > >  
> > > -  sec.b (macro->fun_like);
> > > -  sec.b (macro->variadic);
> > > -  sec.b (macro->syshdr);
> > > -  sec.bflush ();
> > > +  bits_out bits (sec);
> > > +  bits.b (macro->fun_like);
> > > +  bits.b (macro->variadic);
> > > +  bits.b (macro->syshdr);
> > > +  bits.bflush ();
> > >  
> > >    write_location (sec, macro->line);
> > >    if (macro->fun_like)
> > > @@ -16780,10 +16823,11 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
> > >    macro->kind = cmk_macro;
> > >    macro->imported_p = true;
> > >  
> > > -  macro->fun_like = sec.b ();
> > > -  macro->variadic = sec.b ();
> > > -  macro->syshdr = sec.b ();
> > > -  sec.bflush ();
> > > +  bits_in bits (sec);
> > > +  macro->fun_like = bits.b ();
> > > +  macro->variadic = bits.b ();
> > > +  macro->syshdr = bits.b ();
> > > +  bits.bflush ();
> > >  
> > >    macro->line = read_location (sec);
> > >  
> > > -- 
> > > 2.44.0.rc1.15.g4fc51f00ef
> > > 
> > > 
> > 
>
Jason Merrill April 10, 2024, 1:27 a.m. UTC | #6
On 2/16/24 10:06, Patrick Palka wrote:
> On Thu, 15 Feb 2024, Patrick Palka wrote:
>
>> One would expect consecutive calls to bytes_in/out::b for streaming
>> adjacent bits, as we do for tree flag streaming, to at least be
>> optimized by the compiler into individual bit operations using
>> statically known bit positions (and ideally merged into larger sized
>> reads/writes).

Did you have any thoughts about how feasible it would be to use 
data-streamer.h?  I didn't see a response to richi's mail.

>> Unfortunately this doesn't happen because the compiler has trouble
>> tracking the values of this->bit_pos and this->bit_val across such
>> calls, likely because the compiler doesn't know 'this' and so it's
>> treated as global memory.  This means for each consecutive bit stream
>> operation, bit_pos and bit_val are loaded from memory, checked if
>> buffering is needed, and finally the bit is extracted from bit_val
>> according to the (unknown) bit_pos, even though relative to the previous
>> operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
>> is just 1 larger.  This ends up being quite slow, with tree_node_bools
>> taking 10% of time when streaming in parts of the std module.
>>
>> This patch optimizes this by making tracking of bit_pos and bit_val
>> easier for the compiler.  Rather than bit_pos and bit_val being members
>> of the (effectively global) bytes_in/out objects, this patch factors out
>> the bit streaming code/state into separate classes bits_in/out that get
>> constructed locally as needed for bit streaming.  Since these objects
>> are now clearly local, the compiler can more easily track their values.

Please add this rationale to the bits_in comment.

>> And since bit streaming is intended to be batched it's natural for these
>> new classes to be RAII-enabled such that the bit stream is flushed upon
>> destruction.
>>
>> In order to make the most of this improved tracking of bit position,
>> this patch changes parts where we conditionally stream a tree flag
>> to unconditionally stream (the flag or a dummy value).  That way
>> the number of bits streamed and the respective bit positions are as
>> statically known as reasonably possible.  In lang_decl_bools and
>> lang_type_bools we flush the current bit buffer at the start so that
>> subsequent bit positions are statically known.  And in core bools, we
>> can add explicit early exits utilizing invariants that the compiler
>> can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
>> and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
>> it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
>> than 4 bits, it's more space efficient to stream them as individual
>> bytes rather than as packed bits (due to the 32-bit buffer).
> 
> Oops, this last sentence is wrong.  Although the size of the bit buffer
> is 32 bits, upon flushing we rewind unused bytes within the buffer,
> which means streaming 2-8 bits ends up using only one byte not all four.
> So v2 below undoes this pessimization.
> 
>> This patch also moves the definitions of the relevant streaming classes
>> into anonymous namespaces so that the compiler can make more informed
>> decisions about inlining their member functions.

I'm curious why you decided to put namespace { } around each class 
rather than a larger part of the file?  Not asking for a change, just 
curious.

I'm also surprised that this would make a difference for inline 
functions?  Is this just to allow the compiler to inline member 
functions defined outside the class body without marking them inline?

In any case, please add a rationale comment to the (first) anonymous 
namespace.

>> After this patch, compile time for a simple Hello World using the std
>> module is reduced by 7% with a release compiler.  The on-disk size of
>> the std module increases by 0.7% (presumably due to the extra flushing
>> done in lang_decl_bools and lang_type_bools).
> 
> The on-disk std module now only grows 0.4% instead of 0.7%.
>>
>> The bit stream out performance isn't improved as much as the stream in
>> due to the spans/lengths instrumentation performed on stream out (which
>> probably should be e.g. removed for release builds?)

Based on CHECKING_P, sure.

> diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> index 0291d456ff5..2ecee007e8a 100644
> --- a/gcc/cp/module.cc
> +++ b/gcc/cp/module.cc
> @@ -694,13 +656,126 @@ protected:
>     /* Instrumentation.  */
>     static unsigned spans[4];
>     static unsigned lengths[4];
> -  static int is_set;
> +  friend struct bits_out;
>   };
> +} // anon namespace
> +
> +/* Finish bit packet.  Rewind the bytes not used.  */

Missing blank line.

> +static unsigned
> +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> +{
> +  gcc_assert (bit_pos);
> +  unsigned bytes = (bit_pos + 7) / 8;
> +  bits.unuse (4 - bytes);
> +  bit_pos = 0;
> +  bit_val = 0;
> +  return bytes;
> +}
> +
> @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
>     if (TREE_CODE_CLASS (code) != tcc_type)
>       /* This is TYPE_CACHED_VALUES_P for types.  */
>       WB (t->base.public_flag);
> +  else
> +    WB (false);

Can we simplify the pattern for conditionally writing/reading?  It looks 
easy to forget to add the else.  Perhaps a COND_WB macro with rationale 
comment?

Jason
Patrick Palka April 10, 2024, 3:25 p.m. UTC | #7
On Tue, 9 Apr 2024, Jason Merrill wrote:

> On 2/16/24 10:06, Patrick Palka wrote:
> > On Thu, 15 Feb 2024, Patrick Palka wrote:
> > 
> > > One would expect consecutive calls to bytes_in/out::b for streaming
> > > adjacent bits, as we do for tree flag streaming, to at least be
> > > optimized by the compiler into individual bit operations using
> > > statically known bit positions (and ideally merged into larger sized
> > > reads/writes).
> 
> Did you have any thoughts about how feasible it would be to use
> data-streamer.h?  I didn't see a response to richi's mail.

IIUC the workhorses of data-streamer.h are

  for streaming out: bitpack_create / bp_pack_value / streamer_write_bitpack
  for streaming in:  streamer_read_bitpack / bp_unpack_value

which use a locally constructed bitpack_d struct for state management,
much like what this patch proposes, which is a sign that this is a good
approach I suppose.

The bit twiddling code is unsurprisingly pretty similar except
data-streamer.h can stream more than one bit at a time whereas
bits_in/out::b from this patch can only handle one bit at a time
(which is by far the common case).  Another difference is that the
data-streamer.h buffer is a HOST_WIDE_INT while the modules bit buffer
is uint32_t (this patch doesn't change that).

Unfortunately it seems data-streamer.h is currently hardcoded for
LTO streaming since bitpack_d::stream must be an lto_input_block and it
uses streamer_write_uhwi_stream and streamer_read_uhwi under the hood.
So we can't use it for modules streaming currently without abstracting
away this hardcoding AFAICT.

> 
> > > Unfortunately this doesn't happen because the compiler has trouble
> > > tracking the values of this->bit_pos and this->bit_val across such
> > > calls, likely because the compiler doesn't know 'this' and so it's
> > > treated as global memory.  This means for each consecutive bit stream
> > > operation, bit_pos and bit_val are loaded from memory, checked if
> > > buffering is needed, and finally the bit is extracted from bit_val
> > > according to the (unknown) bit_pos, even though relative to the previous
> > > operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> > > is just 1 larger.  This ends up being quite slow, with tree_node_bools
> > > taking 10% of time when streaming in parts of the std module.
> > > 
> > > This patch optimizes this by making tracking of bit_pos and bit_val
> > > easier for the compiler.  Rather than bit_pos and bit_val being members
> > > of the (effectively global) bytes_in/out objects, this patch factors out
> > > the bit streaming code/state into separate classes bits_in/out that get
> > > constructed locally as needed for bit streaming.  Since these objects
> > > are now clearly local, the compiler can more easily track their values.
> 
> Please add this rationale to the bits_in comment.

Will do.

> 
> > > And since bit streaming is intended to be batched it's natural for these
> > > new classes to be RAII-enabled such that the bit stream is flushed upon
> > > destruction.
> > > 
> > > In order to make the most of this improved tracking of bit position,
> > > this patch changes parts where we conditionally stream a tree flag
> > > to unconditionally stream (the flag or a dummy value).  That way
> > > the number of bits streamed and the respective bit positions are as
> > > statically known as reasonably possible.  In lang_decl_bools and
> > > lang_type_bools we flush the current bit buffer at the start so that
> > > subsequent bit positions are statically known.  And in core bools, we
> > > can add explicit early exits utilizing invariants that the compiler
> > > can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> > > and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> > > it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> > > than 4 bits, it's more space efficient to stream them as individual
> > > bytes rather than as packed bits (due to the 32-bit buffer).
> > 
> > Oops, this last sentence is wrong.  Although the size of the bit buffer
> > is 32 bits, upon flushing we rewind unused bytes within the buffer,
> > which means streaming 2-8 bits ends up using only one byte not all four.
> > So v2 below undoes this pessimization.
> > 
> > > This patch also moves the definitions of the relevant streaming classes
> > > into anonymous namespaces so that the compiler can make more informed
> > > decisions about inlining their member functions.
> 
> I'm curious why you decided to put namespace { } around each class rather than
> a larger part of the file?  Not asking for a change, just curious.

I don't feel strongly about i, but to me using a separate namespace { }
for each class is consistent with how we use 'static' instead of
namespace { } to give (consecutively defined) free functions internal
linkage, i.e. instead of

    namespace {
      void f() { }
      void g() { }
    }

we do

   static void f() { }
   static void g() { }

Using a separate namespace { } for each class is the closest thing to
'static' for types.  And it makes it obvious whether a class is TU-local
or not.

> 
> I'm also surprised that this would make a difference for inline functions?  Is
> this just to allow the compiler to inline member functions defined outside the
> class body without marking them inline?

The motivation was initially to help ensure trees_in/out::core_bools,
::lang_type_bools and ::lang_decl_bools get inlined into their only
caller ::tree_node_bools.  This is crucial because these functions take
bits_in/out parameters by reference and if they're not inlined then the
compiler can't track the bit state, and we generate bad code again with
a buffering check after every single bit read/write.

Only by giving them internal linkage can the compiler see they have just
one caller, which guarantees they get inlined.  They're otherwise fairly
large functions which are not clearly profitable to inline.

And I reckoned it's good code hygiene to give TU-local types internal
linkage much like how we declare TU-local free functions 'static' so
I gave the base classes of trees_in/out internal linkage as well.

> 
> In any case, please add a rationale comment to the (first) anonymous
> namespace.

Sure, I opted to add a rationale to trees_in since it and trees_out are
the classes that most benefit from this change.

> 
> > > After this patch, compile time for a simple Hello World using the std
> > > module is reduced by 7% with a release compiler.  The on-disk size of
> > > the std module increases by 0.7% (presumably due to the extra flushing
> > > done in lang_decl_bools and lang_type_bools).
> > 
> > The on-disk std module now only grows 0.4% instead of 0.7%.
> > > 
> > > The bit stream out performance isn't improved as much as the stream in
> > > due to the spans/lengths instrumentation performed on stream out (which
> > > probably should be e.g. removed for release builds?)
> 
> Based on CHECKING_P, sure.

I can do that in a follow-up patch.

> 
> > diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> > index 0291d456ff5..2ecee007e8a 100644
> > --- a/gcc/cp/module.cc
> > +++ b/gcc/cp/module.cc
> > @@ -694,13 +656,126 @@ protected:
> >     /* Instrumentation.  */
> >     static unsigned spans[4];
> >     static unsigned lengths[4];
> > -  static int is_set;
> > +  friend struct bits_out;
> >   };
> > +} // anon namespace
> > +
> > +/* Finish bit packet.  Rewind the bytes not used.  */
> 
> Missing blank line.

Fixed.

> 
> > +static unsigned
> > +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> > +{
> > +  gcc_assert (bit_pos);
> > +  unsigned bytes = (bit_pos + 7) / 8;
> > +  bits.unuse (4 - bytes);
> > +  bit_pos = 0;
> > +  bit_val = 0;
> > +  return bytes;
> > +}
> > +
> > @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
> >     if (TREE_CODE_CLASS (code) != tcc_type)
> >       /* This is TYPE_CACHED_VALUES_P for types.  */
> >       WB (t->base.public_flag);
> > +  else
> > +    WB (false);
> 
> Can we simplify the pattern for conditionally writing/reading?  It looks easy
> to forget to add the else.  Perhaps a COND_WB macro with rationale comment?

Fixed.

Here's an incremental diff of the changes.  Will send updated patch as a
follow-up email.

-- >8 --

diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index cb3803a6a12..765d7dde715 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -661,6 +661,7 @@ protected:
 } // anon namespace
 
 /* Finish bit packet.  Rewind the bytes not used.  */
+
 static unsigned
 bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
 {
@@ -679,7 +680,12 @@ bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
    When reading, we don't know how many bools we'll read in.  So read
    4 bytes-worth, and then rewind when flushing if we didn't need them
    all.  You can't have a block of bools closer than 4 bytes to the
-   end of the buffer.  */
+   end of the buffer.
+
+   Both bits_in and bits_out maintain the necessary state for bit packing,
+   and since these objects are locally constructed the compiler can more
+   easily track their state across consecutive reads/writes and yield
+   optimized code with a minimal number of buffering checks.  */
 
 namespace {
 struct bits_in {
@@ -2839,7 +2845,11 @@ struct post_process_data {
 /* Tree stream reader.  Note that reading a stream doesn't mark the
    read trees with TREE_VISITED.  Thus it's quite safe to have
    multiple concurrent readers.  Which is good, because lazy
-   loading. */
+   loading.
+
+   It's important that trees_in/out have internal linkage so that the compiler
+   knows core_bools, lang_type_bools and lang_decl_bools have only a single
+   caller (tree_node_bools) and inlines them appropriately.  */
 namespace {
 class trees_in : public bytes_in {
   typedef bytes_in parent;
@@ -5282,6 +5292,10 @@ void
 trees_out::core_bools (tree t, bits_out& bits)
 {
 #define WB(X) (bits.b (X))
+/* Stream X if COND holds, and if !COND stream a dummy value so that the
+   overall number of bits streamed is independent of the runtime value
+   of COND, which allows the compiler to optimize this function better.  */
+#define COND_WB(COND, X) WB ((COND) ? (X) : false)
   tree_code code = TREE_CODE (t);
 
   WB (t->base.side_effects_flag);
@@ -5298,11 +5312,8 @@ trees_out::core_bools (tree t, bits_out& bits)
      decls they use.  */
   WB (t->base.nothrow_flag);
   WB (t->base.static_flag);
-  if (TREE_CODE_CLASS (code) != tcc_type)
-    /* This is TYPE_CACHED_VALUES_P for types.  */
-    WB (t->base.public_flag);
-  else
-    WB (false);
+  /* This is TYPE_CACHED_VALUES_P for types.  */
+  COND_WB (TREE_CODE_CLASS (code) != tcc_type, t->base.public_flag);
   WB (t->base.private_flag);
   WB (t->base.protected_flag);
   WB (t->base.deprecated_flag);
@@ -5499,6 +5510,8 @@ bool
 trees_in::core_bools (tree t, bits_in& bits)
 {
 #define RB(X) ((X) = bits.b ())
+/* See the comment for COND_WB in trees_out::core_bools.  */
+#define COND_RB(COND, X) ((COND) ? RB (X) : bits.b ())
 
   tree_code code = TREE_CODE (t);
 
@@ -5513,10 +5526,7 @@ trees_in::core_bools (tree t, bits_in& bits)
   /* base.used_flag is not streamed.  */
   RB (t->base.nothrow_flag);
   RB (t->base.static_flag);
-  if (TREE_CODE_CLASS (code) != tcc_type)
-    RB (t->base.public_flag);
-  else
-    bits.b ();
+  COND_RB (TREE_CODE_CLASS (code) != tcc_type, t->base.public_flag);
   RB (t->base.private_flag);
   RB (t->base.protected_flag);
   RB (t->base.deprecated_flag);
Patrick Palka April 10, 2024, 3:26 p.m. UTC | #8
On Wed, 10 Apr 2024, Patrick Palka wrote:

> 
> On Tue, 9 Apr 2024, Jason Merrill wrote:
> 
> > On 2/16/24 10:06, Patrick Palka wrote:
> > > On Thu, 15 Feb 2024, Patrick Palka wrote:
> > > 
> > > > One would expect consecutive calls to bytes_in/out::b for streaming
> > > > adjacent bits, as we do for tree flag streaming, to at least be
> > > > optimized by the compiler into individual bit operations using
> > > > statically known bit positions (and ideally merged into larger sized
> > > > reads/writes).
> > 
> > Did you have any thoughts about how feasible it would be to use
> > data-streamer.h?  I didn't see a response to richi's mail.
> 
> IIUC the workhorses of data-streamer.h are
> 
>   for streaming out: bitpack_create / bp_pack_value / streamer_write_bitpack
>   for streaming in:  streamer_read_bitpack / bp_unpack_value
> 
> which use a locally constructed bitpack_d struct for state management,
> much like what this patch proposes, which is a sign that this is a good
> approach I suppose.
> 
> The bit twiddling code is unsurprisingly pretty similar except
> data-streamer.h can stream more than one bit at a time whereas
> bits_in/out::b from this patch can only handle one bit at a time
> (which is by far the common case).  Another difference is that the
> data-streamer.h buffer is a HOST_WIDE_INT while the modules bit buffer
> is uint32_t (this patch doesn't change that).
> 
> Unfortunately it seems data-streamer.h is currently hardcoded for
> LTO streaming since bitpack_d::stream must be an lto_input_block and it
> uses streamer_write_uhwi_stream and streamer_read_uhwi under the hood.
> So we can't use it for modules streaming currently without abstracting
> away this hardcoding AFAICT.
> 
> > 
> > > > Unfortunately this doesn't happen because the compiler has trouble
> > > > tracking the values of this->bit_pos and this->bit_val across such
> > > > calls, likely because the compiler doesn't know 'this' and so it's
> > > > treated as global memory.  This means for each consecutive bit stream
> > > > operation, bit_pos and bit_val are loaded from memory, checked if
> > > > buffering is needed, and finally the bit is extracted from bit_val
> > > > according to the (unknown) bit_pos, even though relative to the previous
> > > > operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
> > > > is just 1 larger.  This ends up being quite slow, with tree_node_bools
> > > > taking 10% of time when streaming in parts of the std module.
> > > > 
> > > > This patch optimizes this by making tracking of bit_pos and bit_val
> > > > easier for the compiler.  Rather than bit_pos and bit_val being members
> > > > of the (effectively global) bytes_in/out objects, this patch factors out
> > > > the bit streaming code/state into separate classes bits_in/out that get
> > > > constructed locally as needed for bit streaming.  Since these objects
> > > > are now clearly local, the compiler can more easily track their values.
> > 
> > Please add this rationale to the bits_in comment.
> 
> Will do.
> 
> > 
> > > > And since bit streaming is intended to be batched it's natural for these
> > > > new classes to be RAII-enabled such that the bit stream is flushed upon
> > > > destruction.
> > > > 
> > > > In order to make the most of this improved tracking of bit position,
> > > > this patch changes parts where we conditionally stream a tree flag
> > > > to unconditionally stream (the flag or a dummy value).  That way
> > > > the number of bits streamed and the respective bit positions are as
> > > > statically known as reasonably possible.  In lang_decl_bools and
> > > > lang_type_bools we flush the current bit buffer at the start so that
> > > > subsequent bit positions are statically known.  And in core bools, we
> > > > can add explicit early exits utilizing invariants that the compiler
> > > > can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
> > > > and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
> > > > it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
> > > > than 4 bits, it's more space efficient to stream them as individual
> > > > bytes rather than as packed bits (due to the 32-bit buffer).
> > > 
> > > Oops, this last sentence is wrong.  Although the size of the bit buffer
> > > is 32 bits, upon flushing we rewind unused bytes within the buffer,
> > > which means streaming 2-8 bits ends up using only one byte not all four.
> > > So v2 below undoes this pessimization.
> > > 
> > > > This patch also moves the definitions of the relevant streaming classes
> > > > into anonymous namespaces so that the compiler can make more informed
> > > > decisions about inlining their member functions.
> > 
> > I'm curious why you decided to put namespace { } around each class rather than
> > a larger part of the file?  Not asking for a change, just curious.
> 
> I don't feel strongly about i, but to me using a separate namespace { }
> for each class is consistent with how we use 'static' instead of
> namespace { } to give (consecutively defined) free functions internal
> linkage, i.e. instead of
> 
>     namespace {
>       void f() { }
>       void g() { }
>     }
> 
> we do
> 
>    static void f() { }
>    static void g() { }
> 
> Using a separate namespace { } for each class is the closest thing to
> 'static' for types.  And it makes it obvious whether a class is TU-local
> or not.
> 
> > 
> > I'm also surprised that this would make a difference for inline functions?  Is
> > this just to allow the compiler to inline member functions defined outside the
> > class body without marking them inline?
> 
> The motivation was initially to help ensure trees_in/out::core_bools,
> ::lang_type_bools and ::lang_decl_bools get inlined into their only
> caller ::tree_node_bools.  This is crucial because these functions take
> bits_in/out parameters by reference and if they're not inlined then the
> compiler can't track the bit state, and we generate bad code again with
> a buffering check after every single bit read/write.
> 
> Only by giving them internal linkage can the compiler see they have just
> one caller, which guarantees they get inlined.  They're otherwise fairly
> large functions which are not clearly profitable to inline.
> 
> And I reckoned it's good code hygiene to give TU-local types internal
> linkage much like how we declare TU-local free functions 'static' so
> I gave the base classes of trees_in/out internal linkage as well.
> 
> > 
> > In any case, please add a rationale comment to the (first) anonymous
> > namespace.
> 
> Sure, I opted to add a rationale to trees_in since it and trees_out are
> the classes that most benefit from this change.
> 
> > 
> > > > After this patch, compile time for a simple Hello World using the std
> > > > module is reduced by 7% with a release compiler.  The on-disk size of
> > > > the std module increases by 0.7% (presumably due to the extra flushing
> > > > done in lang_decl_bools and lang_type_bools).
> > > 
> > > The on-disk std module now only grows 0.4% instead of 0.7%.
> > > > 
> > > > The bit stream out performance isn't improved as much as the stream in
> > > > due to the spans/lengths instrumentation performed on stream out (which
> > > > probably should be e.g. removed for release builds?)
> > 
> > Based on CHECKING_P, sure.
> 
> I can do that in a follow-up patch.
> 
> > 
> > > diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> > > index 0291d456ff5..2ecee007e8a 100644
> > > --- a/gcc/cp/module.cc
> > > +++ b/gcc/cp/module.cc
> > > @@ -694,13 +656,126 @@ protected:
> > >     /* Instrumentation.  */
> > >     static unsigned spans[4];
> > >     static unsigned lengths[4];
> > > -  static int is_set;
> > > +  friend struct bits_out;
> > >   };
> > > +} // anon namespace
> > > +
> > > +/* Finish bit packet.  Rewind the bytes not used.  */
> > 
> > Missing blank line.
> 
> Fixed.
> 
> > 
> > > +static unsigned
> > > +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
> > > +{
> > > +  gcc_assert (bit_pos);
> > > +  unsigned bytes = (bit_pos + 7) / 8;
> > > +  bits.unuse (4 - bytes);
> > > +  bit_pos = 0;
> > > +  bit_val = 0;
> > > +  return bytes;
> > > +}
> > > +
> > > @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
> > >     if (TREE_CODE_CLASS (code) != tcc_type)
> > >       /* This is TYPE_CACHED_VALUES_P for types.  */
> > >       WB (t->base.public_flag);
> > > +  else
> > > +    WB (false);
> > 
> > Can we simplify the pattern for conditionally writing/reading?  It looks easy
> > to forget to add the else.  Perhaps a COND_WB macro with rationale comment?
> 
> Fixed.
> 
> Here's an incremental diff of the changes.  Will send updated patch as a
> follow-up email.

Updated patch:

Subject: [PATCH] c++/modules: optimize tree flag streaming

One would expect consecutive calls to bytes_in/out::b for streaming
adjacent bits, as we do for tree flag streaming, to at least be
optimized by the compiler into individual bit operations using
statically known bit positions (and ideally combined into larger sized
reads/writes).

Unfortunately this doesn't happen because the compiler has trouble
tracking the values of this->bit_pos and this->bit_val across the
calls, likely because the compiler doesn't know the value of 'this'.
Thus for each consecutive bit stream operation, bit_pos and bit_val are
loaded from 'this', checked if buffering is needed, and finally the bit
is extracted from bit_val according to the (unknown) bit_pos, even
though relative to the previous operation (if we didn't need to buffer)
bit_val is unchanged and bit_pos is just 1 larger.  This ends up being
quite slow, with tree_node_bools taking 10% of time when streaming in
the std module.

This patch optimizes this by making tracking of bit_pos and bit_val
easier for the compiler.  Rather than bit_pos and bit_val being members
of the (effectively global) bytes_in/out objects, this patch factors out
the bit streaming code/state into separate classes bits_in/out that get
constructed locally as needed for bit streaming.  Since these objects
are now clearly local, the compiler can more easily track their values.

And since bit streaming is intended to be batched it's natural for these
new classes to be RAII-enabled so that the bit stream is flushed upon
destruction.

In order to make the most of this improved tracking of bit position,
this patch changes parts where we conditionally stream a tree flag
to unconditionally stream (the flag or a dummy value).  That way
the number of bits streamed and the respective bit positions are as
statically known as reasonably possible.  In lang_decl_bools and
lang_type_bools this patch makes us flush the current bit buffer at the
start so that subsequent bit positions are in turn statically known.
And in core_bools, we can add explicit early exits utilizing invariants
that the compiler can't figure out itself (e.g. a tree code can't have
both TS_TYPE_COMMON and TS_DECL_COMMON, and if a tree code doesn't have
TS_DECL_COMMON then it doesn't have TS_DECL_WITH_VIS).

This patch also moves the definitions of the relevant streaming classes
into anonymous namespaces so that the compiler can make more informed
decisions about inlining their member functions.

After this patch, compile time for a simple Hello World using the std
module is reduced by 7% with a release compiler.  The on-disk size of
the std module increases by 0.4% (presumably due to the extra flushing
done in lang_decl_bools and lang_type_bools).

The bit stream out performance isn't improved as much as the stream in
due to the spans/lengths instrumentation performed on stream out (which
maybe should be disabled for release builds?)

gcc/cp/ChangeLog:

	* module.cc: Update comment about classes defined within.
	(class data): Enclose in an anonymous namespace.
	(data::calc_crc): Moved from bytes::calc_crc.
	(class bytes): Remove.  Move bit_flush to namespace scope.
	(class bytes_in): Enclose in an anonymous namespace.  Inherit
	directly from data and adjust accordingly.  Move b and bflush
	members to bits_in.
	(class bytes_out): As above.  Remove is_set static data member.
	(bit_flush): Moved from class bytes.
	(struct bits_in): Define.
	(struct bits_out): Define.
	(bytes_out::bflush): Moved to bits_out/in.
	(bytes_in::bflush): Likewise
	(bytes_in::bfill): Removed.
	(bytes_out::b): Moved to bits_out/in.
	(bytes_in::b): Likewise.
	(class trees_in): Enclose in an anonymous namespace.
	(class trees_out): Enclose in an anonymous namespace.
	(trees_out::core_bools): Add bits_out/in parameter and use it.
	Unconditionally stream a bit for public_flag.  Add early exits
	as appropriate.
	(trees_out::core_bools): Likewise.
	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.  Unconditionally
	stream a bit for module_keyed_decls_p.
	(trees_in::lang_decl_bools): Likewise.
	(trees_out::lang_type_bools): Add bits_out/in parameter and use
	it.  Flush the current bit buffer at the start.
	(trees_in::lang_type_bools): Likewise.
	(trees_out::tree_node_bools): Construct a bits_out object and
	use/pass it.
	(trees_in::tree_node_bools): Likewise.
	(trees_out::decl_value): Likewise.
	(trees_in::decl_value): Likewise.
	(module_state::write_define): Likewise.
	(module_state::read_define): Likewise.
---
 gcc/cp/module.cc | 436 ++++++++++++++++++++++++++---------------------
 1 file changed, 243 insertions(+), 193 deletions(-)

diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 1be024cedd2..765d7dde715 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -153,9 +153,11 @@ Classes used:
 
    data - buffer
 
-   bytes - data streamer
-   bytes_in : bytes - scalar reader
-   bytes_out : bytes - scalar writer
+   bytes_in - scalar reader
+   bytes_out - scalar writer
+
+   bits_in - bit stream reader
+   bits_out - bit stream writer
 
    elf - ELROND format
    elf_in : elf - ELROND reader
@@ -346,6 +348,7 @@ typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
 
 /* Variable length buffer.  */
 
+namespace {
 class data {
 public:
   class allocator {
@@ -393,6 +396,8 @@ protected:
     return res;
   }
 
+  unsigned calc_crc (unsigned) const;
+
 public:
   void unuse (unsigned count)
   {
@@ -402,6 +407,7 @@ public:
 public:
   static allocator simple_memory;
 };
+} // anon namespace
 
 /* The simple data allocator.  */
 data::allocator data::simple_memory;
@@ -447,46 +453,11 @@ data::allocator::shrink (char *ptr)
   XDELETEVEC (ptr);
 }
 
-/* Byte streamer base.   Buffer with read/write position and smarts
-   for single bits.  */
-
-class bytes : public data {
-public:
-  typedef data parent;
-
-protected:
-  uint32_t bit_val;	/* Bit buffer.  */
-  unsigned bit_pos;	/* Next bit in bit buffer.  */
-
-public:
-  bytes ()
-    :parent (), bit_val (0), bit_pos (0)
-  {}
-  ~bytes () 
-  {
-  }
-
-protected:
-  unsigned calc_crc (unsigned) const;
-
-protected:
-  /* Finish bit packet.  Rewind the bytes not used.  */
-  unsigned bit_flush ()
-  {
-    gcc_assert (bit_pos);
-    unsigned bytes = (bit_pos + 7) / 8;
-    unuse (4 - bytes);
-    bit_pos = 0;
-    bit_val = 0;
-    return bytes;
-  }
-};
-
 /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
    first 4 bytes, so don't include them.  */
 
 unsigned
-bytes::calc_crc (unsigned l) const
+data::calc_crc (unsigned l) const
 {
   return crc32 (0, (unsigned char *)buffer + 4, l - 4);
 }
@@ -495,8 +466,9 @@ class elf_in;
 
 /* Byte stream reader.  */
 
-class bytes_in : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_in : public data {
+  typedef data parent;
 
 protected:
   bool overrun;  /* Sticky read-too-much flag.  */
@@ -531,7 +503,6 @@ public:
     if (offset > size)
       set_overrun ();
     pos = offset;
-    bit_pos = bit_val = 0;
   }
 
 public:
@@ -573,14 +544,7 @@ public:
   unsigned u32 ();  	/* Read uncompressed integer.  */
 
 public:
-  bool b ();	    	/* Read a bool.  */
-  void bflush ();	/* Completed a block of bools.  */
-
-private:
-  void bfill ();	/* Get the next block of bools.  */
-
-public:
-  int c ();		/* Read a char.  */
+  int c () ATTRIBUTE_UNUSED;		/* Read a char.  */
   int i ();		/* Read a signed int.  */
   unsigned u ();	/* Read an unsigned int.  */
   size_t z ();		/* Read a size_t.  */
@@ -590,6 +554,7 @@ public:
   const void *buf (size_t); /* Read a fixed-length buffer.  */
   cpp_hashnode *cpp_node (); /* Read a cpp node.  */
 };
+} // anon namespace
 
 /* Verify the buffer's CRC is correct.  */
 
@@ -610,8 +575,9 @@ class elf_out;
 
 /* Byte stream writer.  */
 
-class bytes_out : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_out : public data {
+  typedef data parent;
 
 public:
   allocator *memory;	/* Obtainer of memory.  */
@@ -659,11 +625,7 @@ public:
   void u32 (unsigned);  /* Write uncompressed integer.  */
 
 public:
-  void b (bool);	/* Write bool.  */
-  void bflush ();	/* Finish block of bools.  */
-
-public:
-  void c (unsigned char); /* Write unsigned char.  */
+  void c (unsigned char) ATTRIBUTE_UNUSED; /* Write unsigned char.  */
   void i (int);		/* Write signed int.  */
   void u (unsigned);	/* Write unsigned int.  */
   void z (size_t s);	/* Write size_t.  */
@@ -694,13 +656,132 @@ protected:
   /* Instrumentation.  */
   static unsigned spans[4];
   static unsigned lengths[4];
-  static int is_set;
+  friend struct bits_out;
+};
+} // anon namespace
+
+/* Finish bit packet.  Rewind the bytes not used.  */
+
+static unsigned
+bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
+{
+  gcc_assert (bit_pos);
+  unsigned bytes = (bit_pos + 7) / 8;
+  bits.unuse (4 - bytes);
+  bit_pos = 0;
+  bit_val = 0;
+  return bytes;
+}
+
+/* Bit stream reader (RAII-enabled).  Bools are packed into bytes.  You
+   cannot mix bools and non-bools.  Use bflush to flush the current stream
+   of bools on demand.  Upon destruction bflush is called.
+
+   When reading, we don't know how many bools we'll read in.  So read
+   4 bytes-worth, and then rewind when flushing if we didn't need them
+   all.  You can't have a block of bools closer than 4 bytes to the
+   end of the buffer.
+
+   Both bits_in and bits_out maintain the necessary state for bit packing,
+   and since these objects are locally constructed the compiler can more
+   easily track their state across consecutive reads/writes and yield
+   optimized code with a minimal number of buffering checks.  */
+
+namespace {
+struct bits_in {
+  bytes_in& in;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+
+  bits_in (bytes_in& in)
+    : in (in)
+  { }
+
+  ~bits_in ()
+  {
+    bflush ();
+  }
+
+  bits_in(const bits_in&) = delete;
+  bits_in& operator=(const bits_in&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      bit_flush (in, bit_val, bit_pos);
+  }
+
+  /* Read one bit.  */
+  bool b ()
+  {
+    if (!bit_pos)
+      bit_val = in.u32 ();
+    bool x = (bit_val >> bit_pos) & 1;
+    bit_pos = (bit_pos + 1) % 32;
+    return x;
+  }
+};
+} // anon namespace
+
+/* Bit stream writer (RAII-enabled), counterpart to bits_in.  */
+
+namespace {
+struct bits_out {
+  bytes_out& out;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+  char is_set = -1;
+
+  bits_out (bytes_out& out)
+    : out (out)
+  { }
+
+  ~bits_out ()
+  {
+    bflush ();
+  }
+
+  bits_out(const bits_out&) = delete;
+  bits_out& operator=(const bits_out&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+    out.spans[2]++;
+    is_set = -1;
+  }
+
+  /* Write one bit.
+
+     It may be worth optimizing for most bools being zero.  Some kind of
+     run-length encoding?  */
+  void b (bool x)
+  {
+    if (is_set != x)
+      {
+	is_set = x;
+	out.spans[x]++;
+      }
+    out.lengths[x]++;
+    bit_val |= unsigned (x) << bit_pos++;
+    if (bit_pos == 32)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+  }
 };
+} // anon namespace
 
 /* Instrumentation.  */
 unsigned bytes_out::spans[4];
 unsigned bytes_out::lengths[4];
-int bytes_out::is_set = -1;
 
 /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
    that pointed to by CRC_PTR.  */
@@ -723,73 +804,6 @@ bytes_out::set_crc (unsigned *crc_ptr)
     }
 }
 
-/* Finish a set of bools.  */
-
-void
-bytes_out::bflush ()
-{
-  if (bit_pos)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-  spans[2]++;
-  is_set = -1;
-}
-
-void
-bytes_in::bflush ()
-{
-  if (bit_pos)
-    bit_flush ();
-}
-
-/* When reading, we don't know how many bools we'll read in.  So read
-   4 bytes-worth, and then rewind when flushing if we didn't need them
-   all.  You can't have a block of bools closer than 4 bytes to the
-   end of the buffer.  */
-
-void
-bytes_in::bfill ()
-{
-  bit_val = u32 ();
-}
-
-/* Bools are packed into bytes.  You cannot mix bools and non-bools.
-   You must call bflush before emitting another type.  So batch your
-   bools.
-
-   It may be worth optimizing for most bools being zero.  Some kind of
-   run-length encoding?  */
-
-void
-bytes_out::b (bool x)
-{
-  if (is_set != x)
-    {
-      is_set = x;
-      spans[x]++;
-    }
-  lengths[x]++;
-  bit_val |= unsigned (x) << bit_pos++;
-  if (bit_pos == 32)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-}
-
-bool
-bytes_in::b ()
-{
-  if (!bit_pos)
-    bfill ();
-  bool v = (bit_val >> bit_pos++) & 1;
-  if (bit_pos == 32)
-    bit_flush ();
-  return v;
-}
-
 /* Exactly 4 bytes.  Used internally for bool packing and a few other
    places.  We can't simply use uint32_t because (a) alignment and
    (b) we need little-endian for the bool streaming rewinding to make
@@ -2831,7 +2845,12 @@ struct post_process_data {
 /* Tree stream reader.  Note that reading a stream doesn't mark the
    read trees with TREE_VISITED.  Thus it's quite safe to have
    multiple concurrent readers.  Which is good, because lazy
-   loading. */
+   loading.
+
+   It's important that trees_in/out have internal linkage so that the compiler
+   knows core_bools, lang_type_bools and lang_decl_bools have only a single
+   caller (tree_node_bools) and inlines them appropriately.  */
+namespace {
 class trees_in : public bytes_in {
   typedef bytes_in parent;
 
@@ -2856,15 +2875,15 @@ private:
 
 public:
   /* Needed for binfo writing  */
-  bool core_bools (tree);
+  bool core_bools (tree, bits_in&);
 
 private:
   /* Stream tree_core, lang_decl_specific and lang_type_specific
      bits.  */
   bool core_vals (tree);
-  bool lang_type_bools (tree);
+  bool lang_type_bools (tree, bits_in&);
   bool lang_type_vals (tree);
-  bool lang_decl_bools (tree);
+  bool lang_decl_bools (tree, bits_in&);
   bool lang_decl_vals (tree);
   bool lang_vals (tree);
   bool tree_node_bools (tree);
@@ -2952,6 +2971,7 @@ private:
 private:
   void assert_definition (tree, bool installing);
 };
+} // anon namespace
 
 trees_in::trees_in (module_state *state)
   :parent (), state (state), unused (0)
@@ -2969,6 +2989,7 @@ trees_in::~trees_in ()
 }
 
 /* Tree stream writer.  */
+namespace {
 class trees_out : public bytes_out {
   typedef bytes_out parent;
 
@@ -3030,11 +3051,11 @@ public:
   }
 
 private:
-  void core_bools (tree);
+  void core_bools (tree, bits_out&);
   void core_vals (tree);
-  void lang_type_bools (tree);
+  void lang_type_bools (tree, bits_out&);
   void lang_type_vals (tree);
-  void lang_decl_bools (tree);
+  void lang_decl_bools (tree, bits_out&);
   void lang_decl_vals (tree);
   void lang_vals (tree);
   void tree_node_bools (tree);
@@ -3114,6 +3135,7 @@ private:
   static unsigned back_ref_count;
   static unsigned null_count;
 };
+} // anon namespace
 
 /* Instrumentation counters.  */
 unsigned trees_out::tree_val_count;
@@ -5267,9 +5289,13 @@ trees_in::start (unsigned code)
 /* Read & write the core boolean flags.  */
 
 void
-trees_out::core_bools (tree t)
+trees_out::core_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
+/* Stream X if COND holds, and if !COND stream a dummy value so that the
+   overall number of bits streamed is independent of the runtime value
+   of COND, which allows the compiler to optimize this function better.  */
+#define COND_WB(COND, X) WB ((COND) ? (X) : false)
   tree_code code = TREE_CODE (t);
 
   WB (t->base.side_effects_flag);
@@ -5286,9 +5312,8 @@ trees_out::core_bools (tree t)
      decls they use.  */
   WB (t->base.nothrow_flag);
   WB (t->base.static_flag);
-  if (TREE_CODE_CLASS (code) != tcc_type)
-    /* This is TYPE_CACHED_VALUES_P for types.  */
-    WB (t->base.public_flag);
+  /* This is TYPE_CACHED_VALUES_P for types.  */
+  COND_WB (TREE_CODE_CLASS (code) != tcc_type, t->base.public_flag);
   WB (t->base.private_flag);
   WB (t->base.protected_flag);
   WB (t->base.deprecated_flag);
@@ -5302,7 +5327,7 @@ trees_out::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      return;
 
     default:
       WB (t->base.u.bits.lang_flag_0);
@@ -5335,7 +5360,7 @@ trees_out::core_bools (tree t)
       break;
     }
 
-  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
+  if (TREE_CODE_CLASS (code) == tcc_type)
     {
       WB (t->type_common.no_force_blk_flag);
       WB (t->type_common.needs_constructing_flag);
@@ -5352,6 +5377,9 @@ trees_out::core_bools (tree t)
       WB (t->type_common.typeless_storage);
     }
 
+  if (TREE_CODE_CLASS (code) != tcc_declaration)
+    return;
+
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
     {
       WB (t->decl_common.nonlocal_flag);
@@ -5425,6 +5453,8 @@ trees_out::core_bools (tree t)
       WB (t->decl_common.decl_nonshareable_flag);
       WB (t->decl_common.decl_not_flexarray);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5445,6 +5475,8 @@ trees_out::core_bools (tree t)
       WB (t->decl_with_vis.final);
       WB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5475,9 +5507,12 @@ trees_out::core_bools (tree t)
 }
 
 bool
-trees_in::core_bools (tree t)
+trees_in::core_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
+/* See the comment for COND_WB in trees_out::core_bools.  */
+#define COND_RB(COND, X) ((COND) ? RB (X) : bits.b ())
+
   tree_code code = TREE_CODE (t);
 
   RB (t->base.side_effects_flag);
@@ -5491,8 +5526,7 @@ trees_in::core_bools (tree t)
   /* base.used_flag is not streamed.  */
   RB (t->base.nothrow_flag);
   RB (t->base.static_flag);
-  if (TREE_CODE_CLASS (code) != tcc_type)
-    RB (t->base.public_flag);
+  COND_RB (TREE_CODE_CLASS (code) != tcc_type, t->base.public_flag);
   RB (t->base.private_flag);
   RB (t->base.protected_flag);
   RB (t->base.deprecated_flag);
@@ -5506,7 +5540,7 @@ trees_in::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      goto done;
 
     default:
       RB (t->base.u.bits.lang_flag_0);
@@ -5526,7 +5560,7 @@ trees_in::core_bools (tree t)
       break;
     }
 
-  if (CODE_CONTAINS_STRUCT (code, TS_TYPE_COMMON))
+  if (TREE_CODE_CLASS (code) == tcc_type)
     {
       RB (t->type_common.no_force_blk_flag);
       RB (t->type_common.needs_constructing_flag);
@@ -5543,6 +5577,9 @@ trees_in::core_bools (tree t)
       RB (t->type_common.typeless_storage);
     }
 
+  if (TREE_CODE_CLASS (code) != tcc_declaration)
+    goto done;
+
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
     {
       RB (t->decl_common.nonlocal_flag);
@@ -5571,6 +5608,8 @@ trees_in::core_bools (tree t)
       RB (t->decl_common.decl_nonshareable_flag);
       RB (t->decl_common.decl_not_flexarray);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5591,6 +5630,8 @@ trees_in::core_bools (tree t)
       RB (t->decl_with_vis.final);
       RB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5614,20 +5655,22 @@ trees_in::core_bools (tree t)
 
       /* decl_type is a (misnamed) 2 bit discriminator.	 */
       unsigned kind = 0;
-      kind |= unsigned (b ()) << 0;
-      kind |= unsigned (b ()) << 1;
+      kind |= unsigned (bits.b ()) << 0;
+      kind |= unsigned (bits.b ()) << 1;
       t->function_decl.decl_type = function_decl_type (kind);
     }
 #undef RB
+done:
   return !get_overrun ();
 }
 
 void
-trees_out::lang_decl_bools (tree t)
+trees_out::lang_decl_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->u.base.language == lang_cplusplus);
   WB ((lang->u.base.use_template >> 0) & 1);
   WB ((lang->u.base.use_template >> 1) & 1);
@@ -5700,15 +5743,16 @@ trees_out::lang_decl_bools (tree t)
 }
 
 bool
-trees_in::lang_decl_bools (tree t)
+trees_in::lang_decl_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
-  lang->u.base.language = b () ? lang_cplusplus : lang_c;
+  bits.bflush ();
+  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->u.base.use_template = v;
   /* lang->u.base.not_really_extern is not streamed.  */
   RB (lang->u.base.initialized_in_class);
@@ -5771,11 +5815,12 @@ trees_in::lang_decl_bools (tree t)
 }
 
 void
-trees_out::lang_type_bools (tree t)
+trees_out::lang_type_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->has_type_conversion);
   WB (lang->has_copy_ctor);
   WB (lang->has_default_ctor);
@@ -5837,11 +5882,12 @@ trees_out::lang_type_bools (tree t)
 }
 
 bool
-trees_in::lang_type_bools (tree t)
+trees_in::lang_type_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   RB (lang->has_type_conversion);
   RB (lang->has_copy_ctor);
   RB (lang->has_default_ctor);
@@ -5849,8 +5895,8 @@ trees_in::lang_type_bools (tree t)
   RB (lang->ref_needs_init);
   RB (lang->has_const_copy_assign);
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->use_template = v;
 
   RB (lang->has_mutable);
@@ -5862,8 +5908,8 @@ trees_in::lang_type_bools (tree t)
   RB (lang->has_new);
   RB (lang->has_array_new);
 
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->gets_delete = v;
   RB (lang->interface_only);
   RB (lang->interface_unknown);
@@ -7140,18 +7186,19 @@ trees_out::tree_node_bools (tree t)
   gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
 		       || DECL_NAMESPACE_ALIAS (t));
 
-  core_bools (t);
+  bits_out bits (*this);
+  core_bools (t, bits);
 
   switch (TREE_CODE_CLASS (TREE_CODE (t)))
     {
     case tcc_declaration:
       {
 	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
-	b (specific);
+	bits.b (specific);
 	if (specific && VAR_P (t))
-	  b (DECL_DECOMPOSITION_P (t));
+	  bits.b (DECL_DECOMPOSITION_P (t));
 	if (specific)
-	  lang_decl_bools (t);
+	  lang_decl_bools (t, bits);
       }
       break;
 
@@ -7162,9 +7209,9 @@ trees_out::tree_node_bools (tree t)
 	gcc_assert (TYPE_LANG_SPECIFIC (t)
 		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
 
-	b (specific);
+	bits.b (specific);
 	if (specific)
-	  lang_type_bools (t);
+	  lang_type_bools (t, bits);
       }
       break;
 
@@ -7172,34 +7219,35 @@ trees_out::tree_node_bools (tree t)
       break;
     }
 
-  bflush ();
+  bits.bflush ();
 }
 
 bool
 trees_in::tree_node_bools (tree t)
 {
-  bool ok = core_bools (t);
+  bits_in bits (*this);
+  bool ok = core_bools (t, bits);
 
   if (ok)
     switch (TREE_CODE_CLASS (TREE_CODE (t)))
       {
       case tcc_declaration:
-	if (b ())
+	if (bits.b ())
 	  {
-	    bool decomp = VAR_P (t) && b ();
+	    bool decomp = VAR_P (t) && bits.b ();
 
 	    ok = maybe_add_lang_decl_raw (t, decomp);
 	    if (ok)
-	      ok = lang_decl_bools (t);
-	}
+	      ok = lang_decl_bools (t, bits);
+	  }
 	break;
 
       case tcc_type:
-	if (b ())
+	if (bits.b ())
 	  {
 	    ok = maybe_add_lang_type_raw (t);
 	    if (ok)
-	      ok = lang_type_bools (t);
+	      ok = lang_type_bools (t, bits);
 	  }
 	break;
 
@@ -7207,7 +7255,7 @@ trees_in::tree_node_bools (tree t)
 	break;
       }
 
-  bflush ();
+  bits.bflush ();
   if (!ok || get_overrun ())
     return false;
 
@@ -7730,11 +7778,11 @@ trees_out::decl_value (tree decl, depset *dep)
 
       if (mk != MK_unique)
 	{
+	  bits_out bits (*this);
 	  if (!(mk & MK_template_mask) && !state->is_header ())
 	    {
 	      /* Tell the importer whether this is a global module entity,
-		 or a module entity.  This bool merges into the next block
-		 of bools.  Sneaky.  */
+		 or a module entity.  */
 	      tree o = get_originating_module_decl (decl);
 	      bool is_attached = false;
 
@@ -7743,9 +7791,9 @@ trees_out::decl_value (tree decl, depset *dep)
 		  && DECL_MODULE_ATTACH_P (not_tmpl))
 		is_attached = true;
 
-	      b (is_attached);
+	      bits.b (is_attached);
 	    }
-	  b (dep && dep->has_defn ());
+	  bits.b (dep && dep->has_defn ());
 	}
       tree_node_bools (decl);
     }
@@ -8014,11 +8062,11 @@ trees_in::decl_value ()
     {
       if (mk != MK_unique)
 	{
+	  bits_in bits (*this);
 	  if (!(mk & MK_template_mask) && !state->is_header ())
-	    /* See note in trees_out about where this bool is sequenced.  */
-	    is_attached = b ();
+	    is_attached = bits.b ();
 
-	  has_defn = b ();
+	  has_defn = bits.b ();
 	}
 
       if (!tree_node_bools (decl))
@@ -16775,10 +16823,11 @@ module_state::write_define (bytes_out &sec, const cpp_macro *macro)
 {
   sec.u (macro->count);
 
-  sec.b (macro->fun_like);
-  sec.b (macro->variadic);
-  sec.b (macro->syshdr);
-  sec.bflush ();
+  bits_out bits (sec);
+  bits.b (macro->fun_like);
+  bits.b (macro->variadic);
+  bits.b (macro->syshdr);
+  bits.bflush ();
 
   write_location (sec, macro->line);
   if (macro->fun_like)
@@ -16873,10 +16922,11 @@ module_state::read_define (bytes_in &sec, cpp_reader *reader) const
   macro->kind = cmk_macro;
   macro->imported_p = true;
 
-  macro->fun_like = sec.b ();
-  macro->variadic = sec.b ();
-  macro->syshdr = sec.b ();
-  sec.bflush ();
+  bits_in bits (sec);
+  macro->fun_like = bits.b ();
+  macro->variadic = bits.b ();
+  macro->syshdr = bits.b ();
+  bits.bflush ();
 
   macro->line = read_location (sec);
Jason Merrill April 10, 2024, 4:33 p.m. UTC | #9
On 4/10/24 11:26, Patrick Palka wrote:
> On Wed, 10 Apr 2024, Patrick Palka wrote:
> 
>>
>> On Tue, 9 Apr 2024, Jason Merrill wrote:
>>
>>> On 2/16/24 10:06, Patrick Palka wrote:
>>>> On Thu, 15 Feb 2024, Patrick Palka wrote:
>>>>
>>>>> One would expect consecutive calls to bytes_in/out::b for streaming
>>>>> adjacent bits, as we do for tree flag streaming, to at least be
>>>>> optimized by the compiler into individual bit operations using
>>>>> statically known bit positions (and ideally merged into larger sized
>>>>> reads/writes).
>>>
>>> Did you have any thoughts about how feasible it would be to use
>>> data-streamer.h?  I didn't see a response to richi's mail.
>>
>> IIUC the workhorses of data-streamer.h are
>>
>>    for streaming out: bitpack_create / bp_pack_value / streamer_write_bitpack
>>    for streaming in:  streamer_read_bitpack / bp_unpack_value
>>
>> which use a locally constructed bitpack_d struct for state management,
>> much like what this patch proposes, which is a sign that this is a good
>> approach I suppose.
>>
>> The bit twiddling code is unsurprisingly pretty similar except
>> data-streamer.h can stream more than one bit at a time whereas
>> bits_in/out::b from this patch can only handle one bit at a time
>> (which is by far the common case).  Another difference is that the
>> data-streamer.h buffer is a HOST_WIDE_INT while the modules bit buffer
>> is uint32_t (this patch doesn't change that).
>>
>> Unfortunately it seems data-streamer.h is currently hardcoded for
>> LTO streaming since bitpack_d::stream must be an lto_input_block and it
>> uses streamer_write_uhwi_stream and streamer_read_uhwi under the hood.
>> So we can't use it for modules streaming currently without abstracting
>> away this hardcoding AFAICT.
>>
>>>
>>>>> Unfortunately this doesn't happen because the compiler has trouble
>>>>> tracking the values of this->bit_pos and this->bit_val across such
>>>>> calls, likely because the compiler doesn't know 'this' and so it's
>>>>> treated as global memory.  This means for each consecutive bit stream
>>>>> operation, bit_pos and bit_val are loaded from memory, checked if
>>>>> buffering is needed, and finally the bit is extracted from bit_val
>>>>> according to the (unknown) bit_pos, even though relative to the previous
>>>>> operation (if we didn't need to buffer) bit_val is unchanged and bit_pos
>>>>> is just 1 larger.  This ends up being quite slow, with tree_node_bools
>>>>> taking 10% of time when streaming in parts of the std module.
>>>>>
>>>>> This patch optimizes this by making tracking of bit_pos and bit_val
>>>>> easier for the compiler.  Rather than bit_pos and bit_val being members
>>>>> of the (effectively global) bytes_in/out objects, this patch factors out
>>>>> the bit streaming code/state into separate classes bits_in/out that get
>>>>> constructed locally as needed for bit streaming.  Since these objects
>>>>> are now clearly local, the compiler can more easily track their values.
>>>
>>> Please add this rationale to the bits_in comment.
>>
>> Will do.
>>
>>>
>>>>> And since bit streaming is intended to be batched it's natural for these
>>>>> new classes to be RAII-enabled such that the bit stream is flushed upon
>>>>> destruction.
>>>>>
>>>>> In order to make the most of this improved tracking of bit position,
>>>>> this patch changes parts where we conditionally stream a tree flag
>>>>> to unconditionally stream (the flag or a dummy value).  That way
>>>>> the number of bits streamed and the respective bit positions are as
>>>>> statically known as reasonably possible.  In lang_decl_bools and
>>>>> lang_type_bools we flush the current bit buffer at the start so that
>>>>> subsequent bit positions are statically known.  And in core bools, we
>>>>> can add explicit early exits utilizing invariants that the compiler
>>>>> can't figure out itself (e.g. a tree code can't have both TS_TYPE_COMMON
>>>>> and TS_DECL_COMMON, and if a tree code doesn't have TS_DECL_COMMON then
>>>>> it doesn't have TS_DECL_WITH_VIS).  Finally if we're streaming fewer
>>>>> than 4 bits, it's more space efficient to stream them as individual
>>>>> bytes rather than as packed bits (due to the 32-bit buffer).
>>>>
>>>> Oops, this last sentence is wrong.  Although the size of the bit buffer
>>>> is 32 bits, upon flushing we rewind unused bytes within the buffer,
>>>> which means streaming 2-8 bits ends up using only one byte not all four.
>>>> So v2 below undoes this pessimization.
>>>>
>>>>> This patch also moves the definitions of the relevant streaming classes
>>>>> into anonymous namespaces so that the compiler can make more informed
>>>>> decisions about inlining their member functions.
>>>
>>> I'm curious why you decided to put namespace { } around each class rather than
>>> a larger part of the file?  Not asking for a change, just curious.
>>
>> I don't feel strongly about i, but to me using a separate namespace { }
>> for each class is consistent with how we use 'static' instead of
>> namespace { } to give (consecutively defined) free functions internal
>> linkage, i.e. instead of
>>
>>      namespace {
>>        void f() { }
>>        void g() { }
>>      }
>>
>> we do
>>
>>     static void f() { }
>>     static void g() { }
>>
>> Using a separate namespace { } for each class is the closest thing to
>> 'static' for types.  And it makes it obvious whether a class is TU-local
>> or not.
>>
>>>
>>> I'm also surprised that this would make a difference for inline functions?  Is
>>> this just to allow the compiler to inline member functions defined outside the
>>> class body without marking them inline?
>>
>> The motivation was initially to help ensure trees_in/out::core_bools,
>> ::lang_type_bools and ::lang_decl_bools get inlined into their only
>> caller ::tree_node_bools.  This is crucial because these functions take
>> bits_in/out parameters by reference and if they're not inlined then the
>> compiler can't track the bit state, and we generate bad code again with
>> a buffering check after every single bit read/write.
>>
>> Only by giving them internal linkage can the compiler see they have just
>> one caller, which guarantees they get inlined.  They're otherwise fairly
>> large functions which are not clearly profitable to inline.
>>
>> And I reckoned it's good code hygiene to give TU-local types internal
>> linkage much like how we declare TU-local free functions 'static' so
>> I gave the base classes of trees_in/out internal linkage as well.
>>
>>>
>>> In any case, please add a rationale comment to the (first) anonymous
>>> namespace.
>>
>> Sure, I opted to add a rationale to trees_in since it and trees_out are
>> the classes that most benefit from this change.
>>
>>>
>>>>> After this patch, compile time for a simple Hello World using the std
>>>>> module is reduced by 7% with a release compiler.  The on-disk size of
>>>>> the std module increases by 0.7% (presumably due to the extra flushing
>>>>> done in lang_decl_bools and lang_type_bools).
>>>>
>>>> The on-disk std module now only grows 0.4% instead of 0.7%.
>>>>>
>>>>> The bit stream out performance isn't improved as much as the stream in
>>>>> due to the spans/lengths instrumentation performed on stream out (which
>>>>> probably should be e.g. removed for release builds?)
>>>
>>> Based on CHECKING_P, sure.
>>
>> I can do that in a follow-up patch.
>>
>>>
>>>> diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
>>>> index 0291d456ff5..2ecee007e8a 100644
>>>> --- a/gcc/cp/module.cc
>>>> +++ b/gcc/cp/module.cc
>>>> @@ -694,13 +656,126 @@ protected:
>>>>      /* Instrumentation.  */
>>>>      static unsigned spans[4];
>>>>      static unsigned lengths[4];
>>>> -  static int is_set;
>>>> +  friend struct bits_out;
>>>>    };
>>>> +} // anon namespace
>>>> +
>>>> +/* Finish bit packet.  Rewind the bytes not used.  */
>>>
>>> Missing blank line.
>>
>> Fixed.
>>
>>>
>>>> +static unsigned
>>>> +bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
>>>> +{
>>>> +  gcc_assert (bit_pos);
>>>> +  unsigned bytes = (bit_pos + 7) / 8;
>>>> +  bits.unuse (4 - bytes);
>>>> +  bit_pos = 0;
>>>> +  bit_val = 0;
>>>> +  return bytes;
>>>> +}
>>>> +
>>>> @@ -5314,6 +5326,8 @@ trees_out::core_bools (tree t)
>>>>      if (TREE_CODE_CLASS (code) != tcc_type)
>>>>        /* This is TYPE_CACHED_VALUES_P for types.  */
>>>>        WB (t->base.public_flag);
>>>> +  else
>>>> +    WB (false);
>>>
>>> Can we simplify the pattern for conditionally writing/reading?  It looks easy
>>> to forget to add the else.  Perhaps a COND_WB macro with rationale comment?
>>
>> Fixed.
>>
>> Here's an incremental diff of the changes.  Will send updated patch as a
>> follow-up email.
> 
> Updated patch:
> 
> Subject: [PATCH] c++/modules: optimize tree flag streaming
> 
> One would expect consecutive calls to bytes_in/out::b for streaming
> adjacent bits, as we do for tree flag streaming, to at least be
> optimized by the compiler into individual bit operations using
> statically known bit positions (and ideally combined into larger sized
> reads/writes).
> 
> Unfortunately this doesn't happen because the compiler has trouble
> tracking the values of this->bit_pos and this->bit_val across the
> calls, likely because the compiler doesn't know the value of 'this'.
> Thus for each consecutive bit stream operation, bit_pos and bit_val are
> loaded from 'this', checked if buffering is needed, and finally the bit
> is extracted from bit_val according to the (unknown) bit_pos, even
> though relative to the previous operation (if we didn't need to buffer)
> bit_val is unchanged and bit_pos is just 1 larger.  This ends up being
> quite slow, with tree_node_bools taking 10% of time when streaming in
> the std module.
> 
> This patch optimizes this by making tracking of bit_pos and bit_val
> easier for the compiler.  Rather than bit_pos and bit_val being members
> of the (effectively global) bytes_in/out objects, this patch factors out
> the bit streaming code/state into separate classes bits_in/out that get
> constructed locally as needed for bit streaming.  Since these objects
> are now clearly local, the compiler can more easily track their values.
> 
> And since bit streaming is intended to be batched it's natural for these
> new classes to be RAII-enabled so that the bit stream is flushed upon
> destruction.
> 
> In order to make the most of this improved tracking of bit position,
> this patch changes parts where we conditionally stream a tree flag
> to unconditionally stream (the flag or a dummy value).  That way
> the number of bits streamed and the respective bit positions are as
> statically known as reasonably possible.  In lang_decl_bools and
> lang_type_bools this patch makes us flush the current bit buffer at the
> start so that subsequent bit positions are in turn statically known.
> And in core_bools, we can add explicit early exits utilizing invariants
> that the compiler can't figure out itself (e.g. a tree code can't have
> both TS_TYPE_COMMON and TS_DECL_COMMON, and if a tree code doesn't have
> TS_DECL_COMMON then it doesn't have TS_DECL_WITH_VIS).
> 
> This patch also moves the definitions of the relevant streaming classes
> into anonymous namespaces so that the compiler can make more informed
> decisions about inlining their member functions.
> 
> After this patch, compile time for a simple Hello World using the std
> module is reduced by 7% with a release compiler.  The on-disk size of
> the std module increases by 0.4% (presumably due to the extra flushing
> done in lang_decl_bools and lang_type_bools).
> 
> The bit stream out performance isn't improved as much as the stream in
> due to the spans/lengths instrumentation performed on stream out (which
> maybe should be disabled for release builds?)
> 
> gcc/cp/ChangeLog:
> 
> 	* module.cc: Update comment about classes defined within.
> 	(class data): Enclose in an anonymous namespace.
> 	(data::calc_crc): Moved from bytes::calc_crc.
> 	(class bytes): Remove.  Move bit_flush to namespace scope.
> 	(class bytes_in): Enclose in an anonymous namespace.  Inherit
> 	directly from data and adjust accordingly.  Move b and bflush
> 	members to bits_in.
> 	(class bytes_out): As above.  Remove is_set static data member.
> 	(bit_flush): Moved from class bytes.
> 	(struct bits_in): Define.
> 	(struct bits_out): Define.
> 	(bytes_out::bflush): Moved to bits_out/in.
> 	(bytes_in::bflush): Likewise
> 	(bytes_in::bfill): Removed.
> 	(bytes_out::b): Moved to bits_out/in.
> 	(bytes_in::b): Likewise.
> 	(class trees_in): Enclose in an anonymous namespace.
> 	(class trees_out): Enclose in an anonymous namespace.
> 	(trees_out::core_bools): Add bits_out/in parameter and use it.
> 	Unconditionally stream a bit for public_flag.  Add early exits
> 	as appropriate.
> 	(trees_out::core_bools): Likewise.
> 	(trees_out::lang_decl_bools): Add bits_out/in parameter and use
> 	it.  Flush the current bit buffer at the start.  Unconditionally
> 	stream a bit for module_keyed_decls_p.
> 	(trees_in::lang_decl_bools): Likewise.
> 	(trees_out::lang_type_bools): Add bits_out/in parameter and use
> 	it.  Flush the current bit buffer at the start.
> 	(trees_in::lang_type_bools): Likewise.
> 	(trees_out::tree_node_bools): Construct a bits_out object and
> 	use/pass it.
> 	(trees_in::tree_node_bools): Likewise.
> 	(trees_out::decl_value): Likewise.
> 	(trees_in::decl_value): Likewise.
> 	(module_state::write_define): Likewise.
> 	(module_state::read_define): Likewise.
> ---
>   gcc/cp/module.cc | 436 ++++++++++++++++++++++++++---------------------
>   1 file changed, 243 insertions(+), 193 deletions(-)
> 
> diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
> index 1be024cedd2..765d7dde715 100644
> --- a/gcc/cp/module.cc
> +++ b/gcc/cp/module.cc
> @@ -153,9 +153,11 @@ Classes used:
>   
>      data - buffer
>   
> -   bytes - data streamer
> -   bytes_in : bytes - scalar reader
> -   bytes_out : bytes - scalar writer
> +   bytes_in - scalar reader
> +   bytes_out - scalar writer

Let's keep documenting the inheritance relationship here, i.e.

   bytes_in : data

> @@ -694,13 +656,132 @@ protected:
>     /* Instrumentation.  */
>     static unsigned spans[4];
>     static unsigned lengths[4];
> -  static int is_set;
> +  friend struct bits_out;

It might be a little more elegant for bits_in/out to be nested classes 
of bytes_in/out, returned from member functions, rather than friends 
constructed directly?  OK either way, with the above comment tweak.

Jason
Iain Sandoe April 13, 2024, 7:40 p.m. UTC | #10
Hi Patrick,

> On 10 Apr 2024, at 17:33, Jason Merrill <jason@redhat.com> wrote:
> 
> On 4/10/24 11:26, Patrick Palka wrote:
>> On Wed, 10 Apr 2024, Patrick Palka wrote:
>>> 
>>> On Tue, 9 Apr 2024, Jason Merrill wrote:
>>> 
>>>> On 2/16/24 10:06, Patrick Palka wrote:
>>>>> On Thu, 15 Feb 2024, Patrick Palka wrote:
>>>>> 

<snip>

> Let's keep documenting the inheritance relationship here, i.e.
> 
>  bytes_in : data
> 
>> @@ -694,13 +656,132 @@ protected:
>>    /* Instrumentation.  */
>>    static unsigned spans[4];
>>    static unsigned lengths[4];
>> -  static int is_set;
>> +  friend struct bits_out;
> 
> It might be a little more elegant for bits_in/out to be nested classes of bytes_in/out, returned from member functions, rather than friends constructed directly?  OK either way, with the above comment tweak.

Unfortunately, this seems to break x86_64 Darwin bootstrap with fails as below - I did not yet have a chance to look in any morre detail, so this is a head’s up - unless you have any immediate ideas?

thanks
Iain


/src-local/gcc-master/gcc/cp/module.cc: In member function ‘{anonymous}::bytes_in::bits_in {anonymous}::bytes_in::stream_bits()’:
/src-local/gcc-master/gcc/cp/module.cc:735:24: error: use of deleted function ‘{anonymous}::bytes_in::bits_in::bits_in(const {anonymous}::bytes_in::bits_in&)’
  735 |   return bits_in (*this);
      |                        ^
/src-local/gcc-master/gcc/cp/module.cc:709:3: note: declared here
  709 |   bits_in(const bits_in&) = delete;
      |   ^~~~~~~
/src-local/gcc-master/gcc/cp/module.cc: In member function ‘{anonymous}::bytes_out::bits_out {anonymous}::bytes_out::stream_bits()’:
/src-local/gcc-master/gcc/cp/module.cc:796:25: error: use of deleted function ‘{anonymous}::bytes_out::bits_out::bits_out(const {anonymous}::bytes_out::bits_out&)’
  796 |   return bits_out (*this);
      |                         ^
/src-local/gcc-master/gcc/cp/module.cc:755:3: note: declared here
  755 |   bits_out(const bits_out&) = delete;
      |   ^~~~~~~~
Patrick Palka April 13, 2024, 7:54 p.m. UTC | #11
On Sat, 13 Apr 2024, Iain Sandoe wrote:

> Hi Patrick,
> 
> > On 10 Apr 2024, at 17:33, Jason Merrill <jason@redhat.com> wrote:
> > 
> > On 4/10/24 11:26, Patrick Palka wrote:
> >> On Wed, 10 Apr 2024, Patrick Palka wrote:
> >>> 
> >>> On Tue, 9 Apr 2024, Jason Merrill wrote:
> >>> 
> >>>> On 2/16/24 10:06, Patrick Palka wrote:
> >>>>> On Thu, 15 Feb 2024, Patrick Palka wrote:
> >>>>> 
> 
> <snip>
> 
> > Let's keep documenting the inheritance relationship here, i.e.
> > 
> >  bytes_in : data
> > 
> >> @@ -694,13 +656,132 @@ protected:
> >>    /* Instrumentation.  */
> >>    static unsigned spans[4];
> >>    static unsigned lengths[4];
> >> -  static int is_set;
> >> +  friend struct bits_out;
> > 
> > It might be a little more elegant for bits_in/out to be nested classes of bytes_in/out, returned from member functions, rather than friends constructed directly?  OK either way, with the above comment tweak.
> 
> Unfortunately, this seems to break x86_64 Darwin bootstrap with fails as below - I did not yet have a chance to look in any morre detail, so this is a head’s up - unless you have any immediate ideas?
> 
> thanks
> Iain
> 
> 
> /src-local/gcc-master/gcc/cp/module.cc: In member function ‘{anonymous}::bytes_in::bits_in {anonymous}::bytes_in::stream_bits()’:
> /src-local/gcc-master/gcc/cp/module.cc:735:24: error: use of deleted function ‘{anonymous}::bytes_in::bits_in::bits_in(const {anonymous}::bytes_in::bits_in&)’
>   735 |   return bits_in (*this);
>       |                        ^
> /src-local/gcc-master/gcc/cp/module.cc:709:3: note: declared here
>   709 |   bits_in(const bits_in&) = delete;
>       |   ^~~~~~~
> /src-local/gcc-master/gcc/cp/module.cc: In member function ‘{anonymous}::bytes_out::bits_out {anonymous}::bytes_out::stream_bits()’:
> /src-local/gcc-master/gcc/cp/module.cc:796:25: error: use of deleted function ‘{anonymous}::bytes_out::bits_out::bits_out(const {anonymous}::bytes_out::bits_out&)’
>   796 |   return bits_out (*this);
>       |                         ^
> /src-local/gcc-master/gcc/cp/module.cc:755:3: note: declared here
>   755 |   bits_out(const bits_out&) = delete;
>       |   ^~~~~~~~

Drat, sorry for the breakage.  We need to define defaulted move ctors
for these classes I think.  I'll take care of it ASAP.
diff mbox series

Patch

diff --git a/gcc/cp/module.cc b/gcc/cp/module.cc
index 0291d456ff5..9477e4f2e35 100644
--- a/gcc/cp/module.cc
+++ b/gcc/cp/module.cc
@@ -153,9 +153,11 @@  Classes used:
 
    data - buffer
 
-   bytes - data streamer
-   bytes_in : bytes - scalar reader
-   bytes_out : bytes - scalar writer
+   bytes_in - scalar reader
+   bytes_out - scalar writer
+
+   bits_in - bit stream reader
+   bits_out - bit stream writer
 
    elf - ELROND format
    elf_in : elf - ELROND reader
@@ -346,6 +348,7 @@  typedef hash_map<void *,signed,ptr_int_traits> ptr_int_hash_map;
 
 /* Variable length buffer.  */
 
+namespace {
 class data {
 public:
   class allocator {
@@ -393,6 +396,8 @@  protected:
     return res;
   }
 
+  unsigned calc_crc (unsigned) const;
+
 public:
   void unuse (unsigned count)
   {
@@ -402,6 +407,7 @@  public:
 public:
   static allocator simple_memory;
 };
+} // anon namespace
 
 /* The simple data allocator.  */
 data::allocator data::simple_memory;
@@ -447,46 +453,11 @@  data::allocator::shrink (char *ptr)
   XDELETEVEC (ptr);
 }
 
-/* Byte streamer base.   Buffer with read/write position and smarts
-   for single bits.  */
-
-class bytes : public data {
-public:
-  typedef data parent;
-
-protected:
-  uint32_t bit_val;	/* Bit buffer.  */
-  unsigned bit_pos;	/* Next bit in bit buffer.  */
-
-public:
-  bytes ()
-    :parent (), bit_val (0), bit_pos (0)
-  {}
-  ~bytes () 
-  {
-  }
-
-protected:
-  unsigned calc_crc (unsigned) const;
-
-protected:
-  /* Finish bit packet.  Rewind the bytes not used.  */
-  unsigned bit_flush ()
-  {
-    gcc_assert (bit_pos);
-    unsigned bytes = (bit_pos + 7) / 8;
-    unuse (4 - bytes);
-    bit_pos = 0;
-    bit_val = 0;
-    return bytes;
-  }
-};
-
 /* Calculate the crc32 of the buffer.  Note the CRC is stored in the
    first 4 bytes, so don't include them.  */
 
 unsigned
-bytes::calc_crc (unsigned l) const
+data::calc_crc (unsigned l) const
 {
   return crc32 (0, (unsigned char *)buffer + 4, l - 4);
 }
@@ -495,8 +466,9 @@  class elf_in;
 
 /* Byte stream reader.  */
 
-class bytes_in : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_in : public data {
+  typedef data parent;
 
 protected:
   bool overrun;  /* Sticky read-too-much flag.  */
@@ -531,7 +503,6 @@  public:
     if (offset > size)
       set_overrun ();
     pos = offset;
-    bit_pos = bit_val = 0;
   }
 
 public:
@@ -572,13 +543,6 @@  public:
 public:
   unsigned u32 ();  	/* Read uncompressed integer.  */
 
-public:
-  bool b ();	    	/* Read a bool.  */
-  void bflush ();	/* Completed a block of bools.  */
-
-private:
-  void bfill ();	/* Get the next block of bools.  */
-
 public:
   int c ();		/* Read a char.  */
   int i ();		/* Read a signed int.  */
@@ -590,6 +554,7 @@  public:
   const void *buf (size_t); /* Read a fixed-length buffer.  */
   cpp_hashnode *cpp_node (); /* Read a cpp node.  */
 };
+} // anon namespace
 
 /* Verify the buffer's CRC is correct.  */
 
@@ -610,8 +575,9 @@  class elf_out;
 
 /* Byte stream writer.  */
 
-class bytes_out : public bytes {
-  typedef bytes parent;
+namespace {
+class bytes_out : public data {
+  typedef data parent;
 
 public:
   allocator *memory;	/* Obtainer of memory.  */
@@ -658,10 +624,6 @@  public:
 public:
   void u32 (unsigned);  /* Write uncompressed integer.  */
 
-public:
-  void b (bool);	/* Write bool.  */
-  void bflush ();	/* Finish block of bools.  */
-
 public:
   void c (unsigned char); /* Write unsigned char.  */
   void i (int);		/* Write signed int.  */
@@ -694,13 +656,127 @@  protected:
   /* Instrumentation.  */
   static unsigned spans[4];
   static unsigned lengths[4];
-  static int is_set;
+  friend struct bits_out;
+};
+} // anon namespace
+
+/* Finish bit packet.  Rewind the bytes not used.  */
+static unsigned
+bit_flush (data& bits, uint32_t& bit_val, unsigned& bit_pos)
+{
+  gcc_assert (bit_pos);
+  unsigned bytes = (bit_pos + 7) / 8;
+  bits.unuse (4 - bytes);
+  bit_pos = 0;
+  bit_val = 0;
+  return bytes;
+}
+
+/* RAII-enabled bit stream reader.  Bools are packed into bytes.  You
+   cannot mix bools and non-bools.  Use bflush to flush the current stream
+   of bools on demand.  Upon destruction bflush is called.
+
+   When reading, we don't know how many bools we'll read in.  So read
+   4 bytes-worth, and then rewind when flushing if we didn't need them
+   all.  You can't have a block of bools closer than 4 bytes to the
+   end of the buffer.  */
+
+namespace {
+struct bits_in {
+  bytes_in& in;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+
+  bits_in (bytes_in& in)
+    : in (in)
+  {
+  }
+
+  ~bits_in ()
+  {
+    bflush ();
+  }
+
+  bits_in(const bits_in&) = delete;
+  bits_in& operator=(const bits_in&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      bit_flush (in, bit_val, bit_pos);
+  }
+
+  /* Read one bit.  */
+  bool b ()
+  {
+    if (!bit_pos)
+      bit_val = in.u32 ();
+    bool x = (bit_val >> bit_pos) & 1;
+    bit_pos = (bit_pos + 1) % 32;
+    return x;
+  }
 };
+} // anon namespace
+
+/* RAII-enabled bit stream writer, counterpart to bits_in.  */
+
+namespace {
+struct bits_out {
+  bytes_out& out;
+  uint32_t bit_val = 0;
+  unsigned bit_pos = 0;
+  char is_set = -1;
+
+  bits_out (bytes_out& out)
+    : out (out)
+  { }
+
+  ~bits_out ()
+  {
+    bflush ();
+  }
+
+  bits_out(const bits_out&) = delete;
+  bits_out& operator=(const bits_out&) = delete;
+
+  /* Completed a block of bools.  */
+  void bflush ()
+  {
+    if (bit_pos)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+    out.spans[2]++;
+    is_set = -1;
+  }
+
+  /* Write one bit.
+
+     It may be worth optimizing for most bools being zero.  Some kind of
+     run-length encoding?  */
+  void b (bool x)
+  {
+    if (is_set != x)
+      {
+	is_set = x;
+	out.spans[x]++;
+      }
+    out.lengths[x]++;
+    bit_val |= unsigned (x) << bit_pos++;
+    if (bit_pos == 32)
+      {
+	out.u32 (bit_val);
+	out.lengths[2] += bit_flush (out, bit_val, bit_pos);
+      }
+  }
+};
+} // anon namespace
 
 /* Instrumentation.  */
 unsigned bytes_out::spans[4];
 unsigned bytes_out::lengths[4];
-int bytes_out::is_set = -1;
 
 /* If CRC_PTR non-null, set the CRC of the buffer.  Mix the CRC into
    that pointed to by CRC_PTR.  */
@@ -723,73 +799,6 @@  bytes_out::set_crc (unsigned *crc_ptr)
     }
 }
 
-/* Finish a set of bools.  */
-
-void
-bytes_out::bflush ()
-{
-  if (bit_pos)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-  spans[2]++;
-  is_set = -1;
-}
-
-void
-bytes_in::bflush ()
-{
-  if (bit_pos)
-    bit_flush ();
-}
-
-/* When reading, we don't know how many bools we'll read in.  So read
-   4 bytes-worth, and then rewind when flushing if we didn't need them
-   all.  You can't have a block of bools closer than 4 bytes to the
-   end of the buffer.  */
-
-void
-bytes_in::bfill ()
-{
-  bit_val = u32 ();
-}
-
-/* Bools are packed into bytes.  You cannot mix bools and non-bools.
-   You must call bflush before emitting another type.  So batch your
-   bools.
-
-   It may be worth optimizing for most bools being zero.  Some kind of
-   run-length encoding?  */
-
-void
-bytes_out::b (bool x)
-{
-  if (is_set != x)
-    {
-      is_set = x;
-      spans[x]++;
-    }
-  lengths[x]++;
-  bit_val |= unsigned (x) << bit_pos++;
-  if (bit_pos == 32)
-    {
-      u32 (bit_val);
-      lengths[2] += bit_flush ();
-    }
-}
-
-bool
-bytes_in::b ()
-{
-  if (!bit_pos)
-    bfill ();
-  bool v = (bit_val >> bit_pos++) & 1;
-  if (bit_pos == 32)
-    bit_flush ();
-  return v;
-}
-
 /* Exactly 4 bytes.  Used internally for bool packing and a few other
    places.  We can't simply use uint32_t because (a) alignment and
    (b) we need little-endian for the bool streaming rewinding to make
@@ -2846,6 +2855,7 @@  struct post_process_data {
    read trees with TREE_VISITED.  Thus it's quite safe to have
    multiple concurrent readers.  Which is good, because lazy
    loading. */
+namespace {
 class trees_in : public bytes_in {
   typedef bytes_in parent;
 
@@ -2870,15 +2880,15 @@  private:
 
 public:
   /* Needed for binfo writing  */
-  bool core_bools (tree);
+  bool core_bools (tree, bits_in&);
 
 private:
   /* Stream tree_core, lang_decl_specific and lang_type_specific
      bits.  */
   bool core_vals (tree);
-  bool lang_type_bools (tree);
+  bool lang_type_bools (tree, bits_in&);
   bool lang_type_vals (tree);
-  bool lang_decl_bools (tree);
+  bool lang_decl_bools (tree, bits_in&);
   bool lang_decl_vals (tree);
   bool lang_vals (tree);
   bool tree_node_bools (tree);
@@ -2965,6 +2975,7 @@  private:
 private:
   void assert_definition (tree, bool installing);
 };
+} // anon namespace
 
 trees_in::trees_in (module_state *state)
   :parent (), state (state), unused (0)
@@ -2982,6 +2993,7 @@  trees_in::~trees_in ()
 }
 
 /* Tree stream writer.  */
+namespace {
 class trees_out : public bytes_out {
   typedef bytes_out parent;
 
@@ -3043,11 +3055,11 @@  public:
   }
 
 private:
-  void core_bools (tree);
+  void core_bools (tree, bits_out&);
   void core_vals (tree);
-  void lang_type_bools (tree);
+  void lang_type_bools (tree, bits_out&);
   void lang_type_vals (tree);
-  void lang_decl_bools (tree);
+  void lang_decl_bools (tree, bits_out&);
   void lang_decl_vals (tree);
   void lang_vals (tree);
   void tree_node_bools (tree);
@@ -3126,6 +3138,7 @@  private:
   static unsigned back_ref_count;
   static unsigned null_count;
 };
+} // anon namespace
 
 /* Instrumentation counters.  */
 unsigned trees_out::tree_val_count;
@@ -5292,9 +5305,9 @@  trees_in::start (unsigned code)
 /* Read & write the core boolean flags.  */
 
 void
-trees_out::core_bools (tree t)
+trees_out::core_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   tree_code code = TREE_CODE (t);
 
   WB (t->base.side_effects_flag);
@@ -5314,6 +5327,8 @@  trees_out::core_bools (tree t)
   if (TREE_CODE_CLASS (code) != tcc_type)
     /* This is TYPE_CACHED_VALUES_P for types.  */
     WB (t->base.public_flag);
+  else
+    WB (false);
   WB (t->base.private_flag);
   WB (t->base.protected_flag);
   WB (t->base.deprecated_flag);
@@ -5327,7 +5342,7 @@  trees_out::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      return;
 
     default:
       WB (t->base.u.bits.lang_flag_0);
@@ -5374,6 +5389,7 @@  trees_out::core_bools (tree t)
       WB (t->type_common.lang_flag_5);
       WB (t->type_common.lang_flag_6);
       WB (t->type_common.typeless_storage);
+      return;
     }
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
@@ -5439,6 +5455,8 @@  trees_out::core_bools (tree t)
       WB (t->decl_common.decl_nonshareable_flag);
       WB (t->decl_common.decl_not_flexarray);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5459,6 +5477,8 @@  trees_out::core_bools (tree t)
       WB (t->decl_with_vis.final);
       WB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    return;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5485,13 +5505,16 @@  trees_out::core_bools (tree t)
       WB ((kind >> 0) & 1);
       WB ((kind >> 1) & 1);
     }
+  else
+    return;
 #undef WB
 }
 
 bool
-trees_in::core_bools (tree t)
+trees_in::core_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
+
   tree_code code = TREE_CODE (t);
 
   RB (t->base.side_effects_flag);
@@ -5507,6 +5530,8 @@  trees_in::core_bools (tree t)
   RB (t->base.static_flag);
   if (TREE_CODE_CLASS (code) != tcc_type)
     RB (t->base.public_flag);
+  else
+    bits.b ();
   RB (t->base.private_flag);
   RB (t->base.protected_flag);
   RB (t->base.deprecated_flag);
@@ -5520,7 +5545,7 @@  trees_in::core_bools (tree t)
     case TARGET_MEM_REF:
     case TREE_VEC:
       /* These use different base.u fields.  */
-      break;
+      goto done;
 
     default:
       RB (t->base.u.bits.lang_flag_0);
@@ -5554,6 +5579,7 @@  trees_in::core_bools (tree t)
       RB (t->type_common.lang_flag_5);
       RB (t->type_common.lang_flag_6);
       RB (t->type_common.typeless_storage);
+      goto done;
     }
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_COMMON))
@@ -5584,6 +5610,8 @@  trees_in::core_bools (tree t)
       RB (t->decl_common.decl_nonshareable_flag);
       RB (t->decl_common.decl_not_flexarray);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_DECL_WITH_VIS))
     {
@@ -5604,6 +5632,8 @@  trees_in::core_bools (tree t)
       RB (t->decl_with_vis.final);
       RB (t->decl_with_vis.regdecl_flag);
     }
+  else
+    goto done;
 
   if (CODE_CONTAINS_STRUCT (code, TS_FUNCTION_DECL))
     {
@@ -5627,20 +5657,22 @@  trees_in::core_bools (tree t)
 
       /* decl_type is a (misnamed) 2 bit discriminator.	 */
       unsigned kind = 0;
-      kind |= unsigned (b ()) << 0;
-      kind |= unsigned (b ()) << 1;
+      kind |= unsigned (bits.b ()) << 0;
+      kind |= unsigned (bits.b ()) << 1;
       t->function_decl.decl_type = function_decl_type (kind);
     }
 #undef RB
+done:
   return !get_overrun ();
 }
 
 void
-trees_out::lang_decl_bools (tree t)
+trees_out::lang_decl_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->u.base.language == lang_cplusplus);
   WB ((lang->u.base.use_template >> 0) & 1);
   WB ((lang->u.base.use_template >> 1) & 1);
@@ -5664,6 +5696,8 @@  trees_out::lang_decl_bools (tree t)
   WB (lang->u.base.module_attach_p);
   if (VAR_OR_FUNCTION_DECL_P (t))
     WB (lang->u.base.module_keyed_decls_p);
+  else
+    WB (false);
   switch (lang->u.base.selector)
     {
     default:
@@ -5714,15 +5748,16 @@  trees_out::lang_decl_bools (tree t)
 }
 
 bool
-trees_in::lang_decl_bools (tree t)
+trees_in::lang_decl_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_decl *lang = DECL_LANG_SPECIFIC (t);
 
-  lang->u.base.language = b () ? lang_cplusplus : lang_c;
+  bits.bflush ();
+  lang->u.base.language = bits.b () ? lang_cplusplus : lang_c;
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->u.base.use_template = v;
   /* lang->u.base.not_really_extern is not streamed.  */
   RB (lang->u.base.initialized_in_class);
@@ -5738,6 +5773,8 @@  trees_in::lang_decl_bools (tree t)
   RB (lang->u.base.module_attach_p);
   if (VAR_OR_FUNCTION_DECL_P (t))
     RB (lang->u.base.module_keyed_decls_p);
+  else
+    bits.b ();
   switch (lang->u.base.selector)
     {
     default:
@@ -5786,11 +5823,12 @@  trees_in::lang_decl_bools (tree t)
 }
 
 void
-trees_out::lang_type_bools (tree t)
+trees_out::lang_type_bools (tree t, bits_out& bits)
 {
-#define WB(X) (b (X))
+#define WB(X) (bits.b (X))
   const struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   WB (lang->has_type_conversion);
   WB (lang->has_copy_ctor);
   WB (lang->has_default_ctor);
@@ -5852,11 +5890,12 @@  trees_out::lang_type_bools (tree t)
 }
 
 bool
-trees_in::lang_type_bools (tree t)
+trees_in::lang_type_bools (tree t, bits_in& bits)
 {
-#define RB(X) ((X) = b ())
+#define RB(X) ((X) = bits.b ())
   struct lang_type *lang = TYPE_LANG_SPECIFIC (t);
 
+  bits.bflush ();
   RB (lang->has_type_conversion);
   RB (lang->has_copy_ctor);
   RB (lang->has_default_ctor);
@@ -5864,8 +5903,8 @@  trees_in::lang_type_bools (tree t)
   RB (lang->ref_needs_init);
   RB (lang->has_const_copy_assign);
   unsigned v;
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->use_template = v;
 
   RB (lang->has_mutable);
@@ -5877,8 +5916,8 @@  trees_in::lang_type_bools (tree t)
   RB (lang->has_new);
   RB (lang->has_array_new);
 
-  v = b () << 0;
-  v |= b () << 1;
+  v = bits.b () << 0;
+  v |= bits.b () << 1;
   lang->gets_delete = v;
   RB (lang->interface_only);
   RB (lang->interface_unknown);
@@ -7106,18 +7145,19 @@  trees_out::tree_node_bools (tree t)
   gcc_checking_assert (TREE_CODE (t) != NAMESPACE_DECL
 		       || DECL_NAMESPACE_ALIAS (t));
 
-  core_bools (t);
+  bits_out bits (*this);
+  core_bools (t, bits);
 
   switch (TREE_CODE_CLASS (TREE_CODE (t)))
     {
     case tcc_declaration:
       {
 	bool specific = DECL_LANG_SPECIFIC (t) != NULL;
-	b (specific);
+	bits.b (specific);
 	if (specific && VAR_P (t))
-	  b (DECL_DECOMPOSITION_P (t));
+	  bits.b (DECL_DECOMPOSITION_P (t));
 	if (specific)
-	  lang_decl_bools (t);
+	  lang_decl_bools (t, bits);
       }
       break;
 
@@ -7128,9 +7168,9 @@  trees_out::tree_node_bools (tree t)
 	gcc_assert (TYPE_LANG_SPECIFIC (t)
 		    == TYPE_LANG_SPECIFIC (TYPE_MAIN_VARIANT (t)));
 
-	b (specific);
+	bits.b (specific);
 	if (specific)
-	  lang_type_bools (t);
+	  lang_type_bools (t, bits);
       }
       break;
 
@@ -7138,34 +7178,35 @@  trees_out::tree_node_bools (tree t)
       break;
     }
 
-  bflush ();
+  bits.bflush ();
 }
 
 bool
 trees_in::tree_node_bools (tree t)
 {
-  bool ok = core_bools (t);
+  bits_in bits (*this);
+  bool ok = core_bools (t, bits);
 
   if (ok)
     switch (TREE_CODE_CLASS (TREE_CODE (t)))
       {
       case tcc_declaration:
-	if (b ())
+	if (bits.b ())
 	  {
-	    bool decomp = VAR_P (t) && b ();
+	    bool decomp = VAR_P (t) && bits.b ();
 
 	    ok = maybe_add_lang_decl_raw (t, decomp);
 	    if (ok)
-	      ok = lang_decl_bools (t);
+	      ok = lang_decl_bools (t, bits);
 	}
 	break;
 
       case tcc_type:
-	if (b ())
+	if (bits.b ())
 	  {
 	    ok = maybe_add_lang_type_raw (t);
 	    if (ok)
-	      ok = lang_type_bools (t);
+	      ok = lang_type_bools (t, bits);
 	  }
 	break;
 
@@ -7173,7 +7214,7 @@  trees_in::tree_node_bools (tree t)
 	break;
       }
 
-  bflush ();
+  bits.bflush ();
   if (!ok || get_overrun ())
     return false;
 
@@ -7699,8 +7740,7 @@  trees_out::decl_value (tree decl, depset *dep)
 	  if (!(mk & MK_template_mask) && !state->is_header ())
 	    {
 	      /* Tell the importer whether this is a global module entity,
-		 or a module entity.  This bool merges into the next block
-		 of bools.  Sneaky.  */
+		 or a module entity.  */
 	      tree o = get_originating_module_decl (decl);
 	      bool is_attached = false;
 
@@ -7709,9 +7749,9 @@  trees_out::decl_value (tree decl, depset *dep)
 		  && DECL_MODULE_ATTACH_P (not_tmpl))
 		is_attached = true;
 
-	      b (is_attached);
+	      c (is_attached);
 	    }
-	  b (dep && dep->has_defn ());
+	  c (dep && dep->has_defn ());
 	}
       tree_node_bools (decl);
     }
@@ -7982,10 +8022,9 @@  trees_in::decl_value ()
       if (mk != MK_unique)
 	{
 	  if (!(mk & MK_template_mask) && !state->is_header ())
-	    /* See note in trees_out about where this bool is sequenced.  */
-	    is_attached = b ();
+	    is_attached = c ();
 
-	  has_defn = b ();
+	  has_defn = c ();
 	}
 
       if (!tree_node_bools (decl))
@@ -16682,10 +16721,9 @@  module_state::write_define (bytes_out &sec, const cpp_macro *macro)
 {
   sec.u (macro->count);
 
-  sec.b (macro->fun_like);
-  sec.b (macro->variadic);
-  sec.b (macro->syshdr);
-  sec.bflush ();
+  sec.c (macro->fun_like);
+  sec.c (macro->variadic);
+  sec.c (macro->syshdr);
 
   write_location (sec, macro->line);
   if (macro->fun_like)
@@ -16780,10 +16818,9 @@  module_state::read_define (bytes_in &sec, cpp_reader *reader) const
   macro->kind = cmk_macro;
   macro->imported_p = true;
 
-  macro->fun_like = sec.b ();
-  macro->variadic = sec.b ();
-  macro->syshdr = sec.b ();
-  sec.bflush ();
+  macro->fun_like = sec.c ();
+  macro->variadic = sec.c ();
+  macro->syshdr = sec.c ();
 
   macro->line = read_location (sec);