[v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Hi Richard,
Many thanks.  Yes, your proposed ao_ref_alignment is exactly what I was looking for.
Here's the second revision of my patch for PR tree-optimization/98335 that both uses/
introduces ao_ref_alignment and more intelligently aligns/trims both head and tail,
for example handling the case discussed by Richard and Jeff Law, of a 16 byte-aligned
object where we wish to avoid trimming (just) the last three bytes.  It uses the useful
property that writing N consecutive bytes, typically requires popcount(N) store
instructions, so we wish to align (if we can) that we begin/end with a store of N' bytes
where popcount(N') is one, if that isn't already the case.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
make -k check with no new failures.  Is this revised version ok for mainline?

2022-03-09  Roger Sayle  <roger@nextmovesoftware.com>
            Richard Biener  <rguenther@suse.de>

gcc/ChangeLog
	PR tree-optimization/98335
	* builtins.cc (get_object_alignment_2): Export.
	* builtins.h (get_object_alignment_2): Likewise.
	* tree-ssa-alias.cc (ao_ref_alignment): New.
	* tree-ssa-alias.h (ao_ref_alignment): Declare.

	* tree-ssa-dse.cc (compute_trims): Improve logic deciding whether
	to align head/tail, writing more bytes but using fewer store insns.
	(maybe_trim_memstar_call): Silence compiler warnings by using
	memset to initialize lendata.

gcc/testsuite/ChangeLog
	PR tree-optimization/98335
	* g++.dg/pr98335.C: New test case.
	* gcc.dg/pr86010.c: New test case.
	* gcc.dg/pr86010-2.c: New test case.

Thanks again for your help.
Roger
--

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: 08 March 2022 10:44
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's
> compute_trims.
> 
> On Tue, Mar 8, 2022 at 11:10 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> > >
> > >
> > > This patch is the main middle-end piece of a fix for PR
> > > tree-opt/98335, which is a code-quality regression affecting
> > > mainline.  The issue occurs in DSE's (dead store elimination's)
> > > compute_trims function that determines where a store to memory can
> > > be trimmed.  In the testcase given in the PR, this function notices
> > > that the first byte of a DImode store is dead, and replaces the
> > > 8-byte store at (aligned) offset zero, with a 7-byte store at
> > > (unaligned) offset one.  Most architectures can store a power-of-two
> > > bytes (up to a maximum) in single instruction, so writing 7 bytes
> > > requires more instructions than writing 8 bytes.  This patch follows Jakub
> Jelinek's suggestion in comment 5, that compute_trims needs improved
> heuristics.
> > >
> > > In this patch, decision of whether and how to align trim_head is
> > > based on the number of bytes being written, the alignment of the
> > > start of the object and where within the object the first byte is
> > > written.  The first tests check whether we're already writing to the
> > > start of the object, and that we're writing three or more bytes.  If
> > > we're only writing one or two bytes, there's no benefit from providing
> additional alignment.
> > > Then we determine the alignment of the object, which is either 1, 2,
> > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never
> > > write more than 7 bytes beyond the minimum required).  If the buffer
> > > is only
> > > 1 or 2 byte aligned there's no benefit from additional alignment.
> > > For the remaining cases, alignment of trim_head is based upon where
> > > within each aligned block (word) the first byte is written.  For
> > > example, storing the last byte (or last half-word) of a word can be
> > > performed with a single insn.
> > >
> > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:
> > >
> > >         movl    $0, -24(%rsp)
> > >         movabsq $72057594037927935, %rdx
> > >         movl    $0, -21(%rsp)
> > >         andq    -24(%rsp), %rdx
> > >         movq    %rdx, %rax
> > >         salq    $8, %rax
> > >         movb    c(%rip), %al
> > >         ret
> > >
> > > to
> > >
> > >         xorl    %eax, %eax
> > >         movb    c(%rip), %al
> > >         ret
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check with no new failures.  I've also added
> > > new testcases for the original motivating PR
> > > tree-optimization/86010, to ensure that those remain optimized (in future).
> Ok for mainline?
> >
> > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index
> > 2b22a61..080e406 100644
> > --- a/gcc/tree-ssa-dse.cc
> > +++ b/gcc/tree-ssa-dse.cc
> > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
> > *trim_head, int *trim_tail,
> >    int first_live = bitmap_first_set_bit (live);
> >    *trim_head = first_live - first_orig;
> >
> > -  /* If more than a word remains, then make sure to keep the
> > -     starting point at least word aligned.  */
> > -  if (last_live - first_live > UNITS_PER_WORD)
> > -    *trim_head &= ~(UNITS_PER_WORD - 1);
> > +  /* If REF is aligned, try to maintain this alignment if it reduces
> > +     the number of (power-of-two sized aligned) writes to memory.
> > +     First check that we're writing >= 3 bytes at a non-zero offset.
> > + */  if (first_live
> > +      && last_live - first_live >= 2)
> > +    {
> > +      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));
> >
> > you can't simply use TYPE_ALIGN_* on ref->base.  You can use
> > get_object_alignment on ref->ref, but ref->ref can be NULL in case the
> > ref was initialized from a builtin call like memcpy.
> >
> > Also ref->base is offsetted by ref->offset which you don't seem to
> > account for.  In theory one could export get_object_alignment_2 and if
> > ref->ref is NULL, use that on ref->base, passing addr_p = true, and
> > then adjust the resulting bitpos by ref->offset and fix align
> > accordingly (trimming might also align an access if the original
> > access was offsetted from known alignment).
> >
> > That said, a helper like ao_ref_alignment () might be useful here.
> 
> Like the attached - free feel to use that.
> 
> Richard.
> 
> >
> > I wonder if we can apply good heuristics to compute_trims without
> > taking into account context, like maybe_trimp_complex_store is already
> > limiting itself to useful subsets and the constructor and memstar
> > cases will only benefit if they end up being expanded inline via
> > *_by_pieces, not if expanded as a call.
> >
> > You don't seem to adjust *trim_tail at all, if an aligned 16 byte
> > region is trimmed there by 3 that will result in two extra stores as well, no?
> >

Message ID	007601d833dd$38977a70$a9c66f50$@nextmovesoftware.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 14E973858C74 From: "Roger Sayle" <roger@nextmovesoftware.com> To: "'Richard Biener'" <richard.guenther@gmail.com> Subject: [PATCH v2] PR tree-optimization/98335: Improvements to DSE's compute_trims. Date: Wed, 9 Mar 2022 17:43:42 -0000 Message-ID: <007601d833dd$38977a70$a9c66f50$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0077_01D833DD.38999D50" Thread-Index: Adgz2uNiPThzHEJ7SX+xfeiGKtthAg== Content-Language: en-gb Precedence: list Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	[v2] PR tree-optimization/98335: Improvements to DSE's compute_trims. \| expand [v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.

[v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Commit Message

Comments

Patch