diff mbox series

[AVX512,PR87767] Optimize memory broadcast for constant vector under AVX512

Message ID CAMZc-bxdgUxqd=ZRyQ0ciPpLr=Deg=U8vhERov0f+B2Rgm17yA@mail.gmail.com
State New
Headers show
Series [AVX512,PR87767] Optimize memory broadcast for constant vector under AVX512 | expand

Commit Message

Hongtao Liu July 9, 2020, 8:33 a.m. UTC
Hi:
  For a constant vector having one duplicated value, there's no need
to put the whole vector in the constant pool, using embedded broadcast
instead.

  Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.

gcc/ChangeLog:

        PR target/87767
        * config/i386/i386-features.c
        (replace_constant_pool_with_broadcast): New function.
        (constant_pool_broadcast): Ditto.
        (class pass_constant_pool_broadcast): New pass.
        (make_pass_constant_pool_broadcast): Ditto.
        * config/i386/i386-passes.def: Insert new pass after combine.
        * config/i386/i386-protos.h
        (make_pass_constant_pool_broadcast): Declare.
        * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
        *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.

gcc/testsuite/ChangeLog:

        PR target/87767
        * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
        * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
        * gcc.target/i386/pr92865-1.c: Adjust testcase.

Comments

Hongtao Liu July 10, 2020, 9:24 a.m. UTC | #1
+ maintainer.
cc H.J

On Thu, Jul 9, 2020 at 4:33 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   For a constant vector having one duplicated value, there's no need
> to put the whole vector in the constant pool, using embedded broadcast
> instead.
>
>   Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.
>
> gcc/ChangeLog:
>
>         PR target/87767
>         * config/i386/i386-features.c
>         (replace_constant_pool_with_broadcast): New function.
>         (constant_pool_broadcast): Ditto.
>         (class pass_constant_pool_broadcast): New pass.
>         (make_pass_constant_pool_broadcast): Ditto.
>         * config/i386/i386-passes.def: Insert new pass after combine.
>         * config/i386/i386-protos.h
>         (make_pass_constant_pool_broadcast): Declare.
>         * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
>         *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/87767
>         * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
>         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
>         * gcc.target/i386/pr92865-1.c: Adjust testcase.
>
>
>
>
> --
> BR,
> Hongtao
Hongtao Liu July 17, 2020, 7:24 a.m. UTC | #2
ping!

On Fri, Jul 10, 2020 at 5:24 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> + maintainer.
> cc H.J
>
> On Thu, Jul 9, 2020 at 4:33 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   For a constant vector having one duplicated value, there's no need
> > to put the whole vector in the constant pool, using embedded broadcast
> > instead.
> >
> >   Bootstrap test is Ok, regression test for i386/x86-64 backend is ok.
> >
> > gcc/ChangeLog:
> >
> >         PR target/87767
> >         * config/i386/i386-features.c
> >         (replace_constant_pool_with_broadcast): New function.
> >         (constant_pool_broadcast): Ditto.
> >         (class pass_constant_pool_broadcast): New pass.
> >         (make_pass_constant_pool_broadcast): Ditto.
> >         * config/i386/i386-passes.def: Insert new pass after combine.
> >         * config/i386/i386-protos.h
> >         (make_pass_constant_pool_broadcast): Declare.
> >         * config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
> >         *avx512f_mul<mode>3<mask_name>_bcst): New define_insn.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/87767
> >         * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
> >         * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> >         * gcc.target/i386/pr92865-1.c: Adjust testcase.
> >
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao
Jan Hubicka July 23, 2020, 8:39 a.m. UTC | #3
Hello,
sorry for taking so long to get to this.
> diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> index 535fc7e981d..8f81d101382 100644
> --- a/gcc/config/i386/i386-features.c
> +++ b/gcc/config/i386/i386-features.c
> @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
>    return new pass_remove_partial_avx_dependency (ctxt);
>  }
>  
> +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> +   with embedded broadcast. i.e.transform
> +
> +     vpaddq .LC0(%rip), %zmm0, %zmm0
> +     ret
> +  .LC0:
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +    .quad 3
> +
> +    to
> +
> +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0

It seems to me that having a special purpose pass for this is bit 
overzelaous.  It seems to me that you can do same pattern matching via 
splitter and fit it into the usual insn splitting pass?

Honza
> +     ret
> +  .LC0:
> +    .quad 3  */
> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)
> +{
> +  subrtx_ptr_iterator::array_type array;
> +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> +    {
> +      rtx *loc = *iter;
> +      rtx x = *loc;
> +      rtx broadcast_mem, vec_dup, constant, first;
> +      machine_mode mode;
> +      if (GET_CODE (x) != MEM
> +	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> +	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> +	continue;
> +
> +      mode = GET_MODE (x);
> +      if (!VECTOR_MODE_P (mode))
> +	return;
> +
> +      constant = get_pool_constant (XEXP (x, 0));
> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	return;
> +
> +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> +	{
> +	  rtx tmp = XVECEXP (constant, 0, i);
> +	  /* Only handle one-value const vector.  */
> +	  if (!rtx_equal_p (tmp, first))
> +	    return;
> +	}
> +
> +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> +      *loc = vec_dup;
> +      INSN_CODE (insn) = -1;
> +      /* Revert change if there's no corresponding pattern.  */
> +      if (recog_memoized (insn) < 0)
> +      	{
> +      	  *loc = x;
> +      	  recog_memoized (insn);
> +      	}
> +      /* At most 1 memory_operand in an insn.  */
> +      return;
> +    }
> +}
> +
> +/* For const vector having one duplicated value, there's no need to put
> +   whole vector in the constant pool when target supports embedded broadcast. */
> +static unsigned int
> +constant_pool_broadcast (void)
> +{
> +  timevar_push (TV_MACH_DEP);
> +  rtx_insn *insn;
> +
> +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> +    {
> +      if (!INSN_P (insn))
> +	continue;
> +
> +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> +	 insns, not any notes that may be attached.  We don't want to mark
> +	 a constant just because it happens to appear in a REG_EQUIV note.  */
> +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> +	{
> +	  int i, n = seq->len ();
> +	  for (i = 0; i < n; ++i)
> +	    {
> +	      rtx subinsn = seq->element (i);
> +	      if (INSN_P (subinsn))
> +		replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> +	    }
> +	}
> +      else
> +	replace_constant_pool_with_broadcast (insn);
> +    }
> +  timevar_pop (TV_MACH_DEP);
> +  return 0;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_constant_pool_broadcast =
> +{
> +  RTL_PASS, /* type */
> +  "cpb", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_MACH_DEP, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_constant_pool_broadcast : public rtl_opt_pass
> +{
> +public:
> +  pass_constant_pool_broadcast (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return TARGET_AVX512F;
> +    }
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      return constant_pool_broadcast ();
> +    }
> +}; // class pass_cpb
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> +{
> +  return new pass_constant_pool_broadcast (ctxt);
> +}
> +
>  /* This compares the priority of target features in function DECL1
>     and DECL2.  It returns positive value if DECL1 is higher priority,
>     negative value if DECL2 is higher priority and 0 if they are the
> diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> index d83c7b956b1..07ecf8e790f 100644
> --- a/gcc/config/i386/i386-passes.def
> +++ b/gcc/config/i386/i386-passes.def
> @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
>    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
>  
>    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 7c2ce618f3f..6c6909b41dd 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
>    (gcc::context *);
>  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
>    (gcc::context *);
> +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 431571a4bc1..fbfb459c5bf 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12127,6 +12127,19 @@
>     (set_attr "prefix" "evex")
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> +	(mult:VI8_AVX512VL
> +	  (vec_duplicate:VI8_AVX512VL
> +	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> +	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> +))]
> +  "TARGET_AVX512DQ"
> +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> +  [(set_attr "type" "sseimul")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "<sseinsnmode>")])
> +
>  (define_expand "mul<mode>3<mask_name>"
>    [(set (match_operand:VI4_AVX512F 0 "register_operand")
>  	(mult:VI4_AVX512F
> @@ -12167,6 +12180,18 @@
>     (set_attr "btver2_decode" "vector,vector,vector")
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> +	(mult:VI4_AVX512VL
> +	  (vec_duplicate:VI4_AVX512VL
> +	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> +	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> +  "TARGET_AVX512F"
> +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> +  [(set_attr "type" "sseimul")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "<sseinsnmode>")])
> +
>  (define_expand "mul<mode>3"
>    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
>  	(mult:VI8_AVX2_AVX512F
> diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..800ef1f957e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> @@ -0,0 +1,40 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2" } */
> +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..21249bc0cf9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> @@ -0,0 +1,66 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef int v16si  __attribute__ ((vector_size (64)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef long long v8di  __attribute__ ((vector_size (64)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef float v16sf  __attribute__ ((vector_size (64)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +typedef double v8df  __attribute__ ((vector_size (64)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v16si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v8di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v16sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +FOO (v8df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v16si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v8di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v16sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> +FOO (v8df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> new file mode 100644
> index 00000000000..938346743c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> @@ -0,0 +1,54 @@
> +/* PR target/87767 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> +
> +#include<stdlib.h>
> +#include<stdio.h>
> +#include "avx512f-broadcast-pr87767-1.c"
> +
> +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)		\
> +  do							\
> +    {							\
> +      TYPE exp[N], src[N];				\
> +      VTYPE res;					\
> +      for (int i = 0; i < N; i++)			\
> +	src[i] = i * i * 107;				\
> +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
> +      for (int i = 0; i < N; i ++)			\
> +	exp[i] = src[i] OP 101;				\
> +      for (int j = 0; j < N; j++)			\
> +	{						\
> +	  if (res[j] != exp[j])				\
> +	    abort();					\
> +	}						\
> +    }							\
> +  while (0)
> +
> +int main()
> +{
> +  TEST (v4si, int, 4, add, +);
> +  TEST (v8si, int, 8, add, +);
> +  TEST (v16si, int, 16, add, +);
> +  TEST (v2di, long long, 2, add, +);
> +  TEST (v4di, long long, 4, add, +);
> +  TEST (v8di, long long, 8, add, +);
> +  TEST (v4sf, float, 4, add, +);
> +  TEST (v8sf, float, 8, add, +);
> +  TEST (v16sf, float, 16, add, +);
> +  TEST (v2df, double, 2, add, +);
> +  TEST (v4df, double, 4, add, +);
> +  TEST (v8df, double, 8, add, +);
> +
> +  TEST (v4si, int, 4, mul, *);
> +  TEST (v8si, int, 8, mul, *);
> +  TEST (v16si, int, 16, mul, *);
> +  TEST (v2di, long long, 2, mul, *);
> +  TEST (v4di, long long, 4, mul, *);
> +  TEST (v8di, long long, 8, mul, *);
> +  TEST (v4sf, float, 4, mul, *);
> +  TEST (v8sf, float, 8, mul, *);
> +  TEST (v16sf, float, 16, mul, *);
> +  TEST (v2df, double, 2, mul, *);
> +  TEST (v4df, double, 4, mul, *);
> +  TEST (v8df, double, 8, mul, *);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> new file mode 100644
> index 00000000000..ec159a68158
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> @@ -0,0 +1,40 @@
> +/* PR target/87767 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f" } */
> +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> +
> +typedef int v4si  __attribute__ ((vector_size (16)));
> +typedef int v8si  __attribute__ ((vector_size (32)));
> +typedef long long v2di  __attribute__ ((vector_size (16)));
> +typedef long long v4di  __attribute__ ((vector_size (32)));
> +typedef float v4sf  __attribute__ ((vector_size (16)));
> +typedef float v8sf  __attribute__ ((vector_size (32)));
> +typedef double v2df  __attribute__ ((vector_size (16)));
> +typedef double v4df  __attribute__ ((vector_size (32)));
> +
> +#define FOO(VTYPE, OP_NAME, OP)			\
> +VTYPE						\
> + __attribute__ ((noipa))			\
> +foo_##OP_NAME##_##VTYPE (VTYPE a)		\
> +{						\
> +  return a OP 101;				\
> +}						\
> +
> +FOO (v4si, add, +);
> +FOO (v8si, add, +);
> +FOO (v2di, add, +);
> +FOO (v4di, add, +);
> +FOO (v4sf, add, +);
> +FOO (v8sf, add, +);
> +FOO (v2df, add, +);
> +FOO (v4df, add, +);
> +
> +FOO (v4si, mul, *);
> +FOO (v8si, mul, *);
> +FOO (v2di, mul, *);
> +FOO (v4di, mul, *);
> +FOO (v4sf, mul, *);
> +FOO (v8sf, mul, *);
> +FOO (v2df, mul, *);
> +FOO (v4df, mul, *);
> diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> index 49b5778a067..a37487d9af7 100644
> --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> @@ -3,10 +3,11 @@
>  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
>  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
>  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
>  
>  extern char arraysb[64];
>  extern short arraysw[32];
> -- 
> 2.18.1
>
Hongtao Liu July 23, 2020, 1:53 p.m. UTC | #4
On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hello,
> sorry for taking so long to get to this.
> > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > index 535fc7e981d..8f81d101382 100644
> > --- a/gcc/config/i386/i386-features.c
> > +++ b/gcc/config/i386/i386-features.c
> > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> >    return new pass_remove_partial_avx_dependency (ctxt);
> >  }
> >
> > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > +   with embedded broadcast. i.e.transform
> > +
> > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > +     ret
> > +  .LC0:
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +    .quad 3
> > +
> > +    to
> > +
> > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
>
> It seems to me that having a special purpose pass for this is bit
> overzelaous.  It seems to me that you can do same pattern matching via
> splitter and fit it into the usual insn splitting pass?
>

From an implementation perspective, there could be lots of work, since
memory embedding broadcast is available for nearly every instruction
in AVX512. And for new added AVX512 instructions, we also need to add
a define_split for them.

> Honza
> > +     ret
> > +  .LC0:
> > +    .quad 3  */
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     return;
> > +
> > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > +     {
> > +       rtx tmp = XVECEXP (constant, 0, i);
> > +       /* Only handle one-value const vector.  */
> > +       if (!rtx_equal_p (tmp, first))
> > +         return;
> > +     }
> > +
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
> > +      /* At most 1 memory_operand in an insn.  */
> > +      return;
> > +    }
> > +}
> > +
> > +/* For const vector having one duplicated value, there's no need to put
> > +   whole vector in the constant pool when target supports embedded broadcast. */
> > +static unsigned int
> > +constant_pool_broadcast (void)
> > +{
> > +  timevar_push (TV_MACH_DEP);
> > +  rtx_insn *insn;
> > +
> > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > +    {
> > +      if (!INSN_P (insn))
> > +     continue;
> > +
> > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > +      insns, not any notes that may be attached.  We don't want to mark
> > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > +     {
> > +       int i, n = seq->len ();
> > +       for (i = 0; i < n; ++i)
> > +         {
> > +           rtx subinsn = seq->element (i);
> > +           if (INSN_P (subinsn))
> > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > +         }
> > +     }
> > +      else
> > +     replace_constant_pool_with_broadcast (insn);
> > +    }
> > +  timevar_pop (TV_MACH_DEP);
> > +  return 0;
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_constant_pool_broadcast =
> > +{
> > +  RTL_PASS, /* type */
> > +  "cpb", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_MACH_DEP, /* tv_id */
> > +  0, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_df_finish, /* todo_flags_finish */
> > +};
> > +
> > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return TARGET_AVX512F;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      return constant_pool_broadcast ();
> > +    }
> > +}; // class pass_cpb
> > +
> > +} // anon namespace
> > +
> > +rtl_opt_pass *
> > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > +{
> > +  return new pass_constant_pool_broadcast (ctxt);
> > +}
> > +
> >  /* This compares the priority of target features in function DECL1
> >     and DECL2.  It returns positive value if DECL1 is higher priority,
> >     negative value if DECL2 is higher priority and 0 if they are the
> > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > index d83c7b956b1..07ecf8e790f 100644
> > --- a/gcc/config/i386/i386-passes.def
> > +++ b/gcc/config/i386/i386-passes.def
> > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> >
> >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > index 7c2ce618f3f..6c6909b41dd 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> >    (gcc::context *);
> >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> >    (gcc::context *);
> > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 431571a4bc1..fbfb459c5bf 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -12127,6 +12127,19 @@
> >     (set_attr "prefix" "evex")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > +     (mult:VI8_AVX512VL
> > +       (vec_duplicate:VI8_AVX512VL
> > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > +))]
> > +  "TARGET_AVX512DQ"
> > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > +  [(set_attr "type" "sseimul")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "<sseinsnmode>")])
> > +
> >  (define_expand "mul<mode>3<mask_name>"
> >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> >       (mult:VI4_AVX512F
> > @@ -12167,6 +12180,18 @@
> >     (set_attr "btver2_decode" "vector,vector,vector")
> >     (set_attr "mode" "<sseinsnmode>")])
> >
> > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > +     (mult:VI4_AVX512VL
> > +       (vec_duplicate:VI4_AVX512VL
> > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > +  "TARGET_AVX512F"
> > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > +  [(set_attr "type" "sseimul")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "<sseinsnmode>")])
> > +
> >  (define_expand "mul<mode>3"
> >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> >       (mult:VI8_AVX2_AVX512F
> > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..800ef1f957e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > @@ -0,0 +1,40 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2" } */
> > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..21249bc0cf9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > @@ -0,0 +1,66 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef int v16si  __attribute__ ((vector_size (64)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +typedef double v8df  __attribute__ ((vector_size (64)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v16si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v8di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v16sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +FOO (v8df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v16si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v8di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v16sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > +FOO (v8df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > new file mode 100644
> > index 00000000000..938346743c2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > @@ -0,0 +1,54 @@
> > +/* PR target/87767 */
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > +
> > +#include<stdlib.h>
> > +#include<stdio.h>
> > +#include "avx512f-broadcast-pr87767-1.c"
> > +
> > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > +  do                                                 \
> > +    {                                                        \
> > +      TYPE exp[N], src[N];                           \
> > +      VTYPE res;                                     \
> > +      for (int i = 0; i < N; i++)                    \
> > +     src[i] = i * i * 107;                           \
> > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > +      for (int i = 0; i < N; i ++)                   \
> > +     exp[i] = src[i] OP 101;                         \
> > +      for (int j = 0; j < N; j++)                    \
> > +     {                                               \
> > +       if (res[j] != exp[j])                         \
> > +         abort();                                    \
> > +     }                                               \
> > +    }                                                        \
> > +  while (0)
> > +
> > +int main()
> > +{
> > +  TEST (v4si, int, 4, add, +);
> > +  TEST (v8si, int, 8, add, +);
> > +  TEST (v16si, int, 16, add, +);
> > +  TEST (v2di, long long, 2, add, +);
> > +  TEST (v4di, long long, 4, add, +);
> > +  TEST (v8di, long long, 8, add, +);
> > +  TEST (v4sf, float, 4, add, +);
> > +  TEST (v8sf, float, 8, add, +);
> > +  TEST (v16sf, float, 16, add, +);
> > +  TEST (v2df, double, 2, add, +);
> > +  TEST (v4df, double, 4, add, +);
> > +  TEST (v8df, double, 8, add, +);
> > +
> > +  TEST (v4si, int, 4, mul, *);
> > +  TEST (v8si, int, 8, mul, *);
> > +  TEST (v16si, int, 16, mul, *);
> > +  TEST (v2di, long long, 2, mul, *);
> > +  TEST (v4di, long long, 4, mul, *);
> > +  TEST (v8di, long long, 8, mul, *);
> > +  TEST (v4sf, float, 4, mul, *);
> > +  TEST (v8sf, float, 8, mul, *);
> > +  TEST (v16sf, float, 16, mul, *);
> > +  TEST (v2df, double, 2, mul, *);
> > +  TEST (v4df, double, 4, mul, *);
> > +  TEST (v8df, double, 8, mul, *);
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > new file mode 100644
> > index 00000000000..ec159a68158
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > @@ -0,0 +1,40 @@
> > +/* PR target/87767 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx512f" } */
> > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > +
> > +typedef int v4si  __attribute__ ((vector_size (16)));
> > +typedef int v8si  __attribute__ ((vector_size (32)));
> > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > +typedef double v2df  __attribute__ ((vector_size (16)));
> > +typedef double v4df  __attribute__ ((vector_size (32)));
> > +
> > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > +VTYPE                                                \
> > + __attribute__ ((noipa))                     \
> > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > +{                                            \
> > +  return a OP 101;                           \
> > +}                                            \
> > +
> > +FOO (v4si, add, +);
> > +FOO (v8si, add, +);
> > +FOO (v2di, add, +);
> > +FOO (v4di, add, +);
> > +FOO (v4sf, add, +);
> > +FOO (v8sf, add, +);
> > +FOO (v2df, add, +);
> > +FOO (v4df, add, +);
> > +
> > +FOO (v4si, mul, *);
> > +FOO (v8si, mul, *);
> > +FOO (v2di, mul, *);
> > +FOO (v4di, mul, *);
> > +FOO (v4sf, mul, *);
> > +FOO (v8sf, mul, *);
> > +FOO (v2df, mul, *);
> > +FOO (v4df, mul, *);
> > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > index 49b5778a067..a37487d9af7 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > @@ -3,10 +3,11 @@
> >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> >
> >  extern char arraysb[64];
> >  extern short arraysw[32];
> > --
> > 2.18.1
> >
>
Hongtao Liu July 24, 2020, 2:37 a.m. UTC | #5
On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hello,
> > sorry for taking so long to get to this.
> > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > index 535fc7e981d..8f81d101382 100644
> > > --- a/gcc/config/i386/i386-features.c
> > > +++ b/gcc/config/i386/i386-features.c
> > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > >    return new pass_remove_partial_avx_dependency (ctxt);
> > >  }
> > >
> > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > +   with embedded broadcast. i.e.transform
> > > +
> > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > +     ret
> > > +  .LC0:
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +    .quad 3
> > > +
> > > +    to
> > > +
> > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> >
> > It seems to me that having a special purpose pass for this is bit
> > overzelaous.  It seems to me that you can do same pattern matching via
> > splitter and fit it into the usual insn splitting pass?
> >
>
> From an implementation perspective, there could be lots of work, since
> memory embedding broadcast is available for nearly every instruction
> in AVX512. And for new added AVX512 instructions, we also need to add
> a define_split for them.
>

I'll add more tests to show my point.

> > Honza
> > > +     ret
> > > +  .LC0:
> > > +    .quad 3  */
> > > +static void
> > > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > > +{
> > > +  subrtx_ptr_iterator::array_type array;
> > > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > > +    {
> > > +      rtx *loc = *iter;
> > > +      rtx x = *loc;
> > > +      rtx broadcast_mem, vec_dup, constant, first;
> > > +      machine_mode mode;
> > > +      if (GET_CODE (x) != MEM
> > > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > > +     continue;
> > > +
> > > +      mode = GET_MODE (x);
> > > +      if (!VECTOR_MODE_P (mode))
> > > +     return;
> > > +
> > > +      constant = get_pool_constant (XEXP (x, 0));
> > > +      first = XVECEXP (constant, 0, 0);
> > > +      /* There could be some rtx like
> > > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > > +      but with "*.LC1" refer to V2DI constant vector.  */
> > > +      if (GET_MODE (constant) != mode)
> > > +     return;
> > > +
> > > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > > +     {
> > > +       rtx tmp = XVECEXP (constant, 0, i);
> > > +       /* Only handle one-value const vector.  */
> > > +       if (!rtx_equal_p (tmp, first))
> > > +         return;
> > > +     }
> > > +
> > > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > > +      *loc = vec_dup;
> > > +      INSN_CODE (insn) = -1;
> > > +      /* Revert change if there's no corresponding pattern.  */
> > > +      if (recog_memoized (insn) < 0)
> > > +             {
> > > +               *loc = x;
> > > +               recog_memoized (insn);
> > > +             }
> > > +      /* At most 1 memory_operand in an insn.  */
> > > +      return;
> > > +    }
> > > +}
> > > +
> > > +/* For const vector having one duplicated value, there's no need to put
> > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > +static unsigned int
> > > +constant_pool_broadcast (void)
> > > +{
> > > +  timevar_push (TV_MACH_DEP);
> > > +  rtx_insn *insn;
> > > +
> > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > +    {
> > > +      if (!INSN_P (insn))
> > > +     continue;
> > > +
> > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > +      insns, not any notes that may be attached.  We don't want to mark
> > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > > +     {
> > > +       int i, n = seq->len ();
> > > +       for (i = 0; i < n; ++i)
> > > +         {
> > > +           rtx subinsn = seq->element (i);
> > > +           if (INSN_P (subinsn))
> > > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > > +         }
> > > +     }
> > > +      else
> > > +     replace_constant_pool_with_broadcast (insn);
> > > +    }
> > > +  timevar_pop (TV_MACH_DEP);
> > > +  return 0;
> > > +}
> > > +
> > > +namespace {
> > > +
> > > +const pass_data pass_data_constant_pool_broadcast =
> > > +{
> > > +  RTL_PASS, /* type */
> > > +  "cpb", /* name */
> > > +  OPTGROUP_NONE, /* optinfo_flags */
> > > +  TV_MACH_DEP, /* tv_id */
> > > +  0, /* properties_required */
> > > +  0, /* properties_provided */
> > > +  0, /* properties_destroyed */
> > > +  0, /* todo_flags_start */
> > > +  TODO_df_finish, /* todo_flags_finish */
> > > +};
> > > +
> > > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > > +{
> > > +public:
> > > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > > +  {}
> > > +
> > > +  /* opt_pass methods: */
> > > +  virtual bool gate (function *)
> > > +    {
> > > +      return TARGET_AVX512F;
> > > +    }
> > > +
> > > +  virtual unsigned int execute (function *)
> > > +    {
> > > +      return constant_pool_broadcast ();
> > > +    }
> > > +}; // class pass_cpb
> > > +
> > > +} // anon namespace
> > > +
> > > +rtl_opt_pass *
> > > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > > +{
> > > +  return new pass_constant_pool_broadcast (ctxt);
> > > +}
> > > +
> > >  /* This compares the priority of target features in function DECL1
> > >     and DECL2.  It returns positive value if DECL1 is higher priority,
> > >     negative value if DECL2 is higher priority and 0 if they are the
> > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > > index d83c7b956b1..07ecf8e790f 100644
> > > --- a/gcc/config/i386/i386-passes.def
> > > +++ b/gcc/config/i386/i386-passes.def
> > > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> > >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> > >
> > >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > index 7c2ce618f3f..6c6909b41dd 100644
> > > --- a/gcc/config/i386/i386-protos.h
> > > +++ b/gcc/config/i386/i386-protos.h
> > > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> > >    (gcc::context *);
> > >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> > >    (gcc::context *);
> > > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index 431571a4bc1..fbfb459c5bf 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -12127,6 +12127,19 @@
> > >     (set_attr "prefix" "evex")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > > +     (mult:VI8_AVX512VL
> > > +       (vec_duplicate:VI8_AVX512VL
> > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > > +))]
> > > +  "TARGET_AVX512DQ"
> > > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > +  [(set_attr "type" "sseimul")
> > > +   (set_attr "prefix" "evex")
> > > +   (set_attr "mode" "<sseinsnmode>")])
> > > +
> > >  (define_expand "mul<mode>3<mask_name>"
> > >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> > >       (mult:VI4_AVX512F
> > > @@ -12167,6 +12180,18 @@
> > >     (set_attr "btver2_decode" "vector,vector,vector")
> > >     (set_attr "mode" "<sseinsnmode>")])
> > >
> > > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > > +     (mult:VI4_AVX512VL
> > > +       (vec_duplicate:VI4_AVX512VL
> > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > > +  "TARGET_AVX512F"
> > > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > +  [(set_attr "type" "sseimul")
> > > +   (set_attr "prefix" "evex")
> > > +   (set_attr "mode" "<sseinsnmode>")])
> > > +
> > >  (define_expand "mul<mode>3"
> > >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> > >       (mult:VI8_AVX2_AVX512F
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..800ef1f957e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > @@ -0,0 +1,40 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2" } */
> > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..21249bc0cf9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > @@ -0,0 +1,66 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef int v16si  __attribute__ ((vector_size (64)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +typedef double v8df  __attribute__ ((vector_size (64)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v16si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v8di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v16sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +FOO (v8df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v16si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v8di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v16sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > +FOO (v8df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > new file mode 100644
> > > index 00000000000..938346743c2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > @@ -0,0 +1,54 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > +
> > > +#include<stdlib.h>
> > > +#include<stdio.h>
> > > +#include "avx512f-broadcast-pr87767-1.c"
> > > +
> > > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > > +  do                                                 \
> > > +    {                                                        \
> > > +      TYPE exp[N], src[N];                           \
> > > +      VTYPE res;                                     \
> > > +      for (int i = 0; i < N; i++)                    \
> > > +     src[i] = i * i * 107;                           \
> > > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > > +      for (int i = 0; i < N; i ++)                   \
> > > +     exp[i] = src[i] OP 101;                         \
> > > +      for (int j = 0; j < N; j++)                    \
> > > +     {                                               \
> > > +       if (res[j] != exp[j])                         \
> > > +         abort();                                    \
> > > +     }                                               \
> > > +    }                                                        \
> > > +  while (0)
> > > +
> > > +int main()
> > > +{
> > > +  TEST (v4si, int, 4, add, +);
> > > +  TEST (v8si, int, 8, add, +);
> > > +  TEST (v16si, int, 16, add, +);
> > > +  TEST (v2di, long long, 2, add, +);
> > > +  TEST (v4di, long long, 4, add, +);
> > > +  TEST (v8di, long long, 8, add, +);
> > > +  TEST (v4sf, float, 4, add, +);
> > > +  TEST (v8sf, float, 8, add, +);
> > > +  TEST (v16sf, float, 16, add, +);
> > > +  TEST (v2df, double, 2, add, +);
> > > +  TEST (v4df, double, 4, add, +);
> > > +  TEST (v8df, double, 8, add, +);
> > > +
> > > +  TEST (v4si, int, 4, mul, *);
> > > +  TEST (v8si, int, 8, mul, *);
> > > +  TEST (v16si, int, 16, mul, *);
> > > +  TEST (v2di, long long, 2, mul, *);
> > > +  TEST (v4di, long long, 4, mul, *);
> > > +  TEST (v8di, long long, 8, mul, *);
> > > +  TEST (v4sf, float, 4, mul, *);
> > > +  TEST (v8sf, float, 8, mul, *);
> > > +  TEST (v16sf, float, 16, mul, *);
> > > +  TEST (v2df, double, 2, mul, *);
> > > +  TEST (v4df, double, 4, mul, *);
> > > +  TEST (v8df, double, 8, mul, *);
> > > +}
> > > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > new file mode 100644
> > > index 00000000000..ec159a68158
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > @@ -0,0 +1,40 @@
> > > +/* PR target/87767 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx512f" } */
> > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > +
> > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > +
> > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > +VTYPE                                                \
> > > + __attribute__ ((noipa))                     \
> > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > +{                                            \
> > > +  return a OP 101;                           \
> > > +}                                            \
> > > +
> > > +FOO (v4si, add, +);
> > > +FOO (v8si, add, +);
> > > +FOO (v2di, add, +);
> > > +FOO (v4di, add, +);
> > > +FOO (v4sf, add, +);
> > > +FOO (v8sf, add, +);
> > > +FOO (v2df, add, +);
> > > +FOO (v4df, add, +);
> > > +
> > > +FOO (v4si, mul, *);
> > > +FOO (v8si, mul, *);
> > > +FOO (v2di, mul, *);
> > > +FOO (v4di, mul, *);
> > > +FOO (v4sf, mul, *);
> > > +FOO (v8sf, mul, *);
> > > +FOO (v2df, mul, *);
> > > +FOO (v4df, mul, *);
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > index 49b5778a067..a37487d9af7 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > @@ -3,10 +3,11 @@
> > >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> > >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> > >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> > >
> > >  extern char arraysb[64];
> > >  extern short arraysw[32];
> > > --
> > > 2.18.1
> > >
> >
>
>
> --
> BR,
> Hongtao
Hongtao Liu Aug. 4, 2020, 6:05 a.m. UTC | #6
Update patch.

There are a lot of avx512 define_insns which lack corresponding memory
broadcast version, i only add *avx512f_mul<mode>3<mask_name>_bcst and
*avx512dq_mul<mode>3<mask_name>_bcst in this patch.

On Fri, Jul 24, 2020 at 10:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> > >
> > > Hello,
> > > sorry for taking so long to get to this.
> > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > > index 535fc7e981d..8f81d101382 100644
> > > > --- a/gcc/config/i386/i386-features.c
> > > > +++ b/gcc/config/i386/i386-features.c
> > > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > > >    return new pass_remove_partial_avx_dependency (ctxt);
> > > >  }
> > > >
> > > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > > +   with embedded broadcast. i.e.transform
> > > > +
> > > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > > +     ret
> > > > +  .LC0:
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +    .quad 3
> > > > +
> > > > +    to
> > > > +
> > > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> > >
> > > It seems to me that having a special purpose pass for this is bit
> > > overzelaous.  It seems to me that you can do same pattern matching via
> > > splitter and fit it into the usual insn splitting pass?
> > >
> >
> > From an implementation perspective, there could be lots of work, since
> > memory embedding broadcast is available for nearly every instruction
> > in AVX512. And for new added AVX512 instructions, we also need to add
> > a define_split for them.
> >
>
> I'll add more tests to show my point.
>
> > > Honza
> > > > +     ret
> > > > +  .LC0:
> > > > +    .quad 3  */
> > > > +static void
> > > > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > > > +{
> > > > +  subrtx_ptr_iterator::array_type array;
> > > > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > > > +    {
> > > > +      rtx *loc = *iter;
> > > > +      rtx x = *loc;
> > > > +      rtx broadcast_mem, vec_dup, constant, first;
> > > > +      machine_mode mode;
> > > > +      if (GET_CODE (x) != MEM
> > > > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
> > > > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > > > +     continue;
> > > > +
> > > > +      mode = GET_MODE (x);
> > > > +      if (!VECTOR_MODE_P (mode))
> > > > +     return;
> > > > +
> > > > +      constant = get_pool_constant (XEXP (x, 0));
> > > > +      first = XVECEXP (constant, 0, 0);
> > > > +      /* There could be some rtx like
> > > > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > > > +      but with "*.LC1" refer to V2DI constant vector.  */
> > > > +      if (GET_MODE (constant) != mode)
> > > > +     return;
> > > > +
> > > > +      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
> > > > +     {
> > > > +       rtx tmp = XVECEXP (constant, 0, i);
> > > > +       /* Only handle one-value const vector.  */
> > > > +       if (!rtx_equal_p (tmp, first))
> > > > +         return;
> > > > +     }
> > > > +
> > > > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > > > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > > > +      *loc = vec_dup;
> > > > +      INSN_CODE (insn) = -1;
> > > > +      /* Revert change if there's no corresponding pattern.  */
> > > > +      if (recog_memoized (insn) < 0)
> > > > +             {
> > > > +               *loc = x;
> > > > +               recog_memoized (insn);
> > > > +             }
> > > > +      /* At most 1 memory_operand in an insn.  */
> > > > +      return;
> > > > +    }
> > > > +}
> > > > +
> > > > +/* For const vector having one duplicated value, there's no need to put
> > > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > > +static unsigned int
> > > > +constant_pool_broadcast (void)
> > > > +{
> > > > +  timevar_push (TV_MACH_DEP);
> > > > +  rtx_insn *insn;
> > > > +
> > > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > > +    {
> > > > +      if (!INSN_P (insn))
> > > > +     continue;
> > > > +
> > > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > > +      insns, not any notes that may be attached.  We don't want to mark
> > > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
> > > > +      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
> > > > +     {
> > > > +       int i, n = seq->len ();
> > > > +       for (i = 0; i < n; ++i)
> > > > +         {
> > > > +           rtx subinsn = seq->element (i);
> > > > +           if (INSN_P (subinsn))
> > > > +             replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
> > > > +         }
> > > > +     }
> > > > +      else
> > > > +     replace_constant_pool_with_broadcast (insn);
> > > > +    }
> > > > +  timevar_pop (TV_MACH_DEP);
> > > > +  return 0;
> > > > +}
> > > > +
> > > > +namespace {
> > > > +
> > > > +const pass_data pass_data_constant_pool_broadcast =
> > > > +{
> > > > +  RTL_PASS, /* type */
> > > > +  "cpb", /* name */
> > > > +  OPTGROUP_NONE, /* optinfo_flags */
> > > > +  TV_MACH_DEP, /* tv_id */
> > > > +  0, /* properties_required */
> > > > +  0, /* properties_provided */
> > > > +  0, /* properties_destroyed */
> > > > +  0, /* todo_flags_start */
> > > > +  TODO_df_finish, /* todo_flags_finish */
> > > > +};
> > > > +
> > > > +class pass_constant_pool_broadcast : public rtl_opt_pass
> > > > +{
> > > > +public:
> > > > +  pass_constant_pool_broadcast (gcc::context *ctxt)
> > > > +    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
> > > > +  {}
> > > > +
> > > > +  /* opt_pass methods: */
> > > > +  virtual bool gate (function *)
> > > > +    {
> > > > +      return TARGET_AVX512F;
> > > > +    }
> > > > +
> > > > +  virtual unsigned int execute (function *)
> > > > +    {
> > > > +      return constant_pool_broadcast ();
> > > > +    }
> > > > +}; // class pass_cpb
> > > > +
> > > > +} // anon namespace
> > > > +
> > > > +rtl_opt_pass *
> > > > +make_pass_constant_pool_broadcast (gcc::context *ctxt)
> > > > +{
> > > > +  return new pass_constant_pool_broadcast (ctxt);
> > > > +}
> > > > +
> > > >  /* This compares the priority of target features in function DECL1
> > > >     and DECL2.  It returns positive value if DECL1 is higher priority,
> > > >     negative value if DECL2 is higher priority and 0 if they are the
> > > > diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> > > > index d83c7b956b1..07ecf8e790f 100644
> > > > --- a/gcc/config/i386/i386-passes.def
> > > > +++ b/gcc/config/i386/i386-passes.def
> > > > @@ -33,3 +33,4 @@ along with GCC; see the file COPYING3.  If not see
> > > >    INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
> > > >
> > > >    INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
> > > > +  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
> > > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> > > > index 7c2ce618f3f..6c6909b41dd 100644
> > > > --- a/gcc/config/i386/i386-protos.h
> > > > +++ b/gcc/config/i386/i386-protos.h
> > > > @@ -386,3 +386,4 @@ extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
> > > >    (gcc::context *);
> > > >  extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
> > > >    (gcc::context *);
> > > > +extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
> > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > index 431571a4bc1..fbfb459c5bf 100644
> > > > --- a/gcc/config/i386/sse.md
> > > > +++ b/gcc/config/i386/sse.md
> > > > @@ -12127,6 +12127,19 @@
> > > >     (set_attr "prefix" "evex")
> > > >     (set_attr "mode" "<sseinsnmode>")])
> > > >
> > > > +(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
> > > > +  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
> > > > +     (mult:VI8_AVX512VL
> > > > +       (vec_duplicate:VI8_AVX512VL
> > > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > > +       (match_operand:VI8_AVX512VL 2 "register_operand" "v")
> > > > +))]
> > > > +  "TARGET_AVX512DQ"
> > > > +  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > > +  [(set_attr "type" "sseimul")
> > > > +   (set_attr "prefix" "evex")
> > > > +   (set_attr "mode" "<sseinsnmode>")])
> > > > +
> > > >  (define_expand "mul<mode>3<mask_name>"
> > > >    [(set (match_operand:VI4_AVX512F 0 "register_operand")
> > > >       (mult:VI4_AVX512F
> > > > @@ -12167,6 +12180,18 @@
> > > >     (set_attr "btver2_decode" "vector,vector,vector")
> > > >     (set_attr "mode" "<sseinsnmode>")])
> > > >
> > > > +(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
> > > > +  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
> > > > +     (mult:VI4_AVX512VL
> > > > +       (vec_duplicate:VI4_AVX512VL
> > > > +         (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
> > > > +       (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
> > > > +  "TARGET_AVX512F"
> > > > +   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
> > > > +  [(set_attr "type" "sseimul")
> > > > +   (set_attr "prefix" "evex")
> > > > +   (set_attr "mode" "<sseinsnmode>")])
> > > > +
> > > >  (define_expand "mul<mode>3"
> > > >    [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
> > > >       (mult:VI8_AVX2_AVX512F
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..800ef1f957e
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,40 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx2" } */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..21249bc0cf9
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
> > > > +/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef int v16si  __attribute__ ((vector_size (64)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef long long v8di  __attribute__ ((vector_size (64)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef float v16sf  __attribute__ ((vector_size (64)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +typedef double v8df  __attribute__ ((vector_size (64)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v16si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v8di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v16sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +FOO (v8df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v16si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v8di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v16sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > +FOO (v8df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > > new file mode 100644
> > > > index 00000000000..938346743c2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
> > > > @@ -0,0 +1,54 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do run } */
> > > > +/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
> > > > +
> > > > +#include<stdlib.h>
> > > > +#include<stdio.h>
> > > > +#include "avx512f-broadcast-pr87767-1.c"
> > > > +
> > > > +#define TEST(VTYPE, TYPE, N, OP_NAME, OP)            \
> > > > +  do                                                 \
> > > > +    {                                                        \
> > > > +      TYPE exp[N], src[N];                           \
> > > > +      VTYPE res;                                     \
> > > > +      for (int i = 0; i < N; i++)                    \
> > > > +     src[i] = i * i * 107;                           \
> > > > +      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);      \
> > > > +      for (int i = 0; i < N; i ++)                   \
> > > > +     exp[i] = src[i] OP 101;                         \
> > > > +      for (int j = 0; j < N; j++)                    \
> > > > +     {                                               \
> > > > +       if (res[j] != exp[j])                         \
> > > > +         abort();                                    \
> > > > +     }                                               \
> > > > +    }                                                        \
> > > > +  while (0)
> > > > +
> > > > +int main()
> > > > +{
> > > > +  TEST (v4si, int, 4, add, +);
> > > > +  TEST (v8si, int, 8, add, +);
> > > > +  TEST (v16si, int, 16, add, +);
> > > > +  TEST (v2di, long long, 2, add, +);
> > > > +  TEST (v4di, long long, 4, add, +);
> > > > +  TEST (v8di, long long, 8, add, +);
> > > > +  TEST (v4sf, float, 4, add, +);
> > > > +  TEST (v8sf, float, 8, add, +);
> > > > +  TEST (v16sf, float, 16, add, +);
> > > > +  TEST (v2df, double, 2, add, +);
> > > > +  TEST (v4df, double, 4, add, +);
> > > > +  TEST (v8df, double, 8, add, +);
> > > > +
> > > > +  TEST (v4si, int, 4, mul, *);
> > > > +  TEST (v8si, int, 8, mul, *);
> > > > +  TEST (v16si, int, 16, mul, *);
> > > > +  TEST (v2di, long long, 2, mul, *);
> > > > +  TEST (v4di, long long, 4, mul, *);
> > > > +  TEST (v8di, long long, 8, mul, *);
> > > > +  TEST (v4sf, float, 4, mul, *);
> > > > +  TEST (v8sf, float, 8, mul, *);
> > > > +  TEST (v16sf, float, 16, mul, *);
> > > > +  TEST (v2df, double, 2, mul, *);
> > > > +  TEST (v4df, double, 4, mul, *);
> > > > +  TEST (v8df, double, 8, mul, *);
> > > > +}
> > > > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > > new file mode 100644
> > > > index 00000000000..ec159a68158
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
> > > > @@ -0,0 +1,40 @@
> > > > +/* PR target/87767 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx512f" } */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
> > > > +/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
> > > > +
> > > > +typedef int v4si  __attribute__ ((vector_size (16)));
> > > > +typedef int v8si  __attribute__ ((vector_size (32)));
> > > > +typedef long long v2di  __attribute__ ((vector_size (16)));
> > > > +typedef long long v4di  __attribute__ ((vector_size (32)));
> > > > +typedef float v4sf  __attribute__ ((vector_size (16)));
> > > > +typedef float v8sf  __attribute__ ((vector_size (32)));
> > > > +typedef double v2df  __attribute__ ((vector_size (16)));
> > > > +typedef double v4df  __attribute__ ((vector_size (32)));
> > > > +
> > > > +#define FOO(VTYPE, OP_NAME, OP)                      \
> > > > +VTYPE                                                \
> > > > + __attribute__ ((noipa))                     \
> > > > +foo_##OP_NAME##_##VTYPE (VTYPE a)            \
> > > > +{                                            \
> > > > +  return a OP 101;                           \
> > > > +}                                            \
> > > > +
> > > > +FOO (v4si, add, +);
> > > > +FOO (v8si, add, +);
> > > > +FOO (v2di, add, +);
> > > > +FOO (v4di, add, +);
> > > > +FOO (v4sf, add, +);
> > > > +FOO (v8sf, add, +);
> > > > +FOO (v2df, add, +);
> > > > +FOO (v4df, add, +);
> > > > +
> > > > +FOO (v4si, mul, *);
> > > > +FOO (v8si, mul, *);
> > > > +FOO (v2di, mul, *);
> > > > +FOO (v4di, mul, *);
> > > > +FOO (v4sf, mul, *);
> > > > +FOO (v8sf, mul, *);
> > > > +FOO (v2df, mul, *);
> > > > +FOO (v4df, mul, *);
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > index 49b5778a067..a37487d9af7 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
> > > > @@ -3,10 +3,11 @@
> > > >  /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
> > > >  /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
> > > >  /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
> > > > -/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
> > > >
> > > >  extern char arraysb[64];
> > > >  extern short arraysw[32];
> > > > --
> > > > 2.18.1
> > > >
> > >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao
Li, Pan2 via Gcc-patches Aug. 26, 2020, 9:23 p.m. UTC | #7
On Tue, 2020-08-04 at 14:05 +0800, Hongtao Liu via Gcc-patches wrote:
> Update patch.
> 
> There are a lot of avx512 define_insns which lack corresponding memory
> broadcast version, i only add *avx512f_mul<mode>3<mask_name>_bcst and
> *avx512dq_mul<mode>3<mask_name>_bcst in this patch.
> 
> On Fri, Jul 24, 2020 at 10:37 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > On Thu, Jul 23, 2020 at 9:53 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > > On Thu, Jul 23, 2020 at 4:39 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> > > > Hello,
> > > > sorry for taking so long to get to this.
> > > > > diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
> > > > > index 535fc7e981d..8f81d101382 100644
> > > > > --- a/gcc/config/i386/i386-features.c
> > > > > +++ b/gcc/config/i386/i386-features.c
> > > > > @@ -2379,6 +2379,152 @@ make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
> > > > >    return new pass_remove_partial_avx_dependency (ctxt);
> > > > >  }
> > > > > 
> > > > > +/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
> > > > > +   with embedded broadcast. i.e.transform
> > > > > +
> > > > > +     vpaddq .LC0(%rip), %zmm0, %zmm0
> > > > > +     ret
> > > > > +  .LC0:
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +    .quad 3
> > > > > +
> > > > > +    to
> > > > > +
> > > > > +     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
> > > > 
> > > > It seems to me that having a special purpose pass for this is bit
> > > > overzelaous.  It seems to me that you can do same pattern matching via
> > > > splitter and fit it into the usual insn splitting pass?
> > > > 
> > > 
> > > From an implementation perspective, there could be lots of work, since
> > > memory embedding broadcast is available for nearly every instruction
> > > in AVX512. And for new added AVX512 instructions, we also need to add
> > > a define_split for them.
> > > 
> > 
> > 


> > +/* For const vector having one duplicated value, there's no need to put
> > > > > +   whole vector in the constant pool when target supports embedded broadcast. */
> > > > > +static unsigned int
> > > > > +constant_pool_broadcast (void)
> > > > > +{
> > > > > +  timevar_push (TV_MACH_DEP);
> > > > > +  rtx_insn *insn;
> > > > > +
> > > > > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > > > > +    {
> > > > > +      if (!INSN_P (insn))
> > > > > +     continue;
> > > > > +
> > > > > +      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
> > > > > +      insns, not any notes that may be attached.  We don't want to mark
> > > > > +      a constant just because it happens to appear in a REG_EQUIV note.  */
Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
surprised we need to handle that case.

So your pass modifies the insn in place, which is fine.  But do we actually
remove the original constant pool entry if it's no longer used?  If not, does
this patch actually save anything (memory bandwidth perhaps?)

Is there an existing pass over the RTL chain where this would work so that it's
more compile-time efficient?

jeff
Jan Hubicka Aug. 27, 2020, 11:09 a.m. UTC | #8
> Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
> surprised we need to handle that case.
> 
> So your pass modifies the insn in place, which is fine.  But do we actually
> remove the original constant pool entry if it's no longer used?  If not, does
> this patch actually save anything (memory bandwidth perhaps?)

Constant pool entries are output only if actually used by asm output, so
this could just work.
> 
> Is there an existing pass over the RTL chain where this would work so that it's
> more compile-time efficient?

I was also concerned about adding yet another pass and wanted to look
bit more into posibility to make this a part of peephole pass.  While it
is true that the usual way to write it (adding extra pattern for every
instruction)  is a lot of work I was thinking if we can perhaps just add
quite generic define_peephole which will match everything containing
broadcast via predicate, call into the expander that will try to build
mathcing instruction and fail otherwise.  While it is still bit of a
hack I think it may be less intrusive then yet another machine specific
pass.

Honza
> 
> jeff
>
Jakub Jelinek Aug. 27, 2020, 12:24 p.m. UTC | #9
On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)
> +{
> +  subrtx_ptr_iterator::array_type array;
> +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> +    {
> +      rtx *loc = *iter;
> +      rtx x = *loc;
> +      rtx broadcast_mem, vec_dup, constant, first;
> +      machine_mode mode;
> +      if (GET_CODE (x) != MEM

MEM_P

> +	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF

SYMBOL_REF_P

> +	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> +	continue;
> +
> +      mode = GET_MODE (x);
> +      if (!VECTOR_MODE_P (mode))
> +	return;
> +
> +      constant = get_pool_constant (XEXP (x, 0));
> +      first = XVECEXP (constant, 0, 0);

Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
and punt otherwise?

> +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> +      *loc = vec_dup;
> +      INSN_CODE (insn) = -1;
> +      /* Revert change if there's no corresponding pattern.  */
> +      if (recog_memoized (insn) < 0)
> +      	{
> +      	  *loc = x;
> +      	  recog_memoized (insn);
> +      	}

The usual way of doing this would be through
  validate_change (insn, loc, vec_dup, 0);

Also, isn't the pass also useful for TARGET_AVX and above (but in that case
only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
full vector loads?

As Jeff wrote, I wonder if when successfully replacing those pool constants
the old constant pool entries will be omitted.

Another thing I wonder about is whether more analysis shouldn't be used.
E.g. if the constant pool entry is already emitted into .rodata anyway
(e.g. some earlier function needed it), using the broadcast will mean
actually larger .rodata.  If {1to8} and similar is as fast as reading all
the same elements from memory (or faster), perhaps in that case it should
broadcast from the first element of the existing constant pool full vector
rather than creating a new one.
And similarly, perhaps the function should look at all constant pool entries
in the current function (not yet emitted into .rodata) and if it would
succeed for some and not for others, either use broadcast from its first
element or not perform it for the others too.

	Jakub
Richard Biener Aug. 27, 2020, 1:07 p.m. UTC | #10
On Thu, Aug 27, 2020 at 2:25 PM Jakub Jelinek via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>
> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean
> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.

IIRC I once implemented this (re-using vector constant components
for non-vector pool entries) but it was quite hackish and never merged
it seems.

Richard.

>         Jakub
>
Jakub Jelinek Aug. 27, 2020, 1:20 p.m. UTC | #11
On Thu, Aug 27, 2020 at 03:07:59PM +0200, Richard Biener wrote:
> > Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> > only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> > full vector loads?
> >
> > As Jeff wrote, I wonder if when successfully replacing those pool constants
> > the old constant pool entries will be omitted.
> >
> > Another thing I wonder about is whether more analysis shouldn't be used.
> > E.g. if the constant pool entry is already emitted into .rodata anyway
> > (e.g. some earlier function needed it), using the broadcast will mean
> > actually larger .rodata.  If {1to8} and similar is as fast as reading all
> > the same elements from memory (or faster), perhaps in that case it should
> > broadcast from the first element of the existing constant pool full vector
> > rather than creating a new one.
> > And similarly, perhaps the function should look at all constant pool entries
> > in the current function (not yet emitted into .rodata) and if it would
> > succeed for some and not for others, either use broadcast from its first
> > element or not perform it for the others too.
> 
> IIRC I once implemented this (re-using vector constant components
> for non-vector pool entries) but it was quite hackish and never merged
> it seems.

If the generic constant pool code could do it, it would of course simplify
this pass.  Not sure if the case where earlier function emits already some
smaller constant and later function needs a CONST_VECTOR containing that can
be handled at all (probably not), but if the same function has both scalar
pool entries and CONST_VECTOR ones that contain those, or already emitted
CONST_VECTOR pool entry has them, it shouldn't be that hard, at least for
targets with symbol aliases, e.g. by using .LC33 = .LC24 or .LC34 = .LC24 + 8
where .LC33 or .LC34 would be the scalar pool entry label and .LC24
CONST_VECTOR containing those.

Seems constant pool marking is performed during
mark_constant_pool called during final from assemble_start_function or
assemble_end_function, so if the pass replaces the constants before final
and the constants are unused, they won't be emitted.

	Jakub
Richard Biener Aug. 28, 2020, 6:47 a.m. UTC | #12
On Thu, Aug 27, 2020 at 3:20 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Aug 27, 2020 at 03:07:59PM +0200, Richard Biener wrote:
> > > Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> > > only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> > > full vector loads?
> > >
> > > As Jeff wrote, I wonder if when successfully replacing those pool constants
> > > the old constant pool entries will be omitted.
> > >
> > > Another thing I wonder about is whether more analysis shouldn't be used.
> > > E.g. if the constant pool entry is already emitted into .rodata anyway
> > > (e.g. some earlier function needed it), using the broadcast will mean
> > > actually larger .rodata.  If {1to8} and similar is as fast as reading all
> > > the same elements from memory (or faster), perhaps in that case it should
> > > broadcast from the first element of the existing constant pool full vector
> > > rather than creating a new one.
> > > And similarly, perhaps the function should look at all constant pool entries
> > > in the current function (not yet emitted into .rodata) and if it would
> > > succeed for some and not for others, either use broadcast from its first
> > > element or not perform it for the others too.
> >
> > IIRC I once implemented this (re-using vector constant components
> > for non-vector pool entries) but it was quite hackish and never merged
> > it seems.
>
> If the generic constant pool code could do it, it would of course simplify
> this pass.  Not sure if the case where earlier function emits already some
> smaller constant and later function needs a CONST_VECTOR containing that can
> be handled at all (probably not), but if the same function has both scalar
> pool entries and CONST_VECTOR ones that contain those, or already emitted
> CONST_VECTOR pool entry has them, it shouldn't be that hard, at least for
> targets with symbol aliases, e.g. by using .LC33 = .LC24 or .LC34 = .LC24 + 8
> where .LC33 or .LC34 would be the scalar pool entry label and .LC24
> CONST_VECTOR containing those.
>
> Seems constant pool marking is performed during
> mark_constant_pool called during final from assemble_start_function or
> assemble_end_function, so if the pass replaces the constants before final
> and the constants are unused, they won't be emitted.

IIRC elsewhere it was discussed to use ld to perform merging by
emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
byte sizes).
ld could always direct 8 byte constant refs to the larger pools (sub-)entry.

As for GCCs constant pool code the issue is in the way lookup works
(hashing) and my earlier patch was recording an additional descriptor
for the first vector element IIRC.  So indeed if first the scalar is emitted
and then the vector this won't work easily (we'd need to be able to
associate multiple labes with a constant), but it could be made work, too.

I guess it would be interesting to experiment with function local pools
ordered by accesses to reduce memory bandwith cost (and pressure
on prefetchers) for memory bandwidth starved code.

Richard.

>         Jakub
>
Jakub Jelinek Aug. 28, 2020, 8:52 a.m. UTC | #13
On Fri, Aug 28, 2020 at 08:47:06AM +0200, Richard Biener via Gcc-patches wrote:
> IIRC elsewhere it was discussed to use ld to perform merging by
> emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
> byte sizes).

ld does that already, and gcc too.

> ld could always direct 8 byte constant refs to the larger pools (sub-)entry.

But there is no way to express in ELF that something like that would be
acceptable.

I meant something like the following, which on e.g. a dumb:

typedef float V __attribute__((vector_size (4 * sizeof (float))));

void
foo (V *p, float *q)
{
  p[0] += (V) { 1.0f, 2.0f, 3.0f, 4.0f };
  q[0] += 4.0f;
  q[1] -= 3.0f;
  q[17] -= 2.0f;
  q[31] += 1.0f;
}

testcase merges all the 4 scalar constant pool entries into the CONST_VECTOR
one.

I'm punting for section anchors and not doing it in the per-function (i.e.
non-shared) constant pools simply because I don't know them well enough,
don't know whether backends use the offsets for something etc.
For section anchors, I guess it would need to be done before (re)computing the
offsets and arrange for the desc->mark < 0 entries not to be considered as
objects in the object block, for non-shared pools, perhaps it would be
enough to call the new function from output_constant_pool before calling
recompute_pool_offsets and adjust recompute_pool_offsets to ignore
desc->mark < 0.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	* varasm.c (output_constant_pool_contents): Emit desc->mark < 0
	entries as aliases.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 10:38:10.207636849 +0200
@@ -4198,7 +4198,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+        const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+        char label[256];
+        char buffer[256 + 32];
+        const char *p;
+
+        ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4236,52 @@ output_constant_pool_contents (struct rt
       }
 }
 
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& GET_CODE (desc->constant) == CONST_VECTOR
+	&& VECTOR_MODE_P (desc->mode)
+	&& GET_MODE_CLASS (desc->mode) != MODE_VECTOR_BOOL
+	&& !(SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	     && SYMBOL_REF_BLOCK (desc->sym))
+	&& desc->labelno >= 0)
+      {
+	scalar_mode submode = GET_MODE_INNER (desc->mode);
+	unsigned int subalign = MIN (desc->align, GET_MODE_BITSIZE (submode));
+	int units = GET_MODE_NUNITS (desc->mode);
+
+	for (int i = 0; i < units; i++)
+	  {
+	    if (i != 0
+		&& rtx_equal_p (CONST_VECTOR_ELT (desc->constant, i),
+				CONST_VECTOR_ELT (desc->constant, i - 1)))
+	      continue;
+
+	    constant_descriptor_rtx tmp;
+	    tmp.constant = CONST_VECTOR_ELT (desc->constant, i);
+	    tmp.mode = submode;
+	    hashval_t hash = const_rtx_hash (tmp.constant);
+	    constant_descriptor_rtx *eldesc
+	      = pool->const_rtx_htab->find_with_hash (&tmp, hash);
+	    if (eldesc
+		&& eldesc->mark > 0
+		&& eldesc->align <= subalign
+		&& !(SYMBOL_REF_HAS_BLOCK_INFO_P (eldesc->sym)
+		     && SYMBOL_REF_BLOCK (eldesc->sym)))
+	      {
+		eldesc->mark = ~desc->labelno;
+		eldesc->offset = i * GET_MODE_SIZE (submode);
+	      }
+	  }
+      }
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4317,9 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 


	Jakub
Richard Biener Aug. 28, 2020, 10:36 a.m. UTC | #14
On Fri, Aug 28, 2020 at 10:52 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 08:47:06AM +0200, Richard Biener via Gcc-patches wrote:
> > IIRC elsewhere it was discussed to use ld to perform merging by
> > emitting separate rodata sections for constant sizes (4, 8, 16, 32, 64
> > byte sizes).
>
> ld does that already, and gcc too.
>
> > ld could always direct 8 byte constant refs to the larger pools (sub-)entry.
>
> But there is no way to express in ELF that something like that would be
> acceptable.

Hmm, I see.

> I meant something like the following, which on e.g. a dumb:
>
> typedef float V __attribute__((vector_size (4 * sizeof (float))));
>
> void
> foo (V *p, float *q)
> {
>   p[0] += (V) { 1.0f, 2.0f, 3.0f, 4.0f };
>   q[0] += 4.0f;
>   q[1] -= 3.0f;
>   q[17] -= 2.0f;
>   q[31] += 1.0f;
> }
>
> testcase merges all the 4 scalar constant pool entries into the CONST_VECTOR
> one.
>
> I'm punting for section anchors and not doing it in the per-function (i.e.
> non-shared) constant pools simply because I don't know them well enough,
> don't know whether backends use the offsets for something etc.
> For section anchors, I guess it would need to be done before (re)computing the
> offsets and arrange for the desc->mark < 0 entries not to be considered as
> objects in the object block, for non-shared pools, perhaps it would be
> enough to call the new function from output_constant_pool before calling
> recompute_pool_offsets and adjust recompute_pool_offsets to ignore
> desc->mark < 0.

Guess this would work indeed.  It's probably quite common to have
both vector and non-vector constants because of vectorization
and scalar epilogues.  But note that elsewhere we're using
the largest component mode to emit vector constant pool entries
to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
below code works for FP modes it likely will break down for
integer modes?

Richard.

> 2020-08-28  Jakub Jelinek  <jakub@redhat.com>
>
>         * varasm.c (output_constant_pool_contents): Emit desc->mark < 0
>         entries as aliases.
>         (optimize_constant_pool): New function.
>         (output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.
>
> --- gcc/varasm.c.jj     2020-07-28 15:39:10.091755086 +0200
> +++ gcc/varasm.c        2020-08-28 10:38:10.207636849 +0200
> @@ -4198,7 +4198,27 @@ output_constant_pool_contents (struct rt
>    class constant_descriptor_rtx *desc;
>
>    for (desc = pool->first; desc ; desc = desc->next)
> -    if (desc->mark)
> +    if (desc->mark < 0)
> +      {
> +#ifdef ASM_OUTPUT_DEF
> +        const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
> +        char label[256];
> +        char buffer[256 + 32];
> +        const char *p;
> +
> +        ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
> +       p = targetm.strip_name_encoding (label);
> +       if (desc->offset)
> +         {
> +           sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
> +           p = buffer;
> +         }
> +       ASM_OUTPUT_DEF (asm_out_file, name, p);
> +#else
> +       gcc_unreachable ();
> +#endif
> +      }
> +    else if (desc->mark)
>        {
>         /* If the constant is part of an object_block, make sure that
>            the constant has been positioned within its block, but do not
> @@ -4216,6 +4236,52 @@ output_constant_pool_contents (struct rt
>        }
>  }
>
> +/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
> +   constants and scalar constants with the values of CONST_VECTOR elements,
> +   try to alias the scalar constants with the CONST_VECTOR elements.  */
> +
> +static void
> +optimize_constant_pool (struct rtx_constant_pool *pool)
> +{
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +       && GET_CODE (desc->constant) == CONST_VECTOR
> +       && VECTOR_MODE_P (desc->mode)
> +       && GET_MODE_CLASS (desc->mode) != MODE_VECTOR_BOOL
> +       && !(SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +            && SYMBOL_REF_BLOCK (desc->sym))
> +       && desc->labelno >= 0)
> +      {
> +       scalar_mode submode = GET_MODE_INNER (desc->mode);
> +       unsigned int subalign = MIN (desc->align, GET_MODE_BITSIZE (submode));
> +       int units = GET_MODE_NUNITS (desc->mode);
> +
> +       for (int i = 0; i < units; i++)
> +         {
> +           if (i != 0
> +               && rtx_equal_p (CONST_VECTOR_ELT (desc->constant, i),
> +                               CONST_VECTOR_ELT (desc->constant, i - 1)))
> +             continue;
> +
> +           constant_descriptor_rtx tmp;
> +           tmp.constant = CONST_VECTOR_ELT (desc->constant, i);
> +           tmp.mode = submode;
> +           hashval_t hash = const_rtx_hash (tmp.constant);
> +           constant_descriptor_rtx *eldesc
> +             = pool->const_rtx_htab->find_with_hash (&tmp, hash);
> +           if (eldesc
> +               && eldesc->mark > 0
> +               && eldesc->align <= subalign
> +               && !(SYMBOL_REF_HAS_BLOCK_INFO_P (eldesc->sym)
> +                    && SYMBOL_REF_BLOCK (eldesc->sym)))
> +             {
> +               eldesc->mark = ~desc->labelno;
> +               eldesc->offset = i * GET_MODE_SIZE (submode);
> +             }
> +         }
> +      }
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>
> @@ -4251,6 +4317,9 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>
>
>
>         Jakub
>
Jakub Jelinek Aug. 28, 2020, 10:47 a.m. UTC | #15
On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> Guess this would work indeed.  It's probably quite common to have
> both vector and non-vector constants because of vectorization
> and scalar epilogues.  But note that elsewhere we're using
> the largest component mode to emit vector constant pool entries
> to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> below code works for FP modes it likely will break down for
> integer modes?

I don't see why it would break, it will not optimize { -1LL, -1LL }
vs. -1 scalar, sure, but it uses the hash and equality function the
rtl constant pool uses, which means it compares both the constants
(rtx_equal_p) and mode we have recorded for it.
Of course, on x86_64 integer scalar constants will pretty much never appear
in the constant pool, so guess we'll need a different target for testing
that.

	Jakub
Richard Biener Aug. 28, 2020, 11:06 a.m. UTC | #16
On Fri, Aug 28, 2020 at 12:47 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> > Guess this would work indeed.  It's probably quite common to have
> > both vector and non-vector constants because of vectorization
> > and scalar epilogues.  But note that elsewhere we're using
> > the largest component mode to emit vector constant pool entries
> > to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> > below code works for FP modes it likely will break down for
> > integer modes?
>
> I don't see why it would break, it will not optimize { -1LL, -1LL }
> vs. -1 scalar, sure, but it uses the hash and equality function the
> rtl constant pool uses, which means it compares both the constants
> (rtx_equal_p) and mode we have recorded for it.

Oh, I thought my patch for PR54201 was installed but it was not
(I grepped my patch folder...).  PR54201 complains about something
similar but then different ;)  Guess post-processing that would also be
possible but also a bit awkward.  Maybe we should hash the
full byte representation instead of the components.

Richard.

> Of course, on x86_64 integer scalar constants will pretty much never appear
> in the constant pool, so guess we'll need a different target for testing
> that.
>
>         Jakub
>
Jakub Jelinek Aug. 28, 2020, 11:26 a.m. UTC | #17
On Fri, Aug 28, 2020 at 01:06:40PM +0200, Richard Biener wrote:
> > On Fri, Aug 28, 2020 at 12:36:00PM +0200, Richard Biener wrote:
> > > Guess this would work indeed.  It's probably quite common to have
> > > both vector and non-vector constants because of vectorization
> > > and scalar epilogues.  But note that elsewhere we're using
> > > the largest component mode to emit vector constant pool entries
> > > to share { -1, -1, -1, -1 } and {-1l, -1l } for example so while the
> > > below code works for FP modes it likely will break down for
> > > integer modes?
> >
> > I don't see why it would break, it will not optimize { -1LL, -1LL }
> > vs. -1 scalar, sure, but it uses the hash and equality function the
> > rtl constant pool uses, which means it compares both the constants
> > (rtx_equal_p) and mode we have recorded for it.
> 
> Oh, I thought my patch for PR54201 was installed but it was not
> (I grepped my patch folder...).  PR54201 complains about something
> similar but then different ;)  Guess post-processing that would also be
> possible but also a bit awkward.  Maybe we should hash the
> full byte representation instead of the components.

I think we in general can't, because the constants can contain even things
like LABEL_REFs, so not everything is actually representable as host byte
stream.
And I'm not convinced doing it in force_constant_mem is best, because at
that point we really don't know if the constants will not be optimized away.

We could extend what my patch does though, by sorting the pool entries
(those with mark > 0 only) by descreasing size and for those for which
native_encode_rtx is successful, enter into a hash table an entry for their
whole byte representation, then half of it until each bytes (possibly taking
into account the alignment too).

As for section anchors, handling that seems far less important, as e.g. on
both powerpc64 and aarch64 I see (unless -fno-merge-constants) the constant
pool emitted outside of the blocks and so optimized by the posted patch.

	Jakub
Jakub Jelinek Aug. 28, 2020, 2:53 p.m. UTC | #18
On Fri, Aug 28, 2020 at 01:06:40PM +0200, Richard Biener via Gcc-patches wrote:
> > I don't see why it would break, it will not optimize { -1LL, -1LL }
> > vs. -1 scalar, sure, but it uses the hash and equality function the
> > rtl constant pool uses, which means it compares both the constants
> > (rtx_equal_p) and mode we have recorded for it.
> 
> Oh, I thought my patch for PR54201 was installed but it was not
> (I grepped my patch folder...).  PR54201 complains about something
> similar but then different ;)  Guess post-processing that would also be
> possible but also a bit awkward.  Maybe we should hash the
> full byte representation instead of the components.

Here is an adjusted patch that ought to merge even the same sized different
mode vectors with the same byte representation, etc.
It won't really help with avoiding the multiple reads of the constant in the
same function, but as you found, your patch doesn't help with that either.
Your patch isn't really incompatible with what the patch below does, though
I wonder whether a) it wouldn't be better to always canonicalize to an
integral mode with as few elts as possible even e.g. for floats b) whether
asserting that it simplify_rtx succeeds is safe, whether it shouldn't just
canonicalize if the canonicalization works and just do what it previously
did otherwise.

The following patch puts all pool entries which can be natively encoded
into a vector, sorts it by decreasing size, determines minimum size
of a pool entry and adds hash elts for each (aligned) min_size or wider
power of two-ish portion of the pool constant in addition to the whole pool
constant byte representation.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 15:37:30.605076961 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, desc->offset);
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,159 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4425,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 


	Jakub
Richard Sandiford Aug. 28, 2020, 4:07 p.m. UTC | #19
Thanks for doing this.  I don't feel qualified to review the full
patch, but one thing:

Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> +  auto_vec<target_unit, 128> buffer;
> +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> +  object_allocator<constant_descriptor_rtx_data>
> +    data_pool ("constant_descriptor_rtx_data_pool");
> +  int idx = 0;
> +  size_t size = 0;
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +	      && SYMBOL_REF_BLOCK (desc->sym)))
> +      {
> +	buffer.truncate (0);

128 isn't big enough for all targets (e.g. aarch64 with
-msve-vector-bits=2048), so I think we still need a reserve
call here.

Thanks,
Richard

> +	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
> +			       GET_MODE_SIZE (desc->mode)))
> +	  {
> +	    constant_descriptor_rtx_data *data = data_pool.allocate ();
> +	    data->desc = desc;
> +	    data->bytes = NULL;
> +	    data->size = GET_MODE_SIZE (desc->mode);
> +	    data->offset = 0;
> +	    data->hash = idx++;
> +	    size += data->size;
> +	    vec.safe_push (data);
> +	  }
> +      }
> +  if (idx)
> +    {
> +      vec.qsort (constant_descriptor_rtx_data_cmp);
> +      unsigned min_size = vec.last ()->size;
> +      target_unit *bytes = XNEWVEC (target_unit, size);
> +      unsigned int i;
> +      constant_descriptor_rtx_data *data;
> +      hash_table<const_rtx_data_hasher> * htab
> +	= new hash_table<const_rtx_data_hasher> (31);
> +      size = 0;
> +      FOR_EACH_VEC_ELT (vec, i, data)
> +	{
> +	  buffer.truncate (0);
> +	  native_encode_rtx (data->desc->mode, data->desc->constant,
> +			     buffer, 0, data->size);
> +	  memcpy (bytes + size, buffer.address (), data->size);
> +	  data->bytes = bytes + size;
> +	  data->hash = iterative_hash (data->bytes,
> +				       data->size * sizeof (target_unit), 0);
> +	  size += data->size;
> +	  constant_descriptor_rtx_data **slot
> +	    = htab->find_slot_with_hash (data, data->hash, INSERT);
> +	  if (*slot)
> +	    {
> +	      data->desc->mark = ~(*slot)->desc->labelno;
> +	      data->desc->offset = (*slot)->offset;
> +	    }
> +	  else
> +	    {
> +	      unsigned int sz = 1 << floor_log2 (data->size);
> +
> +	      *slot = data;
> +	      for (sz >>= 1; sz >= min_size; sz >>= 1)
> +		for (unsigned off = 0; off + sz <= data->size; off += sz)
> +		  {
> +		    constant_descriptor_rtx_data tmp;
> +		    tmp.desc = data->desc;
> +		    tmp.bytes = data->bytes + off;
> +		    tmp.size = sz;
> +		    tmp.offset = off;
> +		    tmp.hash = iterative_hash (tmp.bytes,
> +					       sz * sizeof (target_unit), 0);
> +		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
> +		    if (*slot == NULL)
> +		      {
> +			*slot = data_pool.allocate ();
> +			**slot = tmp;
> +		      }
> +		  }
> +	    }
> +	}
> +      delete htab;
> +      XDELETE (bytes);
> +    }
> +  data_pool.release ();
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>  
> @@ -4251,6 +4425,10 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (optimize
> +      && TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>  
>
>
> 	Jakub
Jakub Jelinek Aug. 28, 2020, 4:25 p.m. UTC | #20
On Fri, Aug 28, 2020 at 05:07:11PM +0100, Richard Sandiford wrote:
> Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > +  auto_vec<target_unit, 128> buffer;
> > +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> > +  object_allocator<constant_descriptor_rtx_data>
> > +    data_pool ("constant_descriptor_rtx_data_pool");
> > +  int idx = 0;
> > +  size_t size = 0;
> > +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> > +    if (desc->mark > 0
> > +	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> > +	      && SYMBOL_REF_BLOCK (desc->sym)))
> > +      {
> > +	buffer.truncate (0);
> 
> 128 isn't big enough for all targets (e.g. aarch64 with
> -msve-vector-bits=2048), so I think we still need a reserve
> call here.

You're right, thanks for spotting it, I've missed native_encode_rtx will do
quick_push rather than safe_push.

Updated patch below, it shouldn't be needed in the second loop, because
the first loop should already grow it to the largest size.

2020-08-28  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 18:21:58.943759578 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, desc->offset);
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	buffer.reserve (GET_MODE_SIZE (desc->mode));
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 


	Jakub
Hongtao Liu Aug. 28, 2020, 5:18 p.m. UTC | #21
On Thu, Aug 27, 2020 at 8:24 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>

Yes, broadcast is insufficient. refer to [1]
[1]. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767#c5

Since pass_combine won't combine load instruction into memory operand,
i.e. it wouldn't combine *mov mem xmm* and *vaddps xmm, xmm, xmm* into
*vaddps mem, xmm, xmm*
It just left this to ira, broadcast under avx512f isn't excluded.

Maybe patch should be rewritten by using post reload splitter, then
there would be no concern of an extra pass, but still have the latter
issue as you mentioned.

> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean

Yes, some later function may need it either, so we need a global view
to decide the replacement, hope it could be done by generic constant
pool code.

> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.
>
>         Jakub
>
Jakub Jelinek Aug. 30, 2020, 9:24 a.m. UTC | #22
On Fri, Aug 28, 2020 at 06:25:46PM +0200, Jakub Jelinek via Gcc-patches wrote:
> You're right, thanks for spotting it, I've missed native_encode_rtx will do
> quick_push rather than safe_push.
> 
> Updated patch below, it shouldn't be needed in the second loop, because
> the first loop should already grow it to the largest size.

Testing beyond a bug in i386.md revealed also that I've lost a cast to long
to avoid breaking 32-bit bootstrap.

This is the version that passed bootstrap/regtest on both x86_64-linux and
i686-linux.  In both bootstraps/regtests together, it saved (from the
statistics I've gathered) 63104 .rodata bytes (before constant merging),
in 6814 hits of the data->desc->mark = ~(*slot)->desc->labelno;.

Ok for trunk?

2020-08-30  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/54201
	* varasm.c: Include alloc-pool.h.
	(output_constant_pool_contents): Emit desc->mark < 0 entries as
	aliases.
	(struct constant_descriptor_rtx_data): New type.
	(constant_descriptor_rtx_data_cmp): New function.
	(struct const_rtx_data_hasher): New type.
	(const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
	methods.
	(optimize_constant_pool): New function.
	(output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.

--- gcc/varasm.c.jj	2020-07-28 15:39:10.091755086 +0200
+++ gcc/varasm.c	2020-08-28 18:21:58.943759578 +0200
@@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
 #include "asan.h"
 #include "rtl-iter.h"
 #include "file-prefix-map.h" /* remap_debug_filename()  */
+#include "alloc-pool.h"
 
 #ifdef XCOFF_DEBUGGING_INFO
 #include "xcoffout.h"		/* Needed for external data declarations.  */
@@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
   class constant_descriptor_rtx *desc;
 
   for (desc = pool->first; desc ; desc = desc->next)
-    if (desc->mark)
+    if (desc->mark < 0)
+      {
+#ifdef ASM_OUTPUT_DEF
+	const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
+	char label[256];
+	char buffer[256 + 32];
+	const char *p;
+
+	ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
+	p = targetm.strip_name_encoding (label);
+	if (desc->offset)
+	  {
+	    sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
+	    p = buffer;
+	  }
+	ASM_OUTPUT_DEF (asm_out_file, name, p);
+#else
+	gcc_unreachable ();
+#endif
+      }
+    else if (desc->mark)
       {
 	/* If the constant is part of an object_block, make sure that
 	   the constant has been positioned within its block, but do not
@@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
       }
 }
 
+struct constant_descriptor_rtx_data {
+  constant_descriptor_rtx *desc;
+  target_unit *bytes;
+  unsigned short size;
+  unsigned short offset;
+  unsigned int hash;
+};
+
+/* qsort callback to sort constant_descriptor_rtx_data * vector by
+   decreasing size.  */
+
+static int
+constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
+{
+  constant_descriptor_rtx_data *const data1
+    = *(constant_descriptor_rtx_data * const *) p1;
+  constant_descriptor_rtx_data *const data2
+    = *(constant_descriptor_rtx_data * const *) p2;
+  if (data1->size > data2->size)
+    return -1;
+  if (data1->size < data2->size)
+    return 1;
+  if (data1->hash < data2->hash)
+    return -1;
+  gcc_assert (data1->hash > data2->hash);
+  return 1;
+}
+
+struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
+{
+  static hashval_t hash (constant_descriptor_rtx_data *);
+  static bool equal (constant_descriptor_rtx_data *,
+		     constant_descriptor_rtx_data *);
+};
+
+/* Hash and compare functions for const_rtx_data_htab.  */
+
+hashval_t
+const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
+{
+  return data->hash;
+}
+
+bool
+const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
+			      constant_descriptor_rtx_data *y)
+{
+  if (x->hash != y->hash || x->size != y->size)
+    return 0;
+  unsigned int align1 = x->desc->align;
+  unsigned int align2 = y->desc->align;
+  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
+  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
+  if (offset1)
+    align1 = least_bit_hwi (offset1);
+  if (offset2)
+    align2 = least_bit_hwi (offset2);
+  if (align2 > align1)
+    return 0;
+  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
+    return 0;
+  return 1;
+}
+
+/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
+   constants and scalar constants with the values of CONST_VECTOR elements,
+   try to alias the scalar constants with the CONST_VECTOR elements.  */
+
+static void
+optimize_constant_pool (struct rtx_constant_pool *pool)
+{
+  auto_vec<target_unit, 128> buffer;
+  auto_vec<constant_descriptor_rtx_data *, 128> vec;
+  object_allocator<constant_descriptor_rtx_data>
+    data_pool ("constant_descriptor_rtx_data_pool");
+  int idx = 0;
+  size_t size = 0;
+  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
+    if (desc->mark > 0
+	&& ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
+	      && SYMBOL_REF_BLOCK (desc->sym)))
+      {
+	buffer.truncate (0);
+	buffer.reserve (GET_MODE_SIZE (desc->mode));
+	if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
+			       GET_MODE_SIZE (desc->mode)))
+	  {
+	    constant_descriptor_rtx_data *data = data_pool.allocate ();
+	    data->desc = desc;
+	    data->bytes = NULL;
+	    data->size = GET_MODE_SIZE (desc->mode);
+	    data->offset = 0;
+	    data->hash = idx++;
+	    size += data->size;
+	    vec.safe_push (data);
+	  }
+      }
+  if (idx)
+    {
+      vec.qsort (constant_descriptor_rtx_data_cmp);
+      unsigned min_size = vec.last ()->size;
+      target_unit *bytes = XNEWVEC (target_unit, size);
+      unsigned int i;
+      constant_descriptor_rtx_data *data;
+      hash_table<const_rtx_data_hasher> * htab
+	= new hash_table<const_rtx_data_hasher> (31);
+      size = 0;
+      FOR_EACH_VEC_ELT (vec, i, data)
+	{
+	  buffer.truncate (0);
+	  native_encode_rtx (data->desc->mode, data->desc->constant,
+			     buffer, 0, data->size);
+	  memcpy (bytes + size, buffer.address (), data->size);
+	  data->bytes = bytes + size;
+	  data->hash = iterative_hash (data->bytes,
+				       data->size * sizeof (target_unit), 0);
+	  size += data->size;
+	  constant_descriptor_rtx_data **slot
+	    = htab->find_slot_with_hash (data, data->hash, INSERT);
+	  if (*slot)
+	    {
+	      data->desc->mark = ~(*slot)->desc->labelno;
+	      data->desc->offset = (*slot)->offset;
+	    }
+	  else
+	    {
+	      unsigned int sz = 1 << floor_log2 (data->size);
+
+	      *slot = data;
+	      for (sz >>= 1; sz >= min_size; sz >>= 1)
+		for (unsigned off = 0; off + sz <= data->size; off += sz)
+		  {
+		    constant_descriptor_rtx_data tmp;
+		    tmp.desc = data->desc;
+		    tmp.bytes = data->bytes + off;
+		    tmp.size = sz;
+		    tmp.offset = off;
+		    tmp.hash = iterative_hash (tmp.bytes,
+					       sz * sizeof (target_unit), 0);
+		    slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
+		    if (*slot == NULL)
+		      {
+			*slot = data_pool.allocate ();
+			**slot = tmp;
+		      }
+		  }
+	    }
+	}
+      delete htab;
+      XDELETE (bytes);
+    }
+  data_pool.release ();
+}
+
 /* Mark all constants that are used in the current function, then write
    out the function's private constant pool.  */
 
@@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
 void
 output_shared_constant_pool (void)
 {
+  if (optimize
+      && TARGET_SUPPORTS_ALIASES)
+    optimize_constant_pool (shared_constant_pool);
+
   output_constant_pool_contents (shared_constant_pool);
 }
 


	Jakub
Richard Biener Aug. 31, 2020, 8:18 a.m. UTC | #23
On Sun, Aug 30, 2020 at 11:24 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Fri, Aug 28, 2020 at 06:25:46PM +0200, Jakub Jelinek via Gcc-patches wrote:
> > You're right, thanks for spotting it, I've missed native_encode_rtx will do
> > quick_push rather than safe_push.
> >
> > Updated patch below, it shouldn't be needed in the second loop, because
> > the first loop should already grow it to the largest size.
>
> Testing beyond a bug in i386.md revealed also that I've lost a cast to long
> to avoid breaking 32-bit bootstrap.
>
> This is the version that passed bootstrap/regtest on both x86_64-linux and
> i686-linux.  In both bootstraps/regtests together, it saved (from the
> statistics I've gathered) 63104 .rodata bytes (before constant merging),
> in 6814 hits of the data->desc->mark = ~(*slot)->desc->labelno;.
>
> Ok for trunk?

OK.

Thanks,
Richard.

> 2020-08-30  Jakub Jelinek  <jakub@redhat.com>
>
>         PR middle-end/54201
>         * varasm.c: Include alloc-pool.h.
>         (output_constant_pool_contents): Emit desc->mark < 0 entries as
>         aliases.
>         (struct constant_descriptor_rtx_data): New type.
>         (constant_descriptor_rtx_data_cmp): New function.
>         (struct const_rtx_data_hasher): New type.
>         (const_rtx_data_hasher::hash, const_rtx_data_hasher::equal): New
>         methods.
>         (optimize_constant_pool): New function.
>         (output_shared_constant_pool): Call it if TARGET_SUPPORTS_ALIASES.
>
> --- gcc/varasm.c.jj     2020-07-28 15:39:10.091755086 +0200
> +++ gcc/varasm.c        2020-08-28 18:21:58.943759578 +0200
> @@ -57,6 +57,7 @@ along with GCC; see the file COPYING3.
>  #include "asan.h"
>  #include "rtl-iter.h"
>  #include "file-prefix-map.h" /* remap_debug_filename()  */
> +#include "alloc-pool.h"
>
>  #ifdef XCOFF_DEBUGGING_INFO
>  #include "xcoffout.h"          /* Needed for external data declarations.  */
> @@ -4198,7 +4199,27 @@ output_constant_pool_contents (struct rt
>    class constant_descriptor_rtx *desc;
>
>    for (desc = pool->first; desc ; desc = desc->next)
> -    if (desc->mark)
> +    if (desc->mark < 0)
> +      {
> +#ifdef ASM_OUTPUT_DEF
> +       const char *name = targetm.strip_name_encoding (XSTR (desc->sym, 0));
> +       char label[256];
> +       char buffer[256 + 32];
> +       const char *p;
> +
> +       ASM_GENERATE_INTERNAL_LABEL (label, "LC", ~desc->mark);
> +       p = targetm.strip_name_encoding (label);
> +       if (desc->offset)
> +         {
> +           sprintf (buffer, "%s+%ld", p, (long) (desc->offset));
> +           p = buffer;
> +         }
> +       ASM_OUTPUT_DEF (asm_out_file, name, p);
> +#else
> +       gcc_unreachable ();
> +#endif
> +      }
> +    else if (desc->mark)
>        {
>         /* If the constant is part of an object_block, make sure that
>            the constant has been positioned within its block, but do not
> @@ -4216,6 +4237,160 @@ output_constant_pool_contents (struct rt
>        }
>  }
>
> +struct constant_descriptor_rtx_data {
> +  constant_descriptor_rtx *desc;
> +  target_unit *bytes;
> +  unsigned short size;
> +  unsigned short offset;
> +  unsigned int hash;
> +};
> +
> +/* qsort callback to sort constant_descriptor_rtx_data * vector by
> +   decreasing size.  */
> +
> +static int
> +constant_descriptor_rtx_data_cmp (const void *p1, const void *p2)
> +{
> +  constant_descriptor_rtx_data *const data1
> +    = *(constant_descriptor_rtx_data * const *) p1;
> +  constant_descriptor_rtx_data *const data2
> +    = *(constant_descriptor_rtx_data * const *) p2;
> +  if (data1->size > data2->size)
> +    return -1;
> +  if (data1->size < data2->size)
> +    return 1;
> +  if (data1->hash < data2->hash)
> +    return -1;
> +  gcc_assert (data1->hash > data2->hash);
> +  return 1;
> +}
> +
> +struct const_rtx_data_hasher : nofree_ptr_hash<constant_descriptor_rtx_data>
> +{
> +  static hashval_t hash (constant_descriptor_rtx_data *);
> +  static bool equal (constant_descriptor_rtx_data *,
> +                    constant_descriptor_rtx_data *);
> +};
> +
> +/* Hash and compare functions for const_rtx_data_htab.  */
> +
> +hashval_t
> +const_rtx_data_hasher::hash (constant_descriptor_rtx_data *data)
> +{
> +  return data->hash;
> +}
> +
> +bool
> +const_rtx_data_hasher::equal (constant_descriptor_rtx_data *x,
> +                             constant_descriptor_rtx_data *y)
> +{
> +  if (x->hash != y->hash || x->size != y->size)
> +    return 0;
> +  unsigned int align1 = x->desc->align;
> +  unsigned int align2 = y->desc->align;
> +  unsigned int offset1 = (x->offset * BITS_PER_UNIT) & (align1 - 1);
> +  unsigned int offset2 = (y->offset * BITS_PER_UNIT) & (align2 - 1);
> +  if (offset1)
> +    align1 = least_bit_hwi (offset1);
> +  if (offset2)
> +    align2 = least_bit_hwi (offset2);
> +  if (align2 > align1)
> +    return 0;
> +  if (memcmp (x->bytes, y->bytes, x->size * sizeof (target_unit)) != 0)
> +    return 0;
> +  return 1;
> +}
> +
> +/* Attempt to optimize constant pool POOL.  If it contains both CONST_VECTOR
> +   constants and scalar constants with the values of CONST_VECTOR elements,
> +   try to alias the scalar constants with the CONST_VECTOR elements.  */
> +
> +static void
> +optimize_constant_pool (struct rtx_constant_pool *pool)
> +{
> +  auto_vec<target_unit, 128> buffer;
> +  auto_vec<constant_descriptor_rtx_data *, 128> vec;
> +  object_allocator<constant_descriptor_rtx_data>
> +    data_pool ("constant_descriptor_rtx_data_pool");
> +  int idx = 0;
> +  size_t size = 0;
> +  for (constant_descriptor_rtx *desc = pool->first; desc; desc = desc->next)
> +    if (desc->mark > 0
> +       && ! (SYMBOL_REF_HAS_BLOCK_INFO_P (desc->sym)
> +             && SYMBOL_REF_BLOCK (desc->sym)))
> +      {
> +       buffer.truncate (0);
> +       buffer.reserve (GET_MODE_SIZE (desc->mode));
> +       if (native_encode_rtx (desc->mode, desc->constant, buffer, 0,
> +                              GET_MODE_SIZE (desc->mode)))
> +         {
> +           constant_descriptor_rtx_data *data = data_pool.allocate ();
> +           data->desc = desc;
> +           data->bytes = NULL;
> +           data->size = GET_MODE_SIZE (desc->mode);
> +           data->offset = 0;
> +           data->hash = idx++;
> +           size += data->size;
> +           vec.safe_push (data);
> +         }
> +      }
> +  if (idx)
> +    {
> +      vec.qsort (constant_descriptor_rtx_data_cmp);
> +      unsigned min_size = vec.last ()->size;
> +      target_unit *bytes = XNEWVEC (target_unit, size);
> +      unsigned int i;
> +      constant_descriptor_rtx_data *data;
> +      hash_table<const_rtx_data_hasher> * htab
> +       = new hash_table<const_rtx_data_hasher> (31);
> +      size = 0;
> +      FOR_EACH_VEC_ELT (vec, i, data)
> +       {
> +         buffer.truncate (0);
> +         native_encode_rtx (data->desc->mode, data->desc->constant,
> +                            buffer, 0, data->size);
> +         memcpy (bytes + size, buffer.address (), data->size);
> +         data->bytes = bytes + size;
> +         data->hash = iterative_hash (data->bytes,
> +                                      data->size * sizeof (target_unit), 0);
> +         size += data->size;
> +         constant_descriptor_rtx_data **slot
> +           = htab->find_slot_with_hash (data, data->hash, INSERT);
> +         if (*slot)
> +           {
> +             data->desc->mark = ~(*slot)->desc->labelno;
> +             data->desc->offset = (*slot)->offset;
> +           }
> +         else
> +           {
> +             unsigned int sz = 1 << floor_log2 (data->size);
> +
> +             *slot = data;
> +             for (sz >>= 1; sz >= min_size; sz >>= 1)
> +               for (unsigned off = 0; off + sz <= data->size; off += sz)
> +                 {
> +                   constant_descriptor_rtx_data tmp;
> +                   tmp.desc = data->desc;
> +                   tmp.bytes = data->bytes + off;
> +                   tmp.size = sz;
> +                   tmp.offset = off;
> +                   tmp.hash = iterative_hash (tmp.bytes,
> +                                              sz * sizeof (target_unit), 0);
> +                   slot = htab->find_slot_with_hash (&tmp, tmp.hash, INSERT);
> +                   if (*slot == NULL)
> +                     {
> +                       *slot = data_pool.allocate ();
> +                       **slot = tmp;
> +                     }
> +                 }
> +           }
> +       }
> +      delete htab;
> +      XDELETE (bytes);
> +    }
> +  data_pool.release ();
> +}
> +
>  /* Mark all constants that are used in the current function, then write
>     out the function's private constant pool.  */
>
> @@ -4251,6 +4426,10 @@ output_constant_pool (const char *fnname
>  void
>  output_shared_constant_pool (void)
>  {
> +  if (optimize
> +      && TARGET_SUPPORTS_ALIASES)
> +    optimize_constant_pool (shared_constant_pool);
> +
>    output_constant_pool_contents (shared_constant_pool);
>  }
>
>
>
>         Jakub
>
Hongtao Liu Sept. 1, 2020, 9:55 a.m. UTC | #24
On Thu, Aug 27, 2020 at 8:24 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>

Changed.

> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P

Changed.

>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>

Changed.

> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>

Changed.

> Under what circumstances are we seeing a SEQUENCE in the x86 backend?  I'm
> surprised we need to handle that case.
>

Remove handling of SEQUENCE.

>Is there an existing pass over the RTL chain where this would work so that it's
>more compile-time efficient?
>

I tried define_split, but there's too many of them(considering usage
of define_subst for mask).
Also for new added instructions which support embedded broadcast,
corresponding define_split needs to be added.

Update patch.

--
BR,
Hongtao
Jakub Jelinek Sept. 1, 2020, 10:11 a.m. UTC | #25
On Tue, Sep 01, 2020 at 05:55:18PM +0800, Hongtao Liu wrote:
> I tried define_split, but there's too many of them(considering usage
> of define_subst for mask).
> Also for new added instructions which support embedded broadcast,
> corresponding define_split needs to be added.

One pass that could (sometimes) handle it is the rpad pass which is also
added right after the combiner like your pass.
So, couldn't you call replace_constant_pool_with_broadcast from both
your pass body loop and remove_partial_avx_dependency, the latter guarded
with if (TARGET_AVX512F), and change the new pass gate to be false when
the rpad pass gate is true?

> +static void
> +replace_constant_pool_with_broadcast (rtx_insn* insn)

Formatting, * should be after space, not before it.

> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	return;

Is there a reason why don't you want to handle that?
You could do instead:
      if (GET_MODE (constant) != mode)
	{
	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
	    return;
	}
      first = XVECEXP (constant, 0, 0);

Other than that LGTM.

	Jakub
Hongtao Liu Sept. 2, 2020, 1:57 a.m. UTC | #26
On Tue, Sep 1, 2020 at 6:11 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Tue, Sep 01, 2020 at 05:55:18PM +0800, Hongtao Liu wrote:
> > I tried define_split, but there's too many of them(considering usage
> > of define_subst for mask).
> > Also for new added instructions which support embedded broadcast,
> > corresponding define_split needs to be added.
>
> One pass that could (sometimes) handle it is the rpad pass which is also
> added right after the combiner like your pass.
> So, couldn't you call replace_constant_pool_with_broadcast from both
> your pass body loop and remove_partial_avx_dependency, the latter guarded
> with if (TARGET_AVX512F), and change the new pass gate to be false when
> the rpad pass gate is true?
>

Yes, it could save compile time when both this pass and rpad pass are available.
Changed.

> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
>
> Formatting, * should be after space, not before it.
>

Changed.

> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     return;
>
> Is there a reason why don't you want to handle that?

just avoid error caused by different element number.

> You could do instead:
>       if (GET_MODE (constant) != mode)
>         {
>           constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
>           if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
>             return;
>         }
>       first = XVECEXP (constant, 0, 0);
>

Changed.

> Other than that LGTM.
>
>         Jakub
>
Jakub Jelinek Sept. 2, 2020, 9:58 a.m. UTC | #27
On Wed, Sep 02, 2020 at 09:57:08AM +0800, Hongtao Liu via Gcc-patches wrote:
> +
> +      first = XVECEXP (constant, 0, 0);
> +      /* There could be some rtx like
> +	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> +	 but with "*.LC1" refer to V2DI constant vector.  */
> +      if (GET_MODE (constant) != mode)
> +	{
> +	  constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
> +	  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
> +	    return;
> +	}

The
      first = XVECEXP (constant, 0, 0);
line needs to be after this if, not before it, otherwise it will miscompile
things or just ICE.

> @@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
>  	  if (!NONDEBUG_INSN_P (insn))
>  	    continue;
>  
> +	  /* Hanlde AVX512 embedded broadcast here to save compile time.  */

s/Hanlde/Handle/

> +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> +    {
> +      if (!INSN_P (insn))
> +	continue;
> +      replace_constant_pool_with_broadcast (insn);
> +    }

Perhaps instead do:
  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
    if (INSN_P (insn))
      replace_constant_pool_with_broadcast (insn);
?

> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      /* Return false if rpad pass gate is true.
> +	 replace_constant_pool_with_broadcast is called
> +	 from both this pass and rpad pass.  */
> +      return (TARGET_AVX512F
> +	      && !(TARGET_AVX
> +		   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> +		   && TARGET_SSE_MATH
> +		   && optimize
> +		   && optimize_function_for_speed_p (cfun)));

I think this could be a maintainance nightmare.
Perhaps instead add

static bool
remove_partial_avx_dependency_gate ()
{
  return (TARGET_AVX
	  && TARGET_SSE_PARTIAL_REG_DEPENDENCY
	  && TARGET_SSE_MATH
	  && optimize
	  && optimize_function_for_speed_p (cfun));
}
after the remove_partial_avx_dependency function definition,
change pass_remove_partial_avx_dependency gate body to
      return remove_partial_avx_dependency_gate ();
and in pass_constant_pool_broadcast::gate do
      return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ();
(with the comment you have there)?

LGTM with those changes.

	Jakub
Hongtao Liu Sept. 3, 2020, 2:11 a.m. UTC | #28
On Wed, Sep 2, 2020 at 5:58 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Sep 02, 2020 at 09:57:08AM +0800, Hongtao Liu via Gcc-patches wrote:
> > +
> > +      first = XVECEXP (constant, 0, 0);
> > +      /* There could be some rtx like
> > +      (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
> > +      but with "*.LC1" refer to V2DI constant vector.  */
> > +      if (GET_MODE (constant) != mode)
> > +     {
> > +       constant = simplify_subreg (mode, constant, GET_MODE (constant), 0);
> > +       if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
> > +         return;
> > +     }
>
> The
>       first = XVECEXP (constant, 0, 0);
> line needs to be after this if, not before it, otherwise it will miscompile
> things or just ICE.
>

Changed.

> > @@ -2197,6 +2272,10 @@ remove_partial_avx_dependency (void)
> >         if (!NONDEBUG_INSN_P (insn))
> >           continue;
> >
> > +       /* Hanlde AVX512 embedded broadcast here to save compile time.  */
>
> s/Hanlde/Handle/
>

Changed, sorry for the typo.

> > +  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
> > +    {
> > +      if (!INSN_P (insn))
> > +     continue;
> > +      replace_constant_pool_with_broadcast (insn);
> > +    }
>
> Perhaps instead do:
>   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
>     if (INSN_P (insn))
>       replace_constant_pool_with_broadcast (insn);
> ?
>

Changed.

> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      /* Return false if rpad pass gate is true.
> > +      replace_constant_pool_with_broadcast is called
> > +      from both this pass and rpad pass.  */
> > +      return (TARGET_AVX512F
> > +           && !(TARGET_AVX
> > +                && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> > +                && TARGET_SSE_MATH
> > +                && optimize
> > +                && optimize_function_for_speed_p (cfun)));
>
> I think this could be a maintainance nightmare.
> Perhaps instead add
>

Yes, a common interface should be added as bellow, changed.

> static bool
> remove_partial_avx_dependency_gate ()
> {
>   return (TARGET_AVX
>           && TARGET_SSE_PARTIAL_REG_DEPENDENCY
>           && TARGET_SSE_MATH
>           && optimize
>           && optimize_function_for_speed_p (cfun));
> }
> after the remove_partial_avx_dependency function definition,
> change pass_remove_partial_avx_dependency gate body to
>       return remove_partial_avx_dependency_gate ();
> and in pass_constant_pool_broadcast::gate do
>       return (TARGET_AVX512F && !remove_partial_avx_dependency_gate ();
> (with the comment you have there)?
>
> LGTM with those changes.
>
>         Jakub
>

Thanks for the review, update patch.
Jakub Jelinek Sept. 3, 2020, 7:27 a.m. UTC | #29
On Thu, Sep 03, 2020 at 10:11:14AM +0800, Hongtao Liu wrote:
> Thanks for the review, update patch.

Ok for trunk, thanks.

> From acf3825279190ca0540bb4704f66568fdbe06ce8 Mon Sep 17 00:00:00 2001
> From: liuhongt <hongtao.liu@intel.com>
> Date: Wed, 8 Jul 2020 17:14:36 +0800
> Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.
> 
> For constant vector having one duplicated value, there's no need to put
> whole vector in the constant pool, using embedded broadcast instead.
> 
> 2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>
> 
> gcc/ChangeLog:
> 
> 	PR target/87767
> 	* config/i386/i386-features.c
> 	(replace_constant_pool_with_broadcast): New function.
> 	(constant_pool_broadcast): Ditto.
> 	(class pass_constant_pool_broadcast): New pass.
> 	(make_pass_constant_pool_broadcast): Ditto.
> 	(remove_partial_avx_dependency): Call
> 	replace_constant_pool_with_broadcast under TARGET_AVX512F, it
> 	would save compile time when both pass rpad and cpb are
> 	available.
> 	(remove_partial_avx_dependency_gate): New function.
> 	(class pass_remove_partial_avx_dependency::gate): Call
> 	remove_partial_avx_dependency_gate.
> 	* config/i386/i386-passes.def: Insert new pass after combine.
> 	* config/i386/i386-protos.h
> 	(make_pass_constant_pool_broadcast): Declare.
> 	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst):
> 	New define_insn.
> 	(*avx512f_mul<mode>3<mask_name>_bcst): Ditto.
> 	* config/i386/avx512fintrin.h (_mm512_set1_ps,
> 	_mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/87767
> 	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
> 	* gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
> 	* gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.

	Jakub
diff mbox series

Patch

From b8f49299e3d23f927a659cd394e3099e3291a76f Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 8 Jul 2020 17:14:36 +0800
Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.

For constant vector having one duplicated value, there's no need to put
whole vector in the constant pool, using embedded broadcast instead.

2020-07-09  Hongtao Liu  <hongtao.liu@intel.com>

gcc/ChangeLog:

	PR target/87767
	* config/i386/i386-features.c
	(replace_constant_pool_with_broadcast): New function.
	(constant_pool_broadcast): Ditto.
	(class pass_constant_pool_broadcast): New pass.
	(make_pass_constant_pool_broadcast): Ditto.
	* config/i386/i386-passes.def: Insert new pass after combine.
	* config/i386/i386-protos.h
	(make_pass_constant_pool_broadcast): Declare.
	* config/i386/sse.md (*avx512dq_mul<mode>3<mask_name>_bcst,
	*avx512f_mul<mode>3<mask_name>_bcst): New define_insn.

gcc/testsuite/ChangeLog:

	PR target/87767
	* gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
	* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
	* gcc.target/i386/pr92865-1.c: Adjust testcase.
---
 gcc/config/i386/i386-features.c               | 146 ++++++++++++++++++
 gcc/config/i386/i386-passes.def               |   1 +
 gcc/config/i386/i386-protos.h                 |   1 +
 gcc/config/i386/sse.md                        |  25 +++
 .../i386/avx2-broadcast-pr87767-1.c           |  40 +++++
 .../i386/avx512f-broadcast-pr87767-1.c        |  66 ++++++++
 .../i386/avx512f-broadcast-pr87767-2.c        |  54 +++++++
 .../i386/avx512vl-broadcast-pr87767-1.c       |  40 +++++
 gcc/testsuite/gcc.target/i386/pr92865-1.c     |   9 +-
 9 files changed, 378 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 535fc7e981d..8f81d101382 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -2379,6 +2379,152 @@  make_pass_remove_partial_avx_dependency (gcc::context *ctxt)
   return new pass_remove_partial_avx_dependency (ctxt);
 }
 
+/* Replace all one-value const vector that are referenced by SYMBOL_REFs in x
+   with embedded broadcast. i.e.transform
+
+     vpaddq .LC0(%rip), %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+    .quad 3
+
+    to
+
+     vpaddq .LC0(%rip){1to8}, %zmm0, %zmm0
+     ret
+  .LC0:
+    .quad 3  */
+static void
+replace_constant_pool_with_broadcast (rtx_insn* insn)
+{
+  subrtx_ptr_iterator::array_type array;
+  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
+    {
+      rtx *loc = *iter;
+      rtx x = *loc;
+      rtx broadcast_mem, vec_dup, constant, first;
+      machine_mode mode;
+      if (GET_CODE (x) != MEM
+	  || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
+	  || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
+	continue;
+
+      mode = GET_MODE (x);
+      if (!VECTOR_MODE_P (mode))
+	return;
+
+      constant = get_pool_constant (XEXP (x, 0));
+      first = XVECEXP (constant, 0, 0);
+      /* There could be some rtx like
+	 (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
+	 but with "*.LC1" refer to V2DI constant vector.  */
+      if (GET_MODE (constant) != mode)
+	return;
+
+      for (int i = 1; i < GET_MODE_NUNITS (mode); ++i)
+	{
+	  rtx tmp = XVECEXP (constant, 0, i);
+	  /* Only handle one-value const vector.  */
+	  if (!rtx_equal_p (tmp, first))
+	    return;
+	}
+
+      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
+      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
+      *loc = vec_dup;
+      INSN_CODE (insn) = -1;
+      /* Revert change if there's no corresponding pattern.  */
+      if (recog_memoized (insn) < 0)
+      	{
+      	  *loc = x;
+      	  recog_memoized (insn);
+      	}
+      /* At most 1 memory_operand in an insn.  */
+      return;
+    }
+}
+
+/* For const vector having one duplicated value, there's no need to put
+   whole vector in the constant pool when target supports embedded broadcast. */
+static unsigned int
+constant_pool_broadcast (void)
+{
+  timevar_push (TV_MACH_DEP);
+  rtx_insn *insn;
+
+  for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn))
+	continue;
+
+      /* Insns may appear inside a SEQUENCE.  Only check the patterns of
+	 insns, not any notes that may be attached.  We don't want to mark
+	 a constant just because it happens to appear in a REG_EQUIV note.  */
+      if (rtx_sequence *seq = dyn_cast <rtx_sequence *> (PATTERN (insn)))
+	{
+	  int i, n = seq->len ();
+	  for (i = 0; i < n; ++i)
+	    {
+	      rtx subinsn = seq->element (i);
+	      if (INSN_P (subinsn))
+		replace_constant_pool_with_broadcast (dyn_cast <rtx_insn *> (subinsn));
+	    }
+	}
+      else
+	replace_constant_pool_with_broadcast (insn);
+    }
+  timevar_pop (TV_MACH_DEP);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_constant_pool_broadcast =
+{
+  RTL_PASS, /* type */
+  "cpb", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_MACH_DEP, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_constant_pool_broadcast : public rtl_opt_pass
+{
+public:
+  pass_constant_pool_broadcast (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_constant_pool_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return TARGET_AVX512F;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      return constant_pool_broadcast ();
+    }
+}; // class pass_cpb
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_constant_pool_broadcast (gcc::context *ctxt)
+{
+  return new pass_constant_pool_broadcast (ctxt);
+}
+
 /* This compares the priority of target features in function DECL1
    and DECL2.  It returns positive value if DECL1 is higher priority,
    negative value if DECL2 is higher priority and 0 if they are the
diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
index d83c7b956b1..07ecf8e790f 100644
--- a/gcc/config/i386/i386-passes.def
+++ b/gcc/config/i386/i386-passes.def
@@ -33,3 +33,4 @@  along with GCC; see the file COPYING3.  If not see
   INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_endbr_and_patchable_area);
 
   INSERT_PASS_AFTER (pass_combine, 1, pass_remove_partial_avx_dependency);
+  INSERT_PASS_AFTER (pass_combine, 1, pass_constant_pool_broadcast);
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 7c2ce618f3f..6c6909b41dd 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -386,3 +386,4 @@  extern rtl_opt_pass *make_pass_insert_endbr_and_patchable_area
   (gcc::context *);
 extern rtl_opt_pass *make_pass_remove_partial_avx_dependency
   (gcc::context *);
+extern rtl_opt_pass *make_pass_constant_pool_broadcast (gcc::context *);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 431571a4bc1..fbfb459c5bf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -12127,6 +12127,19 @@ 
    (set_attr "prefix" "evex")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512dq_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI8_AVX512VL 0 "register_operand" "=v")
+	(mult:VI8_AVX512VL
+	  (vec_duplicate:VI8_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI8_AVX512VL 2 "register_operand" "v")
+))]
+  "TARGET_AVX512DQ"
+  "vpmullq\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3<mask_name>"
   [(set (match_operand:VI4_AVX512F 0 "register_operand")
 	(mult:VI4_AVX512F
@@ -12167,6 +12180,18 @@ 
    (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_insn "*avx512f_mul<mode>3<mask_name>_bcst"
+  [(set (match_operand:VI4_AVX512VL 0 "register_operand" "=v")
+	(mult:VI4_AVX512VL
+	  (vec_duplicate:VI4_AVX512VL
+	    (match_operand:<ssescalarmode> 1 "memory_operand" "m"))
+	  (match_operand:VI4_AVX512VL 2 "register_operand" "v")))]
+  "TARGET_AVX512F"
+   "vpmulld\t{%1<avx512bcst>, %2, %0<mask_operand3>|%0<mask_operand3>, %2, %1<avx512bcst>}"
+  [(set_attr "type" "sseimul")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "<sseinsnmode>")])
+
 (define_expand "mul<mode>3"
   [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand")
 	(mult:VI8_AVX2_AVX512F
diff --git a/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..800ef1f957e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx2-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@ 
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..21249bc0cf9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-1.c
@@ -0,0 +1,66 @@ 
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vpmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vadd\[^\n\]*\\\{1to16\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to2\\\}" 1 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to4\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to8\\\}" 2 } }  */
+/* { dg-final { scan-assembler-times "vmul\[^\n\]*\\\{1to16\\\}" 1 } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef int v16si  __attribute__ ((vector_size (64)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef long long v8di  __attribute__ ((vector_size (64)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef float v16sf  __attribute__ ((vector_size (64)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+typedef double v8df  __attribute__ ((vector_size (64)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v16si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v8di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v16sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+FOO (v8df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v16si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v8di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v16sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
+FOO (v8df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
new file mode 100644
index 00000000000..938346743c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512f-broadcast-pr87767-2.c
@@ -0,0 +1,54 @@ 
+/* PR target/87767 */
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mavx512dq" } */
+
+#include<stdlib.h>
+#include<stdio.h>
+#include "avx512f-broadcast-pr87767-1.c"
+
+#define TEST(VTYPE, TYPE, N, OP_NAME, OP)		\
+  do							\
+    {							\
+      TYPE exp[N], src[N];				\
+      VTYPE res;					\
+      for (int i = 0; i < N; i++)			\
+	src[i] = i * i * 107;				\
+      res = foo_##OP_NAME##_##VTYPE (*(VTYPE*)&src[0]);	\
+      for (int i = 0; i < N; i ++)			\
+	exp[i] = src[i] OP 101;				\
+      for (int j = 0; j < N; j++)			\
+	{						\
+	  if (res[j] != exp[j])				\
+	    abort();					\
+	}						\
+    }							\
+  while (0)
+
+int main()
+{
+  TEST (v4si, int, 4, add, +);
+  TEST (v8si, int, 8, add, +);
+  TEST (v16si, int, 16, add, +);
+  TEST (v2di, long long, 2, add, +);
+  TEST (v4di, long long, 4, add, +);
+  TEST (v8di, long long, 8, add, +);
+  TEST (v4sf, float, 4, add, +);
+  TEST (v8sf, float, 8, add, +);
+  TEST (v16sf, float, 16, add, +);
+  TEST (v2df, double, 2, add, +);
+  TEST (v4df, double, 4, add, +);
+  TEST (v8df, double, 8, add, +);
+
+  TEST (v4si, int, 4, mul, *);
+  TEST (v8si, int, 8, mul, *);
+  TEST (v16si, int, 16, mul, *);
+  TEST (v2di, long long, 2, mul, *);
+  TEST (v4di, long long, 4, mul, *);
+  TEST (v8di, long long, 8, mul, *);
+  TEST (v4sf, float, 4, mul, *);
+  TEST (v8sf, float, 8, mul, *);
+  TEST (v16sf, float, 16, mul, *);
+  TEST (v2df, double, 2, mul, *);
+  TEST (v4df, double, 4, mul, *);
+  TEST (v8df, double, 8, mul, *);
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
new file mode 100644
index 00000000000..ec159a68158
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-broadcast-pr87767-1.c
@@ -0,0 +1,40 @@ 
+/* PR target/87767 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f" } */
+/* { dg-final { scan-assembler-not "\\\{1to\[248\]\\\}" } }  */
+/* { dg-final { scan-assembler-not "\\\{1to16\\\}" } }  */
+
+typedef int v4si  __attribute__ ((vector_size (16)));
+typedef int v8si  __attribute__ ((vector_size (32)));
+typedef long long v2di  __attribute__ ((vector_size (16)));
+typedef long long v4di  __attribute__ ((vector_size (32)));
+typedef float v4sf  __attribute__ ((vector_size (16)));
+typedef float v8sf  __attribute__ ((vector_size (32)));
+typedef double v2df  __attribute__ ((vector_size (16)));
+typedef double v4df  __attribute__ ((vector_size (32)));
+
+#define FOO(VTYPE, OP_NAME, OP)			\
+VTYPE						\
+ __attribute__ ((noipa))			\
+foo_##OP_NAME##_##VTYPE (VTYPE a)		\
+{						\
+  return a OP 101;				\
+}						\
+
+FOO (v4si, add, +);
+FOO (v8si, add, +);
+FOO (v2di, add, +);
+FOO (v4di, add, +);
+FOO (v4sf, add, +);
+FOO (v8sf, add, +);
+FOO (v2df, add, +);
+FOO (v4df, add, +);
+
+FOO (v4si, mul, *);
+FOO (v8si, mul, *);
+FOO (v2di, mul, *);
+FOO (v4di, mul, *);
+FOO (v4sf, mul, *);
+FOO (v8sf, mul, *);
+FOO (v2df, mul, *);
+FOO (v4df, mul, *);
diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
index 49b5778a067..a37487d9af7 100644
--- a/gcc/testsuite/gcc.target/i386/pr92865-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
@@ -3,10 +3,11 @@ 
 /* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
 /* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
 /* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
-/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
-/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 2 } } *
+/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 2 } } */
+/* { dg-final { scan-assembler-times "vpbroadcast\[bwqd\]\[\t ]" 16 } } */
 
 extern char arraysb[64];
 extern short arraysw[32];
-- 
2.18.1