diff mbox series

RFC: x87 reduc_plus_scal_* AVX (and AVX512?) expanders

Message ID alpine.LSU.2.20.1810011637470.16707@zhemvz.fhfr.qr
State New
Headers show
Series RFC: x87 reduc_plus_scal_* AVX (and AVX512?) expanders | expand

Commit Message

Richard Biener Oct. 1, 2018, 2:47 p.m. UTC
I notice that for Zen we create

  0.00 │       vhaddp %ymm3,%ymm3,%ymm3
  1.41 │       vperm2 $0x1,%ymm3,%ymm3,%ymm1
  1.45 │       vaddpd %ymm1,%ymm2,%ymm2

from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
even though the upper half of the result is unused in the end
(we only use the single-precision element zero).  Much better would
be to use vextractf128 which is well-pipelined and has good throughput
(though using vhaddp in itself is quite bad for Zen I didn't try
benchmarking it against open-coding that yet, aka disabling the
expander).  I can generate

        vhaddpd %ymm3, %ymm3, %ymm3
        vextractf128    $0x1, %ymm3, %xmm1
        vaddpd  %xmm1, %xmm3, %xmm3

with


easily though even using scalar operations for the add would be possible.

reduc_plus_scal_v8df uses ix86_expand_reduc which seems to use
full-width instructions throughout.  I recently changed the vectorizer
to open-code tem%ymm = lowpart(%zmm) + highpart(%zmm);
tem%xmm = lowpart(tem%ymm) + highpart(tem%ymm); reduce(tem%xmm) which
is better for all cores so I wonder if ix86_expand_reduc should follow
that scheme.  As said in a related PR the backend is in full control
of the final reduction sequence used when defining reduc_plus_scal_<mode>
which IMHO is good and we likely should have tuning specific patterns
in case there isn't a one-fits-all one.

Thanks,
Richard.
diff mbox series

Patch

Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md      (revision 264758)
+++ gcc/config/i386/sse.md      (working copy)
@@ -2474,12 +2474,12 @@  (define_expand "reduc_plus_scal_v4df"
   "TARGET_AVX"
 {
   rtx tmp = gen_reg_rtx (V4DFmode);
-  rtx tmp2 = gen_reg_rtx (V4DFmode);
-  rtx vec_res = gen_reg_rtx (V4DFmode);
+  rtx tmp2 = gen_reg_rtx (V2DFmode);
+  rtx vec_res = gen_reg_rtx (V2DFmode);
   emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
-  emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
-  emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
-  emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
+  emit_insn (gen_vec_extract_hi_v4df (tmp2, tmp));
+  emit_insn (gen_addv2df3 (vec_res, gen_lowpart (V2DFmode, tmp), tmp2));
+  emit_insn (gen_vec_extractv2dfdf (operands[0], vec_res, const0_rtx));
   DONE;
 })