diff mbox series

[RFC,x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Message ID alpine.LSU.2.20.1907231548290.30921@zhemvz.fhfr.qr
State New
Headers show
Series [RFC,x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs | expand

Commit Message

Richard Biener July 23, 2019, 2 p.m. UTC
The following fixes the runtime regression of 456.hmmer caused
by matching ICC in code generation and using cmov more aggressively
(through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
by manual assembler editing) using the SSE unit for performing
SImode loads, adds and then two singed max operations plus stores
is quite a bit faster than cmovs - even faster than the original
single cmov plus branchy second max.  Even more so for AMD CPUs
than Intel CPUs.

Instead of hacking up some pattern recognition pass to transform
integer mode memory-to-memory computation chains involving
conditional moves to "vector" code (similar to what STV does
for TImode ops on x86_64) the following simply allows SImode
into SSE registers (support for this is already there in some
important places like move patterns!).  For the particular
case of 456.hmmer the required support is loads/stores
(already implemented), SImode adds and SImode smax.

So the patch adds a smax pattern for SImode (we don't have any
for scalar modes but currently expand via a conditional move sequence)
emitting as SSE vector max or cmp/cmov depending on the alternative.

And it amends the *add<mode>_1 pattern with SSE alternatives
(which have to come before the memory alternative as IRA otherwise
doesn't consider reloading a memory operand to a register).

With this in place the runtime of 456.hmmer improves by 10%
on Haswell which is back to before regression speed but not
to same levels as seen with manually editing just the single
important loop.

I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
interesting is probably Zen where moves crossing the
integer - vector domain are excessively expensive (they get
done via the stack).

Clearly this approach will run into register allocation issues
but it looks cleaner than writing yet another STV-like pass
(STV itself is quite awkwardly structured so I refrain from
touching it...).

Anyway - comments?  It seems to me that MMX-in-SSE does
something very similar.

Bootstrapped on x86_64-unknown-linux-gnu, previous testing
revealed some issue.  Forgot that *add<mode>_1 also handles
DImode..., fixed below, re-testing in progress.

Thanks,
Richard.

2019-07-23  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386.md (smaxsi3): New.
	(*add<mode>_1): Add SSE and AVX variants.
	* config/i386/i386.c (ix86_lea_for_add_ok): Do not allow
	SSE registers.

Comments

Richard Biener July 24, 2019, 8:42 a.m. UTC | #1
On Tue, 23 Jul 2019, Richard Biener wrote:

> 
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max.  Even more so for AMD CPUs
> than Intel CPUs.
> 
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!).  For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
> 
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
> 
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
> 
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
> 
> I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).
> 
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
> 
> Anyway - comments?  It seems to me that MMX-in-SSE does
> something very similar.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue.  Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.

Bootstrapped/tested on x86_64-unknown-linux-gnu.  A 3-run of
SPEC CPU 2006 on a Haswell machine completed and results
are in the noise besides the 456.hmmer improvement:

456.hmmer        9330        184       50.7 S    9330        162       
57.4 S
456.hmmer        9330        182       51.2 *    9330        162       
57.7 *
456.hmmer        9330        182       51.2 S    9330        162       
57.7 S

the peak binaries (patched) are all a slightly bit bigger, the
smaxsi3 pattern triggers 6840 times, every time using SSE
registers and never expanding to the cmov variant.  The
*add<mode>_1 pattern ends up using SSE regs 264 times
(out of undoubtly many more, uncounted, times).

I do see cases where the RA ends up moving sources of
the max from GPR to XMM when the destination is stored
to memory and used in other ops with SSE but still
it could have used XMM regs for the sources as well:

        movl    -208(%rbp), %r8d
        addl    (%r9,%rax), %r8d
        vmovd   %r8d, %xmm2
        movq    -120(%rbp), %r8
        # MAX WITH SSE
        vpmaxsd %xmm4, %xmm2, %xmm2

amending the *add<mode>_1 was of course the trickiest part,
mostly because the GPR case has memory alternatives while
the SSE part does not (since we have to use a whole-vector
add we can't use a memory operand which would be wider
than SImode - AVX512 might come to the rescue with
using {splat} from scalar/immediate or masking
but that might come at a runtime cost as well).  Allowing
memory and splitting after reload, adding a match-scratch
might work as well.  But I'm not sure if that wouldn't
make using SSE regs too obvious if it's not all in the
same alternative.  While the above code isn't too bad
on Core, both Bulldozer and Zen take a big hit.

Another case from 400.perlbench:

        vmovd   .LC45(%rip), %xmm7
        vmovd   %ebp, %xmm5
        # MAX WITH SSE
        vpmaxsd %xmm7, %xmm5, %xmm4
        vmovd   %xmm4, %ecx

eh?  I can't see why the RA would ever choose the second
alternative.  It looks like it prefers SSE_REGS for the
operand set from a constant.  A testcase like

int foo (int a)
{
  return a > 5 ? a : 5;
}

produces the above with -mavx2, possibly IRA thinks
the missing matching constraint for the 2nd alternative
makes it win?  The dumps aren't too verbose here just
showing the costs, not how we arrive at them.

Generally using SSE for scalar integer ops shouldn't be
bad, esp. in loops it might free GPRs for induction variables.
Cons are larger instruction encoding and inefficient/missing
handling of immediates and no memory operands.

Of course in the end it's just that for some unknown
reason cmp + cmov is so much slower than pmaxsd
(OK, it's a lot less uops, but...) and that pmaxsd
is quite a bit faster than the variant with a
(very well predicted) branch.

Richard.

> Thanks,
> Richard.
> 
> 2019-07-23  Richard Biener  <rguenther@suse.de>
> 
> 	PR target/91154
> 	* config/i386/i386.md (smaxsi3): New.
> 	(*add<mode>_1): Add SSE and AVX variants.
> 	* config/i386/i386.c (ix86_lea_for_add_ok): Do not allow
> 	SSE registers.
> 
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md	(revision 273732)
> +++ gcc/config/i386/i386.md	(working copy)
> @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
>    ""
>    "ix86_expand_move (<MODE>mode, operands); DONE;")
>  
> +(define_insn "smaxsi3"
> + [(set (match_operand:SI 0 "register_operand" "=r,v,x")
> +       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
> +                (match_operand:SI 2 "register_operand" "r,v,x")))
> +  (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_SSE4_1"
> +{
> +  switch (get_attr_type (insn))
> +    {
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
> +      else
> +        return "pmaxsd\t{%2, %0|%0, %2}";
> +    case TYPE_ICMOV:
> +      /* ???  Instead split this after reload?  */
> +      return "cmpl\t{%2, %0|%0, %2}\n"
> +           "\tcmovl\t{%2, %0|%0, %2}";
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +  [(set_attr "isa" "noavx,avx,noavx")
> +   (set_attr "prefix" "orig,vex,orig")
> +   (set_attr "memory" "none")
> +   (set_attr "type" "icmov,sseadd,sseadd")])
> +
>  (define_insn "*mov<mode>_xor"
>    [(set (match_operand:SWI48 0 "register_operand" "=r")
>  	(match_operand:SWI48 1 "const0_operand"))
> @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
>  })
>  
>  (define_insn "*add<mode>_1"
> -  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
> +  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
>  	(plus:SWI48
> -	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
> -	  (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
> +	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
> +	  (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
>     (clobber (reg:CC FLAGS_REG))]
>    "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
>  {
> @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
>            return "dec{<imodesuffix>}\t%0";
>  	}
>  
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        {
> +          if (<MODE>mode == SImode)
> +	    return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
> +	  else
> +	    return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
> +	}
> +      else if (<MODE>mode == SImode)
> +	return "paddd\t{%2, %0|%0, %2}";
> +      else
> +	return "paddq\t{%2, %0|%0, %2}";
> +
>      default:
>        /* For most processors, ADD is faster than LEA.  This alternative
>  	 was added to use ADD as much as possible.  */
> -      if (which_alternative == 2)
> +      if (which_alternative == 4)
>          std::swap (operands[1], operands[2]);
>          
>        gcc_assert (rtx_equal_p (operands[0], operands[1]));
> @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
>        return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
>      }
>  }
> -  [(set (attr "type")
> -     (cond [(eq_attr "alternative" "3")
> +  [(set_attr "isa" "*,avx,noavx,*,*,*")
> +   (set (attr "type")
> +     (cond [(eq_attr "alternative" "5")
>                (const_string "lea")
> +	    (eq_attr "alternative" "1")
> +	      (const_string "sseadd")
> +	    (eq_attr "alternative" "2")
> +	      (const_string "sseadd")
>  	    (match_operand:SWI48 2 "incdec_operand")
>  	      (const_string "incdec")
>  	   ]
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 273732)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
>    unsigned int regno1 = true_regnum (operands[1]);
>    unsigned int regno2 = true_regnum (operands[2]);
>  
> +  if (SSE_REGNO_P (regno1))
> +    return false;
> +
>    /* If a = b + c, (a!=b && a!=c), must use lea form. */
>    if (regno0 != regno1 && regno0 != regno2)
>      return true;
>
Richard Biener July 24, 2019, 11:11 a.m. UTC | #2
On Wed, 24 Jul 2019, Richard Biener wrote:

> On Tue, 23 Jul 2019, Richard Biener wrote:
> 
> > 
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max.  Even more so for AMD CPUs
> > than Intel CPUs.
> > 
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!).  For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> > 
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> > 
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> > 
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> > 
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> > 
> > Clearly this approach will run into register allocation issues
> > but it looks cleaner than writing yet another STV-like pass
> > (STV itself is quite awkwardly structured so I refrain from
> > touching it...).
> > 
> > Anyway - comments?  It seems to me that MMX-in-SSE does
> > something very similar.
> > 
> > Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> > revealed some issue.  Forgot that *add<mode>_1 also handles
> > DImode..., fixed below, re-testing in progress.
> 
> Bootstrapped/tested on x86_64-unknown-linux-gnu.  A 3-run of
> SPEC CPU 2006 on a Haswell machine completed and results
> are in the noise besides the 456.hmmer improvement:
> 
> 456.hmmer        9330        184       50.7 S    9330        162       
> 57.4 S
> 456.hmmer        9330        182       51.2 *    9330        162       
> 57.7 *
> 456.hmmer        9330        182       51.2 S    9330        162       
> 57.7 S
> 
> the peak binaries (patched) are all a slightly bit bigger, the
> smaxsi3 pattern triggers 6840 times, every time using SSE
> registers and never expanding to the cmov variant.  The
> *add<mode>_1 pattern ends up using SSE regs 264 times
> (out of undoubtly many more, uncounted, times).
> 
> I do see cases where the RA ends up moving sources of
> the max from GPR to XMM when the destination is stored
> to memory and used in other ops with SSE but still
> it could have used XMM regs for the sources as well:
> 
>         movl    -208(%rbp), %r8d
>         addl    (%r9,%rax), %r8d
>         vmovd   %r8d, %xmm2
>         movq    -120(%rbp), %r8
>         # MAX WITH SSE
>         vpmaxsd %xmm4, %xmm2, %xmm2
> 
> amending the *add<mode>_1 was of course the trickiest part,
> mostly because the GPR case has memory alternatives while
> the SSE part does not (since we have to use a whole-vector
> add we can't use a memory operand which would be wider
> than SImode - AVX512 might come to the rescue with
> using {splat} from scalar/immediate or masking
> but that might come at a runtime cost as well).  Allowing
> memory and splitting after reload, adding a match-scratch
> might work as well.  But I'm not sure if that wouldn't
> make using SSE regs too obvious if it's not all in the
> same alternative.  While the above code isn't too bad
> on Core, both Bulldozer and Zen take a big hit.
> 
> Another case from 400.perlbench:
> 
>         vmovd   .LC45(%rip), %xmm7
>         vmovd   %ebp, %xmm5
>         # MAX WITH SSE
>         vpmaxsd %xmm7, %xmm5, %xmm4
>         vmovd   %xmm4, %ecx
> 
> eh?  I can't see why the RA would ever choose the second
> alternative.  It looks like it prefers SSE_REGS for the
> operand set from a constant.  A testcase like
> 
> int foo (int a)
> {
>   return a > 5 ? a : 5;
> }
> 
> produces the above with -mavx2, possibly IRA thinks
> the missing matching constraint for the 2nd alternative
> makes it win?  The dumps aren't too verbose here just
> showing the costs, not how we arrive at them.

Eh, this is to my use of the "cpu" attribute for smaxsi3 which
makes it only enable this alternative for -mavx.  Removing that
we fail to consider SSE regs for the original and this testcase :/

Oh well.

RA needs some more pixie dust it seems ...

Richard.
Jeff Law July 24, 2019, 3:03 p.m. UTC | #3
On 7/23/19 8:00 AM, Richard Biener wrote:
> 
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max.  Even more so for AMD CPUs
> than Intel CPUs.
> 
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!).  For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
> 
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
> 
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
> 
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
> 
> I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).
> 
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
> 
> Anyway - comments?  It seems to me that MMX-in-SSE does
> something very similar.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue.  Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.
Certainly simpler than most of the options and seems effective.

FWIW, I think all the STV code is still disabled and has been for
several releases.  One could make an argument it should get dropped.  If
someone wants to make something like STV work, they can try again and
hopefully learn from the problems with the first implementation.
jeff
Martin Jambor July 25, 2019, 9:13 a.m. UTC | #4
Hello,

On Tue, Jul 23 2019, Richard Biener wrote:
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max.  Even more so for AMD CPUs
> than Intel CPUs.
>
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!).  For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
>
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
>
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
>
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
>
> I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).

There was a znver2 CPU machine not doing anything useful overnight here
so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
patches, I am afraid).  All benchmarks were run only once with options
-Ofast -march=native -mtune=native.

By far the biggest change was indeed 456.hmmer which improved by
incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
2006 so the SPECint score grew by almost 3.4%.

I understand this patch fixes a regression in that benchmark but even
so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
and with the patch is 20% faster than gcc 9.

In SPEC 2017, there were two changes worth mentioning although they
probably need to be confirmed and re-measured on top of the new tuning
changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
3.04%.

Martin


>
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
>
> Anyway - comments?  It seems to me that MMX-in-SSE does
> something very similar.
>
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue.  Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.
>
> Thanks,
> Richard.
>
> 2019-07-23  Richard Biener  <rguenther@suse.de>
>
> 	PR target/91154
> 	* config/i386/i386.md (smaxsi3): New.
> 	(*add<mode>_1): Add SSE and AVX variants.
> 	* config/i386/i386.c (ix86_lea_for_add_ok): Do not allow
> 	SSE registers.
>
Richard Biener July 25, 2019, 12:21 p.m. UTC | #5
On Thu, 25 Jul 2019, Martin Jambor wrote:

> Hello,
> 
> On Tue, Jul 23 2019, Richard Biener wrote:
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max.  Even more so for AMD CPUs
> > than Intel CPUs.
> >
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!).  For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> >
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> >
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> >
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> >
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> 
> There was a znver2 CPU machine not doing anything useful overnight here
> so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
> of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
> patches, I am afraid).  All benchmarks were run only once with options
> -Ofast -march=native -mtune=native.
> 
> By far the biggest change was indeed 456.hmmer which improved by
> incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
> 2006 so the SPECint score grew by almost 3.4%.
> 
> I understand this patch fixes a regression in that benchmark but even
> so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
> and with the patch is 20% faster than gcc 9.
> 
> In SPEC 2017, there were two changes worth mentioning although they
> probably need to be confirmed and re-measured on top of the new tuning
> changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
> 3.04%.

Thanks for checking.  Meanwhile I figured how to restore the
effects of the patch without disabling the GPR alternative in
smaxsi3.  The additional trick I need is avoid register class
preferencing from moves so *movsi_internal gets a few more *s
(in the end we'd need to split the r = g alternative because
for r = C we _do_ want to prefer general regs - unless it's
a special constant that can be loaded into a SSE reg?  Or maybe
that's not needed and reload costs will take care of that).

I've also needed to pessimize the GPR alternative in smaxsi3
because that instruction is supposed to drive all the decision
as it is cheaper when done on SSE regs.

Tunings still wreck things, like using -march=bdver2 will
give you one vpmaxsd and the rest in integer regs, including
an inter-unit move via the stack.

Still this is the best I can get to with my limited .md / LRA
skills.

Is avoiding register-class preferencing from moves good?  I think
it makes sense at least.

How would one write smaxsi3 as a splitter to be split after
reload in the case LRA assigned the GPR alternative?  Is it
even worth doing?  Even the SSE reg alternative can be split
to remove the not needed CC clobber.

Finally I'm unsure about the add where I needed to place
the SSE alternative before the 2nd op memory one since it
otherwise gets the same cost and wins.

So - how to go forward with this?

Thanks,
Richard.

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 273792)
+++ gcc/config/i386/i386.md	(working copy)
@@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
   ""
   "ix86_expand_move (<MODE>mode, operands); DONE;")
 
+(define_insn "smaxsi3"
+ [(set (match_operand:SI 0 "register_operand" "=?r,v,x")
+       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
+                (match_operand:SI 2 "register_operand" "r,v,x")))
+  (clobber (reg:CC FLAGS_REG))]
+  ""
+{
+  switch (get_attr_type (insn))
+    {
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
+      else
+        return "pmaxsd\t{%2, %0|%0, %2}";
+    case TYPE_ICMOV:
+      /* ???  Instead split this after reload?  */
+      return "cmpl\t{%2, %0|%0, %2}\n"
+           "\tcmovl\t{%2, %0|%0, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "*,avx,sse4_noavx")
+   (set_attr "prefix" "orig,vex,orig")
+   (set_attr "memory" "none")
+   (set_attr "type" "icmov,sseadd,sseadd")])
+
 (define_insn "*mov<mode>_xor"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
 	(match_operand:SWI48 1 "const0_operand"))
@@ -2342,9 +2369,9 @@ (define_peephole2
 
 (define_insn "*movsi_internal"
   [(set (match_operand:SI 0 "nonimmediate_operand"
-    "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
+    "=*r,m  ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
 	(match_operand:SI 1 "general_operand"
-    "g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
+    "g  ,*re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))"
 {
   switch (get_attr_type (insn))
@@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
 })
 
 (define_insn "*add<mode>_1"
-  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
 	(plus:SWI48
-	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
-	  (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
+	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
+	  (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
 {
@@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
           return "dec{<imodesuffix>}\t%0";
 	}
 
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        {
+          if (<MODE>mode == SImode)
+	    return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
+	  else
+	    return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
+	}
+      else if (<MODE>mode == SImode)
+	return "paddd\t{%2, %0|%0, %2}";
+      else
+	return "paddq\t{%2, %0|%0, %2}";
+
     default:
       /* For most processors, ADD is faster than LEA.  This alternative
 	 was added to use ADD as much as possible.  */
-      if (which_alternative == 2)
+      if (which_alternative == 4)
         std::swap (operands[1], operands[2]);
         
       gcc_assert (rtx_equal_p (operands[0], operands[1]));
@@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
       return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
     }
 }
-  [(set (attr "type")
-     (cond [(eq_attr "alternative" "3")
+  [(set_attr "isa" "*,avx,sse2,*,*,*")
+   (set (attr "type")
+     (cond [(eq_attr "alternative" "5")
               (const_string "lea")
+	    (eq_attr "alternative" "1")
+	      (const_string "sseadd")
+	    (eq_attr "alternative" "2")
+	      (const_string "sseadd")
 	    (match_operand:SWI48 2 "incdec_operand")
 	      (const_string "incdec")
 	   ]
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 273792)
+++ gcc/config/i386/i386.c	(working copy)
@@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
   unsigned int regno1 = true_regnum (operands[1]);
   unsigned int regno2 = true_regnum (operands[2]);
 
+  if (SSE_REGNO_P (regno1))
+    return false;
+
   /* If a = b + c, (a!=b && a!=c), must use lea form. */
   if (regno0 != regno1 && regno0 != regno2)
     return true;
Uros Bizjak July 27, 2019, 9:22 a.m. UTC | #6
On Wed, Jul 24, 2019 at 5:03 PM Jeff Law <law@redhat.com> wrote:

> > Clearly this approach will run into register allocation issues
> > but it looks cleaner than writing yet another STV-like pass
> > (STV itself is quite awkwardly structured so I refrain from
> > touching it...).
> >
> > Anyway - comments?  It seems to me that MMX-in-SSE does
> > something very similar.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> > revealed some issue.  Forgot that *add<mode>_1 also handles
> > DImode..., fixed below, re-testing in progress.
> Certainly simpler than most of the options and seems effective.
>
> FWIW, I think all the STV code is still disabled and has been for
> several releases.  One could make an argument it should get dropped.  If
> someone wants to make something like STV work, they can try again and
> hopefully learn from the problems with the first implementation.

Huh?

STV code is *enabled by default* on 32bit SSE2 targets, and works
surprisingly well (*) for DImode arithmetic, logic and constant shift
operations. Even 32bit multilib on x86_64 is built with STV.

I am indeed surprised that the perception of the developers is that
STV doesn't work. Maybe I'm missing something obvious here?

(*) The infrastructure includes:
  - cost analysis of the whole STV chain, including moves from integer
registers, loading and storing DImode values
  - preloading of arguments into vector registers to avoid duplicate
int-vec moves
  - different strategies to move arguments between int and vector
registers (e.g. respects TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC flag)

Uros.
Uros Bizjak July 27, 2019, 10:07 a.m. UTC | #7
On Thu, Jul 25, 2019 at 2:21 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 25 Jul 2019, Martin Jambor wrote:
>
> > Hello,
> >
> > On Tue, Jul 23 2019, Richard Biener wrote:
> > > The following fixes the runtime regression of 456.hmmer caused
> > > by matching ICC in code generation and using cmov more aggressively
> > > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > > by manual assembler editing) using the SSE unit for performing
> > > SImode loads, adds and then two singed max operations plus stores
> > > is quite a bit faster than cmovs - even faster than the original
> > > single cmov plus branchy second max.  Even more so for AMD CPUs
> > > than Intel CPUs.
> > >
> > > Instead of hacking up some pattern recognition pass to transform
> > > integer mode memory-to-memory computation chains involving
> > > conditional moves to "vector" code (similar to what STV does
> > > for TImode ops on x86_64) the following simply allows SImode
> > > into SSE registers (support for this is already there in some
> > > important places like move patterns!).  For the particular
> > > case of 456.hmmer the required support is loads/stores
> > > (already implemented), SImode adds and SImode smax.
> > >
> > > So the patch adds a smax pattern for SImode (we don't have any
> > > for scalar modes but currently expand via a conditional move sequence)
> > > emitting as SSE vector max or cmp/cmov depending on the alternative.
> > >
> > > And it amends the *add<mode>_1 pattern with SSE alternatives
> > > (which have to come before the memory alternative as IRA otherwise
> > > doesn't consider reloading a memory operand to a register).
> > >
> > > With this in place the runtime of 456.hmmer improves by 10%
> > > on Haswell which is back to before regression speed but not
> > > to same levels as seen with manually editing just the single
> > > important loop.
> > >
> > > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > > interesting is probably Zen where moves crossing the
> > > integer - vector domain are excessively expensive (they get
> > > done via the stack).
> >
> > There was a znver2 CPU machine not doing anything useful overnight here
> > so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
> > of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
> > patches, I am afraid).  All benchmarks were run only once with options
> > -Ofast -march=native -mtune=native.
> >
> > By far the biggest change was indeed 456.hmmer which improved by
> > incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
> > 2006 so the SPECint score grew by almost 3.4%.
> >
> > I understand this patch fixes a regression in that benchmark but even
> > so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
> > and with the patch is 20% faster than gcc 9.
> >
> > In SPEC 2017, there were two changes worth mentioning although they
> > probably need to be confirmed and re-measured on top of the new tuning
> > changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
> > 3.04%.
>
> Thanks for checking.  Meanwhile I figured how to restore the
> effects of the patch without disabling the GPR alternative in
> smaxsi3.  The additional trick I need is avoid register class
> preferencing from moves so *movsi_internal gets a few more *s
> (in the end we'd need to split the r = g alternative because
> for r = C we _do_ want to prefer general regs - unless it's
> a special constant that can be loaded into a SSE reg?  Or maybe
> that's not needed and reload costs will take care of that).
>
> I've also needed to pessimize the GPR alternative in smaxsi3
> because that instruction is supposed to drive all the decision
> as it is cheaper when done on SSE regs.
>
> Tunings still wreck things, like using -march=bdver2 will
> give you one vpmaxsd and the rest in integer regs, including
> an inter-unit move via the stack.
>
> Still this is the best I can get to with my limited .md / LRA
> skills.
>
> Is avoiding register-class preferencing from moves good?  I think
> it makes sense at least.
>
> How would one write smaxsi3 as a splitter to be split after
> reload in the case LRA assigned the GPR alternative?  Is it
> even worth doing?  Even the SSE reg alternative can be split
> to remove the not needed CC clobber.
>
> Finally I'm unsure about the add where I needed to place
> the SSE alternative before the 2nd op memory one since it
> otherwise gets the same cost and wins.
>
> So - how to go forward with this?

Sorry to come a bit late to the discussion.

We are aware of CMOV issue for quite some time, but the issue is not
understood yet in detail (I was hoping for Intel people to look at
this). However, you demonstrated that using PMAX and PMIN  instead of
scalar CMOV can bring us big gains, and this thread now deals on how
to best implement PMAX/PMIN for scalar code.

I think that the way to go forward is with STV infrastructure.
Currently, the implementation only deals with DImode on SSE2 32bit
targets, but I see no issues on using STV pass also for SImode (on
32bit and 64bit targets). There are actually two STV passes, the first
one (currently run on 64bit targets) is run before cse2, and the
second (which currently runs on 32bit SSE2 only) is run after combine
and before split1 pass. The second pass is interesting to us.

The base idea of the second STV pass (for 32bit targets!) is that we
introduce a DImode _doubleword instructons that otherwise do not exist
with integer registers. Now, the passes up to and including combine
pass can use these instructions to simplify and optimize the insn
flow. Later, based on cost analysis, STV pass either converts the
_doubleword instructions to a real vector ones (e.g. V2DImode
patterns) or leaves them intact, and a follow-up split pass splits
them into scalar SImode instruction pairs. STV pass also takes care to
move and preload values from their scalar form to a vector
representation (using SUBREGs). Please note that all this happens on
pseudos, and register allocator will later simply use scalar (integer)
registers in scalar patterns and vector registers with vector insn
patterns.

Your approach to amend existing scalar SImode patterns with vector
registers will introduce no end of problems. Register allocator will
do funny things during register pressure, where values will take a
trip to a vector register before being stored to memory (and vice
versa, you already found some of them). Current RA simply can't
distinguish clearly between two register sets.

So, my advice would be to use STV pass also for SImode values, on
64bit and 32bit targets. On both targets, we will be able to use
instructions that operate on vector register set, and for 32bit
targets (and to some extent on 64bit targets), we would perhaps be
able to relax register pressure in a kind of controlled way.

So, to demonstrate the benefits of existing STV pass, it should be
relatively easy to introduce 64bit max/min pattern on 32bit target to
handle 64bit values. For 32bit values, the pass should be re-run to
convert SImode scalar operations to vector operations in a controlled
way, based on various cost functions.

Uros.

> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 273792)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
>    ""
>    "ix86_expand_move (<MODE>mode, operands); DONE;")
>
> +(define_insn "smaxsi3"
> + [(set (match_operand:SI 0 "register_operand" "=?r,v,x")
> +       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
> +                (match_operand:SI 2 "register_operand" "r,v,x")))
> +  (clobber (reg:CC FLAGS_REG))]
> +  ""
> +{
> +  switch (get_attr_type (insn))
> +    {
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
> +      else
> +        return "pmaxsd\t{%2, %0|%0, %2}";
> +    case TYPE_ICMOV:
> +      /* ???  Instead split this after reload?  */
> +      return "cmpl\t{%2, %0|%0, %2}\n"
> +           "\tcmovl\t{%2, %0|%0, %2}";
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +  [(set_attr "isa" "*,avx,sse4_noavx")
> +   (set_attr "prefix" "orig,vex,orig")
> +   (set_attr "memory" "none")
> +   (set_attr "type" "icmov,sseadd,sseadd")])
> +
>  (define_insn "*mov<mode>_xor"
>    [(set (match_operand:SWI48 0 "register_operand" "=r")
>         (match_operand:SWI48 1 "const0_operand"))
> @@ -2342,9 +2369,9 @@ (define_peephole2
>
>  (define_insn "*movsi_internal"
>    [(set (match_operand:SI 0 "nonimmediate_operand"
> -    "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
> +    "=*r,m  ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
>         (match_operand:SI 1 "general_operand"
> -    "g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
> +    "g  ,*re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
>    "!(MEM_P (operands[0]) && MEM_P (operands[1]))"
>  {
>    switch (get_attr_type (insn))
> @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
>  })
>
>  (define_insn "*add<mode>_1"
> -  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
> +  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
>         (plus:SWI48
> -         (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
> -         (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
> +         (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
> +         (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
>     (clobber (reg:CC FLAGS_REG))]
>    "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
>  {
> @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
>            return "dec{<imodesuffix>}\t%0";
>         }
>
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        {
> +          if (<MODE>mode == SImode)
> +           return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
> +         else
> +           return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
> +       }
> +      else if (<MODE>mode == SImode)
> +       return "paddd\t{%2, %0|%0, %2}";
> +      else
> +       return "paddq\t{%2, %0|%0, %2}";
> +
>      default:
>        /* For most processors, ADD is faster than LEA.  This alternative
>          was added to use ADD as much as possible.  */
> -      if (which_alternative == 2)
> +      if (which_alternative == 4)
>          std::swap (operands[1], operands[2]);
>
>        gcc_assert (rtx_equal_p (operands[0], operands[1]));
> @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
>        return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
>      }
>  }
> -  [(set (attr "type")
> -     (cond [(eq_attr "alternative" "3")
> +  [(set_attr "isa" "*,avx,sse2,*,*,*")
> +   (set (attr "type")
> +     (cond [(eq_attr "alternative" "5")
>                (const_string "lea")
> +           (eq_attr "alternative" "1")
> +             (const_string "sseadd")
> +           (eq_attr "alternative" "2")
> +             (const_string "sseadd")
>             (match_operand:SWI48 2 "incdec_operand")
>               (const_string "incdec")
>            ]
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      (revision 273792)
> +++ gcc/config/i386/i386.c      (working copy)
> @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
>    unsigned int regno1 = true_regnum (operands[1]);
>    unsigned int regno2 = true_regnum (operands[2]);
>
> +  if (SSE_REGNO_P (regno1))
> +    return false;
> +
>    /* If a = b + c, (a!=b && a!=c), must use lea form. */
>    if (regno0 != regno1 && regno0 != regno2)
>      return true;
Uros Bizjak July 27, 2019, 11:14 a.m. UTC | #8
On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:

> > How would one write smaxsi3 as a splitter to be split after
> > reload in the case LRA assigned the GPR alternative?  Is it
> > even worth doing?  Even the SSE reg alternative can be split
> > to remove the not needed CC clobber.
> >
> > Finally I'm unsure about the add where I needed to place
> > the SSE alternative before the 2nd op memory one since it
> > otherwise gets the same cost and wins.
> >
> > So - how to go forward with this?
>
> Sorry to come a bit late to the discussion.
>
> We are aware of CMOV issue for quite some time, but the issue is not
> understood yet in detail (I was hoping for Intel people to look at
> this). However, you demonstrated that using PMAX and PMIN  instead of
> scalar CMOV can bring us big gains, and this thread now deals on how
> to best implement PMAX/PMIN for scalar code.
>
> I think that the way to go forward is with STV infrastructure.
> Currently, the implementation only deals with DImode on SSE2 32bit
> targets, but I see no issues on using STV pass also for SImode (on
> 32bit and 64bit targets). There are actually two STV passes, the first
> one (currently run on 64bit targets) is run before cse2, and the
> second (which currently runs on 32bit SSE2 only) is run after combine
> and before split1 pass. The second pass is interesting to us.
>
> The base idea of the second STV pass (for 32bit targets!) is that we
> introduce a DImode _doubleword instructons that otherwise do not exist
> with integer registers. Now, the passes up to and including combine
> pass can use these instructions to simplify and optimize the insn
> flow. Later, based on cost analysis, STV pass either converts the
> _doubleword instructions to a real vector ones (e.g. V2DImode
> patterns) or leaves them intact, and a follow-up split pass splits
> them into scalar SImode instruction pairs. STV pass also takes care to
> move and preload values from their scalar form to a vector
> representation (using SUBREGs). Please note that all this happens on
> pseudos, and register allocator will later simply use scalar (integer)
> registers in scalar patterns and vector registers with vector insn
> patterns.
>
> Your approach to amend existing scalar SImode patterns with vector
> registers will introduce no end of problems. Register allocator will
> do funny things during register pressure, where values will take a
> trip to a vector register before being stored to memory (and vice
> versa, you already found some of them). Current RA simply can't
> distinguish clearly between two register sets.
>
> So, my advice would be to use STV pass also for SImode values, on
> 64bit and 32bit targets. On both targets, we will be able to use
> instructions that operate on vector register set, and for 32bit
> targets (and to some extent on 64bit targets), we would perhaps be
> able to relax register pressure in a kind of controlled way.
>
> So, to demonstrate the benefits of existing STV pass, it should be
> relatively easy to introduce 64bit max/min pattern on 32bit target to
> handle 64bit values. For 32bit values, the pass should be re-run to
> convert SImode scalar operations to vector operations in a controlled
> way, based on various cost functions.

Please find attached patch to see STV in action. The compilation will
crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
dump, you will be able to see chain building, cost calculation and
conversion insertion.

The testcase:

--cut here--
long long test (long long a, long long b)
{
  return (a > b) ? a : b;
}
--cut here--

gcc -O2 -m32 -msse2 (-mstv):

_.268r.stv2 dump:

Searching for mode conversion candidates...
  insn 2 is marked as a candidate
  insn 3 is marked as a candidate
  insn 7 is marked as a candidate
Created a new instruction chain #1
Building chain #1...
  Adding insn 2 to chain #1
  Adding insn 7 into chain's #1 queue
  Adding insn 7 to chain #1
  r85 use in insn 12 isn't convertible
  Mark r85 def in insn 7 as requiring both modes in chain #1
  Adding insn 3 into chain's #1 queue
  Adding insn 3 to chain #1
Collected chain #1...
  insns: 2, 3, 7
  defs to convert: r85
Computing gain for chain #1...
  Instruction conversion gain: 24
  Registers conversion cost: 6
  Total gain: 18
Converting chain #1...

...

(insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
        (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
        (mem/c:DI (plus:SI (reg/f:SI 16 argp)
                (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
        (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
            (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
     (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
        (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 15 7 16 2 (set (reg:V2DI 87)
        (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
     (nil))
(insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 17 16 18 2 (set (reg:V2DI 87)
        (lshiftrt:V2DI (reg:V2DI 87)
            (const_int 32 [0x20]))) "max.c":3:22 -1
     (nil))
(insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 12 18 13 2 (set (reg/i:DI 0 ax)
        (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 86)
        (nil)))
(insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
     (nil))

Uros.
Index: i386-features.c
===================================================================
--- i386-features.c	(revision 273844)
+++ i386-features.c	(working copy)
@@ -531,6 +531,9 @@
 	  if (CONST_INT_P (XEXP (src, 1)))
 	    gain -= vector_const_cost (XEXP (src, 1));
 	}
+      else if (GET_CODE (src) == SMAX
+	       || (GET_CODE (src) == SMIN))
+	gain += COSTS_N_INSNS (3);
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
 	gain += ix86_cost->add - COSTS_N_INSNS (1);
@@ -907,6 +910,8 @@
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
       PUT_MODE (src, V2DImode);
@@ -1285,6 +1290,8 @@
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
       if (!REG_P (XEXP (src, 1))
 	  && !MEM_P (XEXP (src, 1))
 	  && !CONST_INT_P (XEXP (src, 1)))
Index: i386.md
===================================================================
--- i386.md	(revision 273844)
+++ i386.md	(working copy)
@@ -17489,6 +17489,14 @@
     gcc_unreachable ();
 })
 
+(define_insn "smaxdi3"
+  [(set (match_operand:DI 0 "register_operand")
+        (smax:DI (match_operand:DI 1 "register_operand")
+                 (match_operand:DI 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2"
+  "#")
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
 	(if_then_else:X87MODEF
Richard Biener July 31, 2019, 11:21 a.m. UTC | #9
On Sat, 27 Jul 2019, Uros Bizjak wrote:

> On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > > How would one write smaxsi3 as a splitter to be split after
> > > reload in the case LRA assigned the GPR alternative?  Is it
> > > even worth doing?  Even the SSE reg alternative can be split
> > > to remove the not needed CC clobber.
> > >
> > > Finally I'm unsure about the add where I needed to place
> > > the SSE alternative before the 2nd op memory one since it
> > > otherwise gets the same cost and wins.
> > >
> > > So - how to go forward with this?
> >
> > Sorry to come a bit late to the discussion.
> >
> > We are aware of CMOV issue for quite some time, but the issue is not
> > understood yet in detail (I was hoping for Intel people to look at
> > this). However, you demonstrated that using PMAX and PMIN  instead of
> > scalar CMOV can bring us big gains, and this thread now deals on how
> > to best implement PMAX/PMIN for scalar code.
> >
> > I think that the way to go forward is with STV infrastructure.
> > Currently, the implementation only deals with DImode on SSE2 32bit
> > targets, but I see no issues on using STV pass also for SImode (on
> > 32bit and 64bit targets). There are actually two STV passes, the first
> > one (currently run on 64bit targets) is run before cse2, and the
> > second (which currently runs on 32bit SSE2 only) is run after combine
> > and before split1 pass. The second pass is interesting to us.
> >
> > The base idea of the second STV pass (for 32bit targets!) is that we
> > introduce a DImode _doubleword instructons that otherwise do not exist
> > with integer registers. Now, the passes up to and including combine
> > pass can use these instructions to simplify and optimize the insn
> > flow. Later, based on cost analysis, STV pass either converts the
> > _doubleword instructions to a real vector ones (e.g. V2DImode
> > patterns) or leaves them intact, and a follow-up split pass splits
> > them into scalar SImode instruction pairs. STV pass also takes care to
> > move and preload values from their scalar form to a vector
> > representation (using SUBREGs). Please note that all this happens on
> > pseudos, and register allocator will later simply use scalar (integer)
> > registers in scalar patterns and vector registers with vector insn
> > patterns.
> >
> > Your approach to amend existing scalar SImode patterns with vector
> > registers will introduce no end of problems. Register allocator will
> > do funny things during register pressure, where values will take a
> > trip to a vector register before being stored to memory (and vice
> > versa, you already found some of them). Current RA simply can't
> > distinguish clearly between two register sets.
> >
> > So, my advice would be to use STV pass also for SImode values, on
> > 64bit and 32bit targets. On both targets, we will be able to use
> > instructions that operate on vector register set, and for 32bit
> > targets (and to some extent on 64bit targets), we would perhaps be
> > able to relax register pressure in a kind of controlled way.
> >
> > So, to demonstrate the benefits of existing STV pass, it should be
> > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > handle 64bit values. For 32bit values, the pass should be re-run to
> > convert SImode scalar operations to vector operations in a controlled
> > way, based on various cost functions.

I've looked at STV before trying to use RA to solve the issue but
quickly stepped away because of its structure which seems to be
tied to particular modes, duplicating things for TImode and DImode
so it looked like I have to write up everything again for SImode...

It really should be possible to run the pass once, handling a set
of modes rather than re-running it for the SImode case I am after.
See also a recent PR about STV slowness and tendency to hog memory
because it seems to enable every DF problem that is around...

> Please find attached patch to see STV in action. The compilation will
> crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> dump, you will be able to see chain building, cost calculation and
> conversion insertion.

So you unconditionally add a smaxdi3 pattern - indeed this looks
necessary even when going the STV route.  The actual regression
for the testcase could also be solved by turing the smaxsi3
back into a compare and jump rather than a conditional move sequence.
So I wonder how you'd do that given that there's pass_if_after_reload
after pass_split_after_reload and I'm not sure we can split
as late as pass_split_before_sched2 (there's also a split _after_
sched2 on x86 it seems).

So how would you go implement {s,u}{min,max}{si,di}3 for the
case STV doesn't end up doing any transform?

You could save me some guesswork here if you can come up with
a reasonably complete final set of patterns (ok, I only care
about smaxsi3) so I can have a look at the STV approach again
(you may remember I simply "split" at assembler emission time).

Thanks,
Richard.

> The testcase:
> 
> --cut here--
> long long test (long long a, long long b)
> {
>   return (a > b) ? a : b;
> }
> --cut here--
> 
> gcc -O2 -m32 -msse2 (-mstv):
> 
> _.268r.stv2 dump:
> 
> Searching for mode conversion candidates...
>   insn 2 is marked as a candidate
>   insn 3 is marked as a candidate
>   insn 7 is marked as a candidate
> Created a new instruction chain #1
> Building chain #1...
>   Adding insn 2 to chain #1
>   Adding insn 7 into chain's #1 queue
>   Adding insn 7 to chain #1
>   r85 use in insn 12 isn't convertible
>   Mark r85 def in insn 7 as requiring both modes in chain #1
>   Adding insn 3 into chain's #1 queue
>   Adding insn 3 to chain #1
> Collected chain #1...
>   insns: 2, 3, 7
>   defs to convert: r85
> Computing gain for chain #1...
>   Instruction conversion gain: 24
>   Registers conversion cost: 6
>   Total gain: 18
> Converting chain #1...
> 
> ...
> 
> (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
>         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> {*movdi_internal}
>      (nil))
> (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
>         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
>                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> {*movdi_internal}
>      (nil))
> (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
>             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
>      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
>         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
>             (expr_list:REG_UNUSED (reg:CC 17 flags)
>                 (nil)))))
> (insn 15 7 16 2 (set (reg:V2DI 87)
>         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
>      (nil))
> (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
>         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
>      (nil))
> (insn 17 16 18 2 (set (reg:V2DI 87)
>         (lshiftrt:V2DI (reg:V2DI 87)
>             (const_int 32 [0x20]))) "max.c":3:22 -1
>      (nil))
> (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
>         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
>      (nil))
> (insn 12 18 13 2 (set (reg/i:DI 0 ax)
>         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
>      (expr_list:REG_DEAD (reg:DI 86)
>         (nil)))
> (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
>      (nil))
> 
> Uros.
>
Uros Bizjak Aug. 1, 2019, 8:54 a.m. UTC | #10
On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sat, 27 Jul 2019, Uros Bizjak wrote:
>
> > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > > How would one write smaxsi3 as a splitter to be split after
> > > > reload in the case LRA assigned the GPR alternative?  Is it
> > > > even worth doing?  Even the SSE reg alternative can be split
> > > > to remove the not needed CC clobber.
> > > >
> > > > Finally I'm unsure about the add where I needed to place
> > > > the SSE alternative before the 2nd op memory one since it
> > > > otherwise gets the same cost and wins.
> > > >
> > > > So - how to go forward with this?
> > >
> > > Sorry to come a bit late to the discussion.
> > >
> > > We are aware of CMOV issue for quite some time, but the issue is not
> > > understood yet in detail (I was hoping for Intel people to look at
> > > this). However, you demonstrated that using PMAX and PMIN  instead of
> > > scalar CMOV can bring us big gains, and this thread now deals on how
> > > to best implement PMAX/PMIN for scalar code.
> > >
> > > I think that the way to go forward is with STV infrastructure.
> > > Currently, the implementation only deals with DImode on SSE2 32bit
> > > targets, but I see no issues on using STV pass also for SImode (on
> > > 32bit and 64bit targets). There are actually two STV passes, the first
> > > one (currently run on 64bit targets) is run before cse2, and the
> > > second (which currently runs on 32bit SSE2 only) is run after combine
> > > and before split1 pass. The second pass is interesting to us.
> > >
> > > The base idea of the second STV pass (for 32bit targets!) is that we
> > > introduce a DImode _doubleword instructons that otherwise do not exist
> > > with integer registers. Now, the passes up to and including combine
> > > pass can use these instructions to simplify and optimize the insn
> > > flow. Later, based on cost analysis, STV pass either converts the
> > > _doubleword instructions to a real vector ones (e.g. V2DImode
> > > patterns) or leaves them intact, and a follow-up split pass splits
> > > them into scalar SImode instruction pairs. STV pass also takes care to
> > > move and preload values from their scalar form to a vector
> > > representation (using SUBREGs). Please note that all this happens on
> > > pseudos, and register allocator will later simply use scalar (integer)
> > > registers in scalar patterns and vector registers with vector insn
> > > patterns.
> > >
> > > Your approach to amend existing scalar SImode patterns with vector
> > > registers will introduce no end of problems. Register allocator will
> > > do funny things during register pressure, where values will take a
> > > trip to a vector register before being stored to memory (and vice
> > > versa, you already found some of them). Current RA simply can't
> > > distinguish clearly between two register sets.
> > >
> > > So, my advice would be to use STV pass also for SImode values, on
> > > 64bit and 32bit targets. On both targets, we will be able to use
> > > instructions that operate on vector register set, and for 32bit
> > > targets (and to some extent on 64bit targets), we would perhaps be
> > > able to relax register pressure in a kind of controlled way.
> > >
> > > So, to demonstrate the benefits of existing STV pass, it should be
> > > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > > handle 64bit values. For 32bit values, the pass should be re-run to
> > > convert SImode scalar operations to vector operations in a controlled
> > > way, based on various cost functions.
>
> I've looked at STV before trying to use RA to solve the issue but
> quickly stepped away because of its structure which seems to be
> tied to particular modes, duplicating things for TImode and DImode
> so it looked like I have to write up everything again for SImode...

ATM, DImode is used exclusively for x86_32 while TImode is used
exclusively for x86_64. Also, TImode is used for different purpose
before combine, while DImode is used after combine. I don't remember
the details, but IIRC it made sense for the intended purpose.
>
> It really should be possible to run the pass once, handling a set
> of modes rather than re-running it for the SImode case I am after.
> See also a recent PR about STV slowness and tendency to hog memory
> because it seems to enable every DF problem that is around...

Huh, I was not aware of implementation details...

> > Please find attached patch to see STV in action. The compilation will
> > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> > dump, you will be able to see chain building, cost calculation and
> > conversion insertion.
>
> So you unconditionally add a smaxdi3 pattern - indeed this looks
> necessary even when going the STV route.  The actual regression
> for the testcase could also be solved by turing the smaxsi3
> back into a compare and jump rather than a conditional move sequence.
> So I wonder how you'd do that given that there's pass_if_after_reload
> after pass_split_after_reload and I'm not sure we can split
> as late as pass_split_before_sched2 (there's also a split _after_
> sched2 on x86 it seems).
>
> So how would you go implement {s,u}{min,max}{si,di}3 for the
> case STV doesn't end up doing any transform?

If STV doesn't transform the insn, then a pre-reload splitter splits
the insn back to compare+cmove. However, considering the SImode move
from/to int/xmm register is relatively cheap, the cost function should
be tuned so that STV always converts smaxsi3 pattern. (As said before,
the fix of the slowdown with consecutive cmov insns is a side effect
of the transformation to smax insn that helps in this particular case,
I think that this issue should be fixed in a general way, there are
already a couple of PRs reported).

> You could save me some guesswork here if you can come up with
> a reasonably complete final set of patterns (ok, I only care
> about smaxsi3) so I can have a look at the STV approach again
> (you may remember I simply "split" at assembler emission time).

I think that the cost function should always enable smaxsi3
generation. To further optimize STV chain (to avoid unnecessary
xmm<->int transitions) we could add all integer logic, arithmetic and
constant shifts to the candidates (the ones that DImode STV converts).

Uros.

> Thanks,
> Richard.
>
> > The testcase:
> >
> > --cut here--
> > long long test (long long a, long long b)
> > {
> >   return (a > b) ? a : b;
> > }
> > --cut here--
> >
> > gcc -O2 -m32 -msse2 (-mstv):
> >
> > _.268r.stv2 dump:
> >
> > Searching for mode conversion candidates...
> >   insn 2 is marked as a candidate
> >   insn 3 is marked as a candidate
> >   insn 7 is marked as a candidate
> > Created a new instruction chain #1
> > Building chain #1...
> >   Adding insn 2 to chain #1
> >   Adding insn 7 into chain's #1 queue
> >   Adding insn 7 to chain #1
> >   r85 use in insn 12 isn't convertible
> >   Mark r85 def in insn 7 as requiring both modes in chain #1
> >   Adding insn 3 into chain's #1 queue
> >   Adding insn 3 to chain #1
> > Collected chain #1...
> >   insns: 2, 3, 7
> >   defs to convert: r85
> > Computing gain for chain #1...
> >   Instruction conversion gain: 24
> >   Registers conversion cost: 6
> >   Total gain: 18
> > Converting chain #1...
> >
> > ...
> >
> > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
> >         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> > {*movdi_internal}
> >      (nil))
> > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
> >         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
> >                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> > {*movdi_internal}
> >      (nil))
> > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
> >         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
> >             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
> >      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
> >         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
> >             (expr_list:REG_UNUSED (reg:CC 17 flags)
> >                 (nil)))))
> > (insn 15 7 16 2 (set (reg:V2DI 87)
> >         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
> >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 17 16 18 2 (set (reg:V2DI 87)
> >         (lshiftrt:V2DI (reg:V2DI 87)
> >             (const_int 32 [0x20]))) "max.c":3:22 -1
> >      (nil))
> > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
> >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 12 18 13 2 (set (reg/i:DI 0 ax)
> >         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 86)
> >         (nil)))
> > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
> >      (nil))
> >
> > Uros.
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Richard Biener Aug. 1, 2019, 9:28 a.m. UTC | #11
On Thu, 1 Aug 2019, Uros Bizjak wrote:

> On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Sat, 27 Jul 2019, Uros Bizjak wrote:
> >
> > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > > How would one write smaxsi3 as a splitter to be split after
> > > > > reload in the case LRA assigned the GPR alternative?  Is it
> > > > > even worth doing?  Even the SSE reg alternative can be split
> > > > > to remove the not needed CC clobber.
> > > > >
> > > > > Finally I'm unsure about the add where I needed to place
> > > > > the SSE alternative before the 2nd op memory one since it
> > > > > otherwise gets the same cost and wins.
> > > > >
> > > > > So - how to go forward with this?
> > > >
> > > > Sorry to come a bit late to the discussion.
> > > >
> > > > We are aware of CMOV issue for quite some time, but the issue is not
> > > > understood yet in detail (I was hoping for Intel people to look at
> > > > this). However, you demonstrated that using PMAX and PMIN  instead of
> > > > scalar CMOV can bring us big gains, and this thread now deals on how
> > > > to best implement PMAX/PMIN for scalar code.
> > > >
> > > > I think that the way to go forward is with STV infrastructure.
> > > > Currently, the implementation only deals with DImode on SSE2 32bit
> > > > targets, but I see no issues on using STV pass also for SImode (on
> > > > 32bit and 64bit targets). There are actually two STV passes, the first
> > > > one (currently run on 64bit targets) is run before cse2, and the
> > > > second (which currently runs on 32bit SSE2 only) is run after combine
> > > > and before split1 pass. The second pass is interesting to us.
> > > >
> > > > The base idea of the second STV pass (for 32bit targets!) is that we
> > > > introduce a DImode _doubleword instructons that otherwise do not exist
> > > > with integer registers. Now, the passes up to and including combine
> > > > pass can use these instructions to simplify and optimize the insn
> > > > flow. Later, based on cost analysis, STV pass either converts the
> > > > _doubleword instructions to a real vector ones (e.g. V2DImode
> > > > patterns) or leaves them intact, and a follow-up split pass splits
> > > > them into scalar SImode instruction pairs. STV pass also takes care to
> > > > move and preload values from their scalar form to a vector
> > > > representation (using SUBREGs). Please note that all this happens on
> > > > pseudos, and register allocator will later simply use scalar (integer)
> > > > registers in scalar patterns and vector registers with vector insn
> > > > patterns.
> > > >
> > > > Your approach to amend existing scalar SImode patterns with vector
> > > > registers will introduce no end of problems. Register allocator will
> > > > do funny things during register pressure, where values will take a
> > > > trip to a vector register before being stored to memory (and vice
> > > > versa, you already found some of them). Current RA simply can't
> > > > distinguish clearly between two register sets.
> > > >
> > > > So, my advice would be to use STV pass also for SImode values, on
> > > > 64bit and 32bit targets. On both targets, we will be able to use
> > > > instructions that operate on vector register set, and for 32bit
> > > > targets (and to some extent on 64bit targets), we would perhaps be
> > > > able to relax register pressure in a kind of controlled way.
> > > >
> > > > So, to demonstrate the benefits of existing STV pass, it should be
> > > > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > > > handle 64bit values. For 32bit values, the pass should be re-run to
> > > > convert SImode scalar operations to vector operations in a controlled
> > > > way, based on various cost functions.
> >
> > I've looked at STV before trying to use RA to solve the issue but
> > quickly stepped away because of its structure which seems to be
> > tied to particular modes, duplicating things for TImode and DImode
> > so it looked like I have to write up everything again for SImode...
> 
> ATM, DImode is used exclusively for x86_32 while TImode is used
> exclusively for x86_64. Also, TImode is used for different purpose
> before combine, while DImode is used after combine. I don't remember
> the details, but IIRC it made sense for the intended purpose.
> >
> > It really should be possible to run the pass once, handling a set
> > of modes rather than re-running it for the SImode case I am after.
> > See also a recent PR about STV slowness and tendency to hog memory
> > because it seems to enable every DF problem that is around...
> 
> Huh, I was not aware of implementation details...
> 
> > > Please find attached patch to see STV in action. The compilation will
> > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> > > dump, you will be able to see chain building, cost calculation and
> > > conversion insertion.
> >
> > So you unconditionally add a smaxdi3 pattern - indeed this looks
> > necessary even when going the STV route.  The actual regression
> > for the testcase could also be solved by turing the smaxsi3
> > back into a compare and jump rather than a conditional move sequence.
> > So I wonder how you'd do that given that there's pass_if_after_reload
> > after pass_split_after_reload and I'm not sure we can split
> > as late as pass_split_before_sched2 (there's also a split _after_
> > sched2 on x86 it seems).
> >
> > So how would you go implement {s,u}{min,max}{si,di}3 for the
> > case STV doesn't end up doing any transform?
> 
> If STV doesn't transform the insn, then a pre-reload splitter splits
> the insn back to compare+cmove.

OK, that would work.  But there's no way to force a jumpy sequence then
which we know is faster than compare+cmove because later RTL
if-conversion passes happily re-discover the smax (or conditional move)
sequence.

> However, considering the SImode move
> from/to int/xmm register is relatively cheap, the cost function should
> be tuned so that STV always converts smaxsi3 pattern.

Note that on both Zen and even more so bdverN the int/xmm transition
makes it no longer profitable but a _lot_ slower than the cmp/cmov
sequence... (for the loop in hmmer which is the only one I see
any effect of any of my patches).  So identifying chains that
start/end in memory is important for cost reasons.

So I think the splitting has to happen after the last if-conversion
pass (and thus we may need to allocate a scratch register for this
purpose?)

> (As said before,
> the fix of the slowdown with consecutive cmov insns is a side effect
> of the transformation to smax insn that helps in this particular case,
> I think that this issue should be fixed in a general way, there are
> already a couple of PRs reported).
> 
> > You could save me some guesswork here if you can come up with
> > a reasonably complete final set of patterns (ok, I only care
> > about smaxsi3) so I can have a look at the STV approach again
> > (you may remember I simply "split" at assembler emission time).
> 
> I think that the cost function should always enable smaxsi3
> generation. To further optimize STV chain (to avoid unnecessary
> xmm<->int transitions) we could add all integer logic, arithmetic and
> constant shifts to the candidates (the ones that DImode STV converts).
>
> Uros.
> 
> > Thanks,
> > Richard.
> >
> > > The testcase:
> > >
> > > --cut here--
> > > long long test (long long a, long long b)
> > > {
> > >   return (a > b) ? a : b;
> > > }
> > > --cut here--
> > >
> > > gcc -O2 -m32 -msse2 (-mstv):
> > >
> > > _.268r.stv2 dump:
> > >
> > > Searching for mode conversion candidates...
> > >   insn 2 is marked as a candidate
> > >   insn 3 is marked as a candidate
> > >   insn 7 is marked as a candidate
> > > Created a new instruction chain #1
> > > Building chain #1...
> > >   Adding insn 2 to chain #1
> > >   Adding insn 7 into chain's #1 queue
> > >   Adding insn 7 to chain #1
> > >   r85 use in insn 12 isn't convertible
> > >   Mark r85 def in insn 7 as requiring both modes in chain #1
> > >   Adding insn 3 into chain's #1 queue
> > >   Adding insn 3 to chain #1
> > > Collected chain #1...
> > >   insns: 2, 3, 7
> > >   defs to convert: r85
> > > Computing gain for chain #1...
> > >   Instruction conversion gain: 24
> > >   Registers conversion cost: 6
> > >   Total gain: 18
> > > Converting chain #1...
> > >
> > > ...
> > >
> > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
> > >         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > >      (nil))
> > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
> > >         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
> > >                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> > > {*movdi_internal}
> > >      (nil))
> > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
> > >         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
> > >             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
> > >      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
> > >         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
> > >             (expr_list:REG_UNUSED (reg:CC 17 flags)
> > >                 (nil)))))
> > > (insn 15 7 16 2 (set (reg:V2DI 87)
> > >         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
> > >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 17 16 18 2 (set (reg:V2DI 87)
> > >         (lshiftrt:V2DI (reg:V2DI 87)
> > >             (const_int 32 [0x20]))) "max.c":3:22 -1
> > >      (nil))
> > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
> > >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> > >      (nil))
> > > (insn 12 18 13 2 (set (reg/i:DI 0 ax)
> > >         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 86)
> > >         (nil)))
> > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
> > >      (nil))
> > >
> > > Uros.
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
>
Uros Bizjak Aug. 1, 2019, 9:37 a.m. UTC | #12
On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:

> > > So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > necessary even when going the STV route.  The actual regression
> > > for the testcase could also be solved by turing the smaxsi3
> > > back into a compare and jump rather than a conditional move sequence.
> > > So I wonder how you'd do that given that there's pass_if_after_reload
> > > after pass_split_after_reload and I'm not sure we can split
> > > as late as pass_split_before_sched2 (there's also a split _after_
> > > sched2 on x86 it seems).
> > >
> > > So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > case STV doesn't end up doing any transform?
> >
> > If STV doesn't transform the insn, then a pre-reload splitter splits
> > the insn back to compare+cmove.
>
> OK, that would work.  But there's no way to force a jumpy sequence then
> which we know is faster than compare+cmove because later RTL
> if-conversion passes happily re-discover the smax (or conditional move)
> sequence.
>
> > However, considering the SImode move
> > from/to int/xmm register is relatively cheap, the cost function should
> > be tuned so that STV always converts smaxsi3 pattern.
>
> Note that on both Zen and even more so bdverN the int/xmm transition
> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> sequence... (for the loop in hmmer which is the only one I see
> any effect of any of my patches).  So identifying chains that
> start/end in memory is important for cost reasons.

Please note that the cost function also considers the cost of move
from/to xmm. So, the cost of the whole chain would disable the
transformation.

> So I think the splitting has to happen after the last if-conversion
> pass (and thus we may need to allocate a scratch register for this
> purpose?)

I really hope that the underlying issue will be solved by a machine
dependant pass inserted somewhere after the pre-reload split. This
way, we can split unconverted smax to the cmove, and this later pass
would handle jcc and cmove instructions. Until then... yes your
proposed approach is one of the ways to avoid unwanted if-conversion,
although sometimes we would like to split to cmove instead.

Uros.
Richard Biener Aug. 3, 2019, 5:26 p.m. UTC | #13
On Thu, 1 Aug 2019, Uros Bizjak wrote:

> On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>
>>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>>>> necessary even when going the STV route.  The actual regression
>>>> for the testcase could also be solved by turing the smaxsi3
>>>> back into a compare and jump rather than a conditional move sequence.
>>>> So I wonder how you'd do that given that there's pass_if_after_reload
>>>> after pass_split_after_reload and I'm not sure we can split
>>>> as late as pass_split_before_sched2 (there's also a split _after_
>>>> sched2 on x86 it seems).
>>>>
>>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>>>> case STV doesn't end up doing any transform?
>>>
>>> If STV doesn't transform the insn, then a pre-reload splitter splits
>>> the insn back to compare+cmove.
>>
>> OK, that would work.  But there's no way to force a jumpy sequence then
>> which we know is faster than compare+cmove because later RTL
>> if-conversion passes happily re-discover the smax (or conditional move)
>> sequence.
>>
>>> However, considering the SImode move
>>> from/to int/xmm register is relatively cheap, the cost function should
>>> be tuned so that STV always converts smaxsi3 pattern.
>>
>> Note that on both Zen and even more so bdverN the int/xmm transition
>> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>> sequence... (for the loop in hmmer which is the only one I see
>> any effect of any of my patches).  So identifying chains that
>> start/end in memory is important for cost reasons.
>
> Please note that the cost function also considers the cost of move
> from/to xmm. So, the cost of the whole chain would disable the
> transformation.
>
>> So I think the splitting has to happen after the last if-conversion
>> pass (and thus we may need to allocate a scratch register for this
>> purpose?)
>
> I really hope that the underlying issue will be solved by a machine
> dependant pass inserted somewhere after the pre-reload split. This
> way, we can split unconverted smax to the cmove, and this later pass
> would handle jcc and cmove instructions. Until then... yes your
> proposed approach is one of the ways to avoid unwanted if-conversion,
> although sometimes we would like to split to cmove instead.

So the following makes STV also consider SImode chains, re-using the
DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
and also did not alter the {SI,DI}mode chain cost function - it's
quite off for TARGET_64BIT.  With this I get the expected conversion
for the testcase derived from hmmer.

No further testing sofar.

Is it OK to re-use the DImode chain code this way?  I'll clean things
up some more of course.

Still need help with the actual patterns for minmax and how the splitters
should look like.

Richard.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274037)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@

  /* Initialize new chain.  */

-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
  {
+  smode = smode_;
+  vmode = vmode_;
+
    chain_id = ++max_id;

     if (dump_file)
@@ -473,7 +476,7 @@
  {
    gcc_assert (CONST_INT_P (exp));

-  if (standard_sse_constant_p (exp, V2DImode))
+  if (standard_sse_constant_p (exp, vmode))
      return COSTS_N_INSNS (1);
    return ix86_cost->sse_load[1];
  }
@@ -534,6 +537,9 @@
        else if (GET_CODE (src) == NEG
  	       || GET_CODE (src) == NOT)
  	gain += ix86_cost->add - COSTS_N_INSNS (1);
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN)
+	gain += COSTS_N_INSNS (3);
        else if (GET_CODE (src) == COMPARE)
  	{
  	  /* Assume comparison cost is the same.  */
@@ -573,7 +579,7 @@
  dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
  {
    if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);

    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
    int i, j;
@@ -707,7 +713,7 @@
    bitmap_copy (conv, insns);

    if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);

    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
      {
@@ -750,6 +756,10 @@
  		  gen_rtx_VEC_SELECT (SImode,
  				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
  	    }
+	  else if (smode == SImode)
+	    {
+	      emit_move_insn (scopy, gen_rtx_SUBREG (SImode, reg, 0));
+	    }
  	  else
  	    {
  	      rtx vcopy = gen_reg_rtx (V2DImode);
@@ -816,14 +826,14 @@
    if (GET_CODE (*op) == NOT)
      {
        convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
      }
    else if (MEM_P (*op))
      {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));

        emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);

        if (dump_file)
  	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +851,30 @@
  	    gcc_assert (!DF_REF_CHAIN (ref));
  	    break;
  	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
      }
    else if (CONST_INT_P (*op))
      {
        rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);

        /* Prefer all ones vector in case of -1.  */
        if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
        else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}

-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
  	{
  	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
  	  rtx_insn *seq = get_insns ();
  	  end_sequence ();
  	  emit_insn_before (seq, insn);
@@ -870,7 +886,7 @@
    else
      {
        gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
      }
  }

@@ -888,9 +904,9 @@
      {
        /* There are no scalar integer instructions and therefore
  	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
      }

    switch (GET_CODE (src))
@@ -899,7 +915,7 @@
      case ASHIFTRT:
      case LSHIFTRT:
        convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
        break;

      case PLUS:
@@ -907,25 +923,27 @@
      case IOR:
      case XOR:
      case AND:
+    case SMAX:
+    case SMIN:
        convert_op (&XEXP (src, 0), insn);
        convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
        break;

      case NEG:
        src = XEXP (src, 0);
        convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
        break;

      case NOT:
        src = XEXP (src, 0);
        convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
        break;

      case MEM:
@@ -939,17 +957,17 @@
        break;

      case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
        break;

      case COMPARE:
        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));

-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));

        if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
        else
  	subreg = copy_rtx_if_shared (src);
        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -1186,7 +1204,7 @@
  		     (const_int 0 [0])))  */

  static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
  {
    if (!TARGET_SSE4_1)
      return false;
@@ -1219,12 +1237,12 @@

    if (!SUBREG_P (op1)
        || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
        || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
  	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
      return false;

    op1 = SUBREG_REG (op1);
@@ -1232,7 +1250,7 @@

    if (op1 != op2
        || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
      return false;

    return true;
@@ -1241,7 +1259,7 @@
  /* The DImode version of scalar_to_vector_candidate_p.  */

  static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
  {
    rtx def_set = single_set (insn);

@@ -1255,12 +1273,12 @@
    rtx dst = SET_DEST (def_set);

    if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);

    /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
         && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
      return false;

    if (!REG_P (dst) && !MEM_P (dst))
@@ -1285,12 +1303,14 @@
      case IOR:
      case XOR:
      case AND:
+    case SMAX:
+    case SMIN:
        if (!REG_P (XEXP (src, 1))
  	  && !MEM_P (XEXP (src, 1))
  	  && !CONST_INT_P (XEXP (src, 1)))
  	return false;

-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
  	  && !CONST_INT_P (XEXP (src, 1)))
  	return false;
        break;
@@ -1319,7 +1339,7 @@
  	  || !REG_P (XEXP (XEXP (src, 0), 0))))
        return false;

-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
        && !CONST_INT_P (XEXP (src, 0)))
      return false;

@@ -1392,7 +1412,7 @@
    if (TARGET_64BIT)
      return timode_scalar_to_vector_candidate_p (insn);
    else
-    return dimode_scalar_to_vector_candidate_p (insn);
+    return dimode_scalar_to_vector_candidate_p (insn, DImode);
  }

  /* The DImode version of remove_non_convertible_regs.  */
@@ -1577,11 +1597,12 @@
  convert_scalars_to_vector ()
  {
    basic_block bb;
-  bitmap candidates;
+  bitmap candidates, sicandidates;
    int converted_insns = 0;

    bitmap_obstack_initialize (NULL);
    candidates = BITMAP_ALLOC (NULL);
+  sicandidates = BITMAP_ALLOC (NULL);

    calculate_dominance_info (CDI_DOMINATORS);
    df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1605,28 +1626,43 @@

  	    bitmap_set_bit (candidates, INSN_UID (insn));
  	  }
+	else if (dimode_scalar_to_vector_candidate_p (insn, SImode))
+	  {
+	    if (dump_file)
+	      fprintf (dump_file, "  insn %d is marked as a SI candidate\n",
+		       INSN_UID (insn));
+
+	    bitmap_set_bit (sicandidates, INSN_UID (insn));
+	  }
      }

    remove_non_convertible_regs (candidates);
+  dimode_remove_non_convertible_regs (sicandidates);

-  if (bitmap_empty_p (candidates))
+  if (bitmap_empty_p (candidates)
+      && bitmap_empty_p (sicandidates))
      if (dump_file)
        fprintf (dump_file, "There are no candidates for optimization.\n");

-  while (!bitmap_empty_p (candidates))
+  bitmap cand = candidates;
+  do
      {
-      unsigned uid = bitmap_first_set_bit (candidates);
+  while (!bitmap_empty_p (cand))
+    {
+      unsigned uid = bitmap_first_set_bit (cand);
        scalar_chain *chain;

-      if (TARGET_64BIT)
+      if (TARGET_64BIT && cand == candidates)
  	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+      else if (cand == candidates)
+	chain = new dimode_scalar_chain (DImode, V2DImode);
+      else if (cand == sicandidates)
+	chain = new dimode_scalar_chain (SImode, V4SImode);

        /* Find instructions chain we want to convert to vector mode.
  	 Check all uses and definitions to estimate all required
  	 conversions.  */
-      chain->build (candidates, uid);
+      chain->build (cand, uid);

        if (chain->compute_convert_gain () > 0)
  	converted_insns += chain->convert ();
@@ -1637,11 +1673,17 @@

        delete chain;
      }
+  if (cand == sicandidates)
+    break;
+  cand = sicandidates;
+    }
+  while (1);

    if (dump_file)
      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);

    BITMAP_FREE (candidates);
+  BITMAP_FREE (sicandidates);
    bitmap_obstack_release (NULL);
    df_process_deferred_rescans ();

Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274037)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@
  class scalar_chain
  {
   public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
    virtual ~scalar_chain ();

    static unsigned max_id;

+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
    /* ID of a chain.  */
    unsigned int chain_id;
    /* A queue of instructions to be included into a chain.  */
@@ -162,6 +167,8 @@
  class dimode_scalar_chain : public scalar_chain
  {
   public:
+  dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
    int compute_convert_gain ();
   private:
    void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@
  class timode_scalar_chain : public scalar_chain
  {
   public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
    /* Convert from TImode to V1TImode is always faster.  */
    int compute_convert_gain () { return 1; }

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274037)
+++ gcc/config/i386/i386.md	(working copy)
@@ -5325,6 +5325,16 @@
         (const_string "SI")
         (const_string "<MODE>")))])

+;; min/max patterns
+
+(define_insn "smaxsi3"
+  [(set (match_operand:SI 0 "register_operand")
+  	(smax:SI (match_operand:SI 1 "register_operand")
+		 (match_operand:SI 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1"
+  "#")
+
  ;; Add instructions

  (define_expand "add<mode>3"
Uros Bizjak Aug. 4, 2019, 5:11 p.m. UTC | #14
On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>
> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> >
> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> >>>> necessary even when going the STV route.  The actual regression
> >>>> for the testcase could also be solved by turing the smaxsi3
> >>>> back into a compare and jump rather than a conditional move sequence.
> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> >>>> after pass_split_after_reload and I'm not sure we can split
> >>>> as late as pass_split_before_sched2 (there's also a split _after_
> >>>> sched2 on x86 it seems).
> >>>>
> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> >>>> case STV doesn't end up doing any transform?
> >>>
> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> >>> the insn back to compare+cmove.
> >>
> >> OK, that would work.  But there's no way to force a jumpy sequence then
> >> which we know is faster than compare+cmove because later RTL
> >> if-conversion passes happily re-discover the smax (or conditional move)
> >> sequence.
> >>
> >>> However, considering the SImode move
> >>> from/to int/xmm register is relatively cheap, the cost function should
> >>> be tuned so that STV always converts smaxsi3 pattern.
> >>
> >> Note that on both Zen and even more so bdverN the int/xmm transition
> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> >> sequence... (for the loop in hmmer which is the only one I see
> >> any effect of any of my patches).  So identifying chains that
> >> start/end in memory is important for cost reasons.
> >
> > Please note that the cost function also considers the cost of move
> > from/to xmm. So, the cost of the whole chain would disable the
> > transformation.
> >
> >> So I think the splitting has to happen after the last if-conversion
> >> pass (and thus we may need to allocate a scratch register for this
> >> purpose?)
> >
> > I really hope that the underlying issue will be solved by a machine
> > dependant pass inserted somewhere after the pre-reload split. This
> > way, we can split unconverted smax to the cmove, and this later pass
> > would handle jcc and cmove instructions. Until then... yes your
> > proposed approach is one of the ways to avoid unwanted if-conversion,
> > although sometimes we would like to split to cmove instead.
>
> So the following makes STV also consider SImode chains, re-using the
> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> and also did not alter the {SI,DI}mode chain cost function - it's
> quite off for TARGET_64BIT.  With this I get the expected conversion
> for the testcase derived from hmmer.
>
> No further testing sofar.
>
> Is it OK to re-use the DImode chain code this way?  I'll clean things
> up some more of course.

Yes, the approach looks OK to me. It makes chain building mode
agnostic, and the chain building can be used for
a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
b) SImode x86_32 and x86_64 (this will be mainly used for SImode
minmax and surrounding SImode operations)
c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
DImode operations)

> Still need help with the actual patterns for minmax and how the splitters
> should look like.

Please look at the attached patch. Maybe we can add memory_operand as
operand 1 and operand 2 predicate, but let's keep things simple for
now.

Uros.
Index: i386.md
===================================================================
--- i386.md	(revision 274008)
+++ i386.md	(working copy)
@@ -17721,6 +17721,27 @@
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:SWI48 0 "register_operand")
+	(smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+		       (match_operand:SWI48 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:CCGC FLAGS_REG)
+	(compare:CCGC (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+   	(if_then_else:SWI48
+	  (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
+	  (match_dup 1)
+	  (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")
Jakub Jelinek Aug. 4, 2019, 5:22 p.m. UTC | #15
On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote:
> Yes, the approach looks OK to me. It makes chain building mode
> agnostic, and the chain building can be used for
> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> minmax and surrounding SImode operations)
> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> DImode operations)
> 
> > Still need help with the actual patterns for minmax and how the splitters
> > should look like.
> 
> Please look at the attached patch. Maybe we can add memory_operand as
> operand 1 and operand 2 predicate, but let's keep things simple for
> now.

Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd?
What about p{min,max}{s,u}{b,w,q}?  Some of those are already in SSE.

If the conversion of the chain fails, couldn't the STV pass split those
SImode etc. min/max patterns into code with branches, rather than turn it
into cmovs?

	Jakub
Uros Bizjak Aug. 4, 2019, 5:35 p.m. UTC | #16
On Sun, Aug 4, 2019 at 7:23 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote:
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd?
> What about p{min,max}{s,u}{b,w,q}?  Some of those are already in SSE.

Sure, unsigned ops will also be added. I just went through the
Richard's patch and looked for RTXes that Richard's patch handles. I'm
not sure about HImode and QImode minmax operations. While these can be
added, we would need to re-run STV in HImode and QImode - I wonder if
it is worth.

> If the conversion of the chain fails, couldn't the STV pass split those
> SImode etc. min/max patterns into code with branches, rather than turn it
> into cmovs?

Since these patterns require SSE4.1, we are sure that we can split
back to cmov. But IMO, cmov/jcc issue is orthogonal to minmax
conversion and should be handled by some other machine-specific pass
that would
analyse cmove insertion and eventually split unwanted cmoves back to
jcc (based on some yet unknown metrics). Please note that there is no
definite proof that it is beneficial to convert cmoves to jcc for all
x86 targets.

Uros.
Richard Biener Aug. 5, 2019, 8:47 a.m. UTC | #17
On Sun, 4 Aug 2019, Uros Bizjak wrote:

> On Sun, Aug 4, 2019 at 7:23 PM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote:
> > > Yes, the approach looks OK to me. It makes chain building mode
> > > agnostic, and the chain building can be used for
> > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > > minmax and surrounding SImode operations)
> > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > > DImode operations)
> > >
> > > > Still need help with the actual patterns for minmax and how the splitters
> > > > should look like.
> > >
> > > Please look at the attached patch. Maybe we can add memory_operand as
> > > operand 1 and operand 2 predicate, but let's keep things simple for
> > > now.
> >
> > Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd?
> > What about p{min,max}{s,u}{b,w,q}?  Some of those are already in SSE.
> 
> Sure, unsigned ops will also be added. I just went through the
> Richard's patch and looked for RTXes that Richard's patch handles. I'm
> not sure about HImode and QImode minmax operations. While these can be
> added, we would need to re-run STV in HImode and QImode - I wonder if
> it is worth.

I think we can always extend later, for now I'm trying to do {SI,DI}mode
only, but yes, u{min,max} would be nice to not miss.

> > If the conversion of the chain fails, couldn't the STV pass split those
> > SImode etc. min/max patterns into code with branches, rather than turn it
> > into cmovs?
> 
> Since these patterns require SSE4.1, we are sure that we can split
> back to cmov. But IMO, cmov/jcc issue is orthogonal to minmax
> conversion and should be handled by some other machine-specific pass
> that would
> analyse cmove insertion and eventually split unwanted cmoves back to
> jcc (based on some yet unknown metrics). Please note that there is no
> definite proof that it is beneficial to convert cmoves to jcc for all
> x86 targets.

I guess a tunable plus (micro-)benchmarking could make this decision.
But yes, this is largely independent - and if we split to jumps
then RTL if-conversion will happily turn it back to cmoves anyway.

Richard.
Richard Sandiford Aug. 5, 2019, 9:13 a.m. UTC | #18
Uros Bizjak <ubizjak@gmail.com> writes:
> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>>
>> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>>
>> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>> >
>> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>> >>>> necessary even when going the STV route.  The actual regression
>> >>>> for the testcase could also be solved by turing the smaxsi3
>> >>>> back into a compare and jump rather than a conditional move sequence.
>> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
>> >>>> after pass_split_after_reload and I'm not sure we can split
>> >>>> as late as pass_split_before_sched2 (there's also a split _after_
>> >>>> sched2 on x86 it seems).
>> >>>>
>> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>> >>>> case STV doesn't end up doing any transform?
>> >>>
>> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
>> >>> the insn back to compare+cmove.
>> >>
>> >> OK, that would work.  But there's no way to force a jumpy sequence then
>> >> which we know is faster than compare+cmove because later RTL
>> >> if-conversion passes happily re-discover the smax (or conditional move)
>> >> sequence.
>> >>
>> >>> However, considering the SImode move
>> >>> from/to int/xmm register is relatively cheap, the cost function should
>> >>> be tuned so that STV always converts smaxsi3 pattern.
>> >>
>> >> Note that on both Zen and even more so bdverN the int/xmm transition
>> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>> >> sequence... (for the loop in hmmer which is the only one I see
>> >> any effect of any of my patches).  So identifying chains that
>> >> start/end in memory is important for cost reasons.
>> >
>> > Please note that the cost function also considers the cost of move
>> > from/to xmm. So, the cost of the whole chain would disable the
>> > transformation.
>> >
>> >> So I think the splitting has to happen after the last if-conversion
>> >> pass (and thus we may need to allocate a scratch register for this
>> >> purpose?)
>> >
>> > I really hope that the underlying issue will be solved by a machine
>> > dependant pass inserted somewhere after the pre-reload split. This
>> > way, we can split unconverted smax to the cmove, and this later pass
>> > would handle jcc and cmove instructions. Until then... yes your
>> > proposed approach is one of the ways to avoid unwanted if-conversion,
>> > although sometimes we would like to split to cmove instead.
>>
>> So the following makes STV also consider SImode chains, re-using the
>> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>> and also did not alter the {SI,DI}mode chain cost function - it's
>> quite off for TARGET_64BIT.  With this I get the expected conversion
>> for the testcase derived from hmmer.
>>
>> No further testing sofar.
>>
>> Is it OK to re-use the DImode chain code this way?  I'll clean things
>> up some more of course.
>
> Yes, the approach looks OK to me. It makes chain building mode
> agnostic, and the chain building can be used for
> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> minmax and surrounding SImode operations)
> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> DImode operations)
>
>> Still need help with the actual patterns for minmax and how the splitters
>> should look like.
>
> Please look at the attached patch. Maybe we can add memory_operand as
> operand 1 and operand 2 predicate, but let's keep things simple for
> now.
>
> Uros.
>
> Index: i386.md
> ===================================================================
> --- i386.md	(revision 274008)
> +++ i386.md	(working copy)
> @@ -17721,6 +17721,27 @@
>      std::swap (operands[4], operands[5]);
>  })
>  
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +	(smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +		       (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +	(compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +   	(if_then_else:SWI48
> +	  (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +	  (match_dup 1)
> +	  (match_dup 2)))])
> +

The pattern could in theory be matched after the last pre-RA split pass
has run, so I think the pattern still needs to have constraints and be
matchable even without can_create_pseudo_p.  It looks like the split
above should work post-RA.

A bit pedantic, because the pattern's probably fine in practice...

Thanks,
Richard

>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")
Uros Bizjak Aug. 5, 2019, 10:07 a.m. UTC | #19
On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Uros Bizjak <ubizjak@gmail.com> writes:
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> >>
> >> On Thu, 1 Aug 2019, Uros Bizjak wrote:
> >>
> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> >> >
> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> >> >>>> necessary even when going the STV route.  The actual regression
> >> >>>> for the testcase could also be solved by turing the smaxsi3
> >> >>>> back into a compare and jump rather than a conditional move sequence.
> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> >> >>>> after pass_split_after_reload and I'm not sure we can split
> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_
> >> >>>> sched2 on x86 it seems).
> >> >>>>
> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> >> >>>> case STV doesn't end up doing any transform?
> >> >>>
> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> >> >>> the insn back to compare+cmove.
> >> >>
> >> >> OK, that would work.  But there's no way to force a jumpy sequence then
> >> >> which we know is faster than compare+cmove because later RTL
> >> >> if-conversion passes happily re-discover the smax (or conditional move)
> >> >> sequence.
> >> >>
> >> >>> However, considering the SImode move
> >> >>> from/to int/xmm register is relatively cheap, the cost function should
> >> >>> be tuned so that STV always converts smaxsi3 pattern.
> >> >>
> >> >> Note that on both Zen and even more so bdverN the int/xmm transition
> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> >> >> sequence... (for the loop in hmmer which is the only one I see
> >> >> any effect of any of my patches).  So identifying chains that
> >> >> start/end in memory is important for cost reasons.
> >> >
> >> > Please note that the cost function also considers the cost of move
> >> > from/to xmm. So, the cost of the whole chain would disable the
> >> > transformation.
> >> >
> >> >> So I think the splitting has to happen after the last if-conversion
> >> >> pass (and thus we may need to allocate a scratch register for this
> >> >> purpose?)
> >> >
> >> > I really hope that the underlying issue will be solved by a machine
> >> > dependant pass inserted somewhere after the pre-reload split. This
> >> > way, we can split unconverted smax to the cmove, and this later pass
> >> > would handle jcc and cmove instructions. Until then... yes your
> >> > proposed approach is one of the ways to avoid unwanted if-conversion,
> >> > although sometimes we would like to split to cmove instead.
> >>
> >> So the following makes STV also consider SImode chains, re-using the
> >> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> >> and also did not alter the {SI,DI}mode chain cost function - it's
> >> quite off for TARGET_64BIT.  With this I get the expected conversion
> >> for the testcase derived from hmmer.
> >>
> >> No further testing sofar.
> >>
> >> Is it OK to re-use the DImode chain code this way?  I'll clean things
> >> up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> >> Still need help with the actual patterns for minmax and how the splitters
> >> should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
> >
> > Uros.
> >
> > Index: i386.md
> > ===================================================================
> > --- i386.md   (revision 274008)
> > +++ i386.md   (working copy)
> > @@ -17721,6 +17721,27 @@
> >      std::swap (operands[4], operands[5]);
> >  })
> >
> > +;; min/max patterns
> > +
> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> > +
> > +(define_insn_and_split "<code><mode>3"
> > +  [(set (match_operand:SWI48 0 "register_operand")
> > +     (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> > +                    (match_operand:SWI48 2 "register_operand")))
> > +   (clobber (reg:CC FLAGS_REG))]
> > +  "TARGET_STV && TARGET_SSE4_1
> > +   && can_create_pseudo_p ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (reg:CCGC FLAGS_REG)
> > +     (compare:CCGC (match_dup 1)(match_dup 2)))
> > +   (set (match_dup 0)
> > +     (if_then_else:SWI48
> > +       (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> > +       (match_dup 1)
> > +       (match_dup 2)))])
> > +
>
> The pattern could in theory be matched after the last pre-RA split pass
> has run, so I think the pattern still needs to have constraints and be
> matchable even without can_create_pseudo_p.  It looks like the split
> above should work post-RA.
>
> A bit pedantic, because the pattern's probably fine in practice...

Currently, all unmatched STV patterns split before reload, and there
were no problems. If the pattern matches after last pre-RA split, then
the post-reload splitter will fail, since can_create_pseudo_p also
applies to the part that splits the insn. In any case, thanks for the
heads-up, hopefully we didn't assume something that doesn't hold.

Thanks,
Uros.

> Thanks,
> Richard
>
> >  ;; Conditional addition patterns
> >  (define_expand "add<mode>cc"
> >    [(match_operand:SWI 0 "register_operand")
Richard Sandiford Aug. 5, 2019, 10:12 a.m. UTC | #20
Uros Bizjak <ubizjak@gmail.com> writes:
> On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Uros Bizjak <ubizjak@gmail.com> writes:
>> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>> >>
>> >> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>> >>
>> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>> >> >
>> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>> >> >>>> necessary even when going the STV route.  The actual regression
>> >> >>>> for the testcase could also be solved by turing the smaxsi3
>> >> >>>> back into a compare and jump rather than a conditional move sequence.
>> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
>> >> >>>> after pass_split_after_reload and I'm not sure we can split
>> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_
>> >> >>>> sched2 on x86 it seems).
>> >> >>>>
>> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>> >> >>>> case STV doesn't end up doing any transform?
>> >> >>>
>> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
>> >> >>> the insn back to compare+cmove.
>> >> >>
>> >> >> OK, that would work.  But there's no way to force a jumpy sequence then
>> >> >> which we know is faster than compare+cmove because later RTL
>> >> >> if-conversion passes happily re-discover the smax (or conditional move)
>> >> >> sequence.
>> >> >>
>> >> >>> However, considering the SImode move
>> >> >>> from/to int/xmm register is relatively cheap, the cost function should
>> >> >>> be tuned so that STV always converts smaxsi3 pattern.
>> >> >>
>> >> >> Note that on both Zen and even more so bdverN the int/xmm transition
>> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>> >> >> sequence... (for the loop in hmmer which is the only one I see
>> >> >> any effect of any of my patches).  So identifying chains that
>> >> >> start/end in memory is important for cost reasons.
>> >> >
>> >> > Please note that the cost function also considers the cost of move
>> >> > from/to xmm. So, the cost of the whole chain would disable the
>> >> > transformation.
>> >> >
>> >> >> So I think the splitting has to happen after the last if-conversion
>> >> >> pass (and thus we may need to allocate a scratch register for this
>> >> >> purpose?)
>> >> >
>> >> > I really hope that the underlying issue will be solved by a machine
>> >> > dependant pass inserted somewhere after the pre-reload split. This
>> >> > way, we can split unconverted smax to the cmove, and this later pass
>> >> > would handle jcc and cmove instructions. Until then... yes your
>> >> > proposed approach is one of the ways to avoid unwanted if-conversion,
>> >> > although sometimes we would like to split to cmove instead.
>> >>
>> >> So the following makes STV also consider SImode chains, re-using the
>> >> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>> >> and also did not alter the {SI,DI}mode chain cost function - it's
>> >> quite off for TARGET_64BIT.  With this I get the expected conversion
>> >> for the testcase derived from hmmer.
>> >>
>> >> No further testing sofar.
>> >>
>> >> Is it OK to re-use the DImode chain code this way?  I'll clean things
>> >> up some more of course.
>> >
>> > Yes, the approach looks OK to me. It makes chain building mode
>> > agnostic, and the chain building can be used for
>> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
>> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
>> > minmax and surrounding SImode operations)
>> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
>> > DImode operations)
>> >
>> >> Still need help with the actual patterns for minmax and how the splitters
>> >> should look like.
>> >
>> > Please look at the attached patch. Maybe we can add memory_operand as
>> > operand 1 and operand 2 predicate, but let's keep things simple for
>> > now.
>> >
>> > Uros.
>> >
>> > Index: i386.md
>> > ===================================================================
>> > --- i386.md   (revision 274008)
>> > +++ i386.md   (working copy)
>> > @@ -17721,6 +17721,27 @@
>> >      std::swap (operands[4], operands[5]);
>> >  })
>> >
>> > +;; min/max patterns
>> > +
>> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
>> > +
>> > +(define_insn_and_split "<code><mode>3"
>> > +  [(set (match_operand:SWI48 0 "register_operand")
>> > +     (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
>> > +                    (match_operand:SWI48 2 "register_operand")))
>> > +   (clobber (reg:CC FLAGS_REG))]
>> > +  "TARGET_STV && TARGET_SSE4_1
>> > +   && can_create_pseudo_p ()"
>> > +  "#"
>> > +  "&& 1"
>> > +  [(set (reg:CCGC FLAGS_REG)
>> > +     (compare:CCGC (match_dup 1)(match_dup 2)))
>> > +   (set (match_dup 0)
>> > +     (if_then_else:SWI48
>> > +       (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
>> > +       (match_dup 1)
>> > +       (match_dup 2)))])
>> > +
>>
>> The pattern could in theory be matched after the last pre-RA split pass
>> has run, so I think the pattern still needs to have constraints and be
>> matchable even without can_create_pseudo_p.  It looks like the split
>> above should work post-RA.
>>
>> A bit pedantic, because the pattern's probably fine in practice...
>
> Currently, all unmatched STV patterns split before reload, and there
> were no problems. If the pattern matches after last pre-RA split, then
> the post-reload splitter will fail, since can_create_pseudo_p also
> applies to the part that splits the insn.

But what I meant was: you should be able to remove the
can_create_pseudo_p () and add constraints.  (You'd have to remove
can_create_pseudo_p () with constraints anyway, since the insn
wouldn't match after RA otherwise.)

Thanks,
Richard

> In any case, thanks for the heads-up, hopefully we didn't assume
> something that doesn't hold.
Uros Bizjak Aug. 5, 2019, 10:24 a.m. UTC | #21
On Mon, Aug 5, 2019 at 12:12 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Uros Bizjak <ubizjak@gmail.com> writes:
> > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Uros Bizjak <ubizjak@gmail.com> writes:
> >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> >> >>
> >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote:
> >> >>
> >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> >> >> >
> >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> >> >> >>>> necessary even when going the STV route.  The actual regression
> >> >> >>>> for the testcase could also be solved by turing the smaxsi3
> >> >> >>>> back into a compare and jump rather than a conditional move sequence.
> >> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> >> >> >>>> after pass_split_after_reload and I'm not sure we can split
> >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_
> >> >> >>>> sched2 on x86 it seems).
> >> >> >>>>
> >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> >> >> >>>> case STV doesn't end up doing any transform?
> >> >> >>>
> >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> >> >> >>> the insn back to compare+cmove.
> >> >> >>
> >> >> >> OK, that would work.  But there's no way to force a jumpy sequence then
> >> >> >> which we know is faster than compare+cmove because later RTL
> >> >> >> if-conversion passes happily re-discover the smax (or conditional move)
> >> >> >> sequence.
> >> >> >>
> >> >> >>> However, considering the SImode move
> >> >> >>> from/to int/xmm register is relatively cheap, the cost function should
> >> >> >>> be tuned so that STV always converts smaxsi3 pattern.
> >> >> >>
> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition
> >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> >> >> >> sequence... (for the loop in hmmer which is the only one I see
> >> >> >> any effect of any of my patches).  So identifying chains that
> >> >> >> start/end in memory is important for cost reasons.
> >> >> >
> >> >> > Please note that the cost function also considers the cost of move
> >> >> > from/to xmm. So, the cost of the whole chain would disable the
> >> >> > transformation.
> >> >> >
> >> >> >> So I think the splitting has to happen after the last if-conversion
> >> >> >> pass (and thus we may need to allocate a scratch register for this
> >> >> >> purpose?)
> >> >> >
> >> >> > I really hope that the underlying issue will be solved by a machine
> >> >> > dependant pass inserted somewhere after the pre-reload split. This
> >> >> > way, we can split unconverted smax to the cmove, and this later pass
> >> >> > would handle jcc and cmove instructions. Until then... yes your
> >> >> > proposed approach is one of the ways to avoid unwanted if-conversion,
> >> >> > although sometimes we would like to split to cmove instead.
> >> >>
> >> >> So the following makes STV also consider SImode chains, re-using the
> >> >> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> >> >> and also did not alter the {SI,DI}mode chain cost function - it's
> >> >> quite off for TARGET_64BIT.  With this I get the expected conversion
> >> >> for the testcase derived from hmmer.
> >> >>
> >> >> No further testing sofar.
> >> >>
> >> >> Is it OK to re-use the DImode chain code this way?  I'll clean things
> >> >> up some more of course.
> >> >
> >> > Yes, the approach looks OK to me. It makes chain building mode
> >> > agnostic, and the chain building can be used for
> >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> >> > minmax and surrounding SImode operations)
> >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> >> > DImode operations)
> >> >
> >> >> Still need help with the actual patterns for minmax and how the splitters
> >> >> should look like.
> >> >
> >> > Please look at the attached patch. Maybe we can add memory_operand as
> >> > operand 1 and operand 2 predicate, but let's keep things simple for
> >> > now.
> >> >
> >> > Uros.
> >> >
> >> > Index: i386.md
> >> > ===================================================================
> >> > --- i386.md   (revision 274008)
> >> > +++ i386.md   (working copy)
> >> > @@ -17721,6 +17721,27 @@
> >> >      std::swap (operands[4], operands[5]);
> >> >  })
> >> >
> >> > +;; min/max patterns
> >> > +
> >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> >> > +
> >> > +(define_insn_and_split "<code><mode>3"
> >> > +  [(set (match_operand:SWI48 0 "register_operand")
> >> > +     (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> >> > +                    (match_operand:SWI48 2 "register_operand")))
> >> > +   (clobber (reg:CC FLAGS_REG))]
> >> > +  "TARGET_STV && TARGET_SSE4_1
> >> > +   && can_create_pseudo_p ()"
> >> > +  "#"
> >> > +  "&& 1"
> >> > +  [(set (reg:CCGC FLAGS_REG)
> >> > +     (compare:CCGC (match_dup 1)(match_dup 2)))
> >> > +   (set (match_dup 0)
> >> > +     (if_then_else:SWI48
> >> > +       (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> >> > +       (match_dup 1)
> >> > +       (match_dup 2)))])
> >> > +
> >>
> >> The pattern could in theory be matched after the last pre-RA split pass
> >> has run, so I think the pattern still needs to have constraints and be
> >> matchable even without can_create_pseudo_p.  It looks like the split
> >> above should work post-RA.
> >>
> >> A bit pedantic, because the pattern's probably fine in practice...
> >
> > Currently, all unmatched STV patterns split before reload, and there
> > were no problems. If the pattern matches after last pre-RA split, then
> > the post-reload splitter will fail, since can_create_pseudo_p also
> > applies to the part that splits the insn.
>
> But what I meant was: you should be able to remove the
> can_create_pseudo_p () and add constraints.  (You'd have to remove
> can_create_pseudo_p () with constraints anyway, since the insn
> wouldn't match after RA otherwise.)

I was under impression that it is better to split pseudo->pseudo, so
reload has some more freedom on what register to choose, especially
with matched and earlyclobbered DImode regs in x86_32 DImode patterns.
There were some complications with andn pattern (that needed
earlyclobber on a register to avoid clobbering registers in a memory
address), and it was necessary to clobber the whole DImode register
pair, wasting a SImode register. We can avoid all these complications
by splitting before the RA, where also a pseudo can be allocated.

Uros.
Richard Sandiford Aug. 5, 2019, 10:39 a.m. UTC | #22
Uros Bizjak <ubizjak@gmail.com> writes:
> On Mon, Aug 5, 2019 at 12:12 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Uros Bizjak <ubizjak@gmail.com> writes:
>> > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Uros Bizjak <ubizjak@gmail.com> writes:
>> >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>> >> >>
>> >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>> >> >>
>> >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>> >> >> >
>> >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>> >> >> >>>> necessary even when going the STV route.  The actual regression
>> >> >> >>>> for the testcase could also be solved by turing the smaxsi3
>> >> >> >>>> back into a compare and jump rather than a conditional move sequence.
>> >> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
>> >> >> >>>> after pass_split_after_reload and I'm not sure we can split
>> >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_
>> >> >> >>>> sched2 on x86 it seems).
>> >> >> >>>>
>> >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>> >> >> >>>> case STV doesn't end up doing any transform?
>> >> >> >>>
>> >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
>> >> >> >>> the insn back to compare+cmove.
>> >> >> >>
>> >> >> >> OK, that would work.  But there's no way to force a jumpy sequence then
>> >> >> >> which we know is faster than compare+cmove because later RTL
>> >> >> >> if-conversion passes happily re-discover the smax (or conditional move)
>> >> >> >> sequence.
>> >> >> >>
>> >> >> >>> However, considering the SImode move
>> >> >> >>> from/to int/xmm register is relatively cheap, the cost function should
>> >> >> >>> be tuned so that STV always converts smaxsi3 pattern.
>> >> >> >>
>> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition
>> >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>> >> >> >> sequence... (for the loop in hmmer which is the only one I see
>> >> >> >> any effect of any of my patches).  So identifying chains that
>> >> >> >> start/end in memory is important for cost reasons.
>> >> >> >
>> >> >> > Please note that the cost function also considers the cost of move
>> >> >> > from/to xmm. So, the cost of the whole chain would disable the
>> >> >> > transformation.
>> >> >> >
>> >> >> >> So I think the splitting has to happen after the last if-conversion
>> >> >> >> pass (and thus we may need to allocate a scratch register for this
>> >> >> >> purpose?)
>> >> >> >
>> >> >> > I really hope that the underlying issue will be solved by a machine
>> >> >> > dependant pass inserted somewhere after the pre-reload split. This
>> >> >> > way, we can split unconverted smax to the cmove, and this later pass
>> >> >> > would handle jcc and cmove instructions. Until then... yes your
>> >> >> > proposed approach is one of the ways to avoid unwanted if-conversion,
>> >> >> > although sometimes we would like to split to cmove instead.
>> >> >>
>> >> >> So the following makes STV also consider SImode chains, re-using the
>> >> >> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>> >> >> and also did not alter the {SI,DI}mode chain cost function - it's
>> >> >> quite off for TARGET_64BIT.  With this I get the expected conversion
>> >> >> for the testcase derived from hmmer.
>> >> >>
>> >> >> No further testing sofar.
>> >> >>
>> >> >> Is it OK to re-use the DImode chain code this way?  I'll clean things
>> >> >> up some more of course.
>> >> >
>> >> > Yes, the approach looks OK to me. It makes chain building mode
>> >> > agnostic, and the chain building can be used for
>> >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
>> >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
>> >> > minmax and surrounding SImode operations)
>> >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
>> >> > DImode operations)
>> >> >
>> >> >> Still need help with the actual patterns for minmax and how the splitters
>> >> >> should look like.
>> >> >
>> >> > Please look at the attached patch. Maybe we can add memory_operand as
>> >> > operand 1 and operand 2 predicate, but let's keep things simple for
>> >> > now.
>> >> >
>> >> > Uros.
>> >> >
>> >> > Index: i386.md
>> >> > ===================================================================
>> >> > --- i386.md   (revision 274008)
>> >> > +++ i386.md   (working copy)
>> >> > @@ -17721,6 +17721,27 @@
>> >> >      std::swap (operands[4], operands[5]);
>> >> >  })
>> >> >
>> >> > +;; min/max patterns
>> >> > +
>> >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
>> >> > +
>> >> > +(define_insn_and_split "<code><mode>3"
>> >> > +  [(set (match_operand:SWI48 0 "register_operand")
>> >> > +     (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
>> >> > +                    (match_operand:SWI48 2 "register_operand")))
>> >> > +   (clobber (reg:CC FLAGS_REG))]
>> >> > +  "TARGET_STV && TARGET_SSE4_1
>> >> > +   && can_create_pseudo_p ()"
>> >> > +  "#"
>> >> > +  "&& 1"
>> >> > +  [(set (reg:CCGC FLAGS_REG)
>> >> > +     (compare:CCGC (match_dup 1)(match_dup 2)))
>> >> > +   (set (match_dup 0)
>> >> > +     (if_then_else:SWI48
>> >> > +       (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
>> >> > +       (match_dup 1)
>> >> > +       (match_dup 2)))])
>> >> > +
>> >>
>> >> The pattern could in theory be matched after the last pre-RA split pass
>> >> has run, so I think the pattern still needs to have constraints and be
>> >> matchable even without can_create_pseudo_p.  It looks like the split
>> >> above should work post-RA.
>> >>
>> >> A bit pedantic, because the pattern's probably fine in practice...
>> >
>> > Currently, all unmatched STV patterns split before reload, and there
>> > were no problems. If the pattern matches after last pre-RA split, then
>> > the post-reload splitter will fail, since can_create_pseudo_p also
>> > applies to the part that splits the insn.
>>
>> But what I meant was: you should be able to remove the
>> can_create_pseudo_p () and add constraints.  (You'd have to remove
>> can_create_pseudo_p () with constraints anyway, since the insn
>> wouldn't match after RA otherwise.)
>
> I was under impression that it is better to split pseudo->pseudo, so
> reload has some more freedom on what register to choose, especially
> with matched and earlyclobbered DImode regs in x86_32 DImode patterns.
> There were some complications with andn pattern (that needed
> earlyclobber on a register to avoid clobbering registers in a memory
> address), and it was necessary to clobber the whole DImode register
> pair, wasting a SImode register. We can avoid all these complications
> by splitting before the RA, where also a pseudo can be allocated.

Yeah, splitting before RA is fine.  All I meant was that:

(define_insn_and_split "<code><mode>3"
  [(set (match_operand:SWI48 0 "register_operand" "=r")
	(smaxmin:SWI48 (match_operand:SWI48 1 "register_operand" "r")
		       (match_operand:SWI48 2 "register_operand" "r")))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_STV && TARGET_SSE4_1"
  "#"
  "&& 1"
  [(set (reg:CCGC FLAGS_REG)
	(compare:CCGC (match_dup 1) (match_dup 2)))
   (set (match_dup 0)
	(if_then_else:SWI48
	  (<smaxmin_rel> (reg:CCGC FLAGS_REG) (const_int 0))
	  (match_dup 1)
	  (match_dup 2)))])

seems like it should be correct too and avoids the theoretical
problem I mentioned.  If the instruction does survive until RA then
the split should work correctly on the reloaded instruction.

Thanks,
Richard
Richard Biener Aug. 5, 2019, 11:50 a.m. UTC | #23
On Sun, 4 Aug 2019, Uros Bizjak wrote:

> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> >
> > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > >>>> necessary even when going the STV route.  The actual regression
> > >>>> for the testcase could also be solved by turing the smaxsi3
> > >>>> back into a compare and jump rather than a conditional move sequence.
> > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > >>>> after pass_split_after_reload and I'm not sure we can split
> > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > >>>> sched2 on x86 it seems).
> > >>>>
> > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > >>>> case STV doesn't end up doing any transform?
> > >>>
> > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > >>> the insn back to compare+cmove.
> > >>
> > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > >> which we know is faster than compare+cmove because later RTL
> > >> if-conversion passes happily re-discover the smax (or conditional move)
> > >> sequence.
> > >>
> > >>> However, considering the SImode move
> > >>> from/to int/xmm register is relatively cheap, the cost function should
> > >>> be tuned so that STV always converts smaxsi3 pattern.
> > >>
> > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > >> sequence... (for the loop in hmmer which is the only one I see
> > >> any effect of any of my patches).  So identifying chains that
> > >> start/end in memory is important for cost reasons.
> > >
> > > Please note that the cost function also considers the cost of move
> > > from/to xmm. So, the cost of the whole chain would disable the
> > > transformation.
> > >
> > >> So I think the splitting has to happen after the last if-conversion
> > >> pass (and thus we may need to allocate a scratch register for this
> > >> purpose?)
> > >
> > > I really hope that the underlying issue will be solved by a machine
> > > dependant pass inserted somewhere after the pre-reload split. This
> > > way, we can split unconverted smax to the cmove, and this later pass
> > > would handle jcc and cmove instructions. Until then... yes your
> > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > although sometimes we would like to split to cmove instead.
> >
> > So the following makes STV also consider SImode chains, re-using the
> > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > and also did not alter the {SI,DI}mode chain cost function - it's
> > quite off for TARGET_64BIT.  With this I get the expected conversion
> > for the testcase derived from hmmer.
> >
> > No further testing sofar.
> >
> > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > up some more of course.
> 
> Yes, the approach looks OK to me. It makes chain building mode
> agnostic, and the chain building can be used for
> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> minmax and surrounding SImode operations)
> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> DImode operations)
> 
> > Still need help with the actual patterns for minmax and how the splitters
> > should look like.
> 
> Please look at the attached patch. Maybe we can add memory_operand as
> operand 1 and operand 2 predicate, but let's keep things simple for
> now.

Thanks.  The attached patch makes the patch cleaner and it survives
"some" barebone testing.  It also touches the cost function to
avoid being too overly trigger-happy.  I've also ended up using
ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
move costs shouldn't be wrapped in COST_N_INSNS.
IMHO we should probably disregard any reg-reg moves for costing pre-RA.
At least with the current code every reg-reg move biases in favor of 
SSE...

And we're simply adding move and non-move costs in 'gain', somewhat
mixing apples and oranges?  We could separate those and require
both to be a net positive win?

Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
costs as apples and oranges... (so it never triggers for Bulldozer)

I now run into

/space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: 
error: unrecognizable insn:
(insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
        (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
            (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) 
-1
     (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
during RTL pass: stv

where even with -mavx2 we do not have s{min,max}v2di3.  We do have
an expander here but it seems only AVX512F has the DImode min/max
ops.  I have adjusted dimode_scalar_to_vector_candidate_p
accordingly.

I'm considering to rename the 
dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
functions to drop the dimode_ prefix - is that OK or do you
prefer some other prefix?

So - bootstrap with --with-arch=skylake in progress.

It detects quite a few chains (unsurprisingly) so I guess we need
to address compile-time issues in the pass before enabling this
enhancement (maybe as followup?).

Further comments on the actual patch welcome, I consider it
"finished" if testing reveals no issues.  ChangeLog still needs
to be written and testcases to be added.

Thanks,
Richard.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274111)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -573,7 +611,7 @@ rtx
 dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
 	    emit_move_insn (vreg, tmp);
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
-	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
-	  }
+	  emit_move_insn (gen_lowpart (smode, vreg), reg);
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
 	rtx_insn *insn = DF_REF_INSN (ref);
@@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
@@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512F)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
 dimode_remove_non_convertible_regs (bitmap candidates)
@@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1638,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    dimode_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274111)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -162,6 +167,8 @@ class scalar_chain
 class dimode_scalar_chain : public scalar_chain
 {
  public:
+  dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274111)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17721,6 +17721,27 @@ (define_peephole2
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:SWI48 0 "register_operand")
+	(smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+		       (match_operand:SWI48 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:CCGC FLAGS_REG)
+	(compare:CCGC (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+   	(if_then_else:SWI48
+	  (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
+	  (match_dup 1)
+	  (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")
Uros Bizjak Aug. 5, 2019, 11:59 a.m. UTC | #24
On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > >>>> necessary even when going the STV route.  The actual regression
> > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > >>>> sched2 on x86 it seems).
> > > >>>>
> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > >>>> case STV doesn't end up doing any transform?
> > > >>>
> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > >>> the insn back to compare+cmove.
> > > >>
> > > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > > >> which we know is faster than compare+cmove because later RTL
> > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > >> sequence.
> > > >>
> > > >>> However, considering the SImode move
> > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > >>
> > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > >> sequence... (for the loop in hmmer which is the only one I see
> > > >> any effect of any of my patches).  So identifying chains that
> > > >> start/end in memory is important for cost reasons.
> > > >
> > > > Please note that the cost function also considers the cost of move
> > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > transformation.
> > > >
> > > >> So I think the splitting has to happen after the last if-conversion
> > > >> pass (and thus we may need to allocate a scratch register for this
> > > >> purpose?)
> > > >
> > > > I really hope that the underlying issue will be solved by a machine
> > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > would handle jcc and cmove instructions. Until then... yes your
> > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > although sometimes we would like to split to cmove instead.
> > >
> > > So the following makes STV also consider SImode chains, re-using the
> > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > for the testcase derived from hmmer.
> > >
> > > No further testing sofar.
> > >
> > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Thanks.  The attached patch makes the patch cleaner and it survives
> "some" barebone testing.  It also touches the cost function to
> avoid being too overly trigger-happy.  I've also ended up using
> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> move costs shouldn't be wrapped in COST_N_INSNS.
> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> At least with the current code every reg-reg move biases in favor of
> SSE...
>
> And we're simply adding move and non-move costs in 'gain', somewhat
> mixing apples and oranges?  We could separate those and require
> both to be a net positive win?
>
> Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> costs as apples and oranges... (so it never triggers for Bulldozer)
>
> I now run into
>
> /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> error: unrecognizable insn:
> (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> -1
>      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> during RTL pass: stv
>
> where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> an expander here but it seems only AVX512F has the DImode min/max
> ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> accordingly.
>
> I'm considering to rename the
> dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> functions to drop the dimode_ prefix - is that OK or do you
> prefer some other prefix?

No, please just drop the prefix.

> So - bootstrap with --with-arch=skylake in progress.
>
> It detects quite a few chains (unsurprisingly) so I guess we need
> to address compile-time issues in the pass before enabling this
> enhancement (maybe as followup?).
>
> Further comments on the actual patch welcome, I consider it
> "finished" if testing reveals no issues.  ChangeLog still needs
> to be written and testcases to be added.

I'll look at the patch later today from the x86 target PoV, maybe an
opinion of the RTL expert would also come in hand here.

Uros.

>
> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     (revision 274111)
> +++ gcc/config/i386/i386-features.c     (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
>  /* Initialize new chain.  */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>
>     if (dump_file)
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
>  {
>    gcc_assert (CONST_INT_P (exp));
>
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>
>  /* Compute a gain for chain conversion.  */
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +       igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -       gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +       igain
> +         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -       gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>                || GET_CODE (src) == ASHIFTRT
>                || GET_CODE (src) == LSHIFTRT)
>         {
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> -         gain += ix86_cost->shift_const;
> +           igain -= vector_const_cost (XEXP (src, 0));
> +         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>           if (INTVAL (XEXP (src, 1)) >= 32)
> -           gain -= COSTS_N_INSNS (1);
> +           igain -= COSTS_N_INSNS (1);
>         }
>        else if (GET_CODE (src) == PLUS
>                || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>                || GET_CODE (src) == XOR
>                || GET_CODE (src) == AND)
>         {
> -         gain += ix86_cost->add;
> +         igain += m * ix86_cost->add - ix86_cost->sse_op;
>           /* Additional gain for andnot for targets without BMI.  */
>           if (GET_CODE (XEXP (src, 0)) == NOT
>               && !TARGET_BMI)
> -           gain += 2 * ix86_cost->add;
> +           igain += m * ix86_cost->add;
>
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> +           igain -= vector_const_cost (XEXP (src, 0));
>           if (CONST_INT_P (XEXP (src, 1)))
> -           gain -= vector_const_cost (XEXP (src, 1));
> +           igain -= vector_const_cost (XEXP (src, 1));
>         }
>        else if (GET_CODE (src) == NEG
>                || GET_CODE (src) == NOT)
> -       gain += ix86_cost->add - COSTS_N_INSNS (1);
> +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +              || GET_CODE (src) == SMIN
> +              || GET_CODE (src) == UMAX
> +              || GET_CODE (src) == UMIN)
> +       {
> +         /* We do not have any conditional move cost, estimate it as a
> +            reg-reg move.  Comparisons are costed as adds.  */
> +         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +         /* Integer SSE ops are all costed the same.  */
> +         igain -= ix86_cost->sse_op;
> +       }
>        else if (GET_CODE (src) == COMPARE)
>         {
>           /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>         {
>           if (REG_P (dst))
> -           gain += COSTS_N_INSNS (2);
> +           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +           igain += COSTS_N_INSNS (m);
>           else if (MEM_P (dst))
> -           gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -         gain -= vector_const_cost (src);
> +           igain += (m * ix86_cost->int_store[2]
> +                    - ix86_cost->sse_store[sse_cost_idx]);
> +         igain -= vector_const_cost (src);
>         }
>        else
>         gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +       {
> +         fprintf (dump_file, "  Instruction gain %d for ", igain);
> +         dump_insn_slim (dump_file, insn);
> +       }
> +      gain += igain;
>      }
>
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -573,7 +611,7 @@ rtx
>  dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -           emit_move_insn (adjust_address (tmp, SImode, 0),
> -                           gen_rtx_SUBREG (SImode, reg, 0));
> -           emit_move_insn (adjust_address (tmp, SImode, 4),
> -                           gen_rtx_SUBREG (SImode, reg, 4));
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +           if (smode == DImode && !TARGET_64BIT)
> +             {
> +               emit_move_insn (adjust_address (tmp, SImode, 0),
> +                               gen_rtx_SUBREG (SImode, reg, 0));
> +               emit_move_insn (adjust_address (tmp, SImode, 4),
> +                               gen_rtx_SUBREG (SImode, reg, 4));
> +             }
> +           else
> +             emit_move_insn (tmp, reg);
>             emit_move_insn (vreg, tmp);
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (SImode, reg, 4),
> -                                         GEN_INT (2)));
> +           if (TARGET_SSE4_1)
> +             {
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (SImode, reg, 4),
> +                                             GEN_INT (2)));
> +             }
> +           else
> +             {
> +               rtx tmp = gen_reg_rtx (DImode);
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 4)));
> +               emit_insn (gen_vec_interleave_lowv4si
> +                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +             }
>           }
>         else
> -         {
> -           rtx tmp = gen_reg_rtx (DImode);
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 4)));
> -           emit_insn (gen_vec_interleave_lowv4si
> -                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -         }
> +         emit_move_insn (gen_lowpart (smode, vreg), reg);
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         rtx_insn *insn = DF_REF_INSN (ref);
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
>           start_sequence ();
>           if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>             {
> -             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>               emit_move_insn (tmp, reg);
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             adjust_address (tmp, SImode, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             adjust_address (tmp, SImode, 4));
> +             if (!TARGET_64BIT && smode == DImode)
> +               {
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 adjust_address (tmp, SImode, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 adjust_address (tmp, SImode, 4));
> +               }
> +             else
> +               emit_move_insn (scopy, tmp);
>             }
> -         else if (TARGET_SSE4_1)
> +         else if (!TARGET_64BIT && smode == DImode)
>             {
> -             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 0),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 4),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +             if (TARGET_SSE4_1)
> +               {
> +                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                             gen_rtvec (1, const0_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 0),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +
> +                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 4),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +               }
> +             else
> +               {
> +                 rtx vcopy = gen_reg_rtx (V2DImode);
> +                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +                 emit_move_insn (vcopy,
> +                                 gen_rtx_LSHIFTRT (V2DImode,
> +                                                   vcopy, GEN_INT (32)));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +               }
>             }
>           else
> -           {
> -             rtx vcopy = gen_reg_rtx (V2DImode);
> -             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -             emit_move_insn (vcopy,
> -                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -           }
> +           emit_move_insn (scopy, reg);
> +
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_conversion_insns (seq, insn);
> @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
>        if (dump_file)
>         fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
>             gcc_assert (!DF_REF_CHAIN (ref));
>             break;
>           }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -       vec_cst = CONSTM1_RTX (V2DImode);
> +       vec_cst = CONSTM1_RTX (vmode);
>        else
> -       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                       gen_rtvec (2, *op, const0_rtx));
> +       {
> +         unsigned n = GET_MODE_NUNITS (vmode);
> +         rtx *v = XALLOCAVEC (rtx, n);
> +         v[0] = *op;
> +         for (unsigned i = 1; i < n; ++i)
> +           v[i] = const0_rtx;
> +         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +       }
>
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>         {
>           start_sequence ();
> -         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_insn_before (seq, insn);
> @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>          temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>
>    switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>
>      case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
>        if (REG_P (src))
> -       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +       subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>         subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                      (const_int 0 [0])))  */
>
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>           && (SUBREG_BYTE (op2) != 0
> -             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>
>    return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>         return false;
>        break;
>
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512F)
> +         || (mode == SImode && !TARGET_SSE4_1))
> +       return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>        break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>
> @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>
>  static void
>  dimode_remove_non_convertible_regs (bitmap candidates)
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>
> @@ -1577,11 +1638,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -       if (scalar_to_vector_candidate_p (insn))
> +       if (TARGET_64BIT
> +           && timode_scalar_to_vector_candidate_p (insn))
>           {
>             if (dump_file)
> -             fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                        INSN_UID (insn));
>
> -           bitmap_set_bit (candidates, INSN_UID (insn));
> +           bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +         }
> +       else
> +         {
> +           /* Check {SI,DI}mode.  */
> +           for (unsigned i = 0; i <= 1; ++i)
> +             if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +               {
> +                 if (dump_file)
> +                   fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +                 break;
> +               }
>           }
>      }
>
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    dimode_remove_non_convertible_regs (&candidates[i]);
>
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +       unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +       scalar_chain *chain;
>
> -      if (TARGET_64BIT)
> -       chain = new timode_scalar_chain;
> -      else
> -       chain = new dimode_scalar_chain;
> +       if (cand_mode[i] == TImode)
> +         chain = new timode_scalar_chain;
> +       else
> +         chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> -      /* Find instructions chain we want to convert to vector mode.
> -        Check all uses and definitions to estimate all required
> -        conversions.  */
> -      chain->build (candidates, uid);
> +       /* Find instructions chain we want to convert to vector mode.
> +          Check all uses and definitions to estimate all required
> +          conversions.  */
> +       chain->build (&candidates[i], uid);
>
> -      if (chain->compute_convert_gain () > 0)
> -       converted_insns += chain->convert ();
> -      else
> -       if (dump_file)
> -         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                  chain->chain_id);
> +       if (chain->compute_convert_gain () > 0)
> +         converted_insns += chain->convert ();
> +       else
> +         if (dump_file)
> +           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                    chain->chain_id);
>
> -      delete chain;
> -    }
> +       delete chain;
> +      }
>
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h     (revision 274111)
> +++ gcc/config/i386/i386-features.h     (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>
>    static unsigned max_id;
>
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -162,6 +167,8 @@ class scalar_chain
>  class dimode_scalar_chain : public scalar_chain
>  {
>   public:
> +  dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 274111)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -17721,6 +17721,27 @@ (define_peephole2
>      std::swap (operands[4], operands[5]);
>  })
>
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +       (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +                      (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +       (compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +       (if_then_else:SWI48
> +         (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +         (match_dup 1)
> +         (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")
Richard Biener Aug. 5, 2019, 12:16 p.m. UTC | #25
On Mon, 5 Aug 2019, Uros Bizjak wrote:

> > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> > functions to drop the dimode_ prefix - is that OK or do you
> > prefer some other prefix?
> 
> No, please just drop the prefix.

just noticed this applies to the derived dimode_scalar_chain class
as well where I can't simply drop the prefix.  So would
general_scalar_chain / 
general_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
be OK?

Richard.
Uros Bizjak Aug. 5, 2019, 12:22 p.m. UTC | #26
On Mon, Aug 5, 2019 at 2:16 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 5 Aug 2019, Uros Bizjak wrote:
>
> > > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> > > functions to drop the dimode_ prefix - is that OK or do you
> > > prefer some other prefix?
> >
> > No, please just drop the prefix.
>
> just noticed this applies to the derived dimode_scalar_chain class
> as well where I can't simply drop the prefix.  So would
> general_scalar_chain /
> general_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> be OK?

I don't want to bikeshed too much here ;) Whatever fits you best.

Uros.
Uros Bizjak Aug. 5, 2019, 12:32 p.m. UTC | #27
On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > >>>> necessary even when going the STV route.  The actual regression
> > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > >>>> sched2 on x86 it seems).
> > > >>>>
> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > >>>> case STV doesn't end up doing any transform?
> > > >>>
> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > >>> the insn back to compare+cmove.
> > > >>
> > > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > > >> which we know is faster than compare+cmove because later RTL
> > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > >> sequence.
> > > >>
> > > >>> However, considering the SImode move
> > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > >>
> > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > >> sequence... (for the loop in hmmer which is the only one I see
> > > >> any effect of any of my patches).  So identifying chains that
> > > >> start/end in memory is important for cost reasons.
> > > >
> > > > Please note that the cost function also considers the cost of move
> > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > transformation.
> > > >
> > > >> So I think the splitting has to happen after the last if-conversion
> > > >> pass (and thus we may need to allocate a scratch register for this
> > > >> purpose?)
> > > >
> > > > I really hope that the underlying issue will be solved by a machine
> > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > would handle jcc and cmove instructions. Until then... yes your
> > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > although sometimes we would like to split to cmove instead.
> > >
> > > So the following makes STV also consider SImode chains, re-using the
> > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > for the testcase derived from hmmer.
> > >
> > > No further testing sofar.
> > >
> > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Thanks.  The attached patch makes the patch cleaner and it survives
> "some" barebone testing.  It also touches the cost function to
> avoid being too overly trigger-happy.  I've also ended up using
> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> move costs shouldn't be wrapped in COST_N_INSNS.
> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> At least with the current code every reg-reg move biases in favor of
> SSE...

This is currently a bit mixed-up area in x86 target support. HJ is
looking into this [1] and I hope Honza can review the patch.

> And we're simply adding move and non-move costs in 'gain', somewhat
> mixing apples and oranges?  We could separate those and require
> both to be a net positive win?
>
> Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> costs as apples and oranges... (so it never triggers for Bulldozer)
>
> I now run into
>
> /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> error: unrecognizable insn:
> (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> -1
>      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> during RTL pass: stv
>
> where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> an expander here but it seems only AVX512F has the DImode min/max
> ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> accordingly.
>
> I'm considering to rename the
> dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> functions to drop the dimode_ prefix - is that OK or do you
> prefer some other prefix?
>
> So - bootstrap with --with-arch=skylake in progress.
>
> It detects quite a few chains (unsurprisingly) so I guess we need
> to address compile-time issues in the pass before enabling this
> enhancement (maybe as followup?).
>
> Further comments on the actual patch welcome, I consider it
> "finished" if testing reveals no issues.  ChangeLog still needs
> to be written and testcases to be added.

> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +       (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +                      (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +       (compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +       (if_then_else:SWI48
> +         (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +         (match_dup 1)
> +         (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")

Please find attached (untested) i386.md patch that defines signed and
unsigned min/max pattern.

[1] https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01542.html

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e19a591fa9d..8a492626103 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -17721,6 +17721,30 @@
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_code_attr maxmin_rel
+  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
+(define_code_attr maxmin_cmpmode
+  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:SWI48 0 "register_operand")
+	(maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+		      (match_operand:SWI48 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
+	(compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+	(if_then_else:SWI48
+	  (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
+	  (match_dup 1)
+	  (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")
Uros Bizjak Aug. 5, 2019, 12:43 p.m. UTC | #28
On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > >>>> necessary even when going the STV route.  The actual regression
> > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > >>>> sched2 on x86 it seems).
> > > >>>>
> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > >>>> case STV doesn't end up doing any transform?
> > > >>>
> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > >>> the insn back to compare+cmove.
> > > >>
> > > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > > >> which we know is faster than compare+cmove because later RTL
> > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > >> sequence.
> > > >>
> > > >>> However, considering the SImode move
> > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > >>
> > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > >> sequence... (for the loop in hmmer which is the only one I see
> > > >> any effect of any of my patches).  So identifying chains that
> > > >> start/end in memory is important for cost reasons.
> > > >
> > > > Please note that the cost function also considers the cost of move
> > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > transformation.
> > > >
> > > >> So I think the splitting has to happen after the last if-conversion
> > > >> pass (and thus we may need to allocate a scratch register for this
> > > >> purpose?)
> > > >
> > > > I really hope that the underlying issue will be solved by a machine
> > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > would handle jcc and cmove instructions. Until then... yes your
> > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > although sometimes we would like to split to cmove instead.
> > >
> > > So the following makes STV also consider SImode chains, re-using the
> > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > for the testcase derived from hmmer.
> > >
> > > No further testing sofar.
> > >
> > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Thanks.  The attached patch makes the patch cleaner and it survives
> "some" barebone testing.  It also touches the cost function to
> avoid being too overly trigger-happy.  I've also ended up using
> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> move costs shouldn't be wrapped in COST_N_INSNS.
> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> At least with the current code every reg-reg move biases in favor of
> SSE...
>
> And we're simply adding move and non-move costs in 'gain', somewhat
> mixing apples and oranges?  We could separate those and require
> both to be a net positive win?
>
> Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> costs as apples and oranges... (so it never triggers for Bulldozer)
>
> I now run into
>
> /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> error: unrecognizable insn:
> (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> -1
>      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> during RTL pass: stv
>
> where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> an expander here but it seems only AVX512F has the DImode min/max
> ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> accordingly.

Uh, you need to use some other mode iterator that SWI48 then, like:

(define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])

and then we need to split DImode for 32bits, too.

Uros.

> I'm considering to rename the
> dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> functions to drop the dimode_ prefix - is that OK or do you
> prefer some other prefix?
>
> So - bootstrap with --with-arch=skylake in progress.
>
> It detects quite a few chains (unsurprisingly) so I guess we need
> to address compile-time issues in the pass before enabling this
> enhancement (maybe as followup?).
>
> Further comments on the actual patch welcome, I consider it
> "finished" if testing reveals no issues.  ChangeLog still needs
> to be written and testcases to be added.
>
> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     (revision 274111)
> +++ gcc/config/i386/i386-features.c     (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
>  /* Initialize new chain.  */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>
>     if (dump_file)
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
>  {
>    gcc_assert (CONST_INT_P (exp));
>
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>
>  /* Compute a gain for chain conversion.  */
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +       igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -       gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +       igain
> +         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -       gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>                || GET_CODE (src) == ASHIFTRT
>                || GET_CODE (src) == LSHIFTRT)
>         {
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> -         gain += ix86_cost->shift_const;
> +           igain -= vector_const_cost (XEXP (src, 0));
> +         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>           if (INTVAL (XEXP (src, 1)) >= 32)
> -           gain -= COSTS_N_INSNS (1);
> +           igain -= COSTS_N_INSNS (1);
>         }
>        else if (GET_CODE (src) == PLUS
>                || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>                || GET_CODE (src) == XOR
>                || GET_CODE (src) == AND)
>         {
> -         gain += ix86_cost->add;
> +         igain += m * ix86_cost->add - ix86_cost->sse_op;
>           /* Additional gain for andnot for targets without BMI.  */
>           if (GET_CODE (XEXP (src, 0)) == NOT
>               && !TARGET_BMI)
> -           gain += 2 * ix86_cost->add;
> +           igain += m * ix86_cost->add;
>
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> +           igain -= vector_const_cost (XEXP (src, 0));
>           if (CONST_INT_P (XEXP (src, 1)))
> -           gain -= vector_const_cost (XEXP (src, 1));
> +           igain -= vector_const_cost (XEXP (src, 1));
>         }
>        else if (GET_CODE (src) == NEG
>                || GET_CODE (src) == NOT)
> -       gain += ix86_cost->add - COSTS_N_INSNS (1);
> +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +              || GET_CODE (src) == SMIN
> +              || GET_CODE (src) == UMAX
> +              || GET_CODE (src) == UMIN)
> +       {
> +         /* We do not have any conditional move cost, estimate it as a
> +            reg-reg move.  Comparisons are costed as adds.  */
> +         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +         /* Integer SSE ops are all costed the same.  */
> +         igain -= ix86_cost->sse_op;
> +       }
>        else if (GET_CODE (src) == COMPARE)
>         {
>           /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>         {
>           if (REG_P (dst))
> -           gain += COSTS_N_INSNS (2);
> +           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +           igain += COSTS_N_INSNS (m);
>           else if (MEM_P (dst))
> -           gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -         gain -= vector_const_cost (src);
> +           igain += (m * ix86_cost->int_store[2]
> +                    - ix86_cost->sse_store[sse_cost_idx]);
> +         igain -= vector_const_cost (src);
>         }
>        else
>         gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +       {
> +         fprintf (dump_file, "  Instruction gain %d for ", igain);
> +         dump_insn_slim (dump_file, insn);
> +       }
> +      gain += igain;
>      }
>
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -573,7 +611,7 @@ rtx
>  dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -           emit_move_insn (adjust_address (tmp, SImode, 0),
> -                           gen_rtx_SUBREG (SImode, reg, 0));
> -           emit_move_insn (adjust_address (tmp, SImode, 4),
> -                           gen_rtx_SUBREG (SImode, reg, 4));
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +           if (smode == DImode && !TARGET_64BIT)
> +             {
> +               emit_move_insn (adjust_address (tmp, SImode, 0),
> +                               gen_rtx_SUBREG (SImode, reg, 0));
> +               emit_move_insn (adjust_address (tmp, SImode, 4),
> +                               gen_rtx_SUBREG (SImode, reg, 4));
> +             }
> +           else
> +             emit_move_insn (tmp, reg);
>             emit_move_insn (vreg, tmp);
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (SImode, reg, 4),
> -                                         GEN_INT (2)));
> +           if (TARGET_SSE4_1)
> +             {
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (SImode, reg, 4),
> +                                             GEN_INT (2)));
> +             }
> +           else
> +             {
> +               rtx tmp = gen_reg_rtx (DImode);
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 4)));
> +               emit_insn (gen_vec_interleave_lowv4si
> +                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +             }
>           }
>         else
> -         {
> -           rtx tmp = gen_reg_rtx (DImode);
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 4)));
> -           emit_insn (gen_vec_interleave_lowv4si
> -                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -         }
> +         emit_move_insn (gen_lowpart (smode, vreg), reg);
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         rtx_insn *insn = DF_REF_INSN (ref);
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
>           start_sequence ();
>           if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>             {
> -             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>               emit_move_insn (tmp, reg);
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             adjust_address (tmp, SImode, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             adjust_address (tmp, SImode, 4));
> +             if (!TARGET_64BIT && smode == DImode)
> +               {
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 adjust_address (tmp, SImode, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 adjust_address (tmp, SImode, 4));
> +               }
> +             else
> +               emit_move_insn (scopy, tmp);
>             }
> -         else if (TARGET_SSE4_1)
> +         else if (!TARGET_64BIT && smode == DImode)
>             {
> -             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 0),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 4),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +             if (TARGET_SSE4_1)
> +               {
> +                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                             gen_rtvec (1, const0_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 0),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +
> +                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 4),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +               }
> +             else
> +               {
> +                 rtx vcopy = gen_reg_rtx (V2DImode);
> +                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +                 emit_move_insn (vcopy,
> +                                 gen_rtx_LSHIFTRT (V2DImode,
> +                                                   vcopy, GEN_INT (32)));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +               }
>             }
>           else
> -           {
> -             rtx vcopy = gen_reg_rtx (V2DImode);
> -             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -             emit_move_insn (vcopy,
> -                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -           }
> +           emit_move_insn (scopy, reg);
> +
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_conversion_insns (seq, insn);
> @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
>        if (dump_file)
>         fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
>             gcc_assert (!DF_REF_CHAIN (ref));
>             break;
>           }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -       vec_cst = CONSTM1_RTX (V2DImode);
> +       vec_cst = CONSTM1_RTX (vmode);
>        else
> -       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                       gen_rtvec (2, *op, const0_rtx));
> +       {
> +         unsigned n = GET_MODE_NUNITS (vmode);
> +         rtx *v = XALLOCAVEC (rtx, n);
> +         v[0] = *op;
> +         for (unsigned i = 1; i < n; ++i)
> +           v[i] = const0_rtx;
> +         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +       }
>
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>         {
>           start_sequence ();
> -         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_insn_before (seq, insn);
> @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>          temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>
>    switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>
>      case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
>        if (REG_P (src))
> -       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +       subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>         subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                      (const_int 0 [0])))  */
>
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>           && (SUBREG_BYTE (op2) != 0
> -             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>
>    return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>         return false;
>        break;
>
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512F)
> +         || (mode == SImode && !TARGET_SSE4_1))
> +       return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>        break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>
> @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>
>  static void
>  dimode_remove_non_convertible_regs (bitmap candidates)
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>
> @@ -1577,11 +1638,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -       if (scalar_to_vector_candidate_p (insn))
> +       if (TARGET_64BIT
> +           && timode_scalar_to_vector_candidate_p (insn))
>           {
>             if (dump_file)
> -             fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                        INSN_UID (insn));
>
> -           bitmap_set_bit (candidates, INSN_UID (insn));
> +           bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +         }
> +       else
> +         {
> +           /* Check {SI,DI}mode.  */
> +           for (unsigned i = 0; i <= 1; ++i)
> +             if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +               {
> +                 if (dump_file)
> +                   fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +                 break;
> +               }
>           }
>      }
>
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    dimode_remove_non_convertible_regs (&candidates[i]);
>
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +       unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +       scalar_chain *chain;
>
> -      if (TARGET_64BIT)
> -       chain = new timode_scalar_chain;
> -      else
> -       chain = new dimode_scalar_chain;
> +       if (cand_mode[i] == TImode)
> +         chain = new timode_scalar_chain;
> +       else
> +         chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> -      /* Find instructions chain we want to convert to vector mode.
> -        Check all uses and definitions to estimate all required
> -        conversions.  */
> -      chain->build (candidates, uid);
> +       /* Find instructions chain we want to convert to vector mode.
> +          Check all uses and definitions to estimate all required
> +          conversions.  */
> +       chain->build (&candidates[i], uid);
>
> -      if (chain->compute_convert_gain () > 0)
> -       converted_insns += chain->convert ();
> -      else
> -       if (dump_file)
> -         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                  chain->chain_id);
> +       if (chain->compute_convert_gain () > 0)
> +         converted_insns += chain->convert ();
> +       else
> +         if (dump_file)
> +           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                    chain->chain_id);
>
> -      delete chain;
> -    }
> +       delete chain;
> +      }
>
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h     (revision 274111)
> +++ gcc/config/i386/i386-features.h     (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>
>    static unsigned max_id;
>
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -162,6 +167,8 @@ class scalar_chain
>  class dimode_scalar_chain : public scalar_chain
>  {
>   public:
> +  dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 274111)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -17721,6 +17721,27 @@ (define_peephole2
>      std::swap (operands[4], operands[5]);
>  })
>
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +       (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +                      (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +       (compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +       (if_then_else:SWI48
> +         (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +         (match_dup 1)
> +         (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")
Uros Bizjak Aug. 5, 2019, 12:51 p.m. UTC | #29
On Mon, Aug 5, 2019 at 2:43 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Sun, 4 Aug 2019, Uros Bizjak wrote:
> >
> > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > > >
> > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > > >
> > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > > >>>> necessary even when going the STV route.  The actual regression
> > > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > > >>>> sched2 on x86 it seems).
> > > > >>>>
> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > > >>>> case STV doesn't end up doing any transform?
> > > > >>>
> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > > >>> the insn back to compare+cmove.
> > > > >>
> > > > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > > > >> which we know is faster than compare+cmove because later RTL
> > > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > > >> sequence.
> > > > >>
> > > > >>> However, considering the SImode move
> > > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > > >>
> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > > >> sequence... (for the loop in hmmer which is the only one I see
> > > > >> any effect of any of my patches).  So identifying chains that
> > > > >> start/end in memory is important for cost reasons.
> > > > >
> > > > > Please note that the cost function also considers the cost of move
> > > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > > transformation.
> > > > >
> > > > >> So I think the splitting has to happen after the last if-conversion
> > > > >> pass (and thus we may need to allocate a scratch register for this
> > > > >> purpose?)
> > > > >
> > > > > I really hope that the underlying issue will be solved by a machine
> > > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > > would handle jcc and cmove instructions. Until then... yes your
> > > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > > although sometimes we would like to split to cmove instead.
> > > >
> > > > So the following makes STV also consider SImode chains, re-using the
> > > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > > for the testcase derived from hmmer.
> > > >
> > > > No further testing sofar.
> > > >
> > > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > > up some more of course.
> > >
> > > Yes, the approach looks OK to me. It makes chain building mode
> > > agnostic, and the chain building can be used for
> > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > > minmax and surrounding SImode operations)
> > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > > DImode operations)
> > >
> > > > Still need help with the actual patterns for minmax and how the splitters
> > > > should look like.
> > >
> > > Please look at the attached patch. Maybe we can add memory_operand as
> > > operand 1 and operand 2 predicate, but let's keep things simple for
> > > now.
> >
> > Thanks.  The attached patch makes the patch cleaner and it survives
> > "some" barebone testing.  It also touches the cost function to
> > avoid being too overly trigger-happy.  I've also ended up using
> > ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> > move costs shouldn't be wrapped in COST_N_INSNS.
> > IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> > At least with the current code every reg-reg move biases in favor of
> > SSE...
> >
> > And we're simply adding move and non-move costs in 'gain', somewhat
> > mixing apples and oranges?  We could separate those and require
> > both to be a net positive win?
> >
> > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> > costs as apples and oranges... (so it never triggers for Bulldozer)
> >
> > I now run into
> >
> > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> > error: unrecognizable insn:
> > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> >         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> >             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> > -1
> >      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
> >         (expr_list:REG_UNUSED (reg:CC 17 flags)
> >             (nil))))
> > during RTL pass: stv
> >
> > where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> > an expander here but it seems only AVX512F has the DImode min/max
> > ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> > accordingly.
>
> Uh, you need to use some other mode iterator that SWI48 then, like:
>
> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
>
> and then we need to split DImode for 32bits, too.

For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
condition, I'll provide _doubleword splitter later.

Uros.
Jakub Jelinek Aug. 5, 2019, 12:53 p.m. UTC | #30
On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote:
> > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> >
> > and then we need to split DImode for 32bits, too.
> 
> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> condition, I'll provide _doubleword splitter later.

Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
to force use of %zmmN?

	Jakub
Uros Bizjak Aug. 5, 2019, 12:56 p.m. UTC | #31
On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote:
> > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > >
> > > and then we need to split DImode for 32bits, too.
> >
> > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > condition, I'll provide _doubleword splitter later.
>
> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> to force use of %zmmN?

It generates V4SI mode, so - yes, AVX512VL.

Thanks,
Uros.
Richard Biener Aug. 5, 2019, 1:04 p.m. UTC | #32
On Mon, 5 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote:
> > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > >
> > > > and then we need to split DImode for 32bits, too.
> > >
> > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > condition, I'll provide _doubleword splitter later.
> >
> > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > to force use of %zmmN?
> 
> It generates V4SI mode, so - yes, AVX512VL.

    case SMAX:
    case SMIN:
    case UMAX:
    case UMIN:
      if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
          || (mode == SImode && !TARGET_SSE4_1))
        return false;

so there's no way to use AVX512VL for 32bit?

Richard.
Uros Bizjak Aug. 5, 2019, 1:09 p.m. UTC | #33
On Mon, Aug 5, 2019 at 3:04 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 5 Aug 2019, Uros Bizjak wrote:
>
> > On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote:
> > >
> > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote:
> > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > >
> > > > > and then we need to split DImode for 32bits, too.
> > > >
> > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > condition, I'll provide _doubleword splitter later.
> > >
> > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > to force use of %zmmN?
> >
> > It generates V4SI mode, so - yes, AVX512VL.
>
>     case SMAX:
>     case SMIN:
>     case UMAX:
>     case UMIN:
>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
>           || (mode == SImode && !TARGET_SSE4_1))
>         return false;
>
> so there's no way to use AVX512VL for 32bit?

There is a way, but on 32bit targets, we need to split DImode
operation to a sequence of SImode operations for unconverted pattern.
This is of course doable, but somehow more complex than simply
emitting a DImode compare + DImode cmove, which is what current
splitter does. So, a follow-up task.

Uros.
Richard Biener Aug. 5, 2019, 1:29 p.m. UTC | #34
On Mon, 5 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 3:04 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> >
> > > On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote:
> > > >
> > > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote:
> > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > >
> > > > > > and then we need to split DImode for 32bits, too.
> > > > >
> > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > condition, I'll provide _doubleword splitter later.
> > > >
> > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > to force use of %zmmN?
> > >
> > > It generates V4SI mode, so - yes, AVX512VL.
> >
> >     case SMAX:
> >     case SMIN:
> >     case UMAX:
> >     case UMIN:
> >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> >           || (mode == SImode && !TARGET_SSE4_1))
> >         return false;
> >
> > so there's no way to use AVX512VL for 32bit?
> 
> There is a way, but on 32bit targets, we need to split DImode
> operation to a sequence of SImode operations for unconverted pattern.
> This is of course doable, but somehow more complex than simply
> emitting a DImode compare + DImode cmove, which is what current
> splitter does. So, a follow-up task.

Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
check we just need to properly split if we enable the scalar minmax
pattern for DImode on 32bits, the STV conversion would go fine.

Richard.
Uros Bizjak Aug. 5, 2019, 7:34 p.m. UTC | #35
On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:

> > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > >
> > > > > > > and then we need to split DImode for 32bits, too.
> > > > > >
> > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > condition, I'll provide _doubleword splitter later.
> > > > >
> > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > to force use of %zmmN?
> > > >
> > > > It generates V4SI mode, so - yes, AVX512VL.
> > >
> > >     case SMAX:
> > >     case SMIN:
> > >     case UMAX:
> > >     case UMIN:
> > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > >           || (mode == SImode && !TARGET_SSE4_1))
> > >         return false;
> > >
> > > so there's no way to use AVX512VL for 32bit?
> >
> > There is a way, but on 32bit targets, we need to split DImode
> > operation to a sequence of SImode operations for unconverted pattern.
> > This is of course doable, but somehow more complex than simply
> > emitting a DImode compare + DImode cmove, which is what current
> > splitter does. So, a follow-up task.
>
> Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> check we just need to properly split if we enable the scalar minmax
> pattern for DImode on 32bits, the STV conversion would go fine.

Yes, that is correct.

Uros.
Richard Biener Aug. 7, 2019, 9:31 a.m. UTC | #36
On Mon, 5 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> 
> > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > >
> > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > >
> > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > >
> > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > to force use of %zmmN?
> > > > >
> > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > >
> > > >     case SMAX:
> > > >     case SMIN:
> > > >     case UMAX:
> > > >     case UMIN:
> > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > >         return false;
> > > >
> > > > so there's no way to use AVX512VL for 32bit?
> > >
> > > There is a way, but on 32bit targets, we need to split DImode
> > > operation to a sequence of SImode operations for unconverted pattern.
> > > This is of course doable, but somehow more complex than simply
> > > emitting a DImode compare + DImode cmove, which is what current
> > > splitter does. So, a follow-up task.
> >
> > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > check we just need to properly split if we enable the scalar minmax
> > pattern for DImode on 32bits, the STV conversion would go fine.
> 
> Yes, that is correct.

So I tested the patch below (now with appropriate ChangeLog) on
x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
the obvious hmmer improvement, now checking for off-noise results
with a 3-run on those that may have one (with more than +-1 second
differences in the 1-run).

As-is the patch likely runs into the splitting issue for DImode
on i?86 and the patch misses functional testcases.  I'll do the
hmmer loop with both DImode and SImode and testcases to trigger
all pattern variants with the different ISAs we have.

Some of the patch could be split out (the cost changes that are
also effective for DImode for example).

AFAICS we could go with only adding SImode avoiding the DImode
splitting thing and this would solve the hmmer regression.

Thanks,
Richard.

2019-08-07  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
	gain if not zero.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274111)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (smode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
 	    emit_move_insn (vreg, tmp);
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
-	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
-	  }
+	  emit_move_insn (gen_lowpart (smode, vreg), reg);
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
 	rtx_insn *insn = DF_REF_INSN (ref);
@@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1638,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274111)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274111)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17721,6 +17721,30 @@ (define_peephole2
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_code_attr maxmin_rel
+  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
+(define_code_attr maxmin_cmpmode
+  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:SWI48 0 "register_operand")
+	(maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+		      (match_operand:SWI48 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && TARGET_SSE4_1
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
+	(compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+	(if_then_else:SWI48
+	  (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
+	  (match_dup 1)
+	  (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")
Richard Biener Aug. 7, 2019, 11:51 a.m. UTC | #37
On Wed, 7 Aug 2019, Richard Biener wrote:

> On Mon, 5 Aug 2019, Uros Bizjak wrote:
> 
> > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > 
> > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > >
> > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > >
> > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > >
> > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > to force use of %zmmN?
> > > > > >
> > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > >
> > > > >     case SMAX:
> > > > >     case SMIN:
> > > > >     case UMAX:
> > > > >     case UMIN:
> > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > >         return false;
> > > > >
> > > > > so there's no way to use AVX512VL for 32bit?
> > > >
> > > > There is a way, but on 32bit targets, we need to split DImode
> > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > This is of course doable, but somehow more complex than simply
> > > > emitting a DImode compare + DImode cmove, which is what current
> > > > splitter does. So, a follow-up task.
> > >
> > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > check we just need to properly split if we enable the scalar minmax
> > > pattern for DImode on 32bits, the STV conversion would go fine.
> > 
> > Yes, that is correct.
> 
> So I tested the patch below (now with appropriate ChangeLog) on
> x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> the obvious hmmer improvement, now checking for off-noise results
> with a 3-run on those that may have one (with more than +-1 second
> differences in the 1-run).
> 
> As-is the patch likely runs into the splitting issue for DImode
> on i?86 and the patch misses functional testcases.  I'll do the
> hmmer loop with both DImode and SImode and testcases to trigger
> all pattern variants with the different ISAs we have.
> 
> Some of the patch could be split out (the cost changes that are
> also effective for DImode for example).
> 
> AFAICS we could go with only adding SImode avoiding the DImode
> splitting thing and this would solve the hmmer regression.

I've additionally bootstrapped with --with-arch=nehalem which
reveals

FAIL: gcc.target/i386/minmax-2.c scan-assembler test
FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp

we emit cmp + cmov here now with -msse4.1 (as soon as the max
pattern is enabled I guess)

Otherwise testing is clean, so I suppose this is the net effect
of just doing the SImode chains;  I don't have AVX512 HW handily
available to really test the DImode path.

Would you be fine to simplify the patch down to SImode chain handling?

Thanks,
Richard.

> Thanks,
> Richard.
> 
> 2019-08-07  Richard Biener  <rguenther@suse.de>
> 
> 	PR target/91154
> 	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> 	mode arguments.
> 	(scalar_chain::smode): New member.
> 	(scalar_chain::vmode): Likewise.
> 	(dimode_scalar_chain): Rename to...
> 	(general_scalar_chain): ... this.
> 	(general_scalar_chain::general_scalar_chain): Take mode arguments.
> 	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> 	base with TImode and V1TImode.
> 	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> 	(general_scalar_chain::vector_const_cost): Adjust for SImode
> 	chains.
> 	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
> 	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> 	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> 	gain if not zero.
> 	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
> 	(general_scalar_chain::make_vector_copies): Likewise.  Handle
> 	non-DImode chains appropriately.
> 	(general_scalar_chain::convert_reg): Likewise.
> 	(general_scalar_chain::convert_op): Likewise.
> 	(general_scalar_chain::convert_insn): Likewise.  Add
> 	fatal_insn_not_found if the result is not recognized.
> 	(convertible_comparison_p): Pass in the scalar mode and use that.
> 	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
> 	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> 	(scalar_to_vector_candidate_p): Remove by inlining into single
> 	caller.
> 	(general_remove_non_convertible_regs): Rename from
> 	dimode_remove_non_convertible_regs.
> 	(remove_non_convertible_regs): Remove by inlining into single caller.
> 	(convert_scalars_to_vector): Handle SImode and DImode chains
> 	in addition to TImode chains.
> 	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> 
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c	(revision 274111)
> +++ gcc/config/i386/i386-features.c	(working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>  
>  /* Initialize new chain.  */
>  
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>  
>     if (dump_file)
> @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
>     conversion.  */
>  
>  void
> -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> +general_scalar_chain::mark_dual_mode_def (df_ref def)
>  {
>    gcc_assert (DF_REF_REG_DEF_P (def));
>  
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
>     instead of using a scalar one.  */
>  
>  int
> -dimode_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp)
>  {
>    gcc_assert (CONST_INT_P (exp));
>  
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>  
>  /* Compute a gain for chain conversion.  */
>  
>  int
> -dimode_scalar_chain::compute_convert_gain ()
> +general_scalar_chain::compute_convert_gain ()
>  {
>    bitmap_iterator bi;
>    unsigned insn_uid;
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>  
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>  
>        if (REG_P (src) && REG_P (dst))
> -	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +	igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +	igain
> +	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>  	       || GET_CODE (src) == ASHIFTRT
>  	       || GET_CODE (src) == LSHIFTRT)
>  	{
>      	  if (CONST_INT_P (XEXP (src, 0)))
> -	    gain -= vector_const_cost (XEXP (src, 0));
> -	  gain += ix86_cost->shift_const;
> +	    igain -= vector_const_cost (XEXP (src, 0));
> +	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>  	  if (INTVAL (XEXP (src, 1)) >= 32)
> -	    gain -= COSTS_N_INSNS (1);
> +	    igain -= COSTS_N_INSNS (1);
>  	}
>        else if (GET_CODE (src) == PLUS
>  	       || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>  	       || GET_CODE (src) == XOR
>  	       || GET_CODE (src) == AND)
>  	{
> -	  gain += ix86_cost->add;
> +	  igain += m * ix86_cost->add - ix86_cost->sse_op;
>  	  /* Additional gain for andnot for targets without BMI.  */
>  	  if (GET_CODE (XEXP (src, 0)) == NOT
>  	      && !TARGET_BMI)
> -	    gain += 2 * ix86_cost->add;
> +	    igain += m * ix86_cost->add;
>  
>  	  if (CONST_INT_P (XEXP (src, 0)))
> -	    gain -= vector_const_cost (XEXP (src, 0));
> +	    igain -= vector_const_cost (XEXP (src, 0));
>  	  if (CONST_INT_P (XEXP (src, 1)))
> -	    gain -= vector_const_cost (XEXP (src, 1));
> +	    igain -= vector_const_cost (XEXP (src, 1));
>  	}
>        else if (GET_CODE (src) == NEG
>  	       || GET_CODE (src) == NOT)
> -	gain += ix86_cost->add - COSTS_N_INSNS (1);
> +	igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +	       || GET_CODE (src) == SMIN
> +	       || GET_CODE (src) == UMAX
> +	       || GET_CODE (src) == UMIN)
> +	{
> +	  /* We do not have any conditional move cost, estimate it as a
> +	     reg-reg move.  Comparisons are costed as adds.  */
> +	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +	  /* Integer SSE ops are all costed the same.  */
> +	  igain -= ix86_cost->sse_op;
> +	}
>        else if (GET_CODE (src) == COMPARE)
>  	{
>  	  /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>  	{
>  	  if (REG_P (dst))
> -	    gain += COSTS_N_INSNS (2);
> +	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +	    igain += COSTS_N_INSNS (m);
>  	  else if (MEM_P (dst))
> -	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -	  gain -= vector_const_cost (src);
> +	    igain += (m * ix86_cost->int_store[2]
> +		     - ix86_cost->sse_store[sse_cost_idx]);
> +	  igain -= vector_const_cost (src);
>  	}
>        else
>  	gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +	{
> +	  fprintf (dump_file, "  Instruction gain %d for ", igain);
> +	  dump_insn_slim (dump_file, insn);
> +	}
> +      gain += igain;
>      }
>  
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>  
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>  
> @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
>  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
>  
>  rtx
> -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>  
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
>  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
>  
>  void
> -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
>  						  rtx reg, rtx new_reg)
>  {
>    replace_with_subreg (single_set (insn), reg, new_reg);
> @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
>     and replace its uses in a chain.  */
>  
>  void
> -dimode_scalar_chain::make_vector_copies (unsigned regno)
> +general_scalar_chain::make_vector_copies (unsigned regno)
>  {
>    rtx reg = regno_reg_rtx[regno];
> -  rtx vreg = gen_reg_rtx (DImode);
> +  rtx vreg = gen_reg_rtx (smode);
>    df_ref ref;
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
>  	start_sequence ();
>  	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>  	  {
> -	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -	    emit_move_insn (adjust_address (tmp, SImode, 0),
> -			    gen_rtx_SUBREG (SImode, reg, 0));
> -	    emit_move_insn (adjust_address (tmp, SImode, 4),
> -			    gen_rtx_SUBREG (SImode, reg, 4));
> +	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +	    if (smode == DImode && !TARGET_64BIT)
> +	      {
> +		emit_move_insn (adjust_address (tmp, SImode, 0),
> +				gen_rtx_SUBREG (SImode, reg, 0));
> +		emit_move_insn (adjust_address (tmp, SImode, 4),
> +				gen_rtx_SUBREG (SImode, reg, 4));
> +	      }
> +	    else
> +	      emit_move_insn (tmp, reg);
>  	    emit_move_insn (vreg, tmp);
>  	  }
> -	else if (TARGET_SSE4_1)
> +	else if (!TARGET_64BIT && smode == DImode)
>  	  {
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 0)));
> -	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					  gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					  gen_rtx_SUBREG (SImode, reg, 4),
> -					  GEN_INT (2)));
> +	    if (TARGET_SSE4_1)
> +	      {
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 0)));
> +		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					      gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					      gen_rtx_SUBREG (SImode, reg, 4),
> +					      GEN_INT (2)));
> +	      }
> +	    else
> +	      {
> +		rtx tmp = gen_reg_rtx (DImode);
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 0)));
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 4)));
> +		emit_insn (gen_vec_interleave_lowv4si
> +			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +			    gen_rtx_SUBREG (V4SImode, vreg, 0),
> +			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +	      }
>  	  }
>  	else
> -	  {
> -	    rtx tmp = gen_reg_rtx (DImode);
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 0)));
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 4)));
> -	    emit_insn (gen_vec_interleave_lowv4si
> -		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -			gen_rtx_SUBREG (V4SImode, vreg, 0),
> -			gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -	  }
> +	  emit_move_insn (gen_lowpart (smode, vreg), reg);
>  	rtx_insn *seq = get_insns ();
>  	end_sequence ();
>  	rtx_insn *insn = DF_REF_INSN (ref);
> @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
>     in case register is used in not convertible insn.  */
>  
>  void
> -dimode_scalar_chain::convert_reg (unsigned regno)
> +general_scalar_chain::convert_reg (unsigned regno)
>  {
>    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
>    rtx reg = regno_reg_rtx[regno];
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>  
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
>  	  start_sequence ();
>  	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>  	    {
> -	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>  	      emit_move_insn (tmp, reg);
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -			      adjust_address (tmp, SImode, 0));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -			      adjust_address (tmp, SImode, 4));
> +	      if (!TARGET_64BIT && smode == DImode)
> +		{
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +				  adjust_address (tmp, SImode, 0));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +				  adjust_address (tmp, SImode, 4));
> +		}
> +	      else
> +		emit_move_insn (scopy, tmp);
>  	    }
> -	  else if (TARGET_SSE4_1)
> +	  else if (!TARGET_64BIT && smode == DImode)
>  	    {
> -	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -	      emit_insn
> -		(gen_rtx_SET
> -		 (gen_rtx_SUBREG (SImode, scopy, 0),
> -		  gen_rtx_VEC_SELECT (SImode,
> -				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -	      emit_insn
> -		(gen_rtx_SET
> -		 (gen_rtx_SUBREG (SImode, scopy, 4),
> -		  gen_rtx_VEC_SELECT (SImode,
> -				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +	      if (TARGET_SSE4_1)
> +		{
> +		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +					      gen_rtvec (1, const0_rtx));
> +		  emit_insn
> +		    (gen_rtx_SET
> +		       (gen_rtx_SUBREG (SImode, scopy, 0),
> +			gen_rtx_VEC_SELECT (SImode,
> +					    gen_rtx_SUBREG (V4SImode, reg, 0),
> +					    tmp)));
> +
> +		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +		  emit_insn
> +		    (gen_rtx_SET
> +		       (gen_rtx_SUBREG (SImode, scopy, 4),
> +			gen_rtx_VEC_SELECT (SImode,
> +					    gen_rtx_SUBREG (V4SImode, reg, 0),
> +					    tmp)));
> +		}
> +	      else
> +		{
> +		  rtx vcopy = gen_reg_rtx (V2DImode);
> +		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +				  gen_rtx_SUBREG (SImode, vcopy, 0));
> +		  emit_move_insn (vcopy,
> +				  gen_rtx_LSHIFTRT (V2DImode,
> +						    vcopy, GEN_INT (32)));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +				  gen_rtx_SUBREG (SImode, vcopy, 0));
> +		}
>  	    }
>  	  else
> -	    {
> -	      rtx vcopy = gen_reg_rtx (V2DImode);
> -	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -			      gen_rtx_SUBREG (SImode, vcopy, 0));
> -	      emit_move_insn (vcopy,
> -			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -			      gen_rtx_SUBREG (SImode, vcopy, 0));
> -	    }
> +	    emit_move_insn (scopy, reg);
> +
>  	  rtx_insn *seq = get_insns ();
>  	  end_sequence ();
>  	  emit_conversion_insns (seq, insn);
> @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
>     registers conversion.  */
>  
>  void
> -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
>  {
>    *op = copy_rtx_if_shared (*op);
>  
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>  
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>  
>        if (dump_file)
>  	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
>  	    gcc_assert (!DF_REF_CHAIN (ref));
>  	    break;
>  	  }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>  
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -	vec_cst = CONSTM1_RTX (V2DImode);
> +	vec_cst = CONSTM1_RTX (vmode);
>        else
> -	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -					gen_rtvec (2, *op, const0_rtx));
> +	{
> +	  unsigned n = GET_MODE_NUNITS (vmode);
> +	  rtx *v = XALLOCAVEC (rtx, n);
> +	  v[0] = *op;
> +	  for (unsigned i = 1; i < n; ++i)
> +	    v[i] = const0_rtx;
> +	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +	}
>  
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>  	{
>  	  start_sequence ();
> -	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>  	  rtx_insn *seq = get_insns ();
>  	  end_sequence ();
>  	  emit_insn_before (seq, insn);
> @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>  
>  /* Convert INSN to vector mode.  */
>  
>  void
> -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> +general_scalar_chain::convert_insn (rtx_insn *insn)
>  {
>    rtx def_set = single_set (insn);
>    rtx src = SET_SRC (def_set);
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>  	 temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>  
>    switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>  
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>  
>      case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>  
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>  
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>  
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
>  
>        if (REG_P (src))
> -	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +	subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>  	subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>  
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>  
> @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
>  }
>  
>  void
> -dimode_scalar_chain::convert_registers ()
> +general_scalar_chain::convert_registers ()
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
>  		     (const_int 0 [0])))  */
>  
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>  	  && (SUBREG_BYTE (op2) != 0
> -	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>  
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>  
>    return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>  
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>  
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>  
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>  
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>  
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	return false;
>        break;
>  
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512VL)
> +	  || (mode == SImode && !TARGET_SSE4_1))
> +	return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	  && !CONST_INT_P (XEXP (src, 1)))
>  	return false;
>  
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>  	  && !CONST_INT_P (XEXP (src, 1)))
>  	return false;
>        break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	  || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>  
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>  
> @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>  
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>  
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>  
>  static void
> -dimode_remove_non_convertible_regs (bitmap candidates)
> +general_remove_non_convertible_regs (bitmap candidates)
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>  
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>  
> @@ -1577,11 +1638,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>  
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>  
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -	if (scalar_to_vector_candidate_p (insn))
> +	if (TARGET_64BIT
> +	    && timode_scalar_to_vector_candidate_p (insn))
>  	  {
>  	    if (dump_file)
> -	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>  		       INSN_UID (insn));
>  
> -	    bitmap_set_bit (candidates, INSN_UID (insn));
> +	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +	  }
> +	else
> +	  {
> +	    /* Check {SI,DI}mode.  */
> +	    for (unsigned i = 0; i <= 1; ++i)
> +	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +		{
> +		  if (dump_file)
> +		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +		  break;
> +		}
>  	  }
>      }
>  
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    general_remove_non_convertible_regs (&candidates[i]);
>  
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>  
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +	unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +	scalar_chain *chain;
>  
> -      if (TARGET_64BIT)
> -	chain = new timode_scalar_chain;
> -      else
> -	chain = new dimode_scalar_chain;
> +	if (cand_mode[i] == TImode)
> +	  chain = new timode_scalar_chain;
> +	else
> +	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
>  
> -      /* Find instructions chain we want to convert to vector mode.
> -	 Check all uses and definitions to estimate all required
> -	 conversions.  */
> -      chain->build (candidates, uid);
> +	/* Find instructions chain we want to convert to vector mode.
> +	   Check all uses and definitions to estimate all required
> +	   conversions.  */
> +	chain->build (&candidates[i], uid);
>  
> -      if (chain->compute_convert_gain () > 0)
> -	converted_insns += chain->convert ();
> -      else
> -	if (dump_file)
> -	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -		   chain->chain_id);
> +	if (chain->compute_convert_gain () > 0)
> +	  converted_insns += chain->convert ();
> +	else
> +	  if (dump_file)
> +	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +		     chain->chain_id);
>  
> -      delete chain;
> -    }
> +	delete chain;
> +      }
>  
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>  
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>  
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h	(revision 274111)
> +++ gcc/config/i386/i386-features.h	(working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>  
>    static unsigned max_id;
>  
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -159,9 +164,11 @@ class scalar_chain
>    virtual void convert_registers () = 0;
>  };
>  
> -class dimode_scalar_chain : public scalar_chain
> +class general_scalar_chain : public scalar_chain
>  {
>   public:
> +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>  
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md	(revision 274111)
> +++ gcc/config/i386/i386.md	(working copy)
> @@ -17721,6 +17721,30 @@ (define_peephole2
>      std::swap (operands[4], operands[5]);
>  })
>  
> +;; min/max patterns
> +
> +(define_code_attr maxmin_rel
> +  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
> +(define_code_attr maxmin_cmpmode
> +  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +	(maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +		      (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
> +	(compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +	(if_then_else:SWI48
> +	  (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
> +	  (match_dup 1)
> +	  (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")
>
Uros Bizjak Aug. 7, 2019, 12:06 p.m. UTC | #38
On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 7 Aug 2019, Richard Biener wrote:
>
> > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> >
> > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > >
> > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > >
> > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > >
> > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > to force use of %zmmN?
> > > > > > >
> > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > >
> > > > > >     case SMAX:
> > > > > >     case SMIN:
> > > > > >     case UMAX:
> > > > > >     case UMIN:
> > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > >         return false;
> > > > > >
> > > > > > so there's no way to use AVX512VL for 32bit?
> > > > >
> > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > This is of course doable, but somehow more complex than simply
> > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > splitter does. So, a follow-up task.
> > > >
> > > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > > check we just need to properly split if we enable the scalar minmax
> > > > pattern for DImode on 32bits, the STV conversion would go fine.
> > >
> > > Yes, that is correct.
> >
> > So I tested the patch below (now with appropriate ChangeLog) on
> > x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> > the obvious hmmer improvement, now checking for off-noise results
> > with a 3-run on those that may have one (with more than +-1 second
> > differences in the 1-run).
> >
> > As-is the patch likely runs into the splitting issue for DImode
> > on i?86 and the patch misses functional testcases.  I'll do the
> > hmmer loop with both DImode and SImode and testcases to trigger
> > all pattern variants with the different ISAs we have.
> >
> > Some of the patch could be split out (the cost changes that are
> > also effective for DImode for example).
> >
> > AFAICS we could go with only adding SImode avoiding the DImode
> > splitting thing and this would solve the hmmer regression.
>
> I've additionally bootstrapped with --with-arch=nehalem which
> reveals
>
> FAIL: gcc.target/i386/minmax-2.c scan-assembler test
> FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp
>
> we emit cmp + cmov here now with -msse4.1 (as soon as the max
> pattern is enabled I guess)
>
> Otherwise testing is clean, so I suppose this is the net effect
> of just doing the SImode chains;  I don't have AVX512 HW handily
> available to really test the DImode path.
>
> Would you be fine to simplify the patch down to SImode chain handling?

Just leave DImode for a couple of days to see what HJ's autotesters
reveal. I'd just disable DImode for 32bit targets for now, we know
that splitters are missing.

Some remarks below.

Uros.

>
> Thanks,
> Richard.
>
> > Thanks,
> > Richard.
> >
> > 2019-08-07  Richard Biener  <rguenther@suse.de>
> >
> >       PR target/91154
> >       * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> >       mode arguments.
> >       (scalar_chain::smode): New member.
> >       (scalar_chain::vmode): Likewise.
> >       (dimode_scalar_chain): Rename to...
> >       (general_scalar_chain): ... this.
> >       (general_scalar_chain::general_scalar_chain): Take mode arguments.
> >       (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> >       base with TImode and V1TImode.
> >       * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> >       (general_scalar_chain::vector_const_cost): Adjust for SImode
> >       chains.
> >       (general_scalar_chain::compute_convert_gain): Likewise.  Fix
> >       reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> >       scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> >       gain if not zero.
> >       (general_scalar_chain::replace_with_subreg): Use vmode/smode.
> >       (general_scalar_chain::make_vector_copies): Likewise.  Handle
> >       non-DImode chains appropriately.
> >       (general_scalar_chain::convert_reg): Likewise.
> >       (general_scalar_chain::convert_op): Likewise.
> >       (general_scalar_chain::convert_insn): Likewise.  Add
> >       fatal_insn_not_found if the result is not recognized.
> >       (convertible_comparison_p): Pass in the scalar mode and use that.
> >       (general_scalar_to_vector_candidate_p): Likewise.  Rename from
> >       dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> >       (scalar_to_vector_candidate_p): Remove by inlining into single
> >       caller.
> >       (general_remove_non_convertible_regs): Rename from
> >       dimode_remove_non_convertible_regs.
> >       (remove_non_convertible_regs): Remove by inlining into single caller.
> >       (convert_scalars_to_vector): Handle SImode and DImode chains
> >       in addition to TImode chains.
> >       * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> >
> > Index: gcc/config/i386/i386-features.c
> > ===================================================================
> > --- gcc/config/i386/i386-features.c   (revision 274111)
> > +++ gcc/config/i386/i386-features.c   (working copy)
> > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
> >
> >  /* Initialize new chain.  */
> >
> > -scalar_chain::scalar_chain ()
> > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> >  {
> > +  smode = smode_;
> > +  vmode = vmode_;
> > +
> >    chain_id = ++max_id;
> >
> >     if (dump_file)
> > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
> >     conversion.  */
> >
> >  void
> > -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> > +general_scalar_chain::mark_dual_mode_def (df_ref def)
> >  {
> >    gcc_assert (DF_REF_REG_DEF_P (def));
> >
> > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
> >        && !HARD_REGISTER_P (SET_DEST (def_set)))
> >      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
> >
> > +  /* ???  The following is quadratic since analyze_register_chain
> > +     iterates over all refs to look for dual-mode regs.  Instead this
> > +     should be done separately for all regs mentioned in the chain once.  */
> >    df_ref ref;
> >    df_ref def;
> >    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
> >     instead of using a scalar one.  */
> >
> >  int
> > -dimode_scalar_chain::vector_const_cost (rtx exp)
> > +general_scalar_chain::vector_const_cost (rtx exp)
> >  {
> >    gcc_assert (CONST_INT_P (exp));
> >
> > -  if (standard_sse_constant_p (exp, V2DImode))
> > -    return COSTS_N_INSNS (1);
> > -  return ix86_cost->sse_load[1];
> > +  if (standard_sse_constant_p (exp, vmode))
> > +    return ix86_cost->sse_op;
> > +  /* We have separate costs for SImode and DImode, use SImode costs
> > +     for smaller modes.  */
> > +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> >  }
> >
> >  /* Compute a gain for chain conversion.  */
> >
> >  int
> > -dimode_scalar_chain::compute_convert_gain ()
> > +general_scalar_chain::compute_convert_gain ()
> >  {
> >    bitmap_iterator bi;
> >    unsigned insn_uid;
> > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
> >    if (dump_file)
> >      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
> >
> > +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> > +     int costs factor in the number of GPRs involved.  When supporting
> > +     smaller modes than SImode the int load/store costs need to be
> > +     adjusted as well.  */
> > +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> > +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> > +
> >    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
> >      {
> >        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
> >        rtx def_set = single_set (insn);
> >        rtx src = SET_SRC (def_set);
> >        rtx dst = SET_DEST (def_set);
> > +      int igain = 0;
> >
> >        if (REG_P (src) && REG_P (dst))
> > -     gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> > +     igain += 2 * m - ix86_cost->xmm_move;
> >        else if (REG_P (src) && MEM_P (dst))
> > -     gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> > +     igain
> > +       += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> >        else if (MEM_P (src) && REG_P (dst))
> > -     gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> > +     igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
> >        else if (GET_CODE (src) == ASHIFT
> >              || GET_CODE (src) == ASHIFTRT
> >              || GET_CODE (src) == LSHIFTRT)
> >       {
> >         if (CONST_INT_P (XEXP (src, 0)))
> > -         gain -= vector_const_cost (XEXP (src, 0));
> > -       gain += ix86_cost->shift_const;
> > +         igain -= vector_const_cost (XEXP (src, 0));
> > +       igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
> >         if (INTVAL (XEXP (src, 1)) >= 32)
> > -         gain -= COSTS_N_INSNS (1);
> > +         igain -= COSTS_N_INSNS (1);
> >       }
> >        else if (GET_CODE (src) == PLUS
> >              || GET_CODE (src) == MINUS
> > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
> >              || GET_CODE (src) == XOR
> >              || GET_CODE (src) == AND)
> >       {
> > -       gain += ix86_cost->add;
> > +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> >         /* Additional gain for andnot for targets without BMI.  */
> >         if (GET_CODE (XEXP (src, 0)) == NOT
> >             && !TARGET_BMI)
> > -         gain += 2 * ix86_cost->add;
> > +         igain += m * ix86_cost->add;
> >
> >         if (CONST_INT_P (XEXP (src, 0)))
> > -         gain -= vector_const_cost (XEXP (src, 0));
> > +         igain -= vector_const_cost (XEXP (src, 0));
> >         if (CONST_INT_P (XEXP (src, 1)))
> > -         gain -= vector_const_cost (XEXP (src, 1));
> > +         igain -= vector_const_cost (XEXP (src, 1));
> >       }
> >        else if (GET_CODE (src) == NEG
> >              || GET_CODE (src) == NOT)
> > -     gain += ix86_cost->add - COSTS_N_INSNS (1);
> > +     igain += m * ix86_cost->add - ix86_cost->sse_op;
> > +      else if (GET_CODE (src) == SMAX
> > +            || GET_CODE (src) == SMIN
> > +            || GET_CODE (src) == UMAX
> > +            || GET_CODE (src) == UMIN)
> > +     {
> > +       /* We do not have any conditional move cost, estimate it as a
> > +          reg-reg move.  Comparisons are costed as adds.  */
> > +       igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> > +       /* Integer SSE ops are all costed the same.  */
> > +       igain -= ix86_cost->sse_op;
> > +     }
> >        else if (GET_CODE (src) == COMPARE)
> >       {
> >         /* Assume comparison cost is the same.  */
> > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
> >        else if (CONST_INT_P (src))
> >       {
> >         if (REG_P (dst))
> > -         gain += COSTS_N_INSNS (2);
> > +         /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> > +         igain += COSTS_N_INSNS (m);
> >         else if (MEM_P (dst))
> > -         gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> > -       gain -= vector_const_cost (src);
> > +         igain += (m * ix86_cost->int_store[2]
> > +                  - ix86_cost->sse_store[sse_cost_idx]);
> > +       igain -= vector_const_cost (src);
> >       }
> >        else
> >       gcc_unreachable ();
> > +
> > +      if (igain != 0 && dump_file)
> > +     {
> > +       fprintf (dump_file, "  Instruction gain %d for ", igain);
> > +       dump_insn_slim (dump_file, insn);
> > +     }
> > +      gain += igain;
> >      }
> >
> >    if (dump_file)
> >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> >
> > +  /* ???  What about integer to SSE?  */
> >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> >
> > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
> >  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
> >
> >  rtx
> > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> >  {
> >    if (x == reg)
> > -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> > +    return gen_rtx_SUBREG (vmode, new_reg, 0);
> >
> >    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
> >    int i, j;
> > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
> >  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
> >
> >  void
> > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> >                                                 rtx reg, rtx new_reg)
> >  {
> >    replace_with_subreg (single_set (insn), reg, new_reg);
> > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
> >     and replace its uses in a chain.  */
> >
> >  void
> > -dimode_scalar_chain::make_vector_copies (unsigned regno)
> > +general_scalar_chain::make_vector_copies (unsigned regno)
> >  {
> >    rtx reg = regno_reg_rtx[regno];
> > -  rtx vreg = gen_reg_rtx (DImode);
> > +  rtx vreg = gen_reg_rtx (smode);
> >    df_ref ref;
> >
> >    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
> >       start_sequence ();
> >       if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> >         {
> > -         rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> > -         emit_move_insn (adjust_address (tmp, SImode, 0),
> > -                         gen_rtx_SUBREG (SImode, reg, 0));
> > -         emit_move_insn (adjust_address (tmp, SImode, 4),
> > -                         gen_rtx_SUBREG (SImode, reg, 4));
> > +         rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> > +         if (smode == DImode && !TARGET_64BIT)
> > +           {
> > +             emit_move_insn (adjust_address (tmp, SImode, 0),
> > +                             gen_rtx_SUBREG (SImode, reg, 0));
> > +             emit_move_insn (adjust_address (tmp, SImode, 4),
> > +                             gen_rtx_SUBREG (SImode, reg, 4));
> > +           }
> > +         else
> > +           emit_move_insn (tmp, reg);
> >           emit_move_insn (vreg, tmp);
> >         }
> > -     else if (TARGET_SSE4_1)
> > +     else if (!TARGET_64BIT && smode == DImode)
> >         {
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> > -         emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                       gen_rtx_SUBREG (SImode, reg, 4),
> > -                                       GEN_INT (2)));
> > +         if (TARGET_SSE4_1)
> > +           {
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> > +             emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                           gen_rtx_SUBREG (SImode, reg, 4),
> > +                                           GEN_INT (2)));
> > +           }
> > +         else
> > +           {
> > +             rtx tmp = gen_reg_rtx (DImode);
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 4)));
> > +             emit_insn (gen_vec_interleave_lowv4si
> > +                        (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                         gen_rtx_SUBREG (V4SImode, tmp, 0)));
> > +           }
> >         }
> >       else
> > -       {
> > -         rtx tmp = gen_reg_rtx (DImode);
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 4)));
> > -         emit_insn (gen_vec_interleave_lowv4si
> > -                    (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                     gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                     gen_rtx_SUBREG (V4SImode, tmp, 0)));
> > -       }
> > +       emit_move_insn (gen_lowpart (smode, vreg), reg);
> >       rtx_insn *seq = get_insns ();
> >       end_sequence ();
> >       rtx_insn *insn = DF_REF_INSN (ref);
> > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
> >     in case register is used in not convertible insn.  */
> >
> >  void
> > -dimode_scalar_chain::convert_reg (unsigned regno)
> > +general_scalar_chain::convert_reg (unsigned regno)
> >  {
> >    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
> >    rtx reg = regno_reg_rtx[regno];
> > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
> >    bitmap_copy (conv, insns);
> >
> >    if (scalar_copy)
> > -    scopy = gen_reg_rtx (DImode);
> > +    scopy = gen_reg_rtx (smode);
> >
> >    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> >      {
> > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
> >         start_sequence ();
> >         if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
> >           {
> > -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> > +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> >             emit_move_insn (tmp, reg);
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > -                           adjust_address (tmp, SImode, 0));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > -                           adjust_address (tmp, SImode, 4));
> > +           if (!TARGET_64BIT && smode == DImode)
> > +             {
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                               adjust_address (tmp, SImode, 0));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                               adjust_address (tmp, SImode, 4));
> > +             }
> > +           else
> > +             emit_move_insn (scopy, tmp);
> >           }
> > -       else if (TARGET_SSE4_1)
> > +       else if (!TARGET_64BIT && smode == DImode)
> >           {
> > -           rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> > -           emit_insn
> > -             (gen_rtx_SET
> > -              (gen_rtx_SUBREG (SImode, scopy, 0),
> > -               gen_rtx_VEC_SELECT (SImode,
> > -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> > -
> > -           tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> > -           emit_insn
> > -             (gen_rtx_SET
> > -              (gen_rtx_SUBREG (SImode, scopy, 4),
> > -               gen_rtx_VEC_SELECT (SImode,
> > -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> > +           if (TARGET_SSE4_1)
> > +             {
> > +               rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> > +                                           gen_rtvec (1, const0_rtx));
> > +               emit_insn
> > +                 (gen_rtx_SET
> > +                    (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                     gen_rtx_VEC_SELECT (SImode,
> > +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> > +                                         tmp)));
> > +
> > +               tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> > +               emit_insn
> > +                 (gen_rtx_SET
> > +                    (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                     gen_rtx_VEC_SELECT (SImode,
> > +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> > +                                         tmp)));
> > +             }
> > +           else
> > +             {
> > +               rtx vcopy = gen_reg_rtx (V2DImode);
> > +               emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> > +               emit_move_insn (vcopy,
> > +                               gen_rtx_LSHIFTRT (V2DImode,
> > +                                                 vcopy, GEN_INT (32)));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> > +             }
> >           }
> >         else
> > -         {
> > -           rtx vcopy = gen_reg_rtx (V2DImode);
> > -           emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> > -           emit_move_insn (vcopy,
> > -                           gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> > -         }
> > +         emit_move_insn (scopy, reg);
> > +
> >         rtx_insn *seq = get_insns ();
> >         end_sequence ();
> >         emit_conversion_insns (seq, insn);
> > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
> >     registers conversion.  */
> >
> >  void
> > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> >  {
> >    *op = copy_rtx_if_shared (*op);
> >
> >    if (GET_CODE (*op) == NOT)
> >      {
> >        convert_op (&XEXP (*op, 0), insn);
> > -      PUT_MODE (*op, V2DImode);
> > +      PUT_MODE (*op, vmode);
> >      }
> >    else if (MEM_P (*op))
> >      {
> > -      rtx tmp = gen_reg_rtx (DImode);
> > +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
> >
> >        emit_insn_before (gen_move_insn (tmp, *op), insn);
> > -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> > +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
> >
> >        if (dump_file)
> >       fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
> >           gcc_assert (!DF_REF_CHAIN (ref));
> >           break;
> >         }
> > -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> > +      *op = gen_rtx_SUBREG (vmode, *op, 0);
> >      }
> >    else if (CONST_INT_P (*op))
> >      {
> >        rtx vec_cst;
> > -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> > +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
> >
> >        /* Prefer all ones vector in case of -1.  */
> >        if (constm1_operand (*op, GET_MODE (*op)))
> > -     vec_cst = CONSTM1_RTX (V2DImode);
> > +     vec_cst = CONSTM1_RTX (vmode);
> >        else
> > -     vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> > -                                     gen_rtvec (2, *op, const0_rtx));
> > +     {
> > +       unsigned n = GET_MODE_NUNITS (vmode);
> > +       rtx *v = XALLOCAVEC (rtx, n);
> > +       v[0] = *op;
> > +       for (unsigned i = 1; i < n; ++i)
> > +         v[i] = const0_rtx;
> > +       vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> > +     }
> >
> > -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> > +      if (!standard_sse_constant_p (vec_cst, vmode))
> >       {
> >         start_sequence ();
> > -       vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> > +       vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
> >         rtx_insn *seq = get_insns ();
> >         end_sequence ();
> >         emit_insn_before (seq, insn);
> > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
> >    else
> >      {
> >        gcc_assert (SUBREG_P (*op));
> > -      gcc_assert (GET_MODE (*op) == V2DImode);
> > +      gcc_assert (GET_MODE (*op) == vmode);
> >      }
> >  }
> >
> >  /* Convert INSN to vector mode.  */
> >
> >  void
> > -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> > +general_scalar_chain::convert_insn (rtx_insn *insn)
> >  {
> >    rtx def_set = single_set (insn);
> >    rtx src = SET_SRC (def_set);
> > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      {
> >        /* There are no scalar integer instructions and therefore
> >        temporary register usage is required.  */
> > -      rtx tmp = gen_reg_rtx (DImode);
> > +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
> >        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> > -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> > +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
> >      }
> >
> >    switch (GET_CODE (src))
> > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      case ASHIFTRT:
> >      case LSHIFTRT:
> >        convert_op (&XEXP (src, 0), insn);
> > -      PUT_MODE (src, V2DImode);
> > +      PUT_MODE (src, vmode);
> >        break;
> >
> >      case PLUS:
> > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      case IOR:
> >      case XOR:
> >      case AND:
> > +    case SMAX:
> > +    case SMIN:
> > +    case UMAX:
> > +    case UMIN:
> >        convert_op (&XEXP (src, 0), insn);
> >        convert_op (&XEXP (src, 1), insn);
> > -      PUT_MODE (src, V2DImode);
> > +      PUT_MODE (src, vmode);
> >        break;
> >
> >      case NEG:
> >        src = XEXP (src, 0);
> >        convert_op (&src, insn);
> > -      subreg = gen_reg_rtx (V2DImode);
> > -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> > -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> > +      subreg = gen_reg_rtx (vmode);
> > +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> > +      src = gen_rtx_MINUS (vmode, subreg, src);
> >        break;
> >
> >      case NOT:
> >        src = XEXP (src, 0);
> >        convert_op (&src, insn);
> > -      subreg = gen_reg_rtx (V2DImode);
> > -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> > -      src = gen_rtx_XOR (V2DImode, src, subreg);
> > +      subreg = gen_reg_rtx (vmode);
> > +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> > +      src = gen_rtx_XOR (vmode, src, subreg);
> >        break;
> >
> >      case MEM:
> > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
> >        break;
> >
> >      case SUBREG:
> > -      gcc_assert (GET_MODE (src) == V2DImode);
> > +      gcc_assert (GET_MODE (src) == vmode);
> >        break;
> >
> >      case COMPARE:
> >        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
> >
> > -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> > -               || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> > +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> > +               || (SUBREG_P (src) && GET_MODE (src) == vmode));
> >
> >        if (REG_P (src))
> > -     subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> > +     subreg = gen_rtx_SUBREG (vmode, src, 0);
> >        else
> >       subreg = copy_rtx_if_shared (src);
> >        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> >    PATTERN (insn) = def_set;
> >
> >    INSN_CODE (insn) = -1;
> > -  recog_memoized (insn);
> > +  int patt = recog_memoized (insn);
> > +  if  (patt == -1)
> > +    fatal_insn_not_found (insn);
> >    df_insn_rescan (insn);
> >  }
> >
> > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
> >  }
> >
> >  void
> > -dimode_scalar_chain::convert_registers ()
> > +general_scalar_chain::convert_registers ()
> >  {
> >    bitmap_iterator bi;
> >    unsigned id;
> > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
> >                    (const_int 0 [0])))  */
> >
> >  static bool
> > -convertible_comparison_p (rtx_insn *insn)
> > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
> >  {
> >    if (!TARGET_SSE4_1)
> >      return false;
> > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
> >
> >    if (!SUBREG_P (op1)
> >        || !SUBREG_P (op2)
> > -      || GET_MODE (op1) != SImode
> > -      || GET_MODE (op2) != SImode
> > +      || GET_MODE (op1) != mode
> > +      || GET_MODE (op2) != mode
> >        || ((SUBREG_BYTE (op1) != 0
> > -        || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> > +        || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
> >         && (SUBREG_BYTE (op2) != 0
> > -           || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> > +           || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
> >      return false;
> >
> >    op1 = SUBREG_REG (op1);
> > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
> >
> >    if (op1 != op2
> >        || !REG_P (op1)
> > -      || GET_MODE (op1) != DImode)
> > +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
> >      return false;
> >
> >    return true;
> > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
> >  /* The DImode version of scalar_to_vector_candidate_p.  */
> >
> >  static bool
> > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
> >  {
> >    rtx def_set = single_set (insn);
> >
> > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
> >    rtx dst = SET_DEST (def_set);
> >
> >    if (GET_CODE (src) == COMPARE)
> > -    return convertible_comparison_p (insn);
> > +    return convertible_comparison_p (insn, mode);
> >
> >    /* We are interested in DImode promotion only.  */
> > -  if ((GET_MODE (src) != DImode
> > +  if ((GET_MODE (src) != mode
> >         && !CONST_INT_P (src))
> > -      || GET_MODE (dst) != DImode)
> > +      || GET_MODE (dst) != mode)
> >      return false;
> >
> >    if (!REG_P (dst) && !MEM_P (dst))
> > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
> >       return false;
> >        break;
> >
> > +    case SMAX:
> > +    case SMIN:
> > +    case UMAX:
> > +    case UMIN:
> > +      if ((mode == DImode && !TARGET_AVX512VL)

Please enable only for TARGET64_BIT for now.

> > +       || (mode == SImode && !TARGET_SSE4_1))
> > +     return false;
> > +      /* Fallthru.  */
> > +
> >      case PLUS:
> >      case MINUS:
> >      case IOR:
> > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> >         && !CONST_INT_P (XEXP (src, 1)))
> >       return false;
> >
> > -      if (GET_MODE (XEXP (src, 1)) != DImode
> > +      if (GET_MODE (XEXP (src, 1)) != mode
> >         && !CONST_INT_P (XEXP (src, 1)))
> >       return false;
> >        break;
> > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> >         || !REG_P (XEXP (XEXP (src, 0), 0))))
> >        return false;
> >
> > -  if (GET_MODE (XEXP (src, 0)) != DImode
> > +  if (GET_MODE (XEXP (src, 0)) != mode
> >        && !CONST_INT_P (XEXP (src, 0)))
> >      return false;
> >
> > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
> >    return false;
> >  }
> >
> > -/* Return 1 if INSN may be converted into vector
> > -   instruction.  */
> > -
> > -static bool
> > -scalar_to_vector_candidate_p (rtx_insn *insn)
> > -{
> > -  if (TARGET_64BIT)
> > -    return timode_scalar_to_vector_candidate_p (insn);
> > -  else
> > -    return dimode_scalar_to_vector_candidate_p (insn);
> > -}
> > +/* For a given bitmap of insn UIDs scans all instruction and
> > +   remove insn from CANDIDATES in case it has both convertible
> > +   and not convertible definitions.
> >
> > -/* The DImode version of remove_non_convertible_regs.  */
> > +   All insns in a bitmap are conversion candidates according to
> > +   scalar_to_vector_candidate_p.  Currently it implies all insns
> > +   are single_set.  */
> >
> >  static void
> > -dimode_remove_non_convertible_regs (bitmap candidates)
> > +general_remove_non_convertible_regs (bitmap candidates)
> >  {
> >    bitmap_iterator bi;
> >    unsigned id;
> > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
> >    BITMAP_FREE (regs);
> >  }
> >
> > -/* For a given bitmap of insn UIDs scans all instruction and
> > -   remove insn from CANDIDATES in case it has both convertible
> > -   and not convertible definitions.
> > -
> > -   All insns in a bitmap are conversion candidates according to
> > -   scalar_to_vector_candidate_p.  Currently it implies all insns
> > -   are single_set.  */
> > -
> > -static void
> > -remove_non_convertible_regs (bitmap candidates)
> > -{
> > -  if (TARGET_64BIT)
> > -    timode_remove_non_convertible_regs (candidates);
> > -  else
> > -    dimode_remove_non_convertible_regs (candidates);
> > -}
> > -
> >  /* Main STV pass function.  Find and convert scalar
> >     instructions into vector mode when profitable.  */
> >
> > @@ -1577,11 +1638,14 @@ static unsigned int
> >  convert_scalars_to_vector ()
> >  {
> >    basic_block bb;
> > -  bitmap candidates;
> >    int converted_insns = 0;
> >
> >    bitmap_obstack_initialize (NULL);
> > -  candidates = BITMAP_ALLOC (NULL);
> > +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> > +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> > +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> > +  for (unsigned i = 0; i < 3; ++i)
> > +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
> >
> >    calculate_dominance_info (CDI_DOMINATORS);
> >    df_set_flags (DF_DEFER_INSN_RESCAN);
> > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
> >      {
> >        rtx_insn *insn;
> >        FOR_BB_INSNS (bb, insn)
> > -     if (scalar_to_vector_candidate_p (insn))
> > +     if (TARGET_64BIT
> > +         && timode_scalar_to_vector_candidate_p (insn))
> >         {
> >           if (dump_file)
> > -           fprintf (dump_file, "  insn %d is marked as a candidate\n",
> > +           fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
> >                      INSN_UID (insn));
> >
> > -         bitmap_set_bit (candidates, INSN_UID (insn));
> > +         bitmap_set_bit (&candidates[2], INSN_UID (insn));
> > +       }
> > +     else
> > +       {
> > +         /* Check {SI,DI}mode.  */
> > +         for (unsigned i = 0; i <= 1; ++i)
> > +           if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> > +             {
> > +               if (dump_file)
> > +                 fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> > +                          INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> > +
> > +               bitmap_set_bit (&candidates[i], INSN_UID (insn));
> > +               break;
> > +             }
> >         }
> >      }
> >
> > -  remove_non_convertible_regs (candidates);
> > +  if (TARGET_64BIT)
> > +    timode_remove_non_convertible_regs (&candidates[2]);
> > +  for (unsigned i = 0; i <= 1; ++i)
> > +    general_remove_non_convertible_regs (&candidates[i]);
> >
> > -  if (bitmap_empty_p (candidates))
> > -    if (dump_file)
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    if (!bitmap_empty_p (&candidates[i]))
> > +      break;
> > +    else if (i == 2 && dump_file)
> >        fprintf (dump_file, "There are no candidates for optimization.\n");
> >
> > -  while (!bitmap_empty_p (candidates))
> > -    {
> > -      unsigned uid = bitmap_first_set_bit (candidates);
> > -      scalar_chain *chain;
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    while (!bitmap_empty_p (&candidates[i]))
> > +      {
> > +     unsigned uid = bitmap_first_set_bit (&candidates[i]);
> > +     scalar_chain *chain;
> >
> > -      if (TARGET_64BIT)
> > -     chain = new timode_scalar_chain;
> > -      else
> > -     chain = new dimode_scalar_chain;
> > +     if (cand_mode[i] == TImode)
> > +       chain = new timode_scalar_chain;
> > +     else
> > +       chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
> >
> > -      /* Find instructions chain we want to convert to vector mode.
> > -      Check all uses and definitions to estimate all required
> > -      conversions.  */
> > -      chain->build (candidates, uid);
> > +     /* Find instructions chain we want to convert to vector mode.
> > +        Check all uses and definitions to estimate all required
> > +        conversions.  */
> > +     chain->build (&candidates[i], uid);
> >
> > -      if (chain->compute_convert_gain () > 0)
> > -     converted_insns += chain->convert ();
> > -      else
> > -     if (dump_file)
> > -       fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> > -                chain->chain_id);
> > +     if (chain->compute_convert_gain () > 0)
> > +       converted_insns += chain->convert ();
> > +     else
> > +       if (dump_file)
> > +         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> > +                  chain->chain_id);
> >
> > -      delete chain;
> > -    }
> > +     delete chain;
> > +      }
> >
> >    if (dump_file)
> >      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
> >
> > -  BITMAP_FREE (candidates);
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    bitmap_release (&candidates[i]);
> >    bitmap_obstack_release (NULL);
> >    df_process_deferred_rescans ();
> >
> > Index: gcc/config/i386/i386-features.h
> > ===================================================================
> > --- gcc/config/i386/i386-features.h   (revision 274111)
> > +++ gcc/config/i386/i386-features.h   (working copy)
> > @@ -127,11 +127,16 @@ namespace {
> >  class scalar_chain
> >  {
> >   public:
> > -  scalar_chain ();
> > +  scalar_chain (enum machine_mode, enum machine_mode);
> >    virtual ~scalar_chain ();
> >
> >    static unsigned max_id;
> >
> > +  /* Scalar mode.  */
> > +  enum machine_mode smode;
> > +  /* Vector mode.  */
> > +  enum machine_mode vmode;
> > +
> >    /* ID of a chain.  */
> >    unsigned int chain_id;
> >    /* A queue of instructions to be included into a chain.  */
> > @@ -159,9 +164,11 @@ class scalar_chain
> >    virtual void convert_registers () = 0;
> >  };
> >
> > -class dimode_scalar_chain : public scalar_chain
> > +class general_scalar_chain : public scalar_chain
> >  {
> >   public:
> > +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> > +    : scalar_chain (smode_, vmode_) {}
> >    int compute_convert_gain ();
> >   private:
> >    void mark_dual_mode_def (df_ref def);
> > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
> >  class timode_scalar_chain : public scalar_chain
> >  {
> >   public:
> > +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> > +
> >    /* Convert from TImode to V1TImode is always faster.  */
> >    int compute_convert_gain () { return 1; }
> >
> > Index: gcc/config/i386/i386.md
> > ===================================================================
> > --- gcc/config/i386/i386.md   (revision 274111)
> > +++ gcc/config/i386/i386.md   (working copy)
> > @@ -17721,6 +17721,30 @@ (define_peephole2
> >      std::swap (operands[4], operands[5]);
> >  })
> >
> > +;; min/max patterns

You should use:

(define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI
"TARGET_64BIT && TARGET_AVX512F"])

in the pattern below. Otherwise, middle-end detects and emits minmax
patterns that have no chance of being converted and always split back
to integer insns.

> > +(define_code_attr maxmin_rel
> > +  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
> > +(define_code_attr maxmin_cmpmode
> > +  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
> > +
> > +(define_insn_and_split "<code><mode>3"
> > +  [(set (match_operand:SWI48 0 "register_operand")
> > +     (maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> > +                   (match_operand:SWI48 2 "register_operand")))
> > +   (clobber (reg:CC FLAGS_REG))]
> > +  "TARGET_STV && TARGET_SSE4_1

leave only TARGET_STV if MAXMIN_IMODE will be used.

> > +   && can_create_pseudo_p ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
> > +     (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
> > +   (set (match_dup 0)
> > +     (if_then_else:SWI48
> > +       (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
> > +       (match_dup 1)
> > +       (match_dup 2)))])
> > +
> >  ;; Conditional addition patterns
> >  (define_expand "add<mode>cc"
> >    [(match_operand:SWI 0 "register_operand")
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Uros Bizjak Aug. 7, 2019, 12:20 p.m. UTC | #39
On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 7 Aug 2019, Richard Biener wrote:
>
> > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> >
> > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > >
> > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > >
> > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > >
> > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > to force use of %zmmN?
> > > > > > >
> > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > >
> > > > > >     case SMAX:
> > > > > >     case SMIN:
> > > > > >     case UMAX:
> > > > > >     case UMIN:
> > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > >         return false;
> > > > > >
> > > > > > so there's no way to use AVX512VL for 32bit?
> > > > >
> > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > This is of course doable, but somehow more complex than simply
> > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > splitter does. So, a follow-up task.
> > > >
> > > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > > check we just need to properly split if we enable the scalar minmax
> > > > pattern for DImode on 32bits, the STV conversion would go fine.
> > >
> > > Yes, that is correct.
> >
> > So I tested the patch below (now with appropriate ChangeLog) on
> > x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> > the obvious hmmer improvement, now checking for off-noise results
> > with a 3-run on those that may have one (with more than +-1 second
> > differences in the 1-run).
> >
> > As-is the patch likely runs into the splitting issue for DImode
> > on i?86 and the patch misses functional testcases.  I'll do the
> > hmmer loop with both DImode and SImode and testcases to trigger
> > all pattern variants with the different ISAs we have.
> >
> > Some of the patch could be split out (the cost changes that are
> > also effective for DImode for example).
> >
> > AFAICS we could go with only adding SImode avoiding the DImode
> > splitting thing and this would solve the hmmer regression.
>
> I've additionally bootstrapped with --with-arch=nehalem which
> reveals
>
> FAIL: gcc.target/i386/minmax-2.c scan-assembler test
> FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp
>
> we emit cmp + cmov here now with -msse4.1 (as soon as the max
> pattern is enabled I guess)

Actually, we have to split using ix86_expand_int_compare. This will
generate optimized CC mode.

Uros.

>
> Otherwise testing is clean, so I suppose this is the net effect
> of just doing the SImode chains;  I don't have AVX512 HW handily
> available to really test the DImode path.
>
> Would you be fine to simplify the patch down to SImode chain handling?
>
> Thanks,
> Richard.
>
> > Thanks,
> > Richard.
> >
> > 2019-08-07  Richard Biener  <rguenther@suse.de>
> >
> >       PR target/91154
> >       * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> >       mode arguments.
> >       (scalar_chain::smode): New member.
> >       (scalar_chain::vmode): Likewise.
> >       (dimode_scalar_chain): Rename to...
> >       (general_scalar_chain): ... this.
> >       (general_scalar_chain::general_scalar_chain): Take mode arguments.
> >       (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> >       base with TImode and V1TImode.
> >       * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> >       (general_scalar_chain::vector_const_cost): Adjust for SImode
> >       chains.
> >       (general_scalar_chain::compute_convert_gain): Likewise.  Fix
> >       reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> >       scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> >       gain if not zero.
> >       (general_scalar_chain::replace_with_subreg): Use vmode/smode.
> >       (general_scalar_chain::make_vector_copies): Likewise.  Handle
> >       non-DImode chains appropriately.
> >       (general_scalar_chain::convert_reg): Likewise.
> >       (general_scalar_chain::convert_op): Likewise.
> >       (general_scalar_chain::convert_insn): Likewise.  Add
> >       fatal_insn_not_found if the result is not recognized.
> >       (convertible_comparison_p): Pass in the scalar mode and use that.
> >       (general_scalar_to_vector_candidate_p): Likewise.  Rename from
> >       dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> >       (scalar_to_vector_candidate_p): Remove by inlining into single
> >       caller.
> >       (general_remove_non_convertible_regs): Rename from
> >       dimode_remove_non_convertible_regs.
> >       (remove_non_convertible_regs): Remove by inlining into single caller.
> >       (convert_scalars_to_vector): Handle SImode and DImode chains
> >       in addition to TImode chains.
> >       * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> >
> > Index: gcc/config/i386/i386-features.c
> > ===================================================================
> > --- gcc/config/i386/i386-features.c   (revision 274111)
> > +++ gcc/config/i386/i386-features.c   (working copy)
> > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
> >
> >  /* Initialize new chain.  */
> >
> > -scalar_chain::scalar_chain ()
> > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> >  {
> > +  smode = smode_;
> > +  vmode = vmode_;
> > +
> >    chain_id = ++max_id;
> >
> >     if (dump_file)
> > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
> >     conversion.  */
> >
> >  void
> > -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> > +general_scalar_chain::mark_dual_mode_def (df_ref def)
> >  {
> >    gcc_assert (DF_REF_REG_DEF_P (def));
> >
> > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
> >        && !HARD_REGISTER_P (SET_DEST (def_set)))
> >      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
> >
> > +  /* ???  The following is quadratic since analyze_register_chain
> > +     iterates over all refs to look for dual-mode regs.  Instead this
> > +     should be done separately for all regs mentioned in the chain once.  */
> >    df_ref ref;
> >    df_ref def;
> >    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
> >     instead of using a scalar one.  */
> >
> >  int
> > -dimode_scalar_chain::vector_const_cost (rtx exp)
> > +general_scalar_chain::vector_const_cost (rtx exp)
> >  {
> >    gcc_assert (CONST_INT_P (exp));
> >
> > -  if (standard_sse_constant_p (exp, V2DImode))
> > -    return COSTS_N_INSNS (1);
> > -  return ix86_cost->sse_load[1];
> > +  if (standard_sse_constant_p (exp, vmode))
> > +    return ix86_cost->sse_op;
> > +  /* We have separate costs for SImode and DImode, use SImode costs
> > +     for smaller modes.  */
> > +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> >  }
> >
> >  /* Compute a gain for chain conversion.  */
> >
> >  int
> > -dimode_scalar_chain::compute_convert_gain ()
> > +general_scalar_chain::compute_convert_gain ()
> >  {
> >    bitmap_iterator bi;
> >    unsigned insn_uid;
> > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
> >    if (dump_file)
> >      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
> >
> > +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> > +     int costs factor in the number of GPRs involved.  When supporting
> > +     smaller modes than SImode the int load/store costs need to be
> > +     adjusted as well.  */
> > +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> > +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> > +
> >    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
> >      {
> >        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
> >        rtx def_set = single_set (insn);
> >        rtx src = SET_SRC (def_set);
> >        rtx dst = SET_DEST (def_set);
> > +      int igain = 0;
> >
> >        if (REG_P (src) && REG_P (dst))
> > -     gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> > +     igain += 2 * m - ix86_cost->xmm_move;
> >        else if (REG_P (src) && MEM_P (dst))
> > -     gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> > +     igain
> > +       += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> >        else if (MEM_P (src) && REG_P (dst))
> > -     gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> > +     igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
> >        else if (GET_CODE (src) == ASHIFT
> >              || GET_CODE (src) == ASHIFTRT
> >              || GET_CODE (src) == LSHIFTRT)
> >       {
> >         if (CONST_INT_P (XEXP (src, 0)))
> > -         gain -= vector_const_cost (XEXP (src, 0));
> > -       gain += ix86_cost->shift_const;
> > +         igain -= vector_const_cost (XEXP (src, 0));
> > +       igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
> >         if (INTVAL (XEXP (src, 1)) >= 32)
> > -         gain -= COSTS_N_INSNS (1);
> > +         igain -= COSTS_N_INSNS (1);
> >       }
> >        else if (GET_CODE (src) == PLUS
> >              || GET_CODE (src) == MINUS
> > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
> >              || GET_CODE (src) == XOR
> >              || GET_CODE (src) == AND)
> >       {
> > -       gain += ix86_cost->add;
> > +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> >         /* Additional gain for andnot for targets without BMI.  */
> >         if (GET_CODE (XEXP (src, 0)) == NOT
> >             && !TARGET_BMI)
> > -         gain += 2 * ix86_cost->add;
> > +         igain += m * ix86_cost->add;
> >
> >         if (CONST_INT_P (XEXP (src, 0)))
> > -         gain -= vector_const_cost (XEXP (src, 0));
> > +         igain -= vector_const_cost (XEXP (src, 0));
> >         if (CONST_INT_P (XEXP (src, 1)))
> > -         gain -= vector_const_cost (XEXP (src, 1));
> > +         igain -= vector_const_cost (XEXP (src, 1));
> >       }
> >        else if (GET_CODE (src) == NEG
> >              || GET_CODE (src) == NOT)
> > -     gain += ix86_cost->add - COSTS_N_INSNS (1);
> > +     igain += m * ix86_cost->add - ix86_cost->sse_op;
> > +      else if (GET_CODE (src) == SMAX
> > +            || GET_CODE (src) == SMIN
> > +            || GET_CODE (src) == UMAX
> > +            || GET_CODE (src) == UMIN)
> > +     {
> > +       /* We do not have any conditional move cost, estimate it as a
> > +          reg-reg move.  Comparisons are costed as adds.  */
> > +       igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> > +       /* Integer SSE ops are all costed the same.  */
> > +       igain -= ix86_cost->sse_op;
> > +     }
> >        else if (GET_CODE (src) == COMPARE)
> >       {
> >         /* Assume comparison cost is the same.  */
> > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
> >        else if (CONST_INT_P (src))
> >       {
> >         if (REG_P (dst))
> > -         gain += COSTS_N_INSNS (2);
> > +         /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> > +         igain += COSTS_N_INSNS (m);
> >         else if (MEM_P (dst))
> > -         gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> > -       gain -= vector_const_cost (src);
> > +         igain += (m * ix86_cost->int_store[2]
> > +                  - ix86_cost->sse_store[sse_cost_idx]);
> > +       igain -= vector_const_cost (src);
> >       }
> >        else
> >       gcc_unreachable ();
> > +
> > +      if (igain != 0 && dump_file)
> > +     {
> > +       fprintf (dump_file, "  Instruction gain %d for ", igain);
> > +       dump_insn_slim (dump_file, insn);
> > +     }
> > +      gain += igain;
> >      }
> >
> >    if (dump_file)
> >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> >
> > +  /* ???  What about integer to SSE?  */
> >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> >
> > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
> >  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
> >
> >  rtx
> > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> >  {
> >    if (x == reg)
> > -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> > +    return gen_rtx_SUBREG (vmode, new_reg, 0);
> >
> >    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
> >    int i, j;
> > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
> >  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
> >
> >  void
> > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> >                                                 rtx reg, rtx new_reg)
> >  {
> >    replace_with_subreg (single_set (insn), reg, new_reg);
> > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
> >     and replace its uses in a chain.  */
> >
> >  void
> > -dimode_scalar_chain::make_vector_copies (unsigned regno)
> > +general_scalar_chain::make_vector_copies (unsigned regno)
> >  {
> >    rtx reg = regno_reg_rtx[regno];
> > -  rtx vreg = gen_reg_rtx (DImode);
> > +  rtx vreg = gen_reg_rtx (smode);
> >    df_ref ref;
> >
> >    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
> >       start_sequence ();
> >       if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> >         {
> > -         rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> > -         emit_move_insn (adjust_address (tmp, SImode, 0),
> > -                         gen_rtx_SUBREG (SImode, reg, 0));
> > -         emit_move_insn (adjust_address (tmp, SImode, 4),
> > -                         gen_rtx_SUBREG (SImode, reg, 4));
> > +         rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> > +         if (smode == DImode && !TARGET_64BIT)
> > +           {
> > +             emit_move_insn (adjust_address (tmp, SImode, 0),
> > +                             gen_rtx_SUBREG (SImode, reg, 0));
> > +             emit_move_insn (adjust_address (tmp, SImode, 4),
> > +                             gen_rtx_SUBREG (SImode, reg, 4));
> > +           }
> > +         else
> > +           emit_move_insn (tmp, reg);
> >           emit_move_insn (vreg, tmp);
> >         }
> > -     else if (TARGET_SSE4_1)
> > +     else if (!TARGET_64BIT && smode == DImode)
> >         {
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> > -         emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                       gen_rtx_SUBREG (SImode, reg, 4),
> > -                                       GEN_INT (2)));
> > +         if (TARGET_SSE4_1)
> > +           {
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> > +             emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                           gen_rtx_SUBREG (SImode, reg, 4),
> > +                                           GEN_INT (2)));
> > +           }
> > +         else
> > +           {
> > +             rtx tmp = gen_reg_rtx (DImode);
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 0)));
> > +             emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> > +                                         CONST0_RTX (V4SImode),
> > +                                         gen_rtx_SUBREG (SImode, reg, 4)));
> > +             emit_insn (gen_vec_interleave_lowv4si
> > +                        (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> > +                         gen_rtx_SUBREG (V4SImode, tmp, 0)));
> > +           }
> >         }
> >       else
> > -       {
> > -         rtx tmp = gen_reg_rtx (DImode);
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 0)));
> > -         emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> > -                                     CONST0_RTX (V4SImode),
> > -                                     gen_rtx_SUBREG (SImode, reg, 4)));
> > -         emit_insn (gen_vec_interleave_lowv4si
> > -                    (gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                     gen_rtx_SUBREG (V4SImode, vreg, 0),
> > -                     gen_rtx_SUBREG (V4SImode, tmp, 0)));
> > -       }
> > +       emit_move_insn (gen_lowpart (smode, vreg), reg);
> >       rtx_insn *seq = get_insns ();
> >       end_sequence ();
> >       rtx_insn *insn = DF_REF_INSN (ref);
> > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
> >     in case register is used in not convertible insn.  */
> >
> >  void
> > -dimode_scalar_chain::convert_reg (unsigned regno)
> > +general_scalar_chain::convert_reg (unsigned regno)
> >  {
> >    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
> >    rtx reg = regno_reg_rtx[regno];
> > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
> >    bitmap_copy (conv, insns);
> >
> >    if (scalar_copy)
> > -    scopy = gen_reg_rtx (DImode);
> > +    scopy = gen_reg_rtx (smode);
> >
> >    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> >      {
> > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
> >         start_sequence ();
> >         if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
> >           {
> > -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> > +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> >             emit_move_insn (tmp, reg);
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > -                           adjust_address (tmp, SImode, 0));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > -                           adjust_address (tmp, SImode, 4));
> > +           if (!TARGET_64BIT && smode == DImode)
> > +             {
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                               adjust_address (tmp, SImode, 0));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                               adjust_address (tmp, SImode, 4));
> > +             }
> > +           else
> > +             emit_move_insn (scopy, tmp);
> >           }
> > -       else if (TARGET_SSE4_1)
> > +       else if (!TARGET_64BIT && smode == DImode)
> >           {
> > -           rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> > -           emit_insn
> > -             (gen_rtx_SET
> > -              (gen_rtx_SUBREG (SImode, scopy, 0),
> > -               gen_rtx_VEC_SELECT (SImode,
> > -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> > -
> > -           tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> > -           emit_insn
> > -             (gen_rtx_SET
> > -              (gen_rtx_SUBREG (SImode, scopy, 4),
> > -               gen_rtx_VEC_SELECT (SImode,
> > -                                   gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> > +           if (TARGET_SSE4_1)
> > +             {
> > +               rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> > +                                           gen_rtvec (1, const0_rtx));
> > +               emit_insn
> > +                 (gen_rtx_SET
> > +                    (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                     gen_rtx_VEC_SELECT (SImode,
> > +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> > +                                         tmp)));
> > +
> > +               tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> > +               emit_insn
> > +                 (gen_rtx_SET
> > +                    (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                     gen_rtx_VEC_SELECT (SImode,
> > +                                         gen_rtx_SUBREG (V4SImode, reg, 0),
> > +                                         tmp)));
> > +             }
> > +           else
> > +             {
> > +               rtx vcopy = gen_reg_rtx (V2DImode);
> > +               emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> > +               emit_move_insn (vcopy,
> > +                               gen_rtx_LSHIFTRT (V2DImode,
> > +                                                 vcopy, GEN_INT (32)));
> > +               emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > +                               gen_rtx_SUBREG (SImode, vcopy, 0));
> > +             }
> >           }
> >         else
> > -         {
> > -           rtx vcopy = gen_reg_rtx (V2DImode);
> > -           emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> > -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> > -           emit_move_insn (vcopy,
> > -                           gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> > -           emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> > -                           gen_rtx_SUBREG (SImode, vcopy, 0));
> > -         }
> > +         emit_move_insn (scopy, reg);
> > +
> >         rtx_insn *seq = get_insns ();
> >         end_sequence ();
> >         emit_conversion_insns (seq, insn);
> > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
> >     registers conversion.  */
> >
> >  void
> > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> >  {
> >    *op = copy_rtx_if_shared (*op);
> >
> >    if (GET_CODE (*op) == NOT)
> >      {
> >        convert_op (&XEXP (*op, 0), insn);
> > -      PUT_MODE (*op, V2DImode);
> > +      PUT_MODE (*op, vmode);
> >      }
> >    else if (MEM_P (*op))
> >      {
> > -      rtx tmp = gen_reg_rtx (DImode);
> > +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
> >
> >        emit_insn_before (gen_move_insn (tmp, *op), insn);
> > -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> > +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
> >
> >        if (dump_file)
> >       fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
> >           gcc_assert (!DF_REF_CHAIN (ref));
> >           break;
> >         }
> > -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> > +      *op = gen_rtx_SUBREG (vmode, *op, 0);
> >      }
> >    else if (CONST_INT_P (*op))
> >      {
> >        rtx vec_cst;
> > -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> > +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
> >
> >        /* Prefer all ones vector in case of -1.  */
> >        if (constm1_operand (*op, GET_MODE (*op)))
> > -     vec_cst = CONSTM1_RTX (V2DImode);
> > +     vec_cst = CONSTM1_RTX (vmode);
> >        else
> > -     vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> > -                                     gen_rtvec (2, *op, const0_rtx));
> > +     {
> > +       unsigned n = GET_MODE_NUNITS (vmode);
> > +       rtx *v = XALLOCAVEC (rtx, n);
> > +       v[0] = *op;
> > +       for (unsigned i = 1; i < n; ++i)
> > +         v[i] = const0_rtx;
> > +       vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> > +     }
> >
> > -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> > +      if (!standard_sse_constant_p (vec_cst, vmode))
> >       {
> >         start_sequence ();
> > -       vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> > +       vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
> >         rtx_insn *seq = get_insns ();
> >         end_sequence ();
> >         emit_insn_before (seq, insn);
> > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
> >    else
> >      {
> >        gcc_assert (SUBREG_P (*op));
> > -      gcc_assert (GET_MODE (*op) == V2DImode);
> > +      gcc_assert (GET_MODE (*op) == vmode);
> >      }
> >  }
> >
> >  /* Convert INSN to vector mode.  */
> >
> >  void
> > -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> > +general_scalar_chain::convert_insn (rtx_insn *insn)
> >  {
> >    rtx def_set = single_set (insn);
> >    rtx src = SET_SRC (def_set);
> > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      {
> >        /* There are no scalar integer instructions and therefore
> >        temporary register usage is required.  */
> > -      rtx tmp = gen_reg_rtx (DImode);
> > +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
> >        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> > -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> > +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
> >      }
> >
> >    switch (GET_CODE (src))
> > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      case ASHIFTRT:
> >      case LSHIFTRT:
> >        convert_op (&XEXP (src, 0), insn);
> > -      PUT_MODE (src, V2DImode);
> > +      PUT_MODE (src, vmode);
> >        break;
> >
> >      case PLUS:
> > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
> >      case IOR:
> >      case XOR:
> >      case AND:
> > +    case SMAX:
> > +    case SMIN:
> > +    case UMAX:
> > +    case UMIN:
> >        convert_op (&XEXP (src, 0), insn);
> >        convert_op (&XEXP (src, 1), insn);
> > -      PUT_MODE (src, V2DImode);
> > +      PUT_MODE (src, vmode);
> >        break;
> >
> >      case NEG:
> >        src = XEXP (src, 0);
> >        convert_op (&src, insn);
> > -      subreg = gen_reg_rtx (V2DImode);
> > -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> > -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> > +      subreg = gen_reg_rtx (vmode);
> > +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> > +      src = gen_rtx_MINUS (vmode, subreg, src);
> >        break;
> >
> >      case NOT:
> >        src = XEXP (src, 0);
> >        convert_op (&src, insn);
> > -      subreg = gen_reg_rtx (V2DImode);
> > -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> > -      src = gen_rtx_XOR (V2DImode, src, subreg);
> > +      subreg = gen_reg_rtx (vmode);
> > +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> > +      src = gen_rtx_XOR (vmode, src, subreg);
> >        break;
> >
> >      case MEM:
> > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
> >        break;
> >
> >      case SUBREG:
> > -      gcc_assert (GET_MODE (src) == V2DImode);
> > +      gcc_assert (GET_MODE (src) == vmode);
> >        break;
> >
> >      case COMPARE:
> >        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
> >
> > -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> > -               || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> > +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> > +               || (SUBREG_P (src) && GET_MODE (src) == vmode));
> >
> >        if (REG_P (src))
> > -     subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> > +     subreg = gen_rtx_SUBREG (vmode, src, 0);
> >        else
> >       subreg = copy_rtx_if_shared (src);
> >        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> >    PATTERN (insn) = def_set;
> >
> >    INSN_CODE (insn) = -1;
> > -  recog_memoized (insn);
> > +  int patt = recog_memoized (insn);
> > +  if  (patt == -1)
> > +    fatal_insn_not_found (insn);
> >    df_insn_rescan (insn);
> >  }
> >
> > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
> >  }
> >
> >  void
> > -dimode_scalar_chain::convert_registers ()
> > +general_scalar_chain::convert_registers ()
> >  {
> >    bitmap_iterator bi;
> >    unsigned id;
> > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
> >                    (const_int 0 [0])))  */
> >
> >  static bool
> > -convertible_comparison_p (rtx_insn *insn)
> > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
> >  {
> >    if (!TARGET_SSE4_1)
> >      return false;
> > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
> >
> >    if (!SUBREG_P (op1)
> >        || !SUBREG_P (op2)
> > -      || GET_MODE (op1) != SImode
> > -      || GET_MODE (op2) != SImode
> > +      || GET_MODE (op1) != mode
> > +      || GET_MODE (op2) != mode
> >        || ((SUBREG_BYTE (op1) != 0
> > -        || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> > +        || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
> >         && (SUBREG_BYTE (op2) != 0
> > -           || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> > +           || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
> >      return false;
> >
> >    op1 = SUBREG_REG (op1);
> > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
> >
> >    if (op1 != op2
> >        || !REG_P (op1)
> > -      || GET_MODE (op1) != DImode)
> > +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
> >      return false;
> >
> >    return true;
> > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
> >  /* The DImode version of scalar_to_vector_candidate_p.  */
> >
> >  static bool
> > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
> >  {
> >    rtx def_set = single_set (insn);
> >
> > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
> >    rtx dst = SET_DEST (def_set);
> >
> >    if (GET_CODE (src) == COMPARE)
> > -    return convertible_comparison_p (insn);
> > +    return convertible_comparison_p (insn, mode);
> >
> >    /* We are interested in DImode promotion only.  */
> > -  if ((GET_MODE (src) != DImode
> > +  if ((GET_MODE (src) != mode
> >         && !CONST_INT_P (src))
> > -      || GET_MODE (dst) != DImode)
> > +      || GET_MODE (dst) != mode)
> >      return false;
> >
> >    if (!REG_P (dst) && !MEM_P (dst))
> > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
> >       return false;
> >        break;
> >
> > +    case SMAX:
> > +    case SMIN:
> > +    case UMAX:
> > +    case UMIN:
> > +      if ((mode == DImode && !TARGET_AVX512VL)
> > +       || (mode == SImode && !TARGET_SSE4_1))
> > +     return false;
> > +      /* Fallthru.  */
> > +
> >      case PLUS:
> >      case MINUS:
> >      case IOR:
> > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> >         && !CONST_INT_P (XEXP (src, 1)))
> >       return false;
> >
> > -      if (GET_MODE (XEXP (src, 1)) != DImode
> > +      if (GET_MODE (XEXP (src, 1)) != mode
> >         && !CONST_INT_P (XEXP (src, 1)))
> >       return false;
> >        break;
> > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> >         || !REG_P (XEXP (XEXP (src, 0), 0))))
> >        return false;
> >
> > -  if (GET_MODE (XEXP (src, 0)) != DImode
> > +  if (GET_MODE (XEXP (src, 0)) != mode
> >        && !CONST_INT_P (XEXP (src, 0)))
> >      return false;
> >
> > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
> >    return false;
> >  }
> >
> > -/* Return 1 if INSN may be converted into vector
> > -   instruction.  */
> > -
> > -static bool
> > -scalar_to_vector_candidate_p (rtx_insn *insn)
> > -{
> > -  if (TARGET_64BIT)
> > -    return timode_scalar_to_vector_candidate_p (insn);
> > -  else
> > -    return dimode_scalar_to_vector_candidate_p (insn);
> > -}
> > +/* For a given bitmap of insn UIDs scans all instruction and
> > +   remove insn from CANDIDATES in case it has both convertible
> > +   and not convertible definitions.
> >
> > -/* The DImode version of remove_non_convertible_regs.  */
> > +   All insns in a bitmap are conversion candidates according to
> > +   scalar_to_vector_candidate_p.  Currently it implies all insns
> > +   are single_set.  */
> >
> >  static void
> > -dimode_remove_non_convertible_regs (bitmap candidates)
> > +general_remove_non_convertible_regs (bitmap candidates)
> >  {
> >    bitmap_iterator bi;
> >    unsigned id;
> > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
> >    BITMAP_FREE (regs);
> >  }
> >
> > -/* For a given bitmap of insn UIDs scans all instruction and
> > -   remove insn from CANDIDATES in case it has both convertible
> > -   and not convertible definitions.
> > -
> > -   All insns in a bitmap are conversion candidates according to
> > -   scalar_to_vector_candidate_p.  Currently it implies all insns
> > -   are single_set.  */
> > -
> > -static void
> > -remove_non_convertible_regs (bitmap candidates)
> > -{
> > -  if (TARGET_64BIT)
> > -    timode_remove_non_convertible_regs (candidates);
> > -  else
> > -    dimode_remove_non_convertible_regs (candidates);
> > -}
> > -
> >  /* Main STV pass function.  Find and convert scalar
> >     instructions into vector mode when profitable.  */
> >
> > @@ -1577,11 +1638,14 @@ static unsigned int
> >  convert_scalars_to_vector ()
> >  {
> >    basic_block bb;
> > -  bitmap candidates;
> >    int converted_insns = 0;
> >
> >    bitmap_obstack_initialize (NULL);
> > -  candidates = BITMAP_ALLOC (NULL);
> > +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> > +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> > +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> > +  for (unsigned i = 0; i < 3; ++i)
> > +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
> >
> >    calculate_dominance_info (CDI_DOMINATORS);
> >    df_set_flags (DF_DEFER_INSN_RESCAN);
> > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
> >      {
> >        rtx_insn *insn;
> >        FOR_BB_INSNS (bb, insn)
> > -     if (scalar_to_vector_candidate_p (insn))
> > +     if (TARGET_64BIT
> > +         && timode_scalar_to_vector_candidate_p (insn))
> >         {
> >           if (dump_file)
> > -           fprintf (dump_file, "  insn %d is marked as a candidate\n",
> > +           fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
> >                      INSN_UID (insn));
> >
> > -         bitmap_set_bit (candidates, INSN_UID (insn));
> > +         bitmap_set_bit (&candidates[2], INSN_UID (insn));
> > +       }
> > +     else
> > +       {
> > +         /* Check {SI,DI}mode.  */
> > +         for (unsigned i = 0; i <= 1; ++i)
> > +           if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> > +             {
> > +               if (dump_file)
> > +                 fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> > +                          INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> > +
> > +               bitmap_set_bit (&candidates[i], INSN_UID (insn));
> > +               break;
> > +             }
> >         }
> >      }
> >
> > -  remove_non_convertible_regs (candidates);
> > +  if (TARGET_64BIT)
> > +    timode_remove_non_convertible_regs (&candidates[2]);
> > +  for (unsigned i = 0; i <= 1; ++i)
> > +    general_remove_non_convertible_regs (&candidates[i]);
> >
> > -  if (bitmap_empty_p (candidates))
> > -    if (dump_file)
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    if (!bitmap_empty_p (&candidates[i]))
> > +      break;
> > +    else if (i == 2 && dump_file)
> >        fprintf (dump_file, "There are no candidates for optimization.\n");
> >
> > -  while (!bitmap_empty_p (candidates))
> > -    {
> > -      unsigned uid = bitmap_first_set_bit (candidates);
> > -      scalar_chain *chain;
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    while (!bitmap_empty_p (&candidates[i]))
> > +      {
> > +     unsigned uid = bitmap_first_set_bit (&candidates[i]);
> > +     scalar_chain *chain;
> >
> > -      if (TARGET_64BIT)
> > -     chain = new timode_scalar_chain;
> > -      else
> > -     chain = new dimode_scalar_chain;
> > +     if (cand_mode[i] == TImode)
> > +       chain = new timode_scalar_chain;
> > +     else
> > +       chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
> >
> > -      /* Find instructions chain we want to convert to vector mode.
> > -      Check all uses and definitions to estimate all required
> > -      conversions.  */
> > -      chain->build (candidates, uid);
> > +     /* Find instructions chain we want to convert to vector mode.
> > +        Check all uses and definitions to estimate all required
> > +        conversions.  */
> > +     chain->build (&candidates[i], uid);
> >
> > -      if (chain->compute_convert_gain () > 0)
> > -     converted_insns += chain->convert ();
> > -      else
> > -     if (dump_file)
> > -       fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> > -                chain->chain_id);
> > +     if (chain->compute_convert_gain () > 0)
> > +       converted_insns += chain->convert ();
> > +     else
> > +       if (dump_file)
> > +         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> > +                  chain->chain_id);
> >
> > -      delete chain;
> > -    }
> > +     delete chain;
> > +      }
> >
> >    if (dump_file)
> >      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
> >
> > -  BITMAP_FREE (candidates);
> > +  for (unsigned i = 0; i <= 2; ++i)
> > +    bitmap_release (&candidates[i]);
> >    bitmap_obstack_release (NULL);
> >    df_process_deferred_rescans ();
> >
> > Index: gcc/config/i386/i386-features.h
> > ===================================================================
> > --- gcc/config/i386/i386-features.h   (revision 274111)
> > +++ gcc/config/i386/i386-features.h   (working copy)
> > @@ -127,11 +127,16 @@ namespace {
> >  class scalar_chain
> >  {
> >   public:
> > -  scalar_chain ();
> > +  scalar_chain (enum machine_mode, enum machine_mode);
> >    virtual ~scalar_chain ();
> >
> >    static unsigned max_id;
> >
> > +  /* Scalar mode.  */
> > +  enum machine_mode smode;
> > +  /* Vector mode.  */
> > +  enum machine_mode vmode;
> > +
> >    /* ID of a chain.  */
> >    unsigned int chain_id;
> >    /* A queue of instructions to be included into a chain.  */
> > @@ -159,9 +164,11 @@ class scalar_chain
> >    virtual void convert_registers () = 0;
> >  };
> >
> > -class dimode_scalar_chain : public scalar_chain
> > +class general_scalar_chain : public scalar_chain
> >  {
> >   public:
> > +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> > +    : scalar_chain (smode_, vmode_) {}
> >    int compute_convert_gain ();
> >   private:
> >    void mark_dual_mode_def (df_ref def);
> > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
> >  class timode_scalar_chain : public scalar_chain
> >  {
> >   public:
> > +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> > +
> >    /* Convert from TImode to V1TImode is always faster.  */
> >    int compute_convert_gain () { return 1; }
> >
> > Index: gcc/config/i386/i386.md
> > ===================================================================
> > --- gcc/config/i386/i386.md   (revision 274111)
> > +++ gcc/config/i386/i386.md   (working copy)
> > @@ -17721,6 +17721,30 @@ (define_peephole2
> >      std::swap (operands[4], operands[5]);
> >  })
> >
> > +;; min/max patterns
> > +
> > +(define_code_attr maxmin_rel
> > +  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
> > +(define_code_attr maxmin_cmpmode
> > +  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
> > +
> > +(define_insn_and_split "<code><mode>3"
> > +  [(set (match_operand:SWI48 0 "register_operand")
> > +     (maxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> > +                   (match_operand:SWI48 2 "register_operand")))
> > +   (clobber (reg:CC FLAGS_REG))]
> > +  "TARGET_STV && TARGET_SSE4_1
> > +   && can_create_pseudo_p ()"
> > +  "#"
> > +  "&& 1"
> > +  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
> > +     (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
> > +   (set (match_dup 0)
> > +     (if_then_else:SWI48
> > +       (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
> > +       (match_dup 1)
> > +       (match_dup 2)))])
> > +
> >  ;; Conditional addition patterns
> >  (define_expand "add<mode>cc"
> >    [(match_operand:SWI 0 "register_operand")
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Uros Bizjak Aug. 7, 2019, 12:43 p.m. UTC | #40
On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Wed, 7 Aug 2019, Richard Biener wrote:
> >
> > > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > > >
> > > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > > >
> > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > > >
> > > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > > to force use of %zmmN?
> > > > > > > >
> > > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > > >
> > > > > > >     case SMAX:
> > > > > > >     case SMIN:
> > > > > > >     case UMAX:
> > > > > > >     case UMIN:
> > > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > > >         return false;
> > > > > > >
> > > > > > > so there's no way to use AVX512VL for 32bit?
> > > > > >
> > > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > > This is of course doable, but somehow more complex than simply
> > > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > > splitter does. So, a follow-up task.
> > > > >
> > > > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > > > check we just need to properly split if we enable the scalar minmax
> > > > > pattern for DImode on 32bits, the STV conversion would go fine.
> > > >
> > > > Yes, that is correct.
> > >
> > > So I tested the patch below (now with appropriate ChangeLog) on
> > > x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> > > the obvious hmmer improvement, now checking for off-noise results
> > > with a 3-run on those that may have one (with more than +-1 second
> > > differences in the 1-run).
> > >
> > > As-is the patch likely runs into the splitting issue for DImode
> > > on i?86 and the patch misses functional testcases.  I'll do the
> > > hmmer loop with both DImode and SImode and testcases to trigger
> > > all pattern variants with the different ISAs we have.
> > >
> > > Some of the patch could be split out (the cost changes that are
> > > also effective for DImode for example).
> > >
> > > AFAICS we could go with only adding SImode avoiding the DImode
> > > splitting thing and this would solve the hmmer regression.
> >
> > I've additionally bootstrapped with --with-arch=nehalem which
> > reveals
> >
> > FAIL: gcc.target/i386/minmax-2.c scan-assembler test
> > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp
> >
> > we emit cmp + cmov here now with -msse4.1 (as soon as the max
> > pattern is enabled I guess)
>
> Actually, we have to split using ix86_expand_int_compare. This will
> generate optimized CC mode.

So, this only matters for comparisons against zero. Currently, the
insn_and_split pattern allows only registers, but we can add other
types, too. I'd say that this is benign issue.

Uros.
Richard Biener Aug. 7, 2019, 12:52 p.m. UTC | #41
On Wed, 7 Aug 2019, Uros Bizjak wrote:

> On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Wed, 7 Aug 2019, Richard Biener wrote:
> > >
> > > > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> > > >
> > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > > > >
> > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > > > >
> > > > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > > > >
> > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > > > >
> > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > > > to force use of %zmmN?
> > > > > > > > >
> > > > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > > > >
> > > > > > > >     case SMAX:
> > > > > > > >     case SMIN:
> > > > > > > >     case UMAX:
> > > > > > > >     case UMIN:
> > > > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > > > >         return false;
> > > > > > > >
> > > > > > > > so there's no way to use AVX512VL for 32bit?
> > > > > > >
> > > > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > > > This is of course doable, but somehow more complex than simply
> > > > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > > > splitter does. So, a follow-up task.
> > > > > >
> > > > > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > > > > check we just need to properly split if we enable the scalar minmax
> > > > > > pattern for DImode on 32bits, the STV conversion would go fine.
> > > > >
> > > > > Yes, that is correct.
> > > >
> > > > So I tested the patch below (now with appropriate ChangeLog) on
> > > > x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> > > > the obvious hmmer improvement, now checking for off-noise results
> > > > with a 3-run on those that may have one (with more than +-1 second
> > > > differences in the 1-run).
> > > >
> > > > As-is the patch likely runs into the splitting issue for DImode
> > > > on i?86 and the patch misses functional testcases.  I'll do the
> > > > hmmer loop with both DImode and SImode and testcases to trigger
> > > > all pattern variants with the different ISAs we have.
> > > >
> > > > Some of the patch could be split out (the cost changes that are
> > > > also effective for DImode for example).
> > > >
> > > > AFAICS we could go with only adding SImode avoiding the DImode
> > > > splitting thing and this would solve the hmmer regression.
> > >
> > > I've additionally bootstrapped with --with-arch=nehalem which
> > > reveals
> > >
> > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test
> > > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp
> > >
> > > we emit cmp + cmov here now with -msse4.1 (as soon as the max
> > > pattern is enabled I guess)
> >
> > Actually, we have to split using ix86_expand_int_compare. This will
> > generate optimized CC mode.
> 
> So, this only matters for comparisons against zero. Currently, the
> insn_and_split pattern allows only registers, but we can add other
> types, too. I'd say that this is benign issue.

OK.  So this is with your suggestions applied plus testcases as
promised.  If we remove DImode support minmax-5.c has to be adjusted
at least.

Currently re-bootstrapping / testing on x86_64-unknown-linux-gnu.

I'll followup with the performance assessment (currently only
testing on Haswell), but I guess it is easy enough to address
issues that pop up with the various auto-testers as followup
by adjusting the cost function (and we may get additional testcases
then as well).

OK if the re-testing shows no issues?

Thanks,
Richard.

2019-08-07  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
	gain if not zero.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274111)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (smode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
 	    emit_move_insn (vreg, tmp);
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
-	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
-	  }
+	  emit_move_insn (gen_lowpart (smode, vreg), reg);
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
 	rtx_insn *insn = DF_REF_INSN (ref);
@@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1638,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274111)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274111)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17721,6 +17721,31 @@ (define_peephole2
     std::swap (operands[4], operands[5]);
 })
 
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_64BIT && TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
+(define_code_attr maxmin_cmpmode
+  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
+
+(define_insn_and_split "<code><mode>3"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE (match_operand:MAXMIN_IMODE 1 "register_operand")
+		      (match_operand:MAXMIN_IMODE 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_STV && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
+	(compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
+   (set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE
+	  (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
+	  (match_dup 1)
+	  (match_dup 2)))])
+
 ;; Conditional addition patterns
 (define_expand "add<mode>cc"
   [(match_operand:SWI 0 "register_operand")
Index: gcc/testsuite/gcc.target/i386/pr91154.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr91154.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr91154.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1 -mstv" } */
+
+void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
+{
+  int sc;
+  int k;
+  for (k = 1; k <= M; k++)
+    {
+      dc[k] = dc[k-1] + tpdd[k-1];
+      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
+      if (dc[k] < -987654321) dc[k] = -987654321;
+    }
+}
+
+/* We want to convert the loop to SSE since SSE pmaxsd is faster than
+   compare + conditional move.  */
+/* { dg-final { scan-assembler-not "cmov" } } */
+/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */
+/* { dg-final { scan-assembler-times "paddd" 2 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-3.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int ssi[1024];
+unsigned int usi[1024];
+long long sdi[1024];
+unsigned long long udi[1024];
+
+#define CHECK(FN, VARIANT) \
+void \
+FN ## VARIANT (void) \
+{ \
+  for (int i = 1; i < 1024; ++i) \
+    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
+}
+
+CHECK(max, ssi);
+CHECK(min, ssi);
+CHECK(max, usi);
+CHECK(min, usi);
+CHECK(max, sdi);
+CHECK(min, sdi);
+CHECK(max, udi);
+CHECK(min, udi);
Index: gcc/testsuite/gcc.target/i386/minmax-4.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -msse4.1" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "pminsd" 1 } } */
+/* { dg-final { scan-assembler-times "pminud" 1 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-5.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-5.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-5.c	(working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -mavx512vl" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "vpminsd" 1 } } */
+/* { dg-final { scan-assembler-times "vpminud" 1 } } */
+/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */
Uros Bizjak Aug. 7, 2019, 12:59 p.m. UTC | #42
On Wed, Aug 7, 2019 at 2:52 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 7 Aug 2019, Uros Bizjak wrote:
>
> > On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Wed, 7 Aug 2019, Richard Biener wrote:
> > > >
> > > > > On Mon, 5 Aug 2019, Uros Bizjak wrote:
> > > > >
> > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > > > > >
> > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > > > > >
> > > > > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > > > > >
> > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > > > > >
> > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > > > > to force use of %zmmN?
> > > > > > > > > >
> > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > > > > >
> > > > > > > > >     case SMAX:
> > > > > > > > >     case SMIN:
> > > > > > > > >     case UMAX:
> > > > > > > > >     case UMIN:
> > > > > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > > > > >         return false;
> > > > > > > > >
> > > > > > > > > so there's no way to use AVX512VL for 32bit?
> > > > > > > >
> > > > > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > > > > This is of course doable, but somehow more complex than simply
> > > > > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > > > > splitter does. So, a follow-up task.
> > > > > > >
> > > > > > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > > > > > check we just need to properly split if we enable the scalar minmax
> > > > > > > pattern for DImode on 32bits, the STV conversion would go fine.
> > > > > >
> > > > > > Yes, that is correct.
> > > > >
> > > > > So I tested the patch below (now with appropriate ChangeLog) on
> > > > > x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> > > > > the obvious hmmer improvement, now checking for off-noise results
> > > > > with a 3-run on those that may have one (with more than +-1 second
> > > > > differences in the 1-run).
> > > > >
> > > > > As-is the patch likely runs into the splitting issue for DImode
> > > > > on i?86 and the patch misses functional testcases.  I'll do the
> > > > > hmmer loop with both DImode and SImode and testcases to trigger
> > > > > all pattern variants with the different ISAs we have.
> > > > >
> > > > > Some of the patch could be split out (the cost changes that are
> > > > > also effective for DImode for example).
> > > > >
> > > > > AFAICS we could go with only adding SImode avoiding the DImode
> > > > > splitting thing and this would solve the hmmer regression.
> > > >
> > > > I've additionally bootstrapped with --with-arch=nehalem which
> > > > reveals
> > > >
> > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test
> > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp
> > > >
> > > > we emit cmp + cmov here now with -msse4.1 (as soon as the max
> > > > pattern is enabled I guess)
> > >
> > > Actually, we have to split using ix86_expand_int_compare. This will
> > > generate optimized CC mode.
> >
> > So, this only matters for comparisons against zero. Currently, the
> > insn_and_split pattern allows only registers, but we can add other
> > types, too. I'd say that this is benign issue.
>
> OK.  So this is with your suggestions applied plus testcases as
> promised.  If we remove DImode support minmax-5.c has to be adjusted
> at least.
>
> Currently re-bootstrapping / testing on x86_64-unknown-linux-gnu.
>
> I'll followup with the performance assessment (currently only
> testing on Haswell), but I guess it is easy enough to address
> issues that pop up with the various auto-testers as followup
> by adjusting the cost function (and we may get additional testcases
> then as well).
>
> OK if the re-testing shows no issues?
>
> Thanks,
> Richard.
>
> 2019-08-07  Richard Biener  <rguenther@suse.de>
>
>         PR target/91154
>         * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
>         mode arguments.
>         (scalar_chain::smode): New member.
>         (scalar_chain::vmode): Likewise.
>         (dimode_scalar_chain): Rename to...
>         (general_scalar_chain): ... this.
>         (general_scalar_chain::general_scalar_chain): Take mode arguments.
>         (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
>         base with TImode and V1TImode.
>         * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
>         (general_scalar_chain::vector_const_cost): Adjust for SImode
>         chains.
>         (general_scalar_chain::compute_convert_gain): Likewise.  Fix
>         reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
>         scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
>         gain if not zero.
>         (general_scalar_chain::replace_with_subreg): Use vmode/smode.
>         (general_scalar_chain::make_vector_copies): Likewise.  Handle
>         non-DImode chains appropriately.
>         (general_scalar_chain::convert_reg): Likewise.
>         (general_scalar_chain::convert_op): Likewise.
>         (general_scalar_chain::convert_insn): Likewise.  Add
>         fatal_insn_not_found if the result is not recognized.
>         (convertible_comparison_p): Pass in the scalar mode and use that.
>         (general_scalar_to_vector_candidate_p): Likewise.  Rename from
>         dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
>         (scalar_to_vector_candidate_p): Remove by inlining into single
>         caller.
>         (general_remove_non_convertible_regs): Rename from
>         dimode_remove_non_convertible_regs.
>         (remove_non_convertible_regs): Remove by inlining into single caller.
>         (convert_scalars_to_vector): Handle SImode and DImode chains
>         in addition to TImode chains.
>         * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
>
>         * gcc.target/i386/pr91154.c: New testcase.
>         * gcc.target/i386/minmax-3.c: Likewise.
>         * gcc.target/i386/minmax-4.c: Likewise.
>         * gcc.target/i386/minmax-5.c: Likewise.

LGTM, perhaps someone with RTL background should also take a look.

(I plan to enhance the new pattern in .md a bit once the patch landing settles.)

Uros.
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     (revision 274111)
> +++ gcc/config/i386/i386-features.c     (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
>  /* Initialize new chain.  */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>
>     if (dump_file)
> @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
>     conversion.  */
>
>  void
> -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> +general_scalar_chain::mark_dual_mode_def (df_ref def)
>  {
>    gcc_assert (DF_REF_REG_DEF_P (def));
>
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
>     instead of using a scalar one.  */
>
>  int
> -dimode_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp)
>  {
>    gcc_assert (CONST_INT_P (exp));
>
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>
>  /* Compute a gain for chain conversion.  */
>
>  int
> -dimode_scalar_chain::compute_convert_gain ()
> +general_scalar_chain::compute_convert_gain ()
>  {
>    bitmap_iterator bi;
>    unsigned insn_uid;
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +       igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -       gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +       igain
> +         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -       gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>                || GET_CODE (src) == ASHIFTRT
>                || GET_CODE (src) == LSHIFTRT)
>         {
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> -         gain += ix86_cost->shift_const;
> +           igain -= vector_const_cost (XEXP (src, 0));
> +         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>           if (INTVAL (XEXP (src, 1)) >= 32)
> -           gain -= COSTS_N_INSNS (1);
> +           igain -= COSTS_N_INSNS (1);
>         }
>        else if (GET_CODE (src) == PLUS
>                || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>                || GET_CODE (src) == XOR
>                || GET_CODE (src) == AND)
>         {
> -         gain += ix86_cost->add;
> +         igain += m * ix86_cost->add - ix86_cost->sse_op;
>           /* Additional gain for andnot for targets without BMI.  */
>           if (GET_CODE (XEXP (src, 0)) == NOT
>               && !TARGET_BMI)
> -           gain += 2 * ix86_cost->add;
> +           igain += m * ix86_cost->add;
>
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> +           igain -= vector_const_cost (XEXP (src, 0));
>           if (CONST_INT_P (XEXP (src, 1)))
> -           gain -= vector_const_cost (XEXP (src, 1));
> +           igain -= vector_const_cost (XEXP (src, 1));
>         }
>        else if (GET_CODE (src) == NEG
>                || GET_CODE (src) == NOT)
> -       gain += ix86_cost->add - COSTS_N_INSNS (1);
> +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +              || GET_CODE (src) == SMIN
> +              || GET_CODE (src) == UMAX
> +              || GET_CODE (src) == UMIN)
> +       {
> +         /* We do not have any conditional move cost, estimate it as a
> +            reg-reg move.  Comparisons are costed as adds.  */
> +         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +         /* Integer SSE ops are all costed the same.  */
> +         igain -= ix86_cost->sse_op;
> +       }
>        else if (GET_CODE (src) == COMPARE)
>         {
>           /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>         {
>           if (REG_P (dst))
> -           gain += COSTS_N_INSNS (2);
> +           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +           igain += COSTS_N_INSNS (m);
>           else if (MEM_P (dst))
> -           gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -         gain -= vector_const_cost (src);
> +           igain += (m * ix86_cost->int_store[2]
> +                    - ix86_cost->sse_store[sse_cost_idx]);
> +         igain -= vector_const_cost (src);
>         }
>        else
>         gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +       {
> +         fprintf (dump_file, "  Instruction gain %d for ", igain);
> +         dump_insn_slim (dump_file, insn);
> +       }
> +      gain += igain;
>      }
>
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
>  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
>
>  rtx
> -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
>  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
>
>  void
> -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
>                                                   rtx reg, rtx new_reg)
>  {
>    replace_with_subreg (single_set (insn), reg, new_reg);
> @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
>     and replace its uses in a chain.  */
>
>  void
> -dimode_scalar_chain::make_vector_copies (unsigned regno)
> +general_scalar_chain::make_vector_copies (unsigned regno)
>  {
>    rtx reg = regno_reg_rtx[regno];
> -  rtx vreg = gen_reg_rtx (DImode);
> +  rtx vreg = gen_reg_rtx (smode);
>    df_ref ref;
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -           emit_move_insn (adjust_address (tmp, SImode, 0),
> -                           gen_rtx_SUBREG (SImode, reg, 0));
> -           emit_move_insn (adjust_address (tmp, SImode, 4),
> -                           gen_rtx_SUBREG (SImode, reg, 4));
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +           if (smode == DImode && !TARGET_64BIT)
> +             {
> +               emit_move_insn (adjust_address (tmp, SImode, 0),
> +                               gen_rtx_SUBREG (SImode, reg, 0));
> +               emit_move_insn (adjust_address (tmp, SImode, 4),
> +                               gen_rtx_SUBREG (SImode, reg, 4));
> +             }
> +           else
> +             emit_move_insn (tmp, reg);
>             emit_move_insn (vreg, tmp);
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (SImode, reg, 4),
> -                                         GEN_INT (2)));
> +           if (TARGET_SSE4_1)
> +             {
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (SImode, reg, 4),
> +                                             GEN_INT (2)));
> +             }
> +           else
> +             {
> +               rtx tmp = gen_reg_rtx (DImode);
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 4)));
> +               emit_insn (gen_vec_interleave_lowv4si
> +                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +             }
>           }
>         else
> -         {
> -           rtx tmp = gen_reg_rtx (DImode);
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 4)));
> -           emit_insn (gen_vec_interleave_lowv4si
> -                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -         }
> +         emit_move_insn (gen_lowpart (smode, vreg), reg);
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         rtx_insn *insn = DF_REF_INSN (ref);
> @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies
>     in case register is used in not convertible insn.  */
>
>  void
> -dimode_scalar_chain::convert_reg (unsigned regno)
> +general_scalar_chain::convert_reg (unsigned regno)
>  {
>    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
>    rtx reg = regno_reg_rtx[regno];
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
>           start_sequence ();
>           if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>             {
> -             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>               emit_move_insn (tmp, reg);
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             adjust_address (tmp, SImode, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             adjust_address (tmp, SImode, 4));
> +             if (!TARGET_64BIT && smode == DImode)
> +               {
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 adjust_address (tmp, SImode, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 adjust_address (tmp, SImode, 4));
> +               }
> +             else
> +               emit_move_insn (scopy, tmp);
>             }
> -         else if (TARGET_SSE4_1)
> +         else if (!TARGET_64BIT && smode == DImode)
>             {
> -             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 0),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 4),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +             if (TARGET_SSE4_1)
> +               {
> +                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                             gen_rtvec (1, const0_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 0),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +
> +                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 4),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +               }
> +             else
> +               {
> +                 rtx vcopy = gen_reg_rtx (V2DImode);
> +                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +                 emit_move_insn (vcopy,
> +                                 gen_rtx_LSHIFTRT (V2DImode,
> +                                                   vcopy, GEN_INT (32)));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +               }
>             }
>           else
> -           {
> -             rtx vcopy = gen_reg_rtx (V2DImode);
> -             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -             emit_move_insn (vcopy,
> -                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -           }
> +           emit_move_insn (scopy, reg);
> +
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_conversion_insns (seq, insn);
> @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign
>     registers conversion.  */
>
>  void
> -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
>  {
>    *op = copy_rtx_if_shared (*op);
>
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
>        if (dump_file)
>         fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
>             gcc_assert (!DF_REF_CHAIN (ref));
>             break;
>           }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -       vec_cst = CONSTM1_RTX (V2DImode);
> +       vec_cst = CONSTM1_RTX (vmode);
>        else
> -       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                       gen_rtvec (2, *op, const0_rtx));
> +       {
> +         unsigned n = GET_MODE_NUNITS (vmode);
> +         rtx *v = XALLOCAVEC (rtx, n);
> +         v[0] = *op;
> +         for (unsigned i = 1; i < n; ++i)
> +           v[i] = const0_rtx;
> +         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +       }
>
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>         {
>           start_sequence ();
> -         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_insn_before (seq, insn);
> @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>
>  /* Convert INSN to vector mode.  */
>
>  void
> -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> +general_scalar_chain::convert_insn (rtx_insn *insn)
>  {
>    rtx def_set = single_set (insn);
>    rtx src = SET_SRC (def_set);
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>          temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>
>    switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>
>      case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
>        if (REG_P (src))
> -       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +       subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>         subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>
> @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i
>  }
>
>  void
> -dimode_scalar_chain::convert_registers ()
> +general_scalar_chain::convert_registers ()
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                      (const_int 0 [0])))  */
>
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>           && (SUBREG_BYTE (op2) != 0
> -             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>
>    return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>         return false;
>        break;
>
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> +         || (mode == SImode && !TARGET_SSE4_1))
> +       return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>        break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>
> @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>
>  static void
> -dimode_remove_non_convertible_regs (bitmap candidates)
> +general_remove_non_convertible_regs (bitmap candidates)
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>
> @@ -1577,11 +1638,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -       if (scalar_to_vector_candidate_p (insn))
> +       if (TARGET_64BIT
> +           && timode_scalar_to_vector_candidate_p (insn))
>           {
>             if (dump_file)
> -             fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                        INSN_UID (insn));
>
> -           bitmap_set_bit (candidates, INSN_UID (insn));
> +           bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +         }
> +       else
> +         {
> +           /* Check {SI,DI}mode.  */
> +           for (unsigned i = 0; i <= 1; ++i)
> +             if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +               {
> +                 if (dump_file)
> +                   fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +                 break;
> +               }
>           }
>      }
>
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    general_remove_non_convertible_regs (&candidates[i]);
>
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +       unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +       scalar_chain *chain;
>
> -      if (TARGET_64BIT)
> -       chain = new timode_scalar_chain;
> -      else
> -       chain = new dimode_scalar_chain;
> +       if (cand_mode[i] == TImode)
> +         chain = new timode_scalar_chain;
> +       else
> +         chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> -      /* Find instructions chain we want to convert to vector mode.
> -        Check all uses and definitions to estimate all required
> -        conversions.  */
> -      chain->build (candidates, uid);
> +       /* Find instructions chain we want to convert to vector mode.
> +          Check all uses and definitions to estimate all required
> +          conversions.  */
> +       chain->build (&candidates[i], uid);
>
> -      if (chain->compute_convert_gain () > 0)
> -       converted_insns += chain->convert ();
> -      else
> -       if (dump_file)
> -         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                  chain->chain_id);
> +       if (chain->compute_convert_gain () > 0)
> +         converted_insns += chain->convert ();
> +       else
> +         if (dump_file)
> +           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                    chain->chain_id);
>
> -      delete chain;
> -    }
> +       delete chain;
> +      }
>
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h     (revision 274111)
> +++ gcc/config/i386/i386-features.h     (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>
>    static unsigned max_id;
>
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -159,9 +164,11 @@ class scalar_chain
>    virtual void convert_registers () = 0;
>  };
>
> -class dimode_scalar_chain : public scalar_chain
> +class general_scalar_chain : public scalar_chain
>  {
>   public:
> +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 274111)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -17721,6 +17721,31 @@ (define_peephole2
>      std::swap (operands[4], operands[5]);
>  })
>
> +;; min/max patterns
> +
> +(define_mode_iterator MAXMIN_IMODE
> +  [(SI "TARGET_SSE4_1") (DI "TARGET_64BIT && TARGET_AVX512VL")])
> +(define_code_attr maxmin_rel
> +  [(smax "ge") (smin "le") (umax "geu") (umin "leu")])
> +(define_code_attr maxmin_cmpmode
> +  [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +       (maxmin:MAXMIN_IMODE (match_operand:MAXMIN_IMODE 1 "register_operand")
> +                     (match_operand:MAXMIN_IMODE 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:<maxmin_cmpmode> FLAGS_REG)
> +       (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +       (if_then_else:MAXMIN_IMODE
> +         (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0))
> +         (match_dup 1)
> +         (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")
> Index: gcc/testsuite/gcc.target/i386/pr91154.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/pr91154.c     (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/pr91154.c     (working copy)
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse4.1 -mstv" } */
> +
> +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
> +{
> +  int sc;
> +  int k;
> +  for (k = 1; k <= M; k++)
> +    {
> +      dc[k] = dc[k-1] + tpdd[k-1];
> +      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
> +      if (dc[k] < -987654321) dc[k] = -987654321;
> +    }
> +}
> +
> +/* We want to convert the loop to SSE since SSE pmaxsd is faster than
> +   compare + conditional move.  */
> +/* { dg-final { scan-assembler-not "cmov" } } */
> +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */
> +/* { dg-final { scan-assembler-times "paddd" 2 } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-3.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-3.c    (working copy)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b))
> +#define min(a,b) (((a) < (b))? (a) : (b))
> +
> +int ssi[1024];
> +unsigned int usi[1024];
> +long long sdi[1024];
> +unsigned long long udi[1024];
> +
> +#define CHECK(FN, VARIANT) \
> +void \
> +FN ## VARIANT (void) \
> +{ \
> +  for (int i = 1; i < 1024; ++i) \
> +    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
> +}
> +
> +CHECK(max, ssi);
> +CHECK(min, ssi);
> +CHECK(max, usi);
> +CHECK(min, usi);
> +CHECK(max, sdi);
> +CHECK(min, sdi);
> +CHECK(max, udi);
> +CHECK(min, udi);
> Index: gcc/testsuite/gcc.target/i386/minmax-4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-4.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-4.c    (working copy)
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -msse4.1" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "pminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pminud" 1 } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-5.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-5.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-5.c    (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -mavx512vl" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "vpminud" 1 } } */
> +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */
Richard Biener Aug. 7, 2019, 1:57 p.m. UTC | #43
On Wed, 7 Aug 2019, Richard Biener wrote:

> On Mon, 5 Aug 2019, Uros Bizjak wrote:
> 
> > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote:
> > 
> > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > >
> > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > >
> > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > >
> > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > to force use of %zmmN?
> > > > > >
> > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > >
> > > > >     case SMAX:
> > > > >     case SMIN:
> > > > >     case UMAX:
> > > > >     case UMIN:
> > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > >         return false;
> > > > >
> > > > > so there's no way to use AVX512VL for 32bit?
> > > >
> > > > There is a way, but on 32bit targets, we need to split DImode
> > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > This is of course doable, but somehow more complex than simply
> > > > emitting a DImode compare + DImode cmove, which is what current
> > > > splitter does. So, a follow-up task.
> > >
> > > Ah, OK.  So for the above condition we can elide the !TARGET_64BIT
> > > check we just need to properly split if we enable the scalar minmax
> > > pattern for DImode on 32bits, the STV conversion would go fine.
> > 
> > Yes, that is correct.
> 
> So I tested the patch below (now with appropriate ChangeLog) on
> x86_64-unknown-linux-gnu.  I've thrown it at SPEC CPU 2006 with
> the obvious hmmer improvement, now checking for off-noise results
> with a 3-run on those that may have one (with more than +-1 second
> differences in the 1-run).

Update on this one.  On Haswell I see (besides hmmer and the
ones +-1 second in the 1-run); base is unpatched, peak is patched:

401.bzip2        9650        382       25.3 S    9650        380       
25.4 S
401.bzip2        9650        381       25.3 *    9650        377       
25.6 *
401.bzip2        9650        381       25.3 S    9650        376       
25.7 S

458.sjeng       12100        433       28.0 S   12100        433       
28.0 S
458.sjeng       12100        428       28.3 S   12100        424       
28.5 *
458.sjeng       12100        432       28.0 *   12100        424       
28.6 S

464.h264ref     22130        413       53.6 S   22130        422       
52.5 S
464.h264ref     22130        413       53.6 *   22130        421       
52.5 S
464.h264ref     22130        413       53.6 S   22130        421       
52.5 *

473.astar        7020        328       21.4 S    7020        316       
22.2 S
473.astar        7020        322       21.8 S    7020        314       
22.4 *
473.astar        7020        322       21.8 *    7020        311       
22.6 S

416.gamess      19580        593       33.0 S   19580        601       
32.6 S
416.gamess      19580        593       33.0 S   19580        601       
32.6 *
416.gamess      19580        593       33.0 *   19580        601       
32.6 S

so it's a loss for 464.h264ref and 416.gamess from the above numbers
and a slight win for the others (and a big one for 456.hmmer).

I plan to have a look at the two as followup only, possibly adding
a debug couter to be able to bisect to a specific chain.

Richard.
Jeff Law Aug. 8, 2019, 3:24 p.m. UTC | #44
On 8/5/19 6:32 AM, Uros Bizjak wrote:
> On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>>
>> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>>
>>> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>>>>
>>>> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>>>>
>>>>> On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>>>>>
>>>>>>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>>>>>>>> necessary even when going the STV route.  The actual regression
>>>>>>>> for the testcase could also be solved by turing the smaxsi3
>>>>>>>> back into a compare and jump rather than a conditional move sequence.
>>>>>>>> So I wonder how you'd do that given that there's pass_if_after_reload
>>>>>>>> after pass_split_after_reload and I'm not sure we can split
>>>>>>>> as late as pass_split_before_sched2 (there's also a split _after_
>>>>>>>> sched2 on x86 it seems).
>>>>>>>>
>>>>>>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>>>>>>>> case STV doesn't end up doing any transform?
>>>>>>>
>>>>>>> If STV doesn't transform the insn, then a pre-reload splitter splits
>>>>>>> the insn back to compare+cmove.
>>>>>>
>>>>>> OK, that would work.  But there's no way to force a jumpy sequence then
>>>>>> which we know is faster than compare+cmove because later RTL
>>>>>> if-conversion passes happily re-discover the smax (or conditional move)
>>>>>> sequence.
>>>>>>
>>>>>>> However, considering the SImode move
>>>>>>> from/to int/xmm register is relatively cheap, the cost function should
>>>>>>> be tuned so that STV always converts smaxsi3 pattern.
>>>>>>
>>>>>> Note that on both Zen and even more so bdverN the int/xmm transition
>>>>>> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>>>>>> sequence... (for the loop in hmmer which is the only one I see
>>>>>> any effect of any of my patches).  So identifying chains that
>>>>>> start/end in memory is important for cost reasons.
>>>>>
>>>>> Please note that the cost function also considers the cost of move
>>>>> from/to xmm. So, the cost of the whole chain would disable the
>>>>> transformation.
>>>>>
>>>>>> So I think the splitting has to happen after the last if-conversion
>>>>>> pass (and thus we may need to allocate a scratch register for this
>>>>>> purpose?)
>>>>>
>>>>> I really hope that the underlying issue will be solved by a machine
>>>>> dependant pass inserted somewhere after the pre-reload split. This
>>>>> way, we can split unconverted smax to the cmove, and this later pass
>>>>> would handle jcc and cmove instructions. Until then... yes your
>>>>> proposed approach is one of the ways to avoid unwanted if-conversion,
>>>>> although sometimes we would like to split to cmove instead.
>>>>
>>>> So the following makes STV also consider SImode chains, re-using the
>>>> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>>>> and also did not alter the {SI,DI}mode chain cost function - it's
>>>> quite off for TARGET_64BIT.  With this I get the expected conversion
>>>> for the testcase derived from hmmer.
>>>>
>>>> No further testing sofar.
>>>>
>>>> Is it OK to re-use the DImode chain code this way?  I'll clean things
>>>> up some more of course.
>>>
>>> Yes, the approach looks OK to me. It makes chain building mode
>>> agnostic, and the chain building can be used for
>>> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
>>> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
>>> minmax and surrounding SImode operations)
>>> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
>>> DImode operations)
>>>
>>>> Still need help with the actual patterns for minmax and how the splitters
>>>> should look like.
>>>
>>> Please look at the attached patch. Maybe we can add memory_operand as
>>> operand 1 and operand 2 predicate, but let's keep things simple for
>>> now.
>>
>> Thanks.  The attached patch makes the patch cleaner and it survives
>> "some" barebone testing.  It also touches the cost function to
>> avoid being too overly trigger-happy.  I've also ended up using
>> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
>> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
>> move costs shouldn't be wrapped in COST_N_INSNS.
>> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
>> At least with the current code every reg-reg move biases in favor of
>> SSE...
> 
> This is currently a bit mixed-up area in x86 target support. HJ is
> looking into this [1] and I hope Honza can review the patch.
Yea, Honza's input on that would be greatly appreciated.

Jeff
>
Uros Bizjak Aug. 9, 2019, 6:05 a.m. UTC | #45
On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:

> > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > >
> > > > > > and then we need to split DImode for 32bits, too.
> > > > >
> > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > condition, I'll provide _doubleword splitter later.
> > > >
> > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > to force use of %zmmN?
> > >
> > > It generates V4SI mode, so - yes, AVX512VL.
> >
> >     case SMAX:
> >     case SMIN:
> >     case UMAX:
> >     case UMIN:
> >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> >           || (mode == SImode && !TARGET_SSE4_1))
> >         return false;
> >
> > so there's no way to use AVX512VL for 32bit?
>
> There is a way, but on 32bit targets, we need to split DImode
> operation to a sequence of SImode operations for unconverted pattern.
> This is of course doable, but somehow more complex than simply
> emitting a DImode compare + DImode cmove, which is what current
> splitter does. So, a follow-up task.

Please find attached the complete .md part that enables SImode for
TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
targets. The patterns also allows for memory operand 2, so STV has
chance to create the vector pattern with implicit load. In case STV
fails, the memory operand 2 is loaded to the register first;  operand
2 is used in compare and cmove instruction, so pre-loading of the
operand should be beneficial.

Also note, that splitting should happen rarely. Due to the cost
function, STV should effectively always convert minmax to a vector
insn.

Uros.
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 274210)
+++ config/i386/i386.md	(working copy)
@@ -17719,6 +17719,110 @@
    (match_operand:SWI 3 "const_int_operand")]
   ""
   "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
+
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
+
+(define_expand "<code><mode>3"
+  [(parallel
+    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	  (maxmin:MAXMIN_IMODE
+	    (match_operand:MAXMIN_IMODE 1 "register_operand")
+	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+     (clobber (reg:CC FLAGS_REG))])]
+  "TARGET_STV")
+
+(define_insn_and_split "*<code><mode>3_1"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE
+	  (match_operand:MAXMIN_IMODE 1 "register_operand")
+	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	  (match_dup 1)
+	  (match_dup 2)))]
+{
+  machine_mode mode = <MODE>mode;
+
+  if (!register_operand (operands[2], mode))
+    operands[2] = force_reg (mode, operands[2]);
+
+  enum rtx_code code = <maxmin_rel>;
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
+
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  emit_insn (gen_rtx_SET (flags, tmp));
+
+  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+})
+
+(define_insn_and_split "*<code>di3_doubleword"
+  [(set (match_operand:DI 0 "register_operand")
+	(maxmin:DI (match_operand:DI 1 "register_operand")
+		   (match_operand:DI 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 1)
+	  (match_dup 2)))
+   (set (match_dup 3)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 4)
+	  (match_dup 5)))]
+{
+  if (!register_operand (operands[2], DImode))
+    operands[2] = force_reg (DImode, operands[2]);
+
+  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+
+  rtx cmplo[2] = { operands[1], operands[2] };
+  rtx cmphi[2] = { operands[4], operands[5] };
+
+  enum rtx_code code = <maxmin_rel>;
+
+  switch (code)
+    {
+    case LE: case LEU:
+      std::swap (cmplo[0], cmplo[1]);
+      std::swap (cmphi[0], cmphi[1]);
+      code = swap_condition (code);
+      /* FALLTHRU */
+
+    case GE: case GEU:
+      {
+	bool uns = (code == GEU);
+	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
+	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
+
+	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
+
+	rtx tmp = gen_rtx_SCRATCH (SImode);
+	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
+
+	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
+	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+
+	break;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+})
 
 ;; Misc patterns (?)
Richard Biener Aug. 9, 2019, 9:25 a.m. UTC | #46
On Fri, 9 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > >
> > > > > > > and then we need to split DImode for 32bits, too.
> > > > > >
> > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > condition, I'll provide _doubleword splitter later.
> > > > >
> > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > to force use of %zmmN?
> > > >
> > > > It generates V4SI mode, so - yes, AVX512VL.
> > >
> > >     case SMAX:
> > >     case SMIN:
> > >     case UMAX:
> > >     case UMIN:
> > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > >           || (mode == SImode && !TARGET_SSE4_1))
> > >         return false;
> > >
> > > so there's no way to use AVX512VL for 32bit?
> >
> > There is a way, but on 32bit targets, we need to split DImode
> > operation to a sequence of SImode operations for unconverted pattern.
> > This is of course doable, but somehow more complex than simply
> > emitting a DImode compare + DImode cmove, which is what current
> > splitter does. So, a follow-up task.
> 
> Please find attached the complete .md part that enables SImode for
> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> targets. The patterns also allows for memory operand 2, so STV has
> chance to create the vector pattern with implicit load. In case STV
> fails, the memory operand 2 is loaded to the register first;  operand
> 2 is used in compare and cmove instruction, so pre-loading of the
> operand should be beneficial.

Thanks.

> Also note, that splitting should happen rarely. Due to the cost
> function, STV should effectively always convert minmax to a vector
> insn.

I've analyzed the 464.h264ref slowdown on Haswell and it is due to
this kind of "simple" conversion:

  5.50 │1d0:   test   %esi,%es
  0.07 │       mov    $0x0,%ex
       │       cmovs  %eax,%es
  5.84 │       imul   %r8d,%es

to

  0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
  0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
 40.45 │       vmovd  %xmm0,%eax
  2.45 │       imul   %r8d,%eax

which looks like a RA artifact in the end.  We spill %esi only
with -mstv here as STV introduces a (subreg:V4SI ...) use
of a pseudo ultimatively set from di.  STV creates an additional
pseudo for this (copy-in) but it places that copy next to the
original def rather than next to the start of the chain it
converts which is probably the issue why we spill.  And this
is because it inserts those at each definition of the pseudo
rather than just at the reaching definition(s) or at the
uses of the pseudo in the chain (that because there may be
defs of that pseudo in the chain itself).  Note that STV emits
such "conversion" copies as simple reg-reg moves:

(insn 1094 3 4 2 (set (reg:SI 777)
        (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
     (nil))

but those do not prevail very long (this one gets removed by CSE2).
So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
and computes

    r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618

so I wonder if STV shouldn't instead emit gpr->xmm moves
here (but I guess nothing again prevents RTL optimizers from
combining that with the single-use in the max instruction...).

So this boils down to STV splitting live-ranges but other
passes undoing that and then RA not considering splitting
live-ranges here, arriving at unoptimal allocation.

A testcase showing this issue is (simplified from 464.h264ref
UMVLine16Y_11):

unsigned short
UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
{
  if (y != width)
    {
      y = y < 0 ? 0 : y;
      return Pic[y * width];
    }
  return Pic[y];
}

where the condition and the Pic[y] load mimics the other use of y.
Different, even worse spilling is generated by

unsigned short
UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
{
  y = y < 0 ? 0 : y;
  return Pic[y * width] + y;
}

I guess this all shows that STVs "trick" of simply wrapping
integer mode pseudos in (subreg:vector-mode ...) is bad?

I've added a (failing) testcase to reflect the above.

Richard.
Jakub Jelinek Aug. 9, 2019, 10:13 a.m. UTC | #47
On Fri, Aug 09, 2019 at 11:25:30AM +0200, Richard Biener wrote:
>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>  40.45 │       vmovd  %xmm0,%eax
>   2.45 │       imul   %r8d,%eax

Shouldn't we hoist the vpxor before the loop?  Is it STV being done too late
that we don't do that anymore?  Couldn't e.g. STV itself detect that and put
the clearing instruction before the loop instead of right before the minmax?

	Jakub
Richard Biener Aug. 9, 2019, 10:59 a.m. UTC | #48
On Fri, 9 Aug 2019, Richard Biener wrote:

> On Fri, 9 Aug 2019, Uros Bizjak wrote:
> 
> > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > 
> > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > >
> > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > >
> > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > >
> > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > to force use of %zmmN?
> > > > >
> > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > >
> > > >     case SMAX:
> > > >     case SMIN:
> > > >     case UMAX:
> > > >     case UMIN:
> > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > >         return false;
> > > >
> > > > so there's no way to use AVX512VL for 32bit?
> > >
> > > There is a way, but on 32bit targets, we need to split DImode
> > > operation to a sequence of SImode operations for unconverted pattern.
> > > This is of course doable, but somehow more complex than simply
> > > emitting a DImode compare + DImode cmove, which is what current
> > > splitter does. So, a follow-up task.
> > 
> > Please find attached the complete .md part that enables SImode for
> > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > targets. The patterns also allows for memory operand 2, so STV has
> > chance to create the vector pattern with implicit load. In case STV
> > fails, the memory operand 2 is loaded to the register first;  operand
> > 2 is used in compare and cmove instruction, so pre-loading of the
> > operand should be beneficial.
> 
> Thanks.
> 
> > Also note, that splitting should happen rarely. Due to the cost
> > function, STV should effectively always convert minmax to a vector
> > insn.
> 
> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> this kind of "simple" conversion:
> 
>   5.50 │1d0:   test   %esi,%es
>   0.07 │       mov    $0x0,%ex
>        │       cmovs  %eax,%es
>   5.84 │       imul   %r8d,%es
> 
> to
> 
>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>  40.45 │       vmovd  %xmm0,%eax
>   2.45 │       imul   %r8d,%eax
> 
> which looks like a RA artifact in the end.  We spill %esi only
> with -mstv here as STV introduces a (subreg:V4SI ...) use
> of a pseudo ultimatively set from di.  STV creates an additional
> pseudo for this (copy-in) but it places that copy next to the
> original def rather than next to the start of the chain it
> converts which is probably the issue why we spill.  And this
> is because it inserts those at each definition of the pseudo
> rather than just at the reaching definition(s) or at the
> uses of the pseudo in the chain (that because there may be
> defs of that pseudo in the chain itself).  Note that STV emits
> such "conversion" copies as simple reg-reg moves:
> 
> (insn 1094 3 4 2 (set (reg:SI 777)
>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>      (nil))
> 
> but those do not prevail very long (this one gets removed by CSE2).
> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> and computes
> 
>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> 
> so I wonder if STV shouldn't instead emit gpr->xmm moves
> here (but I guess nothing again prevents RTL optimizers from
> combining that with the single-use in the max instruction...).
> 
> So this boils down to STV splitting live-ranges but other
> passes undoing that and then RA not considering splitting
> live-ranges here, arriving at unoptimal allocation.
> 
> A testcase showing this issue is (simplified from 464.h264ref
> UMVLine16Y_11):
> 
> unsigned short
> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> {
>   if (y != width)
>     {
>       y = y < 0 ? 0 : y;
>       return Pic[y * width];
>     }
>   return Pic[y];
> }
> 
> where the condition and the Pic[y] load mimics the other use of y.
> Different, even worse spilling is generated by
> 
> unsigned short
> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> {
>   y = y < 0 ? 0 : y;
>   return Pic[y * width] + y;
> }
> 
> I guess this all shows that STVs "trick" of simply wrapping
> integer mode pseudos in (subreg:vector-mode ...) is bad?
> 
> I've added a (failing) testcase to reflect the above.

Experimenting a bit with just for the conversion insns using
V4SImode pseudos we end up preserving those moves (but I
do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
ends up using movv4si_internal which only leaves us with
memory for the SImode operand) _plus_ moving the move next
to the actual use has an effect.  Not necssarily a good one
though:

        vpxor   %xmm0, %xmm0, %xmm0
        vmovaps %xmm0, -16(%rsp)
        movl    %esi, -16(%rsp)
        vpmaxsd -16(%rsp), %xmm0, %xmm0
        vmovd   %xmm0, %eax

eh?  I guess the lowpart set is not good (my patch has this
as well, but I got saved by never having vector modes to subset...).
Using

    (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) "t3.c":5:10 -1

for the move ends up with

        vpxor   %xmm1, %xmm1, %xmm1
        vpinsrd $0, %esi, %xmm1, %xmm0

eh?  LRA chooses the correct alternative here but somehow
postreload CSE CSEs the zero with the xmm1 clearing, leading
to the vpinsrd...  (I guess a general issue, not sure if really
worse - definitely a larger instruction).  Unfortunately
postreload-cse doesn't add a reg-equal note.  This happens only
when emitting the reg move before the use, not doing that emits
a vmovd as expected.

At least the spilling is gone here.

I am re-testing as follows, the main change is that
general_scalar_chain::make_vector_copies now generates a
vector pseudo as destination (and I've fixed up the code
to not generate (subreg:V4SI (reg:V4SI 1234) 0)).

Hope this fixes the observed slowdowns (it fixes the new testcase).

Richard.

mccas.F:twotff_ for 416.gamess
refbuf.c:UMVLine16Y_11 for 464.h264ref

2019-08-07  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
	gain if not zero.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	Elide the subreg if the reg is already vector.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.  Use a vector-mode pseudo as
	destination.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.  Elide the
	subreg if the reg is already vector.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.
	* gcc.target/i386/minmax-6.c: Likewise.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274111)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return (GET_MODE (new_reg) == vmode
+	    ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0));
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (vmode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
-	    emit_move_insn (vreg, tmp);
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      tmp),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
+
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
 	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      reg),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
 	  }
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
@@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      if (GET_MODE (*op) != vmode)
+	*op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1653,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1676,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274111)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274111)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc"
    (match_operand:SWI 3 "const_int_operand")]
   ""
   "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
+
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
+
+(define_expand "<code><mode>3"
+  [(parallel
+    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	  (maxmin:MAXMIN_IMODE
+	    (match_operand:MAXMIN_IMODE 1 "register_operand")
+	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+     (clobber (reg:CC FLAGS_REG))])]
+  "TARGET_STV")
+
+(define_insn_and_split "*<code><mode>3_1"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE
+	  (match_operand:MAXMIN_IMODE 1 "register_operand")
+	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	  (match_dup 1)
+	  (match_dup 2)))]
+{
+  machine_mode mode = <MODE>mode;
+
+  if (!register_operand (operands[2], mode))
+    operands[2] = force_reg (mode, operands[2]);
+
+  enum rtx_code code = <maxmin_rel>;
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
+
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  emit_insn (gen_rtx_SET (flags, tmp));
+
+  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+})
+
+(define_insn_and_split "*<code>di3_doubleword"
+  [(set (match_operand:DI 0 "register_operand")
+	(maxmin:DI (match_operand:DI 1 "register_operand")
+		   (match_operand:DI 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 1)
+	  (match_dup 2)))
+   (set (match_dup 3)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 4)
+	  (match_dup 5)))]
+{
+  if (!register_operand (operands[2], DImode))
+    operands[2] = force_reg (DImode, operands[2]);
+
+  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+
+  rtx cmplo[2] = { operands[1], operands[2] };
+  rtx cmphi[2] = { operands[4], operands[5] };
+
+  enum rtx_code code = <maxmin_rel>;
+
+  switch (code)
+    {
+    case LE: case LEU:
+      std::swap (cmplo[0], cmplo[1]);
+      std::swap (cmphi[0], cmphi[1]);
+      code = swap_condition (code);
+      /* FALLTHRU */
+
+    case GE: case GEU:
+      {
+	bool uns = (code == GEU);
+	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
+	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
+
+	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
+
+	rtx tmp = gen_rtx_SCRATCH (SImode);
+	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
+
+	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
+	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+
+	break;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+})
 
 ;; Misc patterns (?)
 
Index: gcc/testsuite/gcc.target/i386/minmax-3.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int ssi[1024];
+unsigned int usi[1024];
+long long sdi[1024];
+unsigned long long udi[1024];
+
+#define CHECK(FN, VARIANT) \
+void \
+FN ## VARIANT (void) \
+{ \
+  for (int i = 1; i < 1024; ++i) \
+    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
+}
+
+CHECK(max, ssi);
+CHECK(min, ssi);
+CHECK(max, usi);
+CHECK(min, usi);
+CHECK(max, sdi);
+CHECK(min, sdi);
+CHECK(max, udi);
+CHECK(min, udi);
Index: gcc/testsuite/gcc.target/i386/minmax-4.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -msse4.1" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "pminsd" 1 } } */
+/* { dg-final { scan-assembler-times "pminud" 1 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-6.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-6.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-6.c	(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=haswell" } */
+
+unsigned short
+UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
+{
+  if (y != width)
+    {
+      y = y < 0 ? 0 : y;
+      return Pic[y * width];
+    }
+  return Pic[y];
+} 
+
+/* We do not want the RA to spill %esi for it's dual-use but using
+   pmaxsd is OK.  */
+/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
+/* { dg-final { scan-assembler "pmaxsd" } } */
Richard Biener Aug. 9, 2019, 11:01 a.m. UTC | #49
On Fri, 9 Aug 2019, Jakub Jelinek wrote:

> On Fri, Aug 09, 2019 at 11:25:30AM +0200, Richard Biener wrote:
> >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> >  40.45 │       vmovd  %xmm0,%eax
> >   2.45 │       imul   %r8d,%eax
> 
> Shouldn't we hoist the vpxor before the loop?  Is it STV being done too late
> that we don't do that anymore?  Couldn't e.g. STV itself detect that and put
> the clearing instruction before the loop instead of right before the minmax?

This testcase doesn't have a loop, since the minmax patterns do not
allow constants we need to deal with this for the GPR case as well.
And we do when you look at the loop testcase.

Richard.
Richard Biener Aug. 9, 2019, 1 p.m. UTC | #50
On Fri, 9 Aug 2019, Richard Biener wrote:

> On Fri, 9 Aug 2019, Richard Biener wrote:
> 
> > On Fri, 9 Aug 2019, Uros Bizjak wrote:
> > 
> > > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > > 
> > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > >
> > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > >
> > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > >
> > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > to force use of %zmmN?
> > > > > >
> > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > >
> > > > >     case SMAX:
> > > > >     case SMIN:
> > > > >     case UMAX:
> > > > >     case UMIN:
> > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > >         return false;
> > > > >
> > > > > so there's no way to use AVX512VL for 32bit?
> > > >
> > > > There is a way, but on 32bit targets, we need to split DImode
> > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > This is of course doable, but somehow more complex than simply
> > > > emitting a DImode compare + DImode cmove, which is what current
> > > > splitter does. So, a follow-up task.
> > > 
> > > Please find attached the complete .md part that enables SImode for
> > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > targets. The patterns also allows for memory operand 2, so STV has
> > > chance to create the vector pattern with implicit load. In case STV
> > > fails, the memory operand 2 is loaded to the register first;  operand
> > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > operand should be beneficial.
> > 
> > Thanks.
> > 
> > > Also note, that splitting should happen rarely. Due to the cost
> > > function, STV should effectively always convert minmax to a vector
> > > insn.
> > 
> > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > this kind of "simple" conversion:
> > 
> >   5.50 │1d0:   test   %esi,%es
> >   0.07 │       mov    $0x0,%ex
> >        │       cmovs  %eax,%es
> >   5.84 │       imul   %r8d,%es
> > 
> > to
> > 
> >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> >  40.45 │       vmovd  %xmm0,%eax
> >   2.45 │       imul   %r8d,%eax
> > 
> > which looks like a RA artifact in the end.  We spill %esi only
> > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > of a pseudo ultimatively set from di.  STV creates an additional
> > pseudo for this (copy-in) but it places that copy next to the
> > original def rather than next to the start of the chain it
> > converts which is probably the issue why we spill.  And this
> > is because it inserts those at each definition of the pseudo
> > rather than just at the reaching definition(s) or at the
> > uses of the pseudo in the chain (that because there may be
> > defs of that pseudo in the chain itself).  Note that STV emits
> > such "conversion" copies as simple reg-reg moves:
> > 
> > (insn 1094 3 4 2 (set (reg:SI 777)
> >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> >      (nil))
> > 
> > but those do not prevail very long (this one gets removed by CSE2).
> > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > and computes
> > 
> >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > 
> > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > here (but I guess nothing again prevents RTL optimizers from
> > combining that with the single-use in the max instruction...).
> > 
> > So this boils down to STV splitting live-ranges but other
> > passes undoing that and then RA not considering splitting
> > live-ranges here, arriving at unoptimal allocation.
> > 
> > A testcase showing this issue is (simplified from 464.h264ref
> > UMVLine16Y_11):
> > 
> > unsigned short
> > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > {
> >   if (y != width)
> >     {
> >       y = y < 0 ? 0 : y;
> >       return Pic[y * width];
> >     }
> >   return Pic[y];
> > }
> > 
> > where the condition and the Pic[y] load mimics the other use of y.
> > Different, even worse spilling is generated by
> > 
> > unsigned short
> > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > {
> >   y = y < 0 ? 0 : y;
> >   return Pic[y * width] + y;
> > }
> > 
> > I guess this all shows that STVs "trick" of simply wrapping
> > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > 
> > I've added a (failing) testcase to reflect the above.
> 
> Experimenting a bit with just for the conversion insns using
> V4SImode pseudos we end up preserving those moves (but I
> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> ends up using movv4si_internal which only leaves us with
> memory for the SImode operand) _plus_ moving the move next
> to the actual use has an effect.  Not necssarily a good one
> though:
> 
>         vpxor   %xmm0, %xmm0, %xmm0
>         vmovaps %xmm0, -16(%rsp)
>         movl    %esi, -16(%rsp)
>         vpmaxsd -16(%rsp), %xmm0, %xmm0
>         vmovd   %xmm0, %eax
> 
> eh?  I guess the lowpart set is not good (my patch has this
> as well, but I got saved by never having vector modes to subset...).
> Using
> 
>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) "t3.c":5:10 -1
> 
> for the move ends up with
> 
>         vpxor   %xmm1, %xmm1, %xmm1
>         vpinsrd $0, %esi, %xmm1, %xmm0
> 
> eh?  LRA chooses the correct alternative here but somehow
> postreload CSE CSEs the zero with the xmm1 clearing, leading
> to the vpinsrd...  (I guess a general issue, not sure if really
> worse - definitely a larger instruction).  Unfortunately
> postreload-cse doesn't add a reg-equal note.  This happens only
> when emitting the reg move before the use, not doing that emits
> a vmovd as expected.
> 
> At least the spilling is gone here.
> 
> I am re-testing as follows, the main change is that
> general_scalar_chain::make_vector_copies now generates a
> vector pseudo as destination (and I've fixed up the code
> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> 
> Hope this fixes the observed slowdowns (it fixes the new testcase).

It fixes the slowdown observed in 416.gamess and 464.h264ref.

Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.

CCing Jeff who "knows RTL".

OK?

Thanks,
Richard.

> Richard.
> 
> mccas.F:twotff_ for 416.gamess
> refbuf.c:UMVLine16Y_11 for 464.h264ref
> 
> 2019-08-07  Richard Biener  <rguenther@suse.de>
> 
> 	PR target/91154
> 	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> 	mode arguments.
> 	(scalar_chain::smode): New member.
> 	(scalar_chain::vmode): Likewise.
> 	(dimode_scalar_chain): Rename to...
> 	(general_scalar_chain): ... this.
> 	(general_scalar_chain::general_scalar_chain): Take mode arguments.
> 	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> 	base with TImode and V1TImode.
> 	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> 	(general_scalar_chain::vector_const_cost): Adjust for SImode
> 	chains.
> 	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
> 	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> 	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> 	gain if not zero.
> 	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
> 	Elide the subreg if the reg is already vector.
> 	(general_scalar_chain::make_vector_copies): Likewise.  Handle
> 	non-DImode chains appropriately.  Use a vector-mode pseudo as
> 	destination.
> 	(general_scalar_chain::convert_reg): Likewise.
> 	(general_scalar_chain::convert_op): Likewise.  Elide the
> 	subreg if the reg is already vector.
> 	(general_scalar_chain::convert_insn): Likewise.  Add
> 	fatal_insn_not_found if the result is not recognized.
> 	(convertible_comparison_p): Pass in the scalar mode and use that.
> 	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
> 	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> 	(scalar_to_vector_candidate_p): Remove by inlining into single
> 	caller.
> 	(general_remove_non_convertible_regs): Rename from
> 	dimode_remove_non_convertible_regs.
> 	(remove_non_convertible_regs): Remove by inlining into single caller.
> 	(convert_scalars_to_vector): Handle SImode and DImode chains
> 	in addition to TImode chains.
> 	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> 
> 	* gcc.target/i386/pr91154.c: New testcase.
> 	* gcc.target/i386/minmax-3.c: Likewise.
> 	* gcc.target/i386/minmax-4.c: Likewise.
> 	* gcc.target/i386/minmax-5.c: Likewise.
> 	* gcc.target/i386/minmax-6.c: Likewise.
> 
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c	(revision 274111)
> +++ gcc/config/i386/i386-features.c	(working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>  
>  /* Initialize new chain.  */
>  
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>  
>     if (dump_file)
> @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
>     conversion.  */
>  
>  void
> -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> +general_scalar_chain::mark_dual_mode_def (df_ref def)
>  {
>    gcc_assert (DF_REF_REG_DEF_P (def));
>  
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
>     instead of using a scalar one.  */
>  
>  int
> -dimode_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp)
>  {
>    gcc_assert (CONST_INT_P (exp));
>  
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>  
>  /* Compute a gain for chain conversion.  */
>  
>  int
> -dimode_scalar_chain::compute_convert_gain ()
> +general_scalar_chain::compute_convert_gain ()
>  {
>    bitmap_iterator bi;
>    unsigned insn_uid;
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>  
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>  
>        if (REG_P (src) && REG_P (dst))
> -	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +	igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +	igain
> +	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>  	       || GET_CODE (src) == ASHIFTRT
>  	       || GET_CODE (src) == LSHIFTRT)
>  	{
>      	  if (CONST_INT_P (XEXP (src, 0)))
> -	    gain -= vector_const_cost (XEXP (src, 0));
> -	  gain += ix86_cost->shift_const;
> +	    igain -= vector_const_cost (XEXP (src, 0));
> +	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>  	  if (INTVAL (XEXP (src, 1)) >= 32)
> -	    gain -= COSTS_N_INSNS (1);
> +	    igain -= COSTS_N_INSNS (1);
>  	}
>        else if (GET_CODE (src) == PLUS
>  	       || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>  	       || GET_CODE (src) == XOR
>  	       || GET_CODE (src) == AND)
>  	{
> -	  gain += ix86_cost->add;
> +	  igain += m * ix86_cost->add - ix86_cost->sse_op;
>  	  /* Additional gain for andnot for targets without BMI.  */
>  	  if (GET_CODE (XEXP (src, 0)) == NOT
>  	      && !TARGET_BMI)
> -	    gain += 2 * ix86_cost->add;
> +	    igain += m * ix86_cost->add;
>  
>  	  if (CONST_INT_P (XEXP (src, 0)))
> -	    gain -= vector_const_cost (XEXP (src, 0));
> +	    igain -= vector_const_cost (XEXP (src, 0));
>  	  if (CONST_INT_P (XEXP (src, 1)))
> -	    gain -= vector_const_cost (XEXP (src, 1));
> +	    igain -= vector_const_cost (XEXP (src, 1));
>  	}
>        else if (GET_CODE (src) == NEG
>  	       || GET_CODE (src) == NOT)
> -	gain += ix86_cost->add - COSTS_N_INSNS (1);
> +	igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +	       || GET_CODE (src) == SMIN
> +	       || GET_CODE (src) == UMAX
> +	       || GET_CODE (src) == UMIN)
> +	{
> +	  /* We do not have any conditional move cost, estimate it as a
> +	     reg-reg move.  Comparisons are costed as adds.  */
> +	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +	  /* Integer SSE ops are all costed the same.  */
> +	  igain -= ix86_cost->sse_op;
> +	}
>        else if (GET_CODE (src) == COMPARE)
>  	{
>  	  /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>  	{
>  	  if (REG_P (dst))
> -	    gain += COSTS_N_INSNS (2);
> +	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +	    igain += COSTS_N_INSNS (m);
>  	  else if (MEM_P (dst))
> -	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -	  gain -= vector_const_cost (src);
> +	    igain += (m * ix86_cost->int_store[2]
> +		     - ix86_cost->sse_store[sse_cost_idx]);
> +	  igain -= vector_const_cost (src);
>  	}
>        else
>  	gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +	{
> +	  fprintf (dump_file, "  Instruction gain %d for ", igain);
> +	  dump_insn_slim (dump_file, insn);
> +	}
> +      gain += igain;
>      }
>  
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>  
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>  
> @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai
>  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
>  
>  rtx
> -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return (GET_MODE (new_reg) == vmode
> +	    ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0));
>  
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg
>  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
>  
>  void
> -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
>  						  rtx reg, rtx new_reg)
>  {
>    replace_with_subreg (single_set (insn), reg, new_reg);
> @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx
>     and replace its uses in a chain.  */
>  
>  void
> -dimode_scalar_chain::make_vector_copies (unsigned regno)
> +general_scalar_chain::make_vector_copies (unsigned regno)
>  {
>    rtx reg = regno_reg_rtx[regno];
> -  rtx vreg = gen_reg_rtx (DImode);
> +  rtx vreg = gen_reg_rtx (vmode);
>    df_ref ref;
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies
>  	start_sequence ();
>  	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>  	  {
> -	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -	    emit_move_insn (adjust_address (tmp, SImode, 0),
> -			    gen_rtx_SUBREG (SImode, reg, 0));
> -	    emit_move_insn (adjust_address (tmp, SImode, 4),
> -			    gen_rtx_SUBREG (SImode, reg, 4));
> -	    emit_move_insn (vreg, tmp);
> +	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +	    if (smode == DImode && !TARGET_64BIT)
> +	      {
> +		emit_move_insn (adjust_address (tmp, SImode, 0),
> +				gen_rtx_SUBREG (SImode, reg, 0));
> +		emit_move_insn (adjust_address (tmp, SImode, 4),
> +				gen_rtx_SUBREG (SImode, reg, 4));
> +	      }
> +	    else
> +	      emit_move_insn (tmp, reg);
> +	    emit_move_insn (vreg,
> +			    gen_rtx_VEC_MERGE (vmode,
> +					       gen_rtx_VEC_DUPLICATE (vmode,
> +								      tmp),
> +					       CONST0_RTX (vmode),
> +					       GEN_INT (HOST_WIDE_INT_1U)));
> +
>  	  }
> -	else if (TARGET_SSE4_1)
> +	else if (!TARGET_64BIT && smode == DImode)
>  	  {
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 0)));
> -	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					  gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					  gen_rtx_SUBREG (SImode, reg, 4),
> -					  GEN_INT (2)));
> +	    if (TARGET_SSE4_1)
> +	      {
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 0)));
> +		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					      gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					      gen_rtx_SUBREG (SImode, reg, 4),
> +					      GEN_INT (2)));
> +	      }
> +	    else
> +	      {
> +		rtx tmp = gen_reg_rtx (DImode);
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 0)));
> +		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +					    CONST0_RTX (V4SImode),
> +					    gen_rtx_SUBREG (SImode, reg, 4)));
> +		emit_insn (gen_vec_interleave_lowv4si
> +			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +			    gen_rtx_SUBREG (V4SImode, vreg, 0),
> +			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +	      }
>  	  }
>  	else
>  	  {
> -	    rtx tmp = gen_reg_rtx (DImode);
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 0)));
> -	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -					CONST0_RTX (V4SImode),
> -					gen_rtx_SUBREG (SImode, reg, 4)));
> -	    emit_insn (gen_vec_interleave_lowv4si
> -		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -			gen_rtx_SUBREG (V4SImode, vreg, 0),
> -			gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +	    emit_move_insn (vreg,
> +			    gen_rtx_VEC_MERGE (vmode,
> +					       gen_rtx_VEC_DUPLICATE (vmode,
> +								      reg),
> +					       CONST0_RTX (vmode),
> +					       GEN_INT (HOST_WIDE_INT_1U)));
>  	  }
>  	rtx_insn *seq = get_insns ();
>  	end_sequence ();
> @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies
>     in case register is used in not convertible insn.  */
>  
>  void
> -dimode_scalar_chain::convert_reg (unsigned regno)
> +general_scalar_chain::convert_reg (unsigned regno)
>  {
>    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
>    rtx reg = regno_reg_rtx[regno];
> @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>  
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>  
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign
>  	  start_sequence ();
>  	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>  	    {
> -	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>  	      emit_move_insn (tmp, reg);
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -			      adjust_address (tmp, SImode, 0));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -			      adjust_address (tmp, SImode, 4));
> +	      if (!TARGET_64BIT && smode == DImode)
> +		{
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +				  adjust_address (tmp, SImode, 0));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +				  adjust_address (tmp, SImode, 4));
> +		}
> +	      else
> +		emit_move_insn (scopy, tmp);
>  	    }
> -	  else if (TARGET_SSE4_1)
> +	  else if (!TARGET_64BIT && smode == DImode)
>  	    {
> -	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -	      emit_insn
> -		(gen_rtx_SET
> -		 (gen_rtx_SUBREG (SImode, scopy, 0),
> -		  gen_rtx_VEC_SELECT (SImode,
> -				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -	      emit_insn
> -		(gen_rtx_SET
> -		 (gen_rtx_SUBREG (SImode, scopy, 4),
> -		  gen_rtx_VEC_SELECT (SImode,
> -				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +	      if (TARGET_SSE4_1)
> +		{
> +		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +					      gen_rtvec (1, const0_rtx));
> +		  emit_insn
> +		    (gen_rtx_SET
> +		       (gen_rtx_SUBREG (SImode, scopy, 0),
> +			gen_rtx_VEC_SELECT (SImode,
> +					    gen_rtx_SUBREG (V4SImode, reg, 0),
> +					    tmp)));
> +
> +		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +		  emit_insn
> +		    (gen_rtx_SET
> +		       (gen_rtx_SUBREG (SImode, scopy, 4),
> +			gen_rtx_VEC_SELECT (SImode,
> +					    gen_rtx_SUBREG (V4SImode, reg, 0),
> +					    tmp)));
> +		}
> +	      else
> +		{
> +		  rtx vcopy = gen_reg_rtx (V2DImode);
> +		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +				  gen_rtx_SUBREG (SImode, vcopy, 0));
> +		  emit_move_insn (vcopy,
> +				  gen_rtx_LSHIFTRT (V2DImode,
> +						    vcopy, GEN_INT (32)));
> +		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +				  gen_rtx_SUBREG (SImode, vcopy, 0));
> +		}
>  	    }
>  	  else
> -	    {
> -	      rtx vcopy = gen_reg_rtx (V2DImode);
> -	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -			      gen_rtx_SUBREG (SImode, vcopy, 0));
> -	      emit_move_insn (vcopy,
> -			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -			      gen_rtx_SUBREG (SImode, vcopy, 0));
> -	    }
> +	    emit_move_insn (scopy, reg);
> +
>  	  rtx_insn *seq = get_insns ();
>  	  end_sequence ();
>  	  emit_conversion_insns (seq, insn);
> @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign
>     registers conversion.  */
>  
>  void
> -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
>  {
>    *op = copy_rtx_if_shared (*op);
>  
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>  
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>  
>        if (dump_file)
>  	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op
>  	    gcc_assert (!DF_REF_CHAIN (ref));
>  	    break;
>  	  }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      if (GET_MODE (*op) != vmode)
> +	*op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>  
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -	vec_cst = CONSTM1_RTX (V2DImode);
> +	vec_cst = CONSTM1_RTX (vmode);
>        else
> -	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -					gen_rtvec (2, *op, const0_rtx));
> +	{
> +	  unsigned n = GET_MODE_NUNITS (vmode);
> +	  rtx *v = XALLOCAVEC (rtx, n);
> +	  v[0] = *op;
> +	  for (unsigned i = 1; i < n; ++i)
> +	    v[i] = const0_rtx;
> +	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +	}
>  
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>  	{
>  	  start_sequence ();
> -	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>  	  rtx_insn *seq = get_insns ();
>  	  end_sequence ();
>  	  emit_insn_before (seq, insn);
> @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>  
>  /* Convert INSN to vector mode.  */
>  
>  void
> -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> +general_scalar_chain::convert_insn (rtx_insn *insn)
>  {
>    rtx def_set = single_set (insn);
>    rtx src = SET_SRC (def_set);
> @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>  	 temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>  
>    switch (GET_CODE (src))
> @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case PLUS:
> @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>  
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>  
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>  
>      case MEM:
> @@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>  
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>  
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>  
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
>  
>        if (REG_P (src))
> -	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +	subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>  	subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>  
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>  
> @@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i
>  }
>  
>  void
> -dimode_scalar_chain::convert_registers ()
> +general_scalar_chain::convert_registers ()
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn
>  		     (const_int 0 [0])))  */
>  
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>  	  && (SUBREG_BYTE (op2) != 0
> -	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>  
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn
>  
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>  
>    return true;
> @@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>  
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>  
> @@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>  
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>  
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>  
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	return false;
>        break;
>  
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512VL)
> +	  || (mode == SImode && !TARGET_SSE4_1))
> +	return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	  && !CONST_INT_P (XEXP (src, 1)))
>  	return false;
>  
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>  	  && !CONST_INT_P (XEXP (src, 1)))
>  	return false;
>        break;
> @@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>  	  || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>  
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>  
> @@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>  
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>  
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>  
>  static void
> -dimode_remove_non_convertible_regs (bitmap candidates)
> +general_remove_non_convertible_regs (bitmap candidates)
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>  
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>  
> @@ -1577,11 +1653,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>  
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>  
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1676,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -	if (scalar_to_vector_candidate_p (insn))
> +	if (TARGET_64BIT
> +	    && timode_scalar_to_vector_candidate_p (insn))
>  	  {
>  	    if (dump_file)
> -	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>  		       INSN_UID (insn));
>  
> -	    bitmap_set_bit (candidates, INSN_UID (insn));
> +	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +	  }
> +	else
> +	  {
> +	    /* Check {SI,DI}mode.  */
> +	    for (unsigned i = 0; i <= 1; ++i)
> +	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +		{
> +		  if (dump_file)
> +		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +		  break;
> +		}
>  	  }
>      }
>  
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    general_remove_non_convertible_regs (&candidates[i]);
>  
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>  
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +	unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +	scalar_chain *chain;
>  
> -      if (TARGET_64BIT)
> -	chain = new timode_scalar_chain;
> -      else
> -	chain = new dimode_scalar_chain;
> +	if (cand_mode[i] == TImode)
> +	  chain = new timode_scalar_chain;
> +	else
> +	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
>  
> -      /* Find instructions chain we want to convert to vector mode.
> -	 Check all uses and definitions to estimate all required
> -	 conversions.  */
> -      chain->build (candidates, uid);
> +	/* Find instructions chain we want to convert to vector mode.
> +	   Check all uses and definitions to estimate all required
> +	   conversions.  */
> +	chain->build (&candidates[i], uid);
>  
> -      if (chain->compute_convert_gain () > 0)
> -	converted_insns += chain->convert ();
> -      else
> -	if (dump_file)
> -	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -		   chain->chain_id);
> +	if (chain->compute_convert_gain () > 0)
> +	  converted_insns += chain->convert ();
> +	else
> +	  if (dump_file)
> +	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +		     chain->chain_id);
>  
> -      delete chain;
> -    }
> +	delete chain;
> +      }
>  
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>  
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>  
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h	(revision 274111)
> +++ gcc/config/i386/i386-features.h	(working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>  
>    static unsigned max_id;
>  
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -159,9 +164,11 @@ class scalar_chain
>    virtual void convert_registers () = 0;
>  };
>  
> -class dimode_scalar_chain : public scalar_chain
> +class general_scalar_chain : public scalar_chain
>  {
>   public:
> +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>  
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md	(revision 274111)
> +++ gcc/config/i386/i386.md	(working copy)
> @@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc"
>     (match_operand:SWI 3 "const_int_operand")]
>    ""
>    "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
> +
> +;; min/max patterns
> +
> +(define_mode_iterator MAXMIN_IMODE
> +  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
> +(define_code_attr maxmin_rel
> +  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
> +
> +(define_expand "<code><mode>3"
> +  [(parallel
> +    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +	  (maxmin:MAXMIN_IMODE
> +	    (match_operand:MAXMIN_IMODE 1 "register_operand")
> +	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +     (clobber (reg:CC FLAGS_REG))])]
> +  "TARGET_STV")
> +
> +(define_insn_and_split "*<code><mode>3_1"
> +  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +	(maxmin:MAXMIN_IMODE
> +	  (match_operand:MAXMIN_IMODE 1 "register_operand")
> +	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +	(if_then_else:MAXMIN_IMODE (match_dup 3)
> +	  (match_dup 1)
> +	  (match_dup 2)))]
> +{
> +  machine_mode mode = <MODE>mode;
> +
> +  if (!register_operand (operands[2], mode))
> +    operands[2] = force_reg (mode, operands[2]);
> +
> +  enum rtx_code code = <maxmin_rel>;
> +  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
> +  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
> +
> +  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
> +  emit_insn (gen_rtx_SET (flags, tmp));
> +
> +  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +})
> +
> +(define_insn_and_split "*<code>di3_doubleword"
> +  [(set (match_operand:DI 0 "register_operand")
> +	(maxmin:DI (match_operand:DI 1 "register_operand")
> +		   (match_operand:DI 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +	(if_then_else:SI (match_dup 6)
> +	  (match_dup 1)
> +	  (match_dup 2)))
> +   (set (match_dup 3)
> +	(if_then_else:SI (match_dup 6)
> +	  (match_dup 4)
> +	  (match_dup 5)))]
> +{
> +  if (!register_operand (operands[2], DImode))
> +    operands[2] = force_reg (DImode, operands[2]);
> +
> +  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
> +
> +  rtx cmplo[2] = { operands[1], operands[2] };
> +  rtx cmphi[2] = { operands[4], operands[5] };
> +
> +  enum rtx_code code = <maxmin_rel>;
> +
> +  switch (code)
> +    {
> +    case LE: case LEU:
> +      std::swap (cmplo[0], cmplo[1]);
> +      std::swap (cmphi[0], cmphi[1]);
> +      code = swap_condition (code);
> +      /* FALLTHRU */
> +
> +    case GE: case GEU:
> +      {
> +	bool uns = (code == GEU);
> +	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
> +	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
> +
> +	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
> +
> +	rtx tmp = gen_rtx_SCRATCH (SImode);
> +	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
> +
> +	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
> +	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +
> +	break;
> +      }
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +})
>  
>  ;; Misc patterns (?)
>  
> Index: gcc/testsuite/gcc.target/i386/minmax-3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b))
> +#define min(a,b) (((a) < (b))? (a) : (b))
> +
> +int ssi[1024];
> +unsigned int usi[1024];
> +long long sdi[1024];
> +unsigned long long udi[1024];
> +
> +#define CHECK(FN, VARIANT) \
> +void \
> +FN ## VARIANT (void) \
> +{ \
> +  for (int i = 1; i < 1024; ++i) \
> +    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
> +}
> +
> +CHECK(max, ssi);
> +CHECK(min, ssi);
> +CHECK(max, usi);
> +CHECK(min, usi);
> +CHECK(max, sdi);
> +CHECK(min, sdi);
> +CHECK(max, udi);
> +CHECK(min, udi);
> Index: gcc/testsuite/gcc.target/i386/minmax-4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -msse4.1" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "pminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pminud" 1 } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-6.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-6.c	(working copy)
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=haswell" } */
> +
> +unsigned short
> +UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> +{
> +  if (y != width)
> +    {
> +      y = y < 0 ? 0 : y;
> +      return Pic[y * width];
> +    }
> +  return Pic[y];
> +} 
> +
> +/* We do not want the RA to spill %esi for it's dual-use but using
> +   pmaxsd is OK.  */
> +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> +/* { dg-final { scan-assembler "pmaxsd" } } */
Uros Bizjak Aug. 9, 2019, 1:56 p.m. UTC | #51
On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:

> > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > >
> > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > >
> > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > >
> > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > to force use of %zmmN?
> > > > > > >
> > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > >
> > > > > >     case SMAX:
> > > > > >     case SMIN:
> > > > > >     case UMAX:
> > > > > >     case UMIN:
> > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > >         return false;
> > > > > >
> > > > > > so there's no way to use AVX512VL for 32bit?
> > > > >
> > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > This is of course doable, but somehow more complex than simply
> > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > splitter does. So, a follow-up task.
> > > >
> > > > Please find attached the complete .md part that enables SImode for
> > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > > targets. The patterns also allows for memory operand 2, so STV has
> > > > chance to create the vector pattern with implicit load. In case STV
> > > > fails, the memory operand 2 is loaded to the register first;  operand
> > > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > > operand should be beneficial.
> > >
> > > Thanks.
> > >
> > > > Also note, that splitting should happen rarely. Due to the cost
> > > > function, STV should effectively always convert minmax to a vector
> > > > insn.
> > >
> > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > > this kind of "simple" conversion:
> > >
> > >   5.50 │1d0:   test   %esi,%es
> > >   0.07 │       mov    $0x0,%ex
> > >        │       cmovs  %eax,%es
> > >   5.84 │       imul   %r8d,%es
> > >
> > > to
> > >
> > >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> > >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> > >  40.45 │       vmovd  %xmm0,%eax
> > >   2.45 │       imul   %r8d,%eax
> > >
> > > which looks like a RA artifact in the end.  We spill %esi only
> > > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > > of a pseudo ultimatively set from di.  STV creates an additional
> > > pseudo for this (copy-in) but it places that copy next to the
> > > original def rather than next to the start of the chain it
> > > converts which is probably the issue why we spill.  And this
> > > is because it inserts those at each definition of the pseudo
> > > rather than just at the reaching definition(s) or at the
> > > uses of the pseudo in the chain (that because there may be
> > > defs of that pseudo in the chain itself).  Note that STV emits
> > > such "conversion" copies as simple reg-reg moves:
> > >
> > > (insn 1094 3 4 2 (set (reg:SI 777)
> > >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> > >      (nil))
> > >
> > > but those do not prevail very long (this one gets removed by CSE2).
> > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > > and computes
> > >
> > >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> > >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > >
> > > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > > here (but I guess nothing again prevents RTL optimizers from
> > > combining that with the single-use in the max instruction...).
> > >
> > > So this boils down to STV splitting live-ranges but other
> > > passes undoing that and then RA not considering splitting
> > > live-ranges here, arriving at unoptimal allocation.
> > >
> > > A testcase showing this issue is (simplified from 464.h264ref
> > > UMVLine16Y_11):
> > >
> > > unsigned short
> > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > {
> > >   if (y != width)
> > >     {
> > >       y = y < 0 ? 0 : y;
> > >       return Pic[y * width];
> > >     }
> > >   return Pic[y];
> > > }
> > >
> > > where the condition and the Pic[y] load mimics the other use of y.
> > > Different, even worse spilling is generated by
> > >
> > > unsigned short
> > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > {
> > >   y = y < 0 ? 0 : y;
> > >   return Pic[y * width] + y;
> > > }
> > >
> > > I guess this all shows that STVs "trick" of simply wrapping
> > > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > >
> > > I've added a (failing) testcase to reflect the above.
> >
> > Experimenting a bit with just for the conversion insns using
> > V4SImode pseudos we end up preserving those moves (but I
> > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> > ends up using movv4si_internal which only leaves us with
> > memory for the SImode operand) _plus_ moving the move next
> > to the actual use has an effect.  Not necssarily a good one
> > though:
> >
> >         vpxor   %xmm0, %xmm0, %xmm0
> >         vmovaps %xmm0, -16(%rsp)
> >         movl    %esi, -16(%rsp)
> >         vpmaxsd -16(%rsp), %xmm0, %xmm0
> >         vmovd   %xmm0, %eax
> >
> > eh?  I guess the lowpart set is not good (my patch has this
> > as well, but I got saved by never having vector modes to subset...).
> > Using
> >
> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> >             (const_vector:V4SI [
> >                     (const_int 0 [0]) repeated x4
> >                 ])
> >             (const_int 1 [0x1]))) "t3.c":5:10 -1
> >
> > for the move ends up with
> >
> >         vpxor   %xmm1, %xmm1, %xmm1
> >         vpinsrd $0, %esi, %xmm1, %xmm0
> >
> > eh?  LRA chooses the correct alternative here but somehow
> > postreload CSE CSEs the zero with the xmm1 clearing, leading
> > to the vpinsrd...  (I guess a general issue, not sure if really
> > worse - definitely a larger instruction).  Unfortunately
> > postreload-cse doesn't add a reg-equal note.  This happens only
> > when emitting the reg move before the use, not doing that emits
> > a vmovd as expected.
> >
> > At least the spilling is gone here.
> >
> > I am re-testing as follows, the main change is that
> > general_scalar_chain::make_vector_copies now generates a
> > vector pseudo as destination (and I've fixed up the code
> > to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> >
> > Hope this fixes the observed slowdowns (it fixes the new testcase).
>
> It fixes the slowdown observed in 416.gamess and 464.h264ref.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
>
> CCing Jeff who "knows RTL".
>
> OK?

Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
spurious test failures on SSE4.1 targets.

Uros.

> Thanks,
> Richard.
>
> > Richard.
> >
> > mccas.F:twotff_ for 416.gamess
> > refbuf.c:UMVLine16Y_11 for 464.h264ref
> >
> > 2019-08-07  Richard Biener  <rguenther@suse.de>
> >
> >       PR target/91154
> >       * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
> >       mode arguments.
> >       (scalar_chain::smode): New member.
> >       (scalar_chain::vmode): Likewise.
> >       (dimode_scalar_chain): Rename to...
> >       (general_scalar_chain): ... this.
> >       (general_scalar_chain::general_scalar_chain): Take mode arguments.
> >       (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
> >       base with TImode and V1TImode.
> >       * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
> >       (general_scalar_chain::vector_const_cost): Adjust for SImode
> >       chains.
> >       (general_scalar_chain::compute_convert_gain): Likewise.  Fix
> >       reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
> >       scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
> >       gain if not zero.
> >       (general_scalar_chain::replace_with_subreg): Use vmode/smode.
> >       Elide the subreg if the reg is already vector.
> >       (general_scalar_chain::make_vector_copies): Likewise.  Handle
> >       non-DImode chains appropriately.  Use a vector-mode pseudo as
> >       destination.
> >       (general_scalar_chain::convert_reg): Likewise.
> >       (general_scalar_chain::convert_op): Likewise.  Elide the
> >       subreg if the reg is already vector.
> >       (general_scalar_chain::convert_insn): Likewise.  Add
> >       fatal_insn_not_found if the result is not recognized.
> >       (convertible_comparison_p): Pass in the scalar mode and use that.
> >       (general_scalar_to_vector_candidate_p): Likewise.  Rename from
> >       dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
> >       (scalar_to_vector_candidate_p): Remove by inlining into single
> >       caller.
> >       (general_remove_non_convertible_regs): Rename from
> >       dimode_remove_non_convertible_regs.
> >       (remove_non_convertible_regs): Remove by inlining into single caller.
> >       (convert_scalars_to_vector): Handle SImode and DImode chains
> >       in addition to TImode chains.
> >       * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.
> >
> >       * gcc.target/i386/pr91154.c: New testcase.
> >       * gcc.target/i386/minmax-3.c: Likewise.
> >       * gcc.target/i386/minmax-4.c: Likewise.
> >       * gcc.target/i386/minmax-5.c: Likewise.
> >       * gcc.target/i386/minmax-6.c: Likewise.
Jeff Law Aug. 9, 2019, 10:03 p.m. UTC | #52
On 7/27/19 3:22 AM, Uros Bizjak wrote:
> On Wed, Jul 24, 2019 at 5:03 PM Jeff Law <law@redhat.com> wrote:
> 
>>> Clearly this approach will run into register allocation issues
>>> but it looks cleaner than writing yet another STV-like pass
>>> (STV itself is quite awkwardly structured so I refrain from
>>> touching it...).
>>>
>>> Anyway - comments?  It seems to me that MMX-in-SSE does
>>> something very similar.
>>>
>>> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
>>> revealed some issue.  Forgot that *add<mode>_1 also handles
>>> DImode..., fixed below, re-testing in progress.
>> Certainly simpler than most of the options and seems effective.
>>
>> FWIW, I think all the STV code is still disabled and has been for
>> several releases.  One could make an argument it should get dropped.  If
>> someone wants to make something like STV work, they can try again and
>> hopefully learn from the problems with the first implementation.
> 
> Huh?
> 
> STV code is *enabled by default* on 32bit SSE2 targets, and works
> surprisingly well (*) for DImode arithmetic, logic and constant shift
> operations. Even 32bit multilib on x86_64 is built with STV.
I must be mis-remembering or confusing it with something else.  Sorry
for any confusion.

Jeff
Richard Biener Aug. 12, 2019, 12:27 p.m. UTC | #53
On Fri, 9 Aug 2019, Uros Bizjak wrote:

> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:
> 
> > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > > >
> > > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > > >
> > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > > >
> > > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > > to force use of %zmmN?
> > > > > > > >
> > > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > > >
> > > > > > >     case SMAX:
> > > > > > >     case SMIN:
> > > > > > >     case UMAX:
> > > > > > >     case UMIN:
> > > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > > >         return false;
> > > > > > >
> > > > > > > so there's no way to use AVX512VL for 32bit?
> > > > > >
> > > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > > This is of course doable, but somehow more complex than simply
> > > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > > splitter does. So, a follow-up task.
> > > > >
> > > > > Please find attached the complete .md part that enables SImode for
> > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > > > targets. The patterns also allows for memory operand 2, so STV has
> > > > > chance to create the vector pattern with implicit load. In case STV
> > > > > fails, the memory operand 2 is loaded to the register first;  operand
> > > > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > > > operand should be beneficial.
> > > >
> > > > Thanks.
> > > >
> > > > > Also note, that splitting should happen rarely. Due to the cost
> > > > > function, STV should effectively always convert minmax to a vector
> > > > > insn.
> > > >
> > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > > > this kind of "simple" conversion:
> > > >
> > > >   5.50 │1d0:   test   %esi,%es
> > > >   0.07 │       mov    $0x0,%ex
> > > >        │       cmovs  %eax,%es
> > > >   5.84 │       imul   %r8d,%es
> > > >
> > > > to
> > > >
> > > >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> > > >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> > > >  40.45 │       vmovd  %xmm0,%eax
> > > >   2.45 │       imul   %r8d,%eax
> > > >
> > > > which looks like a RA artifact in the end.  We spill %esi only
> > > > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > > > of a pseudo ultimatively set from di.  STV creates an additional
> > > > pseudo for this (copy-in) but it places that copy next to the
> > > > original def rather than next to the start of the chain it
> > > > converts which is probably the issue why we spill.  And this
> > > > is because it inserts those at each definition of the pseudo
> > > > rather than just at the reaching definition(s) or at the
> > > > uses of the pseudo in the chain (that because there may be
> > > > defs of that pseudo in the chain itself).  Note that STV emits
> > > > such "conversion" copies as simple reg-reg moves:
> > > >
> > > > (insn 1094 3 4 2 (set (reg:SI 777)
> > > >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> > > >      (nil))
> > > >
> > > > but those do not prevail very long (this one gets removed by CSE2).
> > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > > > and computes
> > > >
> > > >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> > > >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > > >
> > > > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > > > here (but I guess nothing again prevents RTL optimizers from
> > > > combining that with the single-use in the max instruction...).
> > > >
> > > > So this boils down to STV splitting live-ranges but other
> > > > passes undoing that and then RA not considering splitting
> > > > live-ranges here, arriving at unoptimal allocation.
> > > >
> > > > A testcase showing this issue is (simplified from 464.h264ref
> > > > UMVLine16Y_11):
> > > >
> > > > unsigned short
> > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > > {
> > > >   if (y != width)
> > > >     {
> > > >       y = y < 0 ? 0 : y;
> > > >       return Pic[y * width];
> > > >     }
> > > >   return Pic[y];
> > > > }
> > > >
> > > > where the condition and the Pic[y] load mimics the other use of y.
> > > > Different, even worse spilling is generated by
> > > >
> > > > unsigned short
> > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > > {
> > > >   y = y < 0 ? 0 : y;
> > > >   return Pic[y * width] + y;
> > > > }
> > > >
> > > > I guess this all shows that STVs "trick" of simply wrapping
> > > > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > > >
> > > > I've added a (failing) testcase to reflect the above.
> > >
> > > Experimenting a bit with just for the conversion insns using
> > > V4SImode pseudos we end up preserving those moves (but I
> > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> > > ends up using movv4si_internal which only leaves us with
> > > memory for the SImode operand) _plus_ moving the move next
> > > to the actual use has an effect.  Not necssarily a good one
> > > though:
> > >
> > >         vpxor   %xmm0, %xmm0, %xmm0
> > >         vmovaps %xmm0, -16(%rsp)
> > >         movl    %esi, -16(%rsp)
> > >         vpmaxsd -16(%rsp), %xmm0, %xmm0
> > >         vmovd   %xmm0, %eax
> > >
> > > eh?  I guess the lowpart set is not good (my patch has this
> > > as well, but I got saved by never having vector modes to subset...).
> > > Using
> > >
> > >     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> > >             (const_vector:V4SI [
> > >                     (const_int 0 [0]) repeated x4
> > >                 ])
> > >             (const_int 1 [0x1]))) "t3.c":5:10 -1
> > >
> > > for the move ends up with
> > >
> > >         vpxor   %xmm1, %xmm1, %xmm1
> > >         vpinsrd $0, %esi, %xmm1, %xmm0
> > >
> > > eh?  LRA chooses the correct alternative here but somehow
> > > postreload CSE CSEs the zero with the xmm1 clearing, leading
> > > to the vpinsrd...  (I guess a general issue, not sure if really
> > > worse - definitely a larger instruction).  Unfortunately
> > > postreload-cse doesn't add a reg-equal note.  This happens only
> > > when emitting the reg move before the use, not doing that emits
> > > a vmovd as expected.
> > >
> > > At least the spilling is gone here.
> > >
> > > I am re-testing as follows, the main change is that
> > > general_scalar_chain::make_vector_copies now generates a
> > > vector pseudo as destination (and I've fixed up the code
> > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> > >
> > > Hope this fixes the observed slowdowns (it fixes the new testcase).
> >
> > It fixes the slowdown observed in 416.gamess and 464.h264ref.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> >
> > CCing Jeff who "knows RTL".
> >
> > OK?
> 
> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
> spurious test failures on SSE4.1 targets.

Done.  I've also adjusted the i386.md changelog as follows:

        * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
        (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
        (*<maxmin>di3_doubleword): Likewise.

I see

FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
FAIL: gcc.target/i386/pr78794.c scan-assembler pandn

with the latest patch (this is with -m32) where -mstv causes
all spills to go away and the cmoves replaced (so clearly
better code after the patch) for pr65105-5.c, no obvious
improvements for pr65105-3.c where cmov does appear with -mstv.
I'd rather not "fix" those by adding -mno-stv but instead have
the Intel people fix costing for slm and/or decide what to do.
For pr65105-3.c I'm not sure why if-conversion didn't choose
to use cmov, so clearly the enabled minmax patterns expose the
"failure" here.

I've also seen a 32bit ICE for a bogus store we create with the
live-range splitting fix fixed in the patch below (convert_insn
REG src handling with MEM dst needs to account for a vector-mode
src case).

Maybe it would help to split out changes unrelated to {DI,SI}mode
chain support from the STV costing and also separately install
the live-range splitting "fix"?  I'm willing to do some more
legwork to make review and approval easier here.

Anyway, bootstrapped & tested on x86_64-unknown-linux-gnu.
I've re-checked SPEC CPU 2006 on Haswell with no changes over the
previous results.

Thanks,
Richard.

2019-08-12  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
	gain if not zero.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	Elide the subreg if the reg is already vector.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.  Use a vector-mode pseudo as
	destination.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.  Elide the
	subreg if the reg is already vector.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
	(*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
	(*<maxmin>di3_doubleword): Likewise.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.
	* gcc.target/i386/minmax-6.c: Likewise.
	* gcc.target/i386/minmax-1.c: Add -mno-stv.
	* gcc.target/i386/minmax-2.c: Likewise.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274278)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return (GET_MODE (new_reg) == vmode
+	    ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0));
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (vmode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
-	    emit_move_insn (vreg, tmp);
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      tmp),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
+
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
 	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      reg),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
 	  }
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
@@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      if (GET_MODE (*op) != vmode)
+	*op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -936,20 +1024,22 @@ dimode_scalar_chain::convert_insn (rtx_i
     case REG:
       if (!MEM_P (dst))
 	convert_op (&src, insn);
+      else if (GET_MODE (src) != smode)
+	src = gen_rtx_SUBREG (smode, src, 0);
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1067,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1208,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1278,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1311,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1324,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1333,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1347,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1372,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1391,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1420,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1484,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1648,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1655,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1678,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274278)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274278)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc"
    (match_operand:SWI 3 "const_int_operand")]
   ""
   "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
+
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
+
+(define_expand "<code><mode>3"
+  [(parallel
+    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	  (maxmin:MAXMIN_IMODE
+	    (match_operand:MAXMIN_IMODE 1 "register_operand")
+	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+     (clobber (reg:CC FLAGS_REG))])]
+  "TARGET_STV")
+
+(define_insn_and_split "*<code><mode>3_1"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE
+	  (match_operand:MAXMIN_IMODE 1 "register_operand")
+	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	  (match_dup 1)
+	  (match_dup 2)))]
+{
+  machine_mode mode = <MODE>mode;
+
+  if (!register_operand (operands[2], mode))
+    operands[2] = force_reg (mode, operands[2]);
+
+  enum rtx_code code = <maxmin_rel>;
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
+
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  emit_insn (gen_rtx_SET (flags, tmp));
+
+  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+})
+
+(define_insn_and_split "*<code>di3_doubleword"
+  [(set (match_operand:DI 0 "register_operand")
+	(maxmin:DI (match_operand:DI 1 "register_operand")
+		   (match_operand:DI 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 1)
+	  (match_dup 2)))
+   (set (match_dup 3)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 4)
+	  (match_dup 5)))]
+{
+  if (!register_operand (operands[2], DImode))
+    operands[2] = force_reg (DImode, operands[2]);
+
+  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+
+  rtx cmplo[2] = { operands[1], operands[2] };
+  rtx cmphi[2] = { operands[4], operands[5] };
+
+  enum rtx_code code = <maxmin_rel>;
+
+  switch (code)
+    {
+    case LE: case LEU:
+      std::swap (cmplo[0], cmplo[1]);
+      std::swap (cmphi[0], cmphi[1]);
+      code = swap_condition (code);
+      /* FALLTHRU */
+
+    case GE: case GEU:
+      {
+	bool uns = (code == GEU);
+	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
+	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
+
+	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
+
+	rtx tmp = gen_rtx_SCRATCH (SImode);
+	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
+
+	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
+	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+
+	break;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+})
 
 ;; Misc patterns (?)
 
Index: gcc/testsuite/gcc.target/i386/pr91154.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr91154.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr91154.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1 -mstv" } */
+
+void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
+{
+  int sc;
+  int k;
+  for (k = 1; k <= M; k++)
+    {
+      dc[k] = dc[k-1] + tpdd[k-1];
+      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
+      if (dc[k] < -987654321) dc[k] = -987654321;
+    }
+}
+
+/* We want to convert the loop to SSE since SSE pmaxsd is faster than
+   compare + conditional move.  */
+/* { dg-final { scan-assembler-not "cmov" } } */
+/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */
+/* { dg-final { scan-assembler-times "paddd" 2 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-1.c	(revision 274278)
+++ gcc/testsuite/gcc.target/i386/minmax-1.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=opteron" } */
+/* { dg-options "-O2 -march=opteron -mno-stv" } */
 /* { dg-final { scan-assembler "test" } } */
 /* { dg-final { scan-assembler-not "cmp" } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
Index: gcc/testsuite/gcc.target/i386/minmax-2.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-2.c	(revision 274278)
+++ gcc/testsuite/gcc.target/i386/minmax-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-stv" } */
 /* { dg-final { scan-assembler "test" } } */
 /* { dg-final { scan-assembler-not "cmp" } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
Index: gcc/testsuite/gcc.target/i386/minmax-3.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int ssi[1024];
+unsigned int usi[1024];
+long long sdi[1024];
+unsigned long long udi[1024];
+
+#define CHECK(FN, VARIANT) \
+void \
+FN ## VARIANT (void) \
+{ \
+  for (int i = 1; i < 1024; ++i) \
+    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
+}
+
+CHECK(max, ssi);
+CHECK(min, ssi);
+CHECK(max, usi);
+CHECK(min, usi);
+CHECK(max, sdi);
+CHECK(min, sdi);
+CHECK(max, udi);
+CHECK(min, udi);
Index: gcc/testsuite/gcc.target/i386/minmax-4.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -msse4.1" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "pminsd" 1 } } */
+/* { dg-final { scan-assembler-times "pminud" 1 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-6.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-6.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-6.c	(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=haswell" } */
+
+unsigned short
+UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
+{
+  if (y != width)
+    {
+      y = y < 0 ? 0 : y;
+      return Pic[y * width];
+    }
+  return Pic[y];
+} 
+
+/* We do not want the RA to spill %esi for it's dual-use but using
+   pmaxsd is OK.  */
+/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
+/* { dg-final { scan-assembler "pmaxsd" } } */
Uros Bizjak Aug. 12, 2019, 2:15 p.m. UTC | #54
On Mon, Aug 12, 2019 at 2:27 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Fri, 9 Aug 2019, Uros Bizjak wrote:
>
> > On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > > > > > >
> > > > > > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > > > > > >
> > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > > > > > >
> > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > > > > > to force use of %zmmN?
> > > > > > > > >
> > > > > > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > > > > > >
> > > > > > > >     case SMAX:
> > > > > > > >     case SMIN:
> > > > > > > >     case UMAX:
> > > > > > > >     case UMIN:
> > > > > > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > > > > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > > > > > >         return false;
> > > > > > > >
> > > > > > > > so there's no way to use AVX512VL for 32bit?
> > > > > > >
> > > > > > > There is a way, but on 32bit targets, we need to split DImode
> > > > > > > operation to a sequence of SImode operations for unconverted pattern.
> > > > > > > This is of course doable, but somehow more complex than simply
> > > > > > > emitting a DImode compare + DImode cmove, which is what current
> > > > > > > splitter does. So, a follow-up task.
> > > > > >
> > > > > > Please find attached the complete .md part that enables SImode for
> > > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > > > > > targets. The patterns also allows for memory operand 2, so STV has
> > > > > > chance to create the vector pattern with implicit load. In case STV
> > > > > > fails, the memory operand 2 is loaded to the register first;  operand
> > > > > > 2 is used in compare and cmove instruction, so pre-loading of the
> > > > > > operand should be beneficial.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > > Also note, that splitting should happen rarely. Due to the cost
> > > > > > function, STV should effectively always convert minmax to a vector
> > > > > > insn.
> > > > >
> > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> > > > > this kind of "simple" conversion:
> > > > >
> > > > >   5.50 │1d0:   test   %esi,%es
> > > > >   0.07 │       mov    $0x0,%ex
> > > > >        │       cmovs  %eax,%es
> > > > >   5.84 │       imul   %r8d,%es
> > > > >
> > > > > to
> > > > >
> > > > >   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> > > > >   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> > > > >  40.45 │       vmovd  %xmm0,%eax
> > > > >   2.45 │       imul   %r8d,%eax
> > > > >
> > > > > which looks like a RA artifact in the end.  We spill %esi only
> > > > > with -mstv here as STV introduces a (subreg:V4SI ...) use
> > > > > of a pseudo ultimatively set from di.  STV creates an additional
> > > > > pseudo for this (copy-in) but it places that copy next to the
> > > > > original def rather than next to the start of the chain it
> > > > > converts which is probably the issue why we spill.  And this
> > > > > is because it inserts those at each definition of the pseudo
> > > > > rather than just at the reaching definition(s) or at the
> > > > > uses of the pseudo in the chain (that because there may be
> > > > > defs of that pseudo in the chain itself).  Note that STV emits
> > > > > such "conversion" copies as simple reg-reg moves:
> > > > >
> > > > > (insn 1094 3 4 2 (set (reg:SI 777)
> > > > >         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> > > > >      (nil))
> > > > >
> > > > > but those do not prevail very long (this one gets removed by CSE2).
> > > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> > > > > and computes
> > > > >
> > > > >     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> > > > >     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> > > > >
> > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves
> > > > > here (but I guess nothing again prevents RTL optimizers from
> > > > > combining that with the single-use in the max instruction...).
> > > > >
> > > > > So this boils down to STV splitting live-ranges but other
> > > > > passes undoing that and then RA not considering splitting
> > > > > live-ranges here, arriving at unoptimal allocation.
> > > > >
> > > > > A testcase showing this issue is (simplified from 464.h264ref
> > > > > UMVLine16Y_11):
> > > > >
> > > > > unsigned short
> > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > > > {
> > > > >   if (y != width)
> > > > >     {
> > > > >       y = y < 0 ? 0 : y;
> > > > >       return Pic[y * width];
> > > > >     }
> > > > >   return Pic[y];
> > > > > }
> > > > >
> > > > > where the condition and the Pic[y] load mimics the other use of y.
> > > > > Different, even worse spilling is generated by
> > > > >
> > > > > unsigned short
> > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> > > > > {
> > > > >   y = y < 0 ? 0 : y;
> > > > >   return Pic[y * width] + y;
> > > > > }
> > > > >
> > > > > I guess this all shows that STVs "trick" of simply wrapping
> > > > > integer mode pseudos in (subreg:vector-mode ...) is bad?
> > > > >
> > > > > I've added a (failing) testcase to reflect the above.
> > > >
> > > > Experimenting a bit with just for the conversion insns using
> > > > V4SImode pseudos we end up preserving those moves (but I
> > > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> > > > ends up using movv4si_internal which only leaves us with
> > > > memory for the SImode operand) _plus_ moving the move next
> > > > to the actual use has an effect.  Not necssarily a good one
> > > > though:
> > > >
> > > >         vpxor   %xmm0, %xmm0, %xmm0
> > > >         vmovaps %xmm0, -16(%rsp)
> > > >         movl    %esi, -16(%rsp)
> > > >         vpmaxsd -16(%rsp), %xmm0, %xmm0
> > > >         vmovd   %xmm0, %eax
> > > >
> > > > eh?  I guess the lowpart set is not good (my patch has this
> > > > as well, but I got saved by never having vector modes to subset...).
> > > > Using
> > > >
> > > >     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> > > >             (const_vector:V4SI [
> > > >                     (const_int 0 [0]) repeated x4
> > > >                 ])
> > > >             (const_int 1 [0x1]))) "t3.c":5:10 -1
> > > >
> > > > for the move ends up with
> > > >
> > > >         vpxor   %xmm1, %xmm1, %xmm1
> > > >         vpinsrd $0, %esi, %xmm1, %xmm0
> > > >
> > > > eh?  LRA chooses the correct alternative here but somehow
> > > > postreload CSE CSEs the zero with the xmm1 clearing, leading
> > > > to the vpinsrd...  (I guess a general issue, not sure if really
> > > > worse - definitely a larger instruction).  Unfortunately
> > > > postreload-cse doesn't add a reg-equal note.  This happens only
> > > > when emitting the reg move before the use, not doing that emits
> > > > a vmovd as expected.
> > > >
> > > > At least the spilling is gone here.
> > > >
> > > > I am re-testing as follows, the main change is that
> > > > general_scalar_chain::make_vector_copies now generates a
> > > > vector pseudo as destination (and I've fixed up the code
> > > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> > > >
> > > > Hope this fixes the observed slowdowns (it fixes the new testcase).
> > >
> > > It fixes the slowdown observed in 416.gamess and 464.h264ref.
> > >
> > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> > >
> > > CCing Jeff who "knows RTL".
> > >
> > > OK?
> >
> > Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
> > spurious test failures on SSE4.1 targets.
>
> Done.  I've also adjusted the i386.md changelog as follows:
>
>         * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
>         (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
>         (*<maxmin>di3_doubleword): Likewise.
>
> I see
>
> FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
> FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
> FAIL: gcc.target/i386/pr78794.c scan-assembler pandn
>
> with the latest patch (this is with -m32) where -mstv causes
> all spills to go away and the cmoves replaced (so clearly
> better code after the patch) for pr65105-5.c, no obvious
> improvements for pr65105-3.c where cmov does appear with -mstv.
> I'd rather not "fix" those by adding -mno-stv but instead have
> the Intel people fix costing for slm and/or decide what to do.
> For pr65105-3.c I'm not sure why if-conversion didn't choose
> to use cmov, so clearly the enabled minmax patterns expose the
> "failure" here.
>
> I've also seen a 32bit ICE for a bogus store we create with the
> live-range splitting fix fixed in the patch below (convert_insn
> REG src handling with MEM dst needs to account for a vector-mode
> src case).
>
> Maybe it would help to split out changes unrelated to {DI,SI}mode
> chain support from the STV costing and also separately install
> the live-range splitting "fix"?  I'm willing to do some more
> legwork to make review and approval easier here.

I think this is a good idea. Now we have three sem-related changes
here, if these can be split by topic into independent changes, this
would also help bisection. It looks that generalization from DImode
support to DI/SImode comprises of mostly mechanical changes, and
consting is only tangential to these changes.

Uros.
Jeff Law Aug. 13, 2019, 3:16 p.m. UTC | #55
On 8/9/19 7:00 AM, Richard Biener wrote:
> On Fri, 9 Aug 2019, Richard Biener wrote:
> 
>> On Fri, 9 Aug 2019, Richard Biener wrote:
>>
>>> On Fri, 9 Aug 2019, Uros Bizjak wrote:
>>>
>>>> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>
>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
>>>>>>>>>>
>>>>>>>>>> and then we need to split DImode for 32bits, too.
>>>>>>>>>
>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
>>>>>>>>> condition, I'll provide _doubleword splitter later.
>>>>>>>>
>>>>>>>> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
>>>>>>>> to force use of %zmmN?
>>>>>>>
>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
>>>>>>
>>>>>>     case SMAX:
>>>>>>     case SMIN:
>>>>>>     case UMAX:
>>>>>>     case UMIN:
>>>>>>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
>>>>>>           || (mode == SImode && !TARGET_SSE4_1))
>>>>>>         return false;
>>>>>>
>>>>>> so there's no way to use AVX512VL for 32bit?
>>>>>
>>>>> There is a way, but on 32bit targets, we need to split DImode
>>>>> operation to a sequence of SImode operations for unconverted pattern.
>>>>> This is of course doable, but somehow more complex than simply
>>>>> emitting a DImode compare + DImode cmove, which is what current
>>>>> splitter does. So, a follow-up task.
>>>>
>>>> Please find attached the complete .md part that enables SImode for
>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
>>>> targets. The patterns also allows for memory operand 2, so STV has
>>>> chance to create the vector pattern with implicit load. In case STV
>>>> fails, the memory operand 2 is loaded to the register first;  operand
>>>> 2 is used in compare and cmove instruction, so pre-loading of the
>>>> operand should be beneficial.
>>>
>>> Thanks.
>>>
>>>> Also note, that splitting should happen rarely. Due to the cost
>>>> function, STV should effectively always convert minmax to a vector
>>>> insn.
>>>
>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
>>> this kind of "simple" conversion:
>>>
>>>   5.50 │1d0:   test   %esi,%es
>>>   0.07 │       mov    $0x0,%ex
>>>        │       cmovs  %eax,%es
>>>   5.84 │       imul   %r8d,%es
>>>
>>> to
>>>
>>>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>>>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>>>  40.45 │       vmovd  %xmm0,%eax
>>>   2.45 │       imul   %r8d,%eax
>>>
>>> which looks like a RA artifact in the end.  We spill %esi only
>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
>>> of a pseudo ultimatively set from di.  STV creates an additional
>>> pseudo for this (copy-in) but it places that copy next to the
>>> original def rather than next to the start of the chain it
>>> converts which is probably the issue why we spill.  And this
>>> is because it inserts those at each definition of the pseudo
>>> rather than just at the reaching definition(s) or at the
>>> uses of the pseudo in the chain (that because there may be
>>> defs of that pseudo in the chain itself).  Note that STV emits
>>> such "conversion" copies as simple reg-reg moves:
>>>
>>> (insn 1094 3 4 2 (set (reg:SI 777)
>>>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>>>      (nil))
>>>
>>> but those do not prevail very long (this one gets removed by CSE2).
>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
>>> and computes
>>>
>>>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>>>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
>>>
>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
>>> here (but I guess nothing again prevents RTL optimizers from
>>> combining that with the single-use in the max instruction...).
>>>
>>> So this boils down to STV splitting live-ranges but other
>>> passes undoing that and then RA not considering splitting
>>> live-ranges here, arriving at unoptimal allocation.
>>>
>>> A testcase showing this issue is (simplified from 464.h264ref
>>> UMVLine16Y_11):
>>>
>>> unsigned short
>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>> {
>>>   if (y != width)
>>>     {
>>>       y = y < 0 ? 0 : y;
>>>       return Pic[y * width];
>>>     }
>>>   return Pic[y];
>>> }
>>>
>>> where the condition and the Pic[y] load mimics the other use of y.
>>> Different, even worse spilling is generated by
>>>
>>> unsigned short
>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>> {
>>>   y = y < 0 ? 0 : y;
>>>   return Pic[y * width] + y;
>>> }
>>>
>>> I guess this all shows that STVs "trick" of simply wrapping
>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
>>>
>>> I've added a (failing) testcase to reflect the above.
>>
>> Experimenting a bit with just for the conversion insns using
>> V4SImode pseudos we end up preserving those moves (but I
>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
>> ends up using movv4si_internal which only leaves us with
>> memory for the SImode operand) _plus_ moving the move next
>> to the actual use has an effect.  Not necssarily a good one
>> though:
>>
>>         vpxor   %xmm0, %xmm0, %xmm0
>>         vmovaps %xmm0, -16(%rsp)
>>         movl    %esi, -16(%rsp)
>>         vpmaxsd -16(%rsp), %xmm0, %xmm0
>>         vmovd   %xmm0, %eax
>>
>> eh?  I guess the lowpart set is not good (my patch has this
>> as well, but I got saved by never having vector modes to subset...).
>> Using
>>
>>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>>             (const_vector:V4SI [
>>                     (const_int 0 [0]) repeated x4
>>                 ])
>>             (const_int 1 [0x1]))) "t3.c":5:10 -1
>>
>> for the move ends up with
>>
>>         vpxor   %xmm1, %xmm1, %xmm1
>>         vpinsrd $0, %esi, %xmm1, %xmm0
>>
>> eh?  LRA chooses the correct alternative here but somehow
>> postreload CSE CSEs the zero with the xmm1 clearing, leading
>> to the vpinsrd...  (I guess a general issue, not sure if really
>> worse - definitely a larger instruction).  Unfortunately
>> postreload-cse doesn't add a reg-equal note.  This happens only
>> when emitting the reg move before the use, not doing that emits
>> a vmovd as expected.
>>
>> At least the spilling is gone here.
>>
>> I am re-testing as follows, the main change is that
>> general_scalar_chain::make_vector_copies now generates a
>> vector pseudo as destination (and I've fixed up the code
>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
>>
>> Hope this fixes the observed slowdowns (it fixes the new testcase).
> 
> It fixes the slowdown observed in 416.gamess and 464.h264ref.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> 
> CCing Jeff who "knows RTL".
What specifically do you want me to look at?  I'm not really familiar
with the STV stuff, but can certainly take a peek.


Jeff
Jeff Law Aug. 13, 2019, 3:20 p.m. UTC | #56
On 8/12/19 6:27 AM, Richard Biener wrote:
> On Fri, 9 Aug 2019, Uros Bizjak wrote:
> 
>> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:
>>
>>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
>>>>>>>>>>>>
>>>>>>>>>>>> and then we need to split DImode for 32bits, too.
>>>>>>>>>>>
>>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
>>>>>>>>>>> condition, I'll provide _doubleword splitter later.
>>>>>>>>>>
>>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
>>>>>>>>>> to force use of %zmmN?
>>>>>>>>>
>>>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
>>>>>>>>
>>>>>>>>     case SMAX:
>>>>>>>>     case SMIN:
>>>>>>>>     case UMAX:
>>>>>>>>     case UMIN:
>>>>>>>>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
>>>>>>>>           || (mode == SImode && !TARGET_SSE4_1))
>>>>>>>>         return false;
>>>>>>>>
>>>>>>>> so there's no way to use AVX512VL for 32bit?
>>>>>>>
>>>>>>> There is a way, but on 32bit targets, we need to split DImode
>>>>>>> operation to a sequence of SImode operations for unconverted pattern.
>>>>>>> This is of course doable, but somehow more complex than simply
>>>>>>> emitting a DImode compare + DImode cmove, which is what current
>>>>>>> splitter does. So, a follow-up task.
>>>>>>
>>>>>> Please find attached the complete .md part that enables SImode for
>>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
>>>>>> targets. The patterns also allows for memory operand 2, so STV has
>>>>>> chance to create the vector pattern with implicit load. In case STV
>>>>>> fails, the memory operand 2 is loaded to the register first;  operand
>>>>>> 2 is used in compare and cmove instruction, so pre-loading of the
>>>>>> operand should be beneficial.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> Also note, that splitting should happen rarely. Due to the cost
>>>>>> function, STV should effectively always convert minmax to a vector
>>>>>> insn.
>>>>>
>>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
>>>>> this kind of "simple" conversion:
>>>>>
>>>>>   5.50 │1d0:   test   %esi,%es
>>>>>   0.07 │       mov    $0x0,%ex
>>>>>        │       cmovs  %eax,%es
>>>>>   5.84 │       imul   %r8d,%es
>>>>>
>>>>> to
>>>>>
>>>>>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>>>>>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>>>>>  40.45 │       vmovd  %xmm0,%eax
>>>>>   2.45 │       imul   %r8d,%eax
>>>>>
>>>>> which looks like a RA artifact in the end.  We spill %esi only
>>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
>>>>> of a pseudo ultimatively set from di.  STV creates an additional
>>>>> pseudo for this (copy-in) but it places that copy next to the
>>>>> original def rather than next to the start of the chain it
>>>>> converts which is probably the issue why we spill.  And this
>>>>> is because it inserts those at each definition of the pseudo
>>>>> rather than just at the reaching definition(s) or at the
>>>>> uses of the pseudo in the chain (that because there may be
>>>>> defs of that pseudo in the chain itself).  Note that STV emits
>>>>> such "conversion" copies as simple reg-reg moves:
>>>>>
>>>>> (insn 1094 3 4 2 (set (reg:SI 777)
>>>>>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>>>>>      (nil))
>>>>>
>>>>> but those do not prevail very long (this one gets removed by CSE2).
>>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
>>>>> and computes
>>>>>
>>>>>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>>>>>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
>>>>>
>>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
>>>>> here (but I guess nothing again prevents RTL optimizers from
>>>>> combining that with the single-use in the max instruction...).
>>>>>
>>>>> So this boils down to STV splitting live-ranges but other
>>>>> passes undoing that and then RA not considering splitting
>>>>> live-ranges here, arriving at unoptimal allocation.
>>>>>
>>>>> A testcase showing this issue is (simplified from 464.h264ref
>>>>> UMVLine16Y_11):
>>>>>
>>>>> unsigned short
>>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>>>> {
>>>>>   if (y != width)
>>>>>     {
>>>>>       y = y < 0 ? 0 : y;
>>>>>       return Pic[y * width];
>>>>>     }
>>>>>   return Pic[y];
>>>>> }
>>>>>
>>>>> where the condition and the Pic[y] load mimics the other use of y.
>>>>> Different, even worse spilling is generated by
>>>>>
>>>>> unsigned short
>>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>>>> {
>>>>>   y = y < 0 ? 0 : y;
>>>>>   return Pic[y * width] + y;
>>>>> }
>>>>>
>>>>> I guess this all shows that STVs "trick" of simply wrapping
>>>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
>>>>>
>>>>> I've added a (failing) testcase to reflect the above.
>>>>
>>>> Experimenting a bit with just for the conversion insns using
>>>> V4SImode pseudos we end up preserving those moves (but I
>>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
>>>> ends up using movv4si_internal which only leaves us with
>>>> memory for the SImode operand) _plus_ moving the move next
>>>> to the actual use has an effect.  Not necssarily a good one
>>>> though:
>>>>
>>>>         vpxor   %xmm0, %xmm0, %xmm0
>>>>         vmovaps %xmm0, -16(%rsp)
>>>>         movl    %esi, -16(%rsp)
>>>>         vpmaxsd -16(%rsp), %xmm0, %xmm0
>>>>         vmovd   %xmm0, %eax
>>>>
>>>> eh?  I guess the lowpart set is not good (my patch has this
>>>> as well, but I got saved by never having vector modes to subset...).
>>>> Using
>>>>
>>>>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>>>>             (const_vector:V4SI [
>>>>                     (const_int 0 [0]) repeated x4
>>>>                 ])
>>>>             (const_int 1 [0x1]))) "t3.c":5:10 -1
>>>>
>>>> for the move ends up with
>>>>
>>>>         vpxor   %xmm1, %xmm1, %xmm1
>>>>         vpinsrd $0, %esi, %xmm1, %xmm0
>>>>
>>>> eh?  LRA chooses the correct alternative here but somehow
>>>> postreload CSE CSEs the zero with the xmm1 clearing, leading
>>>> to the vpinsrd...  (I guess a general issue, not sure if really
>>>> worse - definitely a larger instruction).  Unfortunately
>>>> postreload-cse doesn't add a reg-equal note.  This happens only
>>>> when emitting the reg move before the use, not doing that emits
>>>> a vmovd as expected.
>>>>
>>>> At least the spilling is gone here.
>>>>
>>>> I am re-testing as follows, the main change is that
>>>> general_scalar_chain::make_vector_copies now generates a
>>>> vector pseudo as destination (and I've fixed up the code
>>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
>>>>
>>>> Hope this fixes the observed slowdowns (it fixes the new testcase).
>>>
>>> It fixes the slowdown observed in 416.gamess and 464.h264ref.
>>>
>>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
>>>
>>> CCing Jeff who "knows RTL".
>>>
>>> OK?
>>
>> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
>> spurious test failures on SSE4.1 targets.
> 
> Done.  I've also adjusted the i386.md changelog as follows:
> 
>         * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
>         (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
>         (*<maxmin>di3_doubleword): Likewise.
> 
> I see
> 
> FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
> FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
> FAIL: gcc.target/i386/pr78794.c scan-assembler pandn
> 
> with the latest patch (this is with -m32) where -mstv causes
> all spills to go away and the cmoves replaced (so clearly
> better code after the patch) for pr65105-5.c, no obvious
> improvements for pr65105-3.c where cmov does appear with -mstv.
> I'd rather not "fix" those by adding -mno-stv but instead have
> the Intel people fix costing for slm and/or decide what to do.
> For pr65105-3.c I'm not sure why if-conversion didn't choose
> to use cmov, so clearly the enabled minmax patterns expose the
> "failure" here.
I'm not sure how much effort Intel is putting into Silvermont tuning
these days.  So I'd suggest giving HJ a heads-up and a reasonable period
of time to take a looksie, but I wouldn't hold the patch for long due to
a Silvermont tuning issue.

jeff
H.J. Lu Aug. 13, 2019, 7:53 p.m. UTC | #57
On Tue, Aug 13, 2019 at 8:20 AM Jeff Law <law@redhat.com> wrote:
>
> On 8/12/19 6:27 AM, Richard Biener wrote:
> > On Fri, 9 Aug 2019, Uros Bizjak wrote:
> >
> >> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:
> >>
> >>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> >>>>>>>>>>>>
> >>>>>>>>>>>> and then we need to split DImode for 32bits, too.
> >>>>>>>>>>>
> >>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> >>>>>>>>>>> condition, I'll provide _doubleword splitter later.
> >>>>>>>>>>
> >>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> >>>>>>>>>> to force use of %zmmN?
> >>>>>>>>>
> >>>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
> >>>>>>>>
> >>>>>>>>     case SMAX:
> >>>>>>>>     case SMIN:
> >>>>>>>>     case UMAX:
> >>>>>>>>     case UMIN:
> >>>>>>>>       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> >>>>>>>>           || (mode == SImode && !TARGET_SSE4_1))
> >>>>>>>>         return false;
> >>>>>>>>
> >>>>>>>> so there's no way to use AVX512VL for 32bit?
> >>>>>>>
> >>>>>>> There is a way, but on 32bit targets, we need to split DImode
> >>>>>>> operation to a sequence of SImode operations for unconverted pattern.
> >>>>>>> This is of course doable, but somehow more complex than simply
> >>>>>>> emitting a DImode compare + DImode cmove, which is what current
> >>>>>>> splitter does. So, a follow-up task.
> >>>>>>
> >>>>>> Please find attached the complete .md part that enables SImode for
> >>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> >>>>>> targets. The patterns also allows for memory operand 2, so STV has
> >>>>>> chance to create the vector pattern with implicit load. In case STV
> >>>>>> fails, the memory operand 2 is loaded to the register first;  operand
> >>>>>> 2 is used in compare and cmove instruction, so pre-loading of the
> >>>>>> operand should be beneficial.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>> Also note, that splitting should happen rarely. Due to the cost
> >>>>>> function, STV should effectively always convert minmax to a vector
> >>>>>> insn.
> >>>>>
> >>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> >>>>> this kind of "simple" conversion:
> >>>>>
> >>>>>   5.50 │1d0:   test   %esi,%es
> >>>>>   0.07 │       mov    $0x0,%ex
> >>>>>        │       cmovs  %eax,%es
> >>>>>   5.84 │       imul   %r8d,%es
> >>>>>
> >>>>> to
> >>>>>
> >>>>>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
> >>>>>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
> >>>>>  40.45 │       vmovd  %xmm0,%eax
> >>>>>   2.45 │       imul   %r8d,%eax
> >>>>>
> >>>>> which looks like a RA artifact in the end.  We spill %esi only
> >>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
> >>>>> of a pseudo ultimatively set from di.  STV creates an additional
> >>>>> pseudo for this (copy-in) but it places that copy next to the
> >>>>> original def rather than next to the start of the chain it
> >>>>> converts which is probably the issue why we spill.  And this
> >>>>> is because it inserts those at each definition of the pseudo
> >>>>> rather than just at the reaching definition(s) or at the
> >>>>> uses of the pseudo in the chain (that because there may be
> >>>>> defs of that pseudo in the chain itself).  Note that STV emits
> >>>>> such "conversion" copies as simple reg-reg moves:
> >>>>>
> >>>>> (insn 1094 3 4 2 (set (reg:SI 777)
> >>>>>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
> >>>>>      (nil))
> >>>>>
> >>>>> but those do not prevail very long (this one gets removed by CSE2).
> >>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> >>>>> and computes
> >>>>>
> >>>>>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
> >>>>>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> >>>>>
> >>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
> >>>>> here (but I guess nothing again prevents RTL optimizers from
> >>>>> combining that with the single-use in the max instruction...).
> >>>>>
> >>>>> So this boils down to STV splitting live-ranges but other
> >>>>> passes undoing that and then RA not considering splitting
> >>>>> live-ranges here, arriving at unoptimal allocation.
> >>>>>
> >>>>> A testcase showing this issue is (simplified from 464.h264ref
> >>>>> UMVLine16Y_11):
> >>>>>
> >>>>> unsigned short
> >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> >>>>> {
> >>>>>   if (y != width)
> >>>>>     {
> >>>>>       y = y < 0 ? 0 : y;
> >>>>>       return Pic[y * width];
> >>>>>     }
> >>>>>   return Pic[y];
> >>>>> }
> >>>>>
> >>>>> where the condition and the Pic[y] load mimics the other use of y.
> >>>>> Different, even worse spilling is generated by
> >>>>>
> >>>>> unsigned short
> >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> >>>>> {
> >>>>>   y = y < 0 ? 0 : y;
> >>>>>   return Pic[y * width] + y;
> >>>>> }
> >>>>>
> >>>>> I guess this all shows that STVs "trick" of simply wrapping
> >>>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
> >>>>>
> >>>>> I've added a (failing) testcase to reflect the above.
> >>>>
> >>>> Experimenting a bit with just for the conversion insns using
> >>>> V4SImode pseudos we end up preserving those moves (but I
> >>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
> >>>> ends up using movv4si_internal which only leaves us with
> >>>> memory for the SImode operand) _plus_ moving the move next
> >>>> to the actual use has an effect.  Not necssarily a good one
> >>>> though:
> >>>>
> >>>>         vpxor   %xmm0, %xmm0, %xmm0
> >>>>         vmovaps %xmm0, -16(%rsp)
> >>>>         movl    %esi, -16(%rsp)
> >>>>         vpmaxsd -16(%rsp), %xmm0, %xmm0
> >>>>         vmovd   %xmm0, %eax
> >>>>
> >>>> eh?  I guess the lowpart set is not good (my patch has this
> >>>> as well, but I got saved by never having vector modes to subset...).
> >>>> Using
> >>>>
> >>>>     (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
> >>>>             (const_vector:V4SI [
> >>>>                     (const_int 0 [0]) repeated x4
> >>>>                 ])
> >>>>             (const_int 1 [0x1]))) "t3.c":5:10 -1
> >>>>
> >>>> for the move ends up with
> >>>>
> >>>>         vpxor   %xmm1, %xmm1, %xmm1
> >>>>         vpinsrd $0, %esi, %xmm1, %xmm0
> >>>>
> >>>> eh?  LRA chooses the correct alternative here but somehow
> >>>> postreload CSE CSEs the zero with the xmm1 clearing, leading
> >>>> to the vpinsrd...  (I guess a general issue, not sure if really
> >>>> worse - definitely a larger instruction).  Unfortunately
> >>>> postreload-cse doesn't add a reg-equal note.  This happens only
> >>>> when emitting the reg move before the use, not doing that emits
> >>>> a vmovd as expected.
> >>>>
> >>>> At least the spilling is gone here.
> >>>>
> >>>> I am re-testing as follows, the main change is that
> >>>> general_scalar_chain::make_vector_copies now generates a
> >>>> vector pseudo as destination (and I've fixed up the code
> >>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
> >>>>
> >>>> Hope this fixes the observed slowdowns (it fixes the new testcase).
> >>>
> >>> It fixes the slowdown observed in 416.gamess and 464.h264ref.
> >>>
> >>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> >>>
> >>> CCing Jeff who "knows RTL".
> >>>
> >>> OK?
> >>
> >> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
> >> spurious test failures on SSE4.1 targets.
> >
> > Done.  I've also adjusted the i386.md changelog as follows:
> >
> >         * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
> >         (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
> >         (*<maxmin>di3_doubleword): Likewise.
> >
> > I see
> >
> > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
> > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
> > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn
> >
> > with the latest patch (this is with -m32) where -mstv causes
> > all spills to go away and the cmoves replaced (so clearly
> > better code after the patch) for pr65105-5.c, no obvious
> > improvements for pr65105-3.c where cmov does appear with -mstv.
> > I'd rather not "fix" those by adding -mno-stv but instead have
> > the Intel people fix costing for slm and/or decide what to do.
> > For pr65105-3.c I'm not sure why if-conversion didn't choose
> > to use cmov, so clearly the enabled minmax patterns expose the
> > "failure" here.
> I'm not sure how much effort Intel is putting into Silvermont tuning
> these days.  So I'd suggest giving HJ a heads-up and a reasonable period
> of time to take a looksie, but I wouldn't hold the patch for long due to
> a Silvermont tuning issue.

Leave pr65105-3.c to fail for now.  We can take a look later.

Thanks.
Richard Biener Aug. 14, 2019, 9:08 a.m. UTC | #58
On Tue, 13 Aug 2019, Jeff Law wrote:

> On 8/9/19 7:00 AM, Richard Biener wrote:
> > 
> > It fixes the slowdown observed in 416.gamess and 464.h264ref.
> > 
> > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> > 
> > CCing Jeff who "knows RTL".
> What specifically do you want me to look at?  I'm not really familiar
> with the STV stuff, but can certainly take a peek.

Below is the updated patch with the already approved and committed
parts taken out.  It is not mostly mechanical apart from the
make_vector_copies and convert_reg changes which move existing
"patterns" under appropriate conditionals and adds handling of the
case where the scalar mode fits in a single GPR (previously it
was -m32 DImode only, now it handles -m32/-m64 SImode and DImode).

I'm redoing bootstrap / regtest on x86_64-unknown-linux-gnu now just
to be safe.

OK?

I do expect we need to work on the compile-time issue I placed ???
comments on and more generally try to avoid using DF so much.

Thanks,
Richard.

2019-08-13  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Add
	{S,U}{MIN,MAX} support.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
	(*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
	(*<maxmin>di3_doubleword): Likewise.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.
	* gcc.target/i386/minmax-6.c: Likewise.
	* gcc.target/i386/minmax-1.c: Add -mno-stv.
	* gcc.target/i386/minmax-2.c: Likewise.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274422)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,6 +499,13 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
@@ -500,18 +515,19 @@ dimode_scalar_chain::compute_convert_gai
       int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	igain += 2 - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	igain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
 	    igain -= vector_const_cost (XEXP (src, 0));
-	  igain += 2 * ix86_cost->shift_const - ix86_cost->sse_op;
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
 	    igain -= COSTS_N_INSNS (1);
 	}
@@ -521,11 +537,11 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  igain += 2 * ix86_cost->add - ix86_cost->sse_op;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    igain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
 	    igain -= vector_const_cost (XEXP (src, 0));
@@ -534,7 +550,18 @@ dimode_scalar_chain::compute_convert_gai
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	igain += 2 * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1);
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -542,9 +569,11 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    igain += 2 * COSTS_N_INSNS (1);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += m * COSTS_N_INSNS (1);
 	  else if (MEM_P (dst))
-	    igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
 	  igain -= vector_const_cost (src);
 	}
       else
@@ -561,6 +590,7 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -578,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return gen_rtx_SUBREG (vmode, new_reg, 0);
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -601,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -632,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (smode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -644,37 +674,59 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
-	    emit_move_insn (vreg, tmp);
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
+	    emit_insn (gen_rtx_SET
+		        (gen_rtx_SUBREG (vmode, vreg, 0),
+			 gen_rtx_VEC_MERGE (vmode,
+					    gen_rtx_VEC_DUPLICATE (vmode,
+								   tmp),
+					    CONST0_RTX (vmode),
+					    GEN_INT (HOST_WIDE_INT_1U))));
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
-	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
-	  }
+	  emit_insn (gen_rtx_SET
+		       (gen_rtx_SUBREG (vmode, vreg, 0),
+			gen_rtx_VEC_MERGE (vmode,
+					   gen_rtx_VEC_DUPLICATE (vmode,
+								  reg),
+					   CONST0_RTX (vmode),
+					   GEN_INT (HOST_WIDE_INT_1U))));
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
 	rtx_insn *insn = DF_REF_INSN (ref);
@@ -703,7 +755,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -715,7 +767,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -735,40 +787,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -817,21 +884,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -849,24 +916,30 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      *op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -878,14 +951,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -896,9 +969,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (smode);
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -907,7 +980,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -915,25 +988,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -947,17 +1024,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -985,7 +1062,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1124,7 +1203,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1194,7 +1273,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1227,12 +1306,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1240,7 +1319,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1249,7 +1328,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1263,12 +1342,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1288,6 +1367,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1298,7 +1386,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1327,7 +1415,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1391,22 +1479,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1561,23 +1643,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1585,11 +1650,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1605,51 +1673,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274422)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274422)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc"
    (match_operand:SWI 3 "const_int_operand")]
   ""
   "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
+
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
+
+(define_expand "<code><mode>3"
+  [(parallel
+    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	  (maxmin:MAXMIN_IMODE
+	    (match_operand:MAXMIN_IMODE 1 "register_operand")
+	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+     (clobber (reg:CC FLAGS_REG))])]
+  "TARGET_STV")
+
+(define_insn_and_split "*<code><mode>3_1"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE
+	  (match_operand:MAXMIN_IMODE 1 "register_operand")
+	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	  (match_dup 1)
+	  (match_dup 2)))]
+{
+  machine_mode mode = <MODE>mode;
+
+  if (!register_operand (operands[2], mode))
+    operands[2] = force_reg (mode, operands[2]);
+
+  enum rtx_code code = <maxmin_rel>;
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
+
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  emit_insn (gen_rtx_SET (flags, tmp));
+
+  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+})
+
+(define_insn_and_split "*<code>di3_doubleword"
+  [(set (match_operand:DI 0 "register_operand")
+	(maxmin:DI (match_operand:DI 1 "register_operand")
+		   (match_operand:DI 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 1)
+	  (match_dup 2)))
+   (set (match_dup 3)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 4)
+	  (match_dup 5)))]
+{
+  if (!register_operand (operands[2], DImode))
+    operands[2] = force_reg (DImode, operands[2]);
+
+  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+
+  rtx cmplo[2] = { operands[1], operands[2] };
+  rtx cmphi[2] = { operands[4], operands[5] };
+
+  enum rtx_code code = <maxmin_rel>;
+
+  switch (code)
+    {
+    case LE: case LEU:
+      std::swap (cmplo[0], cmplo[1]);
+      std::swap (cmphi[0], cmphi[1]);
+      code = swap_condition (code);
+      /* FALLTHRU */
+
+    case GE: case GEU:
+      {
+	bool uns = (code == GEU);
+	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
+	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
+
+	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
+
+	rtx tmp = gen_rtx_SCRATCH (SImode);
+	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
+
+	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
+	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+
+	break;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+})
 
 ;; Misc patterns (?)
 
Index: gcc/testsuite/gcc.target/i386/minmax-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-1.c	(revision 274422)
+++ gcc/testsuite/gcc.target/i386/minmax-1.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=opteron" } */
+/* { dg-options "-O2 -march=opteron -mno-stv" } */
 /* { dg-final { scan-assembler "test" } } */
 /* { dg-final { scan-assembler-not "cmp" } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
Index: gcc/testsuite/gcc.target/i386/minmax-2.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-2.c	(revision 274422)
+++ gcc/testsuite/gcc.target/i386/minmax-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-stv" } */
 /* { dg-final { scan-assembler "test" } } */
 /* { dg-final { scan-assembler-not "cmp" } } */
 #define max(a,b) (((a) > (b))? (a) : (b))
Index: gcc/testsuite/gcc.target/i386/minmax-3.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int ssi[1024];
+unsigned int usi[1024];
+long long sdi[1024];
+unsigned long long udi[1024];
+
+#define CHECK(FN, VARIANT) \
+void \
+FN ## VARIANT (void) \
+{ \
+  for (int i = 1; i < 1024; ++i) \
+    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
+}
+
+CHECK(max, ssi);
+CHECK(min, ssi);
+CHECK(max, usi);
+CHECK(min, usi);
+CHECK(max, sdi);
+CHECK(min, sdi);
+CHECK(max, udi);
+CHECK(min, udi);
Index: gcc/testsuite/gcc.target/i386/minmax-4.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -msse4.1" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "pminsd" 1 } } */
+/* { dg-final { scan-assembler-times "pminud" 1 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-5.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-5.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-5.c	(working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -mavx512vl" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "vpminsd" 1 } } */
+/* { dg-final { scan-assembler-times "vpminud" 1 } } */
+/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */
+/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */
Index: gcc/testsuite/gcc.target/i386/minmax-6.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-6.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-6.c	(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=haswell" } */
+
+unsigned short
+UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
+{
+  if (y != width)
+    {
+      y = y < 0 ? 0 : y;
+      return Pic[y * width];
+    }
+  return Pic[y];
+} 
+
+/* We do not want the RA to spill %esi for it's dual-use but using
+   pmaxsd is OK.  */
+/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
+/* { dg-final { scan-assembler "pmaxsd" } } */
Index: gcc/testsuite/gcc.target/i386/pr91154.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr91154.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr91154.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1 -mstv" } */
+
+void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
+{
+  int sc;
+  int k;
+  for (k = 1; k <= M; k++)
+    {
+      dc[k] = dc[k-1] + tpdd[k-1];
+      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
+      if (dc[k] < -987654321) dc[k] = -987654321;
+    }
+}
+
+/* We want to convert the loop to SSE since SSE pmaxsd is faster than
+   compare + conditional move.  */
+/* { dg-final { scan-assembler-not "cmov" } } */
+/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */
+/* { dg-final { scan-assembler-times "paddd" 2 } } */
Uros Bizjak Aug. 14, 2019, 9:24 a.m. UTC | #59
On Wed, Aug 14, 2019 at 11:08 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Tue, 13 Aug 2019, Jeff Law wrote:
>
> > On 8/9/19 7:00 AM, Richard Biener wrote:
> > >
> > > It fixes the slowdown observed in 416.gamess and 464.h264ref.
> > >
> > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
> > >
> > > CCing Jeff who "knows RTL".
> > What specifically do you want me to look at?  I'm not really familiar
> > with the STV stuff, but can certainly take a peek.
>
> Below is the updated patch with the already approved and committed
> parts taken out.  It is not mostly mechanical apart from the
> make_vector_copies and convert_reg changes which move existing
> "patterns" under appropriate conditionals and adds handling of the
> case where the scalar mode fits in a single GPR (previously it
> was -m32 DImode only, now it handles -m32/-m64 SImode and DImode).
>
> I'm redoing bootstrap / regtest on x86_64-unknown-linux-gnu now just
> to be safe.
>
> OK?
>
> I do expect we need to work on the compile-time issue I placed ???
> comments on and more generally try to avoid using DF so much.
>
> Thanks,
> Richard.
>
> 2019-08-13  Richard Biener  <rguenther@suse.de>
>
>         PR target/91154
>         * config/i386/i386-features.h (scalar_chain::scalar_chain): Add
>         mode arguments.
>         (scalar_chain::smode): New member.
>         (scalar_chain::vmode): Likewise.
>         (dimode_scalar_chain): Rename to...
>         (general_scalar_chain): ... this.
>         (general_scalar_chain::general_scalar_chain): Take mode arguments.
>         (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
>         base with TImode and V1TImode.
>         * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
>         (general_scalar_chain::vector_const_cost): Adjust for SImode
>         chains.
>         (general_scalar_chain::compute_convert_gain): Likewise.  Add
>         {S,U}{MIN,MAX} support.
>         (general_scalar_chain::replace_with_subreg): Use vmode/smode.
>         (general_scalar_chain::make_vector_copies): Likewise.  Handle
>         non-DImode chains appropriately.
>         (general_scalar_chain::convert_reg): Likewise.
>         (general_scalar_chain::convert_op): Likewise.
>         (general_scalar_chain::convert_insn): Likewise.  Add
>         fatal_insn_not_found if the result is not recognized.
>         (convertible_comparison_p): Pass in the scalar mode and use that.
>         (general_scalar_to_vector_candidate_p): Likewise.  Rename from
>         dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
>         (scalar_to_vector_candidate_p): Remove by inlining into single
>         caller.
>         (general_remove_non_convertible_regs): Rename from
>         dimode_remove_non_convertible_regs.
>         (remove_non_convertible_regs): Remove by inlining into single caller.
>         (convert_scalars_to_vector): Handle SImode and DImode chains
>         in addition to TImode chains.
>         * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
>         (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
>         (*<maxmin>di3_doubleword): Likewise.
>
>         * gcc.target/i386/pr91154.c: New testcase.
>         * gcc.target/i386/minmax-3.c: Likewise.
>         * gcc.target/i386/minmax-4.c: Likewise.
>         * gcc.target/i386/minmax-5.c: Likewise.
>         * gcc.target/i386/minmax-6.c: Likewise.
>         * gcc.target/i386/minmax-1.c: Add -mno-stv.
>         * gcc.target/i386/minmax-2.c: Likewise.

OK.

Thanks,
Uros.

> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     (revision 274422)
> +++ gcc/config/i386/i386-features.c     (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
>  /* Initialize new chain.  */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>
>     if (dump_file)
> @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
>     conversion.  */
>
>  void
> -dimode_scalar_chain::mark_dual_mode_def (df_ref def)
> +general_scalar_chain::mark_dual_mode_def (df_ref def)
>  {
>    gcc_assert (DF_REF_REG_DEF_P (def));
>
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
>     instead of using a scalar one.  */
>
>  int
> -dimode_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp)
>  {
>    gcc_assert (CONST_INT_P (exp));
>
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>
>  /* Compute a gain for chain conversion.  */
>
>  int
> -dimode_scalar_chain::compute_convert_gain ()
> +general_scalar_chain::compute_convert_gain ()
>  {
>    bitmap_iterator bi;
>    unsigned insn_uid;
> @@ -491,6 +499,13 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
> @@ -500,18 +515,19 @@ dimode_scalar_chain::compute_convert_gai
>        int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       igain += 2 - ix86_cost->xmm_move;
> +       igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -       igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +       igain
> +         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -       igain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>                || GET_CODE (src) == ASHIFTRT
>                || GET_CODE (src) == LSHIFTRT)
>         {
>           if (CONST_INT_P (XEXP (src, 0)))
>             igain -= vector_const_cost (XEXP (src, 0));
> -         igain += 2 * ix86_cost->shift_const - ix86_cost->sse_op;
> +         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>           if (INTVAL (XEXP (src, 1)) >= 32)
>             igain -= COSTS_N_INSNS (1);
>         }
> @@ -521,11 +537,11 @@ dimode_scalar_chain::compute_convert_gai
>                || GET_CODE (src) == XOR
>                || GET_CODE (src) == AND)
>         {
> -         igain += 2 * ix86_cost->add - ix86_cost->sse_op;
> +         igain += m * ix86_cost->add - ix86_cost->sse_op;
>           /* Additional gain for andnot for targets without BMI.  */
>           if (GET_CODE (XEXP (src, 0)) == NOT
>               && !TARGET_BMI)
> -           igain += 2 * ix86_cost->add;
> +           igain += m * ix86_cost->add;
>
>           if (CONST_INT_P (XEXP (src, 0)))
>             igain -= vector_const_cost (XEXP (src, 0));
> @@ -534,7 +550,18 @@ dimode_scalar_chain::compute_convert_gai
>         }
>        else if (GET_CODE (src) == NEG
>                || GET_CODE (src) == NOT)
> -       igain += 2 * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1);
> +       igain += m * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1);
> +      else if (GET_CODE (src) == SMAX
> +              || GET_CODE (src) == SMIN
> +              || GET_CODE (src) == UMAX
> +              || GET_CODE (src) == UMIN)
> +       {
> +         /* We do not have any conditional move cost, estimate it as a
> +            reg-reg move.  Comparisons are costed as adds.  */
> +         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +         /* Integer SSE ops are all costed the same.  */
> +         igain -= ix86_cost->sse_op;
> +       }
>        else if (GET_CODE (src) == COMPARE)
>         {
>           /* Assume comparison cost is the same.  */
> @@ -542,9 +569,11 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>         {
>           if (REG_P (dst))
> -           igain += 2 * COSTS_N_INSNS (1);
> +           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +           igain += m * COSTS_N_INSNS (1);
>           else if (MEM_P (dst))
> -           igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +           igain += (m * ix86_cost->int_store[2]
> +                    - ix86_cost->sse_store[sse_cost_idx]);
>           igain -= vector_const_cost (src);
>         }
>        else
> @@ -561,6 +590,7 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -578,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai
>  /* Replace REG in X with a V2DI subreg of NEW_REG.  */
>
>  rtx
> -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -601,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg
>  /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
>
>  void
> -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
> +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
>                                                   rtx reg, rtx new_reg)
>  {
>    replace_with_subreg (single_set (insn), reg, new_reg);
> @@ -632,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx
>     and replace its uses in a chain.  */
>
>  void
> -dimode_scalar_chain::make_vector_copies (unsigned regno)
> +general_scalar_chain::make_vector_copies (unsigned regno)
>  {
>    rtx reg = regno_reg_rtx[regno];
> -  rtx vreg = gen_reg_rtx (DImode);
> +  rtx vreg = gen_reg_rtx (smode);
>    df_ref ref;
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> @@ -644,37 +674,59 @@ dimode_scalar_chain::make_vector_copies
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -           emit_move_insn (adjust_address (tmp, SImode, 0),
> -                           gen_rtx_SUBREG (SImode, reg, 0));
> -           emit_move_insn (adjust_address (tmp, SImode, 4),
> -                           gen_rtx_SUBREG (SImode, reg, 4));
> -           emit_move_insn (vreg, tmp);
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +           if (smode == DImode && !TARGET_64BIT)
> +             {
> +               emit_move_insn (adjust_address (tmp, SImode, 0),
> +                               gen_rtx_SUBREG (SImode, reg, 0));
> +               emit_move_insn (adjust_address (tmp, SImode, 4),
> +                               gen_rtx_SUBREG (SImode, reg, 4));
> +             }
> +           else
> +             emit_move_insn (tmp, reg);
> +           emit_insn (gen_rtx_SET
> +                       (gen_rtx_SUBREG (vmode, vreg, 0),
> +                        gen_rtx_VEC_MERGE (vmode,
> +                                           gen_rtx_VEC_DUPLICATE (vmode,
> +                                                                  tmp),
> +                                           CONST0_RTX (vmode),
> +                                           GEN_INT (HOST_WIDE_INT_1U))));
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (SImode, reg, 4),
> -                                         GEN_INT (2)));
> +           if (TARGET_SSE4_1)
> +             {
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (SImode, reg, 4),
> +                                             GEN_INT (2)));
> +             }
> +           else
> +             {
> +               rtx tmp = gen_reg_rtx (DImode);
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 4)));
> +               emit_insn (gen_vec_interleave_lowv4si
> +                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +             }
>           }
>         else
> -         {
> -           rtx tmp = gen_reg_rtx (DImode);
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 4)));
> -           emit_insn (gen_vec_interleave_lowv4si
> -                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -         }
> +         emit_insn (gen_rtx_SET
> +                      (gen_rtx_SUBREG (vmode, vreg, 0),
> +                       gen_rtx_VEC_MERGE (vmode,
> +                                          gen_rtx_VEC_DUPLICATE (vmode,
> +                                                                 reg),
> +                                          CONST0_RTX (vmode),
> +                                          GEN_INT (HOST_WIDE_INT_1U))));
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         rtx_insn *insn = DF_REF_INSN (ref);
> @@ -703,7 +755,7 @@ dimode_scalar_chain::make_vector_copies
>     in case register is used in not convertible insn.  */
>
>  void
> -dimode_scalar_chain::convert_reg (unsigned regno)
> +general_scalar_chain::convert_reg (unsigned regno)
>  {
>    bool scalar_copy = bitmap_bit_p (defs_conv, regno);
>    rtx reg = regno_reg_rtx[regno];
> @@ -715,7 +767,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -735,40 +787,55 @@ dimode_scalar_chain::convert_reg (unsign
>           start_sequence ();
>           if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>             {
> -             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>               emit_move_insn (tmp, reg);
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             adjust_address (tmp, SImode, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             adjust_address (tmp, SImode, 4));
> +             if (!TARGET_64BIT && smode == DImode)
> +               {
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 adjust_address (tmp, SImode, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 adjust_address (tmp, SImode, 4));
> +               }
> +             else
> +               emit_move_insn (scopy, tmp);
>             }
> -         else if (TARGET_SSE4_1)
> +         else if (!TARGET_64BIT && smode == DImode)
>             {
> -             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 0),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 4),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +             if (TARGET_SSE4_1)
> +               {
> +                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                             gen_rtvec (1, const0_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 0),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +
> +                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 4),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +               }
> +             else
> +               {
> +                 rtx vcopy = gen_reg_rtx (V2DImode);
> +                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +                 emit_move_insn (vcopy,
> +                                 gen_rtx_LSHIFTRT (V2DImode,
> +                                                   vcopy, GEN_INT (32)));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +               }
>             }
>           else
> -           {
> -             rtx vcopy = gen_reg_rtx (V2DImode);
> -             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -             emit_move_insn (vcopy,
> -                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -           }
> +           emit_move_insn (scopy, reg);
> +
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_conversion_insns (seq, insn);
> @@ -817,21 +884,21 @@ dimode_scalar_chain::convert_reg (unsign
>     registers conversion.  */
>
>  void
> -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
> +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
>  {
>    *op = copy_rtx_if_shared (*op);
>
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
>        if (dump_file)
>         fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -849,24 +916,30 @@ dimode_scalar_chain::convert_op (rtx *op
>             gcc_assert (!DF_REF_CHAIN (ref));
>             break;
>           }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -       vec_cst = CONSTM1_RTX (V2DImode);
> +       vec_cst = CONSTM1_RTX (vmode);
>        else
> -       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                       gen_rtvec (2, *op, const0_rtx));
> +       {
> +         unsigned n = GET_MODE_NUNITS (vmode);
> +         rtx *v = XALLOCAVEC (rtx, n);
> +         v[0] = *op;
> +         for (unsigned i = 1; i < n; ++i)
> +           v[i] = const0_rtx;
> +         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +       }
>
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>         {
>           start_sequence ();
> -         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_insn_before (seq, insn);
> @@ -878,14 +951,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>
>  /* Convert INSN to vector mode.  */
>
>  void
> -dimode_scalar_chain::convert_insn (rtx_insn *insn)
> +general_scalar_chain::convert_insn (rtx_insn *insn)
>  {
>    rtx def_set = single_set (insn);
>    rtx src = SET_SRC (def_set);
> @@ -896,9 +969,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>          temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (smode);
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>
>    switch (GET_CODE (src))
> @@ -907,7 +980,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case PLUS:
> @@ -915,25 +988,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>
>      case MEM:
> @@ -947,17 +1024,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
>        if (REG_P (src))
> -       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +       subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>         subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -985,7 +1062,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>
> @@ -1124,7 +1203,7 @@ timode_scalar_chain::convert_insn (rtx_i
>  }
>
>  void
> -dimode_scalar_chain::convert_registers ()
> +general_scalar_chain::convert_registers ()
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1194,7 +1273,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                      (const_int 0 [0])))  */
>
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1227,12 +1306,12 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>           && (SUBREG_BYTE (op2) != 0
> -             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>
>    op1 = SUBREG_REG (op1);
> @@ -1240,7 +1319,7 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>
>    return true;
> @@ -1249,7 +1328,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>
> @@ -1263,12 +1342,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1288,6 +1367,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>         return false;
>        break;
>
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512VL)
> +         || (mode == SImode && !TARGET_SSE4_1))
> +       return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1298,7 +1386,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>        break;
> @@ -1327,7 +1415,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>
> @@ -1391,22 +1479,16 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>
>  static void
> -dimode_remove_non_convertible_regs (bitmap candidates)
> +general_remove_non_convertible_regs (bitmap candidates)
>  {
>    bitmap_iterator bi;
>    unsigned id;
> @@ -1561,23 +1643,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>
> @@ -1585,11 +1650,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1605,51 +1673,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -       if (scalar_to_vector_candidate_p (insn))
> +       if (TARGET_64BIT
> +           && timode_scalar_to_vector_candidate_p (insn))
>           {
>             if (dump_file)
> -             fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                        INSN_UID (insn));
>
> -           bitmap_set_bit (candidates, INSN_UID (insn));
> +           bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +         }
> +       else
> +         {
> +           /* Check {SI,DI}mode.  */
> +           for (unsigned i = 0; i <= 1; ++i)
> +             if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +               {
> +                 if (dump_file)
> +                   fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +                 break;
> +               }
>           }
>      }
>
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    general_remove_non_convertible_regs (&candidates[i]);
>
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +       unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +       scalar_chain *chain;
>
> -      if (TARGET_64BIT)
> -       chain = new timode_scalar_chain;
> -      else
> -       chain = new dimode_scalar_chain;
> +       if (cand_mode[i] == TImode)
> +         chain = new timode_scalar_chain;
> +       else
> +         chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> -      /* Find instructions chain we want to convert to vector mode.
> -        Check all uses and definitions to estimate all required
> -        conversions.  */
> -      chain->build (candidates, uid);
> +       /* Find instructions chain we want to convert to vector mode.
> +          Check all uses and definitions to estimate all required
> +          conversions.  */
> +       chain->build (&candidates[i], uid);
>
> -      if (chain->compute_convert_gain () > 0)
> -       converted_insns += chain->convert ();
> -      else
> -       if (dump_file)
> -         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                  chain->chain_id);
> +       if (chain->compute_convert_gain () > 0)
> +         converted_insns += chain->convert ();
> +       else
> +         if (dump_file)
> +           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                    chain->chain_id);
>
> -      delete chain;
> -    }
> +       delete chain;
> +      }
>
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h     (revision 274422)
> +++ gcc/config/i386/i386-features.h     (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>
>    static unsigned max_id;
>
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -159,9 +164,11 @@ class scalar_chain
>    virtual void convert_registers () = 0;
>  };
>
> -class dimode_scalar_chain : public scalar_chain
> +class general_scalar_chain : public scalar_chain
>  {
>   public:
> +  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 274422)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc"
>     (match_operand:SWI 3 "const_int_operand")]
>    ""
>    "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
> +
> +;; min/max patterns
> +
> +(define_mode_iterator MAXMIN_IMODE
> +  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
> +(define_code_attr maxmin_rel
> +  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
> +
> +(define_expand "<code><mode>3"
> +  [(parallel
> +    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +         (maxmin:MAXMIN_IMODE
> +           (match_operand:MAXMIN_IMODE 1 "register_operand")
> +           (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +     (clobber (reg:CC FLAGS_REG))])]
> +  "TARGET_STV")
> +
> +(define_insn_and_split "*<code><mode>3_1"
> +  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
> +       (maxmin:MAXMIN_IMODE
> +         (match_operand:MAXMIN_IMODE 1 "register_operand")
> +         (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +       (if_then_else:MAXMIN_IMODE (match_dup 3)
> +         (match_dup 1)
> +         (match_dup 2)))]
> +{
> +  machine_mode mode = <MODE>mode;
> +
> +  if (!register_operand (operands[2], mode))
> +    operands[2] = force_reg (mode, operands[2]);
> +
> +  enum rtx_code code = <maxmin_rel>;
> +  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
> +  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
> +
> +  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
> +  emit_insn (gen_rtx_SET (flags, tmp));
> +
> +  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +})
> +
> +(define_insn_and_split "*<code>di3_doubleword"
> +  [(set (match_operand:DI 0 "register_operand")
> +       (maxmin:DI (match_operand:DI 1 "register_operand")
> +                  (match_operand:DI 2 "nonimmediate_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +       (if_then_else:SI (match_dup 6)
> +         (match_dup 1)
> +         (match_dup 2)))
> +   (set (match_dup 3)
> +       (if_then_else:SI (match_dup 6)
> +         (match_dup 4)
> +         (match_dup 5)))]
> +{
> +  if (!register_operand (operands[2], DImode))
> +    operands[2] = force_reg (DImode, operands[2]);
> +
> +  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
> +
> +  rtx cmplo[2] = { operands[1], operands[2] };
> +  rtx cmphi[2] = { operands[4], operands[5] };
> +
> +  enum rtx_code code = <maxmin_rel>;
> +
> +  switch (code)
> +    {
> +    case LE: case LEU:
> +      std::swap (cmplo[0], cmplo[1]);
> +      std::swap (cmphi[0], cmphi[1]);
> +      code = swap_condition (code);
> +      /* FALLTHRU */
> +
> +    case GE: case GEU:
> +      {
> +       bool uns = (code == GEU);
> +       rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
> +         = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
> +
> +       emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
> +
> +       rtx tmp = gen_rtx_SCRATCH (SImode);
> +       emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
> +
> +       rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
> +       operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
> +
> +       break;
> +      }
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +})
>
>  ;; Misc patterns (?)
>
> Index: gcc/testsuite/gcc.target/i386/minmax-1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-1.c    (revision 274422)
> +++ gcc/testsuite/gcc.target/i386/minmax-1.c    (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -march=opteron" } */
> +/* { dg-options "-O2 -march=opteron -mno-stv" } */
>  /* { dg-final { scan-assembler "test" } } */
>  /* { dg-final { scan-assembler-not "cmp" } } */
>  #define max(a,b) (((a) > (b))? (a) : (b))
> Index: gcc/testsuite/gcc.target/i386/minmax-2.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-2.c    (revision 274422)
> +++ gcc/testsuite/gcc.target/i386/minmax-2.c    (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -mno-stv" } */
>  /* { dg-final { scan-assembler "test" } } */
>  /* { dg-final { scan-assembler-not "cmp" } } */
>  #define max(a,b) (((a) > (b))? (a) : (b))
> Index: gcc/testsuite/gcc.target/i386/minmax-3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-3.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-3.c    (working copy)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b))
> +#define min(a,b) (((a) < (b))? (a) : (b))
> +
> +int ssi[1024];
> +unsigned int usi[1024];
> +long long sdi[1024];
> +unsigned long long udi[1024];
> +
> +#define CHECK(FN, VARIANT) \
> +void \
> +FN ## VARIANT (void) \
> +{ \
> +  for (int i = 1; i < 1024; ++i) \
> +    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
> +}
> +
> +CHECK(max, ssi);
> +CHECK(min, ssi);
> +CHECK(max, usi);
> +CHECK(min, usi);
> +CHECK(max, sdi);
> +CHECK(min, sdi);
> +CHECK(max, udi);
> +CHECK(min, udi);
> Index: gcc/testsuite/gcc.target/i386/minmax-4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-4.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-4.c    (working copy)
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -msse4.1" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "pminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "pminud" 1 } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-5.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-5.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-5.c    (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mstv -mavx512vl" } */
> +
> +#include "minmax-3.c"
> +
> +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */
> +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */
> +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */
> +/* { dg-final { scan-assembler-times "vpminud" 1 } } */
> +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */
> Index: gcc/testsuite/gcc.target/i386/minmax-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/minmax-6.c    (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/minmax-6.c    (working copy)
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=haswell" } */
> +
> +unsigned short
> +UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> +{
> +  if (y != width)
> +    {
> +      y = y < 0 ? 0 : y;
> +      return Pic[y * width];
> +    }
> +  return Pic[y];
> +}
> +
> +/* We do not want the RA to spill %esi for it's dual-use but using
> +   pmaxsd is OK.  */
> +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> +/* { dg-final { scan-assembler "pmaxsd" } } */
> Index: gcc/testsuite/gcc.target/i386/pr91154.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/pr91154.c     (nonexistent)
> +++ gcc/testsuite/gcc.target/i386/pr91154.c     (working copy)
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse4.1 -mstv" } */
> +
> +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M)
> +{
> +  int sc;
> +  int k;
> +  for (k = 1; k <= M; k++)
> +    {
> +      dc[k] = dc[k-1] + tpdd[k-1];
> +      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
> +      if (dc[k] < -987654321) dc[k] = -987654321;
> +    }
> +}
> +
> +/* We want to convert the loop to SSE since SSE pmaxsd is faster than
> +   compare + conditional move.  */
> +/* { dg-final { scan-assembler-not "cmov" } } */
> +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */
> +/* { dg-final { scan-assembler-times "paddd" 2 } } */
Uros Bizjak Aug. 15, 2019, 8:59 a.m. UTC | #60
On Tue, Aug 13, 2019 at 9:54 PM H.J. Lu <hjl.tools@gmail.com> wrote:

> > > with the latest patch (this is with -m32) where -mstv causes
> > > all spills to go away and the cmoves replaced (so clearly
> > > better code after the patch) for pr65105-5.c, no obvious
> > > improvements for pr65105-3.c where cmov does appear with -mstv.
> > > I'd rather not "fix" those by adding -mno-stv but instead have
> > > the Intel people fix costing for slm and/or decide what to do.
> > > For pr65105-3.c I'm not sure why if-conversion didn't choose
> > > to use cmov, so clearly the enabled minmax patterns expose the
> > > "failure" here.
> > I'm not sure how much effort Intel is putting into Silvermont tuning
> > these days.  So I'd suggest giving HJ a heads-up and a reasonable period
> > of time to take a looksie, but I wouldn't hold the patch for long due to
> > a Silvermont tuning issue.
>
> Leave pr65105-3.c to fail for now.  We can take a look later.

I have a patch for this. The problem is with conversion of COMPARE,
which gets assigned to SImode chain, while in fact we expect very
specific form of DImode compare.

Uros.
diff mbox series

Patch

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 273732)
+++ gcc/config/i386/i386.md	(working copy)
@@ -1881,6 +1881,33 @@  (define_expand "mov<mode>"
   ""
   "ix86_expand_move (<MODE>mode, operands); DONE;")
 
+(define_insn "smaxsi3"
+ [(set (match_operand:SI 0 "register_operand" "=r,v,x")
+       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
+                (match_operand:SI 2 "register_operand" "r,v,x")))
+  (clobber (reg:CC FLAGS_REG))]
+  "TARGET_SSE4_1"
+{
+  switch (get_attr_type (insn))
+    {
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
+      else
+        return "pmaxsd\t{%2, %0|%0, %2}";
+    case TYPE_ICMOV:
+      /* ???  Instead split this after reload?  */
+      return "cmpl\t{%2, %0|%0, %2}\n"
+           "\tcmovl\t{%2, %0|%0, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "noavx,avx,noavx")
+   (set_attr "prefix" "orig,vex,orig")
+   (set_attr "memory" "none")
+   (set_attr "type" "icmov,sseadd,sseadd")])
+
 (define_insn "*mov<mode>_xor"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
 	(match_operand:SWI48 1 "const0_operand"))
@@ -5368,10 +5395,10 @@  (define_insn_and_split "*add<dwi>3_doubl
 })
 
 (define_insn "*add<mode>_1"
-  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
 	(plus:SWI48
-	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
-	  (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
+	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
+	  (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
 {
@@ -5390,10 +5417,23 @@  (define_insn "*add<mode>_1"
           return "dec{<imodesuffix>}\t%0";
 	}
 
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        {
+          if (<MODE>mode == SImode)
+	    return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
+	  else
+	    return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
+	}
+      else if (<MODE>mode == SImode)
+	return "paddd\t{%2, %0|%0, %2}";
+      else
+	return "paddq\t{%2, %0|%0, %2}";
+
     default:
       /* For most processors, ADD is faster than LEA.  This alternative
 	 was added to use ADD as much as possible.  */
-      if (which_alternative == 2)
+      if (which_alternative == 4)
         std::swap (operands[1], operands[2]);
         
       gcc_assert (rtx_equal_p (operands[0], operands[1]));
@@ -5403,9 +5443,14 @@  (define_insn "*add<mode>_1"
       return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
     }
 }
-  [(set (attr "type")
-     (cond [(eq_attr "alternative" "3")
+  [(set_attr "isa" "*,avx,noavx,*,*,*")
+   (set (attr "type")
+     (cond [(eq_attr "alternative" "5")
               (const_string "lea")
+	    (eq_attr "alternative" "1")
+	      (const_string "sseadd")
+	    (eq_attr "alternative" "2")
+	      (const_string "sseadd")
 	    (match_operand:SWI48 2 "incdec_operand")
 	      (const_string "incdec")
 	   ]
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 273732)
+++ gcc/config/i386/i386.c	(working copy)
@@ -14616,6 +14616,9 @@  ix86_lea_for_add_ok (rtx_insn *insn, rtx
   unsigned int regno1 = true_regnum (operands[1]);
   unsigned int regno2 = true_regnum (operands[2]);
 
+  if (SSE_REGNO_P (regno1))
+    return false;
+
   /* If a = b + c, (a!=b && a!=c), must use lea form. */
   if (regno0 != regno1 && regno0 != regno2)
     return true;