Patchwork [testsuite,i386] BMI2 support for GCC

login
register
mail settings
Submitter Uros Bizjak
Date Aug. 20, 2011, 9:16 p.m.
Message ID <CAFULd4aJGuetGiM_w=L467SGVOaTct526fzXxA_n7m+hgDU93Q@mail.gmail.com>
Download mbox | patch
Permalink /patch/110788/
State New
Headers show

Comments

Uros Bizjak - Aug. 20, 2011, 9:16 p.m.
On Sat, Aug 20, 2011 at 2:09 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

> Don't expand RORX through ix86_expand_binary_operator, generate it
> directly from expander. You are complicating things with splitters too
> much!
>
> I will rewrite this part of i386.md.

So, attached RFC patch handles BMI2 mul, shift and ror stuff.

Some remarks:
- M and N register modifiers are added to print low and high register
of a double word register pair. This is needed for mulx insn.
- ishiftx and rotatex instruction type attributes are added.
- "w" mode attribute is added to add register prefix for word mode.
This is needed to output QImode count register of shift insns.

- mulx is expanded directly from expander, IMO it is always a win to
generate this insn if available.

- Yb register constraint is added to conditionally enable generation
of BMI alternatives in generic shift and rotate patterns. The BMI
variant is generated only if RA chooses it as the most profitable
alternative.
- shift and rotate instructions are split post-reload from generic
patterns to strip flags clobber.
- zero-extended 64bit variants are also handled for shift and rotate insns.
- rotate right AND rotate left instructions are handled through rorx.

2011-08-20  Uros Bizjak  <ubizjak@gmail.com>

	* config/i386/i386.md (type): Add ishiftx and rotatex.
	(length_immediate): Handle ishiftx and rotatex.
	(imm_disp): Ditto.
	(w): New mode attribute.

	(mul<mode><dwi>3): Split from <u>mul<mode><dwi>3.
	(umul<mode><dwi>3): Ditto.  Generate bmi2_umul<mode><dwi>3_1 pattern
	for TARGET_BMI2.
	(bmi2_umul<mode><dwi>3_1): New insn pattern.

	(*bmi2_ashl<mode>3_1): New insn pattern.
	(*ashl<mode>3_1): Add ishiftx BMI2 alternative.
	(*ashl<mode>3_1 splitter): New splitter to avoid flags dependency.
	(*bmi2_ashlsi3_1_zext): New insn pattern.
	(*ashlsi3_1_zext): Add ishiftx BMI2 alternative.
	(*ashlsi3_1_zext splitter): New splitter to avoid flags dependency.

	(*bmi2_<shiftrt_insn><mode>3_1): New insn pattern.
	(*<shiftrt_insn><mode>3_1): Add ishiftx BMI2 alternative.
	(*<shiftrt_insn><mode>3_1 splitter): New splitter to avoid
	flags dependency.
	(*bmi2_<shiftrt_insn>si3_1_zext): New insn pattern.
	(*<shiftrt_insn>si3_1_zext): Add ishiftx BMI2 alternative.
	(*<shiftrt_insn>si3_1_zext splitter): New splitter to avoid
	flags dependency.

	(*bmi2_rorx<mode>3_1): New insn pattern.
	(*<rotate_insn><mode>3_1): Add rotatex BMI2 alternative.
	(*rotate<mode>3_1 splitter): New splitter to avoid flags dependency.
	(*rotatert<mode>3_1 splitter): Ditto.
	(*bmi2_rorxsi3_1_zext): New insn pattern.
	(*<rotate_insn>si3_1_zext): Add rotatex BMI2 alternative.
	(*rotatesi3_1_zext  splitter): New splitter to avoid flags dependency.
	(*rotatertsi3_1_zext splitter): Ditto.

	* config/i386/constraints.md (Yb): New register constraint.
	* config/i386/i386.c (print_reg): Handle 'M' and 'N' modifiers.
	(print_operand): Ditto.

The patch is currently in RFC/RFT state, since I have no way to
properly test it. The patch bootstraps OK and regression test is clean
on x86_64-pc-linux-gnu {,-m32}. I tested the patch lightly on provided
testcases, so expected patterns are generated. Oh, and all insn
constraints should be changed from TARGET_BMI to TARGET_BMI2.

Uros.
H.J. Lu - Aug. 20, 2011, 9:31 p.m.
On Sat, Aug 20, 2011 at 2:16 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Aug 20, 2011 at 2:09 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>
>> Don't expand RORX through ix86_expand_binary_operator, generate it
>> directly from expander. You are complicating things with splitters too
>> much!
>>
>> I will rewrite this part of i386.md.
>
> So, attached RFC patch handles BMI2 mul, shift and ror stuff.
>
> Some remarks:
> - M and N register modifiers are added to print low and high register
> of a double word register pair. This is needed for mulx insn.
> - ishiftx and rotatex instruction type attributes are added.
> - "w" mode attribute is added to add register prefix for word mode.
> This is needed to output QImode count register of shift insns.
>
> - mulx is expanded directly from expander, IMO it is always a win to
> generate this insn if available.
>
> - Yb register constraint is added to conditionally enable generation
> of BMI alternatives in generic shift and rotate patterns. The BMI
> variant is generated only if RA chooses it as the most profitable
> alternative.
> - shift and rotate instructions are split post-reload from generic
> patterns to strip flags clobber.
> - zero-extended 64bit variants are also handled for shift and rotate insns.
> - rotate right AND rotate left instructions are handled through rorx.
>
> 2011-08-20  Uros Bizjak  <ubizjak@gmail.com>
>
>        * config/i386/i386.md (type): Add ishiftx and rotatex.
>        (length_immediate): Handle ishiftx and rotatex.
>        (imm_disp): Ditto.
>        (w): New mode attribute.
>
>        (mul<mode><dwi>3): Split from <u>mul<mode><dwi>3.
>        (umul<mode><dwi>3): Ditto.  Generate bmi2_umul<mode><dwi>3_1 pattern
>        for TARGET_BMI2.
>        (bmi2_umul<mode><dwi>3_1): New insn pattern.
>
>        (*bmi2_ashl<mode>3_1): New insn pattern.
>        (*ashl<mode>3_1): Add ishiftx BMI2 alternative.
>        (*ashl<mode>3_1 splitter): New splitter to avoid flags dependency.
>        (*bmi2_ashlsi3_1_zext): New insn pattern.
>        (*ashlsi3_1_zext): Add ishiftx BMI2 alternative.
>        (*ashlsi3_1_zext splitter): New splitter to avoid flags dependency.
>
>        (*bmi2_<shiftrt_insn><mode>3_1): New insn pattern.
>        (*<shiftrt_insn><mode>3_1): Add ishiftx BMI2 alternative.
>        (*<shiftrt_insn><mode>3_1 splitter): New splitter to avoid
>        flags dependency.
>        (*bmi2_<shiftrt_insn>si3_1_zext): New insn pattern.
>        (*<shiftrt_insn>si3_1_zext): Add ishiftx BMI2 alternative.
>        (*<shiftrt_insn>si3_1_zext splitter): New splitter to avoid
>        flags dependency.
>
>        (*bmi2_rorx<mode>3_1): New insn pattern.
>        (*<rotate_insn><mode>3_1): Add rotatex BMI2 alternative.
>        (*rotate<mode>3_1 splitter): New splitter to avoid flags dependency.
>        (*rotatert<mode>3_1 splitter): Ditto.
>        (*bmi2_rorxsi3_1_zext): New insn pattern.
>        (*<rotate_insn>si3_1_zext): Add rotatex BMI2 alternative.
>        (*rotatesi3_1_zext  splitter): New splitter to avoid flags dependency.
>        (*rotatertsi3_1_zext splitter): Ditto.
>
>        * config/i386/constraints.md (Yb): New register constraint.
>        * config/i386/i386.c (print_reg): Handle 'M' and 'N' modifiers.
>        (print_operand): Ditto.
>
> The patch is currently in RFC/RFT state, since I have no way to
> properly test it. The patch bootstraps OK and regression test is clean

We are using HSW emulator (SDE):

http://software.intel.com/en-us/articles/pre-release-license-agreement-for-intel-software-development-emulator-accept-end-user-license-agreement-and-download/

to test FMA, BMI/BMI2.  I have a SDE sim for dejagnu so that I can run
GCC testsuite under SDE.

> on x86_64-pc-linux-gnu {,-m32}. I tested the patch lightly on provided
> testcases, so expected patterns are generated. Oh, and all insn
> constraints should be changed from TARGET_BMI to TARGET_BMI2.
>
> Uros.
>

We can also implement MULX with split:

(define_split
  [(parallel [(set (match_operand:<DWI> 0 "register_operand" "")
                   (mult:<DWI>
                     (zero_extend:<DWI>
                       (match_operand:DWIH 1 "nonimmediate_operand" ""))
                     (zero_extend:<DWI>
                       (match_operand:DWIH 2 "nonimmediate_operand" ""))))
              (clobber (reg:CC FLAGS_REG))])]
  "TARGET_BMI2
   && ix86_binary_operator_ok (MULT, <MODE>mode, operands)"
  [(set (match_operand:<DWI> 0 "register_operand" "")
        (mult:<DWI>
          (zero_extend:<DWI>
            (match_operand:DWIH 1 "register_operand" ""))
          (zero_extend:<DWI>
            (match_operand:DWIH 2 "nonimmediate_operand" ""))))])

(define_insn "*bmi2_umul<mode><dwi>3_1"
  [(set (match_operand:<DWI> 0 "register_operand" "=r")
        (mult:<DWI>
          (zero_extend:<DWI>
            (match_operand:DWIH 1 "register_operand" "d"))
          (zero_extend:<DWI>
            (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
  "TARGET_BMI2"
{
  if (<MODE>mode == DImode)
    return "mulx\t{%2, %M0, %N0|%N0, %M0, %2}";
  else
    return "mulx\t{%2, %M0, %N0|%N0, %M0, %2}";
}
  [(set_attr "type" "imul")
   (set_attr "prefix" "vex")
   (set_attr "mode" "<MODE>")])
Uros Bizjak - Aug. 20, 2011, 9:44 p.m.
On Sat, Aug 20, 2011 at 11:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

> We can also implement MULX with split:
>
> (define_split
>  [(parallel [(set (match_operand:<DWI> 0 "register_operand" "")
>                   (mult:<DWI>
>                     (zero_extend:<DWI>
>                       (match_operand:DWIH 1 "nonimmediate_operand" ""))
>                     (zero_extend:<DWI>
>                       (match_operand:DWIH 2 "nonimmediate_operand" ""))))
>              (clobber (reg:CC FLAGS_REG))])]
>  "TARGET_BMI2
>   && ix86_binary_operator_ok (MULT, <MODE>mode, operands)"
>  [(set (match_operand:<DWI> 0 "register_operand" "")
>        (mult:<DWI>
>          (zero_extend:<DWI>
>            (match_operand:DWIH 1 "register_operand" ""))
>          (zero_extend:<DWI>
>            (match_operand:DWIH 2 "nonimmediate_operand" ""))))])

Well, this is unconditional splitter, no better than current approach
where the pattern is expanded directly.

If you want to squeeze out the last 0.005% of performance, you should
add BMI alternative to existing umul pattern, leave the choice of
alternative to RA and split the exact alternative (that is, you need
some true_regnum calls in splitter constraint) after reload to mulx
pattern. Please, see new patterns for how this should be done.

I'm not against this approach, but after 10 hours of hacking, I just
wanted to leave it to an interested reader ;)

Uros.
Richard Henderson - Aug. 20, 2011, 9:47 p.m.
On 08/20/2011 02:16 PM, Uros Bizjak wrote:
> - Yb register constraint is added to conditionally enable generation
> of BMI alternatives in generic shift and rotate patterns. The BMI
> variant is generated only if RA chooses it as the most profitable
> alternative.

We really should use the (relatively new) enabled attribute instead
of adding more and more conditional register constraints.


r~
H.J. Lu - Aug. 20, 2011, 9:49 p.m.
On Sat, Aug 20, 2011 at 2:44 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Aug 20, 2011 at 11:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>> We can also implement MULX with split:
>>
>> (define_split
>>  [(parallel [(set (match_operand:<DWI> 0 "register_operand" "")
>>                   (mult:<DWI>
>>                     (zero_extend:<DWI>
>>                       (match_operand:DWIH 1 "nonimmediate_operand" ""))
>>                     (zero_extend:<DWI>
>>                       (match_operand:DWIH 2 "nonimmediate_operand" ""))))
>>              (clobber (reg:CC FLAGS_REG))])]
>>  "TARGET_BMI2
>>   && ix86_binary_operator_ok (MULT, <MODE>mode, operands)"
>>  [(set (match_operand:<DWI> 0 "register_operand" "")
>>        (mult:<DWI>
>>          (zero_extend:<DWI>
>>            (match_operand:DWIH 1 "register_operand" ""))
>>          (zero_extend:<DWI>
>>            (match_operand:DWIH 2 "nonimmediate_operand" ""))))])
>
> Well, this is unconditional splitter, no better than current approach
> where the pattern is expanded directly.
>
> If you want to squeeze out the last 0.005% of performance, you should
> add BMI alternative to existing umul pattern, leave the choice of
> alternative to RA and split the exact alternative (that is, you need
> some true_regnum calls in splitter constraint) after reload to mulx
> pattern. Please, see new patterns for how this should be done.
>
> I'm not against this approach, but after 10 hours of hacking, I just
> wanted to leave it to an interested reader ;)

We won't use split then.

Thanks.
Richard Henderson - Aug. 20, 2011, 9:52 p.m.
On 08/20/2011 02:16 PM, Uros Bizjak wrote:
> +(define_insn "bmi2_umul<mode><dwi>3_1"
> +  [(set (match_operand:<DWI> 0 "register_operand" "=r")
> +	(mult:<DWI>
> +	  (zero_extend:<DWI>
> +	    (match_operand:DWIH 1 "nonimmediate_operand" "%d"))
> +	  (zero_extend:<DWI>
> +	    (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
> +  "TARGET_BMI
> +   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
> +  "mulx\t{%2, %M0, %N0|%N0, %M0, %2}"
> +  [(set_attr "type" "imul")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "<MODE>")])

You can do better than this, and avoid the %M %N specifiers.
The outputs are truly independent and do not need to be a pair.

See the mn10300 umulsidi3{,_internal} patterns.


r~
H.J. Lu - Aug. 20, 2011, 10:02 p.m.
On Sat, Aug 20, 2011 at 2:52 PM, Richard Henderson <rth@redhat.com> wrote:
> On 08/20/2011 02:16 PM, Uros Bizjak wrote:
>> +(define_insn "bmi2_umul<mode><dwi>3_1"
>> +  [(set (match_operand:<DWI> 0 "register_operand" "=r")
>> +     (mult:<DWI>
>> +       (zero_extend:<DWI>
>> +         (match_operand:DWIH 1 "nonimmediate_operand" "%d"))
>> +       (zero_extend:<DWI>
>> +         (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
>> +  "TARGET_BMI
>> +   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>> +  "mulx\t{%2, %M0, %N0|%N0, %M0, %2}"
>> +  [(set_attr "type" "imul")
>> +   (set_attr "prefix" "vex")
>> +   (set_attr "mode" "<MODE>")])
>
> You can do better than this, and avoid the %M %N specifiers.
> The outputs are truly independent and do not need to be a pair.
>

Since RA use register pairs for TImode/DImode, should requiring
TI/DI registers in pairs generate better does?
H.J. Lu - Aug. 20, 2011, 10:03 p.m.
On Sat, Aug 20, 2011 at 3:02 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sat, Aug 20, 2011 at 2:52 PM, Richard Henderson <rth@redhat.com> wrote:
>> On 08/20/2011 02:16 PM, Uros Bizjak wrote:
>>> +(define_insn "bmi2_umul<mode><dwi>3_1"
>>> +  [(set (match_operand:<DWI> 0 "register_operand" "=r")
>>> +     (mult:<DWI>
>>> +       (zero_extend:<DWI>
>>> +         (match_operand:DWIH 1 "nonimmediate_operand" "%d"))
>>> +       (zero_extend:<DWI>
>>> +         (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
>>> +  "TARGET_BMI
>>> +   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>>> +  "mulx\t{%2, %M0, %N0|%N0, %M0, %2}"
>>> +  [(set_attr "type" "imul")
>>> +   (set_attr "prefix" "vex")
>>> +   (set_attr "mode" "<MODE>")])
>>
>> You can do better than this, and avoid the %M %N specifiers.
>> The outputs are truly independent and do not need to be a pair.
>>
>
> Since RA use register pairs for TImode/DImode, should requiring
> TI/DI registers in pairs generate better does?
                                                          ^^^^^^ codes.

Without register pairs, we are generating very strange codes.
Richard Henderson - Aug. 20, 2011, 11:58 p.m.
On 08/20/2011 03:03 PM, H.J. Lu wrote:
> On Sat, Aug 20, 2011 at 3:02 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> You can do better than this, and avoid the %M %N specifiers.
>>> The outputs are truly independent and do not need to be a pair.
>>>
>>
>> Since RA use register pairs for TImode/DImode, should requiring
>> TI/DI registers in pairs generate better does?
>                                                           ^^^^^^ codes.
> 
> Without register pairs, we are generating very strange codes.
> 

We ought to be making better use of the lower-subregs pass.
Representing independent outputs when possible enables that.

Admittedly, the i386 port needs more attention to really make
this happen properly.  But we don't need to make things even
worse in the meantime.


r~
Uros Bizjak - Aug. 21, 2011, 12:47 a.m.
On Sun, Aug 21, 2011 at 1:58 AM, Richard Henderson <rth@redhat.com> wrote:
> On 08/20/2011 03:03 PM, H.J. Lu wrote:
>> On Sat, Aug 20, 2011 at 3:02 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> You can do better than this, and avoid the %M %N specifiers.
>>>> The outputs are truly independent and do not need to be a pair.
>>>>
>>>
>>> Since RA use register pairs for TImode/DImode, should requiring
>>> TI/DI registers in pairs generate better does?
>>                                                           ^^^^^^ codes.
>>
>> Without register pairs, we are generating very strange codes.
>>
>
> We ought to be making better use of the lower-subregs pass.
> Representing independent outputs when possible enables that.
>
> Admittedly, the i386 port needs more attention to really make
> this happen properly.  But we don't need to make things even
> worse in the meantime.

I will investigate this.

BTW: Latest patch has a small error. Insn mnemonic in following
pattern should be "mult" instead of "imult", so the correct version
reads:

+(define_insn "*umul<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=A,r")
+	(mult:<DWI>
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 1 "nonimmediate_operand" "%0,d"))
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,rm"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "!(MEM_P (operands[1]) && MEM_P (operands[2]))"
+  "@
+   mul{<imodesuffix>}\t%2
+   #"

Uros.
H.J. Lu - Aug. 21, 2011, 12:52 a.m.
On Sat, Aug 20, 2011 at 5:47 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sun, Aug 21, 2011 at 1:58 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 08/20/2011 03:03 PM, H.J. Lu wrote:
>>> On Sat, Aug 20, 2011 at 3:02 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>> You can do better than this, and avoid the %M %N specifiers.
>>>>> The outputs are truly independent and do not need to be a pair.
>>>>>
>>>>
>>>> Since RA use register pairs for TImode/DImode, should requiring
>>>> TI/DI registers in pairs generate better does?
>>>                                                           ^^^^^^ codes.
>>>
>>> Without register pairs, we are generating very strange codes.
>>>
>>
>> We ought to be making better use of the lower-subregs pass.
>> Representing independent outputs when possible enables that.
>>
>> Admittedly, the i386 port needs more attention to really make
>> this happen properly.  But we don't need to make things even
>> worse in the meantime.
>
> I will investigate this.
>

One problem is 32bit movdi and 64bit movti.  They require
register pairs.We may need to split them before RA.
Richard Henderson - Aug. 21, 2011, 1:36 a.m.
On 08/20/2011 05:52 PM, H.J. Lu wrote:
> One problem is 32bit movdi and 64bit movti.  They require
> register pairs.We may need to split them before RA.

lower-subreg ought to be able to look through plain moves...


r~
Uros Bizjak - Aug. 21, 2011, 9:14 a.m.
On Sat, Aug 20, 2011 at 11:52 PM, Richard Henderson <rth@redhat.com> wrote:
> On 08/20/2011 02:16 PM, Uros Bizjak wrote:
>> +(define_insn "bmi2_umul<mode><dwi>3_1"
>> +  [(set (match_operand:<DWI> 0 "register_operand" "=r")
>> +     (mult:<DWI>
>> +       (zero_extend:<DWI>
>> +         (match_operand:DWIH 1 "nonimmediate_operand" "%d"))
>> +       (zero_extend:<DWI>
>> +         (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
>> +  "TARGET_BMI
>> +   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>> +  "mulx\t{%2, %M0, %N0|%N0, %M0, %2}"
>> +  [(set_attr "type" "imul")
>> +   (set_attr "prefix" "vex")
>> +   (set_attr "mode" "<MODE>")])
>
> You can do better than this, and avoid the %M %N specifiers.
> The outputs are truly independent and do not need to be a pair.
>
> See the mn10300 umulsidi3{,_internal} patterns.

I have tried your suggestion, using patterns like following:

(define_insn "umulsidi3_1"
  [(set (match_operand:SI 0 "register_operand" "=a,r")
	(mult:SI
	  (match_operand:SI 2 "nonimmediate_operand" "%0,d")
	  (match_operand:SI 3 "nonimmediate_operand" "rm,rm")))
   (set (match_operand:SI 1 "register_operand" "=d,r")
	(truncate:SI
	  (lshiftrt:DI
	    (mult:DI (zero_extend:DI (match_dup 2))
		     (zero_extend:DI (match_dup 3)))
	    (const_int 32))))
   (clobber (reg:CC FLAGS_REG))]
  "!TARGET_64BIT
   && !(MEM_P (operands[2]) && MEM_P (operands[3]))"
  "@
   mull\t%3
   #"
  [(set_attr "isa" "base,bmi2")
   (set_attr "type" "imul,imulx")
   (set_attr "length_immediate" "0,*")
   (set (attr "athlon_decode")
	(cond [(eq_attr "alternative" "0")
		 (if_then_else (eq_attr "cpu" "athlon")
		   (const_string "vector")
		   (const_string "double"))]
	      (const_string "*")))
   (set_attr "amdfam10_decode" "double,*")
   (set_attr "bdver1_decode" "direct,*")
   (set_attr "prefix" "orig,vex")
   (set_attr "mode" "SI")])


The compiler works, for a couple of simple testcases it produces the
same code as with register pairs. However, there are a couple of
problems:

- various length calculations look into operand{0,1,2} to determine
instruction length. This is fixable with a little effort.

- patterns that include (const_int N) do not macroize and this leads
to pattern explosion. For this simple example, in addition to
splitting out  any_extend pattern, we have to split also DWIH
patterns.

In the past, I have tried to use match_operand with const_int INTVAL
predicates, but gcc crashed elsewhere due to additional operand.
Please see [1].

IMO, it is currently too much pain to implement splitted pairs in
existing patterns for too low gain. I will however implement split to
mulx pattern after reload to proposed pattern to avoid %M %N.

[1] http://gcc.gnu.org/ml/gcc/2010-07/msg00143.html

Uros.
Uros Bizjak - Aug. 21, 2011, 11:02 a.m.
On Sat, Aug 20, 2011 at 11:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>> The patch is currently in RFC/RFT state, since I have no way to
>> properly test it. The patch bootstraps OK and regression test is clean
>
> We are using HSW emulator (SDE):
>
> http://software.intel.com/en-us/articles/pre-release-license-agreement-for-intel-software-development-emulator-accept-end-user-license-agreement-and-download/
>
> to test FMA, BMI/BMI2.  I have a SDE sim for dejagnu so that I can run
> GCC testsuite under SDE.

It is not simulator that is problematic. My binutils-of-the-day
doesn't support BMI2 insns.

Just an idea - is it possible to setup development environment with a
simulator on one of gcc compile-farm machines? This way, everything
will work out of the box and with detailed instructions, I won't
scratch my head on how to setup simulator every time new ISA is
introduced ;)

Uros

Patch

Index: i386.md
===================================================================
--- i386.md	(revision 177925)
+++ i386.md	(working copy)
@@ -50,6 +50,8 @@ 
 ;; t --  likewise, print the V8SFmode name of the register.
 ;; h -- print the QImode name for a "high" register, either ah, bh, ch or dh.
 ;; y -- print "st(0)" instead of "st" as a register.
+;; M -- print the low register of a double word register pair.
+;; N -- print the high register of a double word register pair.
 ;; d -- print duplicated register operand for AVX instruction.
 ;; D -- print condition for SSE cmp instruction.
 ;; P -- if PIC, print an @PLT suffix.
@@ -377,7 +379,7 @@ 
 (define_attr "type"
   "other,multi,
    alu,alu1,negnot,imov,imovx,lea,
-   incdec,ishift,ishift1,rotate,rotate1,imul,idiv,
+   incdec,ishift,ishiftx,ishift1,rotate,rotatex,rotate1,imul,idiv,
    icmp,test,ibr,setcc,icmov,
    push,pop,call,callv,leave,
    str,bitmanip,
@@ -414,8 +416,8 @@ 
 	   (const_int 0)
 	 (eq_attr "unit" "i387,sse,mmx")
 	   (const_int 0)
-	 (eq_attr "type" "alu,alu1,negnot,imovx,ishift,rotate,ishift1,rotate1,
-			  imul,icmp,push,pop")
+	 (eq_attr "type" "alu,alu1,negnot,imovx,ishift,ishiftx,ishift1,
+			  rotate,rotatex,rotate1,imul,icmp,push,pop")
 	   (symbol_ref "ix86_attr_length_immediate_default (insn, true)")
 	 (eq_attr "type" "imov,test")
 	   (symbol_ref "ix86_attr_length_immediate_default (insn, false)")
@@ -675,7 +677,7 @@ 
 	      (and (match_operand 0 "memory_displacement_operand" "")
 		   (match_operand 1 "immediate_operand" "")))
 	   (const_string "true")
-	 (and (eq_attr "type" "alu,ishift,rotate,imul,idiv")
+	 (and (eq_attr "type" "alu,ishift,ishiftx,rotate,rotatex,imul,idiv")
 	      (and (match_operand 0 "memory_displacement_operand" "")
 		   (match_operand 2 "immediate_operand" "")))
 	   (const_string "true")
@@ -947,6 +949,9 @@ 
 ;; Instruction suffix for REX 64bit operators.
 (define_mode_attr rex64suffix [(SI "") (DI "{q}")])
 
+;; Register prefix for word mode.
+(define_mode_attr w [(SI "k") (DI "q")])
+
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
 (define_mode_iterator P [(SI "Pmode == SImode") (DI "Pmode == DImode")])
@@ -6830,15 +6835,34 @@ 
    (set_attr "bdver1_decode" "direct")
    (set_attr "mode" "QI")])
 
-(define_expand "<u>mul<mode><dwi>3"
+(define_expand "mul<mode><dwi>3"
   [(parallel [(set (match_operand:<DWI> 0 "register_operand" "")
 		   (mult:<DWI>
-		     (any_extend:<DWI>
+		     (sign_extend:<DWI>
 		       (match_operand:DWIH 1 "nonimmediate_operand" ""))
-		     (any_extend:<DWI>
+		     (sign_extend:<DWI>
 		       (match_operand:DWIH 2 "register_operand" ""))))
 	      (clobber (reg:CC FLAGS_REG))])])
 
+(define_expand "umul<mode><dwi>3"
+  [(parallel [(set (match_operand:<DWI> 0 "register_operand" "")
+		   (mult:<DWI>
+		     (zero_extend:<DWI>
+		       (match_operand:DWIH 1 "nonimmediate_operand" ""))
+		     (zero_extend:<DWI>
+		       (match_operand:DWIH 2 "register_operand" ""))))
+	      (clobber (reg:CC FLAGS_REG))])]
+  ""
+{
+  if (TARGET_BMI)
+    {
+      emit_insn (gen_bmi2_umul<mode><dwi>3_1 (operands[0],
+					      operands[1],
+					      operands[2]));
+      DONE;
+    }
+})
+
 (define_expand "<u>mulqihi3"
   [(parallel [(set (match_operand:HI 0 "register_operand" "")
 		   (mult:HI
@@ -6849,6 +6873,20 @@ 
 	      (clobber (reg:CC FLAGS_REG))])]
   "TARGET_QIMODE_MATH")
 
+(define_insn "bmi2_umul<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(mult:<DWI>
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 1 "nonimmediate_operand" "%d"))
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm"))))]
+  "TARGET_BMI
+   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
+  "mulx\t{%2, %M0, %N0|%N0, %M0, %2}"
+  [(set_attr "type" "imul")
+   (set_attr "prefix" "vex")
+   (set_attr "mode" "<MODE>")])
+
 (define_insn "*<u>mul<mode><dwi>3_1"
   [(set (match_operand:<DWI> 0 "register_operand" "=A")
 	(mult:<DWI>
@@ -9056,16 +9094,26 @@ 
   [(set_attr "type" "ishift")
    (set_attr "mode" "<MODE>")])
 
+(define_insn "*bmi2_ashl<mode>3_1"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+	(ashift:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "rm")
+		      (match_operand:QI 2 "register_operand" "r")))]
+  "TARGET_BMI"
+  "salx\t{%<w>2, %1, %0|%0, %1, %<w>2}"
+  [(set_attr "type" "ishiftx")
+   (set_attr "mode" "<MODE>")])
+
 (define_insn "*ashl<mode>3_1"
-  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r")
-	(ashift:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0,l")
-		      (match_operand:QI 2 "nonmemory_operand" "c<S>,M")))
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,Yb")
+	(ashift:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0,l,mYb")
+		      (match_operand:QI 2 "nonmemory_operand" "c<S>,M,Yb")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (ASHIFT, <MODE>mode, operands)"
 {
   switch (get_attr_type (insn))
     {
     case TYPE_LEA:
+    case TYPE_ISHIFTX:
       return "#";
 
     case TYPE_ALU:
@@ -9084,6 +9132,8 @@ 
   [(set (attr "type")
      (cond [(eq_attr "alternative" "1")
 	      (const_string "lea")
+	    (eq_attr "alternative" "2")
+	      (const_string "ishiftx")
             (and (and (ne (symbol_ref "TARGET_DOUBLE_WITH_ADD")
 		          (const_int 0))
 		      (match_operand 0 "register_operand" ""))
@@ -9102,17 +9152,39 @@ 
        (const_string "*")))
    (set_attr "mode" "<MODE>")])
 
+;; Convert shift to the shiftx pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand" "")
+	(ashift:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "")
+		      (match_operand:QI 2 "register_operand" "")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+	(ashift:SWI48 (match_dup 1) (match_dup 2)))])
+
+(define_insn "*bmi2_ashlsi3_1_zext"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(zero_extend:DI
+	  (ashift:SI (match_operand:SI 1 "nonimmediate_operand" "rm")
+		     (match_operand:QI 2 "register_operand" "r"))))]
+  "TARGET_64BIT && TARGET_BMI"
+  "salx\t{%k2, %1, %k0|%k0, %1, %k2}"
+  [(set_attr "type" "ishiftx")
+   (set_attr "mode" "SI")])
+
 (define_insn "*ashlsi3_1_zext"
-  [(set (match_operand:DI 0 "register_operand" "=r,r")
+  [(set (match_operand:DI 0 "register_operand" "=r,r,Yb")
 	(zero_extend:DI
-	  (ashift:SI (match_operand:SI 1 "register_operand" "0,l")
-		     (match_operand:QI 2 "nonmemory_operand" "cI,M"))))
+	  (ashift:SI (match_operand:SI 1 "nonimmediate_operand" "0,l,mYb")
+		     (match_operand:QI 2 "nonmemory_operand" "cI,M,Yb"))))
    (clobber (reg:CC FLAGS_REG))]
   "TARGET_64BIT && ix86_binary_operator_ok (ASHIFT, SImode, operands)"
 {
   switch (get_attr_type (insn))
     {
     case TYPE_LEA:
+    case TYPE_ISHIFTX:
       return "#";
 
     case TYPE_ALU:
@@ -9130,6 +9202,8 @@ 
   [(set (attr "type")
      (cond [(eq_attr "alternative" "1")
 	      (const_string "lea")
+	    (eq_attr "alternative" "2")
+	      (const_string "ishiftx")
             (and (ne (symbol_ref "TARGET_DOUBLE_WITH_ADD")
 		     (const_int 0))
 		 (match_operand 2 "const1_operand" ""))
@@ -9147,6 +9221,18 @@ 
        (const_string "*")))
    (set_attr "mode" "SI")])
 
+;; Convert shift to the shiftx pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:DI 0 "register_operand" "")
+	(zero_extend:DI
+	  (ashift:SI (match_operand:SI 1 "nonimmediate_operand" "")
+		     (match_operand:QI 2 "register_operand" ""))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+  	(zero_extend:DI (ashift:SI (match_dup 1) (match_dup 2))))])
+
 (define_insn "*ashlhi3_1"
   [(set (match_operand:HI 0 "nonimmediate_operand" "=rm")
 	(ashift:HI (match_operand:HI 1 "nonimmediate_operand" "0")
@@ -9763,20 +9849,37 @@ 
   DONE;
 })
 
+(define_insn "*bmi2_<shiftrt_insn><mode>3_1"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+	(any_shiftrt:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "rm")
+			   (match_operand:QI 2 "register_operand" "r")))]
+  "TARGET_BMI"
+  "<shiftrt>x\t{%<w>2, %1, %0|%0, %1, %<w>2}"
+  [(set_attr "type" "ishiftx")
+   (set_attr "mode" "<MODE>")])
+
 (define_insn "*<shiftrt_insn><mode>3_1"
-  [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m")
-	(any_shiftrt:SWI (match_operand:SWI 1 "nonimmediate_operand" "0")
-			 (match_operand:QI 2 "nonmemory_operand" "c<S>")))
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,Yb")
+	(any_shiftrt:SWI48
+	  (match_operand:SWI48 1 "nonimmediate_operand" "0,mYb")
+	  (match_operand:QI 2 "nonmemory_operand" "c<S>,Yb")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
 {
-  if (operands[2] == const1_rtx
-      && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
-    return "<shiftrt>{<imodesuffix>}\t%0";
-  else
-    return "<shiftrt>{<imodesuffix>}\t{%2, %0|%0, %2}";
+  switch (get_attr_type (insn))
+    {
+    case TYPE_ISHIFTX:
+      return "#";
+
+    default:
+      if (operands[2] == const1_rtx
+	  && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+	return "<shiftrt>{<imodesuffix>}\t%0";
+      else
+	return "<shiftrt>{<imodesuffix>}\t{%2, %0|%0, %2}";
+    }
 }
-  [(set_attr "type" "ishift")
+  [(set_attr "type" "ishift,ishiftx")
    (set (attr "length_immediate")
      (if_then_else
        (and (match_operand 2 "const1_operand" "")
@@ -9786,19 +9889,83 @@ 
        (const_string "*")))
    (set_attr "mode" "<MODE>")])
 
-(define_insn "*<shiftrt_insn>si3_1_zext"
+;; Convert shift to the shiftx pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand" "")
+	(any_shiftrt:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "")
+			   (match_operand:QI 2 "register_operand" "")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+	(any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))])
+
+(define_insn "*bmi2_<shiftrt_insn>si3_1_zext"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(zero_extend:DI
-	  (any_shiftrt:SI (match_operand:SI 1 "register_operand" "0")
-			  (match_operand:QI 2 "nonmemory_operand" "cI"))))
+	  (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "rm")
+			  (match_operand:QI 2 "register_operand" "r"))))]
+  "TARGET_64BIT && TARGET_BMI"
+  "<shiftrt>x\t{%k2, %1, %k0|%k0, %1, %k2}"
+  [(set_attr "type" "ishiftx")
+   (set_attr "mode" "SI")])
+
+(define_insn "*<shiftrt_insn>si3_1_zext"
+  [(set (match_operand:DI 0 "register_operand" "=r,Yb")
+	(zero_extend:DI
+	  (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "0,mYb")
+			  (match_operand:QI 2 "nonmemory_operand" "cI,Yb"))))
    (clobber (reg:CC FLAGS_REG))]
   "TARGET_64BIT && ix86_binary_operator_ok (<CODE>, SImode, operands)"
 {
+  switch (get_attr_type (insn))
+    {
+    case TYPE_ISHIFTX:
+      return "#";
+
+    default:
+      if (operands[2] == const1_rtx
+	  && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+	return "<shiftrt>{l}\t%k0";
+      else
+	return "<shiftrt>{l}\t{%2, %k0|%k0, %2}";
+    }
+}
+  [(set_attr "type" "ishift,ishiftx")
+   (set (attr "length_immediate")
+     (if_then_else
+       (and (match_operand 2 "const1_operand" "")
+	    (ne (symbol_ref "TARGET_SHIFT1 || optimize_function_for_size_p (cfun)")
+		(const_int 0)))
+       (const_string "0")
+       (const_string "*")))
+   (set_attr "mode" "SI")])
+
+;; Convert shift to the shiftx pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:DI 0 "register_operand" "")
+	(zero_extend:DI
+	  (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" "")
+			  (match_operand:QI 2 "register_operand" ""))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+  	(zero_extend:DI (any_shiftrt:SI (match_dup 1) (match_dup 2))))])
+
+(define_insn "*<shiftrt_insn><mode>3_1"
+  [(set (match_operand:SWI12 0 "nonimmediate_operand" "=<r>m")
+	(any_shiftrt:SWI12
+	  (match_operand:SWI12 1 "nonimmediate_operand" "0")
+	  (match_operand:QI 2 "nonmemory_operand" "c<S>")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
+{
   if (operands[2] == const1_rtx
       && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
-    return "<shiftrt>{l}\t%k0";
+    return "<shiftrt>{<imodesuffix>}\t%0";
   else
-    return "<shiftrt>{l}\t{%2, %k0|%k0, %2}";
+    return "<shiftrt>{<imodesuffix>}\t{%2, %0|%0, %2}";
 }
   [(set_attr "type" "ishift")
    (set (attr "length_immediate")
@@ -9808,7 +9975,7 @@ 
 		(const_int 0)))
        (const_string "0")
        (const_string "*")))
-   (set_attr "mode" "SI")])
+   (set_attr "mode" "<MODE>")])
 
 (define_insn "*<shiftrt_insn>qi3_1_slp"
   [(set (strict_low_part (match_operand:QI 0 "nonimmediate_operand" "+qm"))
@@ -10060,42 +10227,153 @@ 
   split_double_mode (<DWI>mode, &operands[0], 1, &operands[4], &operands[5]);
 })
 
+(define_insn "*bmi2_rorx<mode>3_1"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+	(rotatert:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "rm")
+			(match_operand:QI 2 "immediate_operand" "<S>")))]
+  "TARGET_BMI"
+  "rorx\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "type" "rotatex")
+   (set_attr "mode" "<MODE>")])
+
 (define_insn "*<rotate_insn><mode>3_1"
-  [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m")
-	(any_rotate:SWI (match_operand:SWI 1 "nonimmediate_operand" "0")
-			(match_operand:QI 2 "nonmemory_operand" "c<S>")))
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,Yb")
+	(any_rotate:SWI48
+	  (match_operand:SWI48 1 "nonimmediate_operand" "0,mYb")
+	  (match_operand:QI 2 "nonmemory_operand" "c<S>,<S>")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
 {
-  if (operands[2] == const1_rtx
-      && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
-    return "<rotate>{<imodesuffix>}\t%0";
-  else
-    return "<rotate>{<imodesuffix>}\t{%2, %0|%0, %2}";
+  switch (get_attr_type (insn))
+    {
+    case TYPE_ROTATEX:
+      return "#";
+
+    default:
+      if (operands[2] == const1_rtx
+	  && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+	return "<rotate>{<imodesuffix>}\t%0";
+      else
+	return "<rotate>{<imodesuffix>}\t{%2, %0|%0, %2}";
+    }
 }
-  [(set_attr "type" "rotate")
+  [(set_attr "type" "rotate,rotatex")
    (set (attr "length_immediate")
      (if_then_else
-       (and (match_operand 2 "const1_operand" "")
-	    (ne (symbol_ref "TARGET_SHIFT1 || optimize_function_for_size_p (cfun)")
-		(const_int 0)))
+       (and (eq_attr "type" "rotate")
+	    (and (match_operand 2 "const1_operand" "")
+		 (ne (symbol_ref "TARGET_SHIFT1 || optimize_function_for_size_p (cfun)")
+		     (const_int 0))))
        (const_string "0")
        (const_string "*")))
    (set_attr "mode" "<MODE>")])
 
-(define_insn "*<rotate_insn>si3_1_zext"
+;; Convert rotate to the rotatex pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand" "")
+	(rotate:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "")
+		      (match_operand:QI 2 "immediate_operand" "")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+	(rotatert:SWI48 (match_dup 1) (match_dup 2)))]
+{
+  operands[2]
+    = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - INTVAL (operands[2]));
+})
+
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand" "")
+	(rotatert:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "")
+			(match_operand:QI 2 "immediate_operand" "")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+	(rotatert:SWI48 (match_dup 1) (match_dup 2)))])
+
+(define_insn "*bmi2_rorxsi3_1_zext"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(zero_extend:DI
-	  (any_rotate:SI (match_operand:SI 1 "register_operand" "0")
-			 (match_operand:QI 2 "nonmemory_operand" "cI"))))
+	  (rotatert:SI (match_operand:SI 1 "nonimmediate_operand" "rm")
+		       (match_operand:QI 2 "immediate_operand" "I"))))]
+  "TARGET_64BIT && TARGET_BMI"
+  "rorx\t{%2, %1, %k0|%k0, %1, %2}"
+  [(set_attr "type" "rotatex")
+   (set_attr "mode" "SI")])
+
+(define_insn "*<rotate_insn>si3_1_zext"
+  [(set (match_operand:DI 0 "register_operand" "=r,Yb")
+	(zero_extend:DI
+	  (any_rotate:SI (match_operand:SI 1 "nonimmediate_operand" "0,mYb")
+			 (match_operand:QI 2 "nonmemory_operand" "cI,I"))))
    (clobber (reg:CC FLAGS_REG))]
   "TARGET_64BIT && ix86_binary_operator_ok (<CODE>, SImode, operands)"
 {
-    if (operands[2] == const1_rtx
-	&& (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
-    return "<rotate>{l}\t%k0";
+  switch (get_attr_type (insn))
+    {
+    case TYPE_ROTATEX:
+      return "#";
+
+    default:
+      if (operands[2] == const1_rtx
+	  && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+	return "<rotate>{l}\t%k0";
+      else
+	return "<rotate>{l}\t{%2, %k0|%k0, %2}";
+    }
+}
+  [(set_attr "type" "rotate,rotatex")
+   (set (attr "length_immediate")
+     (if_then_else
+       (and (eq_attr "type" "rotate")
+	    (and (match_operand 2 "const1_operand" "")
+		 (ne (symbol_ref "TARGET_SHIFT1 || optimize_function_for_size_p (cfun)")
+		     (const_int 0))))
+       (const_string "0")
+       (const_string "*")))
+   (set_attr "mode" "SI")])
+
+;; Convert rotate to the rotatex pattern to avoid flags dependency.
+(define_split
+  [(set (match_operand:DI 0 "register_operand" "")
+	(zero_extend:DI
+	  (rotate:SI (match_operand:SI 1 "nonimmediate_operand" "")
+		     (match_operand:QI 2 "immediate_operand" ""))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+  	(zero_extend:DI (rotatert:SI (match_dup 1) (match_dup 2))))]
+{
+  operands[2]
+    = GEN_INT (GET_MODE_BITSIZE (SImode) - INTVAL (operands[2]));
+})
+
+(define_split
+  [(set (match_operand:DI 0 "register_operand" "")
+	(zero_extend:DI
+	  (rotatert:SI (match_operand:SI 1 "nonimmediate_operand" "")
+		       (match_operand:QI 2 "immediate_operand" ""))))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT && TARGET_BMI && reload_completed
+   && true_regnum (operands[0]) != true_regnum (operands[1])"
+  [(set (match_dup 0)
+  	(zero_extend:DI (rotatert:SI (match_dup 1) (match_dup 2))))])
+
+(define_insn "*<rotate_insn><mode>3_1"
+  [(set (match_operand:SWI12 0 "nonimmediate_operand" "=<r>m")
+	(any_rotate:SWI12 (match_operand:SWI12 1 "nonimmediate_operand" "0")
+			  (match_operand:QI 2 "nonmemory_operand" "c<S>")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
+{
+  if (operands[2] == const1_rtx
+      && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+    return "<rotate>{<imodesuffix>}\t%0";
   else
-    return "<rotate>{l}\t{%2, %k0|%k0, %2}";
+    return "<rotate>{<imodesuffix>}\t{%2, %0|%0, %2}";
 }
   [(set_attr "type" "rotate")
    (set (attr "length_immediate")
@@ -10105,7 +10383,7 @@ 
 		(const_int 0)))
        (const_string "0")
        (const_string "*")))
-   (set_attr "mode" "SI")])
+   (set_attr "mode" "<MODE>")])
 
 (define_insn "*<rotate_insn>qi3_1_slp"
   [(set (strict_low_part (match_operand:QI 0 "nonimmediate_operand" "+qm"))
Index: constraints.md
===================================================================
--- constraints.md	(revision 177925)
+++ constraints.md	(working copy)
@@ -92,6 +92,7 @@ 
 ;;  m	MMX inter-unit moves enabled
 ;;  d	Integer register when integer DFmode moves are enabled
 ;;  x	Integer register when integer XFmode moves are enabled
+;;  b	Integer register when BMI2 instructions are enabled
 
 (define_register_constraint "Yz" "TARGET_SSE ? SSE_FIRST_REG : NO_REGS"
  "First SSE register (@code{%xmm0}).")
@@ -123,6 +124,10 @@ 
  "optimize_function_for_speed_p (cfun) ? GENERAL_REGS : NO_REGS"
  "@internal Any integer register when integer XFmode moves are enabled.")
 
+(define_register_constraint "Yb"
+ "TARGET_BMI ? GENERAL_REGS : NO_REGS"
+ "@internal Any integer register, when BMI2 is enabled.")
+
 (define_constraint "z"
   "@internal Constant call address operand."
   (match_operand 0 "constant_call_address_operand"))
Index: i386.c
===================================================================
--- i386.c	(revision 177928)
+++ i386.c	(working copy)
@@ -13285,6 +13285,8 @@  put_condition_code (enum rtx_code code, enum machi
    If CODE is 't', pretend the mode is V8SFmode.
    If CODE is 'h', pretend the reg is the 'high' byte register.
    If CODE is 'y', print "st(0)" instead of "st", if the reg is stack op.
+   If CODE is 'M', print the low register of a double word register pair.
+   If CODE is 'N', print the high register of a double word register pair.
    If CODE is 'd', duplicate the operand for AVX instruction.
  */
 
@@ -13327,6 +13329,18 @@  print_reg (rtx x, int code, FILE *file)
     code = 16;
   else if (code == 't')
     code = 32;
+  else if (code == 'M')
+    {
+      gcc_assert (GET_MODE (x) == GET_MODE_WIDER_MODE (word_mode));
+      x = gen_lowpart (word_mode, x);
+      code = GET_MODE_SIZE (word_mode);
+    }
+  else if (code == 'N')
+    {
+      gcc_assert (GET_MODE (x) == GET_MODE_WIDER_MODE (word_mode));
+      x = gen_highpart (word_mode, x);
+      code = GET_MODE_SIZE (word_mode);
+    }
   else
     code = GET_MODE_SIZE (GET_MODE (x));
 
@@ -13472,6 +13486,8 @@  get_some_local_dynamic_name (void)
    t --  likewise, print the V8SFmode name of the register.
    h -- print the QImode name for a "high" register, either ah, bh, ch or dh.
    y -- print "st(0)" instead of "st" as a register.
+   M -- print the low register of a double word register pair.
+   N -- print the high register of a double word register pair.
    d -- print duplicated register operand for AVX instruction.
    D -- print condition for SSE cmp instruction.
    P -- if PIC, print an @PLT suffix.
@@ -13678,6 +13694,8 @@  ix86_print_operand (FILE *file, rtx x, int code)
 	case 'h':
 	case 't':
 	case 'y':
+	case 'M':
+	case 'N':
 	case 'x':
 	case 'X':
 	case 'P':