diff mbox

]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

Message ID CAEoMCqRUy94A2w4W-0CfECc-oQOuNb9O6sjsURkDt_9g=08exw@mail.gmail.com
State New
Headers show

Commit Message

Yuri Rumyantsev May 31, 2016, 3 p.m. UTC
Hi Uros,

Here is initial patch to improve performance of 64-bit integer
arithmetic in 32-bit mode. We discovered that gcc is significantly
behind icc and clang on rsa benchmark from eembc2.0 suite.
Te problem function looks like
typedef unsigned long long ull;
typedef unsigned long ul;
ul mul_add(ul *rp, ul *ap, int num, ul w)
 {
 ul c1=0;
 ull t;
 for (;;)
  {
  { t=(ull)w * ap[0] + rp[0] + c1;
   rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
  if (--num == 0) break;
  { t=(ull)w * ap[1] + rp[1] + c1;
   rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
  if (--num == 0) break;
  { t=(ull)w * ap[2] + rp[2] + c1;
   rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
  if (--num == 0) break;
  { t=(ull)w * ap[3] + rp[3] + c1;
   rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
  if (--num == 0) break;
  ap+=4;
  rp+=4;
  }
 return(c1);
 }

If we apply patch below we will get +6% speed-up for rsa on Silvermont.

The patch looks loke (not complete since there are other 64-bit
instructions e.g. subtraction):


What is your opinion?

Comments

Uros Bizjak May 31, 2016, 4:15 p.m. UTC | #1
On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Uros,
>
> Here is initial patch to improve performance of 64-bit integer
> arithmetic in 32-bit mode. We discovered that gcc is significantly
> behind icc and clang on rsa benchmark from eembc2.0 suite.
> Te problem function looks like
> typedef unsigned long long ull;
> typedef unsigned long ul;
> ul mul_add(ul *rp, ul *ap, int num, ul w)
>  {
>  ul c1=0;
>  ull t;
>  for (;;)
>   {
>   { t=(ull)w * ap[0] + rp[0] + c1;
>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>   if (--num == 0) break;
>   { t=(ull)w * ap[1] + rp[1] + c1;
>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>   if (--num == 0) break;
>   { t=(ull)w * ap[2] + rp[2] + c1;
>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>   if (--num == 0) break;
>   { t=(ull)w * ap[3] + rp[3] + c1;
>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>   if (--num == 0) break;
>   ap+=4;
>   rp+=4;
>   }
>  return(c1);
>  }
>
> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>
> The patch looks loke (not complete since there are other 64-bit
> instructions e.g. subtraction):
>
> Index: i386.md
> ===================================================================
> --- i386.md     (revision 236181)
> +++ i386.md     (working copy)
> @@ -5439,7 +5439,7 @@
>     (clobber (reg:CC FLAGS_REG))]
>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>    "#"
> -  "reload_completed"
> +  "1"
>    [(parallel [(set (reg:CCC FLAGS_REG)
>                    (compare:CCC
>                      (plus:DWIH (match_dup 1) (match_dup 2))
>
> What is your opinion?

This splitter doesn't depend on hard registers, so there is no
technical obstacle for the proposed patch. OTOH, this is a very old
splitter, it is possible that it was introduced to handle some of
reload deficiencies. Maybe Jeff knows something about this approach.
We have LRA now, so perhaps we have to rethink the purpose of these
DImode splitters.

A pragmatic approach would be - if the patch shows measurable benefit,
and doesn't introduce regressions, then Stage 1 is the time to try it.

BTW: Use "&&  1" in the split condition of the combined insn_and_split
pattern to copy the enable condition from the insn part. If there is
no condition, you should just use "".

Uros.
Ilya Enkovich June 1, 2016, 9:57 a.m. UTC | #2
2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi Uros,
>>
>> Here is initial patch to improve performance of 64-bit integer
>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>> Te problem function looks like
>> typedef unsigned long long ull;
>> typedef unsigned long ul;
>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>  {
>>  ul c1=0;
>>  ull t;
>>  for (;;)
>>   {
>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>   if (--num == 0) break;
>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>   if (--num == 0) break;
>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>   if (--num == 0) break;
>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>   if (--num == 0) break;
>>   ap+=4;
>>   rp+=4;
>>   }
>>  return(c1);
>>  }
>>
>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>
>> The patch looks loke (not complete since there are other 64-bit
>> instructions e.g. subtraction):
>>
>> Index: i386.md
>> ===================================================================
>> --- i386.md     (revision 236181)
>> +++ i386.md     (working copy)
>> @@ -5439,7 +5439,7 @@
>>     (clobber (reg:CC FLAGS_REG))]
>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>    "#"
>> -  "reload_completed"
>> +  "1"
>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>                    (compare:CCC
>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>
>> What is your opinion?
>
> This splitter doesn't depend on hard registers, so there is no
> technical obstacle for the proposed patch. OTOH, this is a very old
> splitter, it is possible that it was introduced to handle some of
> reload deficiencies. Maybe Jeff knows something about this approach.
> We have LRA now, so perhaps we have to rethink the purpose of these
> DImode splitters.

The change doesn't spoil splitter for hard register case and therefore
splitter still should be able to handle any reload deficiencies.  I think
we should try to split all instructions working on multiword registers
(not only PLUS case) at earlier passes to allow more optimizations on
splitted code and relax registers allocation (now we need to allocate
consequent registers).  Probably make a separate split right after STV?
This should help with PR70321.

Thanks,
Ilya

>
> A pragmatic approach would be - if the patch shows measurable benefit,
> and doesn't introduce regressions, then Stage 1 is the time to try it.
>
> BTW: Use "&&  1" in the split condition of the combined insn_and_split
> pattern to copy the enable condition from the insn part. If there is
> no condition, you should just use "".
>
> Uros.
Richard Biener June 1, 2016, 10:06 a.m. UTC | #3
On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi Uros,
>>>
>>> Here is initial patch to improve performance of 64-bit integer
>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>> Te problem function looks like
>>> typedef unsigned long long ull;
>>> typedef unsigned long ul;
>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>  {
>>>  ul c1=0;
>>>  ull t;
>>>  for (;;)
>>>   {
>>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   ap+=4;
>>>   rp+=4;
>>>   }
>>>  return(c1);
>>>  }
>>>
>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>
>>> The patch looks loke (not complete since there are other 64-bit
>>> instructions e.g. subtraction):
>>>
>>> Index: i386.md
>>> ===================================================================
>>> --- i386.md     (revision 236181)
>>> +++ i386.md     (working copy)
>>> @@ -5439,7 +5439,7 @@
>>>     (clobber (reg:CC FLAGS_REG))]
>>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>    "#"
>>> -  "reload_completed"
>>> +  "1"
>>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>>                    (compare:CCC
>>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>>
>>> What is your opinion?
>>
>> This splitter doesn't depend on hard registers, so there is no
>> technical obstacle for the proposed patch. OTOH, this is a very old
>> splitter, it is possible that it was introduced to handle some of
>> reload deficiencies. Maybe Jeff knows something about this approach.
>> We have LRA now, so perhaps we have to rethink the purpose of these
>> DImode splitters.
>
> The change doesn't spoil splitter for hard register case and therefore
> splitter still should be able to handle any reload deficiencies.  I think
> we should try to split all instructions working on multiword registers
> (not only PLUS case) at earlier passes to allow more optimizations on
> splitted code and relax registers allocation (now we need to allocate
> consequent registers).  Probably make a separate split right after STV?
> This should help with PR70321.

There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
DImode ops though.

Richard.

> Thanks,
> Ilya
>
>>
>> A pragmatic approach would be - if the patch shows measurable benefit,
>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>
>> BTW: Use "&&  1" in the split condition of the combined insn_and_split
>> pattern to copy the enable condition from the insn part. If there is
>> no condition, you should just use "".
>>
>> Uros.
Ilya Enkovich June 1, 2016, 10:40 a.m. UTC | #4
2016-06-01 13:06 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
>>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi Uros,
>>>>
>>>> Here is initial patch to improve performance of 64-bit integer
>>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>>> Te problem function looks like
>>>> typedef unsigned long long ull;
>>>> typedef unsigned long ul;
>>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>>  {
>>>>  ul c1=0;
>>>>  ull t;
>>>>  for (;;)
>>>>   {
>>>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   ap+=4;
>>>>   rp+=4;
>>>>   }
>>>>  return(c1);
>>>>  }
>>>>
>>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>>
>>>> The patch looks loke (not complete since there are other 64-bit
>>>> instructions e.g. subtraction):
>>>>
>>>> Index: i386.md
>>>> ===================================================================
>>>> --- i386.md     (revision 236181)
>>>> +++ i386.md     (working copy)
>>>> @@ -5439,7 +5439,7 @@
>>>>     (clobber (reg:CC FLAGS_REG))]
>>>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>>    "#"
>>>> -  "reload_completed"
>>>> +  "1"
>>>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>>>                    (compare:CCC
>>>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>>>
>>>> What is your opinion?
>>>
>>> This splitter doesn't depend on hard registers, so there is no
>>> technical obstacle for the proposed patch. OTOH, this is a very old
>>> splitter, it is possible that it was introduced to handle some of
>>> reload deficiencies. Maybe Jeff knows something about this approach.
>>> We have LRA now, so perhaps we have to rethink the purpose of these
>>> DImode splitters.
>>
>> The change doesn't spoil splitter for hard register case and therefore
>> splitter still should be able to handle any reload deficiencies.  I think
>> we should try to split all instructions working on multiword registers
>> (not only PLUS case) at earlier passes to allow more optimizations on
>> splitted code and relax registers allocation (now we need to allocate
>> consequent registers).  Probably make a separate split right after STV?
>> This should help with PR70321.
>
> There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
> DImode ops though.

Looking at pass description I see it works when "all the uses of a multi-word
register are via SUBREG, or are copies of the register to another location".
It doesn't cover cases when we operate with whole multi-word registers.

Thanks,
Ilya

>
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> A pragmatic approach would be - if the patch shows measurable benefit,
>>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>>
>>> BTW: Use "&&  1" in the split condition of the combined insn_and_split
>>> pattern to copy the enable condition from the insn part. If there is
>>> no condition, you should just use "".
>>>
>>> Uros.
diff mbox

Patch

Index: i386.md
===================================================================
--- i386.md     (revision 236181)
+++ i386.md     (working copy)
@@ -5439,7 +5439,7 @@ 
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
   "#"
-  "reload_completed"
+  "1"
   [(parallel [(set (reg:CCC FLAGS_REG)
                   (compare:CCC
                     (plus:DWIH (match_dup 1) (match_dup 2))