diff mbox

[PING**2,ARM] Further improve stack usage on sha512 (PR 77308)

Message ID AM4PR0701MB21629F5BB692295C62E09DFCE4120@AM4PR0701MB2162.eurprd07.prod.outlook.com
State New
Headers show

Commit Message

Bernd Edlinger April 29, 2017, 5:45 p.m. UTC
Ping...

I attached a rebased version since there was a merge conflict in
the xordi3 pattern, otherwise the patch is still identical.
It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
early when the target has no neon or iwmmxt.


Thanks
Bernd.



On 11/28/16 20:42, Bernd Edlinger wrote:
> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>> <bernd.edlinger@hotmail.de> wrote:

>>> Hi!

>>>

>>> This improves the stack usage on the sha512 test case for the case

>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>> patterns right while expanding which is similar to what the

>>> shift-pattern

>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well as

>>> thumb1.

>>>

>>

>> I would go further and do this in the absence of Neon, the VFP unit

>> being there doesn't help with DImode operations i.e. we do not have 64

>> bit integer arithmetic instructions without Neon. The main reason why

>> we have the DImode patterns split so late is to give a chance for

>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>> map down to these instructions. Doing this just for soft-float doesn't

>> improve the default case only. I don't usually test iwmmxt and I'm not

>> sure who has the ability to do so, thus keeping this restriction for

>> iwMMX is fine.

>>

>>

>

> Yes I understand, thanks for pointing that out.

>

> I was not aware what iwmmxt exists at all, but I noticed that most

> 64bit expansions work completely different, and would break if we split

> the pattern early.

>

> I can however only look at the assembler outout for iwmmxt, and make

> sure that the stack usage does not get worse.

>

> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as

> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

> for the test cases, and vfp and soft-float at around 270 bytes stack

> usage.

>

>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>

>>> Note this also splits many ldrd/strd instructions and therefore I will

>>> post a followup-patch that mitigates this effect by enabling the

>>> ldrd/strd

>>> peephole optimization after the necessary reg-testing.

>>>

>>>

>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>

>> What do you mean by arm-linux-gnueabihf - when folks say that I

>> interpret it as --with-arch=armv7-a --with-float=hard

>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>

>> If you've really bootstrapped and regtested it on armhf, doesn't this

>> patch as it stand have no effect there i.e. no change ?

>> arm-linux-gnueabihf usually means to me someone has configured with

>> --with-float=hard, so there are no regressions in the hard float ABI

>> case,

>>

>

> I know it proves little.  When I say arm-linux-gnueabihf

> I do in fact mean --enable-languages=all,ada,go,obj-c++

> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

> --with-float=hard.

>

> My main interest in the stack usage is of course not because of linux,

> but because of eCos where we have very small task stacks and in fact

> no fpu support by the O/S at all, so that patch is exactly what we need.

>

>

> Bootstrapped and reg-tested on arm-linux-gnueabihf

> Is it OK for trunk?

>

>

> Thanks

> Bernd.

Comments

Bernd Edlinger May 12, 2017, 4:50 p.m. UTC | #1
Ping...

On 04/29/17 19:45, Bernd Edlinger wrote:
> Ping...

>

> I attached a rebased version since there was a merge conflict in

> the xordi3 pattern, otherwise the patch is still identical.

> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2

> early when the target has no neon or iwmmxt.

>

>

> Thanks

> Bernd.

>

>

>

> On 11/28/16 20:42, Bernd Edlinger wrote:

>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>>> <bernd.edlinger@hotmail.de> wrote:

>>>> Hi!

>>>>

>>>> This improves the stack usage on the sha512 test case for the case

>>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>>> patterns right while expanding which is similar to what the

>>>> shift-pattern

>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as

>>>> well as

>>>> thumb1.

>>>>

>>>

>>> I would go further and do this in the absence of Neon, the VFP unit

>>> being there doesn't help with DImode operations i.e. we do not have 64

>>> bit integer arithmetic instructions without Neon. The main reason why

>>> we have the DImode patterns split so late is to give a chance for

>>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>>> map down to these instructions. Doing this just for soft-float doesn't

>>> improve the default case only. I don't usually test iwmmxt and I'm not

>>> sure who has the ability to do so, thus keeping this restriction for

>>> iwMMX is fine.

>>>

>>>

>>

>> Yes I understand, thanks for pointing that out.

>>

>> I was not aware what iwmmxt exists at all, but I noticed that most

>> 64bit expansions work completely different, and would break if we split

>> the pattern early.

>>

>> I can however only look at the assembler outout for iwmmxt, and make

>> sure that the stack usage does not get worse.

>>

>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as

>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

>> for the test cases, and vfp and soft-float at around 270 bytes stack

>> usage.

>>

>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>>

>>>> Note this also splits many ldrd/strd instructions and therefore I will

>>>> post a followup-patch that mitigates this effect by enabling the

>>>> ldrd/strd

>>>> peephole optimization after the necessary reg-testing.

>>>>

>>>>

>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>>

>>> What do you mean by arm-linux-gnueabihf - when folks say that I

>>> interpret it as --with-arch=armv7-a --with-float=hard

>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>>

>>> If you've really bootstrapped and regtested it on armhf, doesn't this

>>> patch as it stand have no effect there i.e. no change ?

>>> arm-linux-gnueabihf usually means to me someone has configured with

>>> --with-float=hard, so there are no regressions in the hard float ABI

>>> case,

>>>

>>

>> I know it proves little.  When I say arm-linux-gnueabihf

>> I do in fact mean --enable-languages=all,ada,go,obj-c++

>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

>> --with-float=hard.

>>

>> My main interest in the stack usage is of course not because of linux,

>> but because of eCos where we have very small task stacks and in fact

>> no fpu support by the O/S at all, so that patch is exactly what we need.

>>

>>

>> Bootstrapped and reg-tested on arm-linux-gnueabihf

>> Is it OK for trunk?

>>

>>

>> Thanks

>> Bernd.
Bernd Edlinger June 1, 2017, 4:01 p.m. UTC | #2
Ping...

On 05/12/17 18:49, Bernd Edlinger wrote:
> Ping...

> 

> On 04/29/17 19:45, Bernd Edlinger wrote:

>> Ping...

>>

>> I attached a rebased version since there was a merge conflict in

>> the xordi3 pattern, otherwise the patch is still identical.

>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2

>> early when the target has no neon or iwmmxt.

>>

>>

>> Thanks

>> Bernd.

>>

>>

>>

>> On 11/28/16 20:42, Bernd Edlinger wrote:

>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>>>> <bernd.edlinger@hotmail.de> wrote:

>>>>> Hi!

>>>>>

>>>>> This improves the stack usage on the sha512 test case for the case

>>>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>>>> patterns right while expanding which is similar to what the

>>>>> shift-pattern

>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as

>>>>> well as

>>>>> thumb1.

>>>>>

>>>>

>>>> I would go further and do this in the absence of Neon, the VFP unit

>>>> being there doesn't help with DImode operations i.e. we do not have 64

>>>> bit integer arithmetic instructions without Neon. The main reason why

>>>> we have the DImode patterns split so late is to give a chance for

>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>>>> map down to these instructions. Doing this just for soft-float doesn't

>>>> improve the default case only. I don't usually test iwmmxt and I'm not

>>>> sure who has the ability to do so, thus keeping this restriction for

>>>> iwMMX is fine.

>>>>

>>>>

>>>

>>> Yes I understand, thanks for pointing that out.

>>>

>>> I was not aware what iwmmxt exists at all, but I noticed that most

>>> 64bit expansions work completely different, and would break if we split

>>> the pattern early.

>>>

>>> I can however only look at the assembler outout for iwmmxt, and make

>>> sure that the stack usage does not get worse.

>>>

>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as

>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

>>> for the test cases, and vfp and soft-float at around 270 bytes stack

>>> usage.

>>>

>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>>>

>>>>> Note this also splits many ldrd/strd instructions and therefore I will

>>>>> post a followup-patch that mitigates this effect by enabling the

>>>>> ldrd/strd

>>>>> peephole optimization after the necessary reg-testing.

>>>>>

>>>>>

>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>>>

>>>> What do you mean by arm-linux-gnueabihf - when folks say that I

>>>> interpret it as --with-arch=armv7-a --with-float=hard

>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>>>

>>>> If you've really bootstrapped and regtested it on armhf, doesn't this

>>>> patch as it stand have no effect there i.e. no change ?

>>>> arm-linux-gnueabihf usually means to me someone has configured with

>>>> --with-float=hard, so there are no regressions in the hard float ABI

>>>> case,

>>>>

>>>

>>> I know it proves little.  When I say arm-linux-gnueabihf

>>> I do in fact mean --enable-languages=all,ada,go,obj-c++

>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

>>> --with-float=hard.

>>>

>>> My main interest in the stack usage is of course not because of linux,

>>> but because of eCos where we have very small task stacks and in fact

>>> no fpu support by the O/S at all, so that patch is exactly what we need.

>>>

>>>

>>> Bootstrapped and reg-tested on arm-linux-gnueabihf

>>> Is it OK for trunk?

>>>

>>>

>>> Thanks

>>> Bernd.
Bernd Edlinger June 14, 2017, 12:35 p.m. UTC | #3
Ping...

On 06/01/17 18:01, Bernd Edlinger wrote:
> Ping...

> 

> On 05/12/17 18:49, Bernd Edlinger wrote:

>> Ping...

>>

>> On 04/29/17 19:45, Bernd Edlinger wrote:

>>> Ping...

>>>

>>> I attached a rebased version since there was a merge conflict in

>>> the xordi3 pattern, otherwise the patch is still identical.

>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2

>>> early when the target has no neon or iwmmxt.

>>>

>>>

>>> Thanks

>>> Bernd.

>>>

>>>

>>>

>>> On 11/28/16 20:42, Bernd Edlinger wrote:

>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>>>>> <bernd.edlinger@hotmail.de> wrote:

>>>>>> Hi!

>>>>>>

>>>>>> This improves the stack usage on the sha512 test case for the case

>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>>>>> patterns right while expanding which is similar to what the

>>>>>> shift-pattern

>>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as

>>>>>> well as

>>>>>> thumb1.

>>>>>>

>>>>>

>>>>> I would go further and do this in the absence of Neon, the VFP unit

>>>>> being there doesn't help with DImode operations i.e. we do not have 64

>>>>> bit integer arithmetic instructions without Neon. The main reason why

>>>>> we have the DImode patterns split so late is to give a chance for

>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>>>>> map down to these instructions. Doing this just for soft-float doesn't

>>>>> improve the default case only. I don't usually test iwmmxt and I'm not

>>>>> sure who has the ability to do so, thus keeping this restriction for

>>>>> iwMMX is fine.

>>>>>

>>>>>

>>>>

>>>> Yes I understand, thanks for pointing that out.

>>>>

>>>> I was not aware what iwmmxt exists at all, but I noticed that most

>>>> 64bit expansions work completely different, and would break if we split

>>>> the pattern early.

>>>>

>>>> I can however only look at the assembler outout for iwmmxt, and make

>>>> sure that the stack usage does not get worse.

>>>>

>>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as

>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

>>>> for the test cases, and vfp and soft-float at around 270 bytes stack

>>>> usage.

>>>>

>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>>>>

>>>>>> Note this also splits many ldrd/strd instructions and therefore I 

>>>>>> will

>>>>>> post a followup-patch that mitigates this effect by enabling the

>>>>>> ldrd/strd

>>>>>> peephole optimization after the necessary reg-testing.

>>>>>>

>>>>>>

>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>>>>

>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I

>>>>> interpret it as --with-arch=armv7-a --with-float=hard

>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>>>>

>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this

>>>>> patch as it stand have no effect there i.e. no change ?

>>>>> arm-linux-gnueabihf usually means to me someone has configured with

>>>>> --with-float=hard, so there are no regressions in the hard float ABI

>>>>> case,

>>>>>

>>>>

>>>> I know it proves little.  When I say arm-linux-gnueabihf

>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++

>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

>>>> --with-float=hard.

>>>>

>>>> My main interest in the stack usage is of course not because of linux,

>>>> but because of eCos where we have very small task stacks and in fact

>>>> no fpu support by the O/S at all, so that patch is exactly what we 

>>>> need.

>>>>

>>>>

>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf

>>>> Is it OK for trunk?

>>>>

>>>>

>>>> Thanks

>>>> Bernd.
Bernd Edlinger July 5, 2017, 6:14 p.m. UTC | #4
Ping...

The latest version of this patch was here:
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01567.html

Thanks
Bernd.

On 06/14/17 14:34, Bernd Edlinger wrote:
> Ping...

> 

> On 06/01/17 18:01, Bernd Edlinger wrote:

>> Ping...

>>

>> On 05/12/17 18:49, Bernd Edlinger wrote:

>>> Ping...

>>>

>>> On 04/29/17 19:45, Bernd Edlinger wrote:

>>>> Ping...

>>>>

>>>> I attached a rebased version since there was a merge conflict in

>>>> the xordi3 pattern, otherwise the patch is still identical.

>>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2

>>>> early when the target has no neon or iwmmxt.

>>>>

>>>>

>>>> Thanks

>>>> Bernd.

>>>>

>>>>

>>>>

>>>> On 11/28/16 20:42, Bernd Edlinger wrote:

>>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>>>>>> <bernd.edlinger@hotmail.de> wrote:

>>>>>>> Hi!

>>>>>>>

>>>>>>> This improves the stack usage on the sha512 test case for the case

>>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>>>>>> patterns right while expanding which is similar to what the

>>>>>>> shift-pattern

>>>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as

>>>>>>> well as

>>>>>>> thumb1.

>>>>>>>

>>>>>>

>>>>>> I would go further and do this in the absence of Neon, the VFP unit

>>>>>> being there doesn't help with DImode operations i.e. we do not 

>>>>>> have 64

>>>>>> bit integer arithmetic instructions without Neon. The main reason why

>>>>>> we have the DImode patterns split so late is to give a chance for

>>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>>>>>> map down to these instructions. Doing this just for soft-float 

>>>>>> doesn't

>>>>>> improve the default case only. I don't usually test iwmmxt and I'm 

>>>>>> not

>>>>>> sure who has the ability to do so, thus keeping this restriction for

>>>>>> iwMMX is fine.

>>>>>>

>>>>>>

>>>>>

>>>>> Yes I understand, thanks for pointing that out.

>>>>>

>>>>> I was not aware what iwmmxt exists at all, but I noticed that most

>>>>> 64bit expansions work completely different, and would break if we 

>>>>> split

>>>>> the pattern early.

>>>>>

>>>>> I can however only look at the assembler outout for iwmmxt, and make

>>>>> sure that the stack usage does not get worse.

>>>>>

>>>>> Thus the new version of the patch keeps only thumb1, neon and 

>>>>> iwmmxt as

>>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

>>>>> for the test cases, and vfp and soft-float at around 270 bytes stack

>>>>> usage.

>>>>>

>>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>>>>>

>>>>>>> Note this also splits many ldrd/strd instructions and therefore I 

>>>>>>> will

>>>>>>> post a followup-patch that mitigates this effect by enabling the

>>>>>>> ldrd/strd

>>>>>>> peephole optimization after the necessary reg-testing.

>>>>>>>

>>>>>>>

>>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>>>>>

>>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I

>>>>>> interpret it as --with-arch=armv7-a --with-float=hard

>>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>>>>>

>>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this

>>>>>> patch as it stand have no effect there i.e. no change ?

>>>>>> arm-linux-gnueabihf usually means to me someone has configured with

>>>>>> --with-float=hard, so there are no regressions in the hard float ABI

>>>>>> case,

>>>>>>

>>>>>

>>>>> I know it proves little.  When I say arm-linux-gnueabihf

>>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++

>>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

>>>>> --with-float=hard.

>>>>>

>>>>> My main interest in the stack usage is of course not because of linux,

>>>>> but because of eCos where we have very small task stacks and in fact

>>>>> no fpu support by the O/S at all, so that patch is exactly what we 

>>>>> need.

>>>>>

>>>>>

>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf

>>>>> Is it OK for trunk?

>>>>>

>>>>>

>>>>> Thanks

>>>>> Bernd.
Kyrill Tkachov Sept. 4, 2017, 2:52 p.m. UTC | #5
On 29/04/17 18:45, Bernd Edlinger wrote:
> Ping...
>
> I attached a rebased version since there was a merge conflict in
> the xordi3 pattern, otherwise the patch is still identical.
> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
> early when the target has no neon or iwmmxt.
>
>
> Thanks
> Bernd.
>
>
>
> On 11/28/16 20:42, Bernd Edlinger wrote:
>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>> <bernd.edlinger@hotmail.de> wrote:
>>>> Hi!
>>>>
>>>> This improves the stack usage on the sha512 test case for the case
>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>> patterns right while expanding which is similar to what the
>>>> shift-pattern
>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well as
>>>> thumb1.
>>>>
>>> I would go further and do this in the absence of Neon, the VFP unit
>>> being there doesn't help with DImode operations i.e. we do not have 64
>>> bit integer arithmetic instructions without Neon. The main reason why
>>> we have the DImode patterns split so late is to give a chance for
>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>> map down to these instructions. Doing this just for soft-float doesn't
>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>> sure who has the ability to do so, thus keeping this restriction for
>>> iwMMX is fine.
>>>
>>>
>> Yes I understand, thanks for pointing that out.
>>
>> I was not aware what iwmmxt exists at all, but I noticed that most
>> 64bit expansions work completely different, and would break if we split
>> the pattern early.
>>
>> I can however only look at the assembler outout for iwmmxt, and make
>> sure that the stack usage does not get worse.
>>
>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>> for the test cases, and vfp and soft-float at around 270 bytes stack
>> usage.
>>
>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>
>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>> post a followup-patch that mitigates this effect by enabling the
>>>> ldrd/strd
>>>> peephole optimization after the necessary reg-testing.
>>>>
>>>>
>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>> interpret it as --with-arch=armv7-a --with-float=hard
>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>
>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>> patch as it stand have no effect there i.e. no change ?
>>> arm-linux-gnueabihf usually means to me someone has configured with
>>> --with-float=hard, so there are no regressions in the hard float ABI
>>> case,
>>>
>> I know it proves little.  When I say arm-linux-gnueabihf
>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>> --with-float=hard.
>>
>> My main interest in the stack usage is of course not because of linux,
>> but because of eCos where we have very small task stacks and in fact
>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>
>>
>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>> Is it OK for trunk?

The code is ok.
AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code 
as the splits
will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and 
TARGET_IWMMXT2
is irrelevant here).

So this is ok for trunk.
Thanks, and sorry again for the delay.
Kyrill

>>
>> Thanks
>> Bernd.
Christophe Lyon Sept. 5, 2017, 8:45 a.m. UTC | #6
Hi Bernd,


On 4 September 2017 at 16:52, Kyrill  Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 29/04/17 18:45, Bernd Edlinger wrote:
>>
>> Ping...
>>
>> I attached a rebased version since there was a merge conflict in
>> the xordi3 pattern, otherwise the patch is still identical.
>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
>> early when the target has no neon or iwmmxt.
>>
>>
>> Thanks
>> Bernd.
>>
>>
>>
>> On 11/28/16 20:42, Bernd Edlinger wrote:
>>>
>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>>>
>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>>> <bernd.edlinger@hotmail.de> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> This improves the stack usage on the sha512 test case for the case
>>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>>> patterns right while expanding which is similar to what the
>>>>> shift-pattern
>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well
>>>>> as
>>>>> thumb1.
>>>>>
>>>> I would go further and do this in the absence of Neon, the VFP unit
>>>> being there doesn't help with DImode operations i.e. we do not have 64
>>>> bit integer arithmetic instructions without Neon. The main reason why
>>>> we have the DImode patterns split so late is to give a chance for
>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>>> map down to these instructions. Doing this just for soft-float doesn't
>>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>>> sure who has the ability to do so, thus keeping this restriction for
>>>> iwMMX is fine.
>>>>
>>>>
>>> Yes I understand, thanks for pointing that out.
>>>
>>> I was not aware what iwmmxt exists at all, but I noticed that most
>>> 64bit expansions work completely different, and would break if we split
>>> the pattern early.
>>>
>>> I can however only look at the assembler outout for iwmmxt, and make
>>> sure that the stack usage does not get worse.
>>>
>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>>> for the test cases, and vfp and soft-float at around 270 bytes stack
>>> usage.
>>>
>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>>
>>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>>> post a followup-patch that mitigates this effect by enabling the
>>>>> ldrd/strd
>>>>> peephole optimization after the necessary reg-testing.
>>>>>
>>>>>
>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>>
>>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>>> interpret it as --with-arch=armv7-a --with-float=hard
>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>>
>>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>>> patch as it stand have no effect there i.e. no change ?
>>>> arm-linux-gnueabihf usually means to me someone has configured with
>>>> --with-float=hard, so there are no regressions in the hard float ABI
>>>> case,
>>>>
>>> I know it proves little.  When I say arm-linux-gnueabihf
>>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>>> --with-float=hard.
>>>
>>> My main interest in the stack usage is of course not because of linux,
>>> but because of eCos where we have very small task stacks and in fact
>>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>>
>>>
>>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>>> Is it OK for trunk?
>
>
> The code is ok.
> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as
> the splits
> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and
> TARGET_IWMMXT2
> is irrelevant here).
>
> So this is ok for trunk.
> Thanks, and sorry again for the delay.
> Kyrill
>

This patch (r251663) causes a regression on armeb-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a9
--with-fpu vfpv3-d16-fp16
FAIL:    gcc.dg/vect/vect-singleton_1.c (internal compiler error)
FAIL:    gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects
(internal compiler error)

the test passes if gcc is configured --with-fpu neon-fp16


Christophe

>>>
>>> Thanks
>>> Bernd.
>
>
Bernd Edlinger Sept. 5, 2017, 2:25 p.m. UTC | #7
Hi Christophe,

On 09/05/17 10:45, Christophe Lyon wrote:
> Hi Bernd,

> 

> 

> On 4 September 2017 at 16:52, Kyrill  Tkachov

> <kyrylo.tkachov@foss.arm.com> wrote:

>>

>> On 29/04/17 18:45, Bernd Edlinger wrote:

>>>

>>> Ping...

>>>

>>> I attached a rebased version since there was a merge conflict in

>>> the xordi3 pattern, otherwise the patch is still identical.

>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2

>>> early when the target has no neon or iwmmxt.

>>>

>>>

>>> Thanks

>>> Bernd.

>>>

>>>

>>>

>>> On 11/28/16 20:42, Bernd Edlinger wrote:

>>>>

>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:

>>>>>

>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

>>>>> <bernd.edlinger@hotmail.de> wrote:

>>>>>>

>>>>>> Hi!

>>>>>>

>>>>>> This improves the stack usage on the sha512 test case for the case

>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode

>>>>>> patterns right while expanding which is similar to what the

>>>>>> shift-pattern

>>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well

>>>>>> as

>>>>>> thumb1.

>>>>>>

>>>>> I would go further and do this in the absence of Neon, the VFP unit

>>>>> being there doesn't help with DImode operations i.e. we do not have 64

>>>>> bit integer arithmetic instructions without Neon. The main reason why

>>>>> we have the DImode patterns split so late is to give a chance for

>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this

>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC

>>>>> map down to these instructions. Doing this just for soft-float doesn't

>>>>> improve the default case only. I don't usually test iwmmxt and I'm not

>>>>> sure who has the ability to do so, thus keeping this restriction for

>>>>> iwMMX is fine.

>>>>>

>>>>>

>>>> Yes I understand, thanks for pointing that out.

>>>>

>>>> I was not aware what iwmmxt exists at all, but I noticed that most

>>>> 64bit expansions work completely different, and would break if we split

>>>> the pattern early.

>>>>

>>>> I can however only look at the assembler outout for iwmmxt, and make

>>>> sure that the stack usage does not get worse.

>>>>

>>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as

>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack

>>>> for the test cases, and vfp and soft-float at around 270 bytes stack

>>>> usage.

>>>>

>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>>>>>

>>>>>> Note this also splits many ldrd/strd instructions and therefore I will

>>>>>> post a followup-patch that mitigates this effect by enabling the

>>>>>> ldrd/strd

>>>>>> peephole optimization after the necessary reg-testing.

>>>>>>

>>>>>>

>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>>>>>

>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I

>>>>> interpret it as --with-arch=armv7-a --with-float=hard

>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>>>>>

>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this

>>>>> patch as it stand have no effect there i.e. no change ?

>>>>> arm-linux-gnueabihf usually means to me someone has configured with

>>>>> --with-float=hard, so there are no regressions in the hard float ABI

>>>>> case,

>>>>>

>>>> I know it proves little.  When I say arm-linux-gnueabihf

>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++

>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16

>>>> --with-float=hard.

>>>>

>>>> My main interest in the stack usage is of course not because of linux,

>>>> but because of eCos where we have very small task stacks and in fact

>>>> no fpu support by the O/S at all, so that patch is exactly what we need.

>>>>

>>>>

>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf

>>>> Is it OK for trunk?

>>

>>

>> The code is ok.

>> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as

>> the splits

>> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and

>> TARGET_IWMMXT2

>> is irrelevant here).

>>

>> So this is ok for trunk.

>> Thanks, and sorry again for the delay.

>> Kyrill

>>

> 

> This patch (r251663) causes a regression on armeb-none-linux-gnueabihf

> --with-mode arm

> --with-cpu cortex-a9

> --with-fpu vfpv3-d16-fp16

> FAIL:    gcc.dg/vect/vect-singleton_1.c (internal compiler error)

> FAIL:    gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects

> (internal compiler error)

> 

> the test passes if gcc is configured --with-fpu neon-fp16

> 


Thank you very much for what you do!

I am able to reproduce this.

Combine creates an invalid insn out of these two insns:

(insn 12 8 18 2 (parallel [
             (set (reg:DI 122)
                 (plus:DI (reg:DI 116 [ aD.5331 ])
                     (reg:DI 119 [ bD.5332 ])))
             (clobber (reg:CC 100 cc))
         ]) "vect-singleton_1.c":28 1 {*arm_adddi3}
      (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
         (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
             (expr_list:REG_UNUSED (reg:CC 100 cc)
                 (nil)))))
(insn 18 12 19 2 (set (reg/i:DI 16 s0)
         (reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp}
      (expr_list:REG_DEAD (reg:DI 122)
         (nil)))

=>

(insn 18 12 19 2 (parallel [
             (set (reg/i:DI 16 s0)
                 (plus:DI (reg:DI 116 [ aD.5331 ])
                     (reg:DI 119 [ bD.5332 ])))
             (clobber (reg:CC 100 cc))
         ]) "vect-singleton_1.c":28 1 {*arm_adddi3}
      (expr_list:REG_UNUSED (reg:CC 100 cc)
         (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
             (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
                 (nil)))))


But after reload this is fixed again, so that is no issue for a
splitter that runs always after reload.

But now the splitter runs into an assertion in gen_highpart(op0).
The same problem should exist with the subdi3 as well, but
not with the other patterns in this patch.

I think the right way to fix this is use a non-vfp register
predicate to prevent combine to create the un-splittable
instruction in the first place.

See attached my proposed fix for this defect.

Christophe, You could do me a favor, and test it in your environment
as I have nothing comparable.


Kyrill, is this patch OK after bootstrap / reg-testing?


Thanks
Bernd.
2017-09-05  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/predicates.md (s_register_operand_nv,
	arm_adddi_operand_nv): Create new non-vfp predicates.
	* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 251663)
+++ gcc/config/arm/arm.md	(working copy)
@@ -457,14 +457,13 @@
 )
 
 (define_insn_and_split "*arm_adddi3"
-  [(set (match_operand:DI          0 "s_register_operand" "=&r,&r,&r,&r,&r")
-	(plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
-		 (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, Dd, Dd")))
+  [(set (match_operand:DI          0 "s_register_operand_nv" "=&r,&r,&r,&r,&r")
+	(plus:DI (match_operand:DI 1 "s_register_operand_nv" "%0, 0, r, 0, r")
+		 (match_operand:DI 2 "arm_adddi_operand_nv"  "r,  0, r, Dd, Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
-  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
-   && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
+  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC_C CC_REGNUM)
 		   (compare:CC_C (plus:SI (match_dup 1) (match_dup 2))
 				 (match_dup 1)))
@@ -1263,9 +1262,9 @@
 )
 
 (define_insn_and_split "*arm_subdi3"
-  [(set (match_operand:DI           0 "s_register_operand" "=&r,&r,&r")
-	(minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0")
-		  (match_operand:DI 2 "s_register_operand" "r,0,0")))
+  [(set (match_operand:DI           0 "s_register_operand_nv" "=&r,&r,&r")
+	(minus:DI (match_operand:DI 1 "s_register_operand_nv" "0,r,0")
+		  (match_operand:DI 2 "s_register_operand_nv" "r,0,0")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
Index: gcc/config/arm/predicates.md
===================================================================
--- gcc/config/arm/predicates.md	(revision 251663)
+++ gcc/config/arm/predicates.md	(working copy)
@@ -653,3 +653,11 @@
   (ior (and (match_code "symbol_ref")
 	    (match_test "!arm_is_long_call_p (SYMBOL_REF_DECL (op))"))
        (match_operand 0 "s_register_operand")))
+
+(define_predicate "s_register_operand_nv"
+  (and (match_operand 0 "s_register_operand")
+       (not (match_operand 0 "vfp_hard_register_operand"))))
+
+(define_predicate "arm_adddi_operand_nv"
+  (and (match_operand 0 "arm_adddi_operand")
+       (not (match_operand 0 "vfp_hard_register_operand"))))
Wilco Dijkstra Sept. 5, 2017, 3:02 p.m. UTC | #8
Bernd Edlinger wrote:

> Combine creates an invalid insn out of these two insns:

Yes it looks like a latent bug. We need to use arm_general_register_operand
as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate
s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand.

+  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"

The split condition for adddi3 now looks more accurate indeed, although we could
remove the !TARGET_NEON from the split condition as this is always true given
arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON".

Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue.

Wilco
Kyrill Tkachov Sept. 5, 2017, 5:45 p.m. UTC | #9
Hi Bernd,

On 05/09/17 15:25, Bernd Edlinger wrote:
> Hi Christophe,
>
> On 09/05/17 10:45, Christophe Lyon wrote:
>> Hi Bernd,
>>
>>
>> On 4 September 2017 at 16:52, Kyrill  Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> On 29/04/17 18:45, Bernd Edlinger wrote:
>>>> Ping...
>>>>
>>>> I attached a rebased version since there was a merge conflict in
>>>> the xordi3 pattern, otherwise the patch is still identical.
>>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
>>>> early when the target has no neon or iwmmxt.
>>>>
>>>>
>>>> Thanks
>>>> Bernd.
>>>>
>>>>
>>>>
>>>> On 11/28/16 20:42, Bernd Edlinger wrote:
>>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>>>>> <bernd.edlinger@hotmail.de> wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> This improves the stack usage on the sha512 test case for the case
>>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>>>>> patterns right while expanding which is similar to what the
>>>>>>> shift-pattern
>>>>>>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well
>>>>>>> as
>>>>>>> thumb1.
>>>>>>>
>>>>>> I would go further and do this in the absence of Neon, the VFP unit
>>>>>> being there doesn't help with DImode operations i.e. we do not have 64
>>>>>> bit integer arithmetic instructions without Neon. The main reason why
>>>>>> we have the DImode patterns split so late is to give a chance for
>>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>>>>> map down to these instructions. Doing this just for soft-float doesn't
>>>>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>>>>> sure who has the ability to do so, thus keeping this restriction for
>>>>>> iwMMX is fine.
>>>>>>
>>>>>>
>>>>> Yes I understand, thanks for pointing that out.
>>>>>
>>>>> I was not aware what iwmmxt exists at all, but I noticed that most
>>>>> 64bit expansions work completely different, and would break if we split
>>>>> the pattern early.
>>>>>
>>>>> I can however only look at the assembler outout for iwmmxt, and make
>>>>> sure that the stack usage does not get worse.
>>>>>
>>>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>>>>> for the test cases, and vfp and soft-float at around 270 bytes stack
>>>>> usage.
>>>>>
>>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>>>>
>>>>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>>>>> post a followup-patch that mitigates this effect by enabling the
>>>>>>> ldrd/strd
>>>>>>> peephole optimization after the necessary reg-testing.
>>>>>>>
>>>>>>>
>>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>>>>> interpret it as --with-arch=armv7-a --with-float=hard
>>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>>>>
>>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>>>>> patch as it stand have no effect there i.e. no change ?
>>>>>> arm-linux-gnueabihf usually means to me someone has configured with
>>>>>> --with-float=hard, so there are no regressions in the hard float ABI
>>>>>> case,
>>>>>>
>>>>> I know it proves little.  When I say arm-linux-gnueabihf
>>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>>>>> --with-float=hard.
>>>>>
>>>>> My main interest in the stack usage is of course not because of linux,
>>>>> but because of eCos where we have very small task stacks and in fact
>>>>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>>>>
>>>>>
>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>>>>> Is it OK for trunk?
>>>
>>> The code is ok.
>>> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as
>>> the splits
>>> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and
>>> TARGET_IWMMXT2
>>> is irrelevant here).
>>>
>>> So this is ok for trunk.
>>> Thanks, and sorry again for the delay.
>>> Kyrill
>>>
>> This patch (r251663) causes a regression on armeb-none-linux-gnueabihf
>> --with-mode arm
>> --with-cpu cortex-a9
>> --with-fpu vfpv3-d16-fp16
>> FAIL:    gcc.dg/vect/vect-singleton_1.c (internal compiler error)
>> FAIL:    gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects
>> (internal compiler error)
>>
>> the test passes if gcc is configured --with-fpu neon-fp16
>>
> Thank you very much for what you do!
>
> I am able to reproduce this.
>
> Combine creates an invalid insn out of these two insns:
>
> (insn 12 8 18 2 (parallel [
>               (set (reg:DI 122)
>                   (plus:DI (reg:DI 116 [ aD.5331 ])
>                       (reg:DI 119 [ bD.5332 ])))
>               (clobber (reg:CC 100 cc))
>           ]) "vect-singleton_1.c":28 1 {*arm_adddi3}
>        (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
>           (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
>               (expr_list:REG_UNUSED (reg:CC 100 cc)
>                   (nil)))))
> (insn 18 12 19 2 (set (reg/i:DI 16 s0)
>           (reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp}
>        (expr_list:REG_DEAD (reg:DI 122)
>           (nil)))
>
> =>
>
> (insn 18 12 19 2 (parallel [
>               (set (reg/i:DI 16 s0)
>                   (plus:DI (reg:DI 116 [ aD.5331 ])
>                       (reg:DI 119 [ bD.5332 ])))
>               (clobber (reg:CC 100 cc))
>           ]) "vect-singleton_1.c":28 1 {*arm_adddi3}
>        (expr_list:REG_UNUSED (reg:CC 100 cc)
>           (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
>               (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
>                   (nil)))))
>
>
> But after reload this is fixed again, so that is no issue for a
> splitter that runs always after reload.
>
> But now the splitter runs into an assertion in gen_highpart(op0).
> The same problem should exist with the subdi3 as well, but
> not with the other patterns in this patch.
>
> I think the right way to fix this is use a non-vfp register
> predicate to prevent combine to create the un-splittable
> instruction in the first place.
>
> See attached my proposed fix for this defect.
>
> Christophe, You could do me a favor, and test it in your environment
> as I have nothing comparable.
>
>
> Kyrill, is this patch OK after bootstrap / reg-testing?
>

Thanks for tracking this down so quickly.
Like Wilco said in his reply, you can just use arm_general_register_operand
and for the adddi_operand predicate use a name like 
arm_general_adddi_operand
(_nv associates to some with non-volatile or non-valid).

I don't think anddi_notdi_di has the same issue as that one still 
requires splitting after reload
in all cases and the registers are guaranteed to be core registers by then.

So ok with using arm_general_register_operand and renaming the new 
adjusted adddi_operand-based predicate to arm_general_adddi_operand, 
assuming bootstrap and testing goes ok.

Thanks,
Kyrill
Bernd Edlinger Sept. 5, 2017, 5:48 p.m. UTC | #10
On 09/05/17 17:02, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:

> 

>> Combine creates an invalid insn out of these two insns:

> 

> Yes it looks like a latent bug. We need to use arm_general_register_operand

> as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate

> s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand.

> 


Thanks, attached is a patch following your suggestion.

> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"

> 

> The split condition for adddi3 now looks more accurate indeed, although we could

> remove the !TARGET_NEON from the split condition as this is always true given

> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON".

> 


No, the split condition does not begin with "&& TARGET_32BIT...".
Therefore the split is enabled in TARGET_NEON after reload_completed.
And it is invoked from adddi3_neon for all alternatives without vfp
registers:

   switch (which_alternative)
     {
     case 0: /* fall through */
     case 3: return "vadd.i64\t%P0, %P1, %P2";
     case 1: return "#";
     case 2: return "#";
     case 4: return "#";
     case 5: return "#";
     case 6: return "#";



> Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue.

> 


Yes, that pattern can be cleaned up in a follow-up patch.
Note this splitter is invoked from bicdi3_neon as well.
However I think anddi_notdi_di should be safe as long as it is enabled
after reload_completed (which is probably a bug).


Bernd.

> Wilco

>
2017-09-05  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/predicates.md (arm_general_adddi_operand): Create new
	non-vfp predicate.
	* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 251663)
+++ gcc/config/arm/arm.md	(working copy)
@@ -457,14 +457,13 @@
 )
 
 (define_insn_and_split "*arm_adddi3"
-  [(set (match_operand:DI          0 "s_register_operand" "=&r,&r,&r,&r,&r")
-	(plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
-		 (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, Dd, Dd")))
+  [(set (match_operand:DI          0 "arm_general_register_operand" "=&r,&r,&r,&r,&r")
+	(plus:DI (match_operand:DI 1 "arm_general_register_operand" "%0, 0, r, 0, r")
+		 (match_operand:DI 2 "arm_general_adddi_operand"    "r,  0, r, Dd, Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
-  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
-   && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
+  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC_C CC_REGNUM)
 		   (compare:CC_C (plus:SI (match_dup 1) (match_dup 2))
 				 (match_dup 1)))
@@ -1263,9 +1262,9 @@
 )
 
 (define_insn_and_split "*arm_subdi3"
-  [(set (match_operand:DI           0 "s_register_operand" "=&r,&r,&r")
-	(minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0")
-		  (match_operand:DI 2 "s_register_operand" "r,0,0")))
+  [(set (match_operand:DI           0 "arm_general_register_operand" "=&r,&r,&r")
+	(minus:DI (match_operand:DI 1 "arm_general_register_operand" "0,r,0")
+		  (match_operand:DI 2 "arm_general_register_operand" "r,0,0")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
Index: gcc/config/arm/predicates.md
===================================================================
--- gcc/config/arm/predicates.md	(revision 251663)
+++ gcc/config/arm/predicates.md	(working copy)
@@ -82,6 +82,11 @@
 	      || REGNO (op) >= FIRST_PSEUDO_REGISTER));
 })
 
+(define_predicate "arm_general_adddi_operand"
+  (ior (match_operand 0 "arm_general_register_operand")
+       (and (match_code "const_int")
+	    (match_test "const_ok_for_dimode_op (INTVAL (op), PLUS)"))))
+
 (define_predicate "vfp_register_operand"
   (match_code "reg,subreg")
 {
Kyrill Tkachov Sept. 5, 2017, 5:53 p.m. UTC | #11
On 05/09/17 18:48, Bernd Edlinger wrote:
> On 09/05/17 17:02, Wilco Dijkstra wrote:
>> Bernd Edlinger wrote:
>>
>>> Combine creates an invalid insn out of these two insns:
>> Yes it looks like a latent bug. We need to use arm_general_register_operand
>> as arm_adddi3/subdi3 only allow integer registers. You don't need a new predicate
>> s_register_operand_nv. Also I'd prefer something like arm_general_adddi_operand.
>>
> Thanks, attached is a patch following your suggestion.
>
>> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
>>
>> The split condition for adddi3 now looks more accurate indeed, although we could
>> remove the !TARGET_NEON from the split condition as this is always true given
>> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON".
>>
> No, the split condition does not begin with "&& TARGET_32BIT...".
> Therefore the split is enabled in TARGET_NEON after reload_completed.
> And it is invoked from adddi3_neon for all alternatives without vfp
> registers:
>
>     switch (which_alternative)
>       {
>       case 0: /* fall through */
>       case 3: return "vadd.i64\t%P0, %P1, %P2";
>       case 1: return "#";
>       case 2: return "#";
>       case 4: return "#";
>       case 5: return "#";
>       case 6: return "#";
>
>
>
>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue.
>>
> Yes, that pattern can be cleaned up in a follow-up patch.
> Note this splitter is invoked from bicdi3_neon as well.
> However I think anddi_notdi_di should be safe as long as it is enabled
> after reload_completed (which is probably a bug).
>

Thanks, that's what I had in mind in my other reply.
This is ok if testing comes back ok.

Kyrill


> Bernd.
>
>> Wilco
>>
Christophe Lyon Sept. 5, 2017, 6:20 p.m. UTC | #12
On 5 September 2017 at 19:53, Kyrill  Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 05/09/17 18:48, Bernd Edlinger wrote:
>>
>> On 09/05/17 17:02, Wilco Dijkstra wrote:
>>>
>>> Bernd Edlinger wrote:
>>>
>>>> Combine creates an invalid insn out of these two insns:
>>>
>>> Yes it looks like a latent bug. We need to use
>>> arm_general_register_operand
>>> as arm_adddi3/subdi3 only allow integer registers. You don't need a new
>>> predicate
>>> s_register_operand_nv. Also I'd prefer something like
>>> arm_general_adddi_operand.
>>>
>> Thanks, attached is a patch following your suggestion.
>>
>>> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) ||
>>> reload_completed)"
>>>
>>> The split condition for adddi3 now looks more accurate indeed, although
>>> we could
>>> remove the !TARGET_NEON from the split condition as this is always true
>>> given
>>> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON".
>>>
>> No, the split condition does not begin with "&& TARGET_32BIT...".
>> Therefore the split is enabled in TARGET_NEON after reload_completed.
>> And it is invoked from adddi3_neon for all alternatives without vfp
>> registers:
>>
>>     switch (which_alternative)
>>       {
>>       case 0: /* fall through */
>>       case 3: return "vadd.i64\t%P0, %P1, %P2";
>>       case 1: return "#";
>>       case 2: return "#";
>>       case 4: return "#";
>>       case 5: return "#";
>>       case 6: return "#";
>>
>>
>>
>>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the
>>> same issue.
>>>
>> Yes, that pattern can be cleaned up in a follow-up patch.
>> Note this splitter is invoked from bicdi3_neon as well.
>> However I think anddi_notdi_di should be safe as long as it is enabled
>> after reload_completed (which is probably a bug).
>>
>
> Thanks, that's what I had in mind in my other reply.
> This is ok if testing comes back ok.
>

I've submitted the patch for testing, I'll let you know about the results.

Christophe

> Kyrill
>
>
>> Bernd.
>>
>>> Wilco
>>>
>
Wilco Dijkstra Sept. 5, 2017, 9:27 p.m. UTC | #13
Bernd Edlinger wrote:
> No, the split condition does not begin with "&& TARGET_32BIT...".
> Therefore the split is enabled in TARGET_NEON after reload_completed.
> And it is invoked from adddi3_neon for all alternatives without vfp
> registers:

Hmm that's a huge mess. I'd argue that any inst_and_split should only split
it's own instruction, never other instructions (especially if they are from
different md files, which is extremely confusing). Otherwise we should use
a separate split and explicitly list which instructions it splits.

So then the next question is whether the neon_adddi3 still needs the
arm_adddi3 splitter in some odd corner cases?

> > Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue. 

> Yes, that pattern can be cleaned up in a follow-up patch.

And there are a lot more instructions that need the same treatment and split
early (probably best at expand time). I noticed none of the zero/sign extends
split before regalloc for example.

> Note this splitter is invoked from bicdi3_neon as well.
> However I think anddi_notdi_di should be safe as long as it is enabled
> after reload_completed (which is probably a bug).

Since we should be splitting and/bic early now I don't think you can get anddi_notdi
anymore. So it could be removed completely assuming Neon already does the right
thing.

It looks like we need to do a full pass over all DI mode instructions and clean up
all the mess.

Wilco
Christophe Lyon Sept. 6, 2017, 7:35 a.m. UTC | #14
On 5 September 2017 at 20:20, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 5 September 2017 at 19:53, Kyrill  Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> On 05/09/17 18:48, Bernd Edlinger wrote:
>>>
>>> On 09/05/17 17:02, Wilco Dijkstra wrote:
>>>>
>>>> Bernd Edlinger wrote:
>>>>
>>>>> Combine creates an invalid insn out of these two insns:
>>>>
>>>> Yes it looks like a latent bug. We need to use
>>>> arm_general_register_operand
>>>> as arm_adddi3/subdi3 only allow integer registers. You don't need a new
>>>> predicate
>>>> s_register_operand_nv. Also I'd prefer something like
>>>> arm_general_adddi_operand.
>>>>
>>> Thanks, attached is a patch following your suggestion.
>>>
>>>> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) ||
>>>> reload_completed)"
>>>>
>>>> The split condition for adddi3 now looks more accurate indeed, although
>>>> we could
>>>> remove the !TARGET_NEON from the split condition as this is always true
>>>> given
>>>> arm_adddi3 uses "TARGET_32BIT && !TARGET_NEON".
>>>>
>>> No, the split condition does not begin with "&& TARGET_32BIT...".
>>> Therefore the split is enabled in TARGET_NEON after reload_completed.
>>> And it is invoked from adddi3_neon for all alternatives without vfp
>>> registers:
>>>
>>>     switch (which_alternative)
>>>       {
>>>       case 0: /* fall through */
>>>       case 3: return "vadd.i64\t%P0, %P1, %P2";
>>>       case 1: return "#";
>>>       case 2: return "#";
>>>       case 4: return "#";
>>>       case 5: return "#";
>>>       case 6: return "#";
>>>
>>>
>>>
>>>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the
>>>> same issue.
>>>>
>>> Yes, that pattern can be cleaned up in a follow-up patch.
>>> Note this splitter is invoked from bicdi3_neon as well.
>>> However I think anddi_notdi_di should be safe as long as it is enabled
>>> after reload_completed (which is probably a bug).
>>>
>>
>> Thanks, that's what I had in mind in my other reply.
>> This is ok if testing comes back ok.
>>
>
> I've submitted the patch for testing, I'll let you know about the results.
>

I can confirm the last patch does fix the regression I reported, and causes
no other regression. (The previous previous of the fix, worked, too).

Thanks for the prompt fix.

Christophe

> Christophe
>
>> Kyrill
>>
>>
>>> Bernd.
>>>
>>>> Wilco
>>>>
>>
Bernd Edlinger Sept. 6, 2017, 9:31 a.m. UTC | #15
On 09/05/17 23:27, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:

>> No, the split condition does not begin with "&& TARGET_32BIT...".

>> Therefore the split is enabled in TARGET_NEON after reload_completed.

>> And it is invoked from adddi3_neon for all alternatives without vfp

>> registers:

> 

> Hmm that's a huge mess. I'd argue that any inst_and_split should only split

> it's own instruction, never other instructions (especially if they are from

> different md files, which is extremely confusing). Otherwise we should use

> a separate split and explicitly list which instructions it splits.

> 


This is literally a mine-field.

> So then the next question is whether the neon_adddi3 still needs the

> arm_adddi3 splitter in some odd corner cases?

> 


Yes, I think so.
While *arm_adddi3 and adddi3_neon insn are mutually exclusive,
they share the splitter.

I don't know with which iwmmxt-pattern the *arm_adddi3-splitter might
interfere, therefore I don't know if the !TARGET_IWMMXT can be removed
from the splitter condition.


Other patterns have a iwmmxt-twin that is not mutually exclusive.
For instance *anddi_notdi_di, bicdi3_neon and iwmmxt_nanddi3
The bicdi3_neon insn duplicates the alternatives while iwmmxt does not.

And nobody is able to test iwmmxt.

>>> Also there are more cases, a quick grep suggests *anddi_notdi_di has the same issue.

> 

>> Yes, that pattern can be cleaned up in a follow-up patch.

> 

> And there are a lot more instructions that need the same treatment and split

> early (probably best at expand time). I noticed none of the zero/sign extends

> split before regalloc for example.

> 


I did use the test cases as benchmarks to decide which way to go.

It is quite easy to try different combinations of cpu and inspect
the stack usage of neon, iwmmxt, and vfp etc.

It may be due to interaction with other patterns, but not in every case
the split at expand time produced the best results.

>> Note this splitter is invoked from bicdi3_neon as well.

>> However I think anddi_notdi_di should be safe as long as it is enabled

>> after reload_completed (which is probably a bug).

> 

> Since we should be splitting and/bic early now I don't think you can get anddi_notdi

> anymore. So it could be removed completely assuming Neon already does the right

> thing.

> > It looks like we need to do a full pass over all DI mode instructions 

and clean up
> all the mess.

> 


Yes, but in small steps, and using some benchmarks to make sure that
each step does improve at least something.


Bernd.

> Wilco

>
diff mbox

Patch

2017-04-26  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for
	TARGET_NEON and TARGET_IWMMXT.
	(anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for
	TARGET_NEON and TARGET_IWMMXT.
	(*one_cmpldi2_insn): Moved the body of one_cmpldi2 here.

testsuite:
2017-04-26  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308-1.c: New test.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 247222)
+++ gcc/config/arm/arm.md	(working copy)
@@ -467,7 +467,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
-  "TARGET_32BIT && reload_completed
+  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
    && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
   [(parallel [(set (reg:CC_C CC_REGNUM)
 		   (compare:CC_C (plus:SI (match_dup 1) (match_dup 2))
@@ -1273,7 +1273,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
-  "&& reload_completed"
+  "&& (!TARGET_IWMMXT || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))])
@@ -2259,7 +2259,24 @@ 
 	(and:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "neon_inv_logic_op2" "")))]
   "TARGET_32BIT"
-  ""
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_binary (AND, SImode,
+				      gen_lowpart (SImode, operands[1]),
+				      gen_lowpart (SImode, operands[2]));
+      rtx high = simplify_gen_binary (AND, SImode,
+				      gen_highpart (SImode, operands[1]),
+				      gen_highpart_mode (SImode, DImode,
+							 operands[2]));
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
 )
 
 (define_insn_and_split "*anddi3_insn"
@@ -3132,7 +3149,24 @@ 
 	(ior:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "neon_logic_op2" "")))]
   "TARGET_32BIT"
-  ""
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_binary (IOR, SImode,
+				      gen_lowpart (SImode, operands[1]),
+				      gen_lowpart (SImode, operands[2]));
+      rtx high = simplify_gen_binary (IOR, SImode,
+				      gen_highpart (SImode, operands[1]),
+				      gen_highpart_mode (SImode, DImode,
+							 operands[2]));
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
 )
 
 (define_insn_and_split "*iordi3_insn"
@@ -3320,6 +3354,22 @@ 
        no NEON instructions that take an immediate.  */
     if (TARGET_IWMMXT && !REG_P (operands[2]))
       operands[2] = force_reg (DImode, operands[2]);
+    if (!TARGET_NEON && !TARGET_IWMMXT)
+      {
+	rtx low  = simplify_gen_binary (XOR, SImode,
+					gen_lowpart (SImode, operands[1]),
+					gen_lowpart (SImode, operands[2]));
+	rtx high = simplify_gen_binary (XOR, SImode,
+					gen_highpart (SImode, operands[1]),
+					gen_highpart_mode (SImode, DImode,
+							   operands[2]));
+
+	emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+	emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+	DONE;
+      }
+    /* Otherwise expand pattern as above.  */
   }
 )
 
@@ -5031,7 +5081,31 @@ 
   "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE"
   "")
 
-(define_insn_and_split "one_cmpldi2"
+(define_expand "one_cmpldi2"
+  [(set (match_operand:DI 0 "s_register_operand" "")
+	(not:DI (match_operand:DI 1 "s_register_operand" "")))]
+  "TARGET_32BIT"
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_unary (NOT, SImode,
+				     gen_lowpart (SImode, operands[1]),
+				     SImode);
+      rtx high = simplify_gen_unary (NOT, SImode,
+				     gen_highpart_mode (SImode, DImode,
+							operands[1]),
+				     SImode);
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
+)
+
+(define_insn_and_split "*one_cmpldi2_insn"
   [(set (match_operand:DI 0 "s_register_operand"	 "=w,&r,&r,?w")
 	(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
   "TARGET_32BIT"
Index: gcc/testsuite/gcc.target/arm/pr77308-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr77308-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr77308-1.c	(working copy)
@@ -0,0 +1,169 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with bit-not "~" at the Sigma-blocks.
+   It improves the test coverage of one_cmpldi2 and subdi3 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
+#define Sigma1(x)       ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
+#define sigma0(x)       ~(ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
+#define sigma1(x)       ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}