Patchwork [AArch64] Implement ADD in vector registers for 32-bit scalar values.

login
register
mail settings
Submitter James Greenhalgh
Date March 28, 2014, 9:12 a.m.
Message ID <1395997970-27335-1-git-send-email-james.greenhalgh@arm.com>
Download mbox | patch
Permalink /patch/334621/
State New
Headers show

Comments

James Greenhalgh - March 28, 2014, 9:12 a.m.
Hi,

There is no way to perform scalar addition in the vector register file,
but with the RTX costs in place we start rewriting (x << 1) to (x + x)
on almost all cores. The code which makes this decision has no idea that we
will end up doing this (it happens well before reload) and so we end up with
very ugly code generation in the case where addition was selected, but
we are operating in vector registers.

This patch relies on the same gimmick we are already using to allow
shifts on 32-bit scalars in the vector register file - Use a vector 32x2
operation instead, knowing that we can safely ignore the top bits.

This restores some normality to scalar_shift_1.c, however the test
that we generate a left shift by one is clearly bogus, so remove that.

This patch is pretty ugly, but it does generate superficially better
looking code for this testcase.

Tested on aarch64-none-elf with no issues.

OK for stage 1?

Thanks,
James

---
gcc/

2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.md (*addsi3_aarch64): Add alternative in
	vector registers.

gcc/testsuite/
2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>

	* gcc.target/aarch64/scalar_shift_1.c: Fix expected assembler.
Andrew Pinski - March 28, 2014, 11:11 a.m.
> On Mar 28, 2014, at 2:12 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> 
> Hi,
> 
> There is no way to perform scalar addition in the vector register file,
> but with the RTX costs in place we start rewriting (x << 1) to (x + x)
> on almost all cores. The code which makes this decision has no idea that we
> will end up doing this (it happens well before reload) and so we end up with
> very ugly code generation in the case where addition was selected, but
> we are operating in vector registers.
> 
> This patch relies on the same gimmick we are already using to allow
> shifts on 32-bit scalars in the vector register file - Use a vector 32x2
> operation instead, knowing that we can safely ignore the top bits.
> 
> This restores some normality to scalar_shift_1.c, however the test
> that we generate a left shift by one is clearly bogus, so remove that.
> 
> This patch is pretty ugly, but it does generate superficially better
> looking code for this testcase.
> 
> Tested on aarch64-none-elf with no issues.
> 
> OK for stage 1?

It seems we should also discourage the neon alternatives as there might be extra movement between the two register sets which we don't want. 

Thanks,
Andrew


> 
> Thanks,
> James
> 
> ---
> gcc/
> 
> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> 
>    * config/aarch64/aarch64.md (*addsi3_aarch64): Add alternative in
>    vector registers.
> 
> gcc/testsuite/
> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> 
>    * gcc.target/aarch64/scalar_shift_1.c: Fix expected assembler.
> <0001-AArch64-Implement-ADD-in-vector-registers-for-32-bit.patch>
James Greenhalgh - March 28, 2014, 2:48 p.m.
On Fri, Mar 28, 2014 at 11:11:58AM +0000, pinskia@gmail.com wrote:
> > On Mar 28, 2014, at 2:12 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> > Hi,
> > 
> > There is no way to perform scalar addition in the vector register file,
> > but with the RTX costs in place we start rewriting (x << 1) to (x + x)
> > on almost all cores. The code which makes this decision has no idea that we
> > will end up doing this (it happens well before reload) and so we end up with
> > very ugly code generation in the case where addition was selected, but
> > we are operating in vector registers.
> > 
> > This patch relies on the same gimmick we are already using to allow
> > shifts on 32-bit scalars in the vector register file - Use a vector 32x2
> > operation instead, knowing that we can safely ignore the top bits.
> > 
> > This restores some normality to scalar_shift_1.c, however the test
> > that we generate a left shift by one is clearly bogus, so remove that.
> > 
> > This patch is pretty ugly, but it does generate superficially better
> > looking code for this testcase.
> > 
> > Tested on aarch64-none-elf with no issues.
> > 
> > OK for stage 1?
> 
> It seems we should also discourage the neon alternatives as there might be
> extra movement between the two register sets which we don't want. 

I see your point, but we've tried to avoid doing that elsewhere in the
AArch64 backend. Our argument has been that strictly speaking, it isn't that
the alternative is expensive, it is the movement between the register sets. We
do model that elsewhere, and the register allocator should already be trying to
avoid unneccesary moves between register classes.

If those mechanisms are broken, we should fix them - in that case fixing
this by discouraging valid alternatives would seem to be gaffer-taping over the
real problem.

Thanks,
James

> 
> Thanks,
> Andrew
> 
> > 
> > Thanks,
> > James
> > 
> > ---
> > gcc/
> > 
> > 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> >    * config/aarch64/aarch64.md (*addsi3_aarch64): Add alternative in
> >    vector registers.
> > 
> > gcc/testsuite/
> > 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> > 
> >    * gcc.target/aarch64/scalar_shift_1.c: Fix expected assembler.
> > <0001-AArch64-Implement-ADD-in-vector-registers-for-32-bit.patch>
>
Andrew Pinski - March 28, 2014, 3:09 p.m.
> On Mar 28, 2014, at 7:48 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> 
> On Fri, Mar 28, 2014 at 11:11:58AM +0000, pinskia@gmail.com wrote:
>>> On Mar 28, 2014, at 2:12 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>>> Hi,
>>> 
>>> There is no way to perform scalar addition in the vector register file,
>>> but with the RTX costs in place we start rewriting (x << 1) to (x + x)
>>> on almost all cores. The code which makes this decision has no idea that we
>>> will end up doing this (it happens well before reload) and so we end up with
>>> very ugly code generation in the case where addition was selected, but
>>> we are operating in vector registers.
>>> 
>>> This patch relies on the same gimmick we are already using to allow
>>> shifts on 32-bit scalars in the vector register file - Use a vector 32x2
>>> operation instead, knowing that we can safely ignore the top bits.
>>> 
>>> This restores some normality to scalar_shift_1.c, however the test
>>> that we generate a left shift by one is clearly bogus, so remove that.
>>> 
>>> This patch is pretty ugly, but it does generate superficially better
>>> looking code for this testcase.
>>> 
>>> Tested on aarch64-none-elf with no issues.
>>> 
>>> OK for stage 1?
>> 
>> It seems we should also discourage the neon alternatives as there might be
>> extra movement between the two register sets which we don't want.
> 
> I see your point, but we've tried to avoid doing that elsewhere in the
> AArch64 backend. Our argument has been that strictly speaking, it isn't that
> the alternative is expensive, it is the movement between the register sets. We
> do model that elsewhere, and the register allocator should already be trying to
> avoid unneccesary moves between register classes.
> 

What about on a specific core where that alternative is expensive; that is the vector instructions are worse than the scalar ones. How are we going to handle this case?

Thanks,
Andrew


> If those mechanisms are broken, we should fix them - in that case fixing
> this by discouraging valid alternatives would seem to be gaffer-taping over the
> real problem.
> 
> Thanks,
> James
> 
>> 
>> Thanks,
>> Andrew
>> 
>>> 
>>> Thanks,
>>> James
>>> 
>>> ---
>>> gcc/
>>> 
>>> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
>>> 
>>>   * config/aarch64/aarch64.md (*addsi3_aarch64): Add alternative in
>>>   vector registers.
>>> 
>>> gcc/testsuite/
>>> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
>>> 
>>>   * gcc.target/aarch64/scalar_shift_1.c: Fix expected assembler.
>>> <0001-AArch64-Implement-ADD-in-vector-registers-for-32-bit.patch>
>>
James Greenhalgh - March 28, 2014, 3:39 p.m.
On Fri, Mar 28, 2014 at 03:09:22PM +0000, pinskia@gmail.com wrote:
> 
> 
> > On Mar 28, 2014, at 7:48 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> > 
> > On Fri, Mar 28, 2014 at 11:11:58AM +0000, pinskia@gmail.com wrote:
> >>> On Mar 28, 2014, at 2:12 AM, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> >>> Hi,
> >>> 
> >>> There is no way to perform scalar addition in the vector register file,
> >>> but with the RTX costs in place we start rewriting (x << 1) to (x + x)
> >>> on almost all cores. The code which makes this decision has no idea that we
> >>> will end up doing this (it happens well before reload) and so we end up with
> >>> very ugly code generation in the case where addition was selected, but
> >>> we are operating in vector registers.
> >>> 
> >>> This patch relies on the same gimmick we are already using to allow
> >>> shifts on 32-bit scalars in the vector register file - Use a vector 32x2
> >>> operation instead, knowing that we can safely ignore the top bits.
> >>> 
> >>> This restores some normality to scalar_shift_1.c, however the test
> >>> that we generate a left shift by one is clearly bogus, so remove that.
> >>> 
> >>> This patch is pretty ugly, but it does generate superficially better
> >>> looking code for this testcase.
> >>> 
> >>> Tested on aarch64-none-elf with no issues.
> >>> 
> >>> OK for stage 1?
> >> 
> >> It seems we should also discourage the neon alternatives as there might be
> >> extra movement between the two register sets which we don't want.
> > 
> > I see your point, but we've tried to avoid doing that elsewhere in the
> > AArch64 backend. Our argument has been that strictly speaking, it isn't that
> > the alternative is expensive, it is the movement between the register sets. We
> > do model that elsewhere, and the register allocator should already be trying to
> > avoid unneccesary moves between register classes.
> > 
> 
> What about on a specific core where that alternative is expensive; that is
> the vector instructions are worse than the scalar ones. How are we going to
> handle this case?

Certainly not by discouraging the alternative for all cores. We would need
a more nuanced approach which could be tuned on a per-core basis. Otherwise
we are bluntly and inaccurately pessimizing those cases where we can cheaply
perform the operation in the vector register file (e.g. we are cleaning up
loose ends after a vector loop, we have spilled to the vector register
file, etc.). The register preference mechanism feels the wrong place to
catch this as it does not allow for that degree of per-core felxibility,
an alternative is simply "disparaged slightly" (?, * in LRA) or
"disparaged severely" (!).

I would think that we don't want to start polluting the machine description
trying to hack around this as was done with the ARM backend's
neon_for_64_bits/avoid_neon_for_64_bits.

How have other targets solved this issue?

Thanks,
James

> Thanks,
> Andrew
> 
> > If those mechanisms are broken, we should fix them - in that case fixing
> > this by discouraging valid alternatives would seem to be gaffer-taping over the
> > real problem.
> > 
> > Thanks,
> > James
> > 
> >> 
> >> Thanks,
> >> Andrew
> >> 
> >>> 
> >>> Thanks,
> >>> James
> >>> 
> >>> ---
> >>> gcc/
> >>> 
> >>> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> >>> 
> >>>   * config/aarch64/aarch64.md (*addsi3_aarch64): Add alternative in
> >>>   vector registers.
> >>> 
> >>> gcc/testsuite/
> >>> 2014-03-27  James Greenhalgh  <james.greenhalgh@arm.com>
> >>> 
> >>>   * gcc.target/aarch64/scalar_shift_1.c: Fix expected assembler.
> >>> <0001-AArch64-Implement-ADD-in-vector-registers-for-32-bit.patch>
> >> 
>

Patch

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index c86a29d8e7f8df21f25e14d22df1c3e8c37c907f..9c544a0a473732ebdf9238205db96d0d0c57de9a 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1063,16 +1063,17 @@  (define_expand "add<mode>3"
 
 (define_insn "*addsi3_aarch64"
   [(set
-    (match_operand:SI 0 "register_operand" "=rk,rk,rk")
+    (match_operand:SI 0 "register_operand" "=rk,rk,w,rk")
     (plus:SI
-     (match_operand:SI 1 "register_operand" "%rk,rk,rk")
-     (match_operand:SI 2 "aarch64_plus_operand" "I,r,J")))]
+     (match_operand:SI 1 "register_operand" "%rk,rk,w,rk")
+     (match_operand:SI 2 "aarch64_plus_operand" "I,r,w,J")))]
   ""
   "@
   add\\t%w0, %w1, %2
   add\\t%w0, %w1, %w2
+  add\\t%0.2s, %1.2s, %2.2s
   sub\\t%w0, %w1, #%n2"
-  [(set_attr "type" "alu_imm,alu_reg,alu_imm")]
+  [(set_attr "type" "alu_imm,alu_reg,neon_add,alu_imm")]
 )
 
 ;; zero_extend version of above
diff --git a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
index 7cb17f8..826bafc 100644
--- a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
@@ -193,7 +193,6 @@  test_corners_sisd_di (Int64x1 b)
   return b;
 }
 /* { dg-final { scan-assembler "sshr\td\[0-9\]+,\ d\[0-9\]+,\ 63" } } */
-/* { dg-final { scan-assembler "shl\td\[0-9\]+,\ d\[0-9\]+,\ 1" } } */
 
 Int32x1
 test_corners_sisd_si (Int32x1 b)
@@ -207,7 +206,6 @@  test_corners_sisd_si (Int32x1 b)
   return b;
 }
 /* { dg-final { scan-assembler "sshr\tv\[0-9\]+\.2s,\ v\[0-9\]+\.2s,\ 31" } } */
-/* { dg-final { scan-assembler "shl\tv\[0-9\]+\.2s,\ v\[0-9\]+\.2s,\ 1" } } */