diff mbox series

, Fix PR target/pr83862: Fix PowerPC long double signbit with -mabi=ieeelongdouble

Message ID 20180117035543.GA1373@ibm-tiger.the-meissners.org
State New
Headers show
Series , Fix PR target/pr83862: Fix PowerPC long double signbit with -mabi=ieeelongdouble | expand

Commit Message

Michael Meissner Jan. 17, 2018, 3:55 a.m. UTC
PR target/83862 pointed out a problem I put into the 128-bit floating point
type signbit optimization.  The issue is we want to avoid doing a load to a
floating point/vector register and then a direct move to do signbit, so we
change the load to load the upper 64-bits of the floating point value to get
the sign bit.  Unfortunately, if the type is IEEE 128-bit and memory is
addressed with an indexed address on a little endian system, it generates an
illegal address and generates an internal compiler error.

I have tested this on a little endian power8 system, with bootstrap compilers.
There was not regression, and the new test passes.  Can I install this into the
trunk?

The same code is also in GCC 6 and 7.  While, -mabi=ieeelongdouble is not
supported in those releases, you can get a failure if you use an explicit
_Float128 type instead of long double.  Assuming that the bug shows up, can I
apply these patches to those branches as well?

[gcc]
2018-01-16  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/83862
	* config/rs6000/rs6000.c (rs6000_split_signbit): Do not create an
	illegal address on little endian systems if the source value is in
	memory addressed with indexed addressing.
	* config/rs6000/rs6000.md (signbit<mode>2_dm): Likewise.
	(signbit<mode>2_dm_<su>ext): Likewise.

[gcc/testsuite]
2018-01-16  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/83862
	* gcc.target/powerpc/pr83862.c: New test.

Comments

Segher Boessenkool Jan. 17, 2018, 10:09 p.m. UTC | #1
On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote:
> PR target/83862 pointed out a problem I put into the 128-bit floating point
> type signbit optimization.  The issue is we want to avoid doing a load to a
> floating point/vector register and then a direct move to do signbit, so we
> change the load to load the upper 64-bits of the floating point value to get
> the sign bit.  Unfortunately, if the type is IEEE 128-bit and memory is
> addressed with an indexed address on a little endian system, it generates an
> illegal address and generates an internal compiler error.

So all this is caused by these splitters running after reload.  Why do
we have to do that?  Do we?  We should be able to just change it to a
subreg and shift, early already?


Segher
Michael Meissner Jan. 17, 2018, 11:40 p.m. UTC | #2
On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote:
> On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote:
> > PR target/83862 pointed out a problem I put into the 128-bit floating point
> > type signbit optimization.  The issue is we want to avoid doing a load to a
> > floating point/vector register and then a direct move to do signbit, so we
> > change the load to load the upper 64-bits of the floating point value to get
> > the sign bit.  Unfortunately, if the type is IEEE 128-bit and memory is
> > addressed with an indexed address on a little endian system, it generates an
> > illegal address and generates an internal compiler error.
> 
> So all this is caused by these splitters running after reload.  Why do
> we have to do that?  Do we?  We should be able to just change it to a
> subreg and shift, early already?

The part that is failing is trying to optimize the case:

	x = signbit (*p)

Doing the code with just registers means that you will get:

	vr = load *p
	gr = diret move from vr
	shift

With the optimization you get:

	gr = 'upper' word
	shift

If the address is:

	base + index

In little endian, the compiler tried to generate:

	(base + index) + 8

And it didn't have a temporary register to put base+index.  It only shows up on
a little endian system on a type that does REG+REG addressing.  IBM extended
double does not allow REG+REG addressing, so it wasn't an issue for that.  But
with GLIBC starting to test -mabi=ieeelongdouble, it showed up.

I believe when I worked on it, we were mostly big endian, and the IEEE stuff
wasn't yet completely solid, so it was a thinko on my part.

I did the memory optimization to GPR and avoid the direct move because it is
used frequently in the GLIBC math library (and direct move on power8 being on
the slow side).
Segher Boessenkool Jan. 18, 2018, 6:39 p.m. UTC | #3
On Wed, Jan 17, 2018 at 06:40:12PM -0500, Michael Meissner wrote:
> On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote:
> > On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote:
> > > PR target/83862 pointed out a problem I put into the 128-bit floating point
> > > type signbit optimization.  The issue is we want to avoid doing a load to a
> > > floating point/vector register and then a direct move to do signbit, so we
> > > change the load to load the upper 64-bits of the floating point value to get
> > > the sign bit.  Unfortunately, if the type is IEEE 128-bit and memory is
> > > addressed with an indexed address on a little endian system, it generates an
> > > illegal address and generates an internal compiler error.
> > 
> > So all this is caused by these splitters running after reload.  Why do
> > we have to do that?  Do we?  We should be able to just change it to a
> > subreg and shift, early already?
> 
> The part that is failing is trying to optimize the case:
> 
> 	x = signbit (*p)
> 
> Doing the code with just registers means that you will get:
> 
> 	vr = load *p
> 	gr = diret move from vr
> 	shift
> 
> With the optimization you get:
> 
> 	gr = 'upper' word
> 	shift
> 
> If the address is:
> 
> 	base + index
> 
> In little endian, the compiler tried to generate:
> 
> 	(base + index) + 8
> 
> And it didn't have a temporary register to put base+index.

Yes, but you won't have this problem if you do this before RA.  Is there
any good reason to do it only after RA?

If you do it before RA you immediately get

	some_di = load upper from *p
	shift

just as you want (and things might even be optimised further).


Segher
Michael Meissner Jan. 18, 2018, 6:51 p.m. UTC | #4
On Thu, Jan 18, 2018 at 12:39:05PM -0600, Segher Boessenkool wrote:
> On Wed, Jan 17, 2018 at 06:40:12PM -0500, Michael Meissner wrote:
> > On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote:
> > > On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote:
> > > > PR target/83862 pointed out a problem I put into the 128-bit floating point
> > > > type signbit optimization.  The issue is we want to avoid doing a load to a
> > > > floating point/vector register and then a direct move to do signbit, so we
> > > > change the load to load the upper 64-bits of the floating point value to get
> > > > the sign bit.  Unfortunately, if the type is IEEE 128-bit and memory is
> > > > addressed with an indexed address on a little endian system, it generates an
> > > > illegal address and generates an internal compiler error.
> > > 
> > > So all this is caused by these splitters running after reload.  Why do
> > > we have to do that?  Do we?  We should be able to just change it to a
> > > subreg and shift, early already?
> > 
> > The part that is failing is trying to optimize the case:
> > 
> > 	x = signbit (*p)
> > 
> > Doing the code with just registers means that you will get:
> > 
> > 	vr = load *p
> > 	gr = diret move from vr
> > 	shift
> > 
> > With the optimization you get:
> > 
> > 	gr = 'upper' word
> > 	shift
> > 
> > If the address is:
> > 
> > 	base + index
> > 
> > In little endian, the compiler tried to generate:
> > 
> > 	(base + index) + 8
> > 
> > And it didn't have a temporary register to put base+index.
> 
> Yes, but you won't have this problem if you do this before RA.  Is there
> any good reason to do it only after RA?

Yes you will, because it is a memory address not a register.

> If you do it before RA you immediately get
> 
> 	some_di = load upper from *p
> 	shift
> 
> just as you want (and things might even be optimised further).

And in my experience, splitting such loads and changing the type before RA,
often times leads to error.
Michael Meissner Jan. 22, 2018, 12:03 a.m. UTC | #5
I've reworked the patch for PR83864.  This bug is due to the compiler issuing
an internal error for code of the form on a little endian system:

    int sb (_Float128 *p, unsigned long n) { return __builtin_signbit (p[n]); }

The problem is that the memory optimization wants to load the high double word
from memory to isolate the signbit.  In little endian, the high word is at
offset 8 instead of 0.

The bug will also show up if you use the long double type, and you use the
-mabi=ieeelongdouble option.

Previously the code did one UNSPEC (signbit<mode>2_dm) that was done before
register allocation, and combined the value being in vector register, memory,
and GPR register along with the shift to isolate the signbit.  Then after the
split after register allocation, it created a second UNSPEC
(signbit<mode>2_dm2) that was just the direct move, or it did the memory and
GPR code path separately.

With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm)
just to get the high double word, and the shift right is then emitted.  The
UNSPEC only handles the value being in either vector or GPR register.  There is
a second UNSPEC that is created by the combiner if the value is in memory.  On
little endian systems, the first split pass (before register allocation) will
allocate a pseudo register to hold the initial ADD of the base and index
registers for indexed loads, and then forms a LD reg,8(tmp) to load the high
word.  Just in case, some code after register allocation reforms the UNSPEC, it
uses a base register for the load, and it can use the base register as needed
for the temporary.

I have run this code on both little endian and big endian power8 systems, doing
bootstrap and regression test.  There were no regressions.  Can I install this
bug on the trunk?

I don't believe the bug shows up in gcc 6/7 branches, but I will build these
and test to see if the code is needed to be backported.

[gcc]
2018-01-21  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/83862
	* config/rs6000/rs6000-protos.h (rs6000_split_signbit): Delete,
	no longer used.
	* config/rs6000/rs6000.c (rs6000_split_signbit): Likewise.
	* config/rs6000/rs6000.md (signbit<mode>2): Change code for IEEE
	128-bit to produce an UNSPEC move to get the double word with the
	signbit and then a shift directly to do signbit.
	(signbit<mode>2_dm): Replace old IEEE 128-bit signbit
	implementation with a new version that just does either a direct
	move or a regular move.  Move memory interface to separate insns.
	Move insns so they are next to the expander.
	(signbit<mode>2_dm_mem_be): New combiner insns to combine load
	with signbit move.  Split big and little endian case.
	(signbit<mode>2_dm_mem_le): Likewise.
	(signbit<mode>2_dm_<su>ext): Delete, no longer used.
	(signbit<mode>2_dm2): Likewise.

[gcc/testsuite]
2018-01-22  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/83862
	* gcc.target/powerpc/pr83862.c: New test.
Segher Boessenkool Jan. 22, 2018, 6:32 p.m. UTC | #6
Hi Mike,

Thanks for doing this!

On Sun, Jan 21, 2018 at 07:03:58PM -0500, Michael Meissner wrote:
> I've reworked the patch for PR83864.  This bug is due to the compiler issuing
> an internal error for code of the form on a little endian system:
> 
>     int sb (_Float128 *p, unsigned long n) { return __builtin_signbit (p[n]); }
> 
> The problem is that the memory optimization wants to load the high double word
> from memory to isolate the signbit.  In little endian, the high word is at
> offset 8 instead of 0.
> 
> The bug will also show up if you use the long double type, and you use the
> -mabi=ieeelongdouble option.
> 
> Previously the code did one UNSPEC (signbit<mode>2_dm) that was done before
> register allocation, and combined the value being in vector register, memory,
> and GPR register along with the shift to isolate the signbit.  Then after the
> split after register allocation, it created a second UNSPEC
> (signbit<mode>2_dm2) that was just the direct move, or it did the memory and
> GPR code path separately.
> 
> With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm)
> just to get the high double word, and the shift right is then emitted.  The
> UNSPEC only handles the value being in either vector or GPR register.  There is
> a second UNSPEC that is created by the combiner if the value is in memory.  On
> little endian systems, the first split pass (before register allocation) will
> allocate a pseudo register to hold the initial ADD of the base and index
> registers for indexed loads, and then forms a LD reg,8(tmp) to load the high
> word.  Just in case, some code after register allocation reforms the UNSPEC, it
> uses a base register for the load, and it can use the base register as needed
> for the temporary.

But does it have to do an unspec at all?  Can't it just immediately take
the relevant 64-bit half (during expand), no unspec in sight, and that
is that?

> I have run this code on both little endian and big endian power8 systems, doing
> bootstrap and regression test.  There were no regressions.  Can I install this
> bug on the trunk?
> 
> I don't believe the bug shows up in gcc 6/7 branches, but I will build these
> and test to see if the code is needed to be backported.

If it is needed on the branches, okay for there too (once 7 reopens).

> [gcc]
> 2018-01-21  Michael Meissner  <meissner@linux.vnet.ibm.com>
> 
> 	PR target/83862
> 	* config/rs6000/rs6000-protos.h (rs6000_split_signbit): Delete,
> 	no longer used.
> 	* config/rs6000/rs6000.c (rs6000_split_signbit): Likewise.
> 	* config/rs6000/rs6000.md (signbit<mode>2): Change code for IEEE
> 	128-bit to produce an UNSPEC move to get the double word with the
> 	signbit and then a shift directly to do signbit.
> 	(signbit<mode>2_dm): Replace old IEEE 128-bit signbit
> 	implementation with a new version that just does either a direct
> 	move or a regular move.  Move memory interface to separate insns.
> 	Move insns so they are next to the expander.
> 	(signbit<mode>2_dm_mem_be): New combiner insns to combine load
> 	with signbit move.  Split big and little endian case.
> 	(signbit<mode>2_dm_mem_le): Likewise.
> 	(signbit<mode>2_dm_<su>ext): Delete, no longer used.
> 	(signbit<mode>2_dm2): Likewise.
> 
> [gcc/testsuite]
> 2018-01-22  Michael Meissner  <meissner@linux.vnet.ibm.com>
> 
> 	PR target/83862
> 	* gcc.target/powerpc/pr83862.c: New test.

The patch is okay for trunk, but I still think this can be a lot simpler.

> +;;  We can't use SUBREG:DI of the IEEE 128-bit value before register
> +;; allocation, because KF/TFmode aren't tieable with DImode.  Also, having the
> +;; 64-bit scalar part in the high end of the register instead of the low end
> +;; also complicates things.  So instead, we use an UNSPEC to do the move of the
> +;; high double word to isolate the signbit.

I'll think about this.

Either way, thank you for your patience!  And the patch :-)


Segher
Michael Meissner Jan. 22, 2018, 7:22 p.m. UTC | #7
On Mon, Jan 22, 2018 at 12:32:13PM -0600, Segher Boessenkool wrote:
> Hi Mike,
> 
> Thanks for doing this!
> 
> On Sun, Jan 21, 2018 at 07:03:58PM -0500, Michael Meissner wrote:
> > With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm)
> > just to get the high double word, and the shift right is then emitted.  The
> > UNSPEC only handles the value being in either vector or GPR register.  There is
> > a second UNSPEC that is created by the combiner if the value is in memory.  On
> > little endian systems, the first split pass (before register allocation) will
> > allocate a pseudo register to hold the initial ADD of the base and index
> > registers for indexed loads, and then forms a LD reg,8(tmp) to load the high
> > word.  Just in case, some code after register allocation reforms the UNSPEC, it
> > uses a base register for the load, and it can use the base register as needed
> > for the temporary.
> 
> But does it have to do an unspec at all?  Can't it just immediately take
> the relevant 64-bit half (during expand), no unspec in sight, and that
> is that?

Yes it needs the UNSPEC.  As I said, you can't use SUBREG due to a combination
of MODES_TIEABLE_P and the way things are represented in vector registers (with
the scalar part being in the upper part of the register).  I tried it, and
could not get it to work.
Michael Meissner Jan. 22, 2018, 11:28 p.m. UTC | #8
On Mon, Jan 22, 2018 at 12:32:13PM -0600, Segher Boessenkool wrote:
> Hi Mike,
> 
> If it is needed on the branches, okay for there too (once 7 reopens).

It is needed on GCC 7 if you compile the test for -mcpu=power9 and use
an explicit __float128 instead of long double (-mabi=ieeelongdouble is not
really supported in GCC 7).  I'll check GCC 6 as well.
diff mbox series

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 256753)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -23341,9 +23341,27 @@  rs6000_split_signbit (rtx dest, rtx src)
 
   if (MEM_P (src))
     {
-      rtx mem = (WORDS_BIG_ENDIAN
-		 ? adjust_address (src, DImode, 0)
-		 : adjust_address (src, DImode, 8));
+      rtx addr = XEXP (src, 0);
+      rtx mem;
+
+      if (!WORDS_BIG_ENDIAN)
+	{
+	  /* Do not create an illegal address for indexed addressing when we
+	     add in the 8 to address the second word where the sign bit is.
+	     Instead use the desitnation register as a base register.  */
+	  if (GET_CODE (addr) == PLUS
+	      && !rs6000_legitimate_offset_address_p (DImode, addr, true, true))
+	    {
+	      emit_insn (gen_rtx_SET (dest, addr));
+	      mem = change_address (src, DImode,
+				    gen_rtx_PLUS (Pmode, dest, GEN_INT (8)));
+	    }
+	  else
+	    mem = adjust_address (src, DImode, 8);
+	}
+      else
+	mem = adjust_address (src, DImode, 0);
+
       emit_insn (gen_rtx_SET (dest_di, mem));
     }
 
Index: gcc/config/rs6000/rs6000.md
===================================================================
--- gcc/config/rs6000/rs6000.md	(revision 256753)
+++ gcc/config/rs6000/rs6000.md	(working copy)
@@ -4835,9 +4835,11 @@  (define_expand "copysign<mode>3"
   })
 
 ;; Optimize signbit on 64-bit systems with direct move to avoid doing the store
-;; and load.
+;; and load.  We restrict signbit coming from a load to use a base register for
+;; the destination, in case we need to use the base register as a tempoary
+;; address register.
 (define_insn_and_split "signbit<mode>2_dm"
-  [(set (match_operand:SI 0 "gpc_reg_operand" "=r,r,r")
+  [(set (match_operand:SI 0 "gpc_reg_operand" "=r,&b,r")
 	(unspec:SI
 	 [(match_operand:SIGNBIT 1 "input_operand" "wa,m,r")]
 	 UNSPEC_SIGNBIT))]
@@ -4853,7 +4855,7 @@  (define_insn_and_split "signbit<mode>2_d
   (set_attr "type" "mftgpr,load,integer")])
 
 (define_insn_and_split "*signbit<mode>2_dm_<su>ext"
-  [(set (match_operand:DI 0 "gpc_reg_operand" "=r,r,r")
+  [(set (match_operand:DI 0 "gpc_reg_operand" "=r,&b,r")
 	(any_extend:DI
 	 (unspec:SI
 	  [(match_operand:SIGNBIT 1 "input_operand" "wa,m,r")]
Index: gcc/testsuite/gcc.target/powerpc/pr83862.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr83862.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/pr83862.c	(working copy)
@@ -0,0 +1,21 @@ 
+/* PR target/83862.c */
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-require-effective-target ppc_float128_sw } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mpower8-vector -O2 -mabi=ieeelongdouble -Wno-psabi" } */
+
+/* On little endian systems, optimizing signbit of IEEE 128-bit values from
+   memory could abort if the memory address was indexed (reg+reg).  The
+   optimization is only on 64-bit machines with direct move.
+
+   Compile with -g -O2 -mabi=ieeelongdouble -Wno-psabi.  */
+
+#ifndef TYPE
+#define TYPE long double
+#endif
+
+int sbr (TYPE a) { return __builtin_signbit (a); }
+int sbm (TYPE *a) { return __builtin_signbit (*a); }
+int sbo (TYPE *a) { return __builtin_signbit (a[4]); }
+int sbi (TYPE *a, unsigned long n) { return __builtin_signbit (a[n]); }
+void sbs (int *p, TYPE a) { *p = __builtin_signbit (a); }