diff mbox

Fix ix86_split_long_move collision handling with TLS (PR target/66470)

Message ID 20150609125703.GA10247@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek June 9, 2015, 12:57 p.m. UTC
On Tue, Jun 09, 2015 at 02:32:07PM +0200, Uros Bizjak wrote:
> > -         emit_insn (gen_rtx_SET (base, XEXP (part[1][0], 0)));
> > +         addr = XEXP (part[1][0], 0);
> > +         if (TARGET_TLS_DIRECT_SEG_REFS)
> > +           {
> > +             struct ix86_address parts;
> > +             int ok = ix86_decompose_address (addr, &parts);
> > +             gcc_assert (ok);
> > +             if (parts.seg == DEFAULT_TLS_SEG_REG)
> > +               {
> > +                 /* It is not valid to use %gs: or %fs: in
> > +                    lea though, so we need to remove it from the
> > +                    address used for lea and add it to each individual
> > +                    memory loads instead.  */
> > +                 addr = copy_rtx (addr);
> > +                 rtx *x = &addr;
> > +                 while (GET_CODE (*x) == PLUS)
> 
> Why not use RTX iterators here? IMO, it would be much more readable.

Do you mean something like this?  It is larger and don't see readability
advantages there at all (plus the 4.8/4.9 backports can't use that anyway).
But if you prefer it, I can retest it.
I still need to look for a PLUS and scan the individual operands of it,
because the desired change is that the PLUS is replaced with its operand
(one that isn't UNSPEC_TP).  And getting rid of the ix86_decompose_address
would mean either unconditionally performing (wasteful) copy_rtx, or
two RTX iterators cycles.  The PLUS it is looking for has to be
a toplevel PLUS or PLUS in XEXP (x, 0) of that (recursively), otherwise
ix86_decompose_address wouldn't recognize it.

2015-06-09  Jakub Jelinek  <jakub@redhat.com>

	PR target/66470
	* config/i386/i386.c (ix86_split_long_move): For collisions
	involving direct tls segment refs, move the UNSPEC_TP out of
	the address for lea, to each of the memory loads.

	* gcc.dg/tls/pr66470.c: New test.



	Jakub

Comments

Uros Bizjak June 9, 2015, 1:21 p.m. UTC | #1
On Tue, Jun 9, 2015 at 2:57 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 09, 2015 at 02:32:07PM +0200, Uros Bizjak wrote:
>> > -         emit_insn (gen_rtx_SET (base, XEXP (part[1][0], 0)));
>> > +         addr = XEXP (part[1][0], 0);
>> > +         if (TARGET_TLS_DIRECT_SEG_REFS)
>> > +           {
>> > +             struct ix86_address parts;
>> > +             int ok = ix86_decompose_address (addr, &parts);
>> > +             gcc_assert (ok);
>> > +             if (parts.seg == DEFAULT_TLS_SEG_REG)
>> > +               {
>> > +                 /* It is not valid to use %gs: or %fs: in
>> > +                    lea though, so we need to remove it from the
>> > +                    address used for lea and add it to each individual
>> > +                    memory loads instead.  */
>> > +                 addr = copy_rtx (addr);
>> > +                 rtx *x = &addr;
>> > +                 while (GET_CODE (*x) == PLUS)
>>
>> Why not use RTX iterators here? IMO, it would be much more readable.
>
> Do you mean something like this?  It is larger and don't see readability
> advantages there at all (plus the 4.8/4.9 backports can't use that anyway).
> But if you prefer it, I can retest it.

I'm afraid that simple scan loop won't work correctly on x32. There
are some issues with UNSPEC_TP for this target, so we have to generate
zero_extend of SImode UNSPEC, e.g.:

(plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))

as can be seen in get_thread_pointer to construct the address. It
looks that your loop won't find the UNSPEC_TP tag in the above case.

Uros.
Jakub Jelinek June 9, 2015, 1:39 p.m. UTC | #2
On Tue, Jun 09, 2015 at 03:21:55PM +0200, Uros Bizjak wrote:
> I'm afraid that simple scan loop won't work correctly on x32. There
> are some issues with UNSPEC_TP for this target, so we have to generate
> zero_extend of SImode UNSPEC, e.g.:
> 
> (plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))
> 
> as can be seen in get_thread_pointer to construct the address. It
> looks that your loop won't find the UNSPEC_TP tag in the above case.

You're right, for -m32 it would need to start with
   rtx *x = &addr;
+  while (GET_CODE (*x) == ZERO_EXTEND
+	 || GET_CODE (*x) == AND
+	 || GET_CODE (*x) == SUBREG)
+    x = &XEXP (*x, 0);
to get at the PLUS.  Now, with either the original patch with the above
ammendment, or with the iterators, the question is what to do
with the UNSPEC_TP for -m32.  Is it ok to just use addr32 on the
lea and use normal Pmode (== DImode) addressing for the memory reads
(if I read the code well, that is what it does right now for the non-TLS
ones:
          if (GET_MODE (base) != Pmode)
            base = gen_rtx_REG (Pmode, REGNO (base));
)?  Then we'd need to change tls_base mode to Pmode.

	Jakub
Uros Bizjak June 9, 2015, 2:04 p.m. UTC | #3
On Tue, Jun 9, 2015 at 3:39 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 09, 2015 at 03:21:55PM +0200, Uros Bizjak wrote:
>> I'm afraid that simple scan loop won't work correctly on x32. There
>> are some issues with UNSPEC_TP for this target, so we have to generate
>> zero_extend of SImode UNSPEC, e.g.:
>>
>> (plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))
>>
>> as can be seen in get_thread_pointer to construct the address. It
>> looks that your loop won't find the UNSPEC_TP tag in the above case.
>
> You're right, for -m32 it would need to start with
>    rtx *x = &addr;
> +  while (GET_CODE (*x) == ZERO_EXTEND
> +        || GET_CODE (*x) == AND
> +        || GET_CODE (*x) == SUBREG)
> +    x = &XEXP (*x, 0);
> to get at the PLUS.  Now, with either the original patch with the above
> ammendment, or with the iterators, the question is what to do
> with the UNSPEC_TP for -m32.  Is it ok to just use addr32 on the
> lea and use normal Pmode (== DImode) addressing for the memory reads
> (if I read the code well, that is what it does right now for the non-TLS
> ones:
>           if (GET_MODE (base) != Pmode)
>             base = gen_rtx_REG (Pmode, REGNO (base));
> )?  Then we'd need to change tls_base mode to Pmode.

(I assume you referred to -mx32 above. Please also note that -mx32
uses Pmode == SImode by default).

Yes, please emit zero-extended address load to a DImode register, and
use this register with UNSPEC_TP to form final DImode address. The LEA
that will be generated from address load will use DImode inputs due to
%H operand modifier and will output to (implicitly zero-extended)
SImode register.

FYI, you just hit two special cases here:
- addresses that mention %fs and %gs are always in DImode
- LEA input operands are also always in DImode, but compiler already
takes care of it.

Uros.
Uros Bizjak June 9, 2015, 2:12 p.m. UTC | #4
On Tue, Jun 9, 2015 at 3:39 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 09, 2015 at 03:21:55PM +0200, Uros Bizjak wrote:
>> I'm afraid that simple scan loop won't work correctly on x32. There
>> are some issues with UNSPEC_TP for this target, so we have to generate
>> zero_extend of SImode UNSPEC, e.g.:
>>
>> (plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))
>>
>> as can be seen in get_thread_pointer to construct the address. It
>> looks that your loop won't find the UNSPEC_TP tag in the above case.
>
> You're right, for -m32 it would need to start with
>    rtx *x = &addr;
> +  while (GET_CODE (*x) == ZERO_EXTEND
> +        || GET_CODE (*x) == AND
> +        || GET_CODE (*x) == SUBREG)
> +    x = &XEXP (*x, 0);

Oh, you can use SImode_address_operand predicate here.

Uros.
Jakub Jelinek June 9, 2015, 2:26 p.m. UTC | #5
On Tue, Jun 09, 2015 at 04:12:33PM +0200, Uros Bizjak wrote:
> On Tue, Jun 9, 2015 at 3:39 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Tue, Jun 09, 2015 at 03:21:55PM +0200, Uros Bizjak wrote:
> >> I'm afraid that simple scan loop won't work correctly on x32. There
> >> are some issues with UNSPEC_TP for this target, so we have to generate
> >> zero_extend of SImode UNSPEC, e.g.:
> >>
> >> (plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))
> >>
> >> as can be seen in get_thread_pointer to construct the address. It
> >> looks that your loop won't find the UNSPEC_TP tag in the above case.
> >
> > You're right, for -m32 it would need to start with

Yeah, I meant -mx32 (which I have no experience with nor spare time for).

> >    rtx *x = &addr;
> > +  while (GET_CODE (*x) == ZERO_EXTEND
> > +        || GET_CODE (*x) == AND
> > +        || GET_CODE (*x) == SUBREG)
> > +    x = &XEXP (*x, 0);
> 
> Oh, you can use SImode_address_operand predicate here.

Do I need to loop, or can there be just one SImode_address_operand
code?  Do you want to use the iterators (as in the second patch) or not
(then is
  if (SImode_address_operand (addr, VOIDmode))
    x = &XEXP (addr, 0);
ok)?  Is Pmode always SImode for -mx32, or depending on some switch or
something?  Would it be acceptable to just guard the changes in the patch
with !TARGET_X32 and let H.J. deal with that target?  I'm afraid I'm lost
when to ZERO_EXTEND addr (if needed at all), etc.

	Jakub
Uros Bizjak June 9, 2015, 4:16 p.m. UTC | #6
On Tue, Jun 9, 2015 at 4:26 PM, Jakub Jelinek <jakub@redhat.com> wrote:

>> >> I'm afraid that simple scan loop won't work correctly on x32. There
>> >> are some issues with UNSPEC_TP for this target, so we have to generate
>> >> zero_extend of SImode UNSPEC, e.g.:
>> >>
>> >> (plus:DI (zero_extend:DI (unspec:SI [...] UNSPEC_TP) (reg:DI ...))
>> >>
>> >> as can be seen in get_thread_pointer to construct the address. It
>> >> looks that your loop won't find the UNSPEC_TP tag in the above case.
>> >
>> > You're right, for -m32 it would need to start with
>
> Yeah, I meant -mx32 (which I have no experience with nor spare time for).
>
>> >    rtx *x = &addr;
>> > +  while (GET_CODE (*x) == ZERO_EXTEND
>> > +        || GET_CODE (*x) == AND
>> > +        || GET_CODE (*x) == SUBREG)
>> > +    x = &XEXP (*x, 0);
>>
>> Oh, you can use SImode_address_operand predicate here.
>
> Do I need to loop, or can there be just one SImode_address_operand

IIRC, apart from the whole address, only UNSPEC_TP can be
zero_extended. It is a hardware "feature" (== HW bug) that addr32
doesn't apply to segment registers.

> code?  Do you want to use the iterators (as in the second patch) or not
> (then is
>   if (SImode_address_operand (addr, VOIDmode))
>     x = &XEXP (addr, 0);
> ok)?  Is Pmode always SImode for -mx32, or depending on some switch or

Nope, it depends on -maddress-mode switch, and can be SImode or DImode.

> something?  Would it be acceptable to just guard the changes in the patch
> with !TARGET_X32 and let H.J. deal with that target?  I'm afraid I'm lost
> when to ZERO_EXTEND addr (if needed at all), etc.

If you wish, I can take your patch and take if further. -mx32 is a
delicate beast...

Uros.
>
>         Jakub
Jakub Jelinek June 9, 2015, 4:21 p.m. UTC | #7
On Tue, Jun 09, 2015 at 06:16:32PM +0200, Uros Bizjak wrote:
> > something?  Would it be acceptable to just guard the changes in the patch
> > with !TARGET_X32 and let H.J. deal with that target?  I'm afraid I'm lost
> > when to ZERO_EXTEND addr (if needed at all), etc.
> 
> If you wish, I can take your patch and take if further. -mx32 is a
> delicate beast...

If you could, it would be appreciated, I'm quite busy with OpenMP 4.1 stuff
now.
Note that for -m64/-mx32 it will be much harder to create a reproducer,
because to trigger the bug one has to convince the register allocator
to allocate the lhs of the load in certain registers (not that hard),
but also the index register (to be scaled, also not that hard) and
also the register holding the tls symbol immediate.  Wonder if one has to
keep all but the two registers live across the load or something similar.

	Jakub
diff mbox

Patch

--- gcc/config/i386/i386.c.jj	2015-06-08 15:41:19.000000000 +0200
+++ gcc/config/i386/i386.c	2015-06-09 14:42:18.357849227 +0200
@@ -22866,7 +22866,7 @@  ix86_split_long_move (rtx operands[])
 	 Do an lea to the last part and use only one colliding move.  */
       else if (collisions > 1)
 	{
-	  rtx base;
+	  rtx base, addr, tls_base = NULL_RTX;
 
 	  collisions = 1;
 
@@ -22877,10 +22877,48 @@  ix86_split_long_move (rtx operands[])
 	  if (GET_MODE (base) != Pmode)
 	    base = gen_rtx_REG (Pmode, REGNO (base));
 
-	  emit_insn (gen_rtx_SET (base, XEXP (part[1][0], 0)));
+	  addr = XEXP (part[1][0], 0);
+	  if (TARGET_TLS_DIRECT_SEG_REFS)
+	    {
+	      struct ix86_address parts;
+	      int ok = ix86_decompose_address (addr, &parts);
+	      gcc_assert (ok);
+	      if (parts.seg == DEFAULT_TLS_SEG_REG)
+		{
+		  /* It is not valid to use %gs: or %fs: in
+		     lea though, so we need to remove it from the
+		     address used for lea and add it to each individual
+		     memory loads instead.  */
+		  addr = copy_rtx (addr);
+		  subrtx_ptr_iterator::array_type array;
+		  FOR_EACH_SUBRTX_PTR (iter, array, &addr, NONCONST)
+		    {
+		      rtx *x = *iter;
+		      if (GET_CODE (*x) == PLUS)
+			{
+			  for (i = 0; i < 2; i++)
+			    if (GET_CODE (XEXP (*x, i)) == UNSPEC
+				&& XINT (XEXP (*x, i), 1) == UNSPEC_TP)
+			      {
+				tls_base = XEXP (*x, i);
+				*x = XEXP (*x, 1 - i);
+				break;
+			      }
+			  if (tls_base)
+			    break;
+			}
+		    }
+		  gcc_assert (tls_base);
+		}
+	    }
+	  emit_insn (gen_rtx_SET (base, addr));
+	  if (tls_base)
+	    base = gen_rtx_PLUS (GET_MODE (base), base, tls_base);
 	  part[1][0] = replace_equiv_address (part[1][0], base);
 	  for (i = 1; i < nparts; i++)
 	    {
+	      if (tls_base)
+		base = copy_rtx (base);
 	      tmp = plus_constant (Pmode, base, UNITS_PER_WORD * i);
 	      part[1][i] = replace_equiv_address (part[1][i], tmp);
 	    }
--- gcc/testsuite/gcc.dg/tls/pr66470.c.jj	2015-06-09 11:59:05.543954781 +0200
+++ gcc/testsuite/gcc.dg/tls/pr66470.c	2015-06-09 11:58:43.000000000 +0200
@@ -0,0 +1,29 @@ 
+/* PR target/66470 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target tls } */
+
+extern __thread unsigned long long a[10];
+extern __thread struct S { int a, b; } b[10];
+
+unsigned long long
+foo (long x)
+{
+  return a[x];
+}
+
+struct S
+bar (long x)
+{
+  return b[x];
+}
+
+#ifdef __SIZEOF_INT128__
+extern __thread unsigned __int128 c[10];
+
+unsigned __int128
+baz (long x)
+{
+  return c[x];
+}
+#endif