hppa: Fix data race in setting function descriptors during lazy binding
diff mbox series

Message ID ed8b3116-064a-70b5-8ff1-4f74d6d7da63@bell.net
State New
Headers show
Series
  • hppa: Fix data race in setting function descriptors during lazy binding
Related show

Commit Message

John David Anglin Oct. 20, 2019, 6:28 p.m. UTC
This patch fixes the data race in setting function descriptors during lazy binding (Bug 23296).

It addresses an issue that is present mainly on SMP machines running threaded code.  In a typical
indirect call or PLT import stub, the target address is loaded first.  Then the global pointer
is loaded into the PIC register in the delay slot of a branch to the target address.  During lazy
binding, the target address is a trampoline which transfers to _dl_runtime_resolve().

_dl_runtime_resolve() uses the relocation offset stored in the global pointer and the linkage map
stored in the trampoline to find the relocation.  Then, the function descriptor is updated.

In a multi-threaded application, it is possible for the global pointer to be updated between the
load of the target address and the global pointer.  When this happens, the relocation offset has
been replaced by the new global pointer.  The function pointer has probably been updated as well
but there is no way to find the address of the function descriptor and to transfer to the target.
So, _dl_runtime_resolve() typically crashes.

HP-UX addressed this problem by adding an extra pc-relative branch to each descriptor.  The
descriptor is initially setup to point to the branch.  The branch then transfers to the trampoline.
This allowed the trampoline code to figure out which descriptor was being used without any
modification to user code.  I didn't use this approach as it is more complex and potentially
changes function pointer canonicalization.

The order of loading the target address and global pointer in indirect calls was not consistent
with the order used in import stubs.  In particular, $$dyncall and some inline versions of it
loaded the global pointer first.  This was inconsistent with the global pointer being updated
first in dl-machine.h.  Assuming the accesses are ordered, we want elf_machine_fixup_plt() to
store the global pointer first and calls to load it last.  Then, the global pointer will be
correct when the target function is entered.

However, just to make things more fun, HP added support for out-of-order execution of accesses
in PA 2.0.  The accesses used by calls are weakly ordered. So, it's possibly under some circumstances
that a function might be entered with the wrong global pointer.  However, HP uses weakly ordered
accesses in 64-bit HP-UX, so I assume that loading the global pointer in the delay slot of the
branch must work consistently.

The basic fix for the race is a combination of modifying user code to preserve the address of the
function descriptor in register %r22 and setting the least-significant bit in the relocation offset.
The latter was suggested by Carlos as a way to distinguish relocation offsets from global pointer
values.  Conventionally, %r22 is used as the address of the function descriptor in calls to $$dyncall.
So, it wasn't hard to preserve the address in %r22.

I have updated gcc trunk and gcc-9 branch to not clobber %r22 in $$dyncall and inline indirect calls.
I have also modified the import stubs in binutils trunk and the 2.33 branch to preserve %r22.  This
required making the stubs one instruction longer but we save one relocation.  I also modified binutils
to align the .plt section on a 8-byte boundary.  This allows descriptors to be updated atomically.

With these changes, _dl_runtime_resolve() can fallback to an alternate mechanism to find the relocation
offset when it has been clobbered.  There's just one additional instruction in the fast path. I tested
the fallback function, _dl_fix_reloc_arg(), by changing the branch to always use the fallback.  Old
code still runs as it did before.

_dl_runtime_profile() assembles but I don't think the testsuite exercises it sufficiently.

With this change, I haven't observed any problem with lazy binding in several testsuite runs.  With 2.29,
builds fail almost every time on Debian unstable.

Dave
--
 sysdeps/hppa/dl-fptr.c                        | 18 ++++---
 sysdeps/hppa/dl-machine.h                     | 28 +++++++++--
 sysdeps/hppa/dl-runtime.c                     | 33 +++++++++++++
 sysdeps/hppa/dl-trampoline.S                  | 71 ++++++++++++++++++++++-----
 sysdeps/unix/sysv/linux/hppa/atomic-machine.h |  2 +
 5 files changed, 130 insertions(+), 22 deletions(-)

Comments

Carlos O'Donell Oct. 29, 2019, 8:42 p.m. UTC | #1
On 10/20/19 2:28 PM, John David Anglin wrote:
> This patch fixes the data race in setting function descriptors during lazy binding (Bug 23296).

Please post a v2 with comments update. I think this solution looks really
good. We could have used the low bit of the gp to do an interlocked update,
but because we can recompute gp we just do that and put it back into the
right location. We always ensure writers write gp then ip. We always ensure
readers read gp then ip.

> It addresses an issue that is present mainly on SMP machines running threaded code.  In a typical
> indirect call or PLT import stub, the target address is loaded first.  Then the global pointer
> is loaded into the PIC register in the delay slot of a branch to the target address.  During lazy
> binding, the target address is a trampoline which transfers to _dl_runtime_resolve().

Correct.

> _dl_runtime_resolve() uses the relocation offset stored in the global pointer and the linkage map
> stored in the trampoline to find the relocation.  Then, the function descriptor is updated.

OK.

> In a multi-threaded application, it is possible for the global pointer to be updated between the
> load of the target address and the global pointer.  When this happens, the relocation offset has
> been replaced by the new global pointer.  The function pointer has probably been updated as well
> but there is no way to find the address of the function descriptor and to transfer to the target.
> So, _dl_runtime_resolve() typically crashes.

You argue that the current elf_machine_fixup_plt is invalid then?

This argues that the only workable solution is the atomic floating point double word store?

> HP-UX addressed this problem by adding an extra pc-relative branch to each descriptor.  The
> descriptor is initially setup to point to the branch.  The branch then transfers to the trampoline.
> This allowed the trampoline code to figure out which descriptor was being used without any
> modification to user code.  I didn't use this approach as it is more complex and potentially
> changes function pointer canonicalization.

OK.

> The order of loading the target address and global pointer in indirect calls was not consistent
> with the order used in import stubs.  In particular, $$dyncall and some inline versions of it
> loaded the global pointer first.  This was inconsistent with the global pointer being updated
> first in dl-machine.h.  Assuming the accesses are ordered, we want elf_machine_fixup_plt() to
> store the global pointer first and calls to load it last.  Then, the global pointer will be
> correct when the target function is entered.

Correct.

Writers: store gp, store ip.
Readers: load ip, load gp.

If readers see a new ip, they will also be guaranteed to see a new gp. OK.
If readres see an old ip, they -may- see an old gp. OK.
If readers see an old ip, they -may- see a new gp. BAD.
 * If we see a new gp it won't have the low-bit set and we recompute it.

> However, just to make things more fun, HP added support for out-of-order execution of accesses
> in PA 2.0.  The accesses used by calls are weakly ordered. So, it's possibly under some circumstances
> that a function might be entered with the wrong global pointer.  However, HP uses weakly ordered
> accesses in 64-bit HP-UX, so I assume that loading the global pointer in the delay slot of the
> branch must work consistently.

Agreed.

> The basic fix for the race is a combination of modifying user code to preserve the address of the
> function descriptor in register %r22 and setting the least-significant bit in the relocation offset.
> The latter was suggested by Carlos as a way to distinguish relocation offsets from global pointer
> values.  Conventionally, %r22 is used as the address of the function descriptor in calls to $$dyncall.
> So, it wasn't hard to preserve the address in %r22.

OK.

> I have updated gcc trunk and gcc-9 branch to not clobber %r22 in $$dyncall and inline indirect calls.
> I have also modified the import stubs in binutils trunk and the 2.33 branch to preserve %r22.  This
> required making the stubs one instruction longer but we save one relocation.  I also modified binutils
> to align the .plt section on a 8-byte boundary.  This allows descriptors to be updated atomically.

Right, with a floating point double word store.

> With these changes, _dl_runtime_resolve() can fallback to an alternate mechanism to find the relocation
> offset when it has been clobbered.  There's just one additional instruction in the fast path. I tested
> the fallback function, _dl_fix_reloc_arg(), by changing the branch to always use the fallback.  Old
> code still runs as it did before.

OK.

> _dl_runtime_profile() assembles but I don't think the testsuite exercises it sufficiently.
> 
> With this change, I haven't observed any problem with lazy binding in several testsuite runs.  With 2.29,
> builds fail almost every time on Debian unstable.
> 
> Dave
> --
>  sysdeps/hppa/dl-fptr.c                        | 18 ++++---
>  sysdeps/hppa/dl-machine.h                     | 28 +++++++++--
>  sysdeps/hppa/dl-runtime.c                     | 33 +++++++++++++
>  sysdeps/hppa/dl-trampoline.S                  | 71 ++++++++++++++++++++++-----
>  sysdeps/unix/sysv/linux/hppa/atomic-machine.h |  2 +
>  5 files changed, 130 insertions(+), 22 deletions(-)
> 
> diff --git a/sysdeps/hppa/dl-fptr.c b/sysdeps/hppa/dl-fptr.c
> index af1acb0701..c841632906 100644
> --- a/sysdeps/hppa/dl-fptr.c
> +++ b/sysdeps/hppa/dl-fptr.c
> @@ -172,8 +172,8 @@ make_fdesc (ElfW(Addr) ip, ElfW(Addr) gp)
>      }
> 
>   install:
> -  fdesc->ip = ip;
>    fdesc->gp = gp;
> +  fdesc->ip = ip;

OK. make_fdesc is a writer so store to gp then ip.

> 
>    return (ElfW(Addr)) fdesc;
>  }
> @@ -350,7 +350,9 @@ ElfW(Addr)
>  _dl_lookup_address (const void *address)
>  {
>    ElfW(Addr) addr = (ElfW(Addr)) address;
> -  unsigned int *desc, *gptr;
> +  ElfW(Word) reloc_arg;
> +  volatile unsigned int *desc;
> +  unsigned int *gptr;

Why do you use volatile for desc?

To avoid an atomic operation and still tell the compiler
the data is realy volatile (updated by another thread)?

> 
>    /* Return ADDR if the least-significant two bits of ADDR are not consistent
>       with ADDR being a linker defined function pointer.  The normal value for
> @@ -367,7 +369,11 @@ _dl_lookup_address (const void *address)
>    if (!_dl_read_access_allowed (desc))
>      return addr;
> 
> -  /* Load first word of candidate descriptor.  It should be a pointer
> +  /* First load the relocation offset.  */
> +  reloc_arg = (ElfW(Word)) desc[1];
> +  atomic_full_barrier();

This is a LoadLoad with a full barrier between it, and it means that load
of reloc_arg is always first, and load of gptr is second, with no reordering.

> +
> +  /* Then load first word of candidate descriptor.  It should be a pointer
>       with word alignment and point to memory that can be read.  */
>    gptr = (unsigned int *) desc[0];

If gptr is resolved, then you know reloc_arg is also resolved.

If gptr is not resolved, then you know nothing about reloc_arg, and you
immediately enter _dl_fixup. The reloc_arg may already have been resolved.
It still leaves you possibly calling _dl_fixup with an new reloc_arg? No
because you can detect a gp without the low-bit set.

>    if (((unsigned int) gptr & 3) != 0
> @@ -377,8 +383,8 @@ _dl_lookup_address (const void *address)
>    /* See if descriptor requires resolution.  The following trampoline is
>       used in each global offset table for function resolution:
> 
> -		ldw 0(r20),r22
> -		bv r0(r22)
> +		ldw 0(r20),r21
> +		bv r0(r21)

OK. Use r21.

>  		ldw 4(r20),r21
>       tramp:	b,l .-12,r20
>  		depwi 0,31,2,r20
> @@ -389,7 +395,7 @@ _dl_lookup_address (const void *address)
>    if (gptr[0] == 0xea9f1fdd			/* b,l .-12,r20     */
>        && gptr[1] == 0xd6801c1e			/* depwi 0,31,2,r20 */
>        && (ElfW(Addr)) gptr[2] == elf_machine_resolve ())
> -    _dl_fixup ((struct link_map *) gptr[5], (ElfW(Word)) desc[1]);
> +    _dl_fixup ((struct link_map *) gptr[5], reloc_arg);

OK. Use the descriptor copy e.g. reloc_arg.

> 
>    return (ElfW(Addr)) desc[0];

OK.

>  }
> diff --git a/sysdeps/hppa/dl-machine.h b/sysdeps/hppa/dl-machine.h
> index 5aa219a5d4..c3d34717e8 100644
> --- a/sysdeps/hppa/dl-machine.h
> +++ b/sysdeps/hppa/dl-machine.h
> @@ -117,10 +117,28 @@ elf_machine_fixup_plt (struct link_map *map, lookup_t t,
>    volatile Elf32_Addr *rfdesc = reloc_addr;
>    /* map is the link_map for the caller, t is the link_map for the object
>       being called */
> -  rfdesc[1] = value.gp;
> -  /* Need to ensure that the gp is visible before the code
> -     entry point is updated */
> -  rfdesc[0] = value.ip;
> +
> +  /* We would like the function descriptor to be double word aligned.  This
> +     helps performance (ip and gp then reside on the same cache line) and
> +     we can update the pair atomically with a single store.  However, the
> +     linker doesn't currently ensure this alignment.  */
> +  if ((unsigned int)reloc_addr & 7)
> +    {
> +      /* Need to ensure that the gp is visible before the code
> +         entry point is updated */
> +      rfdesc[1] = value.gp;
> +      atomic_full_barrier();

The full fence between a Load and Load ensures no movement of the loads.

> +      rfdesc[0] = value.ip;

So gp is always stored first followed by ip.

> +    }
> +  else
> +    {
> +      /* Update pair atomically with floating point store.  */
> +      union { ElfW(Word) v[2]; double d; } u;
> +
> +      u.v[0] = value.ip;
> +      u.v[1] = value.gp;
> +      *(volatile double *)rfdesc = u.d;

OK. This makes sense because we update both, and it can't be wrong in any reader.

> +    }
>    return value;
>  }
> 
> @@ -265,7 +283,7 @@ elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
>  		     here.  The trampoline code will load the proper
>  		     LTP and pass the reloc offset to the fixup
>  		     function.  */
> -		  fptr->gp = iplt - jmprel;
> +		  fptr->gp = (iplt - jmprel) | 1;

Set the low-bit of the reloc offset.

This way you can distinguish between reloc offset, and adjusted gp.

This requires a macro and a huge comment.

e.g.

/* The gp slot in the function descriptor contains the relocation
   offset before resolution.  To distinguish between a resolved
   gp value and an unresolved relocation offset we set an unused
   bit in the relocation offset.  This would allow us to do a
   synchronzied two word update using this bit (interlocked
   update), but instead of waiting for the update we simply
   recompute the gp value given that we know the ip.  */
#define PA_GP_RELOC 1

Then:

fptr->gp = (iplt - jmprel) | PA_GP_RELOC

That way we don't forget what the magic "|1" is.

>  		} /* r_sym != 0 */
>  	      else
>  		{
> diff --git a/sysdeps/hppa/dl-runtime.c b/sysdeps/hppa/dl-runtime.c
> new file mode 100644
> index 0000000000..189bb32cde
> --- /dev/null
> +++ b/sysdeps/hppa/dl-runtime.c
> @@ -0,0 +1,33 @@
> +/* Clear least-significant bit of relocation offset.  */

Needs a copyright header with the appropriate year.

> +#define reloc_offset (reloc_arg & ~1)
> +#define reloc_index  (reloc_arg & ~1) / sizeof (PLTREL)
> +
> +#include <elf/dl-runtime.c>
> +

Suggest:

/* The caller has encountered a partially relocated function
   descriptor.  The gp of the descriptor has been updated, but
   not the ip.  We find the function descriptor again and compute
   the relocation offset and return that to the caller.  The caller
   will continue on to call _dl_fixup with the relocation offset.  */

> +ElfW(Word)
> +attribute_hidden __attribute ((noinline)) ARCH_FIXUP_ATTRIBUTE
> +_dl_fix_reloc_arg (struct fdesc *fptr, struct link_map *l)
> +{
> +  Elf32_Addr l_addr, iplt, jmprel, end_jmprel, r_type;
> +  const Elf32_Rela *reloc;
> +
> +  l_addr = l->l_addr;
> +  jmprel = D_PTR(l, l_info[DT_JMPREL]);
> +  end_jmprel = jmprel + l->l_info[DT_PLTRELSZ]->d_un.d_val;
> +
> +  /* Process the relocs...  */

Suggest "Look for the entry..."

> +  for (iplt = jmprel; iplt < end_jmprel; iplt += sizeof (Elf32_Rela))
> +    {
> +      reloc = (const Elf32_Rela *) iplt;
> +      r_type = ELF32_R_TYPE (reloc->r_info);
> +
> +      if (__builtin_expect (r_type == R_PARISC_IPLT, 1)
> +	  && fptr == (struct fdesc *) (reloc->r_offset + l_addr))
> +	/* Return reloc offset.  */

Suggest "Found entry. Return the reloc offset."

> +	return iplt - jmprel;
> +    }
> +
> +  /* Crash if we weren't passed a valid function pointer.  */
> +  ABORT_INSTRUCTION;
> +  return 0;
> +}
> diff --git a/sysdeps/hppa/dl-trampoline.S b/sysdeps/hppa/dl-trampoline.S
> index b61a13684a..93e93d9157 100644
> --- a/sysdeps/hppa/dl-trampoline.S
> +++ b/sysdeps/hppa/dl-trampoline.S
> @@ -31,7 +31,7 @@
>     slow down __cffc when it attempts to call fixup to resolve function
>     descriptor references. Please refer to gcc/gcc/config/pa/fptr.c
> 
> -   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp.  */
> +   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp, r22 = fp.  */

OK.

> 
>  	/* RELOCATION MARKER: bl to provide gcc's __cffc with fixup loc. */
>  	.text
> @@ -61,17 +61,19 @@ _dl_runtime_resolve:
>  	copy	%sp, %r1	/* Copy previous sp */
>  	/* Save function result address (on entry) */
>  	stwm	%r28,128(%sp)
> -	/* Fillin some frame info to follow ABI */
> +	/* Fill in some frame info to follow ABI */
>  	stw	%r1,-4(%sp)	/* Previous sp */
>  	stw	%r21,-32(%sp)	/* PIC register value */
> 
>  	/* Save input floating point registers. This must be done
>  	   in the new frame since the previous frame doesn't have
>  	   enough space */
> -	ldo	-56(%sp),%r1
> +	ldo	-64(%sp),%r1
>  	fstd,ma	%fr4,-8(%r1)
>  	fstd,ma	%fr5,-8(%r1)
>  	fstd,ma	%fr6,-8(%r1)
> +
> +	bb,>=	%r19,31,2f		/* branch if not reloc offset */

OK.

>  	fstd,ma	%fr7,-8(%r1)
> 
>  	/* Set up args to fixup func, needs only two arguments  */
> @@ -79,7 +81,7 @@ _dl_runtime_resolve:
>  	copy	%r19,%r25		/* (2) reloc offset  */
> 
>  	/* Call the real address resolver. */
> -	bl	_dl_fixup,%rp
> +3:	bl	_dl_fixup,%rp
>  	copy	%r21,%r19		/* set fixup func ltp */
> 
>  	/* While the linker will set a function pointer to NULL when it
> @@ -102,7 +104,7 @@ _dl_runtime_resolve:
>  	copy	%r29, %r19
> 
>  	/* Reload arguments fp args */
> -	ldo	-56(%sp),%r1
> +	ldo	-64(%sp),%r1
>  	fldd,ma	-8(%r1),%fr4
>  	fldd,ma	-8(%r1),%fr5
>  	fldd,ma	-8(%r1),%fr6
> @@ -129,6 +131,25 @@ _dl_runtime_resolve:
>  	bv	%r0(%rp)
>  	ldo	-128(%sp),%sp
> 
> +2:
> +	/* Set up args for _dl_fix_reloc_arg.  */
> +	copy	%r22,%r26		/* (1) function pointer */
> +	depi	0,31,2,%r26		/* clear least significant bits */
> +	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
> +
> +	/* Save ltp and link map arg for _dl_fixup.  */
> +	stw	%r21,-56(%sp)		/* ltp */
> +	stw	%r25,-60(%sp)		/* struct link map */
> +
> +	/* Find reloc offset. */
> +	bl	_dl_fix_reloc_arg,%rp
> +	copy	%r21,%r19		/* set func ltp */
> +
> +	/* Set up args for _dl_fixup.  */
> +	ldw	-56(%sp),%r21		/* ltp */
> +	ldw	-60(%sp),%r26		/* (1) struct link map */
> +	b	3b
> +	copy	%ret0,%r25		/* (2) reloc offset */

OK.

>          .EXIT
>          .PROCEND
>  	cfi_endproc
> @@ -153,7 +174,7 @@ _dl_runtime_profile:
>  	copy	%sp, %r1	/* Copy previous sp */
>  	/* Save function result address (on entry) */
>  	stwm	%r28,192(%sp)
> -	/* Fillin some frame info to follow ABI */
> +	/* Fill in some frame info to follow ABI */
>  	stw	%r1,-4(%sp)	/* Previous sp */
>  	stw	%r21,-32(%sp)	/* PIC register value */
> 
> @@ -181,10 +202,9 @@ _dl_runtime_profile:
>  	fstd,ma	%fr5,8(%r1)
>  	fstd,ma	%fr6,8(%r1)
>  	fstd,ma	%fr7,8(%r1)
> -	/* 32-bit stack pointer and return register */
> +	/* 32-bit stack pointer */
> +	bb,>=,n	%r19,31,2f		/* branch if not reloc offset */
>  	stw	%sp,-56(%sp)
> -	stw	%r2,-52(%sp)
> -
> 
>  	/* Set up args to fixup func, needs five arguments  */
>  	ldw	8+4(%r20),%r26		/* (1) got[1] == struct link_map */
> @@ -197,7 +217,7 @@ _dl_runtime_profile:
>  	stw	%r1, -52(%sp)		/* (5) long int *framesizep */
> 
>  	/* Call the real address resolver. */
> -	bl	_dl_profile_fixup,%rp
> +3:	bl	_dl_profile_fixup,%rp
>  	copy	%r21,%r19		/* set fixup func ltp */
> 
>  	/* Load up the returned function descriptor */
> @@ -215,7 +235,9 @@ _dl_runtime_profile:
>  	fldd,ma	8(%r1),%fr5
>  	fldd,ma	8(%r1),%fr6
>  	fldd,ma	8(%r1),%fr7
> -	ldw	-52(%sp),%rp
> +
> +	/* Reload rp register -(192+20) without adjusting stack */
> +	ldw	-212(%sp),%rp
> 
>  	/* Reload static link register -(192+16) without adjusting stack */
>  	ldw	-208(%sp),%r29
> @@ -303,6 +325,33 @@ L(cont):
>          ldw -20(%sp),%rp
>  	/* Return */
>  	bv,n	0(%r2)
> +
> +2:
> +	/* Set up args for _dl_fix_reloc_arg.  */
> +	copy	%r22,%r26		/* (1) function pointer */
> +	depi	0,31,2,%r26		/* clear least significant bits */
> +	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
> +
> +	/* Save ltp and link map arg for _dl_fixup.  */
> +	stw	%r21,-92(%sp)		/* ltp */
> +	stw	%r25,-116(%sp)		/* struct link map */
> +
> +	/* Find reloc offset. */
> +	bl	_dl_fix_reloc_arg,%rp
> +	copy	%r21,%r19		/* set func ltp */
> +
> +	 /* Restore fixup ltp.  */
> +	ldw	-92(%sp),%r21		/* ltp */
> +
> +	/* Set up args to fixup func, needs five arguments  */
> +	ldw	-116(%sp),%r26		/* (1) struct link map */
> +	copy	%ret0,%r25		/* (2) reloc offset  */
> +	stw	%r25,-120(%sp)		/* Save reloc offset */
> +	ldw	-212(%sp),%r24		/* (3) profile_fixup needs rp */
> +	ldo	-56(%sp),%r23		/* (4) La_hppa_regs */
> +	ldo	-112(%sp), %r1
> +	b	3b
> +	stw	%r1, -52(%sp)		/* (5) long int *framesizep */

OK.

>          .EXIT
>          .PROCEND
>  	cfi_endproc
> diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> index 4e83f8f17b..faf303fad0 100644
> --- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> +++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
> @@ -36,6 +36,8 @@ typedef uintptr_t uatomicptr_t;
>  typedef intmax_t atomic_max_t;
>  typedef uintmax_t uatomic_max_t;
> 
> +#define atomic_full_barrier() __sync_synchronize ()

OK. This impacts all atomic.h functions that eventually use atomic_full_barrier,
but that's fine.

> +
>  #define __HAVE_64B_ATOMICS 0
>  #define USE_ATOMIC_COMPILER_BUILTINS 0
>
John David Anglin Nov. 3, 2019, 6:32 p.m. UTC | #2
On 2019-10-29 4:42 p.m., Carlos O'Donell wrote:
> On 10/20/19 2:28 PM, John David Anglin wrote:
>> This patch fixes the data race in setting function descriptors during lazy binding (Bug 23296).
> Please post a v2 with comments update. I think this solution looks really
> good. We could have used the low bit of the gp to do an interlocked update,
> but because we can recompute gp we just do that and put it back into the
> right location. We always ensure writers write gp then ip. We always ensure
> readers read gp then ip.
Okay.

Regarding the ordering of reading and writing gp and ip, I'm not 100% sure we are there yet.  The PA-RISC 1.x architecture
specified that all loads and stores are observed to be performed in order.  The PA-RISC 2.0 architecture introduced
weakly ordered loads and stores.  This was implemented in the first pa8000 processor.  PA-RISC 2.0 also added
the TLB and PSW O bits, which when both set, provide ordered data accesses.  HP provided the PT_PARISC_WEAKORDER
define for ELF objects that's supposed to enable weak ordering.  However, as far as I can tell, this was not implemented.
The PSW O bit is never set in HP-UX and Linux applications.  So, PA 2.0 systems are all weakly ordered.
 
Thus, the ordering of accesses differs between PA 1.x and PA 2.0 systems.  PA 1.x semaphores don't need a barrier before
release, etc.  PA 1.x applications that need synchronization between processors break on PA 2.0 systems, sometimes in random
ways.  The general problem is the majority of our code base was designed assuming the PA 1.x model.

I added a barrier between the writing of gp and ip to ensure ordering.  We also use a floating point store when possible
to update the values together.  However, the architecture doesn't clearly explain the dependencies between memory
accesses and branch instructions.  It is possible that the ip value needs to be read using a PA 2.0 ordered load to ensure
ip is read before gp.  I don't really want to make this change if it's not necessary as it affects a very time critical path.
As I noted previously, 64-bit HP-UX appears to rely on assembly code order even though the O bit is not set.
>
>> It addresses an issue that is present mainly on SMP machines running threaded code.  In a typical
>> indirect call or PLT import stub, the target address is loaded first.  Then the global pointer
>> is loaded into the PIC register in the delay slot of a branch to the target address.  During lazy
>> binding, the target address is a trampoline which transfers to _dl_runtime_resolve().
> Correct.
>
>> _dl_runtime_resolve() uses the relocation offset stored in the global pointer and the linkage map
>> stored in the trampoline to find the relocation.  Then, the function descriptor is updated.
> OK.
>
>> In a multi-threaded application, it is possible for the global pointer to be updated between the
>> load of the target address and the global pointer.  When this happens, the relocation offset has
>> been replaced by the new global pointer.  The function pointer has probably been updated as well
>> but there is no way to find the address of the function descriptor and to transfer to the target.
>> So, _dl_runtime_resolve() typically crashes.
> You argue that the current elf_machine_fixup_plt is invalid then?
Yes.
>
> This argues that the only workable solution is the atomic floating point double word store?
Binutils has been changed to provide 8-byte alignment for the PLT.  The patch uses uses a floating point double
word store when possible.  But we need to handle old code with a 4-byte aligned PLT.
>
>> HP-UX addressed this problem by adding an extra pc-relative branch to each descriptor.  The
>> descriptor is initially setup to point to the branch.  The branch then transfers to the trampoline.
>> This allowed the trampoline code to figure out which descriptor was being used without any
>> modification to user code.  I didn't use this approach as it is more complex and potentially
>> changes function pointer canonicalization.
> OK.
>
>> The order of loading the target address and global pointer in indirect calls was not consistent
>> with the order used in import stubs.  In particular, $$dyncall and some inline versions of it
>> loaded the global pointer first.  This was inconsistent with the global pointer being updated
>> first in dl-machine.h.  Assuming the accesses are ordered, we want elf_machine_fixup_plt() to
>> store the global pointer first and calls to load it last.  Then, the global pointer will be
>> correct when the target function is entered.
> Correct.
>
> Writers: store gp, store ip.
> Readers: load ip, load gp.
>
> If readers see a new ip, they will also be guaranteed to see a new gp. OK.
> If readres see an old ip, they -may- see an old gp. OK.
> If readers see an old ip, they -may- see a new gp. BAD.
>  * If we see a new gp it won't have the low-bit set and we recompute it.
Yes.
>
>> However, just to make things more fun, HP added support for out-of-order execution of accesses
>> in PA 2.0.  The accesses used by calls are weakly ordered. So, it's possibly under some circumstances
>> that a function might be entered with the wrong global pointer.  However, HP uses weakly ordered
>> accesses in 64-bit HP-UX, so I assume that loading the global pointer in the delay slot of the
>> branch must work consistently.
> Agreed.
>
>> The basic fix for the race is a combination of modifying user code to preserve the address of the
>> function descriptor in register %r22 and setting the least-significant bit in the relocation offset.
>> The latter was suggested by Carlos as a way to distinguish relocation offsets from global pointer
>> values.  Conventionally, %r22 is used as the address of the function descriptor in calls to $$dyncall.
>> So, it wasn't hard to preserve the address in %r22.
> OK.
>
>> I have updated gcc trunk and gcc-9 branch to not clobber %r22 in $$dyncall and inline indirect calls.
>> I have also modified the import stubs in binutils trunk and the 2.33 branch to preserve %r22.  This
>> required making the stubs one instruction longer but we save one relocation.  I also modified binutils
>> to align the .plt section on a 8-byte boundary.  This allows descriptors to be updated atomically.
> Right, with a floating point double word store.
>
>> With these changes, _dl_runtime_resolve() can fallback to an alternate mechanism to find the relocation
>> offset when it has been clobbered.  There's just one additional instruction in the fast path. I tested
>> the fallback function, _dl_fix_reloc_arg(), by changing the branch to always use the fallback.  Old
>> code still runs as it did before.
> OK.
>
>> _dl_runtime_profile() assembles but I don't think the testsuite exercises it sufficiently.
>>
>> With this change, I haven't observed any problem with lazy binding in several testsuite runs.  With 2.29,
>> builds fail almost every time on Debian unstable.
>>
>> Dave
>> --
>>  sysdeps/hppa/dl-fptr.c                        | 18 ++++---
>>  sysdeps/hppa/dl-machine.h                     | 28 +++++++++--
>>  sysdeps/hppa/dl-runtime.c                     | 33 +++++++++++++
>>  sysdeps/hppa/dl-trampoline.S                  | 71 ++++++++++++++++++++++-----
>>  sysdeps/unix/sysv/linux/hppa/atomic-machine.h |  2 +
>>  5 files changed, 130 insertions(+), 22 deletions(-)
>>
>> diff --git a/sysdeps/hppa/dl-fptr.c b/sysdeps/hppa/dl-fptr.c
>> index af1acb0701..c841632906 100644
>> --- a/sysdeps/hppa/dl-fptr.c
>> +++ b/sysdeps/hppa/dl-fptr.c
>> @@ -172,8 +172,8 @@ make_fdesc (ElfW(Addr) ip, ElfW(Addr) gp)
>>      }
>>
>>   install:
>> -  fdesc->ip = ip;
>>    fdesc->gp = gp;
>> +  fdesc->ip = ip;
> OK. make_fdesc is a writer so store to gp then ip.
This is non critical and done just for consistency as this setup is done before the application runs.
>
>>    return (ElfW(Addr)) fdesc;
>>  }
>> @@ -350,7 +350,9 @@ ElfW(Addr)
>>  _dl_lookup_address (const void *address)
>>  {
>>    ElfW(Addr) addr = (ElfW(Addr)) address;
>> -  unsigned int *desc, *gptr;
>> +  ElfW(Word) reloc_arg;
>> +  volatile unsigned int *desc;
>> +  unsigned int *gptr;
> Why do you use volatile for desc?
To ensure gcc doesn't optimize the return value desc[0] after the call to _dl_fixup.
>
> To avoid an atomic operation and still tell the compiler
> the data is realy volatile (updated by another thread)?
>
>>    /* Return ADDR if the least-significant two bits of ADDR are not consistent
>>       with ADDR being a linker defined function pointer.  The normal value for
>> @@ -367,7 +369,11 @@ _dl_lookup_address (const void *address)
>>    if (!_dl_read_access_allowed (desc))
>>      return addr;
>>
>> -  /* Load first word of candidate descriptor.  It should be a pointer
>> +  /* First load the relocation offset.  */
>> +  reloc_arg = (ElfW(Word)) desc[1];
>> +  atomic_full_barrier();
> This is a LoadLoad with a full barrier between it, and it means that load
> of reloc_arg is always first, and load of gptr is second, with no reordering.
Yes.
>
>> +
>> +  /* Then load first word of candidate descriptor.  It should be a pointer
>>       with word alignment and point to memory that can be read.  */
>>    gptr = (unsigned int *) desc[0];
> If gptr is resolved, then you know reloc_arg is also resolved.
>
> If gptr is not resolved, then you know nothing about reloc_arg, and you
> immediately enter _dl_fixup. The reloc_arg may already have been resolved.
> It still leaves you possibly calling _dl_fixup with an new reloc_arg? No
> because you can detect a gp without the low-bit set.
It looks like we need to call _dl_fix_reloc_arg() if the low bit in the gp is not set.  This is
can't be done in __cffc().  Yuck!
>>    if (((unsigned int) gptr & 3) != 0
>> @@ -377,8 +383,8 @@ _dl_lookup_address (const void *address)
>>    /* See if descriptor requires resolution.  The following trampoline is
>>       used in each global offset table for function resolution:
>>
>> -		ldw 0(r20),r22
>> -		bv r0(r22)
>> +		ldw 0(r20),r21
>> +		bv r0(r21)
> OK. Use r21.
>
>>  		ldw 4(r20),r21
>>       tramp:	b,l .-12,r20
>>  		depwi 0,31,2,r20
>> @@ -389,7 +395,7 @@ _dl_lookup_address (const void *address)
>>    if (gptr[0] == 0xea9f1fdd			/* b,l .-12,r20     */
>>        && gptr[1] == 0xd6801c1e			/* depwi 0,31,2,r20 */
>>        && (ElfW(Addr)) gptr[2] == elf_machine_resolve ())
>> -    _dl_fixup ((struct link_map *) gptr[5], (ElfW(Word)) desc[1]);
>> +    _dl_fixup ((struct link_map *) gptr[5], reloc_arg);
> OK. Use the descriptor copy e.g. reloc_arg.
Note that reloc_arg is loaded before the ip value in the descriptor.
>
>>    return (ElfW(Addr)) desc[0];
> OK.
>
>>  }
>> diff --git a/sysdeps/hppa/dl-machine.h b/sysdeps/hppa/dl-machine.h
>> index 5aa219a5d4..c3d34717e8 100644
>> --- a/sysdeps/hppa/dl-machine.h
>> +++ b/sysdeps/hppa/dl-machine.h
>> @@ -117,10 +117,28 @@ elf_machine_fixup_plt (struct link_map *map, lookup_t t,
>>    volatile Elf32_Addr *rfdesc = reloc_addr;
>>    /* map is the link_map for the caller, t is the link_map for the object
>>       being called */
>> -  rfdesc[1] = value.gp;
>> -  /* Need to ensure that the gp is visible before the code
>> -     entry point is updated */
>> -  rfdesc[0] = value.ip;
>> +
>> +  /* We would like the function descriptor to be double word aligned.  This
>> +     helps performance (ip and gp then reside on the same cache line) and
>> +     we can update the pair atomically with a single store.  However, the
>> +     linker doesn't currently ensure this alignment.  */
>> +  if ((unsigned int)reloc_addr & 7)
>> +    {
>> +      /* Need to ensure that the gp is visible before the code
>> +         entry point is updated */
>> +      rfdesc[1] = value.gp;
>> +      atomic_full_barrier();
> The full fence between a Load and Load ensures no movement of the loads.
>
>> +      rfdesc[0] = value.ip;
> So gp is always stored first followed by ip.
>
>> +    }
>> +  else
>> +    {
>> +      /* Update pair atomically with floating point store.  */
>> +      union { ElfW(Word) v[2]; double d; } u;
>> +
>> +      u.v[0] = value.ip;
>> +      u.v[1] = value.gp;
>> +      *(volatile double *)rfdesc = u.d;
> OK. This makes sense because we update both, and it can't be wrong in any reader.
It could still be wrong in a reader if it somehow loads gp before ip (may need an ordered load of ip
to prevent).
>
>> +    }
>>    return value;
>>  }
>>
>> @@ -265,7 +283,7 @@ elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
>>  		     here.  The trampoline code will load the proper
>>  		     LTP and pass the reloc offset to the fixup
>>  		     function.  */
>> -		  fptr->gp = iplt - jmprel;
>> +		  fptr->gp = (iplt - jmprel) | 1;
> Set the low-bit of the reloc offset.
>
> This way you can distinguish between reloc offset, and adjusted gp.
>
> This requires a macro and a huge comment.
>
> e.g.
>
> /* The gp slot in the function descriptor contains the relocation
>    offset before resolution.  To distinguish between a resolved
>    gp value and an unresolved relocation offset we set an unused
>    bit in the relocation offset.  This would allow us to do a
>    synchronzied two word update using this bit (interlocked
>    update), but instead of waiting for the update we simply
>    recompute the gp value given that we know the ip.  */
> #define PA_GP_RELOC 1
Okay.
>
> Then:
>
> fptr->gp = (iplt - jmprel) | PA_GP_RELOC
>
> That way we don't forget what the magic "|1" is.
>
>>  		} /* r_sym != 0 */
>>  	      else
>>  		{
>> diff --git a/sysdeps/hppa/dl-runtime.c b/sysdeps/hppa/dl-runtime.c
>> new file mode 100644
>> index 0000000000..189bb32cde
>> --- /dev/null
>> +++ b/sysdeps/hppa/dl-runtime.c
>> @@ -0,0 +1,33 @@
>> +/* Clear least-significant bit of relocation offset.  */
> Needs a copyright header with the appropriate year.
Okay.
>
>> +#define reloc_offset (reloc_arg & ~1)
>> +#define reloc_index  (reloc_arg & ~1) / sizeof (PLTREL)
>> +
>> +#include <elf/dl-runtime.c>
>> +
> Suggest:
>
> /* The caller has encountered a partially relocated function
>    descriptor.  The gp of the descriptor has been updated, but
>    not the ip.  We find the function descriptor again and compute
>    the relocation offset and return that to the caller.  The caller
>    will continue on to call _dl_fixup with the relocation offset.  */
Okay.
>
>> +ElfW(Word)
>> +attribute_hidden __attribute ((noinline)) ARCH_FIXUP_ATTRIBUTE
>> +_dl_fix_reloc_arg (struct fdesc *fptr, struct link_map *l)
>> +{
>> +  Elf32_Addr l_addr, iplt, jmprel, end_jmprel, r_type;
>> +  const Elf32_Rela *reloc;
>> +
>> +  l_addr = l->l_addr;
>> +  jmprel = D_PTR(l, l_info[DT_JMPREL]);
>> +  end_jmprel = jmprel + l->l_info[DT_PLTRELSZ]->d_un.d_val;
>> +
>> +  /* Process the relocs...  */
> Suggest "Look for the entry..."
Okay.
>
>> +  for (iplt = jmprel; iplt < end_jmprel; iplt += sizeof (Elf32_Rela))
>> +    {
>> +      reloc = (const Elf32_Rela *) iplt;
>> +      r_type = ELF32_R_TYPE (reloc->r_info);
>> +
>> +      if (__builtin_expect (r_type == R_PARISC_IPLT, 1)
>> +	  && fptr == (struct fdesc *) (reloc->r_offset + l_addr))
>> +	/* Return reloc offset.  */
> Suggest "Found entry. Return the reloc offset."
Okay.
>
>> +	return iplt - jmprel;
>> +    }
>> +
>> +  /* Crash if we weren't passed a valid function pointer.  */
>> +  ABORT_INSTRUCTION;
>> +  return 0;
>> +}
>> diff --git a/sysdeps/hppa/dl-trampoline.S b/sysdeps/hppa/dl-trampoline.S
>> index b61a13684a..93e93d9157 100644
>> --- a/sysdeps/hppa/dl-trampoline.S
>> +++ b/sysdeps/hppa/dl-trampoline.S
>> @@ -31,7 +31,7 @@
>>     slow down __cffc when it attempts to call fixup to resolve function
>>     descriptor references. Please refer to gcc/gcc/config/pa/fptr.c
>>
>> -   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp.  */
>> +   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp, r22 = fp.  */
> OK.
>
>>  	/* RELOCATION MARKER: bl to provide gcc's __cffc with fixup loc. */
>>  	.text
>> @@ -61,17 +61,19 @@ _dl_runtime_resolve:
>>  	copy	%sp, %r1	/* Copy previous sp */
>>  	/* Save function result address (on entry) */
>>  	stwm	%r28,128(%sp)
>> -	/* Fillin some frame info to follow ABI */
>> +	/* Fill in some frame info to follow ABI */
>>  	stw	%r1,-4(%sp)	/* Previous sp */
>>  	stw	%r21,-32(%sp)	/* PIC register value */
>>
>>  	/* Save input floating point registers. This must be done
>>  	   in the new frame since the previous frame doesn't have
>>  	   enough space */
>> -	ldo	-56(%sp),%r1
>> +	ldo	-64(%sp),%r1
>>  	fstd,ma	%fr4,-8(%r1)
>>  	fstd,ma	%fr5,-8(%r1)
>>  	fstd,ma	%fr6,-8(%r1)
>> +
>> +	bb,>=	%r19,31,2f		/* branch if not reloc offset */
> OK.
>
>>  	fstd,ma	%fr7,-8(%r1)
>>
>>  	/* Set up args to fixup func, needs only two arguments  */
>> @@ -79,7 +81,7 @@ _dl_runtime_resolve:
>>  	copy	%r19,%r25		/* (2) reloc offset  */
>>
>>  	/* Call the real address resolver. */
>> -	bl	_dl_fixup,%rp
>> +3:	bl	_dl_fixup,%rp
>>  	copy	%r21,%r19		/* set fixup func ltp */
>>
>>  	/* While the linker will set a function pointer to NULL when it
>> @@ -102,7 +104,7 @@ _dl_runtime_resolve:
>>  	copy	%r29, %r19
>>
>>  	/* Reload arguments fp args */
>> -	ldo	-56(%sp),%r1
>> +	ldo	-64(%sp),%r1
>>  	fldd,ma	-8(%r1),%fr4
>>  	fldd,ma	-8(%r1),%fr5
>>  	fldd,ma	-8(%r1),%fr6
>> @@ -129,6 +131,25 @@ _dl_runtime_resolve:
>>  	bv	%r0(%rp)
>>  	ldo	-128(%sp),%sp
>>
>> +2:
>> +	/* Set up args for _dl_fix_reloc_arg.  */
>> +	copy	%r22,%r26		/* (1) function pointer */
>> +	depi	0,31,2,%r26		/* clear least significant bits */
>> +	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
>> +
>> +	/* Save ltp and link map arg for _dl_fixup.  */
>> +	stw	%r21,-56(%sp)		/* ltp */
>> +	stw	%r25,-60(%sp)		/* struct link map */
>> +
>> +	/* Find reloc offset. */
>> +	bl	_dl_fix_reloc_arg,%rp
>> +	copy	%r21,%r19		/* set func ltp */
>> +
>> +	/* Set up args for _dl_fixup.  */
>> +	ldw	-56(%sp),%r21		/* ltp */
>> +	ldw	-60(%sp),%r26		/* (1) struct link map */
>> +	b	3b
>> +	copy	%ret0,%r25		/* (2) reloc offset */
> OK.
>
>>          .EXIT
>>          .PROCEND
>>  	cfi_endproc
>> @@ -153,7 +174,7 @@ _dl_runtime_profile:
>>  	copy	%sp, %r1	/* Copy previous sp */
>>  	/* Save function result address (on entry) */
>>  	stwm	%r28,192(%sp)
>> -	/* Fillin some frame info to follow ABI */
>> +	/* Fill in some frame info to follow ABI */
>>  	stw	%r1,-4(%sp)	/* Previous sp */
>>  	stw	%r21,-32(%sp)	/* PIC register value */
>>
>> @@ -181,10 +202,9 @@ _dl_runtime_profile:
>>  	fstd,ma	%fr5,8(%r1)
>>  	fstd,ma	%fr6,8(%r1)
>>  	fstd,ma	%fr7,8(%r1)
>> -	/* 32-bit stack pointer and return register */
>> +	/* 32-bit stack pointer */
>> +	bb,>=,n	%r19,31,2f		/* branch if not reloc offset */
>>  	stw	%sp,-56(%sp)
>> -	stw	%r2,-52(%sp)
>> -
>>
>>  	/* Set up args to fixup func, needs five arguments  */
>>  	ldw	8+4(%r20),%r26		/* (1) got[1] == struct link_map */
>> @@ -197,7 +217,7 @@ _dl_runtime_profile:
>>  	stw	%r1, -52(%sp)		/* (5) long int *framesizep */
>>
>>  	/* Call the real address resolver. */
>> -	bl	_dl_profile_fixup,%rp
>> +3:	bl	_dl_profile_fixup,%rp
>>  	copy	%r21,%r19		/* set fixup func ltp */
>>
>>  	/* Load up the returned function descriptor */
>> @@ -215,7 +235,9 @@ _dl_runtime_profile:
>>  	fldd,ma	8(%r1),%fr5
>>  	fldd,ma	8(%r1),%fr6
>>  	fldd,ma	8(%r1),%fr7
>> -	ldw	-52(%sp),%rp
>> +
>> +	/* Reload rp register -(192+20) without adjusting stack */
>> +	ldw	-212(%sp),%rp
>>
>>  	/* Reload static link register -(192+16) without adjusting stack */
>>  	ldw	-208(%sp),%r29
>> @@ -303,6 +325,33 @@ L(cont):
>>          ldw -20(%sp),%rp
>>  	/* Return */
>>  	bv,n	0(%r2)
>> +
>> +2:
>> +	/* Set up args for _dl_fix_reloc_arg.  */
>> +	copy	%r22,%r26		/* (1) function pointer */
>> +	depi	0,31,2,%r26		/* clear least significant bits */
>> +	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
>> +
>> +	/* Save ltp and link map arg for _dl_fixup.  */
>> +	stw	%r21,-92(%sp)		/* ltp */
>> +	stw	%r25,-116(%sp)		/* struct link map */
>> +
>> +	/* Find reloc offset. */
>> +	bl	_dl_fix_reloc_arg,%rp
>> +	copy	%r21,%r19		/* set func ltp */
>> +
>> +	 /* Restore fixup ltp.  */
>> +	ldw	-92(%sp),%r21		/* ltp */
>> +
>> +	/* Set up args to fixup func, needs five arguments  */
>> +	ldw	-116(%sp),%r26		/* (1) struct link map */
>> +	copy	%ret0,%r25		/* (2) reloc offset  */
>> +	stw	%r25,-120(%sp)		/* Save reloc offset */
>> +	ldw	-212(%sp),%r24		/* (3) profile_fixup needs rp */
>> +	ldo	-56(%sp),%r23		/* (4) La_hppa_regs */
>> +	ldo	-112(%sp), %r1
>> +	b	3b
>> +	stw	%r1, -52(%sp)		/* (5) long int *framesizep */
> OK.
>
>>          .EXIT
>>          .PROCEND
>>  	cfi_endproc
>> diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
>> index 4e83f8f17b..faf303fad0 100644
>> --- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
>> +++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
>> @@ -36,6 +36,8 @@ typedef uintptr_t uatomicptr_t;
>>  typedef intmax_t atomic_max_t;
>>  typedef uintmax_t uatomic_max_t;
>>
>> +#define atomic_full_barrier() __sync_synchronize ()
> OK. This impacts all atomic.h functions that eventually use atomic_full_barrier,
> but that's fine.
I think it is needed (see above discussion on PA 2.0 ordering).

The alternative is to use an asm in elf_machine_fixup_plt() to use ordered stores when the PLT is
not aligned.
>
>> +
>>  #define __HAVE_64B_ATOMICS 0
>>  #define USE_ATOMIC_COMPILER_BUILTINS 0
>>
>

Dave

Patch
diff mbox series

diff --git a/sysdeps/hppa/dl-fptr.c b/sysdeps/hppa/dl-fptr.c
index af1acb0701..c841632906 100644
--- a/sysdeps/hppa/dl-fptr.c
+++ b/sysdeps/hppa/dl-fptr.c
@@ -172,8 +172,8 @@  make_fdesc (ElfW(Addr) ip, ElfW(Addr) gp)
     }

  install:
-  fdesc->ip = ip;
   fdesc->gp = gp;
+  fdesc->ip = ip;

   return (ElfW(Addr)) fdesc;
 }
@@ -350,7 +350,9 @@  ElfW(Addr)
 _dl_lookup_address (const void *address)
 {
   ElfW(Addr) addr = (ElfW(Addr)) address;
-  unsigned int *desc, *gptr;
+  ElfW(Word) reloc_arg;
+  volatile unsigned int *desc;
+  unsigned int *gptr;

   /* Return ADDR if the least-significant two bits of ADDR are not consistent
      with ADDR being a linker defined function pointer.  The normal value for
@@ -367,7 +369,11 @@  _dl_lookup_address (const void *address)
   if (!_dl_read_access_allowed (desc))
     return addr;

-  /* Load first word of candidate descriptor.  It should be a pointer
+  /* First load the relocation offset.  */
+  reloc_arg = (ElfW(Word)) desc[1];
+  atomic_full_barrier();
+
+  /* Then load first word of candidate descriptor.  It should be a pointer
      with word alignment and point to memory that can be read.  */
   gptr = (unsigned int *) desc[0];
   if (((unsigned int) gptr & 3) != 0
@@ -377,8 +383,8 @@  _dl_lookup_address (const void *address)
   /* See if descriptor requires resolution.  The following trampoline is
      used in each global offset table for function resolution:

-		ldw 0(r20),r22
-		bv r0(r22)
+		ldw 0(r20),r21
+		bv r0(r21)
 		ldw 4(r20),r21
      tramp:	b,l .-12,r20
 		depwi 0,31,2,r20
@@ -389,7 +395,7 @@  _dl_lookup_address (const void *address)
   if (gptr[0] == 0xea9f1fdd			/* b,l .-12,r20     */
       && gptr[1] == 0xd6801c1e			/* depwi 0,31,2,r20 */
       && (ElfW(Addr)) gptr[2] == elf_machine_resolve ())
-    _dl_fixup ((struct link_map *) gptr[5], (ElfW(Word)) desc[1]);
+    _dl_fixup ((struct link_map *) gptr[5], reloc_arg);

   return (ElfW(Addr)) desc[0];
 }
diff --git a/sysdeps/hppa/dl-machine.h b/sysdeps/hppa/dl-machine.h
index 5aa219a5d4..c3d34717e8 100644
--- a/sysdeps/hppa/dl-machine.h
+++ b/sysdeps/hppa/dl-machine.h
@@ -117,10 +117,28 @@  elf_machine_fixup_plt (struct link_map *map, lookup_t t,
   volatile Elf32_Addr *rfdesc = reloc_addr;
   /* map is the link_map for the caller, t is the link_map for the object
      being called */
-  rfdesc[1] = value.gp;
-  /* Need to ensure that the gp is visible before the code
-     entry point is updated */
-  rfdesc[0] = value.ip;
+
+  /* We would like the function descriptor to be double word aligned.  This
+     helps performance (ip and gp then reside on the same cache line) and
+     we can update the pair atomically with a single store.  However, the
+     linker doesn't currently ensure this alignment.  */
+  if ((unsigned int)reloc_addr & 7)
+    {
+      /* Need to ensure that the gp is visible before the code
+         entry point is updated */
+      rfdesc[1] = value.gp;
+      atomic_full_barrier();
+      rfdesc[0] = value.ip;
+    }
+  else
+    {
+      /* Update pair atomically with floating point store.  */
+      union { ElfW(Word) v[2]; double d; } u;
+
+      u.v[0] = value.ip;
+      u.v[1] = value.gp;
+      *(volatile double *)rfdesc = u.d;
+    }
   return value;
 }

@@ -265,7 +283,7 @@  elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
 		     here.  The trampoline code will load the proper
 		     LTP and pass the reloc offset to the fixup
 		     function.  */
-		  fptr->gp = iplt - jmprel;
+		  fptr->gp = (iplt - jmprel) | 1;
 		} /* r_sym != 0 */
 	      else
 		{
diff --git a/sysdeps/hppa/dl-runtime.c b/sysdeps/hppa/dl-runtime.c
new file mode 100644
index 0000000000..189bb32cde
--- /dev/null
+++ b/sysdeps/hppa/dl-runtime.c
@@ -0,0 +1,33 @@ 
+/* Clear least-significant bit of relocation offset.  */
+#define reloc_offset (reloc_arg & ~1)
+#define reloc_index  (reloc_arg & ~1) / sizeof (PLTREL)
+
+#include <elf/dl-runtime.c>
+
+ElfW(Word)
+attribute_hidden __attribute ((noinline)) ARCH_FIXUP_ATTRIBUTE
+_dl_fix_reloc_arg (struct fdesc *fptr, struct link_map *l)
+{
+  Elf32_Addr l_addr, iplt, jmprel, end_jmprel, r_type;
+  const Elf32_Rela *reloc;
+
+  l_addr = l->l_addr;
+  jmprel = D_PTR(l, l_info[DT_JMPREL]);
+  end_jmprel = jmprel + l->l_info[DT_PLTRELSZ]->d_un.d_val;
+
+  /* Process the relocs...  */
+  for (iplt = jmprel; iplt < end_jmprel; iplt += sizeof (Elf32_Rela))
+    {
+      reloc = (const Elf32_Rela *) iplt;
+      r_type = ELF32_R_TYPE (reloc->r_info);
+
+      if (__builtin_expect (r_type == R_PARISC_IPLT, 1)
+	  && fptr == (struct fdesc *) (reloc->r_offset + l_addr))
+	/* Return reloc offset.  */
+	return iplt - jmprel;
+    }
+
+  /* Crash if we weren't passed a valid function pointer.  */
+  ABORT_INSTRUCTION;
+  return 0;
+}
diff --git a/sysdeps/hppa/dl-trampoline.S b/sysdeps/hppa/dl-trampoline.S
index b61a13684a..93e93d9157 100644
--- a/sysdeps/hppa/dl-trampoline.S
+++ b/sysdeps/hppa/dl-trampoline.S
@@ -31,7 +31,7 @@ 
    slow down __cffc when it attempts to call fixup to resolve function
    descriptor references. Please refer to gcc/gcc/config/pa/fptr.c

-   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp.  */
+   Enter with r19 = reloc offset, r20 = got-8, r21 = fixup ltp, r22 = fp.  */

 	/* RELOCATION MARKER: bl to provide gcc's __cffc with fixup loc. */
 	.text
@@ -61,17 +61,19 @@  _dl_runtime_resolve:
 	copy	%sp, %r1	/* Copy previous sp */
 	/* Save function result address (on entry) */
 	stwm	%r28,128(%sp)
-	/* Fillin some frame info to follow ABI */
+	/* Fill in some frame info to follow ABI */
 	stw	%r1,-4(%sp)	/* Previous sp */
 	stw	%r21,-32(%sp)	/* PIC register value */

 	/* Save input floating point registers. This must be done
 	   in the new frame since the previous frame doesn't have
 	   enough space */
-	ldo	-56(%sp),%r1
+	ldo	-64(%sp),%r1
 	fstd,ma	%fr4,-8(%r1)
 	fstd,ma	%fr5,-8(%r1)
 	fstd,ma	%fr6,-8(%r1)
+
+	bb,>=	%r19,31,2f		/* branch if not reloc offset */
 	fstd,ma	%fr7,-8(%r1)

 	/* Set up args to fixup func, needs only two arguments  */
@@ -79,7 +81,7 @@  _dl_runtime_resolve:
 	copy	%r19,%r25		/* (2) reloc offset  */

 	/* Call the real address resolver. */
-	bl	_dl_fixup,%rp
+3:	bl	_dl_fixup,%rp
 	copy	%r21,%r19		/* set fixup func ltp */

 	/* While the linker will set a function pointer to NULL when it
@@ -102,7 +104,7 @@  _dl_runtime_resolve:
 	copy	%r29, %r19

 	/* Reload arguments fp args */
-	ldo	-56(%sp),%r1
+	ldo	-64(%sp),%r1
 	fldd,ma	-8(%r1),%fr4
 	fldd,ma	-8(%r1),%fr5
 	fldd,ma	-8(%r1),%fr6
@@ -129,6 +131,25 @@  _dl_runtime_resolve:
 	bv	%r0(%rp)
 	ldo	-128(%sp),%sp

+2:
+	/* Set up args for _dl_fix_reloc_arg.  */
+	copy	%r22,%r26		/* (1) function pointer */
+	depi	0,31,2,%r26		/* clear least significant bits */
+	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
+
+	/* Save ltp and link map arg for _dl_fixup.  */
+	stw	%r21,-56(%sp)		/* ltp */
+	stw	%r25,-60(%sp)		/* struct link map */
+
+	/* Find reloc offset. */
+	bl	_dl_fix_reloc_arg,%rp
+	copy	%r21,%r19		/* set func ltp */
+
+	/* Set up args for _dl_fixup.  */
+	ldw	-56(%sp),%r21		/* ltp */
+	ldw	-60(%sp),%r26		/* (1) struct link map */
+	b	3b
+	copy	%ret0,%r25		/* (2) reloc offset */
         .EXIT
         .PROCEND
 	cfi_endproc
@@ -153,7 +174,7 @@  _dl_runtime_profile:
 	copy	%sp, %r1	/* Copy previous sp */
 	/* Save function result address (on entry) */
 	stwm	%r28,192(%sp)
-	/* Fillin some frame info to follow ABI */
+	/* Fill in some frame info to follow ABI */
 	stw	%r1,-4(%sp)	/* Previous sp */
 	stw	%r21,-32(%sp)	/* PIC register value */

@@ -181,10 +202,9 @@  _dl_runtime_profile:
 	fstd,ma	%fr5,8(%r1)
 	fstd,ma	%fr6,8(%r1)
 	fstd,ma	%fr7,8(%r1)
-	/* 32-bit stack pointer and return register */
+	/* 32-bit stack pointer */
+	bb,>=,n	%r19,31,2f		/* branch if not reloc offset */
 	stw	%sp,-56(%sp)
-	stw	%r2,-52(%sp)
-

 	/* Set up args to fixup func, needs five arguments  */
 	ldw	8+4(%r20),%r26		/* (1) got[1] == struct link_map */
@@ -197,7 +217,7 @@  _dl_runtime_profile:
 	stw	%r1, -52(%sp)		/* (5) long int *framesizep */

 	/* Call the real address resolver. */
-	bl	_dl_profile_fixup,%rp
+3:	bl	_dl_profile_fixup,%rp
 	copy	%r21,%r19		/* set fixup func ltp */

 	/* Load up the returned function descriptor */
@@ -215,7 +235,9 @@  _dl_runtime_profile:
 	fldd,ma	8(%r1),%fr5
 	fldd,ma	8(%r1),%fr6
 	fldd,ma	8(%r1),%fr7
-	ldw	-52(%sp),%rp
+
+	/* Reload rp register -(192+20) without adjusting stack */
+	ldw	-212(%sp),%rp

 	/* Reload static link register -(192+16) without adjusting stack */
 	ldw	-208(%sp),%r29
@@ -303,6 +325,33 @@  L(cont):
         ldw -20(%sp),%rp
 	/* Return */
 	bv,n	0(%r2)
+
+2:
+	/* Set up args for _dl_fix_reloc_arg.  */
+	copy	%r22,%r26		/* (1) function pointer */
+	depi	0,31,2,%r26		/* clear least significant bits */
+	ldw	8+4(%r20),%r25		/* (2) got[1] == struct link_map */
+
+	/* Save ltp and link map arg for _dl_fixup.  */
+	stw	%r21,-92(%sp)		/* ltp */
+	stw	%r25,-116(%sp)		/* struct link map */
+
+	/* Find reloc offset. */
+	bl	_dl_fix_reloc_arg,%rp
+	copy	%r21,%r19		/* set func ltp */
+
+	 /* Restore fixup ltp.  */
+	ldw	-92(%sp),%r21		/* ltp */
+
+	/* Set up args to fixup func, needs five arguments  */
+	ldw	-116(%sp),%r26		/* (1) struct link map */
+	copy	%ret0,%r25		/* (2) reloc offset  */
+	stw	%r25,-120(%sp)		/* Save reloc offset */
+	ldw	-212(%sp),%r24		/* (3) profile_fixup needs rp */
+	ldo	-56(%sp),%r23		/* (4) La_hppa_regs */
+	ldo	-112(%sp), %r1
+	b	3b
+	stw	%r1, -52(%sp)		/* (5) long int *framesizep */
         .EXIT
         .PROCEND
 	cfi_endproc
diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
index 4e83f8f17b..faf303fad0 100644
--- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
+++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
@@ -36,6 +36,8 @@  typedef uintptr_t uatomicptr_t;
 typedef intmax_t atomic_max_t;
 typedef uintmax_t uatomic_max_t;

+#define atomic_full_barrier() __sync_synchronize ()
+
 #define __HAVE_64B_ATOMICS 0
 #define USE_ATOMIC_COMPILER_BUILTINS 0