diff mbox series

Protect _dl_profile_fixup data-dependency order [BZ #23690]

Message ID 20180919214829.20551-1-tuliom@linux.ibm.com
State New
Headers show
Series Protect _dl_profile_fixup data-dependency order [BZ #23690] | expand

Commit Message

Tulio Magno Quites Machado Filho Sept. 19, 2018, 9:48 p.m. UTC
The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

2018-09-19  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Comments

Joseph Myers Sept. 20, 2018, 12:04 a.m. UTC | #1
On Wed, 19 Sep 2018, Tulio Magno Quites Machado Filho wrote:

> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);

> +	atomic_store_release(resultp, value);

Missing spaces before '('.
Carlos O'Donell Sept. 20, 2018, 12:16 a.m. UTC | #2
On 09/19/2018 05:48 PM, Tulio Magno Quites Machado Filho wrote:
> The field reloc_result->addr is used to indicate if the rest of the
> fields of reloc_result have already been written, creating a
> data-dependency order.
> Reading reloc_result->addr to the variable value requires to complete
> before reading the rest of the fields of reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before reloc_result-addr is updated.

Good catch. This needs just a little more work and it's done.

When you're ready please post a v2 and TO me for review.

> 2018-09-19  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
> ---
>  elf/dl-runtime.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..6518e66fd6 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,9 +183,16 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> -  DL_FIXUP_VALUE_TYPE value = *resultp;
> +  /* CONCURRENCY NOTES:
> +
> +     The following code uses reloc_result->addr to indicate if it is the first
> +     time this object is being relocated.
> +     Reading/Writing from/to reloc_result->addr must not happen before previous
> +     writes to reloc_result complete as they could end-up with an incomplete
> +     struct.  */

This is not quite accurate. The following code uses DL_FIXUP_VALUE_CODE_ADDR to
access a potential member of addr to indicate it is the first time the object is
being relocated. The "guard" variable in this access is either the address itself
or the ip of a function descriptor (the only two implementations of "code addr").

Also we don't explain what other data is being accessed? What non-function-locale
data is being updated as part of the guard variable check?

In the case of the function descriptor it's easy to know that the fdesc's gp is 
going to be written to, and without a acquire we won't see that write, or the new 
ip value, so we might have two threads update the same thing, in theory a benign 
data race which we should fix.

In theory elf_ifunc_invoke() is called and that could write to arbitrary memory
with the IFUNC resolver, whose values should be seen by any future thread via
the new load acquire you are putting in place. You might be able to run the IFUNC
resolver twice which could be bad?


> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);

You are potentially requiring an atomic load of a structure whose size can be
an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
and 128-bit for 64-bit ia64). These architectures might not have such wide
atomic operations, and ...

>    if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)

... they only actually look at the value returned by DL_FIXUP_VALUE_CODE_ADDR(value).

The "guard" in this case is the IP value in the address or function descriptor
(only two implementations we have in glibc).

You need an atomic load acquire of the guard (ip), but it's hidden behind macros.

>      {
>        /* This is the first time we have to relocate this object.  */
> @@ -346,7 +353,10 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;
> +	/* Guarantee all previous writes complete before
> +	   resultp (reloc_result->addr) is updated.  See CONCURRENCY NOTES
> +	   earlier  */
> +	atomic_store_release(resultp, value);

Likewise you need an atomic store release of the ip value in the function
descriptor to avoid needing arbitrarily large atomic load/stores.

>      }
>  
>    /* By default we do not call the pltexit function.  */
> 

What I'd like to see:

- Refactor macros to get access to the "code addr" (ip of the fdesc or regular
  addr), and do the atomic loads and stores against that. Or feel free to
  delegate this to more macros at the machine level e.g.
  DL_FIXUP_ATOMIC_LOAD_ACQUIRE_REALLY_LONG_NAME_VALUE_CODE_ADDR (value),
  but I feel this looses out on the fact that we have generic atomic macros
  and so could use something like DL_FIXUP_VALUE_ADDR_CODE which returns
  the address of the code addr to load, and then you operate atomically on
  that to do the load.

- Fix ./sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt) to do an atomic
  store release to update ip. Likewise for ia64. Both should reference the
  concurrency notes in elf/dl-runtime.c.

- Can you write a test for this? A test which starts several threads, barriers
  them all, then starts them, and lets them try to resolve (test is compiled
  lazy binding forced) the plt entries, then repeat for a while, each iteration
  with a different function that hasn't been bound yet (maybe synthetically
  generate a few thousand functions). If we ever see this fail we'll know we
  did something wrong.

And also feel free to tell me I'm wrong :-)
John David Anglin Sept. 20, 2018, 1:59 a.m. UTC | #3
On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>    DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
> You are potentially requiring an atomic load of a structure whose size can be
> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
> and 128-bit for 64-bit ia64). These architectures might not have such wide
> atomic operations, and ...
We have implemented 64-bit atomic loads and stores for 32-bit hppa. They 
are not well tested but
they might work.  They use floating point loads and stores, and kernel 
helper.  The code is pretty horrific :-(

Dave
Carlos O'Donell Sept. 20, 2018, 2 a.m. UTC | #4
On 09/19/2018 09:59 PM, John David Anglin wrote:
> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>    DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>> You are potentially requiring an atomic load of a structure whose size can be
>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>> atomic operations, and ...
> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(

We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
atomic, but other algorithms like the new pthread condvars can use them effectively
to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
64-bit atomic.
Tulio Magno Quites Machado Filho Sept. 20, 2018, 1:03 p.m. UTC | #5
Joseph Myers <joseph@codesourcery.com> writes:

> On Wed, 19 Sep 2018, Tulio Magno Quites Machado Filho wrote:
>
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>
>> +	atomic_store_release(resultp, value);
>
> Missing spaces before '('.

Argh!  Fixed both lines.

Thanks!
John David Anglin Sept. 20, 2018, 1:34 p.m. UTC | #6
On 2018-09-19 10:00 PM, Carlos O'Donell wrote:
> On 09/19/2018 09:59 PM, John David Anglin wrote:
>> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>>     DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>>> You are potentially requiring an atomic load of a structure whose size can be
>>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>>> atomic operations, and ...
>> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
>> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(
> We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
> atomic, but other algorithms like the new pthread condvars can use them effectively
> to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
> 64-bit atomic.
Regarding using fdesc->ip as the guard, The gp is loaded both before and 
after the ip on hppa.
For example, $$dyncall loads gp before the branch.  This could be 
changed at the cost of one
instruction.  Stubs load gp after ip.  I don't think this is easy to change.

Dave
Tulio Magno Quites Machado Filho Sept. 20, 2018, 4:42 p.m. UTC | #7
Carlos O'Donell <carlos@redhat.com> writes:

> On 09/19/2018 05:48 PM, Tulio Magno Quites Machado Filho wrote:
>> The field reloc_result->addr is used to indicate if the rest of the
>> fields of reloc_result have already been written, creating a
>> data-dependency order.
>> Reading reloc_result->addr to the variable value requires to complete
>> before reading the rest of the fields of reloc_result.
>> Likewise, the writes to the other fields of the reloc_result must
>> complete before reloc_result-addr is updated.
>
> Good catch. This needs just a little more work and it's done.
>
> When you're ready please post a v2 and TO me for review.

Ack.

>> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
>> index 63bbc89776..6518e66fd6 100644
>> --- a/elf/dl-runtime.c
>> +++ b/elf/dl-runtime.c
>> @@ -183,9 +183,16 @@ _dl_profile_fixup (
>>    /* This is the address in the array where we store the result of previous
>>       relocations.  */
>>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
>> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>  
>> -  DL_FIXUP_VALUE_TYPE value = *resultp;
>> +  /* CONCURRENCY NOTES:
>> +
>> +     The following code uses reloc_result->addr to indicate if it is the first
>> +     time this object is being relocated.
>> +     Reading/Writing from/to reloc_result->addr must not happen before previous
>> +     writes to reloc_result complete as they could end-up with an incomplete
>> +     struct.  */
>
> This is not quite accurate. The following code uses DL_FIXUP_VALUE_CODE_ADDR to
> access a potential member of addr to indicate it is the first time the object is
> being relocated. The "guard" variable in this access is either the address itself
> or the ip of a function descriptor (the only two implementations of "code addr").

Ack.  I'll elaborate the comments in the next version of the patch.

> Also we don't explain what other data is being accessed? What non-function-locale
> data is being updated as part of the guard variable check?

I don't follow you.
Do you mean other data being updated but not in reloc_result?
What do you mean by non-function-locale?

> In the case of the function descriptor it's easy to know that the fdesc's gp is 
> going to be written to, and without a acquire we won't see that write, or the new 
> ip value, so we might have two threads update the same thing, in theory a benign 
> data race which we should fix.

Why should we fix it?

> In theory elf_ifunc_invoke() is called and that could write to arbitrary memory
> with the IFUNC resolver, whose values should be seen by any future thread via
> the new load acquire you are putting in place. You might be able to run the IFUNC
> resolver twice which could be bad?

Good point.  I haven't tested IFUNCs yet.

>> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>
> You are potentially requiring an atomic load of a structure whose size can be
> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
> and 128-bit for 64-bit ia64). These architectures might not have such wide
> atomic operations, and ...

Would it help hppa and ia64 if I replaced it with just memory fences? e.g.:

  DL_FIXUP_VALUE_TYPE value = *resultp;
  atomic_thread_fence_acquire ();

Likewise for the atomic_store_release () later in the code.

>>    /* By default we do not call the pltexit function.  */
>> 
>
> What I'd like to see:
>
> - Refactor macros to get access to the "code addr" (ip of the fdesc or regular
>   addr), and do the atomic loads and stores against that. Or feel free to
>   delegate this to more macros at the machine level e.g.
>   DL_FIXUP_ATOMIC_LOAD_ACQUIRE_REALLY_LONG_NAME_VALUE_CODE_ADDR (value),
>   but I feel this looses out on the fact that we have generic atomic macros
>   and so could use something like DL_FIXUP_VALUE_ADDR_CODE which returns
>   the address of the code addr to load, and then you operate atomically on
>   that to do the load.
>
> - Fix ./sysdeps/hppa/dl-machine.h (elf_machine_fixup_plt) to do an atomic
>   store release to update ip. Likewise for ia64. Both should reference the
>   concurrency notes in elf/dl-runtime.c.

If we use memory fences, I believe this won't be necessary.
Do you agree?

> - Can you write a test for this? A test which starts several threads, barriers
>   them all, then starts them, and lets them try to resolve (test is compiled
>   lazy binding forced) the plt entries, then repeat for a while, each iteration
>   with a different function that hasn't been bound yet (maybe synthetically
>   generate a few thousand functions). If we ever see this fail we'll know we
>   did something wrong.

Yes.  There is already something attached to the bug report, but it does
require dozens of threads to reproduce here.
If you're fine with that, I can adapt it.
Florian Weimer Sept. 20, 2018, 5:04 p.m. UTC | #8
* Tulio Magno Quites Machado Filho:

> Yes.  There is already something attached to the bug report, but it does
> require dozens of threads to reproduce here.
> If you're fine with that, I can adapt it.

That's okay (depending on the exact value of “dozens” 8-).

I assume it's burning CPU all the time?  Then you should put it into
nptl/ because those tests run serialized.  If you think it should run
longer than five second, we perhaps should make it an xtest.

Thanks,
Florian
Tulio Magno Quites Machado Filho Sept. 21, 2018, 5:21 p.m. UTC | #9
Florian Weimer <fweimer@redhat.com> writes:

> * Tulio Magno Quites Machado Filho:
>
>> Yes.  There is already something attached to the bug report, but it does
>> require dozens of threads to reproduce here.
>> If you're fine with that, I can adapt it.
>
> That's okay (depending on the exact value of “dozens” 8-).

I'm using 70, but I believe we should limit this according to the machine.
I don't think we need to run this test on processors with sequential con

> I assume it's burning CPU all the time?  Then you should put it into
> nptl/ because those tests run serialized.  If you think it should run
> longer than five second, we perhaps should make it an xtest.

Yes, it does burn CPU for ~30s on a POWER8.
Florian Weimer Sept. 21, 2018, 5:24 p.m. UTC | #10
* Tulio Magno Quites Machado Filho:

>> I assume it's burning CPU all the time?  Then you should put it into
>> nptl/ because those tests run serialized.  If you think it should run
>> longer than five second, we perhaps should make it an xtest.
>
> Yes, it does burn CPU for ~30s on a POWER8.

30 second wall clock time?  That's a bit borderline, but not excessive
considering the overall nptl testing time.  But maybe it should still be
an xtest.

Thanks,
Florian
Tulio Magno Quites Machado Filho Sept. 21, 2018, 5:36 p.m. UTC | #11
Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> writes:

> Florian Weimer <fweimer@redhat.com> writes:
>
>> * Tulio Magno Quites Machado Filho:
>>
>>> Yes.  There is already something attached to the bug report, but it does
>>> require dozens of threads to reproduce here.
>>> If you're fine with that, I can adapt it.
>>
>> That's okay (depending on the exact value of “dozens” 8-).
>
> I'm using 70, but I believe we should limit this according to the machine.
> I don't think we need to run this test on processors with sequential con

Oops.  I hit send too soon. I meant to say: I don't think we need to run this
test on processors with stronger memory ordering.  Except for the extra time
when testing, I wouldn't hurt, though.
Tulio Magno Quites Machado Filho Sept. 21, 2018, 5:37 p.m. UTC | #12
Florian Weimer <fweimer@redhat.com> writes:

> * Tulio Magno Quites Machado Filho:
>
>>> I assume it's burning CPU all the time?  Then you should put it into
>>> nptl/ because those tests run serialized.  If you think it should run
>>> longer than five second, we perhaps should make it an xtest.
>>
>> Yes, it does burn CPU for ~30s on a POWER8.
>
> 30 second wall clock time?  That's a bit borderline, but not excessive
> considering the overall nptl testing time.  But maybe it should still be
> an xtest.

Yes.
Carlos O'Donell Oct. 11, 2018, 9:15 p.m. UTC | #13
On 9/20/18 9:34 AM, John David Anglin wrote:
> On 2018-09-19 10:00 PM, Carlos O'Donell wrote:
>> On 09/19/2018 09:59 PM, John David Anglin wrote:
>>> On 2018-09-19 8:16 PM, Carlos O'Donell wrote:
>>>>>     DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>>>>> +  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
>>>> You are potentially requiring an atomic load of a structure whose size can be
>>>> an arbitrary size depending on machine and ABI design (64-bit fore 32-bit hppa,
>>>> and 128-bit for 64-bit ia64). These architectures might not have such wide
>>>> atomic operations, and ...
>>> We have implemented 64-bit atomic loads and stores for 32-bit hppa. They are not well tested but
>>> they might work.  They use floating point loads and stores, and kernel helper.  The code is pretty horrific :-(
>> We only need to use the fdesc->ip as the guard, so we don't really need the 64-bit
>> atomic, but other algorithms like the new pthread condvars can use them effectively
>> to accelerate and avoid 2 lws kernel helper calls and instead use 1 lws kernel helper
>> 64-bit atomic.
> Regarding using fdesc->ip as the guard, The gp is loaded both before and after the ip on hppa.
> For example, $$dyncall loads gp before the branch.  This could be changed at the cost of one
> instruction.  Stubs load gp after ip.  I don't think this is easy to change.

This is fine. The point is that ip is the guard, and so all other elements of the fdesc (only
gp in this case) need to be written to before the guard is updated to indicate that the
relocation of the fdesc is complete.

This is now fixed with Tulios changes which just use an atomic_thread_acquire and
atomic_thread_release in the _dl_profile_fixup.
diff mbox series

Patch

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..6518e66fd6 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,9 +183,16 @@  _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
-  DL_FIXUP_VALUE_TYPE value = *resultp;
+  /* CONCURRENCY NOTES:
+
+     The following code uses reloc_result->addr to indicate if it is the first
+     time this object is being relocated.
+     Reading/Writing from/to reloc_result->addr must not happen before previous
+     writes to reloc_result complete as they could end-up with an incomplete
+     struct.  */
+  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
+  DL_FIXUP_VALUE_TYPE value = atomic_load_acquire(resultp);
   if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
     {
       /* This is the first time we have to relocate this object.  */
@@ -346,7 +353,10 @@  _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	/* Guarantee all previous writes complete before
+	   resultp (reloc_result->addr) is updated.  See CONCURRENCY NOTES
+	   earlier  */
+	atomic_store_release(resultp, value);
     }
 
   /* By default we do not call the pltexit function.  */