[v2,3/3] Mutex: Optimize adaptive spin algorithm
diff mbox series

Message ID 1524624988-29141-3-git-send-email-kemi.wang@intel.com
State New
Headers show
Series
  • [v2,1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
Related show

Commit Message

kemi April 25, 2018, 2:56 a.m. UTC
The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.

Usually, it is useless to go on spinning on the lock if fail to acquire the
lock when lock is available. That's because the spinner probably does not
have the possibility to acquire the lock during the spin process in case of
severe lock contention. Therefore, it would be better to call into the
kernel to block the thread, as suggested by Tim Chen and we can gain the
benefit at least from:
a) save the CPU time;
b) save power budget;
c) reduce the overhead of cache line bouncing during the spinning.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  7.06542e+08    |  7.08998e+08    |   +0.3%    |
|2 threads       |  5.73018e+07    |  7.20815e+07    |   +25.6%   |
|3 threads       |  3.78511e+07    |  1.15544e+08    |   +205.3%  |
|4 threads       |  2.28214e+07    |  6.57055e+07    |   +187.9%  |
|28 threads      |  1.68839e+07    |  5.19314e+07    |   +207.6%  |
|56 threads      |  1.84983e+07    |  5.06522e+07    |   +173.8%  |
|112 threads     |  2.3568e+07     |  4.95375e+07    |   +110.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  5.40274e+08    |  5.47189e+08    |   +1.3%    |
|2 threads       |  4.55684e+07    |  6.03275e+07    |   +32.4%   |
|3 threads       |  3.05702e+07    |  1.04035e+08    |   +240.3%  |
|4 threads       |  2.17341e+07    |  5.57264e+07    |   +156.4%  |
|28 threads      |  1.39503e+07    |  4.53525e+07    |   +225.1%  |
|56 threads      |  1.50154e+07    |  4.16203e+07    |   +177.2%  |
|112 threads     |  1.90175e+07    |  3.88308e+07    |   +104.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  7.23372e+07    | 7.25654e+07     |   +0.3%    |
|2 threads       |  2.67302e+07    | 2.40265e+07     |   -10.1%   |
|3 threads       |  1.89936e+07    | 2.70759e+07     |   +42.6%   |
|4 threads       |  1.62423e+07    | 2.25097e+07     |   +38.6%   |
|28 threads      |  9.85977e+06    | 1.59003e+07     |   +61.3%   |
|56 threads      |  8.11471e+06    | 1.6344e+07      |   +101.4%  |
|112 threads     |  8.58044e+06    | 1.53827e+07     |   +79.3%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  8.16913e+06    |  8.16126e+06    |   -0.1%    |
|2 threads       |  5.82987e+06    |  5.92752e+06    |   +1.7%    |
|3 threads       |  6.05125e+06    |  6.37068e+06    |   +5.3%    |
|4 threads       |  5.91259e+06    |  6.27616e+06    |   +6.1%    |
|28 threads      |  2.40584e+06    |  2.60738e+06    |   +8.4%    |
|56 threads      |  2.32643e+06    |  2.3245e+06     |   -0.1%    |
|112 threads     |  2.32366e+06    |  2.30271e+06    |   -0.9%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex

ChangLog:
    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                 |  4 ++++
 nptl/pthread_mutex_lock.c | 32 ++++++++++++++++++--------------
 2 files changed, 22 insertions(+), 14 deletions(-)

Comments

Florian Weimer May 2, 2018, 8:19 a.m. UTC | #1
On 04/25/2018 04:56 AM, Kemi Wang wrote:
> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>         if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>   	{
>   	  int cnt = 0;> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
> +			mutex->__data.__spins * 2 + 100);
> +
> +      /* MO read while spinning */
> +      do
> +        {
> +         atomic_spin_nop ();
> +        }
> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
> +            ++cnt < max_cnt);
> +        /* Try to acquire the lock if lock is available or the spin count
> +         * is run out, call into kernel to block if fails
> +         */
> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
> +        LLL_MUTEX_LOCK (mutex);
>   > +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> +    }

The indentation is off.  Comments should end with a ”.  ” (dot and two 
spaces).  Multi-line comments do not start with “*” on subsequent lines. 
  We don't use braces when we can avoid them.  Operators such as “&&” 
should be on the following line when breaking up lines.

Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an 
unconditional call to LLL_MUTEX_LOCK be sufficient?

But the real question is if the old way of doing CAS in a loop is 
beneficial on other, non-Intel architectures.  You either need get broad 
consensus from the large SMP architectures (various aarch64 
implementations, IBM POWER and Z), or somehow make this opt-in at the 
source level.

Ideally, we would also get an ack from AMD for the x86 change, but they 
seem not particularly active on libc-alpha, so that may not be realistic.

Thanks,
Florian
kemi May 2, 2018, 11:04 a.m. UTC | #2
On 2018年05月02日 16:19, Florian Weimer wrote:
> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>         if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>       {
>>         int cnt = 0;
> …
>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>> +            mutex->__data.__spins * 2 + 100);
>> +
>> +      /* MO read while spinning */
>> +      do
>> +        {
>> +         atomic_spin_nop ();
>> +        }
>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> +            ++cnt < max_cnt);
>> +        /* Try to acquire the lock if lock is available or the spin count
>> +         * is run out, call into kernel to block if fails
>> +         */
>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>> +        LLL_MUTEX_LOCK (mutex);
>>   
> …
>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>> +    }
> 
> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
> 

Will fold these changes in next version. 
I am not familiar with glibc coding style, apologize for that.

> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
> 

The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user 
space without block when we observed the lock is available. Thus, in case of multiple 
spinners contending for the lock,  only one spinner can acquire the lock successfully
and others fall into block.

I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy 
this purpose.

> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
> 

That would be a platform-specific change and have obvious performance improvement for x86 architecture.
And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
If you or someone else still have some concern of performance regression on other architecture, making
this opt-in could eliminate people's worries. 

"
I checked the change on a 64 cores aarch64 machine, but
differently than previous patch this one seems to show improvements:

nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1               27566206        28776779 (4.206770)  28778073 (4.211078)
2               8498813         9129102 (6.904173)   7042975 (-20.670782)
7               5019434         5832195 (13.935765)  5098511 (1.550982)
14              4379155         6507212 (32.703053)  5200018 (15.785772)
28              4397464         4584480 (4.079329)   4456767 (1.330628)
56              4020956         3534899 (-13.750237) 4096197 (1.836850)
"

Also, I would appreciate if someone could test this patch on other architectures using
the benchmark provided in the second patch. 

> Ideally, we would also get an ack from AMD for the x86 change, but they seem not particularly active on libc-alpha, so that may not be realistic.
> 
> Thanks,
> Florian
Florian Weimer May 8, 2018, 3:08 p.m. UTC | #3
On 05/02/2018 01:04 PM, kemi wrote:
> 
> 
> On 2018年05月02日 16:19, Florian Weimer wrote:
>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>          if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>        {
>>>          int cnt = 0;
>> …
>>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>>> +            mutex->__data.__spins * 2 + 100);
>>> +
>>> +      /* MO read while spinning */
>>> +      do
>>> +        {
>>> +         atomic_spin_nop ();
>>> +        }
>>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>> +            ++cnt < max_cnt);
>>> +        /* Try to acquire the lock if lock is available or the spin count
>>> +         * is run out, call into kernel to block if fails
>>> +         */
>>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>> +        LLL_MUTEX_LOCK (mutex);
>>>    
>> …
>>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>> +    }
>>
>> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
>>
> 
> Will fold these changes in next version.
> I am not familiar with glibc coding style, apologize for that.

No apology needed, it takes some time to get use to.

>> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
>>
> 
> The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user
> space without block when we observed the lock is available. Thus, in case of multiple
> spinners contending for the lock,  only one spinner can acquire the lock successfully
> and others fall into block.
> 
> I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy
> this purpose.

It's what we use for the default case.  It expands to lll_lock, so it 
should try atomic_compare_and_exchange_bool_acq first and only perform a 
futex syscall in case there is contention.  So I do think that 
LLL_MUTEX_TRYLOCK is redundant here.  Perhaps manually review the 
disassembly to make sure?

> 
>> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
>>
> 
> That would be a platform-specific change and have obvious performance improvement for x86 architecture.
> And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
> If you or someone else still have some concern of performance regression on other architecture, making
> this opt-in could eliminate people's worries.
> 
> "
> I checked the change on a 64 cores aarch64 machine, but
> differently than previous patch this one seems to show improvements:
> 
> nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1               27566206        28776779 (4.206770)  28778073 (4.211078)
> 2               8498813         9129102 (6.904173)   7042975 (-20.670782)
> 7               5019434         5832195 (13.935765)  5098511 (1.550982)
> 14              4379155         6507212 (32.703053)  5200018 (15.785772)
> 28              4397464         4584480 (4.079329)   4456767 (1.330628)
> 56              4020956         3534899 (-13.750237) 4096197 (1.836850)
> "

Ah, nice, I had missed that.  I suppose this means we can risk enabling 
it by default.

Thanks,
Florian
kemi May 14, 2018, 8:09 a.m. UTC | #4
On 2018年05月08日 23:08, Florian Weimer wrote:
> On 05/02/2018 01:04 PM, kemi wrote:
>>
>>
>> On 2018年05月02日 16:19, Florian Weimer wrote:
>>> On 04/25/2018 04:56 AM, Kemi Wang wrote:
>>>> @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>>          if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>>        {
>>>>          int cnt = 0;
>>> …
>>>> +      int max_cnt = MIN (__mutex_aconf.spin_count,
>>>> +            mutex->__data.__spins * 2 + 100);
>>>> +
>>>> +      /* MO read while spinning */
>>>> +      do
>>>> +        {
>>>> +         atomic_spin_nop ();
>>>> +        }
>>>> +      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>>> +            ++cnt < max_cnt);
>>>> +        /* Try to acquire the lock if lock is available or the spin count
>>>> +         * is run out, call into kernel to block if fails
>>>> +         */
>>>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>> +        LLL_MUTEX_LOCK (mutex);
>>>>    
>>> …
>>>> +      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>>> +    }
>>>
>>> The indentation is off.  Comments should end with a ”.  ” (dot and two spaces).  Multi-line comments do not start with “*” on subsequent lines.  We don't use braces when we can avoid them.  Operators such as “&&” should be on the following line when breaking up lines.
>>>
>>
>> Will fold these changes in next version.
>> I am not familiar with glibc coding style, apologize for that.
> 
> No apology needed, it takes some time to get use to.
> 
>>> Why is the LLL_MUTEX_TRYLOCK call still needed?  Shouldn't be an unconditional call to LLL_MUTEX_LOCK be sufficient?
>>>
>>
>> The purpose of calling LLL_MUTEX_TRYLOCK here is to try to acquire the lock at user
>> space without block when we observed the lock is available. Thus, in case of multiple
>> spinners contending for the lock,  only one spinner can acquire the lock successfully
>> and others fall into block.
>>
>> I am not sure an unconditional call to LLL_MUTEX_LOCK as you mentioned here can satisfy
>> this purpose.
> 
> It's what we use for the default case.  It expands to lll_lock, so it should try atomic_compare_and_exchange_bool_acq first and only perform a futex syscall in case there is contention.  So I do think that LLL_MUTEX_TRYLOCK is redundant here.  Perhaps manually review the disassembly to make sure?
> 

Make sense. I think you are right.

78            /* Normal mutex.  */
   0x00007ffff7989b20 <+80>:    and    $0x80,%edx
   0x00007ffff7989b26 <+86>:    mov    $0x1,%edi
   0x00007ffff7989b2b <+91>:    xor    %eax,%eax
   0x00007ffff7989b2d <+93>:    mov    %edx,%esi
   *0x00007ffff7989b2f <+95>:    lock cmpxchg %edi,(%r8)*
   *0x00007ffff7989b34 <+100>:   je     0x7ffff7989b4c <__GI___pthread_mutex_lock+124>*
   0x00007ffff7989b36 <+102>:   lea    (%r8),%rdi
   0x00007ffff7989b39 <+105>:   sub    $0x80,%rsp
   0x00007ffff7989b40 <+112>:   callq  0x7ffff7990830 <__lll_lock_wait>
   0x00007ffff7989b45 <+117>:   add    $0x80,%rsp

79            LLL_MUTEX_LOCK (mutex);
   0x00007ffff7989b4c <+124>:   mov    0x8(%r8),%edx
   0x00007ffff7989b50 <+128>:   test   %edx,%edx
   0x00007ffff7989b52 <+130>:   jne    0x7ffff7989cc9 <__GI___pthread_mutex_lock+505>
   0x00007ffff7989cc9 <+505>:   lea    0xa090(%rip),%rcx        # 0x7ffff7993d60 <__PRETTY_FUNCTION__.8768>
   0x00007ffff7989cd0 <+512>:   lea    0x9f05(%rip),%rsi        # 0x7ffff7993bdc
   0x00007ffff7989cd7 <+519>:   lea    0x9f1b(%rip),%rdi        # 0x7ffff7993bf9
   0x00007ffff7989cde <+526>:   mov    $0x4f,%edx
   0x00007ffff7989ce3 <+531>:   callq  0x7ffff7985840 <__assert_fail@plt>

80            assert (mutex->__data.__owner == 0);


>>
>>> But the real question is if the old way of doing CAS in a loop is beneficial on other, non-Intel architectures.  You either need get broad consensus from the large SMP architectures (various aarch64 implementations, IBM POWER and Z), or somehow make this opt-in at the source level.
>>>
>>
>> That would be a platform-specific change and have obvious performance improvement for x86 architecture.
>> And according to Adhemerval, this change could also have some improvement for arrch64 architecture.
>> If you or someone else still have some concern of performance regression on other architecture, making
>> this opt-in could eliminate people's worries.
>>
>> "
>> I checked the change on a 64 cores aarch64 machine, but
>> differently than previous patch this one seems to show improvements:
>>
>> nr_threads      base            head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1               27566206        28776779 (4.206770)  28778073 (4.211078)
>> 2               8498813         9129102 (6.904173)   7042975 (-20.670782)
>> 7               5019434         5832195 (13.935765)  5098511 (1.550982)
>> 14              4379155         6507212 (32.703053)  5200018 (15.785772)
>> 28              4397464         4584480 (4.079329)   4456767 (1.330628)
>> 56              4020956         3534899 (-13.750237) 4096197 (1.836850)
>> "
> 
> Ah, nice, I had missed that.  I suppose this means we can risk enabling it by default.
> 
There are two major changes in this patch.
1) Change the spinning way from trylock to read only while spinning, it should benefit many
architectures since it can avoid expensive memory synchronization among processors caused by
read-for-ownership request. Similar way has been adopted by pthread/kernel spin lock.

2) Fall a thread into block if it fails to acquire lock when we observed the lock is available.
This is a radical change. It can bring significant performance improvement in the case of severe
lock contention(many threads contending for a lock), because the spinner always can't acquire
the lock in its spinning duration. Thus, putting it to a waiting list can save CPU time, power
budget and eliminate the overhead of atomic operation.
For the case in which a thread may acquire the lock while spinning, I understood it is controversial
to call into the kernel to block rather than go on spinning until timeout. 

Therefore, I will drop the second change for V3 series. More investigation and tests will be done before
pushing the second change.

I will send V3 for that if no one has different ideas? Thanks

> Thanks,
> Florian

Patch
diff mbox series

diff --git a/ChangeLog b/ChangeLog
index 76d2628..4c81693 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@ 
 2018-04-24  Kemi Wang <kemi.wang@intel.com>
 
+	* nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex.
+
+2018-04-24  Kemi Wang <kemi.wang@intel.com>
+
 	* benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive
 	spin mutex.
 	* benchmark/Makefile: Add adaptive spin mutex benchmark.
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..3442c58 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <pthread_mutex_conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -124,21 +125,24 @@  __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
-	  do
-	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
-	    }
-	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			mutex->__data.__spins * 2 + 100);
+
+      /* MO read while spinning */
+      do
+        {
+         atomic_spin_nop ();
+        }
+      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+            ++cnt < max_cnt);
+        /* Try to acquire the lock if lock is available or the spin count
+         * is run out, call into kernel to block if fails
+         */
+      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+        LLL_MUTEX_LOCK (mutex);
 
-	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
-	}
+      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+    }
       assert (mutex->__data.__owner == 0);
     }
   else