[2/3] Mutex: Only read while spinning

Message ID 1522394093-9835-2-git-send-email-kemi.wang@intel.com
State New
Headers show
Series
  • [1/3] Tunables: Add tunables of spin count for adaptive spin mutex
Related show

Commit Message

kemi March 30, 2018, 7:14 a.m.
The pthread adaptive spin mutex spins on the lock for a while before going
to a sleep. While the lock is contended and we need to wait, going straight
back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
that will force expensive memory synchronization among processors and
penalize other running threads. For example, it constantly floods the
system with "read for ownership" requests, which are much more expensive to
process than a single read. Thus, we only use MO read until we observe the
lock to not be acquired anymore, as suggusted by Andi Kleen.

Test machine:
2-sockets Skylake paltform, 112 cores with 62G RAM

Test case: Contended pthread adaptive spin mutex with global update
each thread of the workload does:
a) Lock the mutex (adaptive spin type)
b) Globle variable increment
c) Unlock the mutex
in a loop until timeout, and the main thread reports the total iteration
number of all the threads in one second.

This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
modified to PTHREAD_MUTEX_ADAPTIVE_NP.
github: https://github.com/antonblanchard/will-it-scale.git

nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1               51644585        51307573(-0.7%)    51323778(-0.6%)
2               7914789         10011301(+26.5%)   9867343(+24.7%)
7               1687620         4224135(+150.3%)   3430504(+103.3%)
14              1026555         3784957(+268.7%)   1843458(+79.6%)
28              962001          2886885(+200.1%)   681965(-29.1%)
56              883770          2740755(+210.1%)   364879(-58.7%)
112             1150589         2707089(+135.3%)   415261(-63.9%)

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

Comments

Adhemerval Zanella April 5, 2018, 8:55 p.m. | #1
On 30/03/2018 04:14, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before going
> to a sleep. While the lock is contended and we need to wait, going straight
> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
> that will force expensive memory synchronization among processors and
> penalize other running threads. For example, it constantly floods the
> system with "read for ownership" requests, which are much more expensive to
> process than a single read. Thus, we only use MO read until we observe the
> lock to not be acquired anymore, as suggusted by Andi Kleen.
> 
> Test machine:
> 2-sockets Skylake paltform, 112 cores with 62G RAM
> 
> Test case: Contended pthread adaptive spin mutex with global update
> each thread of the workload does:
> a) Lock the mutex (adaptive spin type)
> b) Globle variable increment
> c) Unlock the mutex
> in a loop until timeout, and the main thread reports the total iteration
> number of all the threads in one second.
> 
> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
> github: https://github.com/antonblanchard/will-it-scale.git
> 
> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
> 28              962001          2886885(+200.1%)   681965(-29.1%)
> 56              883770          2740755(+210.1%)   364879(-58.7%)
> 112             1150589         2707089(+135.3%)   415261(-63.9%)

In pthread_mutex3 it is basically more updates in a global variable synchronized
with a mutex, so if I am reading correct the benchmark, a higher value means
less contention. I also assume you use the 'threads' value in this table.

I checked on a 64 cores aarch64 machine to see what kind of improvement, if
any; one would get with this change:

nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
1               27566206        28778254 (4.211680)   28778467 (4.212389)
2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
56              4020956         3898386 (-3.144122)   4038505 (0.434542)

So I think this change should be platform-specific.

> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 1519c14..c3aca93 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>  #include <atomic.h>
>  #include <lowlevellock.h>
>  #include <stap-probe.h>
> +#include <mutex-conf.h>
>  
>  #ifndef lll_lock_elision
>  #define lll_lock_elision(lock, try_lock, private)	({ \
> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>  	{
>  	  int cnt = 0;
> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
> -			     mutex->__data.__spins * 2 + 10);
> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
> +			mutex->__data.__spins * 2 + 100);
>  	  do
>  	    {
> -	      if (cnt++ >= max_cnt)
> -		{
> -		  LLL_MUTEX_LOCK (mutex);
> -		  break;
> -		}
> -	      atomic_spin_nop ();
> +		if (cnt >= max_cnt)
> +		  {
> +		    LLL_MUTEX_LOCK (mutex);
> +		    break;
> +		  }
> +		/* MO read while spinning */
> +		do
> +		  {
> +		    atomic_spin_nop ();
> +		  }
> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
> +			++cnt < max_cnt);
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
>
kemi April 8, 2018, 8:28 a.m. | #2
On 2018年04月06日 04:55, Adhemerval Zanella wrote:
> 
> 
> On 30/03/2018 04:14, Kemi Wang wrote:
>> The pthread adaptive spin mutex spins on the lock for a while before going
>> to a sleep. While the lock is contended and we need to wait, going straight
>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>> that will force expensive memory synchronization among processors and
>> penalize other running threads. For example, it constantly floods the
>> system with "read for ownership" requests, which are much more expensive to
>> process than a single read. Thus, we only use MO read until we observe the
>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>
>> Test machine:
>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>
>> Test case: Contended pthread adaptive spin mutex with global update
>> each thread of the workload does:
>> a) Lock the mutex (adaptive spin type)
>> b) Globle variable increment
>> c) Unlock the mutex
>> in a loop until timeout, and the main thread reports the total iteration
>> number of all the threads in one second.
>>
>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>> github: https://github.com/antonblanchard/will-it-scale.git
>>
>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
> 
> In pthread_mutex3 it is basically more updates in a global variable synchronized
> with a mutex, so if I am reading correct the benchmark, a higher value means
> less contention. I also assume you use the 'threads' value in this table.
> 

Yes. I used multi-threads mode for testing.

> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
> any; one would get with this change:
> 
> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
> 

Thanks for your job to test it on aarch64 machine. That's great supplement for us.

> So I think this change should be platform-specific.
> 

It may depend on platform, especially for cmpxchg, MO read and pause instructions.

Well, I think the performance also depends on spin count. A simple explanation is that:
spin always fails in severe lock contention with large thread numbers (fit case b) below),
thus, the overhead increases due to large spin count, larger spin count, more overhead.
It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
Did you run "export LD_SPIN_COUNT=10" before the second round test?

Analysis:
let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
Then s = k*c, *k* is factor.  

a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
in the critical section, it is easy to understand that the overhead for acquiring the lock 
equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 

b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
   In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame

 T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
 |
 T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
 .
 .
 .
 T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  

after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
for time (k-2)*c,... task(k-1) for time c.

 T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
          sleep         sleep   

 task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
until threads exist. Therefore, the spin count effects the system performance largely!

c) If *k* > 1 && threads number *n* < *k* (small critical section case)
  Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
only consider spin overhead.

>>
>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index 1519c14..c3aca93 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -26,6 +26,7 @@
>>  #include <atomic.h>
>>  #include <lowlevellock.h>
>>  #include <stap-probe.h>
>> +#include <mutex-conf.h>
>>  
>>  #ifndef lll_lock_elision
>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>  	{
>>  	  int cnt = 0;
>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>> -			     mutex->__data.__spins * 2 + 10);
>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>> +			mutex->__data.__spins * 2 + 100);
>>  	  do
>>  	    {
>> -	      if (cnt++ >= max_cnt)
>> -		{
>> -		  LLL_MUTEX_LOCK (mutex);
>> -		  break;
>> -		}
>> -	      atomic_spin_nop ();
>> +		if (cnt >= max_cnt)
>> +		  {
>> +		    LLL_MUTEX_LOCK (mutex);
>> +		    break;
>> +		  }
>> +		/* MO read while spinning */
>> +		do
>> +		  {
>> +		    atomic_spin_nop ();
>> +		  }
>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>> +			++cnt < max_cnt);
>>  	    }
>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>  
>>
> 
>
Adhemerval Zanella April 9, 2018, 8:52 p.m. | #3
On 08/04/2018 05:28, kemi wrote:
> 
> 
> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>
>>
>> On 30/03/2018 04:14, Kemi Wang wrote:
>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>> to a sleep. While the lock is contended and we need to wait, going straight
>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>> that will force expensive memory synchronization among processors and
>>> penalize other running threads. For example, it constantly floods the
>>> system with "read for ownership" requests, which are much more expensive to
>>> process than a single read. Thus, we only use MO read until we observe the
>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>
>>> Test machine:
>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>
>>> Test case: Contended pthread adaptive spin mutex with global update
>>> each thread of the workload does:
>>> a) Lock the mutex (adaptive spin type)
>>> b) Globle variable increment
>>> c) Unlock the mutex
>>> in a loop until timeout, and the main thread reports the total iteration
>>> number of all the threads in one second.
>>>
>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>
>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>
>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>> with a mutex, so if I am reading correct the benchmark, a higher value means
>> less contention. I also assume you use the 'threads' value in this table.
>>
> 
> Yes. I used multi-threads mode for testing.
> 
>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>> any; one would get with this change:
>>
>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>
> 
> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
> 
>> So I think this change should be platform-specific.
>>
> 
> It may depend on platform, especially for cmpxchg, MO read and pause instructions.

Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
more platform specific.

> 
> Well, I think the performance also depends on spin count. A simple explanation is that:
> spin always fails in severe lock contention with large thread numbers (fit case b) below),
> thus, the overhead increases due to large spin count, larger spin count, more overhead.
> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
> Did you run "export LD_SPIN_COUNT=10" before the second round test?

I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
current patch set version is the same).  I redid the number to check for some
mishandled from my part, but still the number are somewhat not ok to set as
default (even more that 1000 is the default value for spin_counts).

nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
1		27563021		28775107 (4.212273)		28775127 (4.212339)
2		8261443		9044577 (8.658603)		7792347 (-6.019958)
7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
14		4951300		5348550 (7.427247)		2854946 (-73.428849)
28		4296920		4551240 (5.587928)		3231684 (-32.962257)
56		4182013		3420937 (-22.247589)		3769944 (-10.930375)


> 
> Analysis:
> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
> Then s = k*c, *k* is factor.  
> 
> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
> in the critical section, it is easy to understand that the overhead for acquiring the lock 
> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
> 
> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
> 
>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>  |
>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>  .
>  .
>  .
>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
> 
> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
> for time (k-2)*c,... task(k-1) for time c.
> 
>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>           sleep         sleep   
> 
>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
> until threads exist. Therefore, the spin count effects the system performance largely!
> 
> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
> only consider spin overhead.

I do agree that the sping count does play a role here, but for the machine
I am testing I think what is actually happening is this patch is adding more
branch on critical loop section, and does hurt performance:

Current spin loop version:
---
  do
    { 
      if (cnt++ >= max_cnt)
        { 
          LLL_MUTEX_LOCK (mutex);
          break;
        }
      atomic_spin_nop ();
    }
  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
---

This patch changes to:
---
  do
    {   
      if (cnt >= max_cnt)
        {
          LLL_MUTEX_LOCK (mutex);
          break;
        }
      do { 
        atomic_spin_nop ();
      }
      while (atomic_load_relaxed (&mutex->__data.__lock) != 0
             && ++cnt < max_cnt);
    }
  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
---

While 3 patch in set changes to:
---
  do
    {
      atomic_spin_nop ();
     }
  while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
         ++cnt < max_cnt);

  if (LLL_MUTEX_TRYLOCK (mutex) != 0)
    LLL_MUTEX_LOCK (mutex);
---

That's why I suggested for next version to squash second and third patch on only
one aimed to optimize the adapting spinning algorithm. 

Also, if you could provide a benchmark to stress this point without resorting in
an external one it would be better.

> 
>>>
>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>> ---
>>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>> index 1519c14..c3aca93 100644
>>> --- a/nptl/pthread_mutex_lock.c
>>> +++ b/nptl/pthread_mutex_lock.c
>>> @@ -26,6 +26,7 @@
>>>  #include <atomic.h>
>>>  #include <lowlevellock.h>
>>>  #include <stap-probe.h>
>>> +#include <mutex-conf.h>
>>>  
>>>  #ifndef lll_lock_elision
>>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>  	{
>>>  	  int cnt = 0;
>>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>>> -			     mutex->__data.__spins * 2 + 10);
>>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>> +			mutex->__data.__spins * 2 + 100);
>>>  	  do
>>>  	    {
>>> -	      if (cnt++ >= max_cnt)
>>> -		{
>>> -		  LLL_MUTEX_LOCK (mutex);
>>> -		  break;
>>> -		}
>>> -	      atomic_spin_nop ();
>>> +		if (cnt >= max_cnt)
>>> +		  {
>>> +		    LLL_MUTEX_LOCK (mutex);
>>> +		    break;
>>> +		  }
>>> +		/* MO read while spinning */
>>> +		do
>>> +		  {
>>> +		    atomic_spin_nop ();
>>> +		  }
>>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>> +			++cnt < max_cnt);
>>>  	    }
>>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>  
>>>
>>
>>
kemi April 10, 2018, 1:46 a.m. | #4
On 2018年04月10日 04:52, Adhemerval Zanella wrote:
> 
> 
> On 08/04/2018 05:28, kemi wrote:
>>
>>
>> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>>
>>>
>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>> that will force expensive memory synchronization among processors and
>>>> penalize other running threads. For example, it constantly floods the
>>>> system with "read for ownership" requests, which are much more expensive to
>>>> process than a single read. Thus, we only use MO read until we observe the
>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>
>>>> Test machine:
>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>
>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>> each thread of the workload does:
>>>> a) Lock the mutex (adaptive spin type)
>>>> b) Globle variable increment
>>>> c) Unlock the mutex
>>>> in a loop until timeout, and the main thread reports the total iteration
>>>> number of all the threads in one second.
>>>>
>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>
>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>
>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>> less contention. I also assume you use the 'threads' value in this table.
>>>
>>
>> Yes. I used multi-threads mode for testing.
>>
>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>> any; one would get with this change:
>>>
>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>
>>
>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>
>>> So I think this change should be platform-specific.
>>>
>>
>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
> 
> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
> more platform specific.
> 
>>
>> Well, I think the performance also depends on spin count. A simple explanation is that:
>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>> Did you run "export LD_SPIN_COUNT=10" before the second round test?
> 
> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
> current patch set version is the same).  I redid the number to check for some
> mishandled from my part, but still the number are somewhat not ok to set as
> default (even more that 1000 is the default value for spin_counts).
> 
> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
> 
> 
>>
>> Analysis:
>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>> Then s = k*c, *k* is factor.  
>>
>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>
>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>
>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>  |
>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>  .
>>  .
>>  .
>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>
>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>> for time (k-2)*c,... task(k-1) for time c.
>>
>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>           sleep         sleep   
>>
>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>> until threads exist. Therefore, the spin count effects the system performance largely!
>>
>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>> only consider spin overhead.
> 
> I do agree that the sping count does play a role here, but for the machine
> I am testing I think what is actually happening is this patch is adding more
> branch on critical loop section, and does hurt performance:
> 
> Current spin loop version:
> ---
>   do
>     { 
>       if (cnt++ >= max_cnt)
>         { 
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       atomic_spin_nop ();
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> This patch changes to:
> ---
>   do
>     {   
>       if (cnt >= max_cnt)
>         {
>           LLL_MUTEX_LOCK (mutex);
>           break;
>         }
>       do { 
>         atomic_spin_nop ();
>       }
>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>              && ++cnt < max_cnt);
>     }
>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> ---
> 
> While 3 patch in set changes to:
> ---
>   do
>     {
>       atomic_spin_nop ();
>      }
>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>          ++cnt < max_cnt);
> 
>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>     LLL_MUTEX_LOCK (mutex);
> ---
> 
> That's why I suggested for next version to squash second and third patch on only
> one aimed to optimize the adapting spinning algorithm. 
> 

Agree, will combine 2nd and 3th patch in V2.

> Also, if you could provide a benchmark to stress this point without resorting in
> an external one it would be better.
> 

I will consider to do that. Preliminarily, we can provide several test cases each of which has
different size of critical section and can be run in multi-threads mode.
Your idea?

>>
>>>>
>>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>>> ---
>>>>  nptl/pthread_mutex_lock.c | 23 +++++++++++++++--------
>>>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>>> index 1519c14..c3aca93 100644
>>>> --- a/nptl/pthread_mutex_lock.c
>>>> +++ b/nptl/pthread_mutex_lock.c
>>>> @@ -26,6 +26,7 @@
>>>>  #include <atomic.h>
>>>>  #include <lowlevellock.h>
>>>>  #include <stap-probe.h>
>>>> +#include <mutex-conf.h>
>>>>  
>>>>  #ifndef lll_lock_elision
>>>>  #define lll_lock_elision(lock, try_lock, private)	({ \
>>>> @@ -124,16 +125,22 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>>        if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>>>  	{
>>>>  	  int cnt = 0;
>>>> -	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>>>> -			     mutex->__data.__spins * 2 + 10);
>>>> +	  int max_cnt = MIN (__mutex_aconf.spin_count,
>>>> +			mutex->__data.__spins * 2 + 100);
>>>>  	  do
>>>>  	    {
>>>> -	      if (cnt++ >= max_cnt)
>>>> -		{
>>>> -		  LLL_MUTEX_LOCK (mutex);
>>>> -		  break;
>>>> -		}
>>>> -	      atomic_spin_nop ();
>>>> +		if (cnt >= max_cnt)
>>>> +		  {
>>>> +		    LLL_MUTEX_LOCK (mutex);
>>>> +		    break;
>>>> +		  }
>>>> +		/* MO read while spinning */
>>>> +		do
>>>> +		  {
>>>> +		    atomic_spin_nop ();
>>>> +		  }
>>>> +		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>>> +			++cnt < max_cnt);
>>>>  	    }
>>>>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>>  
>>>>
>>>
>>>
Adhemerval Zanella April 11, 2018, 1:28 p.m. | #5
On 09/04/2018 22:46, kemi wrote:
> 
> 
> On 2018年04月10日 04:52, Adhemerval Zanella wrote:
>>
>>
>> On 08/04/2018 05:28, kemi wrote:
>>>
>>>
>>> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>>> that will force expensive memory synchronization among processors and
>>>>> penalize other running threads. For example, it constantly floods the
>>>>> system with "read for ownership" requests, which are much more expensive to
>>>>> process than a single read. Thus, we only use MO read until we observe the
>>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>>
>>>>> Test machine:
>>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>>
>>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>>> each thread of the workload does:
>>>>> a) Lock the mutex (adaptive spin type)
>>>>> b) Globle variable increment
>>>>> c) Unlock the mutex
>>>>> in a loop until timeout, and the main thread reports the total iteration
>>>>> number of all the threads in one second.
>>>>>
>>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>>
>>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>>
>>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>>> less contention. I also assume you use the 'threads' value in this table.
>>>>
>>>
>>> Yes. I used multi-threads mode for testing.
>>>
>>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>>> any; one would get with this change:
>>>>
>>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>>
>>>
>>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>>
>>>> So I think this change should be platform-specific.
>>>>
>>>
>>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
>>
>> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
>> more platform specific.
>>
>>>
>>> Well, I think the performance also depends on spin count. A simple explanation is that:
>>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>>> Did you run "export LD_SPIN_COUNT=10" before the second round test?
>>
>> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
>> current patch set version is the same).  I redid the number to check for some
>> mishandled from my part, but still the number are somewhat not ok to set as
>> default (even more that 1000 is the default value for spin_counts).
>>
>> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
>> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
>> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
>> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
>> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
>> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
>>
>>
>>>
>>> Analysis:
>>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>>> Then s = k*c, *k* is factor.  
>>>
>>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>>
>>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>>
>>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>>  |
>>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>>  .
>>>  .
>>>  .
>>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>>
>>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>>> for time (k-2)*c,... task(k-1) for time c.
>>>
>>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>>           sleep         sleep   
>>>
>>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>>> until threads exist. Therefore, the spin count effects the system performance largely!
>>>
>>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>>> only consider spin overhead.
>>
>> I do agree that the sping count does play a role here, but for the machine
>> I am testing I think what is actually happening is this patch is adding more
>> branch on critical loop section, and does hurt performance:
>>
>> Current spin loop version:
>> ---
>>   do
>>     { 
>>       if (cnt++ >= max_cnt)
>>         { 
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       atomic_spin_nop ();
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> This patch changes to:
>> ---
>>   do
>>     {   
>>       if (cnt >= max_cnt)
>>         {
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       do { 
>>         atomic_spin_nop ();
>>       }
>>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>>              && ++cnt < max_cnt);
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> While 3 patch in set changes to:
>> ---
>>   do
>>     {
>>       atomic_spin_nop ();
>>      }
>>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>          ++cnt < max_cnt);
>>
>>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>     LLL_MUTEX_LOCK (mutex);
>> ---
>>
>> That's why I suggested for next version to squash second and third patch on only
>> one aimed to optimize the adapting spinning algorithm. 
>>
> 
> Agree, will combine 2nd and 3th patch in V2.
> 
>> Also, if you could provide a benchmark to stress this point without resorting in
>> an external one it would be better.
>>
> 
> I will consider to do that. Preliminarily, we can provide several test cases each of which has
> different size of critical section and can be run in multi-threads mode.
> Your idea?

I would the pthread_mutex3/threads from the project you used as base, it is
simply enough and give us a direct comparable metric. I would ran with
number of threads of 1, nproc/2 and nproc.

Patch

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c3aca93 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <mutex-conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -124,16 +125,22 @@  __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			mutex->__data.__spins * 2 + 100);
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
+		if (cnt >= max_cnt)
+		  {
+		    LLL_MUTEX_LOCK (mutex);
+		    break;
+		  }
+		/* MO read while spinning */
+		do
+		  {
+		    atomic_spin_nop ();
+		  }
+		while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+			++cnt < max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);