diff mbox series

[RFC] ali_workqueue: Adaptive lock integration on multi-socket/core platform

Message ID 20181128082329.26873-1-ling.ma@MacBook-Pro-7.local
State New
Headers show
Series [RFC] ali_workqueue: Adaptive lock integration on multi-socket/core platform | expand

Commit Message

Ma Ling Nov. 28, 2018, 8:23 a.m. UTC
From: "ling.ma" <ling.ml@antfin.com>

  Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-socket/core platform.

  However if the serialized works are sent to one core and executed
ONLY when contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Adaptive Lock Integration.
(ali workqueue)

  Currently multiple CPU sockets give us better performance per watt,
however that also involve more complex synchronization requirement.
For example under critical section scenario , the Lock cache line
will ping-pong among CPU sockets and the Competing-Lock process
among more cores also bring more overhead. In this version
we introduce distributed synchronization mechanism, which will 
reduce the issues a lot. Assuming There are 2 CPU sockets:

1.	If(the thread is from socket_0)
		Lock_from_socket_0
2.	If (the thread is from socket_1)
		Lock_from_socket_1

3.	Lock_Global 

4.	Enter critical section

5.	If(the thread is from socket_0)
		UnLock_from_socket_0 

6.	if (the thread is from socket_1)
		UnLock_from_socket_1

7.	The threads from the same socket_0 or socket_1 complete the critical one
	by one, until no waiting threads in the right socket. During the process
	We also accelerate data and Lock movement in the same socket.

8.	UnLock_Global:  we allow threads from other sockets to
	enter critical section

Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
get Global Lock in step 3 & 4.

Step 5 or 6 help us to reduce Global Lock & shared data movement,
because Lock and shared data are locked in the same socket.
Ali workqueue is very good at step 7 , meanwhile which also balance
the workload of Lock Owner in original version. In the end we get
significant result as below (We will send the benchmark in this thread soon):

1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
Original Spinlock
Run hashwork in 5 seconds, print statistics below:
1 threads, 10221937 total hashes, 10221937 hashes per thread
2 threads, 18204627 total hashes, 9102313 hashes per thread
4 threads, 21847140 total hashes, 5461785 hashes per thread
8 threads, 13231893 total hashes, 1653986 hashes per thread
16 threads, 9706989 total hashes, 606686 hashes per thread
32 threads, 6096940 total hashes, 190529 hashes per thread
64 threads, 5237120 total hashes, 81830 hashes per thread
80 threads, 5225351 total hashes, 65316 hashes per thread
96 threads, 5345197 total hashes, 55679 hashes per thread

Ali Workqueue
Run hashwork in 5 seconds, print statistics below:
1 threads, 9597719 total hashes, 9597719 hashes per thread
2 threads, 16191658 total hashes, 8095829 hashes per thread
4 threads, 16284311 total hashes, 4071077 hashes per thread
8 threads, 25705715 total hashes, 3213214 hashes per thread
16 threads, 32104276 total hashes, 2006517 hashes per thread
32 threads, 33678957 total hashes, 1052467 hashes per thread
64 threads, 31804354 total hashes, 496943 hashes per thread
80 threads, 34445498 total hashes, 430568 hashes per thread
96 threads, 30523970 total hashes, 317958 hashes per thread

2. Global data benchmark (the smaller is the better,
   the benchmark is from ling.ml@antfin.com for our real workload):

Original Spinlock

1 threads 50000 num
total time (   1 threads): 32789120
2 threads 50000 num
total time (   2 threads): 208625958
4 threads 50000 num
total time (   4 threads): 1063907644
8 threads 50000 num
total time (   8 threads): 4734218966
16 threads 50000 num
total time (  16 threads): 25088565320
32 threads 50000 num
total time (  32 threads): 149992521624
64 threads 50000 num
total time (  64 threads): 1054508130586
80 threads 50000 num
total time (  80 threads): 1488507826842
96 threads 50000 num
total time (  96 threads): 1787252256456

Ali Workqueue
1 threads 50000 num
total time (   1 threads): 36340476
2 threads 50000 num
total time (   2 threads): 169380062
4 threads 50000 num
total time (   4 threads): 565430140
8 threads 50000 num
total time (   8 threads): 1329263188
16 threads 50000 num
total time (  16 threads): 3385617884
32 threads 50000 num
total time (  32 threads): 10736058730
64 threads 50000 num
total time (  64 threads): 31651343042
80 threads 50000 num
total time (  80 threads): 47133700104
96 threads 50000 num
total time (  96 threads): 62611966622

Any comments are appreciated.

Thanks
Ling
---
 ChangeLog                    |   8 ++++
 include/ali_workqueue.h      |  26 +++++++++++
 nptl/Versions                |   2 +
 nptl/ali_workqueue.c         | 102 +++++++++++++++++++++++++++++++++++++++++++
 sysdeps/x86_64/nptl/Makefile |   1 +
 5 files changed, 139 insertions(+)
 create mode 100644 include/ali_workqueue.h
 create mode 100644 nptl/ali_workqueue.c

Comments

=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Nov. 28, 2018, 8:28 a.m. UTC | #1
Attach test cases

Thanks
Ling

在 2018/11/28 下午4:23,“Ma Ling”<ling.ma.program@gmail.com> 写入:

    From: "ling.ma" <ling.ml@antfin.com>
    
      Wire-latency(RC delay) dominate modern computer performance,
    conventional serialized works cause cache line ping-pong seriously,
    the process spend lots of time and power to complete.
    specially on multi-socket/core platform.
    
      However if the serialized works are sent to one core and executed
    ONLY when contention happens, that can save much time and power,
    because all shared data are located in private cache of one core.
    We call the mechanism as Adaptive Lock Integration.
    (ali workqueue)
    
      Currently multiple CPU sockets give us better performance per watt,
    however that also involve more complex synchronization requirement.
    For example under critical section scenario , the Lock cache line
    will ping-pong among CPU sockets and the Competing-Lock process
    among more cores also bring more overhead. In this version
    we introduce distributed synchronization mechanism, which will 
    reduce the issues a lot. Assuming There are 2 CPU sockets:
    
    1.	If(the thread is from socket_0)
    		Lock_from_socket_0
    2.	If (the thread is from socket_1)
    		Lock_from_socket_1
    
    3.	Lock_Global 
    
    4.	Enter critical section
    
    5.	If(the thread is from socket_0)
    		UnLock_from_socket_0 
    
    6.	if (the thread is from socket_1)
    		UnLock_from_socket_1
    
    7.	The threads from the same socket_0 or socket_1 complete the critical one
    	by one, until no waiting threads in the right socket. During the process
    	We also accelerate data and Lock movement in the same socket.
    
    8.	UnLock_Global:  we allow threads from other sockets to
    	enter critical section
    
    Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
    get Global Lock in step 3 & 4.
    
    Step 5 or 6 help us to reduce Global Lock & shared data movement,
    because Lock and shared data are locked in the same socket.
    Ali workqueue is very good at step 7 , meanwhile which also balance
    the workload of Lock Owner in original version. In the end we get
    significant result as below (We will send the benchmark in this thread soon):
    
    1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
    Original Spinlock
    Run hashwork in 5 seconds, print statistics below:
    1 threads, 10221937 total hashes, 10221937 hashes per thread
    2 threads, 18204627 total hashes, 9102313 hashes per thread
    4 threads, 21847140 total hashes, 5461785 hashes per thread
    8 threads, 13231893 total hashes, 1653986 hashes per thread
    16 threads, 9706989 total hashes, 606686 hashes per thread
    32 threads, 6096940 total hashes, 190529 hashes per thread
    64 threads, 5237120 total hashes, 81830 hashes per thread
    80 threads, 5225351 total hashes, 65316 hashes per thread
    96 threads, 5345197 total hashes, 55679 hashes per thread
    
    Ali Workqueue
    Run hashwork in 5 seconds, print statistics below:
    1 threads, 9597719 total hashes, 9597719 hashes per thread
    2 threads, 16191658 total hashes, 8095829 hashes per thread
    4 threads, 16284311 total hashes, 4071077 hashes per thread
    8 threads, 25705715 total hashes, 3213214 hashes per thread
    16 threads, 32104276 total hashes, 2006517 hashes per thread
    32 threads, 33678957 total hashes, 1052467 hashes per thread
    64 threads, 31804354 total hashes, 496943 hashes per thread
    80 threads, 34445498 total hashes, 430568 hashes per thread
    96 threads, 30523970 total hashes, 317958 hashes per thread
    
    2. Global data benchmark (the smaller is the better,
       the benchmark is from ling.ml@antfin.com for our real workload):
    
    Original Spinlock
    
    1 threads 50000 num
    total time (   1 threads): 32789120
    2 threads 50000 num
    total time (   2 threads): 208625958
    4 threads 50000 num
    total time (   4 threads): 1063907644
    8 threads 50000 num
    total time (   8 threads): 4734218966
    16 threads 50000 num
    total time (  16 threads): 25088565320
    32 threads 50000 num
    total time (  32 threads): 149992521624
    64 threads 50000 num
    total time (  64 threads): 1054508130586
    80 threads 50000 num
    total time (  80 threads): 1488507826842
    96 threads 50000 num
    total time (  96 threads): 1787252256456
    
    Ali Workqueue
    1 threads 50000 num
    total time (   1 threads): 36340476
    2 threads 50000 num
    total time (   2 threads): 169380062
    4 threads 50000 num
    total time (   4 threads): 565430140
    8 threads 50000 num
    total time (   8 threads): 1329263188
    16 threads 50000 num
    total time (  16 threads): 3385617884
    32 threads 50000 num
    total time (  32 threads): 10736058730
    64 threads 50000 num
    total time (  64 threads): 31651343042
    80 threads 50000 num
    total time (  80 threads): 47133700104
    96 threads 50000 num
    total time (  96 threads): 62611966622
    
    Any comments are appreciated.
    
    Thanks
    Ling
    ---
     ChangeLog                    |   8 ++++
     include/ali_workqueue.h      |  26 +++++++++++
     nptl/Versions                |   2 +
     nptl/ali_workqueue.c         | 102 +++++++++++++++++++++++++++++++++++++++++++
     sysdeps/x86_64/nptl/Makefile |   1 +
     5 files changed, 139 insertions(+)
     create mode 100644 include/ali_workqueue.h
     create mode 100644 nptl/ali_workqueue.c
    
    diff --git a/ChangeLog b/ChangeLog
    index d7ee676..fdfc00a 100644
    --- a/ChangeLog
    +++ b/ChangeLog
    @@ -1,3 +1,11 @@
    +2018-11-08  Ma Ling  <ling.ml@antfin.com>
    +
    +	* sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command.
    +	* nptl/Versions: Export 2 routines of ali_workqueue.
    +	* nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using
    +	adaptive lock integration machnism.
    +	* include/ali_workqueue.h: New file, the user API definition.
    +
     2018-11-05  Arjun Shankar  <arjun@redhat.com>
     
     	* iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for
    diff --git a/include/ali_workqueue.h b/include/ali_workqueue.h
    new file mode 100644
    index 0000000..62f3429
    --- /dev/null
    +++ b/include/ali_workqueue.h
    @@ -0,0 +1,26 @@
    +#ifndef _ALI_WORKQUEUE_H_
    +#define _ALI_WORKQUEUE_H_
    +
    +#define __aligned(x)	__attribute__((aligned(x)))
    +struct socket {
    +	void *core __aligned(64);
    +	char pad __aligned(64);
    +};
    +
    +struct ali_workqueue {
    +	struct socket  owner;
    +	struct socket  cpu[0];
    +} ali_workqueue_t;
    +
    +
    +struct ali_workqueue_info {
    +	struct ali_workqueue_info *next  __aligned(64);
    +	int pending;
    +	void (*fn)(void *);
    +	void *para;
    +	int socket;
    +};
    +
    +void ali_workqueue_init(struct ali_workqueue *ali_wq, int size);
    +void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali);
    +#endif
    diff --git a/nptl/Versions b/nptl/Versions
    index e7f691d..f4afa6d 100644
    --- a/nptl/Versions
    +++ b/nptl/Versions
    @@ -267,6 +267,8 @@ libpthread {
       }
     
       GLIBC_2.22 {
    +   ali_workqueue_init;
    +   ali_workqueue;
       }
     
       # C11 thread symbols.
    diff --git a/nptl/ali_workqueue.c b/nptl/ali_workqueue.c
    new file mode 100644
    index 0000000..fe34ca0
    --- /dev/null
    +++ b/nptl/ali_workqueue.c
    @@ -0,0 +1,102 @@
    +/* Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#include <stdio.h>
    +#include <string.h>
    +#include <stdint.h>
    +#include <atomic.h>
    +#include "ali_workqueue.h"
    +
    +static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu)
    +{
    +
    +	struct ali_workqueue_info *next, *ali;
    +
    +	old->fn(old->para);
    +retry:
    +	ali = __sync_val_compare_and_swap(cpu, old, NULL);
    +
    +	if(ali == old)
    +		goto end;
    +
    +	ali =  atomic_exchange_acquire(cpu, old);
    +
    +repeat:    
    +	if(old == ali)
    +		goto retry;
    +
    +	while (!(next = atomic_load_relaxed(&ali->next)))
    +   		atomic_spin_nop ();
    +
    +	ali->fn(ali->para);
    +	ali->pending = 0;    
    +	ali = next;    
    +	goto repeat;
    +
    +end:
    +	atomic_store_release(&ali->pending, 0);
    +	return;
    +}
    +
    +void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali)
    +{
    +
    +	struct ali_workqueue_info *old;
    +	void **core;
    +	ali->next = NULL;
    +	ali->pending = 1;
    +	core = &ali_wq->cpu[ali->socket].core;
    +	old =  atomic_exchange_acquire(core , ali);
    +	if(old)	{
    +		atomic_store_release(&ali->next, old);
    +		while(atomic_load_relaxed(&ali->pending))
    +   			atomic_spin_nop ();
    +		return;
    +	}
    +
    +	old =  atomic_exchange_acquire(&ali_wq->owner.core, ali);
    +	if(old) {
    +		atomic_store_release(&old->next, ali);
    +		while((atomic_load_relaxed(&ali->pending)))
    +   			atomic_spin_nop ();
    +	}
    +
    +	run_workqueue(ali, core);
    +	old = ali;
    +
    +	ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL);
    +	if(ali == old)
    +		goto end;
    +
    +	while (!(ali = atomic_load_relaxed(&old->next)))
    +		atomic_spin_nop ();
    +
    +
    +	atomic_store_release(&ali->pending, 0);
    +
    +end:
    +	return;
    +
    +}
    +
    +/* Init ali work queue */
    +void ali_workqueue_init(struct ali_workqueue *ali_wq, int size)
    +{
    +	memset(ali_wq, 0, size);
    +}
    +
    diff --git a/sysdeps/x86_64/nptl/Makefile b/sysdeps/x86_64/nptl/Makefile
    index 7302403..a5d91e2 100644
    --- a/sysdeps/x86_64/nptl/Makefile
    +++ b/sysdeps/x86_64/nptl/Makefile
    @@ -18,3 +18,4 @@
     ifeq ($(subdir),csu)
     gen-as-const-headers += tcb-offsets.sym
     endif
    +libpthread-routines += ali_workqueue
    -- 
    1.8.3.1
Joseph Myers Nov. 28, 2018, 4:40 p.m. UTC | #2
Please see the contribution checklist 
<https://sourceware.org/glibc/wiki/Contribution%20checklist>.  For 
example:

* FSF copyright assignment (with employer assignment / disclaimer as 
applicable) needed.  People are unlikely to look in detail at code without 
an assignment because it could cause problems if the assignment never 
appears and they wish to implement something similar in future.

* Please format code according to the GNU Coding Standards.

* New features need documentation in the user manual, and to be mentioned 
in the NEWS file.

* APIs need to be architecture-independent, in the absence of a clear 
justification for an architecture-specific API.

* A new interface is not useful without an installed header declaring it 
for users (this patch only has an internal header, not an installed one).

* New interfaces need testcases added to the glibc testsuite in the patch 
adding the interface.

* No "Contributed by" in new files.

* New symbol versions must be the version number of the first glibc 
release to have the feature.  For something added now that would be 
GLIBC_2.29.

* ABI test baselines (for all architectures) must be updated in any patch 
adding new interfaces.

* New code should not use __sync_*, and should have comments explicitly 
explaining the synchronization used (in terms of the C11 memory model) 
when using atomics.
=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Nov. 29, 2018, 2:29 p.m. UTC | #3
Hi all,

We have got assignment from Free Software Foundation in 2014,  so there are no problem for the patches we send out.

Thanks
Ling  

在 2018/11/29 下午9:55,“Joseph Myers”<joseph@codesourcery.com> 写入:

    On Thu, 29 Nov 2018, 马凌(彦军) wrote:
    
    > Hi Joseph S. Myers
    > 
    > Thanks for your reminder, we have got assignment from Free Software Foundation as attachment.
    > So we have right to send the formal patch , it is correct ?
    
    Yes.  (It's a good idea to say explicitly when posting the patch that 
    you're covered by the Alibaba assignment.)
    
    -- 
    Joseph S. Myers
    joseph@codesourcery.com
diff mbox series

Patch

diff --git a/ChangeLog b/ChangeLog
index d7ee676..fdfc00a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@ 
+2018-11-08  Ma Ling  <ling.ml@antfin.com>
+
+	* sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command.
+	* nptl/Versions: Export 2 routines of ali_workqueue.
+	* nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using
+	adaptive lock integration machnism.
+	* include/ali_workqueue.h: New file, the user API definition.
+
 2018-11-05  Arjun Shankar  <arjun@redhat.com>
 
 	* iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for
diff --git a/include/ali_workqueue.h b/include/ali_workqueue.h
new file mode 100644
index 0000000..62f3429
--- /dev/null
+++ b/include/ali_workqueue.h
@@ -0,0 +1,26 @@ 
+#ifndef _ALI_WORKQUEUE_H_
+#define _ALI_WORKQUEUE_H_
+
+#define __aligned(x)	__attribute__((aligned(x)))
+struct socket {
+	void *core __aligned(64);
+	char pad __aligned(64);
+};
+
+struct ali_workqueue {
+	struct socket  owner;
+	struct socket  cpu[0];
+} ali_workqueue_t;
+
+
+struct ali_workqueue_info {
+	struct ali_workqueue_info *next  __aligned(64);
+	int pending;
+	void (*fn)(void *);
+	void *para;
+	int socket;
+};
+
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size);
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali);
+#endif
diff --git a/nptl/Versions b/nptl/Versions
index e7f691d..f4afa6d 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -267,6 +267,8 @@  libpthread {
   }
 
   GLIBC_2.22 {
+   ali_workqueue_init;
+   ali_workqueue;
   }
 
   # C11 thread symbols.
diff --git a/nptl/ali_workqueue.c b/nptl/ali_workqueue.c
new file mode 100644
index 0000000..fe34ca0
--- /dev/null
+++ b/nptl/ali_workqueue.c
@@ -0,0 +1,102 @@ 
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <atomic.h>
+#include "ali_workqueue.h"
+
+static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu)
+{
+
+	struct ali_workqueue_info *next, *ali;
+
+	old->fn(old->para);
+retry:
+	ali = __sync_val_compare_and_swap(cpu, old, NULL);
+
+	if(ali == old)
+		goto end;
+
+	ali =  atomic_exchange_acquire(cpu, old);
+
+repeat:    
+	if(old == ali)
+		goto retry;
+
+	while (!(next = atomic_load_relaxed(&ali->next)))
+   		atomic_spin_nop ();
+
+	ali->fn(ali->para);
+	ali->pending = 0;    
+	ali = next;    
+	goto repeat;
+
+end:
+	atomic_store_release(&ali->pending, 0);
+	return;
+}
+
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali)
+{
+
+	struct ali_workqueue_info *old;
+	void **core;
+	ali->next = NULL;
+	ali->pending = 1;
+	core = &ali_wq->cpu[ali->socket].core;
+	old =  atomic_exchange_acquire(core , ali);
+	if(old)	{
+		atomic_store_release(&ali->next, old);
+		while(atomic_load_relaxed(&ali->pending))
+   			atomic_spin_nop ();
+		return;
+	}
+
+	old =  atomic_exchange_acquire(&ali_wq->owner.core, ali);
+	if(old) {
+		atomic_store_release(&old->next, ali);
+		while((atomic_load_relaxed(&ali->pending)))
+   			atomic_spin_nop ();
+	}
+
+	run_workqueue(ali, core);
+	old = ali;
+
+	ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL);
+	if(ali == old)
+		goto end;
+
+	while (!(ali = atomic_load_relaxed(&old->next)))
+		atomic_spin_nop ();
+
+
+	atomic_store_release(&ali->pending, 0);
+
+end:
+	return;
+
+}
+
+/* Init ali work queue */
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size)
+{
+	memset(ali_wq, 0, size);
+}
+
diff --git a/sysdeps/x86_64/nptl/Makefile b/sysdeps/x86_64/nptl/Makefile
index 7302403..a5d91e2 100644
--- a/sysdeps/x86_64/nptl/Makefile
+++ b/sysdeps/x86_64/nptl/Makefile
@@ -18,3 +18,4 @@ 
 ifeq ($(subdir),csu)
 gen-as-const-headers += tcb-offsets.sym
 endif
+libpthread-routines += ali_workqueue