diff mbox

Remove CPU mask size detection from setaffinity

Message ID 555DC4EC.40508@redhat.com
State New
Headers show

Commit Message

Florian Weimer May 21, 2015, 11:43 a.m. UTC
On 05/21/2015 01:28 PM, Florian Weimer wrote:
> On 05/20/2015 06:24 PM, Florian Weimer wrote:
>> The code looks quite broken to me and fails to achieve what it tries to
>> do, as explained in the commit message.
> 
> Another advantage is that the magic number 1024 (128 bytes of CPU bits)
> is gone, so testing on smaller machines is possible.
> 
> Well, I did that (144 CPU cores, detected set size 192 bits), and it
> turns out that my pthread test case is completely broken. :-(
> 
> Here's a new one.  It tries to check that the active CPU set returned by
> the kernel has not been truncated, and that threads are not scheduled to
> CPUs outside the set.

Now actually with the patch.

Comments

Carlos O'Donell May 23, 2015, 2:21 a.m. UTC | #1
On 05/21/2015 07:43 AM, Florian Weimer wrote:
> On 05/21/2015 01:28 PM, Florian Weimer wrote:
>> > On 05/20/2015 06:24 PM, Florian Weimer wrote:
>>> >> The code looks quite broken to me and fails to achieve what it tries to
>>> >> do, as explained in the commit message.
>> > 
>> > Another advantage is that the magic number 1024 (128 bytes of CPU bits)
>> > is gone, so testing on smaller machines is possible.
>> > 
>> > Well, I did that (144 CPU cores, detected set size 192 bits), and it
>> > turns out that my pthread test case is completely broken. :-(
>> > 
>> > Here's a new one.  It tries to check that the active CPU set returned by
>> > the kernel has not been truncated, and that threads are not scheduled to
>> > CPUs outside the set.
> Now actually with the patch.
> 
> -- Florian Weimer / Red Hat Product Security
> 
> 
> 0001-Remove-CPU-set-size-checking-from-sched_setaffinity-.patch
> 
> 
> From ee05f3baec0f2d723c1410cd83845e82c7971264 Mon Sep 17 00:00:00 2001
> Message-Id: <ee05f3baec0f2d723c1410cd83845e82c7971264.1432208579.git.fweimer@redhat.com>
> From: Florian Weimer <fweimer@redhat.com>
> Date: Thu, 21 May 2015 12:42:52 +0100
> Subject: [PATCH] Remove CPU set size checking from sched_setaffinity,
>  pthread_setaffinity_np
> To: libc-alpha@sourceware.org
> 
> With current kernel versions, the check does not reliably detect that
> unavailable CPUs are requested, for these reasons:
> 
> (1) The kernel will silently ignore non-allowed CPUs.
> 
> (2) Similarly, CPU bits which lack an online CPU (possible CPUs)
>     are ignored.
> 
> (3) The existing probing code assumes that the CPU mask size is a
>     power of two and at least 1024.  Neither it has to be a power
>     of two, nor is the minimum possible value 1024, so the value
>     determined is often too large, resulting in incorrect false
>     negatives.
>
> The kernel will still return EINVAL if no CPU in the requested set
> remains which can run the current thread after the affinity change.
> 
> Applications which care about the exact affinity mask will have to query
> it using sched_getaffinity after setting it.

Please bear with me as we work out the exact details and semantics
of the changes.

Firstly, I'd like to see a clear definition of the problem we are
trying to solve before we remove the existing code. Not to say that
the existing code is great, but it keeps the status quo and expectations
from user programs are preserved. Describing the problem also prevents
others from asking "Oh, what about the >1024 cpu problem?" and "Oh,
what about the proper use of online, possible, present CPUs and how
to express them via sysconf?" etc.

Is this the problem?

"The glibc sched_setaffinity routine incorrectly attempts to compute the
 kernel cpumask_t, and therefore does more harm than good imposing a 
 incorrectly computed cpumask_t size as a limit." With "harm" exaplained
 in detail.

The glibc API description for sched_setaffinity says EINVAL
will be returned when the affinity bit mask contains no processes
that are currently physically on the system, or are permitted to
run. It does not fail if the bit mask is larger than the kernel
cpumask_t and that extra space contains zeroes. In fact the glibc
manual states that this information can even be used in the *future*
by the scheduler, one presumes, when more cpus come online.

The present code in sysdeps/unix/sysv/linux/sched_setaffinity.c
uses sched_getaffinity to compute what it believes is the maximum
size of the kernel cpu_mask_t. The loop there will succeed at the
first call since most systems default to NR_CPUS[1] to 1023.
If NR_CPUS were very high, the limit the kernel uses is nr_cpu_ids,
which is still the correct limit and reflects maximum possible cpus
as detected by hardware (as opposed to the compile time max).
The kernel code and glibc code as far as I can tell will not correctly
compute nr_cpu_ids nor NR_CPUS, but will compute the largest power of
two size, starting with 128 bytes, that the kernel will accept as a
cpumask_t. This value need not be nr_cpu_ids, and if nr_cpu_ids is
larger than 1024, then it certainly won't be since the kernel is happy
to copy back a partial set of bits e.g. len < nr_cpu_ids. This leads
to a bug in glibc when trying to set the 1025th's cpu affinity, since
the function would reject this as a bit outside of the kernel cpu mask.


If the kernel ignores non-allowed cpus, that would seem like a bug
in the interface. Either the call succeeds according to your request
or it doesn't.

If the kernel ignores bits for offline cpus, that is also a bug,
it should record them and when the cpus come online and available
they should be used.

However, I have not worked on these interfaces on the kernel side,
and I would appreciate Motohiro-san's experience in this matter.

Thus I only see kernel bugs in not notifying, with an error return
value, that a forbidden cpu was specified in the mask and ignored,
or ignoring a set bit for an offline cpu.

The glibc implementation of sched_setaffinity wrongly computes
nr_cpu_ids once, caches it (since it can't change), and then uses 
hat to implement a sensible API where we reject cpu bits set for cpus
that can't exist given the booted hardware. The API should also
additionally, never fail.

The bugs I see in glibc are:
- Fix nr_cpu_id computation from glibc side by simply reading
  the sysfs value of possible, or iterating sched_getaffinity
  until it returns the same value twice. This would fix the case
  on systems with lots of cpus.
- Add a new sysconf parameter for _SC_NPROCESSORS_MAX to indicate
  possible, that way a user can read this first and use that value
  to do dynamic cpu set allocation.

Not to mention refactoring all of this code to sysconf, and having
sched_[sg]affinity use the refactored functions.

In summary:

Don't remove the existing code, but fix it.


Cheers,
Carlos.

[1] /sys/devices/system/cpu/kernel_max or NR_CPUS or CONFIG_NR_CPUS
    and is a static compile time constant maximum number of cpus
    that can ever be detected and made online by this kernel. It
    is not the possible number of CPUs, that value is <= than
    NR_CPUS and limited by the topology of the system. For example
    possible cpus might be a maximum of 254 cpus on an 8-bit APIC
    system (two IDs reserved, one for broadcast and one for the io-apic).
    In general nr_cpu_ids <= NR_CPUS, and usually set to the possible
    hardware number of cpus since that is lower than NR_CPUS and save
    time iterating loops.
Florian Weimer June 5, 2015, 2:34 p.m. UTC | #2
On 05/23/2015 04:21 AM, Carlos O'Donell wrote:

> Is this the problem?
> 
> "The glibc sched_setaffinity routine incorrectly attempts to compute the
>  kernel cpumask_t, and therefore does more harm than good imposing a 
>  incorrectly computed cpumask_t size as a limit." With "harm" exaplained
>  in detail.

The main harm I see is unnecessary complexity, and perhaps misleading
application developers who read the glibc implementation but not the
kernel code.

I think it's also unreasonable to assume that the current state of CPU
count affinity handling is the final development in this area.  Things
like CPU hotplug and process migration will likely get more important
over time, and making glibc rely on the present implementation details
could cause problems in the future.

(An example of such a change is the route cache removal.  There is now a
sysctl for backwards compatibility which says the kernel maximum table
size is INT_MAX.  If you have code that reads this sysctl, it will
likely fail in some way.)

> The glibc API description for sched_setaffinity says EINVAL
> will be returned when the affinity bit mask contains no processes
> that are currently physically on the system, or are permitted to
> run. It does not fail if the bit mask is larger than the kernel
> cpumask_t and that extra space contains zeroes. In fact the glibc
> manual states that this information can even be used in the *future*
> by the scheduler, one presumes, when more cpus come online.
> 
> The present code in sysdeps/unix/sysv/linux/sched_setaffinity.c
> uses sched_getaffinity to compute what it believes is the maximum
> size of the kernel cpu_mask_t. The loop there will succeed at the
> first call since most systems default to NR_CPUS[1] to 1023.

The value here is not NR_CPUS, but its dynamically reduced counterpart,
nr_cpu_ids, which you call possible CPUs below.  NR_CPUS varies wildly
between distributions (I've seen values ranging from 512 to 5120), but
nr_cpu_ids should be fairly consistent for a single piece of hardware.

> If NR_CPUS were very high, the limit the kernel uses is nr_cpu_ids,
> which is still the correct limit and reflects maximum possible cpus
> as detected by hardware (as opposed to the compile time max).
> The kernel code and glibc code as far as I can tell will not correctly
> compute nr_cpu_ids nor NR_CPUS, but will compute the largest power of
> two size, starting with 128 bytes, that the kernel will accept as a
> cpumask_t.

This does not match my testing, both with the Fedora 21 kernel and the
Red Hat Enterprise Linux 7 kernel (x86_64).  I only tested full 64-bit
words, but the kernel implementation of sched_getaffinity returned
successful if passed 3 words of affinity bits (24 bytes, 192 bits) on a
144 processor system.  A power of two does not seem to be required.

> This value need not be nr_cpu_ids, and if nr_cpu_ids is
> larger than 1024, then it certainly won't be since the kernel is happy
> to copy back a partial set of bits e.g. len < nr_cpu_ids.

Are you sure about that?  That's a clear kernel bug.  The sources I
looked at have this right at the top of the sched_getaffinity
implementation:

	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
		return -EINVAL;

So I think the current glibc is not downright buggy on 1025+ systems.
It just does unnecessary work.

> If the kernel ignores non-allowed cpus, that would seem like a bug
> in the interface. Either the call succeeds according to your request
> or it doesn't.

Based on the kernel code for sched_setaffinity, this seems the intent.

> However, I have not worked on these interfaces on the kernel side,
> and I would appreciate Motohiro-san's experience in this matter.

Same here, we really need to know what's the intent on the kernel side.
diff mbox

Patch

From ee05f3baec0f2d723c1410cd83845e82c7971264 Mon Sep 17 00:00:00 2001
Message-Id: <ee05f3baec0f2d723c1410cd83845e82c7971264.1432208579.git.fweimer@redhat.com>
From: Florian Weimer <fweimer@redhat.com>
Date: Thu, 21 May 2015 12:42:52 +0100
Subject: [PATCH] Remove CPU set size checking from sched_setaffinity,
 pthread_setaffinity_np
To: libc-alpha@sourceware.org

With current kernel versions, the check does not reliably detect that
unavailable CPUs are requested, for these reasons:

(1) The kernel will silently ignore non-allowed CPUs.

(2) Similarly, CPU bits which lack an online CPU (possible CPUs)
    are ignored.

(3) The existing probing code assumes that the CPU mask size is a
    power of two and at least 1024.  Neither it has to be a power
    of two, nor is the minimum possible value 1024, so the value
    determined is often too large, resulting in incorrect false
    negatives.

The kernel will still return EINVAL if no CPU in the requested set
remains which can run the current thread after the affinity change.

Applications which care about the exact affinity mask will have to query
it using sched_getaffinity after setting it.
---
 ChangeLog                                     |  21 ++
 manual/threads.texi                           |   2 -
 nptl/Makefile                                 |   3 +-
 nptl/check-cpuset.h                           |  32 ---
 nptl/pthread_attr_setaffinity.c               |   6 -
 nptl/pthread_setattr_default_np.c             |   5 -
 nptl/tst-thread-affinity.c                    | 292 ++++++++++++++++++++++++++
 posix/Makefile                                |   3 +-
 posix/tst-affinity.c                          | 254 ++++++++++++++++++++++
 sysdeps/unix/sysv/linux/check-cpuset.h        |  48 -----
 sysdeps/unix/sysv/linux/pthread_setaffinity.c |  48 -----
 sysdeps/unix/sysv/linux/sched_setaffinity.c   |  37 ----
 12 files changed, 571 insertions(+), 180 deletions(-)
 delete mode 100644 nptl/check-cpuset.h
 create mode 100644 nptl/tst-thread-affinity.c
 create mode 100644 posix/tst-affinity.c
 delete mode 100644 sysdeps/unix/sysv/linux/check-cpuset.h

diff --git a/ChangeLog b/ChangeLog
index 4de8a25..efb21aa 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,24 @@ 
+2015-05-18  Florian Weimer  <fweimer@redhat.com>
+
+	* nptl/check-cpuset.h: Remove.
+	* nptl/pthread_attr_setaffinity.c (__pthread_attr_setaffinity_new):
+	Remove CPU set size check.
+	* nptl/pthread_setattr_default_np.c (pthread_setattr_default_np):
+	Likewise.
+	* sysdeps/unix/sysv/linux/check-cpuset.h: Remove.
+	* sysdeps/unix/sysv/linux/pthread_setaffinity.c
+	(__kernel_cpumask_size, __determine_cpumask_size): Remove.
+	(__pthread_setaffinity_new): Remove CPU set size check.
+	* sysdeps/unix/sysv/linux/sched_setaffinity.c
+	(__kernel_cpumask_size): Remove.
+	(__sched_setaffinity_new): Remove CPU set size check.
+	* manual/threads.texi (Default Thread Attributes): Remove stale
+	reference to check_cpuset_attr, determine_cpumask_size in comment.
+	* posix/Makefile (tests): Add tst-affinity.
+	* posix/tst-affinity.c: New file.
+	* nptl/Makefile (tests): Add tst-thread-affinity.
+	* nptl/tst-thread-affinity.c: New file.
+
 2015-05-18  Arjun Shankar  <arjun.is@lostca.se>
 
 	* include/stdio.h: Define __need_wint_t.
diff --git a/manual/threads.texi b/manual/threads.texi
index 4d080d4..00cc725 100644
--- a/manual/threads.texi
+++ b/manual/threads.texi
@@ -111,8 +111,6 @@  failure.
 @c  check_sched_priority_attr ok
 @c   sched_get_priority_min dup ok
 @c   sched_get_priority_max dup ok
-@c  check_cpuset_attr ok
-@c   determine_cpumask_size ok
 @c  check_stacksize_attr ok
 @c  lll_lock @asulock @aculock
 @c  free dup @ascuheap @acsmem
diff --git a/nptl/Makefile b/nptl/Makefile
index d784c8d..2dc5467 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -272,7 +272,8 @@  tests = tst-typesizes \
 	tst-getpid3 \
 	tst-setuid3 \
 	tst-initializers1 $(addprefix tst-initializers1-,c89 gnu89 c99 gnu99) \
-	tst-bad-schedattr
+	tst-bad-schedattr \
+	tst-thread-affinity
 xtests = tst-setuid1 tst-setuid1-static tst-setuid2 \
 	tst-mutexpp1 tst-mutexpp6 tst-mutexpp10
 test-srcs = tst-oddstacklimit
diff --git a/nptl/check-cpuset.h b/nptl/check-cpuset.h
deleted file mode 100644
index 315bdf2..0000000
--- a/nptl/check-cpuset.h
+++ /dev/null
@@ -1,32 +0,0 @@ 
-/* Validate cpu_set_t values for NPTL.  Stub version.
-   Copyright (C) 2015 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include <errno.h>
-
-/* Returns 0 if CS and SZ are valid values for the cpuset and cpuset size
-   respectively.  Otherwise it returns an error number.  */
-static inline int
-check_cpuset_attr (const cpu_set_t *cs, const size_t sz)
-{
-  if (sz == 0)
-    return 0;
-
-  /* This means pthread_attr_setaffinity will return ENOSYS, which
-     is the right thing when the cpu_set_t features are not available.  */
-  return ENOSYS;
-}
diff --git a/nptl/pthread_attr_setaffinity.c b/nptl/pthread_attr_setaffinity.c
index 7a127b8..571835d 100644
--- a/nptl/pthread_attr_setaffinity.c
+++ b/nptl/pthread_attr_setaffinity.c
@@ -23,7 +23,6 @@ 
 #include <string.h>
 #include <pthreadP.h>
 #include <shlib-compat.h>
-#include <check-cpuset.h>
 
 
 int
@@ -43,11 +42,6 @@  __pthread_attr_setaffinity_new (pthread_attr_t *attr, size_t cpusetsize,
     }
   else
     {
-      int ret = check_cpuset_attr (cpuset, cpusetsize);
-
-      if (ret)
-        return ret;
-
       if (iattr->cpusetsize != cpusetsize)
 	{
 	  void *newp = (cpu_set_t *) realloc (iattr->cpuset, cpusetsize);
diff --git a/nptl/pthread_setattr_default_np.c b/nptl/pthread_setattr_default_np.c
index 457a467..1a661f1 100644
--- a/nptl/pthread_setattr_default_np.c
+++ b/nptl/pthread_setattr_default_np.c
@@ -21,7 +21,6 @@ 
 #include <pthreadP.h>
 #include <assert.h>
 #include <string.h>
-#include <check-cpuset.h>
 
 
 int
@@ -48,10 +47,6 @@  pthread_setattr_default_np (const pthread_attr_t *in)
 	return ret;
     }
 
-  ret = check_cpuset_attr (real_in->cpuset, real_in->cpusetsize);
-  if (ret)
-    return ret;
-
   /* stacksize == 0 is fine.  It means that we don't change the current
      value.  */
   if (real_in->stacksize != 0)
diff --git a/nptl/tst-thread-affinity.c b/nptl/tst-thread-affinity.c
new file mode 100644
index 0000000..e489333
--- /dev/null
+++ b/nptl/tst-thread-affinity.c
@@ -0,0 +1,292 @@ 
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <sys/time.h>
+
+static int
+setaffinity (size_t size, const cpu_set_t *set)
+{
+  int ret = pthread_setaffinity_np (pthread_self (), size, set);
+  if (ret != 0)
+    {
+      errno = ret;
+      return -1;
+    }
+  return 0;
+}
+
+static int
+getaffinity (size_t size, cpu_set_t *set)
+{
+  int ret = pthread_getaffinity_np (pthread_self (), size, set);
+  if (ret != 0)
+    {
+      errno = ret;
+      return -1;
+    }
+  return 0;
+}
+
+struct conf;
+static bool early_test (struct conf *);
+
+#define SETAFFINITY(size, set) setaffinity ((size), (set))
+#define GETAFFINITY(size, set) getaffinity ((size), (set))
+#define EARLY_TEST(conf) early_test (conf)
+
+#include "../posix/tst-affinity.c"
+
+static int still_running;
+static int failed;
+
+static void *
+thread_burn_one_cpu (void *closure)
+{
+  int cpu = (uintptr_t) closure;
+  while (__atomic_load_n (&still_running, __ATOMIC_RELAXED) == 0)
+    {
+      int current = sched_getcpu ();
+      if (sched_getcpu () != cpu)
+	{
+	  printf ("FAIL: Pinned thread %d ran on impossible cpu %d\n",
+		  cpu, current);
+	  __atomic_store_n (&failed, 1, __ATOMIC_RELAXED);
+	  __atomic_store_n (&still_running, 1, __ATOMIC_RELAXED);
+	}
+    }
+  return NULL;
+}
+
+struct burn_thread
+{
+  pthread_t self;
+  struct conf *conf;
+  cpu_set_t *initial_set;
+  cpu_set_t *seen_set;
+  int thread;
+};
+
+static void *
+thread_burn_any_cpu (void *closure)
+{
+  struct burn_thread *param = closure;
+
+  /* Schedule this thread around a bit to see if it lands on another
+     CPU.  Run this for 2 seconds, once with sched_yield, once
+     without.  */
+  for (int pass = 1; pass <= 2; ++pass)
+    {
+      time_t start = time (NULL);
+      while (time (NULL) - start < 3)
+	{
+	  int cpu = sched_getcpu ();
+	  if (cpu > param->conf->last_cpu
+	      || !CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (param->conf->set_size),
+			       param->initial_set))
+	    {
+	      printf ("FAIL: Unpinned thread %d ran on impossible CPU %d\n",
+		      param->thread, cpu);
+	      __atomic_store_n (&failed, 1, __ATOMIC_RELAXED);
+	      return NULL;
+	    }
+	  CPU_SET_S (cpu, CPU_ALLOC_SIZE (param->conf->set_size),
+		     param->seen_set);
+	  if (pass == 1)
+	    sched_yield ();
+	}
+    }
+  return NULL;
+}
+
+static void
+stop_and_join_threads (struct conf *conf, cpu_set_t *set,
+		       pthread_t *pinned_first, pthread_t *pinned_last,
+		       struct burn_thread *other_first,
+		       struct burn_thread *other_last)
+{
+  __atomic_store_n (&still_running, 1, __ATOMIC_RELAXED);
+  for (pthread_t *p = pinned_first; p < pinned_last; ++p)
+    {
+      int cpu = p - pinned_first;
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), set))
+	continue;
+
+      int ret = pthread_join (*p, NULL);
+      if (ret != 0)
+	{
+	  printf ("Failed to join thread %d: %s\n", cpu, strerror (ret));
+	  fflush (stdout);
+	  /* Cannot shut down cleanly with threads still running.  */
+	  abort ();
+	}
+    }
+
+  for (struct burn_thread *p = other_first; p < other_last; ++p)
+    {
+      int cpu = p - other_first;
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), set))
+	continue;
+
+      int ret = pthread_join (p->self, NULL);
+      if (ret != 0)
+	{
+	  printf ("Failed to join thread %d: %s\n", cpu, strerror (ret));
+	  fflush (stdout);
+	  /* Cannot shut down cleanly with threads still running.  */
+	  abort ();
+	}
+    }
+}
+
+/* Tries to check that the initial set of CPUs is complete and that
+   the main thread will not run on any other threads.  */
+static bool
+early_test (struct conf *conf)
+{
+  pthread_t *pinned_threads
+    = calloc (conf->last_cpu + 1, sizeof (*pinned_threads));
+  struct burn_thread *other_threads
+    = calloc (conf->last_cpu + 1, sizeof (*other_threads));
+  cpu_set_t *initial_set = CPU_ALLOC (conf->set_size);
+  cpu_set_t *scratch_set = CPU_ALLOC (conf->set_size);
+
+  if (pinned_threads == NULL || other_threads == NULL
+      || initial_set == NULL || scratch_set == NULL)
+    {
+      puts ("Memory allocation failure");
+      return false;
+    }
+  if (getaffinity (CPU_ALLOC_SIZE (conf->set_size), initial_set) < 0)
+    {
+      printf ("pthread_getaffinity_np failed: %m\n");
+      return false;
+    }
+  for (int cpu = 0; cpu <= conf->last_cpu; ++cpu)
+    {
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), initial_set))
+	continue;
+      other_threads[cpu].conf = conf;
+      other_threads[cpu].initial_set = initial_set;
+      other_threads[cpu].thread = cpu;
+      other_threads[cpu].seen_set = CPU_ALLOC (conf->set_size);
+      if (other_threads[cpu].seen_set == NULL)
+	{
+	  puts ("Memory allocation failure");
+	  return false;
+	}
+      CPU_ZERO_S (CPU_ALLOC_SIZE (conf->set_size),
+		  other_threads[cpu].seen_set);
+    }
+
+  pthread_attr_t attr;
+  int ret = pthread_attr_init (&attr);
+  if (ret != 0)
+    {
+      printf ("pthread_attr_init failed: %s\n", strerror (ret));
+      return false;
+    }
+
+  /* Spawn a thread pinned to each available CPU.  */
+  for (int cpu = 0; cpu <= conf->last_cpu; ++cpu)
+    {
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), initial_set))
+	continue;
+      CPU_ZERO_S (CPU_ALLOC_SIZE (conf->set_size), scratch_set);
+      CPU_SET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), scratch_set);
+      ret = pthread_attr_setaffinity_np
+	(&attr, CPU_ALLOC_SIZE (conf->set_size), scratch_set);
+      if (ret != 0)
+	{
+	  printf ("pthread_attr_setaffinity_np for CPU %d failed: %s\n",
+		  cpu, strerror (ret));
+	  stop_and_join_threads (conf, initial_set,
+				 pinned_threads, pinned_threads + cpu,
+				 NULL, NULL);
+	  return false;
+	}
+      ret = pthread_create (pinned_threads + cpu, &attr,
+			    thread_burn_one_cpu, (void *) (uintptr_t) cpu);
+      if (ret != 0)
+	{
+	  printf ("pthread_create for CPU %d failed: %s\n",
+		  cpu, strerror (ret));
+	  stop_and_join_threads (conf, initial_set,
+				 pinned_threads, pinned_threads + cpu,
+				 NULL, NULL);
+	  return false;
+	}
+    }
+
+  /* Spawn another set of threads running on all CPUs.  */
+  for (int cpu = 0; cpu <= conf->last_cpu; ++cpu)
+    {
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), initial_set))
+	continue;
+      ret = pthread_create (&other_threads[cpu].self, NULL,
+			    thread_burn_any_cpu, other_threads + cpu);
+      if (ret != 0)
+	{
+	  printf ("pthread_create for thread %d failed: %s\n",
+		  cpu, strerror (ret));
+	  stop_and_join_threads (conf, initial_set,
+				 pinned_threads,
+				 pinned_threads + conf->last_cpu + 1,
+				 other_threads, other_threads + cpu);
+	  return false;
+	}
+    }
+
+  /* Main thread.  */
+  struct burn_thread main_thread;
+  main_thread.conf = conf;
+  main_thread.initial_set = initial_set;
+  main_thread.seen_set = scratch_set;
+  main_thread.thread = -1;
+  CPU_ZERO_S (CPU_ALLOC_SIZE (conf->set_size), main_thread.seen_set);
+  thread_burn_any_cpu (&main_thread);
+  stop_and_join_threads (conf, initial_set,
+			 pinned_threads,
+			 pinned_threads + conf->last_cpu + 1,
+			 other_threads, other_threads + conf->last_cpu + 1);
+
+  printf ("Main thread ran on %d CPU(s) of %d available CPU(s)\n",
+	  CPU_COUNT_S (CPU_ALLOC_SIZE (conf->set_size), scratch_set),
+	  CPU_COUNT_S (CPU_ALLOC_SIZE (conf->set_size), initial_set));
+  CPU_ZERO_S (CPU_ALLOC_SIZE (conf->set_size), scratch_set);
+  for (int cpu = 0; cpu <= conf->last_cpu; ++cpu)
+    {
+      if (!CPU_ISSET_S (cpu, CPU_ALLOC_SIZE (conf->set_size), initial_set))
+	continue;
+      CPU_OR_S (CPU_ALLOC_SIZE (conf->set_size),
+		scratch_set, scratch_set, other_threads[cpu].seen_set);
+      CPU_FREE (other_threads[cpu].seen_set);
+    }
+  printf ("Other threads ran on %d CPU(s)\n",
+	  CPU_COUNT_S (CPU_ALLOC_SIZE (conf->set_size), scratch_set));;
+
+
+  pthread_attr_destroy (&attr);
+  CPU_FREE (scratch_set);
+  CPU_FREE (initial_set);
+  free (pinned_threads);
+  free (other_threads);
+  return failed == 0;
+}
diff --git a/posix/Makefile b/posix/Makefile
index 15e8818..5e70a10 100644
--- a/posix/Makefile
+++ b/posix/Makefile
@@ -87,7 +87,8 @@  tests		:= tstgetopt testfnm runtests runptests	     \
 		   bug-getopt1 bug-getopt2 bug-getopt3 bug-getopt4 \
 		   bug-getopt5 tst-getopt_long1 bug-regex34 bug-regex35 \
 		   tst-pathconf tst-getaddrinfo4 tst-rxspencer-no-utf8 \
-		   tst-fnmatch3 bug-regex36 tst-getaddrinfo5
+		   tst-fnmatch3 bug-regex36 tst-getaddrinfo5 \
+		   tst-affinity
 xtests		:= bug-ga2
 ifeq (yes,$(build-shared))
 test-srcs	:= globtest
diff --git a/posix/tst-affinity.c b/posix/tst-affinity.c
new file mode 100644
index 0000000..1078c63
--- /dev/null
+++ b/posix/tst-affinity.c
@@ -0,0 +1,254 @@ 
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file is included by nptl/tst-thread-affinity.c to test the
+   pthread variants of the functions.  */
+
+#include <errno.h>
+#include <limits.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+
+/* Overide this to test other functions.  */
+#ifndef GETAFFINITY
+#define GETAFFINITY(size, set) sched_getaffinity (0, (size), (set))
+#endif
+#ifndef SETAFFINITY
+#define SETAFFINITY(size, set) sched_setaffinity (0, (size), (set))
+#endif
+#ifndef EARLY_TEST
+#define EARLY_TEST(conf) true
+#endif
+
+struct conf
+{
+  int set_size;			/* in bits */
+  int last_cpu;
+};
+
+static int
+find_set_size (void)
+{
+  /* We need to use multiples of 64 because otherwise, CPU_ALLOC
+     over-allocates, and and we do not see all bits returned by the
+     kernel.  */
+  for (int num_cpus = 64; num_cpus <= INT_MAX / 2; num_cpus += 64)
+    {
+      cpu_set_t *set = CPU_ALLOC (num_cpus);
+      size_t size = CPU_ALLOC_SIZE (num_cpus);
+
+      if (set == NULL)
+	{
+	  printf ("CPU_ALLOC(%d) failed\n", num_cpus);
+	  return -1;
+	}
+      if (GETAFFINITY (size, set) == 0)
+	{
+	  CPU_FREE (set);
+	  return num_cpus;
+	}
+      if (errno != EINVAL)
+	{
+	  printf ("getaffinity for %d CPUs: %m\n", num_cpus);
+	  CPU_FREE (set);
+	  return -1;
+	}
+      CPU_FREE (set);
+    }
+  puts ("Cannot find maximum CPU number");
+  return -1;
+}
+
+static int
+find_last_cpu (const cpu_set_t *set, size_t size)
+{
+  size_t cpus_found = 0;
+  size_t total_cpus = CPU_COUNT_S (size, set);
+  int last_cpu = -1;
+
+  for (int cpu = 0; cpus_found < total_cpus; ++cpu)
+    {
+      if (CPU_ISSET_S (cpu, size, set))
+	{
+	  last_cpu = cpu;
+	  ++cpus_found;
+	}
+    }
+  return last_cpu;
+}
+
+static void
+setup_conf (struct conf *conf)
+{
+  *conf = (struct conf) {-1, -1};
+  conf->set_size = find_set_size ();
+  if (conf->set_size > 0)
+    {
+      cpu_set_t *set = CPU_ALLOC (conf->set_size);
+
+      if (set == NULL)
+	{
+	  printf ("CPU_ALLOC (%d) failed\n", conf->set_size);
+	  CPU_FREE (set);
+	  return;
+	}
+      if (GETAFFINITY (CPU_ALLOC_SIZE (conf->set_size), set) < 0)
+	{
+	  printf ("getaffinity failed: %m\n");
+	  CPU_FREE (set);
+	  return;
+	}
+      conf->last_cpu = find_last_cpu (set, CPU_ALLOC_SIZE (conf->set_size));
+      if (conf->last_cpu < 0)
+	puts ("No test CPU found");
+      CPU_FREE (set);
+    }
+}
+
+static bool
+test_size (const struct conf *conf, size_t size)
+{
+  if (size < conf->set_size)
+    {
+      printf ("Test not run for CPU set size %zu\n", size);
+      return true;
+    }
+
+  cpu_set_t *initial_set = CPU_ALLOC (size);
+  cpu_set_t *set2 = CPU_ALLOC (size);
+  cpu_set_t *active_cpu_set = CPU_ALLOC (size);
+
+  if (initial_set == NULL || set2 == NULL || active_cpu_set == NULL)
+    {
+      printf ("size %zu: CPU_ALLOC failed\n", size);
+      return false;
+    }
+  size = CPU_ALLOC_SIZE (size);
+
+  if (GETAFFINITY (size, initial_set) < 0)
+    {
+      printf ("size %zu: getaffinity: %m\n", size);
+      return false;
+    }
+  if (SETAFFINITY (size, initial_set) < 0)
+    {
+      printf ("size %zu: setaffinity: %m\n", size);
+      return true;
+    }
+
+  /* Use one-CPU set to test switching between CPUs.  */
+  int last_active_cpu = -1;
+  for (int cpu = 0; cpu <= conf->last_cpu; ++cpu)
+    {
+      int active_cpu = sched_getcpu ();
+      if (last_active_cpu >= 0 && last_active_cpu != active_cpu)
+	{
+	  printf ("Unexpected CPU %d, expected %d\n",
+		  active_cpu, last_active_cpu);
+	  return false;
+	}
+
+      if (!CPU_ISSET_S (cpu, size, initial_set))
+	continue;
+      last_active_cpu = cpu;
+
+      CPU_ZERO_S (size, active_cpu_set);
+      CPU_SET_S (cpu, size, active_cpu_set);
+      if (SETAFFINITY (size, active_cpu_set) < 0)
+	{
+	  printf ("size %zu: setaffinity (%d): %m\n", size, cpu);
+	  return false;
+	}
+      active_cpu = sched_getcpu ();
+      if (active_cpu != cpu)
+	{
+	  printf ("Unexpected CPU %d, expected %d\n", active_cpu, cpu);
+	  return false;
+	}
+      if (GETAFFINITY (size, set2) < 0)
+	{
+	  printf ("size %zu: getaffinity (2): %m\n", size);
+	  return false;
+	}
+      if (!CPU_EQUAL_S (size, active_cpu_set, set2))
+	{
+	  printf ("size %zu: CPU sets do not match\n", size);
+	  return false;
+	}
+    }
+
+  if (SETAFFINITY (size, initial_set) < 0)
+    {
+      printf ("size %zu: setaffinity (3): %m\n", size);
+      return false;
+    }
+  if (GETAFFINITY (size, set2) < 0)
+    {
+      printf ("size %zu: getaffinity (3): %m\n", size);
+      return false;
+    }
+  if (!CPU_EQUAL_S (size, initial_set, set2))
+    {
+      printf ("size %zu: CPU sets do not match (2)\n", size);
+      return false;
+    }
+
+  CPU_FREE (initial_set);
+  CPU_FREE (set2);
+  CPU_FREE (active_cpu_set);
+
+  return true;
+}
+
+static int
+do_test (void)
+{
+  {
+    cpu_set_t set;
+    if (GETAFFINITY (sizeof (set), &set) < 0 && errno == ENOSYS)
+      {
+	puts ("getaffinity not supported");
+	return 0;
+      }
+  }
+
+  struct conf conf;
+  setup_conf (&conf);
+  printf ("Detected CPU set size (in bits): %d\n", conf.set_size);
+  printf ("Maximum test CPU: %d\n", conf.last_cpu);
+  if (conf.set_size < 0 || conf.last_cpu < 0)
+    return 1;
+
+  if (!EARLY_TEST (&conf))
+    return 1;
+
+  if (test_size (&conf, 1024)
+      && test_size (&conf, 2)
+      && test_size (&conf, 32)
+      && test_size (&conf, 40)
+      && test_size (&conf, 64)
+      && test_size (&conf, 96)
+      && test_size (&conf, 128)
+      && test_size (&conf, 256)
+      && test_size (&conf, 1024 * 1024))
+    return 0;
+  return 1;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/sysdeps/unix/sysv/linux/check-cpuset.h b/sysdeps/unix/sysv/linux/check-cpuset.h
deleted file mode 100644
index 1d55e0b..0000000
--- a/sysdeps/unix/sysv/linux/check-cpuset.h
+++ /dev/null
@@ -1,48 +0,0 @@ 
-/* Validate cpu_set_t values for NPTL.  Linux version.
-   Copyright (C) 2002-2015 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include <pthread.h>
-#include <errno.h>
-
-
-/* Defined in pthread_setaffinity.c.  */
-extern size_t __kernel_cpumask_size attribute_hidden;
-extern int __determine_cpumask_size (pid_t tid);
-
-/* Returns 0 if CS and SZ are valid values for the cpuset and cpuset size
-   respectively.  Otherwise it returns an error number.  */
-static inline int
-check_cpuset_attr (const cpu_set_t *cs, const size_t sz)
-{
-  if (__kernel_cpumask_size == 0)
-    {
-      int res = __determine_cpumask_size (THREAD_SELF->tid);
-      if (res)
-	return res;
-    }
-
-  /* Check whether the new bitmask has any bit set beyond the
-     last one the kernel accepts.  */
-  for (size_t cnt = __kernel_cpumask_size; cnt < sz; ++cnt)
-    if (((char *) cs)[cnt] != '\0')
-      /* Found a nonzero byte.  This means the user request cannot be
-	 fulfilled.  */
-      return EINVAL;
-
-  return 0;
-}
diff --git a/sysdeps/unix/sysv/linux/pthread_setaffinity.c b/sysdeps/unix/sysv/linux/pthread_setaffinity.c
index e891818..2ebf09d 100644
--- a/sysdeps/unix/sysv/linux/pthread_setaffinity.c
+++ b/sysdeps/unix/sysv/linux/pthread_setaffinity.c
@@ -23,62 +23,14 @@ 
 #include <shlib-compat.h>
 
 
-size_t __kernel_cpumask_size attribute_hidden;
-
-
-/* Determine the size of cpumask_t in the kernel.  */
-int
-__determine_cpumask_size (pid_t tid)
-{
-  size_t psize;
-  int res;
-
-  for (psize = 128; ; psize *= 2)
-    {
-      char buf[psize];
-      INTERNAL_SYSCALL_DECL (err);
-
-      res = INTERNAL_SYSCALL (sched_getaffinity, err, 3, tid, psize, buf);
-      if (INTERNAL_SYSCALL_ERROR_P (res, err))
-	{
-	  if (INTERNAL_SYSCALL_ERRNO (res, err) != EINVAL)
-	    return INTERNAL_SYSCALL_ERRNO (res, err);
-	}
-      else
-	break;
-    }
-
-  if (res != 0)
-    __kernel_cpumask_size = res;
-
-  return 0;
-}
-
-
 int
 __pthread_setaffinity_new (pthread_t th, size_t cpusetsize,
 			   const cpu_set_t *cpuset)
 {
   const struct pthread *pd = (const struct pthread *) th;
-
   INTERNAL_SYSCALL_DECL (err);
   int res;
 
-  if (__glibc_unlikely (__kernel_cpumask_size == 0))
-    {
-      res = __determine_cpumask_size (pd->tid);
-      if (res != 0)
-	return res;
-    }
-
-  /* We now know the size of the kernel cpumask_t.  Make sure the user
-     does not request to set a bit beyond that.  */
-  for (size_t cnt = __kernel_cpumask_size; cnt < cpusetsize; ++cnt)
-    if (((char *) cpuset)[cnt] != '\0')
-      /* Found a nonzero byte.  This means the user request cannot be
-	 fulfilled.  */
-      return EINVAL;
-
   res = INTERNAL_SYSCALL (sched_setaffinity, err, 3, pd->tid, cpusetsize,
 			  cpuset);
 
diff --git a/sysdeps/unix/sysv/linux/sched_setaffinity.c b/sysdeps/unix/sysv/linux/sched_setaffinity.c
index b528617..dfddce7 100644
--- a/sysdeps/unix/sysv/linux/sched_setaffinity.c
+++ b/sysdeps/unix/sysv/linux/sched_setaffinity.c
@@ -22,50 +22,13 @@ 
 #include <unistd.h>
 #include <sys/types.h>
 #include <shlib-compat.h>
-#include <alloca.h>
 
 
 #ifdef __NR_sched_setaffinity
-static size_t __kernel_cpumask_size;
-
 
 int
 __sched_setaffinity_new (pid_t pid, size_t cpusetsize, const cpu_set_t *cpuset)
 {
-  if (__glibc_unlikely (__kernel_cpumask_size == 0))
-    {
-      INTERNAL_SYSCALL_DECL (err);
-      int res;
-
-      size_t psize = 128;
-      void *p = alloca (psize);
-
-      while (res = INTERNAL_SYSCALL (sched_getaffinity, err, 3, getpid (),
-				     psize, p),
-	     INTERNAL_SYSCALL_ERROR_P (res, err)
-	     && INTERNAL_SYSCALL_ERRNO (res, err) == EINVAL)
-	p = extend_alloca (p, psize, 2 * psize);
-
-      if (res == 0 || INTERNAL_SYSCALL_ERROR_P (res, err))
-	{
-	  __set_errno (INTERNAL_SYSCALL_ERRNO (res, err));
-	  return -1;
-	}
-
-      __kernel_cpumask_size = res;
-    }
-
-  /* We now know the size of the kernel cpumask_t.  Make sure the user
-     does not request to set a bit beyond that.  */
-  for (size_t cnt = __kernel_cpumask_size; cnt < cpusetsize; ++cnt)
-    if (((char *) cpuset)[cnt] != '\0')
-      {
-        /* Found a nonzero byte.  This means the user request cannot be
-	   fulfilled.  */
-	__set_errno (EINVAL);
-	return -1;
-      }
-
   int result = INLINE_SYSCALL (sched_setaffinity, 3, pid, cpusetsize, cpuset);
 
 #ifdef RESET_VGETCPU_CACHE
-- 
2.1.0