diff mbox

What *is* the API for sched_getaffinity? Should sched_getaffinity always succeed when using cpu_set_t?

Message ID 558DB0A0.2040707@gmail.com
State New
Headers show

Commit Message

Michael Kerrisk \(man-pages\) June 26, 2015, 8:05 p.m. UTC
Sigh.... I forgot much of what I learned as I wrote the CPU_SET(3) 
page many years ago. Revised patch below.

On 06/26/2015 04:28 PM, Michael Kerrisk (man-pages) wrote:
> Carlos,
> 
> On 07/23/2013 12:34 AM, Carlos O'Donell wrote:
>> On 07/22/2013 05:43 PM, Roland McGrath wrote:
>>>> I can fix the glibc manual. A 'configured' CPU is one that the OS
>>>> can bring online.
>>>
>>> Where do you get this definition, in the absence of a standard that
>>> specifies _SC_NPROCESSORS_CONF?  The only definition I've ever known for
>>> _SC_NPROCESSORS_CONF is a value that's constant for at least the life of
>>> the process (and probably until reboot) that is the upper bound for what
>>> _SC_NPROCESSORS_ONLN might ever report.  If the implementation for Linux is
>>> inconsistent with that definition, then it's just a bug in the implementation.
>>
>> Let me reiterate my understanding such that you can help me clarify
>> exactly my interpretation of the glibc manual wording regarding the
>> two existing constants.
>>
>> The reality of the situation is that the linux kernel as an abstraction
>> presents the following:
>>
>> (a) The number of online cpus.
>>     - Changes dynamically.
>>     - Not constant for the life of the process, but pretty constant.
>>
>> (b) The number of configured cpus.
>>     - The number of detected cpus that the OS could access.
>>     - Some of them may be offline for various reasons.
>>     - Changes dynamically with hotplug.
>>
>> (c) The number of possible CPUs the OS or hardware can support.
>>     - The internal software infrastructure is designed to support at
>>       most this many cpus.
>>     - Constant for the uptime of the system.
>>     - May be tied in some way to the hardware.
>>
>> On Linux, glibc currently maps _SC_NPROCESSORS_CONF to (b) via
>> /sys/devices/system/cpu/cpu*, and _SC_NPROCESSORS_ONLN to (a) via
>> /sys/devices/system/cpu/online.
>>
>> The problem is that sched_getaffinity and sched_setaffinity only cares
>> about (c) since the size of the kernel affinity mask is of size (c).
>>
>> What Motohiro-san was requesting was that the manual should make it clear
>> that _SC_NPROCESSORS_CONF is distinct from (c) which is an OS limit that
>> the user doesn't know.
>>
>> We need not expose (c) as a new _SC_* constant since it's not really
>> required, since glibc's sched_getaffinity and sched_setaffinity could
>> hide the fact that (c) exists from userspace (and that's what I suggest
>> should happen).
>>
>> Does that clarify my statement?
> 
> It's a long time since the last activity in this discussion, and I see that
> https://sourceware.org/bugzilla/show_bug.cgi?id=15630
> remains open, I propose to apply the patch below to the 
> sched_setattr/sched_getattr man page. Seem okay?
> 
> Cheers,
> 
> Michael
> 
> 
> --- a/man2/sched_setaffinity.2
> +++ b/man2/sched_setaffinity.2
> @@ -333,6 +334,57 @@ main(int argc, char *argv[])
>      }
>  }
>  .fi
> +.SH BUGS
> +The glibc
> +.BR sched_setaffinity ()
> +and
> +.BR sched_getaffinity ()
> +wrapper functions do not handle systems with more than 1024 CPUs.
> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
> +The
> +.I cpu_set_t
> +data type used by glibc has a fixed size of 128 bytes,
> +meaning that the the maximum CPU number that can be represented is 1023.
> +If the system has more than 1024 CPUs, then:
> +.IP * 3
> +The
> +.BR sched_setaffinity ()
> +.I mask
> +argument is not capable of representing the excess CPUs.
> +.IP *
> +Calls of the form:
> +
> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
> +
> +will fail with error
> +.BR EINVAL ,
> +the error produced by the underlying system call for the case where the
> +.I mask
> +size specified in
> +.I cpusetsize
> +is smaller than the size of the affinity mask used by the kernel.
> +.PP
> +The workaround for this problem is to fall back to the use of the
> +underlying system call (via
> +.BR syscall (2)),
> +passing
> +.I mask
> +arguments of a sufficient size.
> +Using a value based on the number of online CPUs:
> +
> +    (sysconf(_SC_NPROCESSORS_CONF) / (sizeof(unsigned long) * 8) + 1)
> +                                   * sizeof(unsigned long)
> +
> +is probably sufficient as the size of the mask,
> +although the value returned by the
> +.BR sysconf ()
> +call can in theory change during the lifetime of the process.
> +Alternatively, one can probe for the size of the required mask using raw
> +.BR sched_getaffinity ()
> +system calls with increasing mask sizes
> +until the call does not fail with the error
> +.BR EINVAL .
>  .SH SEE ALSO
>  .ad l
>  .nh

Okay -- scratch the above. How about the patch below.

Cheers,

Michael

Comments

Tolga Dalman June 29, 2015, 9:40 p.m. UTC | #1
Michael,

given the approach is accepted by Carlos and Roland, I have
some minor textual suggestions for the patch itself.

On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:
> --- a/man2/sched_setaffinity.2
> +++ b/man2/sched_setaffinity.2
> @@ -223,6 +223,47 @@ system call returns the size (in bytes) of the
>  .I cpumask_t
>  data type that is used internally by the kernel to
>  represent the CPU set bit mask.
> +.SS Handling systems with more than 1024 CPUs

What if the system has exactly 1024 CPUs ?
Suggestion: systems with 1024 or more CPUs

> +The
> +.I cpu_set_t
> +data type used by glibc has a fixed size of 128 bytes,
> +meaning that the maximum CPU number that can be represented is 1023.
> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html

No objection, although I have never really noticed external references
in man-pages (esp. web refs). Shouldn't these be generally avoided ?
(and yes, I have noticed the FIXME)

> +If the system has more than 1024 CPUs, then calls of the form:

1024 or more CPUs.

> +
> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
> +
> +will fail with the error
> +.BR EINVAL ,
> +the error produced by the underlying system call for the case where the
> +.I mask
> +size specified in
> +.I cpusetsize
> +is smaller than the size of the affinity mask used by the kernel.
> +.PP
> +The underlying system calls (which represent CPU masks as bit masks of type
> +.IR "unsigned long\ *" )
> +impose no restriction on the size of the mask.
> +To handle systems with more than 1024 CPUs, one must dynamically allocate the
> +.I mask
> +argument using
> +.BR CPU_ALLOC (3)

I would rewrite the sentence to avoid "one must".

> +and manipulate the mask using the "_S" macros described in

and manipulate the macros ending with "_S" as described in

> +.BR CPU_ALLOC (3).
> +Using an allocation based on the number of online CPUs:
> +
> +    cpu_set_t *mask = CPU_ALLOC(CPU_ALLOC_SIZE(
> +                                sysconf(_SC_NPROCESSORS_CONF)));
> +
> +is probably sufficient, although the value returned by the
> +.BR sysconf ()
> +call can in theory change during the lifetime of the process.
> +Alternatively, one can obtain a value that is guaranteed to be stable for

Like above, I would replace "one can obtain a value" by "a value can be obtained".

> +the lifetime of the process by proby for the size of the required mask using

s/proby/probing/.

> +.BR sched_getaffinity ()
> +calls with increasing mask sizes until the call does not fail with the error
> +.BR EINVAL .

I would replace "until the call does not fail with error ..." by "while the call succeeds".

Also, the sentence too long, IMHO.

Best regards
Tolga Dalman
Florian Weimer July 1, 2015, 12:37 p.m. UTC | #2
On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:

> +.SS Handling systems with more than 1024 CPUs
> +The
> +.I cpu_set_t
> +data type used by glibc has a fixed size of 128 bytes,
> +meaning that the maximum CPU number that can be represented is 1023.
> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
> +If the system has more than 1024 CPUs, then calls of the form:
> +
> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
> +
> +will fail with the error
> +.BR EINVAL ,
> +the error produced by the underlying system call for the case where the
> +.I mask
> +size specified in
> +.I cpusetsize
> +is smaller than the size of the affinity mask used by the kernel.

I think it is best to leave this as unspecified as possible.  Kernel
behavior already changed once, and I can imagine it changing again.

Carlos and I tried to get clarification of the future direction of the
kernel interface here:

  <https://sourceware.org/ml/libc-alpha/2015-06/msg00210.html>

No reply so far, unless I missed something.

> +.PP
> +The underlying system calls (which represent CPU masks as bit masks of type
> +.IR "unsigned long\ *" )
> +impose no restriction on the size of the mask.
> +To handle systems with more than 1024 CPUs, one must dynamically allocate the
> +.I mask
> +argument using
> +.BR CPU_ALLOC (3)
> +and manipulate the mask using the "_S" macros described in
> +.BR CPU_ALLOC (3).
> +Using an allocation based on the number of online CPUs:
> +
> +    cpu_set_t *mask = CPU_ALLOC(CPU_ALLOC_SIZE(
> +                                sysconf(_SC_NPROCESSORS_CONF)));

I believe this is incorrect in several ways:

CPU_ALLOC uses the raw CPU counts.  CPU_ALLOC_SIZE converts from the raw
count to the size in bytes.  (This API is misdesigned.)

sysconf(_SC_NPROCESSORS_CONF) is not related to the kernel CPU mask
size, so it is not the correct value.

> +is probably sufficient, although the value returned by the
> +.BR sysconf ()
> +call can in theory change during the lifetime of the process.
> +Alternatively, one can obtain a value that is guaranteed to be stable for
> +the lifetime of the process by proby for the size of the required mask using
> +.BR sched_getaffinity ()
> +calls with increasing mask sizes until the call does not fail with the error

This is the only possible way right now if you do not want to read
sysconf values.

It's also worth noting that the system call and the glibc function have
different return values.
Michael Kerrisk \(man-pages\) July 21, 2015, 3:03 p.m. UTC | #3
Hello Florian,

Thanks for your comments, and sorry for the delayed follow-up.

On 07/01/2015 02:37 PM, Florian Weimer wrote:
> On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:
> 
>> +.SS Handling systems with more than 1024 CPUs
>> +The
>> +.I cpu_set_t
>> +data type used by glibc has a fixed size of 128 bytes,
>> +meaning that the maximum CPU number that can be represented is 1023.
>> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
>> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
>> +If the system has more than 1024 CPUs, then calls of the form:
>> +
>> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
>> +
>> +will fail with the error
>> +.BR EINVAL ,
>> +the error produced by the underlying system call for the case where the
>> +.I mask
>> +size specified in
>> +.I cpusetsize
>> +is smaller than the size of the affinity mask used by the kernel.
> 
> I think it is best to leave this as unspecified as possible.  Kernel
> behavior already changed once, and I can imagine it changing again.

Hmmm. Something needs to be said about what the kernel is doing though.
Otherwise, it's hard to make sense of this subsection. Did you have a
suggested rewording that removes the piece you find problematic?

> Carlos and I tried to get clarification of the future direction of the
> kernel interface here:
> 
>   <https://sourceware.org/ml/libc-alpha/2015-06/msg00210.html>
> 
> No reply so far, unless I missed something.

Okay

>> +.PP
>> +The underlying system calls (which represent CPU masks as bit masks of type
>> +.IR "unsigned long\ *" )
>> +impose no restriction on the size of the mask.
>> +To handle systems with more than 1024 CPUs, one must dynamically allocate the
>> +.I mask
>> +argument using
>> +.BR CPU_ALLOC (3)
>> +and manipulate the mask using the "_S" macros described in
>> +.BR CPU_ALLOC (3).
>> +Using an allocation based on the number of online CPUs:
>> +
>> +    cpu_set_t *mask = CPU_ALLOC(CPU_ALLOC_SIZE(
>> +                                sysconf(_SC_NPROCESSORS_CONF)));
> 
> I believe this is incorrect in several ways:
> 
> CPU_ALLOC uses the raw CPU counts.  CPU_ALLOC_SIZE converts from the raw
> count to the size in bytes.  (This API is misdesigned.)

D'oh! Yes, the use of CPU_ALLOC_SIZE() was clearly misguided.

> sysconf(_SC_NPROCESSORS_CONF) is not related to the kernel CPU mask
> size, so it is not the correct value.

Yes, I understand now.

>> +is probably sufficient, although the value returned by the
>> +.BR sysconf ()
>> +call can in theory change during the lifetime of the process.
>> +Alternatively, one can obtain a value that is guaranteed to be stable for
>> +the lifetime of the process by proby for the size of the required mask using
>> +.BR sched_getaffinity ()
>> +calls with increasing mask sizes until the call does not fail with the error
> 
> This is the only possible way right now if you do not want to read
> sysconf values.

Okay. I've amended the text to remove the first piece.

> It's also worth noting that the system call and the glibc function have
> different return values.

Yes, I already cover that elsewhere in the page. See the quoted text below.

Okay, so now I have:

   C library/kernel differences
       This manual page describes the  glibc  interface  for  the  CPU
       affinity  calls.   The actual system call interface is slightly
       different, with  the  mask  being  typed  as  unsigned  long *,
       reflecting  the  fact that the underlying implementation of CPU
       sets is a simple bit mask.  On success, the raw sched_getaffin‐
       ity()  system call returns the size (in bytes) of the cpumask_t
       data type that is used internally by the  kernel  to  represent
       the CPU set bit mask.

   Handling systems with more than 1024 CPUs
       The  underlying  system calls (which represent CPU masks as bit
       masks of type unsigned long *) impose  no  restriction  on  the
       size of the CPU mask.  However, the cpu_set_t data type used by
       glibc has a fixed size of 128 bytes, meaning that  the  maximum
       CPU  number that can be represented is 1023.  If the system has
       more than 1024 CPUs, then calls of the form:

           sched_getaffinity(pid, sizeof(cpu_set_t), &mask);

       will fail with the error EINVAL,  the  error  produced  by  the
       underlying  system call for the case where the mask size speci‐
       fied in cpusetsize is smaller than the  size  of  the  affinity
       mask used by the kernel.

       When  working  on  systems  with  more than 1024 CPUs, one must
       dynamically allocate the mask argument.   Currently,  the  only
       way  to do this is by probing for the size of the required mask
       using sched_getaffinity()  calls  with  increasing  mask  sizes
       (until the call does not fail with the error EINVAL).

Better?

Cheers,

Michael
Michael Kerrisk \(man-pages\) July 21, 2015, 3:03 p.m. UTC | #4
Hello Tolga,

On 06/29/2015 11:40 PM, Tolga Dalman wrote:
> Michael,
> 
> given the approach is accepted by Carlos and Roland, I have
> some minor textual suggestions for the patch itself.
> 
> On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:
>> --- a/man2/sched_setaffinity.2
>> +++ b/man2/sched_setaffinity.2
>> @@ -223,6 +223,47 @@ system call returns the size (in bytes) of the
>>  .I cpumask_t
>>  data type that is used internally by the kernel to
>>  represent the CPU set bit mask.
>> +.SS Handling systems with more than 1024 CPUs
> 
> What if the system has exactly 1024 CPUs ?
> Suggestion: systems with 1024 or more CPUs

I think you've missed something here. CPUs are numbered starting at 0.
"more than 1024 CPUs" is correct here, I belive.

> 
>> +The
>> +.I cpu_set_t
>> +data type used by glibc has a fixed size of 128 bytes,
>> +meaning that the maximum CPU number that can be represented is 1023.
>> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
>> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
> 
> No objection, although I have never really noticed external references
> in man-pages (esp. web refs). Shouldn't these be generally avoided ?
> (and yes, I have noticed the FIXME)

Those pieces are comments in the page source (not rendered by man(1)).

>> +If the system has more than 1024 CPUs, then calls of the form:
> 
> 1024 or more CPUs.

See above

>> +
>> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
>> +
>> +will fail with the error
>> +.BR EINVAL ,
>> +the error produced by the underlying system call for the case where the
>> +.I mask
>> +size specified in
>> +.I cpusetsize
>> +is smaller than the size of the affinity mask used by the kernel.
>> +.PP
>> +The underlying system calls (which represent CPU masks as bit masks of type
>> +.IR "unsigned long\ *" )
>> +impose no restriction on the size of the mask.
>> +To handle systems with more than 1024 CPUs, one must dynamically allocate the
>> +.I mask
>> +argument using
>> +.BR CPU_ALLOC (3)
> 
> I would rewrite the sentence to avoid "one must".

This is a "voice" thing. I personally find "one must" is okay.

>> +and manipulate the mask using the "_S" macros described in
> 
> and manipulate the macros ending with "_S" as described in

I think you've misread the text. I think it's okay.

>> +.BR CPU_ALLOC (3).
>> +Using an allocation based on the number of online CPUs:
>> +
>> +    cpu_set_t *mask = CPU_ALLOC(CPU_ALLOC_SIZE(
>> +                                sysconf(_SC_NPROCESSORS_CONF)));
>> +
>> +is probably sufficient, although the value returned by the
>> +.BR sysconf ()
>> +call can in theory change during the lifetime of the process.
>> +Alternatively, one can obtain a value that is guaranteed to be stable for
> 
> Like above, I would replace "one can obtain a value" by "a value can be obtained".

See above.

>> +the lifetime of the process by proby for the size of the required mask using
> 
> s/proby/probing/.

Thanks--I'd already spotted that one and fixed.

>> +.BR sched_getaffinity ()
>> +calls with increasing mask sizes until the call does not fail with the error
>> +.BR EINVAL .
> 
> I would replace "until the call does not fail with error ..." by "while the call succeeds".

I think you've misunderstood the logic here... Take another look at the sentence.

Thanks,

Michael
Florian Weimer July 22, 2015, 4:02 p.m. UTC | #5
On 07/21/2015 05:03 PM, Michael Kerrisk (man-pages) wrote:
> Hello Florian,
> 
> Thanks for your comments, and sorry for the delayed follow-up.
> 
> On 07/01/2015 02:37 PM, Florian Weimer wrote:
>> On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:
>>
>>> +.SS Handling systems with more than 1024 CPUs
>>> +The
>>> +.I cpu_set_t
>>> +data type used by glibc has a fixed size of 128 bytes,
>>> +meaning that the maximum CPU number that can be represented is 1023.
>>> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
>>> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
>>> +If the system has more than 1024 CPUs, then calls of the form:
>>> +
>>> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
>>> +
>>> +will fail with the error
>>> +.BR EINVAL ,
>>> +the error produced by the underlying system call for the case where the
>>> +.I mask
>>> +size specified in
>>> +.I cpusetsize
>>> +is smaller than the size of the affinity mask used by the kernel.
>>
>> I think it is best to leave this as unspecified as possible.  Kernel
>> behavior already changed once, and I can imagine it changing again.
> 
> Hmmm. Something needs to be said about what the kernel is doing though.
> Otherwise, it's hard to make sense of this subsection. Did you have a
> suggested rewording that removes the piece you find problematic?

What about this?

“If the kernel affinity mask is larger than 1024 then
…
is smaller than the size of the affinity mask used by the kernel.
Depending on the system CPU topology, the kernel affinity mask can
be substantially larger than the number of active CPUs in the system.
”

I.e., make clear that the size of the mask can be quite different from
the CPU count.

>    Handling systems with more than 1024 CPUs
>        The  underlying  system calls (which represent CPU masks as bit
>        masks of type unsigned long *) impose  no  restriction  on  the
>        size of the CPU mask.  However, the cpu_set_t data type used by
>        glibc has a fixed size of 128 bytes, meaning that  the  maximum
>        CPU  number that can be represented is 1023.  If the system has
>        more than 1024 CPUs, then calls of the form:
> 
>            sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
> 
>        will fail with the error EINVAL,  the  error  produced  by  the
>        underlying  system call for the case where the mask size speci‐
>        fied in cpusetsize is smaller than the  size  of  the  affinity
>        mask used by the kernel.
> 
>        When  working  on  systems  with  more than 1024 CPUs, one must
>        dynamically allocate the mask argument.   Currently,  the  only
>        way  to do this is by probing for the size of the required mask
>        using sched_getaffinity()  calls  with  increasing  mask  sizes
>        (until the call does not fail with the error EINVAL).
> 
> Better?

“more than 1024 CPUs” should be “large [kernel CPU] affinity masks”
throughout.
Michael Kerrisk \(man-pages\) July 22, 2015, 4:43 p.m. UTC | #6
Hello Florian,

On 22 July 2015 at 18:02, Florian Weimer <fweimer@redhat.com> wrote:
> On 07/21/2015 05:03 PM, Michael Kerrisk (man-pages) wrote:
>> Hello Florian,
>>
>> Thanks for your comments, and sorry for the delayed follow-up.
>>
>> On 07/01/2015 02:37 PM, Florian Weimer wrote:
>>> On 06/26/2015 10:05 PM, Michael Kerrisk (man-pages) wrote:
>>>
>>>> +.SS Handling systems with more than 1024 CPUs
>>>> +The
>>>> +.I cpu_set_t
>>>> +data type used by glibc has a fixed size of 128 bytes,
>>>> +meaning that the maximum CPU number that can be represented is 1023.
>>>> +.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
>>>> +.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
>>>> +If the system has more than 1024 CPUs, then calls of the form:
>>>> +
>>>> +    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
>>>> +
>>>> +will fail with the error
>>>> +.BR EINVAL ,
>>>> +the error produced by the underlying system call for the case where the
>>>> +.I mask
>>>> +size specified in
>>>> +.I cpusetsize
>>>> +is smaller than the size of the affinity mask used by the kernel.
>>>
>>> I think it is best to leave this as unspecified as possible.  Kernel
>>> behavior already changed once, and I can imagine it changing again.
>>
>> Hmmm. Something needs to be said about what the kernel is doing though.
>> Otherwise, it's hard to make sense of this subsection. Did you have a
>> suggested rewording that removes the piece you find problematic?
>
> What about this?
>
> “If the kernel affinity mask is larger than 1024 then
> …
> is smaller than the size of the affinity mask used by the kernel.
> Depending on the system CPU topology, the kernel affinity mask can
> be substantially larger than the number of active CPUs in the system.
> ”

Looks good. I've taken that.

> I.e., make clear that the size of the mask can be quite different from
> the CPU count.
>
>>    Handling systems with more than 1024 CPUs
>>        The  underlying  system calls (which represent CPU masks as bit
>>        masks of type unsigned long *) impose  no  restriction  on  the
>>        size of the CPU mask.  However, the cpu_set_t data type used by
>>        glibc has a fixed size of 128 bytes, meaning that  the  maximum
>>        CPU  number that can be represented is 1023.  If the system has
>>        more than 1024 CPUs, then calls of the form:
>>
>>            sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
>>
>>        will fail with the error EINVAL,  the  error  produced  by  the
>>        underlying  system call for the case where the mask size speci‐
>>        fied in cpusetsize is smaller than the  size  of  the  affinity
>>        mask used by the kernel.
>>
>>        When  working  on  systems  with  more than 1024 CPUs, one must
>>        dynamically allocate the mask argument.   Currently,  the  only
>>        way  to do this is by probing for the size of the required mask
>>        using sched_getaffinity()  calls  with  increasing  mask  sizes
>>        (until the call does not fail with the error EINVAL).
>>
>> Better?
>
> “more than 1024 CPUs” should be “large [kernel CPU] affinity masks”
> throughout.

Done.

Thanks for your further input. So now we have:

   C library/kernel differences
       This manual page describes the glibc interface for the CPU affin‐
       ity calls.  The actual system call interface is slightly  differ‐
       ent, with the mask being typed as unsigned long *, reflecting the
       fact that the underlying implementation of CPU sets is  a  simple
       bit  mask.   On  success, the raw sched_getaffinity() system call
       returns the size (in bytes) of the cpumask_t data  type  that  is
       used internally by the kernel to represent the CPU set bit mask.

   Handling systems with large CPU affinity masks
       The  underlying  system  calls  (which represent CPU masks as bit
       masks of type unsigned long *) impose no restriction on the  size
       of  the CPU mask.  However, the cpu_set_t data type used by glibc
       has a fixed size of 128 bytes, meaning that the maximum CPU  num‐
       ber  that can be represented is 1023.  If the kernel CPU affinity
       mask is larger than 1024, then calls of the form:

           sched_getaffinity(pid, sizeof(cpu_set_t), &mask);

       will fail with the error EINVAL, the error produced by the under‐
       lying  system  call for the case where the mask size specified in
       cpusetsize is smaller than the size of the affinity mask used  by
       the  kernel.   (Depending  on the system CPU topology, the kernel
       affinity mask can be substantially  larger  than  the  number  of
       active CPUs in the system.)

       When working on systems with large kernel CPU affinity masks, one
       must dynamically allocate the mask argument.  Currently, the only
       way  to  do  this is by probing for the size of the required mask
       using sched_getaffinity() calls with increasing mask sizes (until
       the call does not fail with the error EINVAL).

Cheers,

Michael
diff mbox

Patch

--- a/man2/sched_setaffinity.2
+++ b/man2/sched_setaffinity.2
@@ -223,6 +223,47 @@  system call returns the size (in bytes) of the
 .I cpumask_t
 data type that is used internally by the kernel to
 represent the CPU set bit mask.
+.SS Handling systems with more than 1024 CPUs
+The
+.I cpu_set_t
+data type used by glibc has a fixed size of 128 bytes,
+meaning that the maximum CPU number that can be represented is 1023.
+.\" FIXME . See https://sourceware.org/bugzilla/show_bug.cgi?id=15630
+.\" and https://sourceware.org/ml/libc-alpha/2013-07/msg00288.html
+If the system has more than 1024 CPUs, then calls of the form:
+
+    sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
+
+will fail with the error
+.BR EINVAL ,
+the error produced by the underlying system call for the case where the
+.I mask
+size specified in
+.I cpusetsize
+is smaller than the size of the affinity mask used by the kernel.
+.PP
+The underlying system calls (which represent CPU masks as bit masks of type
+.IR "unsigned long\ *" )
+impose no restriction on the size of the mask.
+To handle systems with more than 1024 CPUs, one must dynamically allocate the
+.I mask
+argument using
+.BR CPU_ALLOC (3)
+and manipulate the mask using the "_S" macros described in
+.BR CPU_ALLOC (3).
+Using an allocation based on the number of online CPUs:
+
+    cpu_set_t *mask = CPU_ALLOC(CPU_ALLOC_SIZE(
+                                sysconf(_SC_NPROCESSORS_CONF)));
+
+is probably sufficient, although the value returned by the
+.BR sysconf ()
+call can in theory change during the lifetime of the process.
+Alternatively, one can obtain a value that is guaranteed to be stable for
+the lifetime of the process by proby for the size of the required mask using
+.BR sched_getaffinity ()
+calls with increasing mask sizes until the call does not fail with the error
+.BR EINVAL .
 .SH EXAMPLE
 The program below creates a child process.
 The parent and child then each assign themselves to a specified CPU