diff mbox

[ovs-dev,v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Message ID 1498731416-30374-1-git-send-email-billy.o.mahony@intel.com
State Superseded
Delegated to: Darrell Ball
Headers show

Commit Message

Billy O'Mahony June 29, 2017, 10:16 a.m. UTC
Previously if there is no available (non-isolated) pmd on the numa node
for a port then the port is not polled at all. This can result in a
non-operational system until such time as nics are physically
repositioned. It is preferable to operate with a pmd on the 'wrong' numa
node albeit with lower performance. Local pmds are still chosen when
available.

Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
---
v9: v8 missed some comments on v7
v8: Some coding style issues; doc tweak
v7: Incorporate review comments on docs and implementation
v6: Change 'port' to 'queue' in a warning msg
v5: Fix warning msg; Update same in docs
v4: Fix a checkpatch error
v3: Fix warning messages not appearing when using multiqueue
v2: Add details of warning messages into docs

 Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
 lib/dpif-netdev.c                    | 41 +++++++++++++++++++++++++++++++++---
 2 files changed, 56 insertions(+), 6 deletions(-)

Comments

Stokes, Ian July 8, 2017, 7:09 p.m. UTC | #1
> Previously if there is no available (non-isolated) pmd on the numa node
> for a port then the port is not polled at all. This can result in a non-
> operational system until such time as nics are physically repositioned. It
> is preferable to operate with a pmd on the 'wrong' numa node albeit with
> lower performance. Local pmds are still chosen when available.
> 
> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> ---
> v9: v8 missed some comments on v7
> v8: Some coding style issues; doc tweak
> v7: Incorporate review comments on docs and implementation
> v6: Change 'port' to 'queue' in a warning msg
> v5: Fix warning msg; Update same in docs
> v4: Fix a checkpatch error
> v3: Fix warning messages not appearing when using multiqueue
> v2: Add details of warning messages into docs
> 
>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
>  lib/dpif-netdev.c                    | 41
> +++++++++++++++++++++++++++++++++---
>  2 files changed, 56 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst
> b/Documentation/intro/install/dpdk.rst
> index e83f852..89775d6 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -449,7 +449,7 @@ affinitized accordingly.
> 
>    A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
>    assigned to it. A pmd thread shall poll the ports for incoming packets,
> -  switch the packets and send to tx port.  pmd thread is CPU bound, and
> needs
> +  switch the packets and send to tx port.  A pmd thread is CPU bound,
> + and needs
>    to be affinitized to isolated cores for optimum performance.
> 
>    By setting a bit in the mask, a pmd thread is created and pinned to the
> @@ -458,8 +458,23 @@ affinitized accordingly.
>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> 
>    .. note::
> -    pmd thread on a NUMA node is only created if there is at least one
> DPDK
> -    interface from that NUMA node added to OVS.
> +    A pmd thread on a NUMA node is only created if there is at least one
> DPDK
> +    interface from that NUMA node added to OVS.  A pmd thread is created
> by
> +    default on a core of a NUMA node or when a specified pmd-cpu-mask has
> +    indicated so.  Even though a PMD thread may exist, the thread only
> starts
> +    consuming CPU cycles if there is least one receive queue assigned to
> +    the pmd.
> +
> +  .. note::
> +    On NUMA systems PCI devices are also local to a NUMA node.  Unbound
> rx
> +    queues for a PCI device will assigned to a pmd on it's local NUMA

Minor point but should read 'will be assigned'
> node if a
> +    non-isolated PMD exists on that NUMA node.  If not, the queue will be
> +    assigned to a non-isolated pmd on a remote NUMA node.  This will
> result in
> +    reduced maximum throughput on that device and possibly on other
> devices
> +    assigned to that pmd thread. In the case such, a queue assignment is
> made a
> +    warning message will be logged: "There's no available (non-isolated)
> pmd

Above should read 'In the case where such a queue assignment is made, a warning message will be logged'
> +    thread on numa node N. Queue Q on port P will be assigned to the pmd
> on
> +    core C (numa node N'). Expect reduced performance."
> 
>  - QEMU vCPU thread Affinity
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..7557f32
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
> numa_id)
>      return NULL;
>  }
> 
> +/* Returns next NUMA from rr list in round-robin fashion. Returns the
> +first
> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
> +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list
> +*rr, const struct rr_numa *numa) {

The comment above can be tidied up a little to better clarify the behavior of this function.
I ended up reading the comments for hmap_next() and hmap_first() before it made sense, and even then it's a bit ambiguous, it ends up being the code that explains the comments.

You could clarify the following 2 statements:

(1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume you mean the function parameter 'const struct rr_numa *numa' but it's not clear on first reading.

(2) " or the last node passed" - again this makes sense only when you look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.

You could say something like:

"Attempt to return the next NUMA from a numa list in a round robin fashion. Return the first NUMA node if the struct rr_numa *numa argument passed to the function is NULL or if the numa node argument passed to hmap_next is already the last node. Return NULL if the numa list is empty."

> +    struct hmap_node *node = NULL;
> +
> +    if (numa) {
> +        node = hmap_next(&rr->numas, &numa->node);
> +    }
> +    if (!node) {
> +        node = hmap_first(&rr->numas);
> +    }
> +
> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }
> +
>  static void
>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {
> @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
> OVS_REQUIRES(dp->port_mutex)  {
>      struct dp_netdev_port *port;
>      struct rr_numa_list rr;
> +    struct rr_numa *non_local_numa = NULL;
> 
>      rr_numa_list_populate(dp, &rr);
> 
> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
> OVS_REQUIRES(dp->port_mutex)
>                  }
>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>                  if (!numa) {
> -                    VLOG_WARN("There's no available (non isolated) pmd
> thread "
> +                    /* There are no pmds on the queue's local NUMA node.
> +                       Round-robin on the NUMA nodes that do have pmds.
> */
> +                    non_local_numa = rr_numa_list_next(&rr,
> non_local_numa);
> +                    if (!non_local_numa) {
> +                        VLOG_ERR("There is no available (non-isolated)
> pmd "
> +                                 "thread for port \'%s\' queue %d. This
> queue "
> +                                 "will not be polled. Is pmd-cpu-mask set
> to "
> +                                 "zero? Or are all PMDs isolated to other
> "
> +                                 "queues?", netdev_get_name(port-
> >netdev),
> +                                 qid);
> +                        continue;
> +                    }
> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> +                    VLOG_WARN("There's no available (non-isolated) pmd
> thread "
>                                "on numa node %d. Queue %d on port \'%s\'
> will "
> -                              "not be polled.",
> -                              numa_id, qid, netdev_get_name(port-
> >netdev));
> +                              "be assigned to the pmd on core %d "
> +                              "(numa node %d). Expect reduced
> performance.",
> +                              numa_id, qid, netdev_get_name(port-
> >netdev),
> +                              q->pmd->core_id, q->pmd->numa_id);
>                  } else {
> +                    /* Assign queue to the next (round-robin) PMD on it's
> local
> +                       NUMA node. */
>                      q->pmd = rr_numa_get_pmd(numa);
>                  }
>              }
> --
> 2.7.4
This tested fine for me, tested with multiple rxqs distributed and isolated over pmds on 2 different numa nodes with varying pmd masks. Also passed sanity checks (clang, sparse compilation etc.).

You can add the tested by tag for me but I'd like to see the changes for the documentation and function comments above before acking.

Tested-by: Ian Stokes <ian.stokes@intel.com>
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Darrell Ball July 8, 2017, 8:43 p.m. UTC | #2
This is one of the patches I have looked into today or will be looking into.
After comment improvements Ian has requested, I’ll fold it into an ovs-dpdk merge repo.

On 7/8/17, 12:09 PM, "ovs-dev-bounces@openvswitch.org on behalf of Stokes, Ian" <ovs-dev-bounces@openvswitch.org on behalf of ian.stokes@intel.com> wrote:

    > Previously if there is no available (non-isolated) pmd on the numa node

    > for a port then the port is not polled at all. This can result in a non-

    > operational system until such time as nics are physically repositioned. It

    > is preferable to operate with a pmd on the 'wrong' numa node albeit with

    > lower performance. Local pmds are still chosen when available.

    > 

    > Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>

    > Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

    > Co-authored-by: Ilya Maximets <i.maximets@samsung.com>

    > ---

    > v9: v8 missed some comments on v7

    > v8: Some coding style issues; doc tweak

    > v7: Incorporate review comments on docs and implementation

    > v6: Change 'port' to 'queue' in a warning msg

    > v5: Fix warning msg; Update same in docs

    > v4: Fix a checkpatch error

    > v3: Fix warning messages not appearing when using multiqueue

    > v2: Add details of warning messages into docs

    > 

    >  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---

    >  lib/dpif-netdev.c                    | 41

    > +++++++++++++++++++++++++++++++++---

    >  2 files changed, 56 insertions(+), 6 deletions(-)

    > 

    > diff --git a/Documentation/intro/install/dpdk.rst

    > b/Documentation/intro/install/dpdk.rst

    > index e83f852..89775d6 100644

    > --- a/Documentation/intro/install/dpdk.rst

    > +++ b/Documentation/intro/install/dpdk.rst

    > @@ -449,7 +449,7 @@ affinitized accordingly.

    > 

    >    A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces

    >    assigned to it. A pmd thread shall poll the ports for incoming packets,

    > -  switch the packets and send to tx port.  pmd thread is CPU bound, and

    > needs

    > +  switch the packets and send to tx port.  A pmd thread is CPU bound,

    > + and needs

    >    to be affinitized to isolated cores for optimum performance.

    > 

    >    By setting a bit in the mask, a pmd thread is created and pinned to the

    > @@ -458,8 +458,23 @@ affinitized accordingly.

    >        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4

    > 

    >    .. note::

    > -    pmd thread on a NUMA node is only created if there is at least one

    > DPDK

    > -    interface from that NUMA node added to OVS.

    > +    A pmd thread on a NUMA node is only created if there is at least one

    > DPDK

    > +    interface from that NUMA node added to OVS.  A pmd thread is created

    > by

    > +    default on a core of a NUMA node or when a specified pmd-cpu-mask has

    > +    indicated so.  Even though a PMD thread may exist, the thread only

    > starts

    > +    consuming CPU cycles if there is least one receive queue assigned to

    > +    the pmd.

    > +

    > +  .. note::

    > +    On NUMA systems PCI devices are also local to a NUMA node.  Unbound

    > rx

    > +    queues for a PCI device will assigned to a pmd on it's local NUMA

    
    Minor point but should read 'will be assigned'
    > node if a

    > +    non-isolated PMD exists on that NUMA node.  If not, the queue will be

    > +    assigned to a non-isolated pmd on a remote NUMA node.  This will

    > result in

    > +    reduced maximum throughput on that device and possibly on other

    > devices

    > +    assigned to that pmd thread. In the case such, a queue assignment is

    > made a

    > +    warning message will be logged: "There's no available (non-isolated)

    > pmd

    
    Above should read 'In the case where such a queue assignment is made, a warning message will be logged'

There was a similar suggestion that got incorporated into v7 and then accidently reverted in v9.


    > +    thread on numa node N. Queue Q on port P will be assigned to the pmd

    > on

    > +    core C (numa node N'). Expect reduced performance."

    > 

    >  - QEMU vCPU thread Affinity

    > 

    > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..7557f32

    > 100644

    > --- a/lib/dpif-netdev.c

    > +++ b/lib/dpif-netdev.c

    > @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int

    > numa_id)

    >      return NULL;

    >  }

    > 

    > +/* Returns next NUMA from rr list in round-robin fashion. Returns the

    > +first

    > + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is

    > +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list

    > +*rr, const struct rr_numa *numa) {

    
    The comment above can be tidied up a little to better clarify the behavior of this function.
    I ended up reading the comments for hmap_next() and hmap_first() before it made sense, and even then it's a bit ambiguous, it ends up being the code that explains the comments.
    
    You could clarify the following 2 statements:
    
    (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume you mean the function parameter 'const struct rr_numa *numa' but it's not clear on first reading.
    
    (2) " or the last node passed" - again this makes sense only when you look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.
    
    You could say something like:
    
    "Attempt to return the next NUMA from a numa list in a round robin fashion. Return the first NUMA node if the struct rr_numa *numa argument passed to the function is NULL or if the numa node argument passed to hmap_next is already the last node. Return NULL if the numa list is empty."
    
    > +    struct hmap_node *node = NULL;

    > +

    > +    if (numa) {

    > +        node = hmap_next(&rr->numas, &numa->node);

    > +    }

    > +    if (!node) {

    > +        node = hmap_first(&rr->numas);

    > +    }

    > +

    > +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }

    > +

    >  static void

    >  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {

    > @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)

    > OVS_REQUIRES(dp->port_mutex)  {

    >      struct dp_netdev_port *port;

    >      struct rr_numa_list rr;

    > +    struct rr_numa *non_local_numa = NULL;

    > 

    >      rr_numa_list_populate(dp, &rr);

    > 

    > @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)

    > OVS_REQUIRES(dp->port_mutex)

    >                  }

    >              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {

    >                  if (!numa) {

    > -                    VLOG_WARN("There's no available (non isolated) pmd

    > thread "

    > +                    /* There are no pmds on the queue's local NUMA node.

    > +                       Round-robin on the NUMA nodes that do have pmds.

    > */

    > +                    non_local_numa = rr_numa_list_next(&rr,

    > non_local_numa);

    > +                    if (!non_local_numa) {

    > +                        VLOG_ERR("There is no available (non-isolated)

    > pmd "

    > +                                 "thread for port \'%s\' queue %d. This

    > queue "

    > +                                 "will not be polled. Is pmd-cpu-mask set

    > to "

    > +                                 "zero? Or are all PMDs isolated to other

    > "

    > +                                 "queues?", netdev_get_name(port-

    > >netdev),

    > +                                 qid);

    > +                        continue;

    > +                    }

    > +                    q->pmd = rr_numa_get_pmd(non_local_numa);

    > +                    VLOG_WARN("There's no available (non-isolated) pmd

    > thread "

    >                                "on numa node %d. Queue %d on port \'%s\'

    > will "

    > -                              "not be polled.",

    > -                              numa_id, qid, netdev_get_name(port-

    > >netdev));

    > +                              "be assigned to the pmd on core %d "

    > +                              "(numa node %d). Expect reduced

    > performance.",

    > +                              numa_id, qid, netdev_get_name(port-

    > >netdev),

    > +                              q->pmd->core_id, q->pmd->numa_id);

    >                  } else {

    > +                    /* Assign queue to the next (round-robin) PMD on it's

    > local

    > +                       NUMA node. */

    >                      q->pmd = rr_numa_get_pmd(numa);

    >                  }

    >              }

    > --

    > 2.7.4

    This tested fine for me, tested with multiple rxqs distributed and isolated over pmds on 2 different numa nodes with varying pmd masks. Also passed sanity checks (clang, sparse compilation etc.).
    
    You can add the tested by tag for me but I'd like to see the changes for the documentation and function comments above before acking.
    
    Tested-by: Ian Stokes <ian.stokes@intel.com>

    > 

    > _______________________________________________

    > dev mailing list

    > dev@openvswitch.org

    > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=p_za6xWBiGasOHJeLuHr5VrfE701rCPbFaS87JPIrNs&s=tsSKfYpne63_JC6ML3S1NQYPtRAu4TTsp8ZY6WFXSdE&e= 

    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=p_za6xWBiGasOHJeLuHr5VrfE701rCPbFaS87JPIrNs&s=tsSKfYpne63_JC6ML3S1NQYPtRAu4TTsp8ZY6WFXSdE&e=
Ilya Maximets July 10, 2017, 8:11 a.m. UTC | #3
On 08.07.2017 22:09, Stokes, Ian wrote:
>> Previously if there is no available (non-isolated) pmd on the numa node
>> for a port then the port is not polled at all. This can result in a non-
>> operational system until such time as nics are physically repositioned. It
>> is preferable to operate with a pmd on the 'wrong' numa node albeit with
>> lower performance. Local pmds are still chosen when available.
>>
>> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>> v9: v8 missed some comments on v7
>> v8: Some coding style issues; doc tweak
>> v7: Incorporate review comments on docs and implementation
>> v6: Change 'port' to 'queue' in a warning msg
>> v5: Fix warning msg; Update same in docs
>> v4: Fix a checkpatch error
>> v3: Fix warning messages not appearing when using multiqueue
>> v2: Add details of warning messages into docs
>>
>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
>>  lib/dpif-netdev.c                    | 41
>> +++++++++++++++++++++++++++++++++---
>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/intro/install/dpdk.rst
>> b/Documentation/intro/install/dpdk.rst
>> index e83f852..89775d6 100644
>> --- a/Documentation/intro/install/dpdk.rst
>> +++ b/Documentation/intro/install/dpdk.rst
>> @@ -449,7 +449,7 @@ affinitized accordingly.
>>
>>    A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
>>    assigned to it. A pmd thread shall poll the ports for incoming packets,
>> -  switch the packets and send to tx port.  pmd thread is CPU bound, and
>> needs
>> +  switch the packets and send to tx port.  A pmd thread is CPU bound,
>> + and needs
>>    to be affinitized to isolated cores for optimum performance.
>>
>>    By setting a bit in the mask, a pmd thread is created and pinned to the
>> @@ -458,8 +458,23 @@ affinitized accordingly.
>>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
>>
>>    .. note::
>> -    pmd thread on a NUMA node is only created if there is at least one
>> DPDK
>> -    interface from that NUMA node added to OVS.
>> +    A pmd thread on a NUMA node is only created if there is at least one
>> DPDK
>> +    interface from that NUMA node added to OVS.  A pmd thread is created
>> by
>> +    default on a core of a NUMA node or when a specified pmd-cpu-mask has
>> +    indicated so.  Even though a PMD thread may exist, the thread only
>> starts
>> +    consuming CPU cycles if there is least one receive queue assigned to
>> +    the pmd.
>> +
>> +  .. note::
>> +    On NUMA systems PCI devices are also local to a NUMA node.  Unbound
>> rx
>> +    queues for a PCI device will assigned to a pmd on it's local NUMA
> 
> Minor point but should read 'will be assigned'
>> node if a
>> +    non-isolated PMD exists on that NUMA node.  If not, the queue will be
>> +    assigned to a non-isolated pmd on a remote NUMA node.  This will
>> result in
>> +    reduced maximum throughput on that device and possibly on other
>> devices
>> +    assigned to that pmd thread. In the case such, a queue assignment is
>> made a
>> +    warning message will be logged: "There's no available (non-isolated)
>> pmd
> 
> Above should read 'In the case where such a queue assignment is made, a warning message will be logged'
>> +    thread on numa node N. Queue Q on port P will be assigned to the pmd
>> on
>> +    core C (numa node N'). Expect reduced performance."
>>
>>  - QEMU vCPU thread Affinity
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..7557f32
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
>> numa_id)
>>      return NULL;
>>  }
>>
>> +/* Returns next NUMA from rr list in round-robin fashion. Returns the
>> +first
>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
>> +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list
>> +*rr, const struct rr_numa *numa) {
> 
> The comment above can be tidied up a little to better clarify the behavior of this function.
> I ended up reading the comments for hmap_next() and hmap_first() before it made sense, and even then it's a bit ambiguous, it ends up being the code that explains the comments.
> 
> You could clarify the following 2 statements:
> 
> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume you mean the function parameter 'const struct rr_numa *numa' but it's not clear on first reading.
> 
> (2) " or the last node passed" - again this makes sense only when you look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.
> 
> You could say something like:
> 
> "Attempt to return the next NUMA from a numa list in a round robin fashion. Return the first NUMA node if the struct rr_numa *numa argument passed to the function is NULL or if the numa node argument passed to hmap_next is already the last node. Return NULL if the numa list is empty."

I'm not sure that references to implementation is a good way to
write comments (I mean 'passed to hmap_next' part).

How about this:
"""
/* Returns the next node in numa list following 'numa' in round-robin fashion.
 * Returns first node if 'numa' is a null pointer or the last node in 'rr'. */
"""

or

"""
/* The next node in numa list following 'numa' in round-robin fashion.
 * Returns:
 *     - 'NULL' if 'rr' is an empty numa list.
 *     - First node in 'rr' if 'numa' is a null pointer.
 *     - First node in 'rr' if 'numa' is the last node in 'rr'.
 *     - Otherwise, the next node in numa list following 'numa'. */
"""

?

> 
>> +    struct hmap_node *node = NULL;
>> +
>> +    if (numa) {
>> +        node = hmap_next(&rr->numas, &numa->node);
>> +    }
>> +    if (!node) {
>> +        node = hmap_first(&rr->numas);
>> +    }
>> +
>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }
>> +
>>  static void
>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {
>> @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
>> OVS_REQUIRES(dp->port_mutex)  {
>>      struct dp_netdev_port *port;
>>      struct rr_numa_list rr;
>> +    struct rr_numa *non_local_numa = NULL;
>>
>>      rr_numa_list_populate(dp, &rr);
>>
>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
>> OVS_REQUIRES(dp->port_mutex)
>>                  }
>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>>                  if (!numa) {
>> -                    VLOG_WARN("There's no available (non isolated) pmd
>> thread "
>> +                    /* There are no pmds on the queue's local NUMA node.
>> +                       Round-robin on the NUMA nodes that do have pmds.
>> */
>> +                    non_local_numa = rr_numa_list_next(&rr,
>> non_local_numa);
>> +                    if (!non_local_numa) {
>> +                        VLOG_ERR("There is no available (non-isolated)
>> pmd "
>> +                                 "thread for port \'%s\' queue %d. This
>> queue "
>> +                                 "will not be polled. Is pmd-cpu-mask set
>> to "
>> +                                 "zero? Or are all PMDs isolated to other
>> "
>> +                                 "queues?", netdev_get_name(port-
>>> netdev),
>> +                                 qid);
>> +                        continue;
>> +                    }
>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
>> +                    VLOG_WARN("There's no available (non-isolated) pmd
>> thread "
>>                                "on numa node %d. Queue %d on port \'%s\'
>> will "
>> -                              "not be polled.",
>> -                              numa_id, qid, netdev_get_name(port-
>>> netdev));
>> +                              "be assigned to the pmd on core %d "
>> +                              "(numa node %d). Expect reduced
>> performance.",
>> +                              numa_id, qid, netdev_get_name(port-
>>> netdev),
>> +                              q->pmd->core_id, q->pmd->numa_id);
>>                  } else {
>> +                    /* Assign queue to the next (round-robin) PMD on it's
>> local
>> +                       NUMA node. */
>>                      q->pmd = rr_numa_get_pmd(numa);
>>                  }
>>              }
>> --
>> 2.7.4
> This tested fine for me, tested with multiple rxqs distributed and isolated over pmds on 2 different numa nodes with varying pmd masks. Also passed sanity checks (clang, sparse compilation etc.).
> 
> You can add the tested by tag for me but I'd like to see the changes for the documentation and function comments above before acking.
> 
> Tested-by: Ian Stokes <ian.stokes@intel.com>
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> 
>
Stokes, Ian July 10, 2017, 9:41 a.m. UTC | #4
> On 08.07.2017 22:09, Stokes, Ian wrote:
> >> Previously if there is no available (non-isolated) pmd on the numa
> >> node for a port then the port is not polled at all. This can result
> >> in a non- operational system until such time as nics are physically
> >> repositioned. It is preferable to operate with a pmd on the 'wrong'
> >> numa node albeit with lower performance. Local pmds are still chosen
> when available.
> >>
> >> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> >> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> >> ---
> >> v9: v8 missed some comments on v7
> >> v8: Some coding style issues; doc tweak
> >> v7: Incorporate review comments on docs and implementation
> >> v6: Change 'port' to 'queue' in a warning msg
> >> v5: Fix warning msg; Update same in docs
> >> v4: Fix a checkpatch error
> >> v3: Fix warning messages not appearing when using multiqueue
> >> v2: Add details of warning messages into docs
> >>
> >>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> >>  lib/dpif-netdev.c                    | 41
> >> +++++++++++++++++++++++++++++++++---
> >>  2 files changed, 56 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/Documentation/intro/install/dpdk.rst
> >> b/Documentation/intro/install/dpdk.rst
> >> index e83f852..89775d6 100644
> >> --- a/Documentation/intro/install/dpdk.rst
> >> +++ b/Documentation/intro/install/dpdk.rst
> >> @@ -449,7 +449,7 @@ affinitized accordingly.
> >>
> >>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> interfaces
> >>    assigned to it. A pmd thread shall poll the ports for incoming
> >> packets,
> >> -  switch the packets and send to tx port.  pmd thread is CPU bound,
> >> and needs
> >> +  switch the packets and send to tx port.  A pmd thread is CPU
> >> + bound, and needs
> >>    to be affinitized to isolated cores for optimum performance.
> >>
> >>    By setting a bit in the mask, a pmd thread is created and pinned
> >> to the @@ -458,8 +458,23 @@ affinitized accordingly.
> >>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> >>
> >>    .. note::
> >> -    pmd thread on a NUMA node is only created if there is at least one
> >> DPDK
> >> -    interface from that NUMA node added to OVS.
> >> +    A pmd thread on a NUMA node is only created if there is at least
> >> + one
> >> DPDK
> >> +    interface from that NUMA node added to OVS.  A pmd thread is
> >> + created
> >> by
> >> +    default on a core of a NUMA node or when a specified pmd-cpu-mask
> has
> >> +    indicated so.  Even though a PMD thread may exist, the thread
> >> + only
> >> starts
> >> +    consuming CPU cycles if there is least one receive queue assigned
> to
> >> +    the pmd.
> >> +
> >> +  .. note::
> >> +    On NUMA systems PCI devices are also local to a NUMA node.
> >> + Unbound
> >> rx
> >> +    queues for a PCI device will assigned to a pmd on it's local
> >> + NUMA
> >
> > Minor point but should read 'will be assigned'
> >> node if a
> >> +    non-isolated PMD exists on that NUMA node.  If not, the queue will
> be
> >> +    assigned to a non-isolated pmd on a remote NUMA node.  This will
> >> result in
> >> +    reduced maximum throughput on that device and possibly on other
> >> devices
> >> +    assigned to that pmd thread. In the case such, a queue
> >> + assignment is
> >> made a
> >> +    warning message will be logged: "There's no available
> >> + (non-isolated)
> >> pmd
> >
> > Above should read 'In the case where such a queue assignment is made, a
> warning message will be logged'
> >> +    thread on numa node N. Queue Q on port P will be assigned to the
> >> + pmd
> >> on
> >> +    core C (numa node N'). Expect reduced performance."
> >>
> >>  - QEMU vCPU thread Affinity
> >>
> >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >> 4e29085..7557f32
> >> 100644
> >> --- a/lib/dpif-netdev.c
> >> +++ b/lib/dpif-netdev.c
> >> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr,
> >> int
> >> numa_id)
> >>      return NULL;
> >>  }
> >>
> >> +/* Returns next NUMA from rr list in round-robin fashion. Returns
> >> +the first
> >> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list
> >> +is empty. */ static struct rr_numa * rr_numa_list_next(struct
> >> +rr_numa_list *rr, const struct rr_numa *numa) {
> >
> > The comment above can be tidied up a little to better clarify the
> behavior of this function.
> > I ended up reading the comments for hmap_next() and hmap_first() before
> it made sense, and even then it's a bit ambiguous, it ends up being the
> code that explains the comments.
> >
> > You could clarify the following 2 statements:
> >
> > (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume
> you mean the function parameter 'const struct rr_numa *numa' but it's not
> clear on first reading.
> >
> > (2) " or the last node passed" - again this makes sense only when you
> look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.
> >
> > You could say something like:
> >
> > "Attempt to return the next NUMA from a numa list in a round robin
> fashion. Return the first NUMA node if the struct rr_numa *numa argument
> passed to the function is NULL or if the numa node argument passed to
> hmap_next is already the last node. Return NULL if the numa list is
> empty."
> 
> I'm not sure that references to implementation is a good way to write
> comments (I mean 'passed to hmap_next' part).
> 
> How about this:
> """
> /* Returns the next node in numa list following 'numa' in round-robin
> fashion.
>  * Returns first node if 'numa' is a null pointer or the last node in
> 'rr'. */ """
> 
> or
> 
> """
> /* The next node in numa list following 'numa' in round-robin fashion.
>  * Returns:
>  *     - 'NULL' if 'rr' is an empty numa list.
>  *     - First node in 'rr' if 'numa' is a null pointer.
>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
>  *     - Otherwise, the next node in numa list following 'numa'. */
> """
> 
> ?

I'm happy with the first option you provided above, could you append returning NULL if the list is empty then I think we're good.

/* Returns the next node in numa list following 'numa' in round-robin fashion.
 * Returns first node if 'numa' is a null pointer or the last node in 'rr'. 
 * Returns NULL if 'rr' numa list is empty. */

Thanks 
Ian
> 
> >
> >> +    struct hmap_node *node = NULL;
> >> +
> >> +    if (numa) {
> >> +        node = hmap_next(&rr->numas, &numa->node);
> >> +    }
> >> +    if (!node) {
> >> +        node = hmap_first(&rr->numas);
> >> +    }
> >> +
> >> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
> >> + }
> >> +
> >>  static void
> >>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
> >> { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool
> >> pinned)
> >> OVS_REQUIRES(dp->port_mutex)  {
> >>      struct dp_netdev_port *port;
> >>      struct rr_numa_list rr;
> >> +    struct rr_numa *non_local_numa = NULL;
> >>
> >>      rr_numa_list_populate(dp, &rr);
> >>
> >> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> >> pinned)
> >> OVS_REQUIRES(dp->port_mutex)
> >>                  }
> >>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> >>                  if (!numa) {
> >> -                    VLOG_WARN("There's no available (non isolated) pmd
> >> thread "
> >> +                    /* There are no pmds on the queue's local NUMA
> node.
> >> +                       Round-robin on the NUMA nodes that do have
> pmds.
> >> */
> >> +                    non_local_numa = rr_numa_list_next(&rr,
> >> non_local_numa);
> >> +                    if (!non_local_numa) {
> >> +                        VLOG_ERR("There is no available
> >> + (non-isolated)
> >> pmd "
> >> +                                 "thread for port \'%s\' queue %d.
> >> + This
> >> queue "
> >> +                                 "will not be polled. Is
> >> + pmd-cpu-mask set
> >> to "
> >> +                                 "zero? Or are all PMDs isolated to
> >> + other
> >> "
> >> +                                 "queues?", netdev_get_name(port-
> >>> netdev),
> >> +                                 qid);
> >> +                        continue;
> >> +                    }
> >> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> >> +                    VLOG_WARN("There's no available (non-isolated)
> >> + pmd
> >> thread "
> >>                                "on numa node %d. Queue %d on port
> \'%s\'
> >> will "
> >> -                              "not be polled.",
> >> -                              numa_id, qid, netdev_get_name(port-
> >>> netdev));
> >> +                              "be assigned to the pmd on core %d "
> >> +                              "(numa node %d). Expect reduced
> >> performance.",
> >> +                              numa_id, qid, netdev_get_name(port-
> >>> netdev),
> >> +                              q->pmd->core_id, q->pmd->numa_id);
> >>                  } else {
> >> +                    /* Assign queue to the next (round-robin) PMD on
> >> + it's
> >> local
> >> +                       NUMA node. */
> >>                      q->pmd = rr_numa_get_pmd(numa);
> >>                  }
> >>              }
> >> --
> >> 2.7.4
> > This tested fine for me, tested with multiple rxqs distributed and
> isolated over pmds on 2 different numa nodes with varying pmd masks. Also
> passed sanity checks (clang, sparse compilation etc.).
> >
> > You can add the tested by tag for me but I'd like to see the changes for
> the documentation and function comments above before acking.
> >
> > Tested-by: Ian Stokes <ian.stokes@intel.com>
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
> >
Billy O'Mahony July 10, 2017, 10:42 a.m. UTC | #5
> -----Original Message-----
> From: Stokes, Ian
> Sent: Monday, July 10, 2017 10:41 AM
> To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
> <billy.o.mahony@intel.com>; dev@openvswitch.org
> Cc: dball@vmare.com
> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > On 08.07.2017 22:09, Stokes, Ian wrote:
> > >> Previously if there is no available (non-isolated) pmd on the numa
> > >> node for a port then the port is not polled at all. This can result
> > >> in a non- operational system until such time as nics are physically
> > >> repositioned. It is preferable to operate with a pmd on the 'wrong'
> > >> numa node albeit with lower performance. Local pmds are still
> > >> chosen
> > when available.
> > >>
> > >> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> > >> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> > >> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> > >> ---
> > >> v9: v8 missed some comments on v7
> > >> v8: Some coding style issues; doc tweak
> > >> v7: Incorporate review comments on docs and implementation
> > >> v6: Change 'port' to 'queue' in a warning msg
> > >> v5: Fix warning msg; Update same in docs
> > >> v4: Fix a checkpatch error
> > >> v3: Fix warning messages not appearing when using multiqueue
> > >> v2: Add details of warning messages into docs
> > >>
> > >>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> > >>  lib/dpif-netdev.c                    | 41
> > >> +++++++++++++++++++++++++++++++++---
> > >>  2 files changed, 56 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/Documentation/intro/install/dpdk.rst
> > >> b/Documentation/intro/install/dpdk.rst
> > >> index e83f852..89775d6 100644
> > >> --- a/Documentation/intro/install/dpdk.rst
> > >> +++ b/Documentation/intro/install/dpdk.rst
> > >> @@ -449,7 +449,7 @@ affinitized accordingly.
> > >>
> > >>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> > interfaces
> > >>    assigned to it. A pmd thread shall poll the ports for incoming
> > >> packets,
> > >> -  switch the packets and send to tx port.  pmd thread is CPU
> > >> bound, and needs
> > >> +  switch the packets and send to tx port.  A pmd thread is CPU
> > >> + bound, and needs
> > >>    to be affinitized to isolated cores for optimum performance.
> > >>
> > >>    By setting a bit in the mask, a pmd thread is created and pinned
> > >> to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > >>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
> > >>
> > >>    .. note::
> > >> -    pmd thread on a NUMA node is only created if there is at least one
> > >> DPDK
> > >> -    interface from that NUMA node added to OVS.
> > >> +    A pmd thread on a NUMA node is only created if there is at
> > >> + least one
> > >> DPDK
> > >> +    interface from that NUMA node added to OVS.  A pmd thread is
> > >> + created
> > >> by
> > >> +    default on a core of a NUMA node or when a specified
> > >> + pmd-cpu-mask
> > has
> > >> +    indicated so.  Even though a PMD thread may exist, the thread
> > >> + only
> > >> starts
> > >> +    consuming CPU cycles if there is least one receive queue
> > >> + assigned
> > to
> > >> +    the pmd.
> > >> +
> > >> +  .. note::
> > >> +    On NUMA systems PCI devices are also local to a NUMA node.
> > >> + Unbound
> > >> rx
> > >> +    queues for a PCI device will assigned to a pmd on it's local
> > >> + NUMA
> > >
> > > Minor point but should read 'will be assigned'

[[BO'M]] 
+1

> > >> node if a
> > >> +    non-isolated PMD exists on that NUMA node.  If not, the queue
> > >> + will
> > be
> > >> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> > >> + will
> > >> result in
> > >> +    reduced maximum throughput on that device and possibly on
> > >> + other
> > >> devices
> > >> +    assigned to that pmd thread. In the case such, a queue
> > >> + assignment is
> > >> made a
> > >> +    warning message will be logged: "There's no available
> > >> + (non-isolated)
> > >> pmd
> > >
> > > Above should read 'In the case where such a queue assignment is
> > > made, a
> > warning message will be logged'

[[BO'M]] 
Suggesting a simpler:
'If such a queue assignment is made a warning message ..."

> > >> +    thread on numa node N. Queue Q on port P will be assigned to
> > >> + the pmd
> > >> on
> > >> +    core C (numa node N'). Expect reduced performance."
> > >>
> > >>  - QEMU vCPU thread Affinity
> > >>
> > >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > >> 4e29085..7557f32
> > >> 100644
> > >> --- a/lib/dpif-netdev.c
> > >> +++ b/lib/dpif-netdev.c
> > >> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr,
> > >> int
> > >> numa_id)
> > >>      return NULL;
> > >>  }
> > >>
> > >> +/* Returns next NUMA from rr list in round-robin fashion. Returns
> > >> +the first
> > >> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list
> > >> +is empty. */ static struct rr_numa * rr_numa_list_next(struct
> > >> +rr_numa_list *rr, const struct rr_numa *numa) {
> > >
> > > The comment above can be tidied up a little to better clarify the
> > behavior of this function.
> > > I ended up reading the comments for hmap_next() and hmap_first()
> > > before
> > it made sense, and even then it's a bit ambiguous, it ends up being
> > the code that explains the comments.
> > >
> > > You could clarify the following 2 statements:
> > >
> > > (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> > > assume
> > you mean the function parameter 'const struct rr_numa *numa' but it's
> > not clear on first reading.
> > >
> > > (2) " or the last node passed" - again this makes sense only when
> > > you
> > look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> >node)'.
> > >
> > > You could say something like:
> > >
> > > "Attempt to return the next NUMA from a numa list in a round robin
> > fashion. Return the first NUMA node if the struct rr_numa *numa
> > argument passed to the function is NULL or if the numa node argument
> > passed to hmap_next is already the last node. Return NULL if the numa
> > list is empty."
> >
> > I'm not sure that references to implementation is a good way to write
> > comments (I mean 'passed to hmap_next' part).
> >
> > How about this:
> > """
> > /* Returns the next node in numa list following 'numa' in round-robin
> > fashion.
> >  * Returns first node if 'numa' is a null pointer or the last node in
> > 'rr'. */ """
> >
> > or
> >
> > """
> > /* The next node in numa list following 'numa' in round-robin fashion.
> >  * Returns:
> >  *     - 'NULL' if 'rr' is an empty numa list.
> >  *     - First node in 'rr' if 'numa' is a null pointer.
> >  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> >  *     - Otherwise, the next node in numa list following 'numa'. */
> > """
> >
> > ?
> 
> I'm happy with the first option you provided above, could you append
> returning NULL if the list is empty then I think we're good.
> 
> /* Returns the next node in numa list following 'numa' in round-robin
> fashion.
>  * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
>  * Returns NULL if 'rr' numa list is empty. */

[[BO'M]] 
Sounds good to me. Anyone object to this wording? 

> 
> Thanks
> Ian
> >
> > >
> > >> +    struct hmap_node *node = NULL;
> > >> +
> > >> +    if (numa) {
> > >> +        node = hmap_next(&rr->numas, &numa->node);
> > >> +    }
> > >> +    if (!node) {
> > >> +        node = hmap_first(&rr->numas);
> > >> +    }
> > >> +
> > >> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> > >> + NULL; }
> > >> +
> > >>  static void
> > >>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> > >> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
> > >> bool
> > >> pinned)
> > >> OVS_REQUIRES(dp->port_mutex)  {
> > >>      struct dp_netdev_port *port;
> > >>      struct rr_numa_list rr;
> > >> +    struct rr_numa *non_local_numa = NULL;
> > >>
> > >>      rr_numa_list_populate(dp, &rr);
> > >>
> > >> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> > >> pinned)
> > >> OVS_REQUIRES(dp->port_mutex)
> > >>                  }
> > >>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> > >>                  if (!numa) {
> > >> -                    VLOG_WARN("There's no available (non isolated) pmd
> > >> thread "
> > >> +                    /* There are no pmds on the queue's local NUMA
> > node.
> > >> +                       Round-robin on the NUMA nodes that do have
> > pmds.
> > >> */
> > >> +                    non_local_numa = rr_numa_list_next(&rr,
> > >> non_local_numa);
> > >> +                    if (!non_local_numa) {
> > >> +                        VLOG_ERR("There is no available
> > >> + (non-isolated)
> > >> pmd "
> > >> +                                 "thread for port \'%s\' queue %d.
> > >> + This
> > >> queue "
> > >> +                                 "will not be polled. Is
> > >> + pmd-cpu-mask set
> > >> to "
> > >> +                                 "zero? Or are all PMDs isolated
> > >> + to other
> > >> "
> > >> +                                 "queues?", netdev_get_name(port-
> > >>> netdev),
> > >> +                                 qid);
> > >> +                        continue;
> > >> +                    }
> > >> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> > >> +                    VLOG_WARN("There's no available (non-isolated)
> > >> + pmd
> > >> thread "
> > >>                                "on numa node %d. Queue %d on port
> > \'%s\'
> > >> will "
> > >> -                              "not be polled.",
> > >> -                              numa_id, qid, netdev_get_name(port-
> > >>> netdev));
> > >> +                              "be assigned to the pmd on core %d "
> > >> +                              "(numa node %d). Expect reduced
> > >> performance.",
> > >> +                              numa_id, qid, netdev_get_name(port-
> > >>> netdev),
> > >> +                              q->pmd->core_id, q->pmd->numa_id);
> > >>                  } else {
> > >> +                    /* Assign queue to the next (round-robin) PMD
> > >> + on it's
> > >> local
> > >> +                       NUMA node. */
> > >>                      q->pmd = rr_numa_get_pmd(numa);
> > >>                  }
> > >>              }
> > >> --
> > >> 2.7.4
> > > This tested fine for me, tested with multiple rxqs distributed and
> > isolated over pmds on 2 different numa nodes with varying pmd masks.
> > Also passed sanity checks (clang, sparse compilation etc.).
> > >
> > > You can add the tested by tag for me but I'd like to see the changes
> > > for
> > the documentation and function comments above before acking.
> > >
> > > Tested-by: Ian Stokes <ian.stokes@intel.com>
> > >>
> > >> _______________________________________________
> > >> dev mailing list
> > >> dev@openvswitch.org
> > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
> > >
> > >
Ilya Maximets July 10, 2017, 10:53 a.m. UTC | #6
On 10.07.2017 13:42, O Mahony, Billy wrote:
> 
> 
>> -----Original Message-----
>> From: Stokes, Ian
>> Sent: Monday, July 10, 2017 10:41 AM
>> To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
>> <billy.o.mahony@intel.com>; dev@openvswitch.org
>> Cc: dball@vmare.com
>> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
>> local numa node.
>>
>>> On 08.07.2017 22:09, Stokes, Ian wrote:
>>>>> Previously if there is no available (non-isolated) pmd on the numa
>>>>> node for a port then the port is not polled at all. This can result
>>>>> in a non- operational system until such time as nics are physically
>>>>> repositioned. It is preferable to operate with a pmd on the 'wrong'
>>>>> numa node albeit with lower performance. Local pmds are still
>>>>> chosen
>>> when available.
>>>>>
>>>>> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
>>>>> ---
>>>>> v9: v8 missed some comments on v7
>>>>> v8: Some coding style issues; doc tweak
>>>>> v7: Incorporate review comments on docs and implementation
>>>>> v6: Change 'port' to 'queue' in a warning msg
>>>>> v5: Fix warning msg; Update same in docs
>>>>> v4: Fix a checkpatch error
>>>>> v3: Fix warning messages not appearing when using multiqueue
>>>>> v2: Add details of warning messages into docs
>>>>>
>>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
>>>>>  lib/dpif-netdev.c                    | 41
>>>>> +++++++++++++++++++++++++++++++++---
>>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/intro/install/dpdk.rst
>>>>> b/Documentation/intro/install/dpdk.rst
>>>>> index e83f852..89775d6 100644
>>>>> --- a/Documentation/intro/install/dpdk.rst
>>>>> +++ b/Documentation/intro/install/dpdk.rst
>>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
>>>>>
>>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
>>> interfaces
>>>>>    assigned to it. A pmd thread shall poll the ports for incoming
>>>>> packets,
>>>>> -  switch the packets and send to tx port.  pmd thread is CPU
>>>>> bound, and needs
>>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
>>>>> + bound, and needs
>>>>>    to be affinitized to isolated cores for optimum performance.
>>>>>
>>>>>    By setting a bit in the mask, a pmd thread is created and pinned
>>>>> to the @@ -458,8 +458,23 @@ affinitized accordingly.
>>>>>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
>>>>>
>>>>>    .. note::
>>>>> -    pmd thread on a NUMA node is only created if there is at least one
>>>>> DPDK
>>>>> -    interface from that NUMA node added to OVS.
>>>>> +    A pmd thread on a NUMA node is only created if there is at
>>>>> + least one
>>>>> DPDK
>>>>> +    interface from that NUMA node added to OVS.  A pmd thread is
>>>>> + created
>>>>> by
>>>>> +    default on a core of a NUMA node or when a specified
>>>>> + pmd-cpu-mask
>>> has
>>>>> +    indicated so.  Even though a PMD thread may exist, the thread
>>>>> + only
>>>>> starts
>>>>> +    consuming CPU cycles if there is least one receive queue
>>>>> + assigned
>>> to
>>>>> +    the pmd.
>>>>> +
>>>>> +  .. note::
>>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
>>>>> + Unbound
>>>>> rx
>>>>> +    queues for a PCI device will assigned to a pmd on it's local
>>>>> + NUMA
>>>>
>>>> Minor point but should read 'will be assigned'
> 
> [[BO'M]] 
> +1
> 
>>>>> node if a
>>>>> +    non-isolated PMD exists on that NUMA node.  If not, the queue
>>>>> + will
>>> be
>>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
>>>>> + will
>>>>> result in
>>>>> +    reduced maximum throughput on that device and possibly on
>>>>> + other
>>>>> devices
>>>>> +    assigned to that pmd thread. In the case such, a queue
>>>>> + assignment is
>>>>> made a
>>>>> +    warning message will be logged: "There's no available
>>>>> + (non-isolated)
>>>>> pmd
>>>>
>>>> Above should read 'In the case where such a queue assignment is
>>>> made, a
>>> warning message will be logged'
> 
> [[BO'M]] 
> Suggesting a simpler:
> 'If such a queue assignment is made a warning message ..."
> 
>>>>> +    thread on numa node N. Queue Q on port P will be assigned to
>>>>> + the pmd
>>>>> on
>>>>> +    core C (numa node N'). Expect reduced performance."
>>>>>
>>>>>  - QEMU vCPU thread Affinity
>>>>>
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>> 4e29085..7557f32
>>>>> 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr,
>>>>> int
>>>>> numa_id)
>>>>>      return NULL;
>>>>>  }
>>>>>
>>>>> +/* Returns next NUMA from rr list in round-robin fashion. Returns
>>>>> +the first
>>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list
>>>>> +is empty. */ static struct rr_numa * rr_numa_list_next(struct
>>>>> +rr_numa_list *rr, const struct rr_numa *numa) {
>>>>
>>>> The comment above can be tidied up a little to better clarify the
>>> behavior of this function.
>>>> I ended up reading the comments for hmap_next() and hmap_first()
>>>> before
>>> it made sense, and even then it's a bit ambiguous, it ends up being
>>> the code that explains the comments.
>>>>
>>>> You could clarify the following 2 statements:
>>>>
>>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
>>>> assume
>>> you mean the function parameter 'const struct rr_numa *numa' but it's
>>> not clear on first reading.
>>>>
>>>> (2) " or the last node passed" - again this makes sense only when
>>>> you
>>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
>>> node)'.
>>>>
>>>> You could say something like:
>>>>
>>>> "Attempt to return the next NUMA from a numa list in a round robin
>>> fashion. Return the first NUMA node if the struct rr_numa *numa
>>> argument passed to the function is NULL or if the numa node argument
>>> passed to hmap_next is already the last node. Return NULL if the numa
>>> list is empty."
>>>
>>> I'm not sure that references to implementation is a good way to write
>>> comments (I mean 'passed to hmap_next' part).
>>>
>>> How about this:
>>> """
>>> /* Returns the next node in numa list following 'numa' in round-robin
>>> fashion.
>>>  * Returns first node if 'numa' is a null pointer or the last node in
>>> 'rr'. */ """
>>>
>>> or
>>>
>>> """
>>> /* The next node in numa list following 'numa' in round-robin fashion.
>>>  * Returns:
>>>  *     - 'NULL' if 'rr' is an empty numa list.
>>>  *     - First node in 'rr' if 'numa' is a null pointer.
>>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
>>>  *     - Otherwise, the next node in numa list following 'numa'. */
>>> """
>>>
>>> ?
>>
>> I'm happy with the first option you provided above, could you append
>> returning NULL if the list is empty then I think we're good.
>>
>> /* Returns the next node in numa list following 'numa' in round-robin
>> fashion.
>>  * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
>>  * Returns NULL if 'rr' numa list is empty. */
> 
> [[BO'M]] 
> Sounds good to me. Anyone object to this wording? 

I don't like three 'Returns' in a row but LGTM in general.

>>
>> Thanks
>> Ian
>>>
>>>>
>>>>> +    struct hmap_node *node = NULL;
>>>>> +
>>>>> +    if (numa) {
>>>>> +        node = hmap_next(&rr->numas, &numa->node);
>>>>> +    }
>>>>> +    if (!node) {
>>>>> +        node = hmap_first(&rr->numas);
>>>>> +    }
>>>>> +
>>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
>>>>> + NULL; }
>>>>> +
>>>>>  static void
>>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
>>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
>>>>> bool
>>>>> pinned)
>>>>> OVS_REQUIRES(dp->port_mutex)  {
>>>>>      struct dp_netdev_port *port;
>>>>>      struct rr_numa_list rr;
>>>>> +    struct rr_numa *non_local_numa = NULL;
>>>>>
>>>>>      rr_numa_list_populate(dp, &rr);
>>>>>
>>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
>>>>> pinned)
>>>>> OVS_REQUIRES(dp->port_mutex)
>>>>>                  }
>>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>>>>>                  if (!numa) {
>>>>> -                    VLOG_WARN("There's no available (non isolated) pmd
>>>>> thread "
>>>>> +                    /* There are no pmds on the queue's local NUMA
>>> node.
>>>>> +                       Round-robin on the NUMA nodes that do have
>>> pmds.
>>>>> */
>>>>> +                    non_local_numa = rr_numa_list_next(&rr,
>>>>> non_local_numa);
>>>>> +                    if (!non_local_numa) {
>>>>> +                        VLOG_ERR("There is no available
>>>>> + (non-isolated)
>>>>> pmd "
>>>>> +                                 "thread for port \'%s\' queue %d.
>>>>> + This
>>>>> queue "
>>>>> +                                 "will not be polled. Is
>>>>> + pmd-cpu-mask set
>>>>> to "
>>>>> +                                 "zero? Or are all PMDs isolated
>>>>> + to other
>>>>> "
>>>>> +                                 "queues?", netdev_get_name(port-
>>>>>> netdev),
>>>>> +                                 qid);
>>>>> +                        continue;
>>>>> +                    }
>>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
>>>>> +                    VLOG_WARN("There's no available (non-isolated)
>>>>> + pmd
>>>>> thread "
>>>>>                                "on numa node %d. Queue %d on port
>>> \'%s\'
>>>>> will "
>>>>> -                              "not be polled.",
>>>>> -                              numa_id, qid, netdev_get_name(port-
>>>>>> netdev));
>>>>> +                              "be assigned to the pmd on core %d "
>>>>> +                              "(numa node %d). Expect reduced
>>>>> performance.",
>>>>> +                              numa_id, qid, netdev_get_name(port-
>>>>>> netdev),
>>>>> +                              q->pmd->core_id, q->pmd->numa_id);
>>>>>                  } else {
>>>>> +                    /* Assign queue to the next (round-robin) PMD
>>>>> + on it's
>>>>> local
>>>>> +                       NUMA node. */
>>>>>                      q->pmd = rr_numa_get_pmd(numa);
>>>>>                  }
>>>>>              }
>>>>> --
>>>>> 2.7.4
>>>> This tested fine for me, tested with multiple rxqs distributed and
>>> isolated over pmds on 2 different numa nodes with varying pmd masks.
>>> Also passed sanity checks (clang, sparse compilation etc.).
>>>>
>>>> You can add the tested by tag for me but I'd like to see the changes
>>>> for
>>> the documentation and function comments above before acking.
>>>>
>>>> Tested-by: Ian Stokes <ian.stokes@intel.com>
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev@openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>>
>>>>
Stokes, Ian July 10, 2017, 10:56 a.m. UTC | #7
> > -----Original Message-----
> > From: Stokes, Ian
> > Sent: Monday, July 10, 2017 10:41 AM
> > To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
> > <billy.o.mahony@intel.com>; dev@openvswitch.org
> > Cc: dball@vmare.com
> > Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on
> > non- local numa node.
> >
> > > On 08.07.2017 22:09, Stokes, Ian wrote:
> > > >> Previously if there is no available (non-isolated) pmd on the
> > > >> numa node for a port then the port is not polled at all. This can
> > > >> result in a non- operational system until such time as nics are
> > > >> physically repositioned. It is preferable to operate with a pmd on
> the 'wrong'
> > > >> numa node albeit with lower performance. Local pmds are still
> > > >> chosen
> > > when available.
> > > >>
> > > >> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> > > >> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> > > >> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> > > >> ---
> > > >> v9: v8 missed some comments on v7
> > > >> v8: Some coding style issues; doc tweak
> > > >> v7: Incorporate review comments on docs and implementation
> > > >> v6: Change 'port' to 'queue' in a warning msg
> > > >> v5: Fix warning msg; Update same in docs
> > > >> v4: Fix a checkpatch error
> > > >> v3: Fix warning messages not appearing when using multiqueue
> > > >> v2: Add details of warning messages into docs
> > > >>
> > > >>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> > > >>  lib/dpif-netdev.c                    | 41
> > > >> +++++++++++++++++++++++++++++++++---
> > > >>  2 files changed, 56 insertions(+), 6 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/intro/install/dpdk.rst
> > > >> b/Documentation/intro/install/dpdk.rst
> > > >> index e83f852..89775d6 100644
> > > >> --- a/Documentation/intro/install/dpdk.rst
> > > >> +++ b/Documentation/intro/install/dpdk.rst
> > > >> @@ -449,7 +449,7 @@ affinitized accordingly.
> > > >>
> > > >>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> > > interfaces
> > > >>    assigned to it. A pmd thread shall poll the ports for incoming
> > > >> packets,
> > > >> -  switch the packets and send to tx port.  pmd thread is CPU
> > > >> bound, and needs
> > > >> +  switch the packets and send to tx port.  A pmd thread is CPU
> > > >> + bound, and needs
> > > >>    to be affinitized to isolated cores for optimum performance.
> > > >>
> > > >>    By setting a bit in the mask, a pmd thread is created and
> > > >> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > > >>        $ ovs-vsctl set Open_vSwitch .
> > > >> other_config:pmd-cpu-mask=0x4
> > > >>
> > > >>    .. note::
> > > >> -    pmd thread on a NUMA node is only created if there is at least
> one
> > > >> DPDK
> > > >> -    interface from that NUMA node added to OVS.
> > > >> +    A pmd thread on a NUMA node is only created if there is at
> > > >> + least one
> > > >> DPDK
> > > >> +    interface from that NUMA node added to OVS.  A pmd thread is
> > > >> + created
> > > >> by
> > > >> +    default on a core of a NUMA node or when a specified
> > > >> + pmd-cpu-mask
> > > has
> > > >> +    indicated so.  Even though a PMD thread may exist, the
> > > >> + thread only
> > > >> starts
> > > >> +    consuming CPU cycles if there is least one receive queue
> > > >> + assigned
> > > to
> > > >> +    the pmd.
> > > >> +
> > > >> +  .. note::
> > > >> +    On NUMA systems PCI devices are also local to a NUMA node.
> > > >> + Unbound
> > > >> rx
> > > >> +    queues for a PCI device will assigned to a pmd on it's local
> > > >> + NUMA
> > > >
> > > > Minor point but should read 'will be assigned'
> 
> [[BO'M]]
> +1
> 
> > > >> node if a
> > > >> +    non-isolated PMD exists on that NUMA node.  If not, the
> > > >> + queue will
> > > be
> > > >> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> > > >> + will
> > > >> result in
> > > >> +    reduced maximum throughput on that device and possibly on
> > > >> + other
> > > >> devices
> > > >> +    assigned to that pmd thread. In the case such, a queue
> > > >> + assignment is
> > > >> made a
> > > >> +    warning message will be logged: "There's no available
> > > >> + (non-isolated)
> > > >> pmd
> > > >
> > > > Above should read 'In the case where such a queue assignment is
> > > > made, a
> > > warning message will be logged'
> 
> [[BO'M]]
> Suggesting a simpler:
> 'If such a queue assignment is made a warning message ..."
+1, LGTM.
> 
> > > >> +    thread on numa node N. Queue Q on port P will be assigned to
> > > >> + the pmd
> > > >> on
> > > >> +    core C (numa node N'). Expect reduced performance."
> > > >>
> > > >>  - QEMU vCPU thread Affinity
> > > >>
> > > >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > >> 4e29085..7557f32
> > > >> 100644
> > > >> --- a/lib/dpif-netdev.c
> > > >> +++ b/lib/dpif-netdev.c
> > > >> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> > > >> *rr, int
> > > >> numa_id)
> > > >>      return NULL;
> > > >>  }
> > > >>
> > > >> +/* Returns next NUMA from rr list in round-robin fashion.
> > > >> +Returns the first
> > > >> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> > > >> +list is empty. */ static struct rr_numa *
> > > >> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> > > >> +*numa) {
> > > >
> > > > The comment above can be tidied up a little to better clarify the
> > > behavior of this function.
> > > > I ended up reading the comments for hmap_next() and hmap_first()
> > > > before
> > > it made sense, and even then it's a bit ambiguous, it ends up being
> > > the code that explains the comments.
> > > >
> > > > You could clarify the following 2 statements:
> > > >
> > > > (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> > > > assume
> > > you mean the function parameter 'const struct rr_numa *numa' but
> > > it's not clear on first reading.
> > > >
> > > > (2) " or the last node passed" - again this makes sense only when
> > > > you
> > > look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> > >node)'.
> > > >
> > > > You could say something like:
> > > >
> > > > "Attempt to return the next NUMA from a numa list in a round robin
> > > fashion. Return the first NUMA node if the struct rr_numa *numa
> > > argument passed to the function is NULL or if the numa node argument
> > > passed to hmap_next is already the last node. Return NULL if the
> > > numa list is empty."
> > >
> > > I'm not sure that references to implementation is a good way to
> > > write comments (I mean 'passed to hmap_next' part).
> > >
> > > How about this:
> > > """
> > > /* Returns the next node in numa list following 'numa' in
> > > round-robin fashion.
> > >  * Returns first node if 'numa' is a null pointer or the last node
> > > in 'rr'. */ """
> > >
> > > or
> > >
> > > """
> > > /* The next node in numa list following 'numa' in round-robin fashion.
> > >  * Returns:
> > >  *     - 'NULL' if 'rr' is an empty numa list.
> > >  *     - First node in 'rr' if 'numa' is a null pointer.
> > >  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> > >  *     - Otherwise, the next node in numa list following 'numa'. */
> > > """
> > >
> > > ?
> >
> > I'm happy with the first option you provided above, could you append
> > returning NULL if the list is empty then I think we're good.
> >
> > /* Returns the next node in numa list following 'numa' in round-robin
> > fashion.
> >  * Returns first node if 'numa' is a null pointer or the last node in
> 'rr'.
> >  * Returns NULL if 'rr' numa list is empty. */
> 
> [[BO'M]]
> Sounds good to me. Anyone object to this wording?
> 
> >
> > Thanks
> > Ian
> > >
> > > >
> > > >> +    struct hmap_node *node = NULL;
> > > >> +
> > > >> +    if (numa) {
> > > >> +        node = hmap_next(&rr->numas, &numa->node);
> > > >> +    }
> > > >> +    if (!node) {
> > > >> +        node = hmap_first(&rr->numas);
> > > >> +    }
> > > >> +
> > > >> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> > > >> + NULL; }
> > > >> +
> > > >>  static void
> > > >>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> > > >> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
> > > >> bool
> > > >> pinned)
> > > >> OVS_REQUIRES(dp->port_mutex)  {
> > > >>      struct dp_netdev_port *port;
> > > >>      struct rr_numa_list rr;
> > > >> +    struct rr_numa *non_local_numa = NULL;
> > > >>
> > > >>      rr_numa_list_populate(dp, &rr);
> > > >>
> > > >> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> > > >> pinned)
> > > >> OVS_REQUIRES(dp->port_mutex)
> > > >>                  }
> > > >>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> > > >>                  if (!numa) {
> > > >> -                    VLOG_WARN("There's no available (non isolated)
> pmd
> > > >> thread "
> > > >> +                    /* There are no pmds on the queue's local
> > > >> + NUMA
> > > node.
> > > >> +                       Round-robin on the NUMA nodes that do
> > > >> + have
> > > pmds.
> > > >> */
> > > >> +                    non_local_numa = rr_numa_list_next(&rr,
> > > >> non_local_numa);
> > > >> +                    if (!non_local_numa) {
> > > >> +                        VLOG_ERR("There is no available
> > > >> + (non-isolated)
> > > >> pmd "
> > > >> +                                 "thread for port \'%s\' queue %d.
> > > >> + This
> > > >> queue "
> > > >> +                                 "will not be polled. Is
> > > >> + pmd-cpu-mask set
> > > >> to "
> > > >> +                                 "zero? Or are all PMDs isolated
> > > >> + to other
> > > >> "
> > > >> +                                 "queues?",
> > > >> + netdev_get_name(port-
> > > >>> netdev),
> > > >> +                                 qid);
> > > >> +                        continue;
> > > >> +                    }
> > > >> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> > > >> +                    VLOG_WARN("There's no available
> > > >> + (non-isolated) pmd
> > > >> thread "
> > > >>                                "on numa node %d. Queue %d on port
> > > \'%s\'
> > > >> will "
> > > >> -                              "not be polled.",
> > > >> -                              numa_id, qid, netdev_get_name(port-
> > > >>> netdev));
> > > >> +                              "be assigned to the pmd on core %d "
> > > >> +                              "(numa node %d). Expect reduced
> > > >> performance.",
> > > >> +                              numa_id, qid,
> > > >> + netdev_get_name(port-
> > > >>> netdev),
> > > >> +                              q->pmd->core_id, q->pmd->numa_id);
> > > >>                  } else {
> > > >> +                    /* Assign queue to the next (round-robin)
> > > >> + PMD on it's
> > > >> local
> > > >> +                       NUMA node. */
> > > >>                      q->pmd = rr_numa_get_pmd(numa);
> > > >>                  }
> > > >>              }
> > > >> --
> > > >> 2.7.4
> > > > This tested fine for me, tested with multiple rxqs distributed and
> > > isolated over pmds on 2 different numa nodes with varying pmd masks.
> > > Also passed sanity checks (clang, sparse compilation etc.).
> > > >
> > > > You can add the tested by tag for me but I'd like to see the
> > > > changes for
> > > the documentation and function comments above before acking.
> > > >
> > > > Tested-by: Ian Stokes <ian.stokes@intel.com>
> > > >>
> > > >> _______________________________________________
> > > >> dev mailing list
> > > >> dev@openvswitch.org
> > > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > >
> > > >
> > > >
Stokes, Ian July 10, 2017, 11:02 a.m. UTC | #8
> On 10.07.2017 13:42, O Mahony, Billy wrote:
> >
> >
> >> -----Original Message-----
> >> From: Stokes, Ian
> >> Sent: Monday, July 10, 2017 10:41 AM
> >> To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
> >> <billy.o.mahony@intel.com>; dev@openvswitch.org
> >> Cc: dball@vmare.com
> >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds
> >> on non- local numa node.
> >>
> >>> On 08.07.2017 22:09, Stokes, Ian wrote:
> >>>>> Previously if there is no available (non-isolated) pmd on the numa
> >>>>> node for a port then the port is not polled at all. This can
> >>>>> result in a non- operational system until such time as nics are
> >>>>> physically repositioned. It is preferable to operate with a pmd on
> the 'wrong'
> >>>>> numa node albeit with lower performance. Local pmds are still
> >>>>> chosen
> >>> when available.
> >>>>>
> >>>>> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> >>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>>>> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> >>>>> ---
> >>>>> v9: v8 missed some comments on v7
> >>>>> v8: Some coding style issues; doc tweak
> >>>>> v7: Incorporate review comments on docs and implementation
> >>>>> v6: Change 'port' to 'queue' in a warning msg
> >>>>> v5: Fix warning msg; Update same in docs
> >>>>> v4: Fix a checkpatch error
> >>>>> v3: Fix warning messages not appearing when using multiqueue
> >>>>> v2: Add details of warning messages into docs
> >>>>>
> >>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> >>>>>  lib/dpif-netdev.c                    | 41
> >>>>> +++++++++++++++++++++++++++++++++---
> >>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/Documentation/intro/install/dpdk.rst
> >>>>> b/Documentation/intro/install/dpdk.rst
> >>>>> index e83f852..89775d6 100644
> >>>>> --- a/Documentation/intro/install/dpdk.rst
> >>>>> +++ b/Documentation/intro/install/dpdk.rst
> >>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
> >>>>>
> >>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> >>> interfaces
> >>>>>    assigned to it. A pmd thread shall poll the ports for incoming
> >>>>> packets,
> >>>>> -  switch the packets and send to tx port.  pmd thread is CPU
> >>>>> bound, and needs
> >>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
> >>>>> + bound, and needs
> >>>>>    to be affinitized to isolated cores for optimum performance.
> >>>>>
> >>>>>    By setting a bit in the mask, a pmd thread is created and
> >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> >>>>>        $ ovs-vsctl set Open_vSwitch .
> >>>>> other_config:pmd-cpu-mask=0x4
> >>>>>
> >>>>>    .. note::
> >>>>> -    pmd thread on a NUMA node is only created if there is at least
> one
> >>>>> DPDK
> >>>>> -    interface from that NUMA node added to OVS.
> >>>>> +    A pmd thread on a NUMA node is only created if there is at
> >>>>> + least one
> >>>>> DPDK
> >>>>> +    interface from that NUMA node added to OVS.  A pmd thread is
> >>>>> + created
> >>>>> by
> >>>>> +    default on a core of a NUMA node or when a specified
> >>>>> + pmd-cpu-mask
> >>> has
> >>>>> +    indicated so.  Even though a PMD thread may exist, the thread
> >>>>> + only
> >>>>> starts
> >>>>> +    consuming CPU cycles if there is least one receive queue
> >>>>> + assigned
> >>> to
> >>>>> +    the pmd.
> >>>>> +
> >>>>> +  .. note::
> >>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
> >>>>> + Unbound
> >>>>> rx
> >>>>> +    queues for a PCI device will assigned to a pmd on it's local
> >>>>> + NUMA
> >>>>
> >>>> Minor point but should read 'will be assigned'
> >
> > [[BO'M]]
> > +1
> >
> >>>>> node if a
> >>>>> +    non-isolated PMD exists on that NUMA node.  If not, the queue
> >>>>> + will
> >>> be
> >>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> >>>>> + will
> >>>>> result in
> >>>>> +    reduced maximum throughput on that device and possibly on
> >>>>> + other
> >>>>> devices
> >>>>> +    assigned to that pmd thread. In the case such, a queue
> >>>>> + assignment is
> >>>>> made a
> >>>>> +    warning message will be logged: "There's no available
> >>>>> + (non-isolated)
> >>>>> pmd
> >>>>
> >>>> Above should read 'In the case where such a queue assignment is
> >>>> made, a
> >>> warning message will be logged'
> >
> > [[BO'M]]
> > Suggesting a simpler:
> > 'If such a queue assignment is made a warning message ..."
> >
> >>>>> +    thread on numa node N. Queue Q on port P will be assigned to
> >>>>> + the pmd
> >>>>> on
> >>>>> +    core C (numa node N'). Expect reduced performance."
> >>>>>
> >>>>>  - QEMU vCPU thread Affinity
> >>>>>
> >>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>>>> 4e29085..7557f32
> >>>>> 100644
> >>>>> --- a/lib/dpif-netdev.c
> >>>>> +++ b/lib/dpif-netdev.c
> >>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> >>>>> *rr, int
> >>>>> numa_id)
> >>>>>      return NULL;
> >>>>>  }
> >>>>>
> >>>>> +/* Returns next NUMA from rr list in round-robin fashion. Returns
> >>>>> +the first
> >>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> >>>>> +list is empty. */ static struct rr_numa *
> >>>>> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> >>>>> +*numa) {
> >>>>
> >>>> The comment above can be tidied up a little to better clarify the
> >>> behavior of this function.
> >>>> I ended up reading the comments for hmap_next() and hmap_first()
> >>>> before
> >>> it made sense, and even then it's a bit ambiguous, it ends up being
> >>> the code that explains the comments.
> >>>>
> >>>> You could clarify the following 2 statements:
> >>>>
> >>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> >>>> assume
> >>> you mean the function parameter 'const struct rr_numa *numa' but
> >>> it's not clear on first reading.
> >>>>
> >>>> (2) " or the last node passed" - again this makes sense only when
> >>>> you
> >>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> >>> node)'.
> >>>>
> >>>> You could say something like:
> >>>>
> >>>> "Attempt to return the next NUMA from a numa list in a round robin
> >>> fashion. Return the first NUMA node if the struct rr_numa *numa
> >>> argument passed to the function is NULL or if the numa node argument
> >>> passed to hmap_next is already the last node. Return NULL if the
> >>> numa list is empty."
> >>>
> >>> I'm not sure that references to implementation is a good way to
> >>> write comments (I mean 'passed to hmap_next' part).
> >>>
> >>> How about this:
> >>> """
> >>> /* Returns the next node in numa list following 'numa' in
> >>> round-robin fashion.
> >>>  * Returns first node if 'numa' is a null pointer or the last node
> >>> in 'rr'. */ """
> >>>
> >>> or
> >>>
> >>> """
> >>> /* The next node in numa list following 'numa' in round-robin fashion.
> >>>  * Returns:
> >>>  *     - 'NULL' if 'rr' is an empty numa list.
> >>>  *     - First node in 'rr' if 'numa' is a null pointer.
> >>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> >>>  *     - Otherwise, the next node in numa list following 'numa'. */
> >>> """
> >>>
> >>> ?
> >>
> >> I'm happy with the first option you provided above, could you append
> >> returning NULL if the list is empty then I think we're good.
> >>
> >> /* Returns the next node in numa list following 'numa' in round-robin
> >> fashion.
> >>  * Returns first node if 'numa' is a null pointer or the last node in
> 'rr'.
> >>  * Returns NULL if 'rr' numa list is empty. */
> >
> > [[BO'M]]
> > Sounds good to me. Anyone object to this wording?
> 
> I don't like three 'Returns' in a row but LGTM in general.
> 

Once these changes are made for the next patch version feel free to add Tested by and acked tags for myself.

Thanks
Ian
> >>
> >> Thanks
> >> Ian
> >>>
> >>>>
> >>>>> +    struct hmap_node *node = NULL;
> >>>>> +
> >>>>> +    if (numa) {
> >>>>> +        node = hmap_next(&rr->numas, &numa->node);
> >>>>> +    }
> >>>>> +    if (!node) {
> >>>>> +        node = hmap_first(&rr->numas);
> >>>>> +    }
> >>>>> +
> >>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> >>>>> + NULL; }
> >>>>> +
> >>>>>  static void
> >>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> >>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
> >>>>> bool
> >>>>> pinned)
> >>>>> OVS_REQUIRES(dp->port_mutex)  {
> >>>>>      struct dp_netdev_port *port;
> >>>>>      struct rr_numa_list rr;
> >>>>> +    struct rr_numa *non_local_numa = NULL;
> >>>>>
> >>>>>      rr_numa_list_populate(dp, &rr);
> >>>>>
> >>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> >>>>> pinned)
> >>>>> OVS_REQUIRES(dp->port_mutex)
> >>>>>                  }
> >>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> >>>>>                  if (!numa) {
> >>>>> -                    VLOG_WARN("There's no available (non isolated)
> pmd
> >>>>> thread "
> >>>>> +                    /* There are no pmds on the queue's local
> >>>>> + NUMA
> >>> node.
> >>>>> +                       Round-robin on the NUMA nodes that do have
> >>> pmds.
> >>>>> */
> >>>>> +                    non_local_numa = rr_numa_list_next(&rr,
> >>>>> non_local_numa);
> >>>>> +                    if (!non_local_numa) {
> >>>>> +                        VLOG_ERR("There is no available
> >>>>> + (non-isolated)
> >>>>> pmd "
> >>>>> +                                 "thread for port \'%s\' queue %d.
> >>>>> + This
> >>>>> queue "
> >>>>> +                                 "will not be polled. Is
> >>>>> + pmd-cpu-mask set
> >>>>> to "
> >>>>> +                                 "zero? Or are all PMDs isolated
> >>>>> + to other
> >>>>> "
> >>>>> +                                 "queues?", netdev_get_name(port-
> >>>>>> netdev),
> >>>>> +                                 qid);
> >>>>> +                        continue;
> >>>>> +                    }
> >>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> >>>>> +                    VLOG_WARN("There's no available
> >>>>> + (non-isolated) pmd
> >>>>> thread "
> >>>>>                                "on numa node %d. Queue %d on port
> >>> \'%s\'
> >>>>> will "
> >>>>> -                              "not be polled.",
> >>>>> -                              numa_id, qid, netdev_get_name(port-
> >>>>>> netdev));
> >>>>> +                              "be assigned to the pmd on core %d "
> >>>>> +                              "(numa node %d). Expect reduced
> >>>>> performance.",
> >>>>> +                              numa_id, qid, netdev_get_name(port-
> >>>>>> netdev),
> >>>>> +                              q->pmd->core_id, q->pmd->numa_id);
> >>>>>                  } else {
> >>>>> +                    /* Assign queue to the next (round-robin) PMD
> >>>>> + on it's
> >>>>> local
> >>>>> +                       NUMA node. */
> >>>>>                      q->pmd = rr_numa_get_pmd(numa);
> >>>>>                  }
> >>>>>              }
> >>>>> --
> >>>>> 2.7.4
> >>>> This tested fine for me, tested with multiple rxqs distributed and
> >>> isolated over pmds on 2 different numa nodes with varying pmd masks.
> >>> Also passed sanity checks (clang, sparse compilation etc.).
> >>>>
> >>>> You can add the tested by tag for me but I'd like to see the
> >>>> changes for
> >>> the documentation and function comments above before acking.
> >>>>
> >>>> Tested-by: Ian Stokes <ian.stokes@intel.com>
> >>>>>
> >>>>> _______________________________________________
> >>>>> dev mailing list
> >>>>> dev@openvswitch.org
> >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>
> >>>>
> >>>>
Billy O'Mahony July 10, 2017, 11:04 a.m. UTC | #9
Thanks everybody. Will post v10 after lunch.

/Billy

> -----Original Message-----
> From: Stokes, Ian
> Sent: Monday, July 10, 2017 12:02 PM
> To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
> <billy.o.mahony@intel.com>; dev@openvswitch.org
> Cc: dball@vmare.com
> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > On 10.07.2017 13:42, O Mahony, Billy wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Stokes, Ian
> > >> Sent: Monday, July 10, 2017 10:41 AM
> > >> To: Ilya Maximets <i.maximets@samsung.com>; O Mahony, Billy
> > >> <billy.o.mahony@intel.com>; dev@openvswitch.org
> > >> Cc: dball@vmare.com
> > >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds
> > >> on non- local numa node.
> > >>
> > >>> On 08.07.2017 22:09, Stokes, Ian wrote:
> > >>>>> Previously if there is no available (non-isolated) pmd on the
> > >>>>> numa node for a port then the port is not polled at all. This
> > >>>>> can result in a non- operational system until such time as nics
> > >>>>> are physically repositioned. It is preferable to operate with a
> > >>>>> pmd on
> > the 'wrong'
> > >>>>> numa node albeit with lower performance. Local pmds are still
> > >>>>> chosen
> > >>> when available.
> > >>>>>
> > >>>>> Signed-off-by: Billy O'Mahony <billy.o.mahony@intel.com>
> > >>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> > >>>>> Co-authored-by: Ilya Maximets <i.maximets@samsung.com>
> > >>>>> ---
> > >>>>> v9: v8 missed some comments on v7
> > >>>>> v8: Some coding style issues; doc tweak
> > >>>>> v7: Incorporate review comments on docs and implementation
> > >>>>> v6: Change 'port' to 'queue' in a warning msg
> > >>>>> v5: Fix warning msg; Update same in docs
> > >>>>> v4: Fix a checkpatch error
> > >>>>> v3: Fix warning messages not appearing when using multiqueue
> > >>>>> v2: Add details of warning messages into docs
> > >>>>>
> > >>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> > >>>>>  lib/dpif-netdev.c                    | 41
> > >>>>> +++++++++++++++++++++++++++++++++---
> > >>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
> > >>>>>
> > >>>>> diff --git a/Documentation/intro/install/dpdk.rst
> > >>>>> b/Documentation/intro/install/dpdk.rst
> > >>>>> index e83f852..89775d6 100644
> > >>>>> --- a/Documentation/intro/install/dpdk.rst
> > >>>>> +++ b/Documentation/intro/install/dpdk.rst
> > >>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
> > >>>>>
> > >>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> > >>> interfaces
> > >>>>>    assigned to it. A pmd thread shall poll the ports for
> > >>>>> incoming packets,
> > >>>>> -  switch the packets and send to tx port.  pmd thread is CPU
> > >>>>> bound, and needs
> > >>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
> > >>>>> + bound, and needs
> > >>>>>    to be affinitized to isolated cores for optimum performance.
> > >>>>>
> > >>>>>    By setting a bit in the mask, a pmd thread is created and
> > >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > >>>>>        $ ovs-vsctl set Open_vSwitch .
> > >>>>> other_config:pmd-cpu-mask=0x4
> > >>>>>
> > >>>>>    .. note::
> > >>>>> -    pmd thread on a NUMA node is only created if there is at least
> > one
> > >>>>> DPDK
> > >>>>> -    interface from that NUMA node added to OVS.
> > >>>>> +    A pmd thread on a NUMA node is only created if there is at
> > >>>>> + least one
> > >>>>> DPDK
> > >>>>> +    interface from that NUMA node added to OVS.  A pmd thread
> > >>>>> + is created
> > >>>>> by
> > >>>>> +    default on a core of a NUMA node or when a specified
> > >>>>> + pmd-cpu-mask
> > >>> has
> > >>>>> +    indicated so.  Even though a PMD thread may exist, the
> > >>>>> + thread only
> > >>>>> starts
> > >>>>> +    consuming CPU cycles if there is least one receive queue
> > >>>>> + assigned
> > >>> to
> > >>>>> +    the pmd.
> > >>>>> +
> > >>>>> +  .. note::
> > >>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
> > >>>>> + Unbound
> > >>>>> rx
> > >>>>> +    queues for a PCI device will assigned to a pmd on it's
> > >>>>> + local NUMA
> > >>>>
> > >>>> Minor point but should read 'will be assigned'
> > >
> > > [[BO'M]]
> > > +1
> > >
> > >>>>> node if a
> > >>>>> +    non-isolated PMD exists on that NUMA node.  If not, the
> > >>>>> + queue will
> > >>> be
> > >>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> > >>>>> + will
> > >>>>> result in
> > >>>>> +    reduced maximum throughput on that device and possibly on
> > >>>>> + other
> > >>>>> devices
> > >>>>> +    assigned to that pmd thread. In the case such, a queue
> > >>>>> + assignment is
> > >>>>> made a
> > >>>>> +    warning message will be logged: "There's no available
> > >>>>> + (non-isolated)
> > >>>>> pmd
> > >>>>
> > >>>> Above should read 'In the case where such a queue assignment is
> > >>>> made, a
> > >>> warning message will be logged'
> > >
> > > [[BO'M]]
> > > Suggesting a simpler:
> > > 'If such a queue assignment is made a warning message ..."
> > >
> > >>>>> +    thread on numa node N. Queue Q on port P will be assigned
> > >>>>> + to the pmd
> > >>>>> on
> > >>>>> +    core C (numa node N'). Expect reduced performance."
> > >>>>>
> > >>>>>  - QEMU vCPU thread Affinity
> > >>>>>
> > >>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > >>>>> 4e29085..7557f32
> > >>>>> 100644
> > >>>>> --- a/lib/dpif-netdev.c
> > >>>>> +++ b/lib/dpif-netdev.c
> > >>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> > >>>>> *rr, int
> > >>>>> numa_id)
> > >>>>>      return NULL;
> > >>>>>  }
> > >>>>>
> > >>>>> +/* Returns next NUMA from rr list in round-robin fashion.
> > >>>>> +Returns the first
> > >>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> > >>>>> +list is empty. */ static struct rr_numa *
> > >>>>> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> > >>>>> +*numa) {
> > >>>>
> > >>>> The comment above can be tidied up a little to better clarify the
> > >>> behavior of this function.
> > >>>> I ended up reading the comments for hmap_next() and hmap_first()
> > >>>> before
> > >>> it made sense, and even then it's a bit ambiguous, it ends up
> > >>> being the code that explains the comments.
> > >>>>
> > >>>> You could clarify the following 2 statements:
> > >>>>
> > >>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> > >>>> assume
> > >>> you mean the function parameter 'const struct rr_numa *numa' but
> > >>> it's not clear on first reading.
> > >>>>
> > >>>> (2) " or the last node passed" - again this makes sense only when
> > >>>> you
> > >>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> > >>> node)'.
> > >>>>
> > >>>> You could say something like:
> > >>>>
> > >>>> "Attempt to return the next NUMA from a numa list in a round
> > >>>> robin
> > >>> fashion. Return the first NUMA node if the struct rr_numa *numa
> > >>> argument passed to the function is NULL or if the numa node
> > >>> argument passed to hmap_next is already the last node. Return NULL
> > >>> if the numa list is empty."
> > >>>
> > >>> I'm not sure that references to implementation is a good way to
> > >>> write comments (I mean 'passed to hmap_next' part).
> > >>>
> > >>> How about this:
> > >>> """
> > >>> /* Returns the next node in numa list following 'numa' in
> > >>> round-robin fashion.
> > >>>  * Returns first node if 'numa' is a null pointer or the last node
> > >>> in 'rr'. */ """
> > >>>
> > >>> or
> > >>>
> > >>> """
> > >>> /* The next node in numa list following 'numa' in round-robin fashion.
> > >>>  * Returns:
> > >>>  *     - 'NULL' if 'rr' is an empty numa list.
> > >>>  *     - First node in 'rr' if 'numa' is a null pointer.
> > >>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> > >>>  *     - Otherwise, the next node in numa list following 'numa'. */
> > >>> """
> > >>>
> > >>> ?
> > >>
> > >> I'm happy with the first option you provided above, could you
> > >> append returning NULL if the list is empty then I think we're good.
> > >>
> > >> /* Returns the next node in numa list following 'numa' in
> > >> round-robin fashion.
> > >>  * Returns first node if 'numa' is a null pointer or the last node
> > >> in
> > 'rr'.
> > >>  * Returns NULL if 'rr' numa list is empty. */
> > >
> > > [[BO'M]]
> > > Sounds good to me. Anyone object to this wording?
> >
> > I don't like three 'Returns' in a row but LGTM in general.
> >
> 
> Once these changes are made for the next patch version feel free to add
> Tested by and acked tags for myself.
> 
> Thanks
> Ian
> > >>
> > >> Thanks
> > >> Ian
> > >>>
> > >>>>
> > >>>>> +    struct hmap_node *node = NULL;
> > >>>>> +
> > >>>>> +    if (numa) {
> > >>>>> +        node = hmap_next(&rr->numas, &numa->node);
> > >>>>> +    }
> > >>>>> +    if (!node) {
> > >>>>> +        node = hmap_first(&rr->numas);
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> > >>>>> + NULL; }
> > >>>>> +
> > >>>>>  static void
> > >>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> > >>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev
> > >>>>> *dp, bool
> > >>>>> pinned)
> > >>>>> OVS_REQUIRES(dp->port_mutex)  {
> > >>>>>      struct dp_netdev_port *port;
> > >>>>>      struct rr_numa_list rr;
> > >>>>> +    struct rr_numa *non_local_numa = NULL;
> > >>>>>
> > >>>>>      rr_numa_list_populate(dp, &rr);
> > >>>>>
> > >>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp,
> > >>>>> bool
> > >>>>> pinned)
> > >>>>> OVS_REQUIRES(dp->port_mutex)
> > >>>>>                  }
> > >>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> > >>>>>                  if (!numa) {
> > >>>>> -                    VLOG_WARN("There's no available (non isolated)
> > pmd
> > >>>>> thread "
> > >>>>> +                    /* There are no pmds on the queue's local
> > >>>>> + NUMA
> > >>> node.
> > >>>>> +                       Round-robin on the NUMA nodes that do
> > >>>>> + have
> > >>> pmds.
> > >>>>> */
> > >>>>> +                    non_local_numa = rr_numa_list_next(&rr,
> > >>>>> non_local_numa);
> > >>>>> +                    if (!non_local_numa) {
> > >>>>> +                        VLOG_ERR("There is no available
> > >>>>> + (non-isolated)
> > >>>>> pmd "
> > >>>>> +                                 "thread for port \'%s\' queue %d.
> > >>>>> + This
> > >>>>> queue "
> > >>>>> +                                 "will not be polled. Is
> > >>>>> + pmd-cpu-mask set
> > >>>>> to "
> > >>>>> +                                 "zero? Or are all PMDs
> > >>>>> + isolated to other
> > >>>>> "
> > >>>>> +                                 "queues?",
> > >>>>> + netdev_get_name(port-
> > >>>>>> netdev),
> > >>>>> +                                 qid);
> > >>>>> +                        continue;
> > >>>>> +                    }
> > >>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> > >>>>> +                    VLOG_WARN("There's no available
> > >>>>> + (non-isolated) pmd
> > >>>>> thread "
> > >>>>>                                "on numa node %d. Queue %d on
> > >>>>> port
> > >>> \'%s\'
> > >>>>> will "
> > >>>>> -                              "not be polled.",
> > >>>>> -                              numa_id, qid, netdev_get_name(port-
> > >>>>>> netdev));
> > >>>>> +                              "be assigned to the pmd on core %d "
> > >>>>> +                              "(numa node %d). Expect reduced
> > >>>>> performance.",
> > >>>>> +                              numa_id, qid,
> > >>>>> + netdev_get_name(port-
> > >>>>>> netdev),
> > >>>>> +                              q->pmd->core_id,
> > >>>>> + q->pmd->numa_id);
> > >>>>>                  } else {
> > >>>>> +                    /* Assign queue to the next (round-robin)
> > >>>>> + PMD on it's
> > >>>>> local
> > >>>>> +                       NUMA node. */
> > >>>>>                      q->pmd = rr_numa_get_pmd(numa);
> > >>>>>                  }
> > >>>>>              }
> > >>>>> --
> > >>>>> 2.7.4
> > >>>> This tested fine for me, tested with multiple rxqs distributed
> > >>>> and
> > >>> isolated over pmds on 2 different numa nodes with varying pmd
> masks.
> > >>> Also passed sanity checks (clang, sparse compilation etc.).
> > >>>>
> > >>>> You can add the tested by tag for me but I'd like to see the
> > >>>> changes for
> > >>> the documentation and function comments above before acking.
> > >>>>
> > >>>> Tested-by: Ian Stokes <ian.stokes@intel.com>
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> dev mailing list
> > >>>>> dev@openvswitch.org
> > >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >>>>
> > >>>>
> > >>>>
diff mbox

Patch

diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
index e83f852..89775d6 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -449,7 +449,7 @@  affinitized accordingly.
 
   A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
   assigned to it. A pmd thread shall poll the ports for incoming packets,
-  switch the packets and send to tx port.  pmd thread is CPU bound, and needs
+  switch the packets and send to tx port.  A pmd thread is CPU bound, and needs
   to be affinitized to isolated cores for optimum performance.
 
   By setting a bit in the mask, a pmd thread is created and pinned to the
@@ -458,8 +458,23 @@  affinitized accordingly.
       $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
 
   .. note::
-    pmd thread on a NUMA node is only created if there is at least one DPDK
-    interface from that NUMA node added to OVS.
+    A pmd thread on a NUMA node is only created if there is at least one DPDK
+    interface from that NUMA node added to OVS.  A pmd thread is created by
+    default on a core of a NUMA node or when a specified pmd-cpu-mask has
+    indicated so.  Even though a PMD thread may exist, the thread only starts
+    consuming CPU cycles if there is least one receive queue assigned to
+    the pmd.
+
+  .. note::
+    On NUMA systems PCI devices are also local to a NUMA node.  Unbound rx
+    queues for a PCI device will assigned to a pmd on it's local NUMA node if a
+    non-isolated PMD exists on that NUMA node.  If not, the queue will be
+    assigned to a non-isolated pmd on a remote NUMA node.  This will result in
+    reduced maximum throughput on that device and possibly on other devices
+    assigned to that pmd thread. In the case such, a queue assignment is made a
+    warning message will be logged: "There's no available (non-isolated) pmd
+    thread on numa node N. Queue Q on port P will be assigned to the pmd on
+    core C (numa node N'). Expect reduced performance."
 
 - QEMU vCPU thread Affinity
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4e29085..7557f32 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3195,6 +3195,23 @@  rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id)
     return NULL;
 }
 
+/* Returns next NUMA from rr list in round-robin fashion. Returns the first
+ * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is empty. */
+static struct rr_numa *
+rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa)
+{
+    struct hmap_node *node = NULL;
+
+    if (numa) {
+        node = hmap_next(&rr->numas, &numa->node);
+    }
+    if (!node) {
+        node = hmap_first(&rr->numas);
+    }
+
+    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
+}
+
 static void
 rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)
 {
@@ -3249,6 +3266,7 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
 {
     struct dp_netdev_port *port;
     struct rr_numa_list rr;
+    struct rr_numa *non_local_numa = NULL;
 
     rr_numa_list_populate(dp, &rr);
 
@@ -3281,11 +3299,28 @@  rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
                 }
             } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
                 if (!numa) {
-                    VLOG_WARN("There's no available (non isolated) pmd thread "
+                    /* There are no pmds on the queue's local NUMA node.
+                       Round-robin on the NUMA nodes that do have pmds. */
+                    non_local_numa = rr_numa_list_next(&rr, non_local_numa);
+                    if (!non_local_numa) {
+                        VLOG_ERR("There is no available (non-isolated) pmd "
+                                 "thread for port \'%s\' queue %d. This queue "
+                                 "will not be polled. Is pmd-cpu-mask set to "
+                                 "zero? Or are all PMDs isolated to other "
+                                 "queues?", netdev_get_name(port->netdev),
+                                 qid);
+                        continue;
+                    }
+                    q->pmd = rr_numa_get_pmd(non_local_numa);
+                    VLOG_WARN("There's no available (non-isolated) pmd thread "
                               "on numa node %d. Queue %d on port \'%s\' will "
-                              "not be polled.",
-                              numa_id, qid, netdev_get_name(port->netdev));
+                              "be assigned to the pmd on core %d "
+                              "(numa node %d). Expect reduced performance.",
+                              numa_id, qid, netdev_get_name(port->netdev),
+                              q->pmd->core_id, q->pmd->numa_id);
                 } else {
+                    /* Assign queue to the next (round-robin) PMD on it's local
+                       NUMA node. */
                     q->pmd = rr_numa_get_pmd(numa);
                 }
             }