diff mbox

[ovs-dev,RFC,1/1] netdev-dpdk: NUMA Aware vHost

Message ID 1457086119-26981-2-git-send-email-ciara.loftus@intel.com
State Not Applicable
Headers show

Commit Message

Ciara Loftus March 4, 2016, 10:08 a.m. UTC
This commit allows for vHost memory from QEMU, DPDK and OVS, as well
as the servicing PMD, to all come from the same socket.

DPDK v2.2 introduces a new configuration option:
CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
from which a vhost device's memory has been allocated by QEMU, and
accordingly reallocates device memory managed by DPDK to that same
socket.

OVS by default sets the socket id of a vhost port to that of the
master lcore. This commit introduces the ability to update the
socket id of the port if it is detected (during VM boot) that the
port memory is not on the default NUMA node. If this is the case, the
mempool of the port is also changed to the new node, and a PMD
thread currently servicing the port will no longer, in favour of a
thread from the new node (if enabled in the CPU mask).

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 INSTALL.DPDK.md   |  6 +++++-
 acinclude.m4      |  2 +-
 lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
 3 files changed, 29 insertions(+), 4 deletions(-)

Comments

Daniele Di Proietto March 10, 2016, 1:22 a.m. UTC | #1
Thanks for the patch, I'll put this in the use case list for
my series if I need to resend it!

It would be nice to get the numa socket information without
linking OVS with libnuma, maybe using some DPDK api. From
a quick look I didn't find any way, but maybe you know a
better way.

Some preliminary comments inline

On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
<dev-bounces@openvswitch.org on behalf of ciara.loftus@intel.com> wrote:

>This commit allows for vHost memory from QEMU, DPDK and OVS, as well
>as the servicing PMD, to all come from the same socket.
>
>DPDK v2.2 introduces a new configuration option:
>CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
>from which a vhost device's memory has been allocated by QEMU, and
>accordingly reallocates device memory managed by DPDK to that same
>socket.
>
>OVS by default sets the socket id of a vhost port to that of the
>master lcore. This commit introduces the ability to update the
>socket id of the port if it is detected (during VM boot) that the
>port memory is not on the default NUMA node. If this is the case, the
>mempool of the port is also changed to the new node, and a PMD
>thread currently servicing the port will no longer, in favour of a
>thread from the new node (if enabled in the CPU mask).
>
>Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>---
> INSTALL.DPDK.md   |  6 +++++-
> acinclude.m4      |  2 +-
> lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
> 3 files changed, 29 insertions(+), 4 deletions(-)
>
>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>index dca79bd..82e6908 100644
>--- a/INSTALL.DPDK.md
>+++ b/INSTALL.DPDK.md
>@@ -33,6 +33,10 @@ on Debian/Ubuntu)
> 
>      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
> 
>+     Enable NUMA-aware vHost by modifying the following in the same file:
>+
>+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
>+

I guess we should also update install_dpdk() in ./travis/build.sh to do
this if it's required

>      Then run `make install` to build and install the library.
>      For default install without IVSHMEM:
> 
>@@ -383,7 +387,7 @@ Performance Tuning:
> 
> 	It is good practice to ensure that threads that are in the datapath are
> 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
>-	responsible for forwarding.
>+	responsible for forwarding. This is now default behavior for vHost
>ports.
> 
>   9. Rx Mergeable buffers
> 
>diff --git a/acinclude.m4 b/acinclude.m4
>index 11c7787..432bdbd 100644
>--- a/acinclude.m4
>+++ b/acinclude.m4
>@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
>     found=false
>     save_LIBS=$LIBS
>     for extras in "" "-ldl"; do
>-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
>+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"

I guess we should also list libnuma-dev in .travis.yml and something
similar
in rhel/openvswitch-fedora.spec

>         AC_LINK_IFELSE(
>            [AC_LANG_PROGRAM([#include <rte_config.h>
>                              #include <rte_eal.h>],
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>index 17b8d51..4e1ce53 100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -29,6 +29,7 @@
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
>+#include <numaif.h>
> 
> #include "dirs.h"
> #include "dp-packet.h"
>@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
> {
>     struct netdev_dpdk *netdev;
>     bool exists = false;
>+    int newnode = 0;
>+    long err = 0;
> 
>     ovs_mutex_lock(&dpdk_mutex);
>     /* Add device to the vhost port with the same name as that passed
>down. */
>@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
>             }
>             ovsrcu_set(&netdev->virtio_dev, dev);
>             exists = true;
>+
>+            /* Get NUMA information */
>+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
>MPOL_F_ADDR);
>+            if (err) {
>+                VLOG_INFO("Error getting NUMA info for vHost Device
>'%s'",
>+                        dev->ifname);
>+                newnode = netdev->socket_id;
>+            } else if (newnode != netdev->socket_id) {
>+                netdev->socket_id = newnode;
>+                /* Change mempool to new NUMA Node */
>+                dpdk_mp_put(netdev->dpdk_mp);
>+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>netdev->mtu);
>+                /* Request netdev reconfiguration. The port may now be
>+                 * serviced by a PMD on the new node if enabled in the
>cpu
>+                 * mask */
>+                netdev_request_reconfigure(&netdev->up);

I think here I would prefer:

1) Remembering the configuration change request
    netdev->requested_socket_id = newnode
2) Calling netdev_request_reconfigure()
3) In the netdev_dpdk_vhost_user_reconfigure() method:
    netdev->socket_id = netdev_requested_socket_id

Otherwise the datapath might be confused, because it assumes that the
socket_id doesn't change while the device is polled by the pmd threads.

It's safe to change almost everything inside the
netdev_dpdk_vhost_user_reconfigure(), because the device will not be
polled by the datapath when the function is called.

The same applies for the mempool: the actual change should be done
in netdev_dpdk_vhost_user_reconfigure(), because there might still be
threads using it.

>+            }
>+
>             dev->flags |= VIRTIO_DEV_RUNNING;
>             /* Disable notifications. */
>             set_irq_status(dev);
>@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
>         return -1;
>     }
> 
>-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added", dev->ifname,
>-              dev->device_fh);
>+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket %i",
>+              dev->ifname, dev->device_fh, newnode);
>     return 0;
> }
> 
>-- 
>2.4.3
>
>_______________________________________________
>dev mailing list
>dev@openvswitch.org
>http://openvswitch.org/mailman/listinfo/dev
Ben Pfaff March 10, 2016, 10:38 p.m. UTC | #2
How does this numa library relate to lib/ovs-numa.[ch]?

On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
> Thanks for the patch, I'll put this in the use case list for
> my series if I need to resend it!
> 
> It would be nice to get the numa socket information without
> linking OVS with libnuma, maybe using some DPDK api. From
> a quick look I didn't find any way, but maybe you know a
> better way.
> 
> Some preliminary comments inline
> 
> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
> <dev-bounces@openvswitch.org on behalf of ciara.loftus@intel.com> wrote:
> 
> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
> >as the servicing PMD, to all come from the same socket.
> >
> >DPDK v2.2 introduces a new configuration option:
> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
> >from which a vhost device's memory has been allocated by QEMU, and
> >accordingly reallocates device memory managed by DPDK to that same
> >socket.
> >
> >OVS by default sets the socket id of a vhost port to that of the
> >master lcore. This commit introduces the ability to update the
> >socket id of the port if it is detected (during VM boot) that the
> >port memory is not on the default NUMA node. If this is the case, the
> >mempool of the port is also changed to the new node, and a PMD
> >thread currently servicing the port will no longer, in favour of a
> >thread from the new node (if enabled in the CPU mask).
> >
> >Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> >---
> > INSTALL.DPDK.md   |  6 +++++-
> > acinclude.m4      |  2 +-
> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
> > 3 files changed, 29 insertions(+), 4 deletions(-)
> >
> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> >index dca79bd..82e6908 100644
> >--- a/INSTALL.DPDK.md
> >+++ b/INSTALL.DPDK.md
> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
> > 
> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
> > 
> >+     Enable NUMA-aware vHost by modifying the following in the same file:
> >+
> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
> >+
> 
> I guess we should also update install_dpdk() in ./travis/build.sh to do
> this if it's required
> 
> >      Then run `make install` to build and install the library.
> >      For default install without IVSHMEM:
> > 
> >@@ -383,7 +387,7 @@ Performance Tuning:
> > 
> > 	It is good practice to ensure that threads that are in the datapath are
> > 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
> >-	responsible for forwarding.
> >+	responsible for forwarding. This is now default behavior for vHost
> >ports.
> > 
> >   9. Rx Mergeable buffers
> > 
> >diff --git a/acinclude.m4 b/acinclude.m4
> >index 11c7787..432bdbd 100644
> >--- a/acinclude.m4
> >+++ b/acinclude.m4
> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
> >     found=false
> >     save_LIBS=$LIBS
> >     for extras in "" "-ldl"; do
> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
> 
> I guess we should also list libnuma-dev in .travis.yml and something
> similar
> in rhel/openvswitch-fedora.spec
> 
> >         AC_LINK_IFELSE(
> >            [AC_LANG_PROGRAM([#include <rte_config.h>
> >                              #include <rte_eal.h>],
> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >index 17b8d51..4e1ce53 100644
> >--- a/lib/netdev-dpdk.c
> >+++ b/lib/netdev-dpdk.c
> >@@ -29,6 +29,7 @@
> > #include <stdio.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> >+#include <numaif.h>
> > 
> > #include "dirs.h"
> > #include "dp-packet.h"
> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
> > {
> >     struct netdev_dpdk *netdev;
> >     bool exists = false;
> >+    int newnode = 0;
> >+    long err = 0;
> > 
> >     ovs_mutex_lock(&dpdk_mutex);
> >     /* Add device to the vhost port with the same name as that passed
> >down. */
> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
> >             }
> >             ovsrcu_set(&netdev->virtio_dev, dev);
> >             exists = true;
> >+
> >+            /* Get NUMA information */
> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
> >MPOL_F_ADDR);
> >+            if (err) {
> >+                VLOG_INFO("Error getting NUMA info for vHost Device
> >'%s'",
> >+                        dev->ifname);
> >+                newnode = netdev->socket_id;
> >+            } else if (newnode != netdev->socket_id) {
> >+                netdev->socket_id = newnode;
> >+                /* Change mempool to new NUMA Node */
> >+                dpdk_mp_put(netdev->dpdk_mp);
> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> >netdev->mtu);
> >+                /* Request netdev reconfiguration. The port may now be
> >+                 * serviced by a PMD on the new node if enabled in the
> >cpu
> >+                 * mask */
> >+                netdev_request_reconfigure(&netdev->up);
> 
> I think here I would prefer:
> 
> 1) Remembering the configuration change request
>     netdev->requested_socket_id = newnode
> 2) Calling netdev_request_reconfigure()
> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
>     netdev->socket_id = netdev_requested_socket_id
> 
> Otherwise the datapath might be confused, because it assumes that the
> socket_id doesn't change while the device is polled by the pmd threads.
> 
> It's safe to change almost everything inside the
> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
> polled by the datapath when the function is called.
> 
> The same applies for the mempool: the actual change should be done
> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
> threads using it.
> 
> >+            }
> >+
> >             dev->flags |= VIRTIO_DEV_RUNNING;
> >             /* Disable notifications. */
> >             set_irq_status(dev);
> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
> >         return -1;
> >     }
> > 
> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added", dev->ifname,
> >-              dev->device_fh);
> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket %i",
> >+              dev->ifname, dev->device_fh, newnode);
> >     return 0;
> > }
> > 
> >-- 
> >2.4.3
> >
> >_______________________________________________
> >dev mailing list
> >dev@openvswitch.org
> >http://openvswitch.org/mailman/listinfo/dev
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
Daniele Di Proietto March 10, 2016, 11:24 p.m. UTC | #3
ovs-numa reads the CPUs and NUMA nodes configuration from sysfs.
It is only used to decide how to set the affinity of pmd threads.

This new code just needs get_mempolicy() from libnuma: it is used
to get the NUMA node of a virtual address and it is simply a wrapper
to a system call.  Want do you think? Should we implement that in OVS?

If we implement the system call wrapper in ovs-numa we could avoid
depending on libnuma-dev for the build, but (I guess) we would still
need to link with -lnuma, as it is a dependency of the statically linked
DPDK library

Thanks

On 10/03/2016 14:38, "Ben Pfaff" <blp@ovn.org> wrote:

>How does this numa library relate to lib/ovs-numa.[ch]?
>
>On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
>> Thanks for the patch, I'll put this in the use case list for
>> my series if I need to resend it!
>> 
>> It would be nice to get the numa socket information without
>> linking OVS with libnuma, maybe using some DPDK api. From
>> a quick look I didn't find any way, but maybe you know a
>> better way.
>> 
>> Some preliminary comments inline
>> 
>> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
>> <dev-bounces@openvswitch.org on behalf of ciara.loftus@intel.com> wrote:
>> 
>> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
>> >as the servicing PMD, to all come from the same socket.
>> >
>> >DPDK v2.2 introduces a new configuration option:
>> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
>> >from which a vhost device's memory has been allocated by QEMU, and
>> >accordingly reallocates device memory managed by DPDK to that same
>> >socket.
>> >
>> >OVS by default sets the socket id of a vhost port to that of the
>> >master lcore. This commit introduces the ability to update the
>> >socket id of the port if it is detected (during VM boot) that the
>> >port memory is not on the default NUMA node. If this is the case, the
>> >mempool of the port is also changed to the new node, and a PMD
>> >thread currently servicing the port will no longer, in favour of a
>> >thread from the new node (if enabled in the CPU mask).
>> >
>> >Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> >---
>> > INSTALL.DPDK.md   |  6 +++++-
>> > acinclude.m4      |  2 +-
>> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
>> > 3 files changed, 29 insertions(+), 4 deletions(-)
>> >
>> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> >index dca79bd..82e6908 100644
>> >--- a/INSTALL.DPDK.md
>> >+++ b/INSTALL.DPDK.md
>> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
>> > 
>> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
>> > 
>> >+     Enable NUMA-aware vHost by modifying the following in the same
>>file:
>> >+
>> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
>> >+
>> 
>> I guess we should also update install_dpdk() in ./travis/build.sh to do
>> this if it's required
>> 
>> >      Then run `make install` to build and install the library.
>> >      For default install without IVSHMEM:
>> > 
>> >@@ -383,7 +387,7 @@ Performance Tuning:
>> > 
>> > 	It is good practice to ensure that threads that are in the datapath
>>are
>> > 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU
>>vCPUs
>> >-	responsible for forwarding.
>> >+	responsible for forwarding. This is now default behavior for vHost
>> >ports.
>> > 
>> >   9. Rx Mergeable buffers
>> > 
>> >diff --git a/acinclude.m4 b/acinclude.m4
>> >index 11c7787..432bdbd 100644
>> >--- a/acinclude.m4
>> >+++ b/acinclude.m4
>> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
>> >     found=false
>> >     save_LIBS=$LIBS
>> >     for extras in "" "-ldl"; do
>> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
>> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
>> 
>> I guess we should also list libnuma-dev in .travis.yml and something
>> similar
>> in rhel/openvswitch-fedora.spec
>> 
>> >         AC_LINK_IFELSE(
>> >            [AC_LANG_PROGRAM([#include <rte_config.h>
>> >                              #include <rte_eal.h>],
>> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> >index 17b8d51..4e1ce53 100644
>> >--- a/lib/netdev-dpdk.c
>> >+++ b/lib/netdev-dpdk.c
>> >@@ -29,6 +29,7 @@
>> > #include <stdio.h>
>> > #include <sys/types.h>
>> > #include <sys/stat.h>
>> >+#include <numaif.h>
>> > 
>> > #include "dirs.h"
>> > #include "dp-packet.h"
>> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
>> > {
>> >     struct netdev_dpdk *netdev;
>> >     bool exists = false;
>> >+    int newnode = 0;
>> >+    long err = 0;
>> > 
>> >     ovs_mutex_lock(&dpdk_mutex);
>> >     /* Add device to the vhost port with the same name as that passed
>> >down. */
>> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
>> >             }
>> >             ovsrcu_set(&netdev->virtio_dev, dev);
>> >             exists = true;
>> >+
>> >+            /* Get NUMA information */
>> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
>> >MPOL_F_ADDR);
>> >+            if (err) {
>> >+                VLOG_INFO("Error getting NUMA info for vHost Device
>> >'%s'",
>> >+                        dev->ifname);
>> >+                newnode = netdev->socket_id;
>> >+            } else if (newnode != netdev->socket_id) {
>> >+                netdev->socket_id = newnode;
>> >+                /* Change mempool to new NUMA Node */
>> >+                dpdk_mp_put(netdev->dpdk_mp);
>> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>> >netdev->mtu);
>> >+                /* Request netdev reconfiguration. The port may now be
>> >+                 * serviced by a PMD on the new node if enabled in the
>> >cpu
>> >+                 * mask */
>> >+                netdev_request_reconfigure(&netdev->up);
>> 
>> I think here I would prefer:
>> 
>> 1) Remembering the configuration change request
>>     netdev->requested_socket_id = newnode
>> 2) Calling netdev_request_reconfigure()
>> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
>>     netdev->socket_id = netdev_requested_socket_id
>> 
>> Otherwise the datapath might be confused, because it assumes that the
>> socket_id doesn't change while the device is polled by the pmd threads.
>> 
>> It's safe to change almost everything inside the
>> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
>> polled by the datapath when the function is called.
>> 
>> The same applies for the mempool: the actual change should be done
>> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
>> threads using it.
>> 
>> >+            }
>> >+
>> >             dev->flags |= VIRTIO_DEV_RUNNING;
>> >             /* Disable notifications. */
>> >             set_irq_status(dev);
>> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
>> >         return -1;
>> >     }
>> > 
>> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added",
>>dev->ifname,
>> >-              dev->device_fh);
>> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
>>%i",
>> >+              dev->ifname, dev->device_fh, newnode);
>> >     return 0;
>> > }
>> > 
>> >-- 
>> >2.4.3
>> >
>> >_______________________________________________
>> >dev mailing list
>> >dev@openvswitch.org
>> >http://openvswitch.org/mailman/listinfo/dev
>> 
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
Ben Pfaff March 11, 2016, 12:07 a.m. UTC | #4
On Thu, Mar 10, 2016 at 11:24:36PM +0000, Daniele Di Proietto wrote:
> ovs-numa reads the CPUs and NUMA nodes configuration from sysfs.
> It is only used to decide how to set the affinity of pmd threads.
> 
> This new code just needs get_mempolicy() from libnuma: it is used
> to get the NUMA node of a virtual address and it is simply a wrapper
> to a system call.  Want do you think? Should we implement that in OVS?

Looking at the implementation in libnuma, although it's a wrapper around
a system call, it requires a lot of architecture-specific crud.  I don't
object to using libnuma for this.

> If we implement the system call wrapper in ovs-numa we could avoid
> depending on libnuma-dev for the build, but (I guess) we would still
> need to link with -lnuma, as it is a dependency of the statically linked
> DPDK library
> 
> Thanks
> 
> On 10/03/2016 14:38, "Ben Pfaff" <blp@ovn.org> wrote:
> 
> >How does this numa library relate to lib/ovs-numa.[ch]?
> >
> >On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
> >> Thanks for the patch, I'll put this in the use case list for
> >> my series if I need to resend it!
> >> 
> >> It would be nice to get the numa socket information without
> >> linking OVS with libnuma, maybe using some DPDK api. From
> >> a quick look I didn't find any way, but maybe you know a
> >> better way.
> >> 
> >> Some preliminary comments inline
> >> 
> >> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
> >> <dev-bounces@openvswitch.org on behalf of ciara.loftus@intel.com> wrote:
> >> 
> >> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
> >> >as the servicing PMD, to all come from the same socket.
> >> >
> >> >DPDK v2.2 introduces a new configuration option:
> >> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
> >> >from which a vhost device's memory has been allocated by QEMU, and
> >> >accordingly reallocates device memory managed by DPDK to that same
> >> >socket.
> >> >
> >> >OVS by default sets the socket id of a vhost port to that of the
> >> >master lcore. This commit introduces the ability to update the
> >> >socket id of the port if it is detected (during VM boot) that the
> >> >port memory is not on the default NUMA node. If this is the case, the
> >> >mempool of the port is also changed to the new node, and a PMD
> >> >thread currently servicing the port will no longer, in favour of a
> >> >thread from the new node (if enabled in the CPU mask).
> >> >
> >> >Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> >> >---
> >> > INSTALL.DPDK.md   |  6 +++++-
> >> > acinclude.m4      |  2 +-
> >> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
> >> > 3 files changed, 29 insertions(+), 4 deletions(-)
> >> >
> >> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> >> >index dca79bd..82e6908 100644
> >> >--- a/INSTALL.DPDK.md
> >> >+++ b/INSTALL.DPDK.md
> >> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
> >> > 
> >> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
> >> > 
> >> >+     Enable NUMA-aware vHost by modifying the following in the same
> >>file:
> >> >+
> >> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
> >> >+
> >> 
> >> I guess we should also update install_dpdk() in ./travis/build.sh to do
> >> this if it's required
> >> 
> >> >      Then run `make install` to build and install the library.
> >> >      For default install without IVSHMEM:
> >> > 
> >> >@@ -383,7 +387,7 @@ Performance Tuning:
> >> > 
> >> > 	It is good practice to ensure that threads that are in the datapath
> >>are
> >> > 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU
> >>vCPUs
> >> >-	responsible for forwarding.
> >> >+	responsible for forwarding. This is now default behavior for vHost
> >> >ports.
> >> > 
> >> >   9. Rx Mergeable buffers
> >> > 
> >> >diff --git a/acinclude.m4 b/acinclude.m4
> >> >index 11c7787..432bdbd 100644
> >> >--- a/acinclude.m4
> >> >+++ b/acinclude.m4
> >> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
> >> >     found=false
> >> >     save_LIBS=$LIBS
> >> >     for extras in "" "-ldl"; do
> >> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
> >> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
> >> 
> >> I guess we should also list libnuma-dev in .travis.yml and something
> >> similar
> >> in rhel/openvswitch-fedora.spec
> >> 
> >> >         AC_LINK_IFELSE(
> >> >            [AC_LANG_PROGRAM([#include <rte_config.h>
> >> >                              #include <rte_eal.h>],
> >> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >> >index 17b8d51..4e1ce53 100644
> >> >--- a/lib/netdev-dpdk.c
> >> >+++ b/lib/netdev-dpdk.c
> >> >@@ -29,6 +29,7 @@
> >> > #include <stdio.h>
> >> > #include <sys/types.h>
> >> > #include <sys/stat.h>
> >> >+#include <numaif.h>
> >> > 
> >> > #include "dirs.h"
> >> > #include "dp-packet.h"
> >> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
> >> > {
> >> >     struct netdev_dpdk *netdev;
> >> >     bool exists = false;
> >> >+    int newnode = 0;
> >> >+    long err = 0;
> >> > 
> >> >     ovs_mutex_lock(&dpdk_mutex);
> >> >     /* Add device to the vhost port with the same name as that passed
> >> >down. */
> >> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
> >> >             }
> >> >             ovsrcu_set(&netdev->virtio_dev, dev);
> >> >             exists = true;
> >> >+
> >> >+            /* Get NUMA information */
> >> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
> >> >MPOL_F_ADDR);
> >> >+            if (err) {
> >> >+                VLOG_INFO("Error getting NUMA info for vHost Device
> >> >'%s'",
> >> >+                        dev->ifname);
> >> >+                newnode = netdev->socket_id;
> >> >+            } else if (newnode != netdev->socket_id) {
> >> >+                netdev->socket_id = newnode;
> >> >+                /* Change mempool to new NUMA Node */
> >> >+                dpdk_mp_put(netdev->dpdk_mp);
> >> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> >> >netdev->mtu);
> >> >+                /* Request netdev reconfiguration. The port may now be
> >> >+                 * serviced by a PMD on the new node if enabled in the
> >> >cpu
> >> >+                 * mask */
> >> >+                netdev_request_reconfigure(&netdev->up);
> >> 
> >> I think here I would prefer:
> >> 
> >> 1) Remembering the configuration change request
> >>     netdev->requested_socket_id = newnode
> >> 2) Calling netdev_request_reconfigure()
> >> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
> >>     netdev->socket_id = netdev_requested_socket_id
> >> 
> >> Otherwise the datapath might be confused, because it assumes that the
> >> socket_id doesn't change while the device is polled by the pmd threads.
> >> 
> >> It's safe to change almost everything inside the
> >> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
> >> polled by the datapath when the function is called.
> >> 
> >> The same applies for the mempool: the actual change should be done
> >> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
> >> threads using it.
> >> 
> >> >+            }
> >> >+
> >> >             dev->flags |= VIRTIO_DEV_RUNNING;
> >> >             /* Disable notifications. */
> >> >             set_irq_status(dev);
> >> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
> >> >         return -1;
> >> >     }
> >> > 
> >> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added",
> >>dev->ifname,
> >> >-              dev->device_fh);
> >> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
> >>%i",
> >> >+              dev->ifname, dev->device_fh, newnode);
> >> >     return 0;
> >> > }
> >> > 
> >> >-- 
> >> >2.4.3
> >> >
> >> >_______________________________________________
> >> >dev mailing list
> >> >dev@openvswitch.org
> >> >http://openvswitch.org/mailman/listinfo/dev
> >> 
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> http://openvswitch.org/mailman/listinfo/dev
>
Ciara Loftus March 16, 2016, 2:48 p.m. UTC | #5
> 
> Thanks for the patch, I'll put this in the use case list for
> my series if I need to resend it!
> 
> It would be nice to get the numa socket information without
> linking OVS with libnuma, maybe using some DPDK api. From
> a quick look I didn't find any way, but maybe you know a
> better way.
> 
> Some preliminary comments inline

Thanks for the feedback. I've sent out a v2 of the patch that incorporates most of your requests.
I just noticed I misspelled your name in the cover letter, apologies!

Some small comments below.

Thanks,
Ciara
> 
> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
> <dev-bounces@openvswitch.org on behalf of ciara.loftus@intel.com>
> wrote:
> 
> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
> >as the servicing PMD, to all come from the same socket.
> >
> >DPDK v2.2 introduces a new configuration option:
> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
> >from which a vhost device's memory has been allocated by QEMU, and
> >accordingly reallocates device memory managed by DPDK to that same
> >socket.
> >
> >OVS by default sets the socket id of a vhost port to that of the
> >master lcore. This commit introduces the ability to update the
> >socket id of the port if it is detected (during VM boot) that the
> >port memory is not on the default NUMA node. If this is the case, the
> >mempool of the port is also changed to the new node, and a PMD
> >thread currently servicing the port will no longer, in favour of a
> >thread from the new node (if enabled in the CPU mask).
> >
> >Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> >---
> > INSTALL.DPDK.md   |  6 +++++-
> > acinclude.m4      |  2 +-
> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
> > 3 files changed, 29 insertions(+), 4 deletions(-)
> >
> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> >index dca79bd..82e6908 100644
> >--- a/INSTALL.DPDK.md
> >+++ b/INSTALL.DPDK.md
> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
> >
> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
> >
> >+     Enable NUMA-aware vHost by modifying the following in the same file:
> >+
> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
> >+
> 
> I guess we should also update install_dpdk() in ./travis/build.sh to do
> this if it's required
I left this out because everything will still work ok (ie. as before) without this option. It can be optionally enabled if the functionality is desired.
However if we think it should be always enabled I can include this in the next revision.

> 
> >      Then run `make install` to build and install the library.
> >      For default install without IVSHMEM:
> >
> >@@ -383,7 +387,7 @@ Performance Tuning:
> >
> > 	It is good practice to ensure that threads that are in the datapath are
> > 	pinned to cores in the same NUMA area. e.g. pmd threads and
> QEMU vCPUs
> >-	responsible for forwarding.
> >+	responsible for forwarding. This is now default behavior for vHost
> >ports.
> >
> >   9. Rx Mergeable buffers
> >
> >diff --git a/acinclude.m4 b/acinclude.m4
> >index 11c7787..432bdbd 100644
> >--- a/acinclude.m4
> >+++ b/acinclude.m4
> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
> >     found=false
> >     save_LIBS=$LIBS
> >     for extras in "" "-ldl"; do
> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
> 
> I guess we should also list libnuma-dev in .travis.yml and something
> similar
> in rhel/openvswitch-fedora.spec
I updated these in the v2. 

> 
> >         AC_LINK_IFELSE(
> >            [AC_LANG_PROGRAM([#include <rte_config.h>
> >                              #include <rte_eal.h>],
> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> >index 17b8d51..4e1ce53 100644
> >--- a/lib/netdev-dpdk.c
> >+++ b/lib/netdev-dpdk.c
> >@@ -29,6 +29,7 @@
> > #include <stdio.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> >+#include <numaif.h>
> >
> > #include "dirs.h"
> > #include "dp-packet.h"
> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
> > {
> >     struct netdev_dpdk *netdev;
> >     bool exists = false;
> >+    int newnode = 0;
> >+    long err = 0;
> >
> >     ovs_mutex_lock(&dpdk_mutex);
> >     /* Add device to the vhost port with the same name as that passed
> >down. */
> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
> >             }
> >             ovsrcu_set(&netdev->virtio_dev, dev);
> >             exists = true;
> >+
> >+            /* Get NUMA information */
> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
> >MPOL_F_ADDR);
> >+            if (err) {
> >+                VLOG_INFO("Error getting NUMA info for vHost Device
> >'%s'",
> >+                        dev->ifname);
> >+                newnode = netdev->socket_id;
> >+            } else if (newnode != netdev->socket_id) {
> >+                netdev->socket_id = newnode;
> >+                /* Change mempool to new NUMA Node */
> >+                dpdk_mp_put(netdev->dpdk_mp);
> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> >netdev->mtu);
> >+                /* Request netdev reconfiguration. The port may now be
> >+                 * serviced by a PMD on the new node if enabled in the
> >cpu
> >+                 * mask */
> >+                netdev_request_reconfigure(&netdev->up);
> 
> I think here I would prefer:
> 
> 1) Remembering the configuration change request
>     netdev->requested_socket_id = newnode
> 2) Calling netdev_request_reconfigure()
> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
>     netdev->socket_id = netdev_requested_socket_id
> 
> Otherwise the datapath might be confused, because it assumes that the
> socket_id doesn't change while the device is polled by the pmd threads.
> 
> It's safe to change almost everything inside the
> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
> polled by the datapath when the function is called.
> 
> The same applies for the mempool: the actual change should be done
> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
> threads using it.
Makes sense thanks for the suggestion. This is implemented in the v2.

> 
> >+            }
> >+
> >             dev->flags |= VIRTIO_DEV_RUNNING;
> >             /* Disable notifications. */
> >             set_irq_status(dev);
> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
> >         return -1;
> >     }
> >
> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added", dev-
> >ifname,
> >-              dev->device_fh);
> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
> %i",
> >+              dev->ifname, dev->device_fh, newnode);
> >     return 0;
> > }
> >
> >--
> >2.4.3
> >
> >_______________________________________________
> >dev mailing list
> >dev@openvswitch.org
> >http://openvswitch.org/mailman/listinfo/dev
diff mbox

Patch

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index dca79bd..82e6908 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -33,6 +33,10 @@  on Debian/Ubuntu)
 
      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
 
+     Enable NUMA-aware vHost by modifying the following in the same file:
+
+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
+
      Then run `make install` to build and install the library.
      For default install without IVSHMEM:
 
@@ -383,7 +387,7 @@  Performance Tuning:
 
 	It is good practice to ensure that threads that are in the datapath are
 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
-	responsible for forwarding.
+	responsible for forwarding. This is now default behavior for vHost ports.
 
   9. Rx Mergeable buffers
 
diff --git a/acinclude.m4 b/acinclude.m4
index 11c7787..432bdbd 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -199,7 +199,7 @@  AC_DEFUN([OVS_CHECK_DPDK], [
     found=false
     save_LIBS=$LIBS
     for extras in "" "-ldl"; do
-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
         AC_LINK_IFELSE(
            [AC_LANG_PROGRAM([#include <rte_config.h>
                              #include <rte_eal.h>],
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 17b8d51..4e1ce53 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -29,6 +29,7 @@ 
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <numaif.h>
 
 #include "dirs.h"
 #include "dp-packet.h"
@@ -1878,6 +1879,8 @@  new_device(struct virtio_net *dev)
 {
     struct netdev_dpdk *netdev;
     bool exists = false;
+    int newnode = 0;
+    long err = 0;
 
     ovs_mutex_lock(&dpdk_mutex);
     /* Add device to the vhost port with the same name as that passed down. */
@@ -1891,6 +1894,24 @@  new_device(struct virtio_net *dev)
             }
             ovsrcu_set(&netdev->virtio_dev, dev);
             exists = true;
+
+            /* Get NUMA information */
+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE | MPOL_F_ADDR);
+            if (err) {
+                VLOG_INFO("Error getting NUMA info for vHost Device '%s'",
+                        dev->ifname);
+                newnode = netdev->socket_id;
+            } else if (newnode != netdev->socket_id) {
+                netdev->socket_id = newnode;
+                /* Change mempool to new NUMA Node */
+                dpdk_mp_put(netdev->dpdk_mp);
+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
+                /* Request netdev reconfiguration. The port may now be
+                 * serviced by a PMD on the new node if enabled in the cpu
+                 * mask */
+                netdev_request_reconfigure(&netdev->up);
+            }
+
             dev->flags |= VIRTIO_DEV_RUNNING;
             /* Disable notifications. */
             set_irq_status(dev);
@@ -1907,8 +1928,8 @@  new_device(struct virtio_net *dev)
         return -1;
     }
 
-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added", dev->ifname,
-              dev->device_fh);
+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket %i",
+              dev->ifname, dev->device_fh, newnode);
     return 0;
 }