diff mbox series

[ovs-dev,v5,1/3] netdev-dpdk: Add shared mempool config.

Message ID 20220624101325.1792748-2-ktraynor@redhat.com
State Accepted
Commit 3757e9f8e9c3190cf981740fec662e84b12d07ef
Headers show
Series DPDK shared mempool config. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Kevin Traynor June 24, 2022, 10:13 a.m. UTC
Mempools may currently be shared between DPDK ports based
on port MTU and NUMA. With some hint from the user we can
increase the sharing on MTU and hence reduce memory
consumption in many cases.

For example, a port with MTU 9000, uses a mempool with an
mbuf size based on 9000 MTU. A port with MTU 1500, uses a
different mempool with an mbuf size based on 1500 MTU.

In this case, assuming same NUMA, both these ports could
share the 9000 MTU mempool.

The user must give a hint as order of creation of ports and
setting of MTUs may vary and we need to ensure that upgrades
from older OVS versions do not require more memory.

This scheme can also prevent multiple mempools being created
for cases where a port is added picking up a default MTU and
an appropriate mempool, but later has it's MTU changed to a
different value requiring a different mempool.

Example usage:

 $ ovs-vsctl --no-wait set Open_vSwitch . \
   other_config:shared-mempool-config=9000,1500:1,6000:1

Port added on NUMA 0:
* MTU 1500, use mempool based on 9000 MTU
* MTU 5000, use mempool based on 9000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)

Port added on NUMA 1:
* MTU 1500, use mempool based on 1500 MTU
* MTU 5000, use mempool based on 6000 MTU
* MTU 9000, use mempool based on 9000 MTU
* MTU 9300, use mempool based on 9300 MTU (existing behaviour)

Default behaviour is unchanged and mempools are still only created
when needed.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
---
 Documentation/topics/dpdk/memory.rst |  44 +++++++++++
 NEWS                                 |   3 +
 lib/dpdk.c                           |   2 +-
 lib/netdev-dpdk.c                    | 109 ++++++++++++++++++++++++++-
 lib/netdev-dpdk.h                    |   5 +-
 vswitchd/vswitch.xml                 |  37 +++++++++
 6 files changed, 195 insertions(+), 5 deletions(-)

Comments

Pai G, Sunil July 8, 2022, 9:18 p.m. UTC | #1
Hi Kevin, 

Thanks for the patch. 
LGTM in general, couple of minor comments/thoughts inline.

<snipped>

> +struct user_mempool_config {
> +    int adj_mtu;
> +    int socket_id;
> +};
> +
> +static struct user_mempool_config *user_mempools = NULL; static int
> +n_user_mempools;

Should we consider restricting n_user_mempools to some max value ?

> 
>  /* There should be one 'struct dpdk_tx_queue' created for @@ -573,4
> +582,42 @@ dpdk_buf_size(int mtu)  }
> 
> +static int
> +dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, int
> +port_socket_id) {
> +    int best_adj_user_mtu = INT_MAX;
> +
> +    for (unsigned i = 0; i < n_user_mempools; i++) {
> +        int user_adj_mtu, user_socket_id;
> +
> +        user_adj_mtu = user_mempools[i].adj_mtu;
> +        user_socket_id = user_mempools[i].socket_id;
> +        if (port_adj_mtu > user_adj_mtu
> +            || (user_socket_id != INT_MAX
> +                && user_socket_id != port_socket_id)) {
> +            continue;
> +        }
> +        if (user_adj_mtu < best_adj_user_mtu) {
> +            /* This is the is the lowest valid user MTU. */
> +            best_adj_user_mtu = user_adj_mtu;
> +            if (best_adj_user_mtu == port_adj_mtu) {
> +                /* Found an exact fit, no need to keep searching. */
> +                break;
> +            }
> +        }


I guess this is more appropriate than sorting etc to find lowest valid user MTU since we assume n_user_mempools will be quite small ?


> +    }
> +    if (best_adj_user_mtu == INT_MAX) {
> +        VLOG_DBG("No user configured shared mempool mbuf sizes found "
> +                 "suitable for port with MTU %d, NUMA %d.", port_mtu,
> +                 port_socket_id);
> +        best_adj_user_mtu = port_adj_mtu;
> +    } else {
> +        VLOG_DBG("Found user configured shared mempool with mbufs "
> +                 "of size %d, suitable for port with MTU %d, NUMA %d.",
> +                 MTU_TO_FRAME_LEN(best_adj_user_mtu), port_mtu,
> +                 port_socket_id);
> +    }
> +    return best_adj_user_mtu;
> +}
> +

<snipped>

Thanks and regards,
Sunil
Kevin Traynor July 11, 2022, 5:54 p.m. UTC | #2
On Fri, Jul 8, 2022 at 10:18 PM Pai G, Sunil <sunil.pai.g@intel.com> wrote:
>
> Hi Kevin,
>

Hi Sunil,

thanks for reviewing the patches.

> Thanks for the patch.
> LGTM in general, couple of minor comments/thoughts inline.
>
> <snipped>
>
> > +struct user_mempool_config {
> > +    int adj_mtu;
> > +    int socket_id;
> > +};
> > +
> > +static struct user_mempool_config *user_mempools = NULL; static int
> > +n_user_mempools;
>
> Should we consider restricting n_user_mempools to some max value ?
>

We could do. There is no issue with having many values, it just might
not make much sense for a user.

Is your concern that someone might maliciously keep adding values?
I'm not sure if there is some protection already within the smap_get
that would prevent this. i'd need to check that.
If there is, then that would be covered already. Otherwise we could
add some arbitrary max number like 100.

> >
> >  /* There should be one 'struct dpdk_tx_queue' created for @@ -573,4
> > +582,42 @@ dpdk_buf_size(int mtu)  }
> >
> > +static int
> > +dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, int
> > +port_socket_id) {
> > +    int best_adj_user_mtu = INT_MAX;
> > +
> > +    for (unsigned i = 0; i < n_user_mempools; i++) {
> > +        int user_adj_mtu, user_socket_id;
> > +
> > +        user_adj_mtu = user_mempools[i].adj_mtu;
> > +        user_socket_id = user_mempools[i].socket_id;
> > +        if (port_adj_mtu > user_adj_mtu
> > +            || (user_socket_id != INT_MAX
> > +                && user_socket_id != port_socket_id)) {
> > +            continue;
> > +        }
> > +        if (user_adj_mtu < best_adj_user_mtu) {
> > +            /* This is the is the lowest valid user MTU. */
> > +            best_adj_user_mtu = user_adj_mtu;
> > +            if (best_adj_user_mtu == port_adj_mtu) {
> > +                /* Found an exact fit, no need to keep searching. */
> > +                break;
> > +            }
> > +        }
>
>
> I guess this is more appropriate than sorting etc to find lowest valid user MTU since we assume n_user_mempools will be quite small ?
>

It is relying on MTU *and* socket id (which may be wildcard/any NUMA),
so we couldn't sort it as a flat list.
There would need to be multiple lists created for every numa etc

There is no need to go to that trouble to speed up the look up when we
expect a few entries and we only look this up on the control path
when we are changing MTU. So I think it's fine the way it is.

Kevin.

> > +    }
> > +    if (best_adj_user_mtu == INT_MAX) {
> > +        VLOG_DBG("No user configured shared mempool mbuf sizes found "
> > +                 "suitable for port with MTU %d, NUMA %d.", port_mtu,
> > +                 port_socket_id);
> > +        best_adj_user_mtu = port_adj_mtu;
> > +    } else {
> > +        VLOG_DBG("Found user configured shared mempool with mbufs "
> > +                 "of size %d, suitable for port with MTU %d, NUMA %d.",
> > +                 MTU_TO_FRAME_LEN(best_adj_user_mtu), port_mtu,
> > +                 port_socket_id);
> > +    }
> > +    return best_adj_user_mtu;
> > +}
> > +
>
> <snipped>
>
> Thanks and regards,
> Sunil
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Pai G, Sunil July 11, 2022, 6:37 p.m. UTC | #3
Hi Kevin, 

Responses inline.

> > > +static struct user_mempool_config *user_mempools = NULL; static int
> > > +n_user_mempools;
> >
> > Should we consider restricting n_user_mempools to some max value ?
> >
> 
> We could do. There is no issue with having many values, it just might not
> make much sense for a user.
> 
> Is your concern that someone might maliciously keep adding values?

Yes, this was my concern. 

> I'm not sure if there is some protection already within the smap_get that
> would prevent this. i'd need to check that.
> If there is, then that would be covered already. Otherwise we could add
> some arbitrary max number like 100.

Sure, sounds good.
Otherwise, the patch looks good. I'll be happy to ack!

> 
> > >
> > >  /* There should be one 'struct dpdk_tx_queue' created for @@ -573,4
> > > +582,42 @@ dpdk_buf_size(int mtu)  }
> > >
> > > +static int
> > > +dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, int
> > > +port_socket_id) {
> > > +    int best_adj_user_mtu = INT_MAX;
> > > +
> > > +    for (unsigned i = 0; i < n_user_mempools; i++) {
> > > +        int user_adj_mtu, user_socket_id;
> > > +
> > > +        user_adj_mtu = user_mempools[i].adj_mtu;
> > > +        user_socket_id = user_mempools[i].socket_id;
> > > +        if (port_adj_mtu > user_adj_mtu
> > > +            || (user_socket_id != INT_MAX
> > > +                && user_socket_id != port_socket_id)) {
> > > +            continue;
> > > +        }
> > > +        if (user_adj_mtu < best_adj_user_mtu) {
> > > +            /* This is the is the lowest valid user MTU. */
> > > +            best_adj_user_mtu = user_adj_mtu;
> > > +            if (best_adj_user_mtu == port_adj_mtu) {
> > > +                /* Found an exact fit, no need to keep searching. */
> > > +                break;
> > > +            }
> > > +        }
> >
> >
> > I guess this is more appropriate than sorting etc to find lowest valid
> user MTU since we assume n_user_mempools will be quite small ?
> >
> 
> It is relying on MTU *and* socket id (which may be wildcard/any NUMA), so
> we couldn't sort it as a flat list.
> There would need to be multiple lists created for every numa etc
> 
> There is no need to go to that trouble to speed up the look up when we
> expect a few entries and we only look this up on the control path when we
> are changing MTU. So I think it's fine the way it is.
> 
> Kevin.

I agree with your assessment Kevin 😊

> 
> > > +    }
> > > +    if (best_adj_user_mtu == INT_MAX) {
> > > +        VLOG_DBG("No user configured shared mempool mbuf sizes found
> "
> > > +                 "suitable for port with MTU %d, NUMA %d.", port_mtu,
> > > +                 port_socket_id);
> > > +        best_adj_user_mtu = port_adj_mtu;
> > > +    } else {
> > > +        VLOG_DBG("Found user configured shared mempool with mbufs "
> > > +                 "of size %d, suitable for port with MTU %d, NUMA
> %d.",
> > > +                 MTU_TO_FRAME_LEN(best_adj_user_mtu), port_mtu,
> > > +                 port_socket_id);
> > > +    }
> > > +    return best_adj_user_mtu;
> > > +}
> > > +
> >
> > <snipped>
> >
> > Thanks and regards,
> > Sunil
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
index 8b7758e6e..9714d79d4 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -214,2 +214,46 @@  Example 3: (2 rxq, 2 PMD, 9000 MTU)
  Mbuf size = 10176 Bytes
  Memory required = 26656 * 10176 = 271 MB
+
+Shared Mempool Configuration
+----------------------------
+
+In order to increase sharing of mempools, a user can configure the MTUs which
+mempools are based on by using ``shared-mempool-config``.
+
+An MTU configured by the user is adjusted to an mbuf size used for mempool
+creation and stored. If a port is subsequently added that has an MTU which can
+be accommodated by this mbuf size, it will be used for mempool creation/reuse.
+
+This can increase sharing by consolidating mempools for ports with different
+MTUs which would otherwise use separate mempools. It can also help to remove
+the need for mempools being created after a port is added but before it's MTU
+is changed to a different value.
+
+For example, on a 2 NUMA system::
+
+  $ ovs-vsctl ovs-vsctl --no-wait set Open_vSwitch . \
+  other_config:shared-mempool-config=9000,1500:1,6000:1
+
+
+In this case, OVS stores the mbuf sizes based on the following MTUs.
+
+* NUMA 0: 9000
+* NUMA 1: 1500, 6000, 9000
+
+Ports added will use mempools with the mbuf sizes based on the above MTUs where
+possible. If there is more than one suitable, the one closest to the MTU will
+be selected.
+
+Port added on NUMA 0:
+
+* MTU 1500, use mempool based on 9000 MTU
+* MTU 6000, use mempool based on 9000 MTU
+* MTU 9000, use mempool based on 9000 MTU
+* MTU 9300, use mempool based on 9300 MTU (existing behaviour)
+
+Port added on NUMA 1:
+
+* MTU 1500, use mempool based on 1500 MTU
+* MTU 6000, use mempool based on 6000 MTU
+* MTU 9000, use mempool based on 9000 MTU
+* MTU 9300, use mempool based on 9300 MTU (existing behaviour)
diff --git a/NEWS b/NEWS
index 9fe3f44f4..846a8d969 100644
--- a/NEWS
+++ b/NEWS
@@ -33,4 +33,7 @@  Post-v2.17.0
      * OVS validated with DPDK 21.11.1.  It is recommended to use this version
        until further releases.
+     * New configuration knob 'other_config:shared-mempool-config' to set MTU
+       that shared mempool mbuf size is based on. This allows interfaces with
+       different MTU sizes to share mempools.
 
 
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 6886fbd9d..d909974f9 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -519,5 +519,5 @@  dpdk_init__(const struct smap *ovs_other_config)
 
     /* Finally, register the dpdk classes */
-    netdev_dpdk_register();
+    netdev_dpdk_register(ovs_other_config);
     netdev_register_flow_api_provider(&netdev_offload_dpdk);
     return true;
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f9535bfb4..a4b62ec6d 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -54,4 +54,5 @@ 
 #include "openvswitch/list.h"
 #include "openvswitch/match.h"
+#include "openvswitch/ofp-parse.h"
 #include "openvswitch/ofp-print.h"
 #include "openvswitch/shash.h"
@@ -371,5 +372,13 @@  struct dpdk_mp {
      int refcount;
      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
- };
+};
+
+struct user_mempool_config {
+    int adj_mtu;
+    int socket_id;
+};
+
+static struct user_mempool_config *user_mempools = NULL;
+static int n_user_mempools;
 
 /* There should be one 'struct dpdk_tx_queue' created for
@@ -573,4 +582,42 @@  dpdk_buf_size(int mtu)
 }
 
+static int
+dpdk_get_user_adjusted_mtu(int port_adj_mtu, int port_mtu, int port_socket_id)
+{
+    int best_adj_user_mtu = INT_MAX;
+
+    for (unsigned i = 0; i < n_user_mempools; i++) {
+        int user_adj_mtu, user_socket_id;
+
+        user_adj_mtu = user_mempools[i].adj_mtu;
+        user_socket_id = user_mempools[i].socket_id;
+        if (port_adj_mtu > user_adj_mtu
+            || (user_socket_id != INT_MAX
+                && user_socket_id != port_socket_id)) {
+            continue;
+        }
+        if (user_adj_mtu < best_adj_user_mtu) {
+            /* This is the is the lowest valid user MTU. */
+            best_adj_user_mtu = user_adj_mtu;
+            if (best_adj_user_mtu == port_adj_mtu) {
+                /* Found an exact fit, no need to keep searching. */
+                break;
+            }
+        }
+    }
+    if (best_adj_user_mtu == INT_MAX) {
+        VLOG_DBG("No user configured shared mempool mbuf sizes found "
+                 "suitable for port with MTU %d, NUMA %d.", port_mtu,
+                 port_socket_id);
+        best_adj_user_mtu = port_adj_mtu;
+    } else {
+        VLOG_DBG("Found user configured shared mempool with mbufs "
+                 "of size %d, suitable for port with MTU %d, NUMA %d.",
+                 MTU_TO_FRAME_LEN(best_adj_user_mtu), port_mtu,
+                 port_socket_id);
+    }
+    return best_adj_user_mtu;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
@@ -796,4 +843,8 @@  dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
      * to see if reuse is possible. */
     if (!per_port_mp) {
+        /* If user has provided defined mempools, check if one is suitable
+         * and get new buffer size.*/
+        mtu = dpdk_get_user_adjusted_mtu(mtu, dev->requested_mtu,
+                                         dev->requested_socket_id);
         LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
             if (dmp->socket_id == dev->requested_socket_id
@@ -5335,4 +5386,54 @@  netdev_dpdk_rte_flow_tunnel_item_release(struct netdev *netdev,
 #endif /* ALLOW_EXPERIMENTAL_API */
 
+static void
+parse_user_mempools_list(const char *mtus)
+{
+    char *list, *copy, *key, *value;
+    int error = 0;
+
+    if (!mtus) {
+        return;
+    }
+
+    n_user_mempools = 0;
+    list = copy = xstrdup(mtus);
+
+    while (ofputil_parse_key_value(&list, &key, &value)) {
+        int socket_id, mtu, adj_mtu;
+
+        if (!str_to_int(key, 0, &mtu) || mtu < 0) {
+            error = EINVAL;
+            VLOG_WARN("Invalid user configured shared mempool MTU.");
+            break;
+        }
+
+        if (!str_to_int(value, 0, &socket_id)) {
+            /* No socket specified. It will apply for all numas. */
+            socket_id = INT_MAX;
+        } else if (socket_id < 0) {
+            error = EINVAL;
+            VLOG_WARN("Invalid user configured shared mempool NUMA.");
+            break;
+        }
+
+        user_mempools = xrealloc(user_mempools, (n_user_mempools + 1) *
+                                 sizeof(struct user_mempool_config));
+        adj_mtu = FRAME_LEN_TO_MTU(dpdk_buf_size(mtu));
+        user_mempools[n_user_mempools].adj_mtu = adj_mtu;
+        user_mempools[n_user_mempools].socket_id = socket_id;
+        n_user_mempools++;
+        VLOG_INFO("User configured shared mempool set for: MTU %d, NUMA %s.",
+                  mtu, socket_id == INT_MAX ? "ALL" : value);
+    }
+
+    if (error) {
+        VLOG_WARN("User configured shared mempools will not be used.");
+        n_user_mempools = 0;
+        free(user_mempools);
+        user_mempools = NULL;
+    }
+    free(copy);
+}
+
 #define NETDEV_DPDK_CLASS_COMMON                            \
     .is_pmd = true,                                         \
@@ -5418,6 +5519,10 @@  static const struct netdev_class dpdk_vhost_client_class = {
 
 void
-netdev_dpdk_register(void)
+netdev_dpdk_register(const struct smap *ovs_other_config)
 {
+    const char *mempoolcfg = smap_get(ovs_other_config,
+                                      "shared-mempool-config");
+
+    parse_user_mempools_list(mempoolcfg);
     netdev_register_provider(&dpdk_class);
     netdev_register_provider(&dpdk_vhost_class);
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
index 699be3fb4..7d2f64af2 100644
--- a/lib/netdev-dpdk.h
+++ b/lib/netdev-dpdk.h
@@ -21,4 +21,5 @@ 
 
 #include "openvswitch/compiler.h"
+#include "smap.h"
 
 struct dp_packet;
@@ -29,5 +30,5 @@  struct netdev;
 #include <rte_flow.h>
 
-void netdev_dpdk_register(void);
+void netdev_dpdk_register(const struct smap *);
 void free_dpdk_buf(struct dp_packet *);
 
@@ -151,5 +152,5 @@  netdev_dpdk_rte_flow_tunnel_item_release(
 
 static inline void
-netdev_dpdk_register(void)
+netdev_dpdk_register(const struct smap *ovs_other_config OVS_UNUSED)
 {
     /* Nothing */
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index cc1dd77ec..98486c009 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -491,4 +491,41 @@ 
       </column>
 
+      <column name="other_config" key="shared-mempool-config">
+              <p>Specifies dpdk shared mempool config.</p>
+              <p>Value should be set in the following form:</p>
+              <p>
+                <code>other_config:shared-mempool-config=&lt;
+                  user-shared-mempool-mtu-list&gt;</code>
+              </p>
+              <p>where</p>
+              <p>
+                <ul>
+                  <li>
+                    &lt;user-shared-mempool-mtu-list&gt; ::=
+                    NULL | &lt;non-empty-list&gt;
+                  </li>
+                  <li>
+                    &lt;non-empty-list&gt; ::= &lt;user-mtus&gt; |
+                                               &lt;user-mtus&gt; ,
+                                               &lt;non-empty-list&gt;
+                  </li>
+                  <li>
+                    &lt;user-mtus&gt; ::= &lt;mtu-all-socket&gt; |
+                                          &lt;mtu-socket-pair&gt;
+                  </li>
+                  <li>
+                    &lt;mtu-all-socket&gt; ::= &lt;mtu&gt;
+                  </li>
+                  <li>
+                    &lt;mtu-socket-pair&gt; ::= &lt;mtu&gt; : &lt;socket-id&gt;
+                  </li>
+                </ul>
+              </p>
+        <p>
+          Changing this value requires restarting the daemon if dpdk-init has
+          already been set to true.
+        </p>
+      </column>
+
       <column name="other_config" key="tx-flush-interval"
               type='{"type": "integer",