diff mbox series

[ovs-dev,RFC,v1] dpdk: Support both shared and per port mempools.

Message ID 1526653836-21599-1-git-send-email-ian.stokes@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series [ovs-dev,RFC,v1] dpdk: Support both shared and per port mempools. | expand

Commit Message

Stokes, Ian May 18, 2018, 2:30 p.m. UTC
This commit re-introduces the concept of shared mempools as the default
memory model for DPDK devices. Per port mempools are still available but
must be enabled explicitly by a user.

OVS previously used a shared mempool model for ports with the same MTU
and socket configuration. This was replaced by a per port mempool model
to address issues flagged by users such as:

https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html

However the per port model potentially requires an increase in memory
resource requirements to support the same number of ports and configuration
as the shared port model.

This is considered a blocking factor for current deployments of OVS
when upgrading to future OVS releases as a user may have to redimension
memory for the same deployment configuration. This may not be possible for
users.

This commit resolves the issue by re-introducing shared mempools as
the default memory behaviour in OVS DPDK but also refactors the memory
configuration code to allow for per port mempools.

This patch adds a new global config option, per-port-mp-enabled, that
controls the enablement of per port mempools for DPDK devices.

    ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true

This value defaults to false; to enable per port mempool support,
this field should be set to true when setting other global parameters
on init (such as "dpdk-socket-mem", for example). Changing the value at
runtime is not supported, and requires restarting the vswitch
daemon.

The mempool sweep functionality is also replaced with the
sweep functionality from OVS 2.9 found in commits

c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)

As this patch is RFC there are a number of TO-DOs including adding a
memory calculation section to the documentation for both models. This is
expected to be completed in the v1 after RFC.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>
---
 Documentation/automake.mk            |   1 +
 Documentation/topics/dpdk/index.rst  |   1 +
 Documentation/topics/dpdk/memory.rst |  67 +++++++
 NEWS                                 |   1 +
 lib/dpdk-stub.c                      |   6 +
 lib/dpdk.c                           |  13 ++
 lib/dpdk.h                           |   1 +
 lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++--------------
 vswitchd/vswitch.xml                 |  16 ++
 9 files changed, 305 insertions(+), 127 deletions(-)
 create mode 100644 Documentation/topics/dpdk/memory.rst

Comments

Kevin Traynor May 25, 2018, 10:57 a.m. UTC | #1
On 05/18/2018 03:30 PM, Ian Stokes wrote:
> This commit re-introduces the concept of shared mempools as the default
> memory model for DPDK devices. Per port mempools are still available but
> must be enabled explicitly by a user.
> 
> OVS previously used a shared mempool model for ports with the same MTU
> and socket configuration. This was replaced by a per port mempool model
> to address issues flagged by users such as:
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
> 
> However the per port model potentially requires an increase in memory
> resource requirements to support the same number of ports and configuration
> as the shared port model.
> 
> This is considered a blocking factor for current deployments of OVS
> when upgrading to future OVS releases as a user may have to redimension
> memory for the same deployment configuration. This may not be possible for
> users.
> 

Hi Ian, thanks for addressing this. Few comments below...

> This commit resolves the issue by re-introducing shared mempools as
> the default memory behaviour in OVS DPDK but also refactors the memory
> configuration code to allow for per port mempools.

> This patch adds a new global config option, per-port-mp-enabled, that
> controls the enablement of per port mempools for DPDK devices.
> 
>     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> 

It doesn't need enabled in the name. If say
'per-port-mempool=true/false', then it's already obvious it's enabled.
Actually, I wonder if 'per-port-memory' would be better as it doesn't
require a user to know what a DPDK mempool is.

> This value defaults to false; to enable per port mempool support,
> this field should be set to true when setting other global parameters
> on init (such as "dpdk-socket-mem", for example). Changing the value at
> runtime is not supported, and requires restarting the vswitch
> daemon.
> 
> The mempool sweep functionality is also replaced with the
> sweep functionality from OVS 2.9 found in commits
> 
> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> 
> As this patch is RFC there are a number of TO-DOs including adding a
> memory calculation section to the documentation for both models. This is
> expected to be completed in the v1 after RFC.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> ---
>  Documentation/automake.mk            |   1 +
>  Documentation/topics/dpdk/index.rst  |   1 +
>  Documentation/topics/dpdk/memory.rst |  67 +++++++
>  NEWS                                 |   1 +
>  lib/dpdk-stub.c                      |   6 +
>  lib/dpdk.c                           |  13 ++
>  lib/dpdk.h                           |   1 +
>  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++--------------
>  vswitchd/vswitch.xml                 |  16 ++
>  9 files changed, 305 insertions(+), 127 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/memory.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 683ca14..14c2189 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>  	Documentation/topics/dpdk/index.rst \
>  	Documentation/topics/dpdk/bridge.rst \
>  	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/memory.rst \
>  	Documentation/topics/dpdk/pdump.rst \
>  	Documentation/topics/dpdk/phy.rst \
>  	Documentation/topics/dpdk/pmd.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index 181f61a..cf24a7b 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,3 +40,4 @@ The DPDK Datapath
>     /topics/dpdk/qos
>     /topics/dpdk/pdump
>     /topics/dpdk/jumbo-frames
> +   /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
> new file mode 100644
> index 0000000..1198067
> --- /dev/null
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -0,0 +1,67 @@
> +..
> +        Copyright 2018, Intel, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +=========================
> +DPDK Device Memory Models
> +=========================
> +
> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> +shared mempools or per port mempools. The specifics of both are detailed
> +below.
> +
> +Shared Mempools
> +---------------
> +
> +By Default OVS DPDK uses a shared mempool model. This means that multiple
> +ports can share the same mempool. For example when a port is added it will
> +have a given MTU and socket ID associated with it. If a mempool has been
> +created previously for an existing port that has the same MTU and socket ID,
> +that mempool is used for both ports. If there is no existing mempool
> +supporting these parameters then a new mempool is created.
> +
> +
> +Per Port Mempools
> +-----------------
> +
> +In the per port mempool model, mempools are created per device and are not
> +shared. The benefit of this is a more transparent memory model where mempools
> +will not be exhausted by other DPDK devices. However this comes at a potential
> +increase in cost for memory dimensioning for a given deployment. Users should
> +be aware of the memory requirements for their deployment before using this
> +model and allocate the required hugepage memory.
> +
> +Per port mempool support may be enabled via a global config value,
> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
> +model for all DPDK devices in OVS::
> +
> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> +
> +.. important::
> +
> +    Changing this value requires restarting the daemon.
> +

It's a good general guide, but the daemon does not need to be restarted
once it's set before dpdk-init=true.

> +.. todo::
> +
> +   Add memory calculation section for both mempool models.
> diff --git a/NEWS b/NEWS
> index ec548b0..c9991cf 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -30,6 +30,7 @@ Post-v2.9.0
>       * New 'check-dpdk' Makefile target to run a new system testsuite.
>         See Testing topic for the details.
>       * Add LSC interrupt support for DPDK physical devices.
> +     * Support both shared and per port mempools for DPDK devices.
>     - Userspace datapath:
>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>       * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 041cd0c..d2a9718 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
>      return false;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return false;
> +}
> +
>  void
>  print_dpdk_version(void)
>  {
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 00dd974..e251970 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for DPDK log redirection */
>  
>  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
> +static bool per_port_mp_enabled = false; /* Status of per port mempool
> +                                          * support */
>  
>  static int
>  process_vhost_flags(char *flag, const char *default_val, int size,
> @@ -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>                 vhost_iommu_enabled ? "enabled" : "disabled");
>  
> +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
> +                                        "per-port-mp-enabled", false);
> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> +               per_port_mp_enabled ? "enabled" : "disabled");
> +
>      argv = grow_argv(&argv, 0, 1);
>      argc = 1;
>      argv[0] = xstrdup(ovs_get_program_name());
> @@ -498,6 +505,12 @@ dpdk_vhost_iommu_enabled(void)
>      return vhost_iommu_enabled;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return per_port_mp_enabled;
> +}
> +
>  void
>  dpdk_set_lcore_id(unsigned cpu)
>  {
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index b041535..243385c 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>  void dpdk_set_lcore_id(unsigned cpu);
>  const char *dpdk_get_vhost_sock_dir(void);
>  bool dpdk_vhost_iommu_enabled(void);
> +bool dpdk_per_port_mempool_enabled(void);
>  void print_dpdk_version(void);
>  
>  #endif /* dpdk.h */
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 87152a7..cda2ace 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
> - * roughly estimated number of mbufs: if this fails (because the system doesn't
> - * have enough hugepages) we keep halving the number until the allocation
> - * succeeds or we reach MIN_NB_MBUF */
> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> + * enough hugepages) we keep halving the number until the allocation succeeds
> + * or we reach MIN_NB_MBUF */
> +
> +#define MAX_NB_MBUF          (4096 * 64)
>  #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
> +                  == 0);
> +
> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
> +                  % MP_CACHE_SZ == 0);
> +
>  /*
>   * DPDK XSTATS Counter names definition
>   */
> @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>  static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>      = OVS_MUTEX_INITIALIZER;
>  
> -/* Contains all 'struct dpdk_mp's. */

Not strictly needed, but comment is still valid

> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>  
> -/* Wrapper for a mempool released but not yet freed. */
>  struct dpdk_mp {
>       struct rte_mempool *mp;
> +     int mtu;
> +     int socket_id;
> +     int refcount;
>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>   };
>  
> @@ -381,7 +393,7 @@ struct netdev_dpdk {
>  
>      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> -        struct rte_mempool *mp;
> +        struct dpdk_mp *dpdk_mp;
>  
>          /* virtio identifier for vhost devices */
>          ovsrcu_index vid;
> @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
>      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
>  }
>  
> -static int
> -dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> -    unsigned ring_count;
> -    /* This logic is needed because rte_mempool_full() is not guaranteed to
> -     * be atomic and mbufs could be moved from mempool cache --> mempool ring
> -     * during the call. However, as no mbufs will be taken from the mempool
> -     * at this time, we can work around it by also checking the ring entries
> -     * separately and ensuring that they have not changed.
> -     */
> -    ring_count = rte_mempool_ops_get_count(mp);
> -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> -        return 1;
> -    }
> -
> -    return 0;
> -}
> -
> -/* Free unused mempools. */
> -static void
> -dpdk_mp_sweep(void)
> +/* Calculating the required number of mbufs differs depending on the
> + * mempool model being used. Check if per port mempools are in use before
> + * calculating.
> + */
> +static uint32_t
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    struct dpdk_mp *dmp, *next;
> +    uint32_t n_mbufs;
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dpdk_mp_full(dmp->mp)) {
> -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_mempool_free(dmp->mp);
> -            rte_free(dmp);
> +    if (!per_port_mp) {
> +        /* Shared mempools are being used.
> +         * XXX: this is a really rough method of provisioning memory.

XXX causes warnings in checkpatch (found that out on rte_mempool_full patch)

> +         * It's impossible to determine what the exact memory requirements are
> +         * when the number of ports and rxqs that utilize a particular mempool
> +         * can change dynamically at runtime. For now, use this rough
> +         * heurisitic.
> +         */
> +        if (mtu >= ETHER_MTU) {
> +            n_mbufs = MAX_NB_MBUF;
> +        } else {
> +            n_mbufs = MIN_NB_MBUF;
>          }
> +    } else {
> +        /* Per port mempools are being used.
> +         * XXX: rough estimation of number of mbufs required for this port:
> +         * <packets required to fill the device rxqs>
> +         * + <packets that could be stuck on other ports txqs>
> +         * + <packets in the pmd threads>
> +         * + <additional memory for corner cases>
> +         */
> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> +                  + dev->requested_n_txq * dev->requested_txq_size
> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> +                  + MIN_NB_MBUF;
>      }
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -}
> -
> -/* Ensure a mempool will not be freed. */
> -static void
> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> -    struct dpdk_mp *dmp, *next;
>  
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dmp->mp == mp) {
> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_free(dmp);
> -            break;
> -        }
> -    }
> +    return n_mbufs;
>  }
>  
> -/* Returns a valid pointer when either of the following is true:
> - *  - a new mempool was just created;
> - *  - a matching mempool already exists. */
> -static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +static struct dpdk_mp *
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
>      uint32_t n_mbufs;
>      uint32_t hash = hash_string(netdev_name, 0);
> -    struct rte_mempool *mp = NULL;
> -
> -    /*
> -     * XXX: rough estimation of number of mbufs required for this port:
> -     * <packets required to fill the device rxqs>
> -     * + <packets that could be stuck on other ports txqs>
> -     * + <packets in the pmd threads>
> -     * + <additional memory for corner cases>
> -     */
> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> -              + dev->requested_n_txq * dev->requested_txq_size
> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> -              + MIN_NB_MBUF;
> +    struct dpdk_mp *dmp = NULL;
> +    int ret;
> +
> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!dmp) {
> +        return NULL;
> +    }
> +    dmp->socket_id = socket_id;
> +    dmp->mtu = mtu;
> +    dmp->refcount = 1;
> +
> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
>      do {
>          /* Full DPDK memory pool name must be unique and cannot be
>           * longer than RTE_MEMPOOL_NAMESIZE. */
> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>                             "ovs%08x%02d%05d%07u",
>                             hash, socket_id, mtu, n_mbufs);

you're using the hash of one dev in the name, but that mempool can later
be used by other dev, and even be no longer used by the original dev.
OTOH no one is going to know what the hash means anyway, so it shouldn't
confuse any more on debug.

>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
> @@ -623,102 +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>                    netdev_name, n_mbufs, socket_id,
>                    dev->requested_n_rxq, dev->requested_n_txq);
>  
> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> +                                          MP_CACHE_SZ,
> +                                          sizeof (struct dp_packet)
> +                                          - sizeof (struct rte_mbuf),
> +                                          MBUF_SIZE(mtu)
> +                                          - sizeof(struct dp_packet),
> +                                          socket_id);
>  
> -        if (mp) {
> +        if (dmp->mp) {
>              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>                       mp_name, n_mbufs);
>              /* rte_pktmbuf_pool_create has done some initialization of the
> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
> -             * of the packet still need to be initialized by
> -             * ovs_rte_pktmbuf_init. */
> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
> +             * initializes some OVS specific fields of dp_packet.
> +             */
> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> +            return dmp;
>          } else if (rte_errno == EEXIST) {
>              /* A mempool with the same name already exists.  We just
>               * retrieve its pointer to be returned to the caller. */
> -            mp = rte_mempool_lookup(mp_name);
> +            dmp->mp = rte_mempool_lookup(mp_name);
>              /* As the mempool create returned EEXIST we can expect the
>               * lookup has returned a valid pointer.  If for some reason
>               * that's not the case we keep track of it. */
>              VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
> -                     mp_name, mp);
> -            /* Ensure this reused mempool will not be freed. */
> -            dpdk_mp_do_not_free(mp);

The mempool you are about to reuse could have a refcount of 0 and about
to be freed by a sweep. So you would need something like the function
above before giving up dpdk_mp_mutex. Maybe you could increase the
refcount for it now and re-adjust later if you need to.

> +                     mp_name, dmp->mp);
>          } else {
>              VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>                       mp_name, n_mbufs);
>          }
> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>  
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -    return mp;
> +    rte_free(dmp);
> +    return NULL;
>  }
>  
> -/* Release an existing mempool. */
> +static int
> +dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> +{
> +    unsigned ring_count;
> +    /* This logic is needed because rte_mempool_full() is not guaranteed to
> +     * be atomic and mbufs could be moved from mempool cache --> mempool ring
> +     * during the call. However, as no mbufs will be taken from the mempool
> +     * at this time, we can work around it by also checking the ring entries
> +     * separately and ensuring that they have not changed.
> +     */
> +    ring_count = rte_mempool_ops_get_count(mp);
> +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> +        return 1;
> +    }
> +
> +    return 0;

you'll need to rebase because of the changes here. Won't make any
difference to your operation though

> +}
> +
> +/* Free unused mempools. */
>  static void
> -dpdk_mp_release(struct rte_mempool *mp)
> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>  {
> -    if (!mp) {
> -        return;
> +    struct dpdk_mp *dmp, *next;
> +
> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> +            ovs_list_remove(&dmp->list_node);
> +            rte_mempool_free(dmp->mp);
> +            rte_free(dmp);
> +        }
>      }
> +}
> +
> +static struct dpdk_mp *
> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> +{
> +    struct dpdk_mp *dmp;
> +    bool reuse = false;
>  
>      ovs_mutex_lock(&dpdk_mp_mutex);
> -    if (dpdk_mp_full(mp)) {
> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> -        rte_mempool_free(mp);
> -    } else {
> -        struct dpdk_mp *dmp;
> +    /* Check if shared mempools are being used, if so check existing mempools
> +     * to see if reuse is possible. */
> +    if (!per_port_mp) {
> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> +            if (dmp->socket_id == dev->requested_socket_id
> +                && dmp->mtu == mtu) {
> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> +                dmp->refcount++;
> +                reuse = true;
> +                break;
> +            }
> +        }
> +    }
> +    /* Sweep mempools after reuse or before create. */
> +    dpdk_mp_sweep();
>  
> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!reuse) {
> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>          if (dmp) {
> -            dmp->mp = mp;
> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>          }
>      }
> +
>      ovs_mutex_unlock(&dpdk_mp_mutex);
> +
> +    return dmp;
>  }
>  
> -/* Tries to allocate a new mempool - or re-use an existing one where
> - * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> - * On success - or when re-using an existing mempool - the new configuration
> - * will be applied.
> +/* Decrement reference to a mempool. */
> +static void
> +dpdk_mp_put(struct dpdk_mp *dmp)
> +{
> +    if (!dmp) {
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dpdk_mp_mutex);
> +    ovs_assert(dmp->refcount);
> +    dmp->refcount--;
> +    ovs_mutex_unlock(&dpdk_mp_mutex);
> +}
> +
> +/* Depending on the memory model being used this function tries
> + * identify and reuse an existing mempool or tries to allocate new
> + * mempool on requested_socket_id with mbuf size corresponding to
> + * requested_mtu. On success new configuration will be applied.
>   * On error, device will be left unchanged. */
>  static int
>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> -    struct rte_mempool *mp;
> +    struct dpdk_mp *mp;
>      int ret = 0;
> +    bool per_port_mp = dpdk_per_port_mempool_enabled();
>  
> -    dpdk_mp_sweep();
> +    /* With shared mempools we do not need to configure a mempool if the MTU
> +     * and socket ID have not changed, the previous configuration is still
> +     * valid so return 0 */
> +    if (dev->mtu == dev->requested_mtu
> +        && dev->socket_id == dev->requested_socket_id
> +        && (!per_port_mp)) {

can remove brackets and probably more intuitive to check the memory
model first

> +        return ret;
> +    }
>  
> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>      if (!mp) {
>          VLOG_ERR("Failed to create memory pool for netdev "
>                   "%s, with MTU %d on socket %d: %s\n",
>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
> -                 rte_strerror(rte_errno));
> -        ret = rte_errno;
> +        rte_strerror(rte_errno));
> +        return rte_errno;
>      } else {
> -        /* If a new MTU was requested and its rounded value equals the one
> -         * that is currently used, then the existing mempool is returned. */
> -        if (dev->mp != mp) {
> -            /* A new mempool was created, release the previous one. */
> -            dpdk_mp_release(dev->mp);
> -        } else {
> -            ret = EEXIST;
> +        /* Check for any pre-existing dpdk_mp for the device */
> +        if (dev->dpdk_mp != NULL) {

Not sure you need this check, dpdk_mp_put will handle NULL

> +            /* If a new MTU was requested and its rounded value equals the
> +             * one that is currently used, then the existing mempool is
> +             * returned. */
> +            if (dev->dpdk_mp->mp != mp->mp) {
> +                /* A new mempool was created, release the previous one. */
> +                dpdk_mp_put(dev->dpdk_mp);
> +            } else {
> +                /* If the mempool already exists in the current dpdk_mp then
> +                 * we need to ensure dpdk_mp that was just created is freed in
> +                 * the next sweep as it will not be used. *> +                dpdk_mp_put(mp);

This doesn't look right - it will likely lead to the mempool you are
using being freed also. I think you just want to free the dpdk_mp
struct, but not the actual mempool.

> +                ret = EEXIST;

Setting ret, but returning 0 below

> +            }
>          }
> -        dev->mp = mp;
> +        dev->dpdk_mp = mp;
>          dev->mtu = dev->requested_mtu;
>          dev->socket_id = dev->requested_socket_id;
>          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      }
>  
> -    return ret;
> +    return 0;
>  }
>  

I know there was lots of paths already, but adding two different schemes
is making the code here and in create even more confusing. I think
there's a few subtle bugs above, and it's probably down to the
complexity. I wonder could you use the per_port_mp a bit more so at
least the paths for each scheme are a more segregated? When reviewing,
it's hard to understand if a code branch would happen with one scheme or
the other, or both.

Kevin.

>  static void
> @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>  
>          for (i = 0; i < n_rxq; i++) {
>              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
> -                                          dev->socket_id, NULL, dev->mp);
> +                                          dev->socket_id, NULL,
> +                                          dev->dpdk_mp->mp);
>              if (diag) {
>                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>                            dev->up.name, i, rte_strerror(-diag));
> @@ -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>      rte_eth_link_get_nowait(dev->port_id, &dev->link);
>  
> -    mbp_priv = rte_mempool_get_priv(dev->mp);
> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>      dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>  
>      /* Get the Flow control configuration for DPDK-ETH */
> @@ -1176,7 +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
>      OVS_EXCLUDED(dev->mutex)
>  {
>      rte_free(dev->tx_q);
> -    dpdk_mp_release(dev->mp);
> +    dpdk_mp_put(dev->dpdk_mp);
>  
>      ovs_list_remove(&dev->list_node);
>      free(ovsrcu_get_protected(struct ingress_policer *,
> @@ -1928,7 +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>          return EAGAIN;
>      }
>  
> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>                                      (struct rte_mbuf **) batch->packets,
>                                      NETDEV_MAX_BURST);
>      if (!nb_rx) {
> @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>              continue;
>          }
>  
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>              dropped += cnt - i;
>              break;
> @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>          ovs_mutex_lock(&dev->mutex);
>          ovs_mutex_lock(&dpdk_mp_mutex);
>  
> -        rte_mempool_dump(stream, dev->mp);
> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>  
>          ovs_mutex_unlock(&dpdk_mp_mutex);
>          ovs_mutex_unlock(&dev->mutex);
> @@ -3742,7 +3814,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>  
>      err = netdev_dpdk_mempool_configure(dev);
>      if (!err) {
> -        /* A new mempool was created. */
> +        /* A new mempool was created or re-used. */
>          netdev_change_seq_changed(&dev->up);
>      } else if (err != EEXIST){
>          return err;
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 7ab90d5..95520af 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -359,6 +359,22 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="per-port-mp-enabled"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          By default OVS DPDK uses a shared mempool memory model wherein
> +          devices that have the same MTU and socket values can share the same
> +          mempool. Setting this value to <code>true</code> changes this
> +          behaviour. Per port mempools allow DPDK devices to use a private
> +          mempool per device. This can provide greater transapraency as
> +          regards memory usage but potentially at the cost of greater memory
> +          requirements.
> +        </p>
> +        <p>
> +          Changing this value requires restarting the daemon.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-flush-interval"
>                type='{"type": "integer",
>                       "minInteger": 0, "maxInteger": 1000000}'>
>
Lam, Tiago May 25, 2018, 4:53 p.m. UTC | #2
Hi Ian,

Thanks for bringing this forward.

I've tested this on current master by re-configuring MTUs of existing
ports, adding new ports, deleting existing ones, etc. It seems to be
behaving as expected:
- On the shared model, it allocates a new mempool only if the MTU
changes, otherwise reuses the already existing one. And it frees a
mempool only if it isn't being used by any port;
- On the per-port model, it allocates a new mempool for every port. And
it frees the mempools when the ports are destroyed / MTU is changed.

There's a catch on how mempools are freed, though. I have a comment on
that in-line below*.

(I see Kevin has sent a review as well, so I'll refrain to touch on the
same points, unless to re-iterate.)

On 18/05/2018 15:30, Ian Stokes wrote:
> This commit re-introduces the concept of shared mempools as the default
> memory model for DPDK devices. Per port mempools are still available but
> must be enabled explicitly by a user.
> 
> OVS previously used a shared mempool model for ports with the same MTU
> and socket configuration. This was replaced by a per port mempool model
> to address issues flagged by users such as:
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
> 
> However the per port model potentially requires an increase in memory
> resource requirements to support the same number of ports and configuration
> as the shared port model.
> 
> This is considered a blocking factor for current deployments of OVS
> when upgrading to future OVS releases as a user may have to redimension
> memory for the same deployment configuration. This may not be possible for
> users.
> 
> This commit resolves the issue by re-introducing shared mempools as
> the default memory behaviour in OVS DPDK but also refactors the memory
> configuration code to allow for per port mempools.
> 
> This patch adds a new global config option, per-port-mp-enabled, that
> controls the enablement of per port mempools for DPDK devices.
> 
>     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> 
> This value defaults to false; to enable per port mempool support,
> this field should be set to true when setting other global parameters
> on init (such as "dpdk-socket-mem", for example). Changing the value at
> runtime is not supported, and requires restarting the vswitch
> daemon.
> 
> The mempool sweep functionality is also replaced with the
> sweep functionality from OVS 2.9 found in commits
> 
> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> 
> As this patch is RFC there are a number of TO-DOs including adding a
> memory calculation section to the documentation for both models. This is
> expected to be completed in the v1 after RFC.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> ---
>  Documentation/automake.mk            |   1 +
>  Documentation/topics/dpdk/index.rst  |   1 +
>  Documentation/topics/dpdk/memory.rst |  67 +++++++
>  NEWS                                 |   1 +
>  lib/dpdk-stub.c                      |   6 +
>  lib/dpdk.c                           |  13 ++
>  lib/dpdk.h                           |   1 +
>  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++--------------
>  vswitchd/vswitch.xml                 |  16 ++
>  9 files changed, 305 insertions(+), 127 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/memory.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 683ca14..14c2189 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>  	Documentation/topics/dpdk/index.rst \
>  	Documentation/topics/dpdk/bridge.rst \
>  	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/memory.rst \
>  	Documentation/topics/dpdk/pdump.rst \
>  	Documentation/topics/dpdk/phy.rst \
>  	Documentation/topics/dpdk/pmd.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index 181f61a..cf24a7b 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,3 +40,4 @@ The DPDK Datapath
>     /topics/dpdk/qos
>     /topics/dpdk/pdump
>     /topics/dpdk/jumbo-frames
> +   /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
> new file mode 100644
> index 0000000..1198067
> --- /dev/null
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -0,0 +1,67 @@
> +..
> +        Copyright 2018, Intel, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +=========================
> +DPDK Device Memory Models
> +=========================
> +
> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> +shared mempools or per port mempools. The specifics of both are detailed
> +below.
> +
> +Shared Mempools
> +---------------
> +
> +By Default OVS DPDK uses a shared mempool model. This means that multiple
> +ports can share the same mempool. For example when a port is added it will
> +have a given MTU and socket ID associated with it. If a mempool has been
> +created previously for an existing port that has the same MTU and socket ID,
> +that mempool is used for both ports. If there is no existing mempool
> +supporting these parameters then a new mempool is created.
> +
> +
> +Per Port Mempools
> +-----------------
> +
> +In the per port mempool model, mempools are created per device and are not
> +shared. The benefit of this is a more transparent memory model where mempools
> +will not be exhausted by other DPDK devices. However this comes at a potential
> +increase in cost for memory dimensioning for a given deployment. Users should
> +be aware of the memory requirements for their deployment before using this
> +model and allocate the required hugepage memory.
> +
> +Per port mempool support may be enabled via a global config value,
> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
> +model for all DPDK devices in OVS::
> +
> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> +
> +.. important::
> +
> +    Changing this value requires restarting the daemon.
> +
> +.. todo::
> +
> +   Add memory calculation section for both mempool models.
> diff --git a/NEWS b/NEWS
> index ec548b0..c9991cf 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -30,6 +30,7 @@ Post-v2.9.0
>       * New 'check-dpdk' Makefile target to run a new system testsuite.
>         See Testing topic for the details.
>       * Add LSC interrupt support for DPDK physical devices.
> +     * Support both shared and per port mempools for DPDK devices.
>     - Userspace datapath:
>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>       * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 041cd0c..d2a9718 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
>      return false;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return false;
> +}
> +
>  void
>  print_dpdk_version(void)
>  {
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 00dd974..e251970 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for DPDK log redirection */
>  
>  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
> +static bool per_port_mp_enabled = false; /* Status of per port mempool
> +                                          * support */
>  
>  static int
>  process_vhost_flags(char *flag, const char *default_val, int size,
> @@ -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>                 vhost_iommu_enabled ? "enabled" : "disabled");
>  
> +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
> +                                        "per-port-mp-enabled", false);
> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> +               per_port_mp_enabled ? "enabled" : "disabled");
> +
>      argv = grow_argv(&argv, 0, 1);
>      argc = 1;
>      argv[0] = xstrdup(ovs_get_program_name());
> @@ -498,6 +505,12 @@ dpdk_vhost_iommu_enabled(void)
>      return vhost_iommu_enabled;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return per_port_mp_enabled;
> +}
> +
>  void
>  dpdk_set_lcore_id(unsigned cpu)
>  {
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index b041535..243385c 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>  void dpdk_set_lcore_id(unsigned cpu);
>  const char *dpdk_get_vhost_sock_dir(void);
>  bool dpdk_vhost_iommu_enabled(void);
> +bool dpdk_per_port_mempool_enabled(void);
>  void print_dpdk_version(void);
>  
>  #endif /* dpdk.h */
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 87152a7..cda2ace 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
> - * roughly estimated number of mbufs: if this fails (because the system doesn't
> - * have enough hugepages) we keep halving the number until the allocation
> - * succeeds or we reach MIN_NB_MBUF */
> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> + * enough hugepages) we keep halving the number until the allocation succeeds
> + * or we reach MIN_NB_MBUF */
> +
> +#define MAX_NB_MBUF          (4096 * 64)
>  #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
> +                  == 0);
> +
> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
> +                  % MP_CACHE_SZ == 0);
> +
>  /*
>   * DPDK XSTATS Counter names definition
>   */
> @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>  static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>      = OVS_MUTEX_INITIALIZER;
>  
> -/* Contains all 'struct dpdk_mp's. */
> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>  
> -/* Wrapper for a mempool released but not yet freed. */
>  struct dpdk_mp {
>       struct rte_mempool *mp;
> +     int mtu;
> +     int socket_id;
> +     int refcount;
>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>   };
>  
> @@ -381,7 +393,7 @@ struct netdev_dpdk {
>  
>      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> -        struct rte_mempool *mp;
> +        struct dpdk_mp *dpdk_mp;
>  
>          /* virtio identifier for vhost devices */
>          ovsrcu_index vid;
> @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
>      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
>  }
>  
> -static int
> -dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> -    unsigned ring_count;
> -    /* This logic is needed because rte_mempool_full() is not guaranteed to
> -     * be atomic and mbufs could be moved from mempool cache --> mempool ring
> -     * during the call. However, as no mbufs will be taken from the mempool
> -     * at this time, we can work around it by also checking the ring entries
> -     * separately and ensuring that they have not changed.
> -     */
> -    ring_count = rte_mempool_ops_get_count(mp);
> -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> -        return 1;
> -    }
> -
> -    return 0;
> -}
> -
> -/* Free unused mempools. */
> -static void
> -dpdk_mp_sweep(void)
> +/* Calculating the required number of mbufs differs depending on the
> + * mempool model being used. Check if per port mempools are in use before
> + * calculating.
> + */
> +static uint32_t
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    struct dpdk_mp *dmp, *next;
> +    uint32_t n_mbufs;
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dpdk_mp_full(dmp->mp)) {
> -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_mempool_free(dmp->mp);
> -            rte_free(dmp);
> +    if (!per_port_mp) {
> +        /* Shared mempools are being used.
> +         * XXX: this is a really rough method of provisioning memory.
> +         * It's impossible to determine what the exact memory requirements are
> +         * when the number of ports and rxqs that utilize a particular mempool
> +         * can change dynamically at runtime. For now, use this rough
> +         * heurisitic.
> +         */
> +        if (mtu >= ETHER_MTU) {
> +            n_mbufs = MAX_NB_MBUF;

Is this going to be configurable in the future?

> +        } else {
> +            n_mbufs = MIN_NB_MBUF;
>          }
> +    } else {
> +        /* Per port mempools are being used.
> +         * XXX: rough estimation of number of mbufs required for this port:
> +         * <packets required to fill the device rxqs>
> +         * + <packets that could be stuck on other ports txqs>
> +         * + <packets in the pmd threads>
> +         * + <additional memory for corner cases>
> +         */
> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> +                  + dev->requested_n_txq * dev->requested_txq_size
> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> +                  + MIN_NB_MBUF;
>      }
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -}
> -
> -/* Ensure a mempool will not be freed. */
> -static void
> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> -    struct dpdk_mp *dmp, *next;
>  
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dmp->mp == mp) {
> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_free(dmp);
> -            break;
> -        }
> -    }
> +    return n_mbufs;
>  }
>  
> -/* Returns a valid pointer when either of the following is true:
> - *  - a new mempool was just created;
> - *  - a matching mempool already exists. */
> -static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +static struct dpdk_mp *
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
>      uint32_t n_mbufs;
>      uint32_t hash = hash_string(netdev_name, 0);
> -    struct rte_mempool *mp = NULL;
> -
> -    /*
> -     * XXX: rough estimation of number of mbufs required for this port:
> -     * <packets required to fill the device rxqs>
> -     * + <packets that could be stuck on other ports txqs>
> -     * + <packets in the pmd threads>
> -     * + <additional memory for corner cases>
> -     */
> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> -              + dev->requested_n_txq * dev->requested_txq_size
> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> -              + MIN_NB_MBUF;
> +    struct dpdk_mp *dmp = NULL;
> +    int ret;
> +
> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!dmp) {
> +        return NULL;
> +    }
> +    dmp->socket_id = socket_id;
> +    dmp->mtu = mtu;
> +    dmp->refcount = 1;
> +
> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
>      do {
>          /* Full DPDK memory pool name must be unique and cannot be
>           * longer than RTE_MEMPOOL_NAMESIZE. */
> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>                             "ovs%08x%02d%05d%07u",
>                             hash, socket_id, mtu, n_mbufs);
>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
> @@ -623,102 +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>                    netdev_name, n_mbufs, socket_id,
>                    dev->requested_n_rxq, dev->requested_n_txq);
>  
> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> +                                          MP_CACHE_SZ,
> +                                          sizeof (struct dp_packet)
> +                                          - sizeof (struct rte_mbuf),
> +                                          MBUF_SIZE(mtu)
> +                                          - sizeof(struct dp_packet),
> +                                          socket_id);
>  
> -        if (mp) {
> +        if (dmp->mp) {
>              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>                       mp_name, n_mbufs);
>              /* rte_pktmbuf_pool_create has done some initialization of the
> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
> -             * of the packet still need to be initialized by
> -             * ovs_rte_pktmbuf_init. */
> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
> +             * initializes some OVS specific fields of dp_packet.
> +             */
> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> +            return dmp;
>          } else if (rte_errno == EEXIST) {
>              /* A mempool with the same name already exists.  We just
>               * retrieve its pointer to be returned to the caller. */
> -            mp = rte_mempool_lookup(mp_name);
> +            dmp->mp = rte_mempool_lookup(mp_name);
>              /* As the mempool create returned EEXIST we can expect the
>               * lookup has returned a valid pointer.  If for some reason
>               * that's not the case we keep track of it. */
>              VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
> -                     mp_name, mp);
> -            /* Ensure this reused mempool will not be freed. */
> -            dpdk_mp_do_not_free(mp);
> +                     mp_name, dmp->mp);
>          } else {
>              VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>                       mp_name, n_mbufs);
>          }

Just a thought, but now with shared memory where n_mbufs are initially
set to 4096*64, one can see this error log printed a few times before
the memory gets allocated. We could potentially demote this to a WARN
and write a more friendly message and only print the error below, before
returning to the caller (at that point we surely couldn't allocate the
mempool).

> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>  
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -    return mp;
> +    rte_free(dmp);
> +    return NULL;
>  }
>  
> -/* Release an existing mempool. */
> +static int
> +dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> +{
> +    unsigned ring_count;
> +    /* This logic is needed because rte_mempool_full() is not guaranteed to
> +     * be atomic and mbufs could be moved from mempool cache --> mempool ring
> +     * during the call. However, as no mbufs will be taken from the mempool
> +     * at this time, we can work around it by also checking the ring entries
> +     * separately and ensuring that they have not changed.
> +     */
> +    ring_count = rte_mempool_ops_get_count(mp);
> +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Free unused mempools. */
>  static void
> -dpdk_mp_release(struct rte_mempool *mp)
> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>  {
> -    if (!mp) {
> -        return;
> +    struct dpdk_mp *dmp, *next;
> +
> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> +            ovs_list_remove(&dmp->list_node);
> +            rte_mempool_free(dmp->mp);
> +            rte_free(dmp);
> +        }
>      }
> +}
> +
> +static struct dpdk_mp *
> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> +{
> +    struct dpdk_mp *dmp;
> +    bool reuse = false;
>  
>      ovs_mutex_lock(&dpdk_mp_mutex);
> -    if (dpdk_mp_full(mp)) {
> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> -        rte_mempool_free(mp);
> -    } else {
> -        struct dpdk_mp *dmp;
> +    /* Check if shared mempools are being used, if so check existing mempools
> +     * to see if reuse is possible. */
> +    if (!per_port_mp) {
> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> +            if (dmp->socket_id == dev->requested_socket_id
> +                && dmp->mtu == mtu) {
> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> +                dmp->refcount++;
> +                reuse = true;
> +                break;
> +            }
> +        }
> +    }
> +    /* Sweep mempools after reuse or before create. */
> +    dpdk_mp_sweep();

*Should dpdk_mp_sweep() be called when destroying an interface as well,
in common_destruct()? While testing, I've noticed that if I add one port
and delete the same port the mempool will still be allocated until you
add another port, since dpdk_mp_sweep() will only be called on the next
call to dpdp_mp_get(). This could potentially create problems in the
per-port model where one deletes a certain number of ports and can't add
any other ports because there's (hanging) mempools holding memory.

>  
> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!reuse) {
> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>          if (dmp) {
> -            dmp->mp = mp;
> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>          }
>      }
> +
>      ovs_mutex_unlock(&dpdk_mp_mutex);
> +
> +    return dmp;
>  }
>  
> -/* Tries to allocate a new mempool - or re-use an existing one where
> - * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> - * On success - or when re-using an existing mempool - the new configuration
> - * will be applied.
> +/* Decrement reference to a mempool. */
> +static void
> +dpdk_mp_put(struct dpdk_mp *dmp)
> +{
> +    if (!dmp) {
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dpdk_mp_mutex);
> +    ovs_assert(dmp->refcount);
> +    dmp->refcount--;
> +    ovs_mutex_unlock(&dpdk_mp_mutex);
> +}
> +
> +/* Depending on the memory model being used this function tries
> + * identify and reuse an existing mempool or tries to allocate new
> + * mempool on requested_socket_id with mbuf size corresponding to
> + * requested_mtu. On success new configuration will be applied.
>   * On error, device will be left unchanged. */
>  static int
>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> -    struct rte_mempool *mp;
> +    struct dpdk_mp *mp;
>      int ret = 0;
> +    bool per_port_mp = dpdk_per_port_mempool_enabled();
>  
> -    dpdk_mp_sweep();
> +    /* With shared mempools we do not need to configure a mempool if the MTU
> +     * and socket ID have not changed, the previous configuration is still
> +     * valid so return 0 */
> +    if (dev->mtu == dev->requested_mtu
> +        && dev->socket_id == dev->requested_socket_id
> +        && (!per_port_mp)) {

I also find that moving the `(!per_port_mp)` condition to the beginning
improves readability here. It even goes in hand with your comment just
above - "With shared mempools we do not ...".

> +        return ret;
> +    }
>  
> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>      if (!mp) {
>          VLOG_ERR("Failed to create memory pool for netdev "
>                   "%s, with MTU %d on socket %d: %s\n",
>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
> -                 rte_strerror(rte_errno));
> -        ret = rte_errno;
> +        rte_strerror(rte_errno));

Missing indentation here.

> +        return rte_errno;
>      } else {
> -        /* If a new MTU was requested and its rounded value equals the one
> -         * that is currently used, then the existing mempool is returned. */
> -        if (dev->mp != mp) {
> -            /* A new mempool was created, release the previous one. */
> -            dpdk_mp_release(dev->mp);
> -        } else {
> -            ret = EEXIST;
> +        /* Check for any pre-existing dpdk_mp for the device */
> +        if (dev->dpdk_mp != NULL) {
> +            /* If a new MTU was requested and its rounded value equals the
> +             * one that is currently used, then the existing mempool is
> +             * returned. */
> +            if (dev->dpdk_mp->mp != mp->mp) {
> +                /* A new mempool was created, release the previous one. */
> +                dpdk_mp_put(dev->dpdk_mp);
> +            } else {
> +                /* If the mempool already exists in the current dpdk_mp then
> +                 * we need to ensure dpdk_mp that was just created is freed in
> +                 * the next sweep as it will not be used. */
The code path around the `else` block here will only be called when
`!per_port_mp`. This is because `dpdk_mp_get()` will only return an
already existing mempool when using the shared model. Otherwise a new
one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)`
will be true. Am I reading this right? If so then refactoring this a bit
to differentiate on `per_port_mp` might help on readability - this goes
in-line with Kevin's comment about making this piece a bit more readable.

On the same note, the comment above mislead me to think that the
allocated `mp` is being freed, which would result in error since the
same `mp` is then assigned below. Instead, what it is doing is
decrementing the refcount in struct dpdk_mp, which might end up being
freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen
on the shared model unless no port is using the mempool.

Thanks,
Tiago.


> +                dpdk_mp_put(mp);
> +                ret = EEXIST;
> +            }
>          }
> -        dev->mp = mp;
> +        dev->dpdk_mp = mp;
>          dev->mtu = dev->requested_mtu;
>          dev->socket_id = dev->requested_socket_id;
>          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      }
>  
> -    return ret;
> +    return 0;
>  }
>
>  static void
> @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>  
>          for (i = 0; i < n_rxq; i++) {
>              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
> -                                          dev->socket_id, NULL, dev->mp);
> +                                          dev->socket_id, NULL,
> +                                          dev->dpdk_mp->mp);
>              if (diag) {
>                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>                            dev->up.name, i, rte_strerror(-diag));
> @@ -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>      rte_eth_link_get_nowait(dev->port_id, &dev->link);
>  
> -    mbp_priv = rte_mempool_get_priv(dev->mp);
> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>      dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>  
>      /* Get the Flow control configuration for DPDK-ETH */
> @@ -1176,7 +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
>      OVS_EXCLUDED(dev->mutex)
>  {
>      rte_free(dev->tx_q);
> -    dpdk_mp_release(dev->mp);
> +    dpdk_mp_put(dev->dpdk_mp);
>  
>      ovs_list_remove(&dev->list_node);
>      free(ovsrcu_get_protected(struct ingress_policer *,
> @@ -1928,7 +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>          return EAGAIN;
>      }
>  
> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>                                      (struct rte_mbuf **) batch->packets,
>                                      NETDEV_MAX_BURST);
>      if (!nb_rx) {
> @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>              continue;
>          }
>  
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>              dropped += cnt - i;
>              break;
> @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>          ovs_mutex_lock(&dev->mutex);
>          ovs_mutex_lock(&dpdk_mp_mutex);
>  
> -        rte_mempool_dump(stream, dev->mp);
> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>  
>          ovs_mutex_unlock(&dpdk_mp_mutex);
>          ovs_mutex_unlock(&dev->mutex);
> @@ -3742,7 +3814,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>  
>      err = netdev_dpdk_mempool_configure(dev);
>      if (!err) {
> -        /* A new mempool was created. */
> +        /* A new mempool was created or re-used. */
>          netdev_change_seq_changed(&dev->up);
>      } else if (err != EEXIST){
>          return err;
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 7ab90d5..95520af 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -359,6 +359,22 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="per-port-mp-enabled"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          By default OVS DPDK uses a shared mempool memory model wherein
> +          devices that have the same MTU and socket values can share the same
> +          mempool. Setting this value to <code>true</code> changes this
> +          behaviour. Per port mempools allow DPDK devices to use a private
> +          mempool per device. This can provide greater transapraency as
> +          regards memory usage but potentially at the cost of greater memory
> +          requirements.
> +        </p>
> +        <p>
> +          Changing this value requires restarting the daemon.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-flush-interval"
>                type='{"type": "integer",
>                       "minInteger": 0, "maxInteger": 1000000}'>
>
Kevin Traynor May 25, 2018, 5:22 p.m. UTC | #3
Hi Tiago,

On 05/25/2018 05:53 PM, Lam, Tiago wrote:
> Hi Ian,
> 
> Thanks for bringing this forward.
> 
> I've tested this on current master by re-configuring MTUs of existing
> ports, adding new ports, deleting existing ones, etc. It seems to be
> behaving as expected:
> - On the shared model, it allocates a new mempool only if the MTU
> changes, otherwise reuses the already existing one. And it frees a
> mempool only if it isn't being used by any port;
> - On the per-port model, it allocates a new mempool for every port. And
> it frees the mempools when the ports are destroyed / MTU is changed.
> 
> There's a catch on how mempools are freed, though. I have a comment on
> that in-line below*.
> 
> (I see Kevin has sent a review as well, so I'll refrain to touch on the
> same points, unless to re-iterate.)
> 
> On 18/05/2018 15:30, Ian Stokes wrote:
>> This commit re-introduces the concept of shared mempools as the default
>> memory model for DPDK devices. Per port mempools are still available but
>> must be enabled explicitly by a user.
>>
>> OVS previously used a shared mempool model for ports with the same MTU
>> and socket configuration. This was replaced by a per port mempool model
>> to address issues flagged by users such as:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>
>> However the per port model potentially requires an increase in memory
>> resource requirements to support the same number of ports and configuration
>> as the shared port model.
>>
>> This is considered a blocking factor for current deployments of OVS
>> when upgrading to future OVS releases as a user may have to redimension
>> memory for the same deployment configuration. This may not be possible for
>> users.
>>
>> This commit resolves the issue by re-introducing shared mempools as
>> the default memory behaviour in OVS DPDK but also refactors the memory
>> configuration code to allow for per port mempools.
>>
>> This patch adds a new global config option, per-port-mp-enabled, that
>> controls the enablement of per port mempools for DPDK devices.
>>
>>     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
>>
>> This value defaults to false; to enable per port mempool support,
>> this field should be set to true when setting other global parameters
>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>> runtime is not supported, and requires restarting the vswitch
>> daemon.
>>
>> The mempool sweep functionality is also replaced with the
>> sweep functionality from OVS 2.9 found in commits
>>
>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>
>> As this patch is RFC there are a number of TO-DOs including adding a
>> memory calculation section to the documentation for both models. This is
>> expected to be completed in the v1 after RFC.
>>
>> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
>> ---
>>  Documentation/automake.mk            |   1 +
>>  Documentation/topics/dpdk/index.rst  |   1 +
>>  Documentation/topics/dpdk/memory.rst |  67 +++++++
>>  NEWS                                 |   1 +
>>  lib/dpdk-stub.c                      |   6 +
>>  lib/dpdk.c                           |  13 ++
>>  lib/dpdk.h                           |   1 +
>>  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++--------------
>>  vswitchd/vswitch.xml                 |  16 ++
>>  9 files changed, 305 insertions(+), 127 deletions(-)
>>  create mode 100644 Documentation/topics/dpdk/memory.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index 683ca14..14c2189 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>  	Documentation/topics/dpdk/index.rst \
>>  	Documentation/topics/dpdk/bridge.rst \
>>  	Documentation/topics/dpdk/jumbo-frames.rst \
>> +	Documentation/topics/dpdk/memory.rst \
>>  	Documentation/topics/dpdk/pdump.rst \
>>  	Documentation/topics/dpdk/phy.rst \
>>  	Documentation/topics/dpdk/pmd.rst \
>> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
>> index 181f61a..cf24a7b 100644
>> --- a/Documentation/topics/dpdk/index.rst
>> +++ b/Documentation/topics/dpdk/index.rst
>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>     /topics/dpdk/qos
>>     /topics/dpdk/pdump
>>     /topics/dpdk/jumbo-frames
>> +   /topics/dpdk/memory
>> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
>> new file mode 100644
>> index 0000000..1198067
>> --- /dev/null
>> +++ b/Documentation/topics/dpdk/memory.rst
>> @@ -0,0 +1,67 @@
>> +..
>> +        Copyright 2018, Intel, Inc.
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>> +      not use this file except in compliance with the License. You may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, software
>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>> +      License for the specific language governing permissions and limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +=========================
>> +DPDK Device Memory Models
>> +=========================
>> +
>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>> +shared mempools or per port mempools. The specifics of both are detailed
>> +below.
>> +
>> +Shared Mempools
>> +---------------
>> +
>> +By Default OVS DPDK uses a shared mempool model. This means that multiple
>> +ports can share the same mempool. For example when a port is added it will
>> +have a given MTU and socket ID associated with it. If a mempool has been
>> +created previously for an existing port that has the same MTU and socket ID,
>> +that mempool is used for both ports. If there is no existing mempool
>> +supporting these parameters then a new mempool is created.
>> +
>> +
>> +Per Port Mempools
>> +-----------------
>> +
>> +In the per port mempool model, mempools are created per device and are not
>> +shared. The benefit of this is a more transparent memory model where mempools
>> +will not be exhausted by other DPDK devices. However this comes at a potential
>> +increase in cost for memory dimensioning for a given deployment. Users should
>> +be aware of the memory requirements for their deployment before using this
>> +model and allocate the required hugepage memory.
>> +
>> +Per port mempool support may be enabled via a global config value,
>> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
>> +model for all DPDK devices in OVS::
>> +
>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
>> +
>> +.. important::
>> +
>> +    Changing this value requires restarting the daemon.
>> +
>> +.. todo::
>> +
>> +   Add memory calculation section for both mempool models.
>> diff --git a/NEWS b/NEWS
>> index ec548b0..c9991cf 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -30,6 +30,7 @@ Post-v2.9.0
>>       * New 'check-dpdk' Makefile target to run a new system testsuite.
>>         See Testing topic for the details.
>>       * Add LSC interrupt support for DPDK physical devices.
>> +     * Support both shared and per port mempools for DPDK devices.
>>     - Userspace datapath:
>>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>>       * Detailed PMD performance metrics available with new command
>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>> index 041cd0c..d2a9718 100644
>> --- a/lib/dpdk-stub.c
>> +++ b/lib/dpdk-stub.c
>> @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
>>      return false;
>>  }
>>  
>> +bool
>> +dpdk_per_port_mempool_enabled(void)
>> +{
>> +    return false;
>> +}
>> +
>>  void
>>  print_dpdk_version(void)
>>  {
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 00dd974..e251970 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for DPDK log redirection */
>>  
>>  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
>> +static bool per_port_mp_enabled = false; /* Status of per port mempool
>> +                                          * support */
>>  
>>  static int
>>  process_vhost_flags(char *flag, const char *default_val, int size,
>> @@ -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>                 vhost_iommu_enabled ? "enabled" : "disabled");
>>  
>> +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
>> +                                        "per-port-mp-enabled", false);
>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>> +               per_port_mp_enabled ? "enabled" : "disabled");
>> +
>>      argv = grow_argv(&argv, 0, 1);
>>      argc = 1;
>>      argv[0] = xstrdup(ovs_get_program_name());
>> @@ -498,6 +505,12 @@ dpdk_vhost_iommu_enabled(void)
>>      return vhost_iommu_enabled;
>>  }
>>  
>> +bool
>> +dpdk_per_port_mempool_enabled(void)
>> +{
>> +    return per_port_mp_enabled;
>> +}
>> +
>>  void
>>  dpdk_set_lcore_id(unsigned cpu)
>>  {
>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>> index b041535..243385c 100644
>> --- a/lib/dpdk.h
>> +++ b/lib/dpdk.h
>> @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>  void dpdk_set_lcore_id(unsigned cpu);
>>  const char *dpdk_get_vhost_sock_dir(void);
>>  bool dpdk_vhost_iommu_enabled(void);
>> +bool dpdk_per_port_mempool_enabled(void);
>>  void print_dpdk_version(void);
>>  
>>  #endif /* dpdk.h */
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 87152a7..cda2ace 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>  
>> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
>> - * roughly estimated number of mbufs: if this fails (because the system doesn't
>> - * have enough hugepages) we keep halving the number until the allocation
>> - * succeeds or we reach MIN_NB_MBUF */
>> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
>> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>> + * enough hugepages) we keep halving the number until the allocation succeeds
>> + * or we reach MIN_NB_MBUF */
>> +
>> +#define MAX_NB_MBUF          (4096 * 64)
>>  #define MIN_NB_MBUF          (4096 * 4)
>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>  
>> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
>> +                  == 0);
>> +
>> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
>> +                  % MP_CACHE_SZ == 0);
>> +
>>  /*
>>   * DPDK XSTATS Counter names definition
>>   */
>> @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
>>  static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>>      = OVS_MUTEX_INITIALIZER;
>>  
>> -/* Contains all 'struct dpdk_mp's. */
>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>  
>> -/* Wrapper for a mempool released but not yet freed. */
>>  struct dpdk_mp {
>>       struct rte_mempool *mp;
>> +     int mtu;
>> +     int socket_id;
>> +     int refcount;
>>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>   };
>>  
>> @@ -381,7 +393,7 @@ struct netdev_dpdk {
>>  
>>      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>> -        struct rte_mempool *mp;
>> +        struct dpdk_mp *dpdk_mp;
>>  
>>          /* virtio identifier for vhost devices */
>>          ovsrcu_index vid;
>> @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
>>      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
>>  }
>>  
>> -static int
>> -dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>> -{
>> -    unsigned ring_count;
>> -    /* This logic is needed because rte_mempool_full() is not guaranteed to
>> -     * be atomic and mbufs could be moved from mempool cache --> mempool ring
>> -     * during the call. However, as no mbufs will be taken from the mempool
>> -     * at this time, we can work around it by also checking the ring entries
>> -     * separately and ensuring that they have not changed.
>> -     */
>> -    ring_count = rte_mempool_ops_get_count(mp);
>> -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
>> -        return 1;
>> -    }
>> -
>> -    return 0;
>> -}
>> -
>> -/* Free unused mempools. */
>> -static void
>> -dpdk_mp_sweep(void)
>> +/* Calculating the required number of mbufs differs depending on the
>> + * mempool model being used. Check if per port mempools are in use before
>> + * calculating.
>> + */
>> +static uint32_t
>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>  {
>> -    struct dpdk_mp *dmp, *next;
>> +    uint32_t n_mbufs;
>>  
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dpdk_mp_full(dmp->mp)) {
>> -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>> -            ovs_list_remove(&dmp->list_node);
>> -            rte_mempool_free(dmp->mp);
>> -            rte_free(dmp);
>> +    if (!per_port_mp) {
>> +        /* Shared mempools are being used.
>> +         * XXX: this is a really rough method of provisioning memory.
>> +         * It's impossible to determine what the exact memory requirements are
>> +         * when the number of ports and rxqs that utilize a particular mempool
>> +         * can change dynamically at runtime. For now, use this rough
>> +         * heurisitic.
>> +         */
>> +        if (mtu >= ETHER_MTU) {
>> +            n_mbufs = MAX_NB_MBUF;
> 
> Is this going to be configurable in the future?
> 
>> +        } else {
>> +            n_mbufs = MIN_NB_MBUF;
>>          }
>> +    } else {
>> +        /* Per port mempools are being used.
>> +         * XXX: rough estimation of number of mbufs required for this port:
>> +         * <packets required to fill the device rxqs>
>> +         * + <packets that could be stuck on other ports txqs>
>> +         * + <packets in the pmd threads>
>> +         * + <additional memory for corner cases>
>> +         */
>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> +                  + dev->requested_n_txq * dev->requested_txq_size
>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> +                  + MIN_NB_MBUF;
>>      }
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>> -}
>> -
>> -/* Ensure a mempool will not be freed. */
>> -static void
>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>> -{
>> -    struct dpdk_mp *dmp, *next;
>>  
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dmp->mp == mp) {
>> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
>> -            ovs_list_remove(&dmp->list_node);
>> -            rte_free(dmp);
>> -            break;
>> -        }
>> -    }
>> +    return n_mbufs;
>>  }
>>  
>> -/* Returns a valid pointer when either of the following is true:
>> - *  - a new mempool was just created;
>> - *  - a matching mempool already exists. */
>> -static struct rte_mempool *
>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>> +static struct dpdk_mp *
>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>  {
>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>      const char *netdev_name = netdev_get_name(&dev->up);
>>      int socket_id = dev->requested_socket_id;
>>      uint32_t n_mbufs;
>>      uint32_t hash = hash_string(netdev_name, 0);
>> -    struct rte_mempool *mp = NULL;
>> -
>> -    /*
>> -     * XXX: rough estimation of number of mbufs required for this port:
>> -     * <packets required to fill the device rxqs>
>> -     * + <packets that could be stuck on other ports txqs>
>> -     * + <packets in the pmd threads>
>> -     * + <additional memory for corner cases>
>> -     */
>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> -              + dev->requested_n_txq * dev->requested_txq_size
>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> -              + MIN_NB_MBUF;
>> +    struct dpdk_mp *dmp = NULL;
>> +    int ret;
>> +
>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!dmp) {
>> +        return NULL;
>> +    }
>> +    dmp->socket_id = socket_id;
>> +    dmp->mtu = mtu;
>> +    dmp->refcount = 1;
>> +
>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>  
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>>      do {
>>          /* Full DPDK memory pool name must be unique and cannot be
>>           * longer than RTE_MEMPOOL_NAMESIZE. */
>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>                             "ovs%08x%02d%05d%07u",
>>                             hash, socket_id, mtu, n_mbufs);
>>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>> @@ -623,102 +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>                    netdev_name, n_mbufs, socket_id,
>>                    dev->requested_n_rxq, dev->requested_n_txq);
>>  
>> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>> +                                          MP_CACHE_SZ,
>> +                                          sizeof (struct dp_packet)
>> +                                          - sizeof (struct rte_mbuf),
>> +                                          MBUF_SIZE(mtu)
>> +                                          - sizeof(struct dp_packet),
>> +                                          socket_id);
>>  
>> -        if (mp) {
>> +        if (dmp->mp) {
>>              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>                       mp_name, n_mbufs);
>>              /* rte_pktmbuf_pool_create has done some initialization of the
>> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
>> -             * of the packet still need to be initialized by
>> -             * ovs_rte_pktmbuf_init. */
>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
>> +             * initializes some OVS specific fields of dp_packet.
>> +             */
>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>> +            return dmp;
>>          } else if (rte_errno == EEXIST) {
>>              /* A mempool with the same name already exists.  We just
>>               * retrieve its pointer to be returned to the caller. */
>> -            mp = rte_mempool_lookup(mp_name);
>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>              /* As the mempool create returned EEXIST we can expect the
>>               * lookup has returned a valid pointer.  If for some reason
>>               * that's not the case we keep track of it. */
>>              VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
>> -                     mp_name, mp);
>> -            /* Ensure this reused mempool will not be freed. */
>> -            dpdk_mp_do_not_free(mp);
>> +                     mp_name, dmp->mp);
>>          } else {
>>              VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>>                       mp_name, n_mbufs);
>>          }
> 
> Just a thought, but now with shared memory where n_mbufs are initially
> set to 4096*64, one can see this error log printed a few times before
> the memory gets allocated. We could potentially demote this to a WARN
> and write a more friendly message and only print the error below, before
> returning to the caller (at that point we surely couldn't allocate the
> mempool).
> 
>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>>  
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>> -    return mp;
>> +    rte_free(dmp);
>> +    return NULL;
>>  }
>>  
>> -/* Release an existing mempool. */
>> +static int
>> +dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>> +{
>> +    unsigned ring_count;
>> +    /* This logic is needed because rte_mempool_full() is not guaranteed to
>> +     * be atomic and mbufs could be moved from mempool cache --> mempool ring
>> +     * during the call. However, as no mbufs will be taken from the mempool
>> +     * at this time, we can work around it by also checking the ring entries
>> +     * separately and ensuring that they have not changed.
>> +     */
>> +    ring_count = rte_mempool_ops_get_count(mp);
>> +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* Free unused mempools. */
>>  static void
>> -dpdk_mp_release(struct rte_mempool *mp)
>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>  {
>> -    if (!mp) {
>> -        return;
>> +    struct dpdk_mp *dmp, *next;
>> +
>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>> +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>> +            ovs_list_remove(&dmp->list_node);
>> +            rte_mempool_free(dmp->mp);
>> +            rte_free(dmp);
>> +        }
>>      }
>> +}
>> +
>> +static struct dpdk_mp *
>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>> +{
>> +    struct dpdk_mp *dmp;
>> +    bool reuse = false;
>>  
>>      ovs_mutex_lock(&dpdk_mp_mutex);
>> -    if (dpdk_mp_full(mp)) {
>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>> -        rte_mempool_free(mp);
>> -    } else {
>> -        struct dpdk_mp *dmp;
>> +    /* Check if shared mempools are being used, if so check existing mempools
>> +     * to see if reuse is possible. */
>> +    if (!per_port_mp) {
>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>> +            if (dmp->socket_id == dev->requested_socket_id
>> +                && dmp->mtu == mtu) {
>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>> +                dmp->refcount++;
>> +                reuse = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    /* Sweep mempools after reuse or before create. */
>> +    dpdk_mp_sweep();
> 
> *Should dpdk_mp_sweep() be called when destroying an interface as well,
> in common_destruct()? While testing, I've noticed that if I add one port
> and delete the same port the mempool will still be allocated until you
> add another port, since dpdk_mp_sweep() will only be called on the next
> call to dpdp_mp_get(). This could potentially create problems in the
> per-port model where one deletes a certain number of ports and can't add
> any other ports because there's (hanging) mempools holding memory.
> 

With the shared model, it was deliberate to leave the dpdk_mp_sweep() as
late as possible to give more time for buffers to be returned to it and
to allow for the possibility that it might be reused. In order to not
require any additional memory, it is done just before a new mempool is
created.

It's not a problem to call dpdk_mp_sweep() anytime but not sure I follow
the issue you've raised - sweep would still be called before mempools
for any new port are created. Is there a case missed?

>>  
>> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!reuse) {
>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>          if (dmp) {
>> -            dmp->mp = mp;
>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>> +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>          }
>>      }
>> +
>>      ovs_mutex_unlock(&dpdk_mp_mutex);
>> +
>> +    return dmp;
>>  }
>>  
>> -/* Tries to allocate a new mempool - or re-use an existing one where
>> - * appropriate - on requested_socket_id with a size determined by
>> - * requested_mtu and requested Rx/Tx queues.
>> - * On success - or when re-using an existing mempool - the new configuration
>> - * will be applied.
>> +/* Decrement reference to a mempool. */
>> +static void
>> +dpdk_mp_put(struct dpdk_mp *dmp)
>> +{
>> +    if (!dmp) {
>> +        return;
>> +    }
>> +
>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>> +    ovs_assert(dmp->refcount);
>> +    dmp->refcount--;
>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>> +}
>> +
>> +/* Depending on the memory model being used this function tries
>> + * identify and reuse an existing mempool or tries to allocate new
>> + * mempool on requested_socket_id with mbuf size corresponding to
>> + * requested_mtu. On success new configuration will be applied.
>>   * On error, device will be left unchanged. */
>>  static int
>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>      OVS_REQUIRES(dev->mutex)
>>  {
>>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> -    struct rte_mempool *mp;
>> +    struct dpdk_mp *mp;
>>      int ret = 0;
>> +    bool per_port_mp = dpdk_per_port_mempool_enabled();
>>  
>> -    dpdk_mp_sweep();
>> +    /* With shared mempools we do not need to configure a mempool if the MTU
>> +     * and socket ID have not changed, the previous configuration is still
>> +     * valid so return 0 */
>> +    if (dev->mtu == dev->requested_mtu
>> +        && dev->socket_id == dev->requested_socket_id
>> +        && (!per_port_mp)) {
> 
> I also find that moving the `(!per_port_mp)` condition to the beginning
> improves readability here. It even goes in hand with your comment just
> above - "With shared mempools we do not ...".
> 
>> +        return ret;
>> +    }
>>  
>> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>      if (!mp) {
>>          VLOG_ERR("Failed to create memory pool for netdev "
>>                   "%s, with MTU %d on socket %d: %s\n",
>>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>> -                 rte_strerror(rte_errno));
>> -        ret = rte_errno;
>> +        rte_strerror(rte_errno));
> 
> Missing indentation here.
> 
>> +        return rte_errno;
>>      } else {
>> -        /* If a new MTU was requested and its rounded value equals the one
>> -         * that is currently used, then the existing mempool is returned. */
>> -        if (dev->mp != mp) {
>> -            /* A new mempool was created, release the previous one. */
>> -            dpdk_mp_release(dev->mp);
>> -        } else {
>> -            ret = EEXIST;
>> +        /* Check for any pre-existing dpdk_mp for the device */
>> +        if (dev->dpdk_mp != NULL) {
>> +            /* If a new MTU was requested and its rounded value equals the
>> +             * one that is currently used, then the existing mempool is
>> +             * returned. */
>> +            if (dev->dpdk_mp->mp != mp->mp) {
>> +                /* A new mempool was created, release the previous one. */
>> +                dpdk_mp_put(dev->dpdk_mp);
>> +            } else {
>> +                /* If the mempool already exists in the current dpdk_mp then
>> +                 * we need to ensure dpdk_mp that was just created is freed in
>> +                 * the next sweep as it will not be used. */
> The code path around the `else` block here will only be called when
> `!per_port_mp`. This is because `dpdk_mp_get()` will only return an
> already existing mempool when using the shared model. Otherwise a new
> one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)`
> will be true. Am I reading this right? If so then refactoring this a bit
> to differentiate on `per_port_mp` might help on readability - this goes
> in-line with Kevin's comment about making this piece a bit more readable.
> 
> On the same note, the comment above mislead me to think that the
> allocated `mp` is being freed, which would result in error since the
> same `mp` is then assigned below. Instead, what it is doing is
> decrementing the refcount in struct dpdk_mp, which might end up being
> freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen
> on the shared model unless no port is using the mempool.
> 

but this port is using it :-)

+            if (dev->dpdk_mp->mp != mp->mp) {

is false, which means they both have the same mempool. The refcount in
dpdk_mp only counts user of the dpdk_mp, not the actual mp, so it could
then be freed while still needed.

Kevin.

> Thanks,
> Tiago.
> 
> 
>> +                dpdk_mp_put(mp);
>> +                ret = EEXIST;
>> +            }
>>          }
>> -        dev->mp = mp;
>> +        dev->dpdk_mp = mp;
>>          dev->mtu = dev->requested_mtu;
>>          dev->socket_id = dev->requested_socket_id;
>>          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>      }
>>  
>> -    return ret;
>> +    return 0;
>>  }
>>
>>  static void
>> @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>  
>>          for (i = 0; i < n_rxq; i++) {
>>              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
>> -                                          dev->socket_id, NULL, dev->mp);
>> +                                          dev->socket_id, NULL,
>> +                                          dev->dpdk_mp->mp);
>>              if (diag) {
>>                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>                            dev->up.name, i, rte_strerror(-diag));
>> @@ -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>      rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>  
>> -    mbp_priv = rte_mempool_get_priv(dev->mp);
>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>      dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>>  
>>      /* Get the Flow control configuration for DPDK-ETH */
>> @@ -1176,7 +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
>>      OVS_EXCLUDED(dev->mutex)
>>  {
>>      rte_free(dev->tx_q);
>> -    dpdk_mp_release(dev->mp);
>> +    dpdk_mp_put(dev->dpdk_mp);
>>  
>>      ovs_list_remove(&dev->list_node);
>>      free(ovsrcu_get_protected(struct ingress_policer *,
>> @@ -1928,7 +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>          return EAGAIN;
>>      }
>>  
>> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>                                      (struct rte_mbuf **) batch->packets,
>>                                      NETDEV_MAX_BURST);
>>      if (!nb_rx) {
>> @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>>              continue;
>>          }
>>  
>> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>>              dropped += cnt - i;
>>              break;
>> @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>>          ovs_mutex_lock(&dev->mutex);
>>          ovs_mutex_lock(&dpdk_mp_mutex);
>>  
>> -        rte_mempool_dump(stream, dev->mp);
>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>  
>>          ovs_mutex_unlock(&dpdk_mp_mutex);
>>          ovs_mutex_unlock(&dev->mutex);
>> @@ -3742,7 +3814,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>  
>>      err = netdev_dpdk_mempool_configure(dev);
>>      if (!err) {
>> -        /* A new mempool was created. */
>> +        /* A new mempool was created or re-used. */
>>          netdev_change_seq_changed(&dev->up);
>>      } else if (err != EEXIST){
>>          return err;
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 7ab90d5..95520af 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -359,6 +359,22 @@
>>          </p>
>>        </column>
>>  
>> +      <column name="other_config" key="per-port-mp-enabled"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          By default OVS DPDK uses a shared mempool memory model wherein
>> +          devices that have the same MTU and socket values can share the same
>> +          mempool. Setting this value to <code>true</code> changes this
>> +          behaviour. Per port mempools allow DPDK devices to use a private
>> +          mempool per device. This can provide greater transapraency as
>> +          regards memory usage but potentially at the cost of greater memory
>> +          requirements.
>> +        </p>
>> +        <p>
>> +          Changing this value requires restarting the daemon.
>> +        </p>
>> +      </column>
>> +
>>        <column name="other_config" key="tx-flush-interval"
>>                type='{"type": "integer",
>>                       "minInteger": 0, "maxInteger": 1000000}'>
>>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Kevin Traynor May 25, 2018, 5:33 p.m. UTC | #4
On 05/25/2018 05:53 PM, Lam, Tiago wrote:
>>              VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>>                       mp_name, n_mbufs);
>>          }
> Just a thought, but now with shared memory where n_mbufs are initially
> set to 4096*64, one can see this error log printed a few times before
> the memory gets allocated. We could potentially demote this to a WARN
> and write a more friendly message and only print the error below, before
> returning to the caller (at that point we surely couldn't allocate the
> mempool).
> 

+1. I'd go further and have it as debug.
Lam, Tiago May 27, 2018, 5:04 p.m. UTC | #5
Hi Kevin,

On 25/05/2018 18:22, Kevin Traynor wrote:
> Hi Tiago,
> 

[snip]

>>> +
>>> +static struct dpdk_mp *
>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>> +{
>>> +    struct dpdk_mp *dmp;
>>> +    bool reuse = false;
>>>  
>>>      ovs_mutex_lock(&dpdk_mp_mutex);
>>> -    if (dpdk_mp_full(mp)) {
>>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>> -        rte_mempool_free(mp);
>>> -    } else {
>>> -        struct dpdk_mp *dmp;
>>> +    /* Check if shared mempools are being used, if so check existing mempools
>>> +     * to see if reuse is possible. */
>>> +    if (!per_port_mp) {
>>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>> +            if (dmp->socket_id == dev->requested_socket_id
>>> +                && dmp->mtu == mtu) {
>>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>> +                dmp->refcount++;
>>> +                reuse = true;
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    /* Sweep mempools after reuse or before create. */
>>> +    dpdk_mp_sweep();
>>
>> *Should dpdk_mp_sweep() be called when destroying an interface as well,
>> in common_destruct()? While testing, I've noticed that if I add one port
>> and delete the same port the mempool will still be allocated until you
>> add another port, since dpdk_mp_sweep() will only be called on the next
>> call to dpdp_mp_get(). This could potentially create problems in the
>> per-port model where one deletes a certain number of ports and can't add
>> any other ports because there's (hanging) mempools holding memory.
>>
> 
> With the shared model, it was deliberate to leave the dpdk_mp_sweep() as
> late as possible to give more time for buffers to be returned to it and
> to allow for the possibility that it might be reused. In order to not
> require any additional memory, it is done just before a new mempool is
> created.
> 

Thanks for giving some context, I also found that to be the case while
testing with both models.

> It's not a problem to call dpdk_mp_sweep() anytime but not sure I follow
> the issue you've raised - sweep would still be called before mempools
> for any new port are created. Is there a case missed?
> 

Maybe I haven't explained myself as clearly as I wanted to. I wouldn't
say it is a missed case as much as a consequence of what you just
mentioned - "leave the dpdk_mp_sweep() as late as possible", but I
thought it would be best to flag it just so we are aware.

Let's say you are using the per-port model and you add port X, which
gets mempool A allocated to it. And let's say you immediately delete
that port X. What I was seeing while testing is that mempool A is not
freed right away, only when a new port, let's say port Y, is added. This
also goes in-line with what you just mentioned.

Now, my point is that this could potentially become an issue in the
per-port model, since it doesn't reuse the mempools. If you add a bunch
of ports and then delete them in a batch (or maybe this is not a
use-case?), if you try to add more ports you may find yourself out of
memory, because the mempools haven't actually been freed when deleting
the ports.

>>> +
>>> +/* Depending on the memory model being used this function tries
>>> + * identify and reuse an existing mempool or tries to allocate new
>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>> + * requested_mtu. On success new configuration will be applied.
>>>   * On error, device will be left unchanged. */
>>>  static int
>>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>      OVS_REQUIRES(dev->mutex)
>>>  {
>>>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>> -    struct rte_mempool *mp;
>>> +    struct dpdk_mp *mp;
>>>      int ret = 0;
>>> +    bool per_port_mp = dpdk_per_port_mempool_enabled();
>>>  
>>> -    dpdk_mp_sweep();
>>> +    /* With shared mempools we do not need to configure a mempool if the MTU
>>> +     * and socket ID have not changed, the previous configuration is still
>>> +     * valid so return 0 */
>>> +    if (dev->mtu == dev->requested_mtu
>>> +        && dev->socket_id == dev->requested_socket_id
>>> +        && (!per_port_mp)) {
>>
>> I also find that moving the `(!per_port_mp)` condition to the beginning
>> improves readability here. It even goes in hand with your comment just
>> above - "With shared mempools we do not ...".
>>
>>> +        return ret;
>>> +    }
>>>  
>>> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>>      if (!mp) {
>>>          VLOG_ERR("Failed to create memory pool for netdev "
>>>                   "%s, with MTU %d on socket %d: %s\n",
>>>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>>> -                 rte_strerror(rte_errno));
>>> -        ret = rte_errno;
>>> +        rte_strerror(rte_errno));
>>
>> Missing indentation here.
>>
>>> +        return rte_errno;
>>>      } else {
>>> -        /* If a new MTU was requested and its rounded value equals the one
>>> -         * that is currently used, then the existing mempool is returned. */
>>> -        if (dev->mp != mp) {
>>> -            /* A new mempool was created, release the previous one. */
>>> -            dpdk_mp_release(dev->mp);
>>> -        } else {
>>> -            ret = EEXIST;
>>> +        /* Check for any pre-existing dpdk_mp for the device */
>>> +        if (dev->dpdk_mp != NULL) {
>>> +            /* If a new MTU was requested and its rounded value equals the
>>> +             * one that is currently used, then the existing mempool is
>>> +             * returned. */
>>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>> +                /* A new mempool was created, release the previous one. */
>>> +                dpdk_mp_put(dev->dpdk_mp);
>>> +            } else {
>>> +                /* If the mempool already exists in the current dpdk_mp then
>>> +                 * we need to ensure dpdk_mp that was just created is freed in
>>> +                 * the next sweep as it will not be used. */
>> The code path around the `else` block here will only be called when
>> `!per_port_mp`. This is because `dpdk_mp_get()` will only return an
>> already existing mempool when using the shared model. Otherwise a new
>> one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)`
>> will be true. Am I reading this right? If so then refactoring this a bit
>> to differentiate on `per_port_mp` might help on readability - this goes
>> in-line with Kevin's comment about making this piece a bit more readable.
>>
>> On the same note, the comment above mislead me to think that the
>> allocated `mp` is being freed, which would result in error since the
>> same `mp` is then assigned below. Instead, what it is doing is
>> decrementing the refcount in struct dpdk_mp, which might end up being
>> freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen
>> on the shared model unless no port is using the mempool.
>>
> 
> but this port is using it :-)
> 
> +            if (dev->dpdk_mp->mp != mp->mp) {
> 
> is false, which means they both have the same mempool. The refcount in
> dpdk_mp only counts user of the dpdk_mp, not the actual mp, so it could
> then be freed while still needed.
> 

I'm not sure I understand your point entirely. It is indeed true that
both `mp`s will be equal and thus `dpdk_mp_put()` will be called,
decrementing the `refcount`. But `refcount` shouldn't get to 0 (and thus
the mempool shouldn't be freed during sweep) as the previous call to
`dpdk_mp_get()` is also incrementing `refcount` by 1.

(In fact, the `else` block seems to be here only to decrement the
`refcount` when a port has been reconfigured and ends up reusing the
same mempool it was using before. In that case, the previous call to
`dpdk_mp_get()` is "wrongly" incrementing `refcount` by 1.)

I also tried to crash the patch by exercising this part of the code with
the following:
1. Add one port - `refcount` here was 1;
2. Reconfigure that port (but keep the MTU, so the same mempool gets
reused) - `refcount` here was 1;
3. Add a second port (this should trigger a call to `dpdk_mp_sweep()`) -
`refcount` here was 2;
4. Send some traffic.

And wasn't able to do so. I do agree, though, that this part can be
improved.

Regards,
Tiago.
Kevin Traynor May 28, 2018, 12:32 p.m. UTC | #6
On 05/27/2018 06:04 PM, Lam, Tiago wrote:
> Hi Kevin,
> 
> On 25/05/2018 18:22, Kevin Traynor wrote:
>> Hi Tiago,
>>
> 
> [snip]
> 
>>>> +
>>>> +static struct dpdk_mp *
>>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>> +{
>>>> +    struct dpdk_mp *dmp;
>>>> +    bool reuse = false;
>>>>  
>>>>      ovs_mutex_lock(&dpdk_mp_mutex);
>>>> -    if (dpdk_mp_full(mp)) {
>>>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>>> -        rte_mempool_free(mp);
>>>> -    } else {
>>>> -        struct dpdk_mp *dmp;
>>>> +    /* Check if shared mempools are being used, if so check existing mempools
>>>> +     * to see if reuse is possible. */
>>>> +    if (!per_port_mp) {
>>>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>> +            if (dmp->socket_id == dev->requested_socket_id
>>>> +                && dmp->mtu == mtu) {
>>>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>>> +                dmp->refcount++;
>>>> +                reuse = true;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    /* Sweep mempools after reuse or before create. */
>>>> +    dpdk_mp_sweep();
>>>
>>> *Should dpdk_mp_sweep() be called when destroying an interface as well,
>>> in common_destruct()? While testing, I've noticed that if I add one port
>>> and delete the same port the mempool will still be allocated until you
>>> add another port, since dpdk_mp_sweep() will only be called on the next
>>> call to dpdp_mp_get(). This could potentially create problems in the
>>> per-port model where one deletes a certain number of ports and can't add
>>> any other ports because there's (hanging) mempools holding memory.
>>>
>>
>> With the shared model, it was deliberate to leave the dpdk_mp_sweep() as
>> late as possible to give more time for buffers to be returned to it and
>> to allow for the possibility that it might be reused. In order to not
>> require any additional memory, it is done just before a new mempool is
>> created.
>>
> 
> Thanks for giving some context, I also found that to be the case while
> testing with both models.
> 
>> It's not a problem to call dpdk_mp_sweep() anytime but not sure I follow
>> the issue you've raised - sweep would still be called before mempools
>> for any new port are created. Is there a case missed?
>>
> 
> Maybe I haven't explained myself as clearly as I wanted to. I wouldn't
> say it is a missed case as much as a consequence of what you just
> mentioned - "leave the dpdk_mp_sweep() as late as possible", but I
> thought it would be best to flag it just so we are aware.
> 
> Let's say you are using the per-port model and you add port X, which
> gets mempool A allocated to it. And let's say you immediately delete
> that port X. What I was seeing while testing is that mempool A is not
> freed right away, only when a new port, let's say port Y, is added. This
> also goes in-line with what you just mentioned.
> 
> Now, my point is that this could potentially become an issue in the
> per-port model, since it doesn't reuse the mempools. If you add a bunch
> of ports and then delete them in a batch (or maybe this is not a
> use-case?), if you try to add more ports you may find yourself out of
> memory, because the mempools haven't actually been freed when deleting
> the ports.
> 

I think it will be ok, because the sweep should be still called at least
once before creating the next mempool.

>>>> +
>>>> +/* Depending on the memory model being used this function tries
>>>> + * identify and reuse an existing mempool or tries to allocate new
>>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>>> + * requested_mtu. On success new configuration will be applied.
>>>>   * On error, device will be left unchanged. */
>>>>  static int
>>>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>>      OVS_REQUIRES(dev->mutex)
>>>>  {
>>>>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>>> -    struct rte_mempool *mp;
>>>> +    struct dpdk_mp *mp;
>>>>      int ret = 0;
>>>> +    bool per_port_mp = dpdk_per_port_mempool_enabled();
>>>>  
>>>> -    dpdk_mp_sweep();
>>>> +    /* With shared mempools we do not need to configure a mempool if the MTU
>>>> +     * and socket ID have not changed, the previous configuration is still
>>>> +     * valid so return 0 */
>>>> +    if (dev->mtu == dev->requested_mtu
>>>> +        && dev->socket_id == dev->requested_socket_id
>>>> +        && (!per_port_mp)) {
>>>
>>> I also find that moving the `(!per_port_mp)` condition to the beginning
>>> improves readability here. It even goes in hand with your comment just
>>> above - "With shared mempools we do not ...".
>>>
>>>> +        return ret;
>>>> +    }
>>>>  
>>>> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>>>      if (!mp) {
>>>>          VLOG_ERR("Failed to create memory pool for netdev "
>>>>                   "%s, with MTU %d on socket %d: %s\n",
>>>>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>>>> -                 rte_strerror(rte_errno));
>>>> -        ret = rte_errno;
>>>> +        rte_strerror(rte_errno));
>>>
>>> Missing indentation here.
>>>
>>>> +        return rte_errno;
>>>>      } else {
>>>> -        /* If a new MTU was requested and its rounded value equals the one
>>>> -         * that is currently used, then the existing mempool is returned. */
>>>> -        if (dev->mp != mp) {
>>>> -            /* A new mempool was created, release the previous one. */
>>>> -            dpdk_mp_release(dev->mp);
>>>> -        } else {
>>>> -            ret = EEXIST;
>>>> +        /* Check for any pre-existing dpdk_mp for the device */
>>>> +        if (dev->dpdk_mp != NULL) {
>>>> +            /* If a new MTU was requested and its rounded value equals the
>>>> +             * one that is currently used, then the existing mempool is
>>>> +             * returned. */
>>>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>>> +                /* A new mempool was created, release the previous one. */
>>>> +                dpdk_mp_put(dev->dpdk_mp);
>>>> +            } else {
>>>> +                /* If the mempool already exists in the current dpdk_mp then
>>>> +                 * we need to ensure dpdk_mp that was just created is freed in
>>>> +                 * the next sweep as it will not be used. */
>>> The code path around the `else` block here will only be called when
>>> `!per_port_mp`. This is because `dpdk_mp_get()` will only return an
>>> already existing mempool when using the shared model. Otherwise a new
>>> one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)`
>>> will be true. Am I reading this right? If so then refactoring this a bit
>>> to differentiate on `per_port_mp` might help on readability - this goes
>>> in-line with Kevin's comment about making this piece a bit more readable.
>>>
>>> On the same note, the comment above mislead me to think that the
>>> allocated `mp` is being freed, which would result in error since the
>>> same `mp` is then assigned below. Instead, what it is doing is
>>> decrementing the refcount in struct dpdk_mp, which might end up being
>>> freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen
>>> on the shared model unless no port is using the mempool.
>>>
>>
>> but this port is using it :-)
>>
>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>
>> is false, which means they both have the same mempool. The refcount in
>> dpdk_mp only counts user of the dpdk_mp, not the actual mp, so it could
>> then be freed while still needed.
>>
> 
> I'm not sure I understand your point entirely. It is indeed true that
> both `mp`s will be equal and thus `dpdk_mp_put()` will be called,
> decrementing the `refcount`. But `refcount` shouldn't get to 0 (and thus
> the mempool shouldn't be freed during sweep) as the previous call to
> `dpdk_mp_get()` is also incrementing `refcount` by 1.
> 

sure, that is the case if the dpdk_mp structs are guaranteed to be same
also, but that is dependent on how mp_get/create works.

> (In fact, the `else` block seems to be here only to decrement the
> `refcount` when a port has been reconfigured and ends up reusing the
> same mempool it was using before. In that case, the previous call to
> `dpdk_mp_get()` is "wrongly" incrementing `refcount` by 1.)
> 
> I also tried to crash the patch by exercising this part of the code with
> the following:
> 1. Add one port - `refcount` here was 1;
> 2. Reconfigure that port (but keep the MTU, so the same mempool gets
> reused) - `refcount` here was 1;
> 3. Add a second port (this should trigger a call to `dpdk_mp_sweep()`) -
> `refcount` here was 2;
> 4. Send some traffic.
> 
> And wasn't able to do so. I do agree, though, that this part can be
> improved.
> 
> Regards,
> Tiago.
>
Stokes, Ian May 30, 2018, 10:25 a.m. UTC | #7
> On 05/18/2018 03:30 PM, Ian Stokes wrote:
> > This commit re-introduces the concept of shared mempools as the
> > default memory model for DPDK devices. Per port mempools are still
> > available but must be enabled explicitly by a user.
> >
> > OVS previously used a shared mempool model for ports with the same MTU
> > and socket configuration. This was replaced by a per port mempool
> > model to address issues flagged by users such as:
> >
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/0425
> > 60.html
> >
> > However the per port model potentially requires an increase in memory
> > resource requirements to support the same number of ports and
> > configuration as the shared port model.
> >
> > This is considered a blocking factor for current deployments of OVS
> > when upgrading to future OVS releases as a user may have to
> > redimension memory for the same deployment configuration. This may not
> > be possible for users.
> >
> 
> Hi Ian, thanks for addressing this. Few comments below...
> 
> > This commit resolves the issue by re-introducing shared mempools as
> > the default memory behaviour in OVS DPDK but also refactors the memory
> > configuration code to allow for per port mempools.
> 
> > This patch adds a new global config option, per-port-mp-enabled, that
> > controls the enablement of per port mempools for DPDK devices.
> >
> >     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> >
> 
> It doesn't need enabled in the name. If say 'per-port-mempool=true/false',
> then it's already obvious it's enabled.
> Actually, I wonder if 'per-port-memory' would be better as it doesn't
> require a user to know what a DPDK mempool is.
> 

Agree, will change in the v2.

> > This value defaults to false; to enable per port mempool support, this
> > field should be set to true when setting other global parameters on
> > init (such as "dpdk-socket-mem", for example). Changing the value at
> > runtime is not supported, and requires restarting the vswitch daemon.
> >
> > The mempool sweep functionality is also replaced with the sweep
> > functionality from OVS 2.9 found in commits
> >
> > c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> > a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> >
> > As this patch is RFC there are a number of TO-DOs including adding a
> > memory calculation section to the documentation for both models. This
> > is expected to be completed in the v1 after RFC.
> >
> > Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> > ---
> >  Documentation/automake.mk            |   1 +
> >  Documentation/topics/dpdk/index.rst  |   1 +
> >  Documentation/topics/dpdk/memory.rst |  67 +++++++
> >  NEWS                                 |   1 +
> >  lib/dpdk-stub.c                      |   6 +
> >  lib/dpdk.c                           |  13 ++
> >  lib/dpdk.h                           |   1 +
> >  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++-------
> -------
> >  vswitchd/vswitch.xml                 |  16 ++
> >  9 files changed, 305 insertions(+), 127 deletions(-)  create mode
> > 100644 Documentation/topics/dpdk/memory.rst
> >
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 683ca14..14c2189 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -36,6 +36,7 @@ DOC_SOURCE = \
> >  	Documentation/topics/dpdk/index.rst \
> >  	Documentation/topics/dpdk/bridge.rst \
> >  	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/memory.rst \
> >  	Documentation/topics/dpdk/pdump.rst \
> >  	Documentation/topics/dpdk/phy.rst \
> >  	Documentation/topics/dpdk/pmd.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst
> > b/Documentation/topics/dpdk/index.rst
> > index 181f61a..cf24a7b 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,3 +40,4 @@ The DPDK Datapath
> >     /topics/dpdk/qos
> >     /topics/dpdk/pdump
> >     /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/memory.rst
> > b/Documentation/topics/dpdk/memory.rst
> > new file mode 100644
> > index 0000000..1198067
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/memory.rst
> > @@ -0,0 +1,67 @@
> > +..
> > +        Copyright 2018, Intel, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License");
> you may
> > +      not use this file except in compliance with the License. You may
> obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing,
> software
> > +      distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the
> > +      License for the specific language governing permissions and
> limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +=========================
> > +DPDK Device Memory Models
> > +=========================
> > +
> > +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> > +shared mempools or per port mempools. The specifics of both are
> > +detailed below.
> > +
> > +Shared Mempools
> > +---------------
> > +
> > +By Default OVS DPDK uses a shared mempool model. This means that
> > +multiple ports can share the same mempool. For example when a port is
> > +added it will have a given MTU and socket ID associated with it. If a
> > +mempool has been created previously for an existing port that has the
> > +same MTU and socket ID, that mempool is used for both ports. If there
> > +is no existing mempool supporting these parameters then a new mempool
> is created.
> > +
> > +
> > +Per Port Mempools
> > +-----------------
> > +
> > +In the per port mempool model, mempools are created per device and
> > +are not shared. The benefit of this is a more transparent memory
> > +model where mempools will not be exhausted by other DPDK devices.
> > +However this comes at a potential increase in cost for memory
> > +dimensioning for a given deployment. Users should be aware of the
> > +memory requirements for their deployment before using this model and
> allocate the required hugepage memory.
> > +
> > +Per port mempool support may be enabled via a global config value,
> > +```per-port-mp-enabled```. Setting this to true enables the per port
> > +mempool model for all DPDK devices in OVS::
> > +
> > +   $ ovs-vsctl set Open_vSwitch .
> > + other_config:per-port-mp-enabled=true
> > +
> > +.. important::
> > +
> > +    Changing this value requires restarting the daemon.
> > +
> 
> It's a good general guide, but the daemon does not need to be restarted
> once it's set before dpdk-init=true.

Good point, on a separate note I think this would also apply to the vhost iommu option also? (It also requests daemon restart in documentation/commit messages). Will look into it.

> 
> > +.. todo::
> > +
> > +   Add memory calculation section for both mempool models.
> > diff --git a/NEWS b/NEWS
> > index ec548b0..c9991cf 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -30,6 +30,7 @@ Post-v2.9.0
> >       * New 'check-dpdk' Makefile target to run a new system testsuite.
> >         See Testing topic for the details.
> >       * Add LSC interrupt support for DPDK physical devices.
> > +     * Support both shared and per port mempools for DPDK devices.
> >     - Userspace datapath:
> >       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
> single PMD
> >       * Detailed PMD performance metrics available with new command
> > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 041cd0c..d2a9718
> > 100644
> > --- a/lib/dpdk-stub.c
> > +++ b/lib/dpdk-stub.c
> > @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
> >      return false;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return false;
> > +}
> > +
> >  void
> >  print_dpdk_version(void)
> >  {
> > diff --git a/lib/dpdk.c b/lib/dpdk.c
> > index 00dd974..e251970 100644
> > --- a/lib/dpdk.c
> > +++ b/lib/dpdk.c
> > @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for
> DPDK log redirection */
> >
> >  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets
> */
> >  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
> > support */
> > +static bool per_port_mp_enabled = false; /* Status of per port mempool
> > +                                          * support */
> >
> >  static int
> >  process_vhost_flags(char *flag, const char *default_val, int size, @@
> > -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
> >      VLOG_INFO("IOMMU support for vhost-user-client %s.",
> >                 vhost_iommu_enabled ? "enabled" : "disabled");
> >
> > +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
> > +                                        "per-port-mp-enabled", false);
> > +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> > +               per_port_mp_enabled ? "enabled" : "disabled");
> > +
> >      argv = grow_argv(&argv, 0, 1);
> >      argc = 1;
> >      argv[0] = xstrdup(ovs_get_program_name()); @@ -498,6 +505,12 @@
> > dpdk_vhost_iommu_enabled(void)
> >      return vhost_iommu_enabled;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return per_port_mp_enabled;
> > +}
> > +
> >  void
> >  dpdk_set_lcore_id(unsigned cpu)
> >  {
> > diff --git a/lib/dpdk.h b/lib/dpdk.h
> > index b041535..243385c 100644
> > --- a/lib/dpdk.h
> > +++ b/lib/dpdk.h
> > @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
> > void dpdk_set_lcore_id(unsigned cpu);  const char
> > *dpdk_get_vhost_sock_dir(void);  bool dpdk_vhost_iommu_enabled(void);
> > +bool dpdk_per_port_mempool_enabled(void);
> >  void print_dpdk_version(void);
> >
> >  #endif /* dpdk.h */
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > 87152a7..cda2ace 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
> >  #define NETDEV_DPDK_MBUF_ALIGN      1024
> >  #define NETDEV_DPDK_MAX_PKT_LEN     9728
> >
> > -/* Min number of packets in the mempool.  OVS tries to allocate a
> > mempool with
> > - * roughly estimated number of mbufs: if this fails (because the
> > system doesn't
> > - * have enough hugepages) we keep halving the number until the
> > allocation
> > - * succeeds or we reach MIN_NB_MBUF */
> > +/* Max and min number of packets in the mempool.  OVS tries to
> > +allocate a
> > + * mempool with MAX_NB_MBUF: if this fails (because the system
> > +doesn't have
> > + * enough hugepages) we keep halving the number until the allocation
> > +succeeds
> > + * or we reach MIN_NB_MBUF */
> > +
> > +#define MAX_NB_MBUF          (4096 * 64)
> >  #define MIN_NB_MBUF          (4096 * 4)
> >  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
> >
> > +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> > +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF)
> > +                  == 0);
> > +
> > +/* The smallest possible NB_MBUF that we're going to try should be a
> > +multiple
> > + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> > +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF))
> > +                  % MP_CACHE_SZ == 0);
> > +
> >  /*
> >   * DPDK XSTATS Counter names definition
> >   */
> > @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list
> > OVS_GUARDED_BY(dpdk_mutex)  static struct ovs_mutex dpdk_mp_mutex
> OVS_ACQ_AFTER(dpdk_mutex)
> >      = OVS_MUTEX_INITIALIZER;
> >
> > -/* Contains all 'struct dpdk_mp's. */
> 
> Not strictly needed, but comment is still valid

This was lost in rebasing I think, will return it in v2.

> 
> > -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> > +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
> >
> > -/* Wrapper for a mempool released but not yet freed. */  struct
> > dpdk_mp {
> >       struct rte_mempool *mp;
> > +     int mtu;
> > +     int socket_id;
> > +     int refcount;
> >       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> >   };
> >
> > @@ -381,7 +393,7 @@ struct netdev_dpdk {
> >
> >      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> >          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> > -        struct rte_mempool *mp;
> > +        struct dpdk_mp *dpdk_mp;
> >
> >          /* virtio identifier for vhost devices */
> >          ovsrcu_index vid;
> > @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp
> OVS_UNUSED,
> >      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);  }
> >
> > -static int
> > -dpdk_mp_full(const struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    unsigned ring_count;
> > -    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > -     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > -     * during the call. However, as no mbufs will be taken from the
> mempool
> > -     * at this time, we can work around it by also checking the ring
> entries
> > -     * separately and ensuring that they have not changed.
> > -     */
> > -    ring_count = rte_mempool_ops_get_count(mp);
> > -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > -        return 1;
> > -    }
> > -
> > -    return 0;
> > -}
> > -
> > -/* Free unused mempools. */
> > -static void
> > -dpdk_mp_sweep(void)
> > +/* Calculating the required number of mbufs differs depending on the
> > + * mempool model being used. Check if per port mempools are in use
> > +before
> > + * calculating.
> > + */
> > +static uint32_t
> > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
> > +per_port_mp)
> >  {
> > -    struct dpdk_mp *dmp, *next;
> > +    uint32_t n_mbufs;
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dpdk_mp_full(dmp->mp)) {
> > -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_mempool_free(dmp->mp);
> > -            rte_free(dmp);
> > +    if (!per_port_mp) {
> > +        /* Shared mempools are being used.
> > +         * XXX: this is a really rough method of provisioning memory.
> 
> XXX causes warnings in checkpatch (found that out on rte_mempool_full
> patch)
> 

Yes, I spotted this myself before sending the patch :). The guideline agreed as regards 'XXX' in patches is it's allowed so as to flag future work (and ensure it's not just something the author missed).

There was a fair bit of discussion as regards the memory calculations on the ML previously so I expect it will change in the future, possibly a reduction or removal of the tx queues specific calculations for the per port implementation.

For the v1 here I kept the comment just to highlight it could be reworked in the future.
 
> > +         * It's impossible to determine what the exact memory
> requirements are
> > +         * when the number of ports and rxqs that utilize a particular
> mempool
> > +         * can change dynamically at runtime. For now, use this rough
> > +         * heurisitic.
> > +         */
> > +        if (mtu >= ETHER_MTU) {
> > +            n_mbufs = MAX_NB_MBUF;
> > +        } else {
> > +            n_mbufs = MIN_NB_MBUF;
> >          }
> > +    } else {
> > +        /* Per port mempools are being used.
> > +         * XXX: rough estimation of number of mbufs required for this
> port:
> > +         * <packets required to fill the device rxqs>
> > +         * + <packets that could be stuck on other ports txqs>
> > +         * + <packets in the pmd threads>
> > +         * + <additional memory for corner cases>
> > +         */
> > +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > +                  + dev->requested_n_txq * dev->requested_txq_size
> > +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > +                  + MIN_NB_MBUF;
> >      }
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -}
> > -
> > -/* Ensure a mempool will not be freed. */ -static void
> > -dpdk_mp_do_not_free(struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    struct dpdk_mp *dmp, *next;
> >
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dmp->mp == mp) {
> > -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp-
> >name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_free(dmp);
> > -            break;
> > -        }
> > -    }
> > +    return n_mbufs;
> >  }
> >
> > -/* Returns a valid pointer when either of the following is true:
> > - *  - a new mempool was just created;
> > - *  - a matching mempool already exists. */ -static struct
> > rte_mempool * -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> > +static struct dpdk_mp *
> > +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> >  {
> >      char mp_name[RTE_MEMPOOL_NAMESIZE];
> >      const char *netdev_name = netdev_get_name(&dev->up);
> >      int socket_id = dev->requested_socket_id;
> >      uint32_t n_mbufs;
> >      uint32_t hash = hash_string(netdev_name, 0);
> > -    struct rte_mempool *mp = NULL;
> > -
> > -    /*
> > -     * XXX: rough estimation of number of mbufs required for this port:
> > -     * <packets required to fill the device rxqs>
> > -     * + <packets that could be stuck on other ports txqs>
> > -     * + <packets in the pmd threads>
> > -     * + <additional memory for corner cases>
> > -     */
> > -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > -              + dev->requested_n_txq * dev->requested_txq_size
> > -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > -              + MIN_NB_MBUF;
> > +    struct dpdk_mp *dmp = NULL;
> > +    int ret;
> > +
> > +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!dmp) {
> > +        return NULL;
> > +    }
> > +    dmp->socket_id = socket_id;
> > +    dmp->mtu = mtu;
> > +    dmp->refcount = 1;
> > +
> > +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> >      do {
> >          /* Full DPDK memory pool name must be unique and cannot be
> >           * longer than RTE_MEMPOOL_NAMESIZE. */
> > -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> > +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> >                             "ovs%08x%02d%05d%07u",
> >                             hash, socket_id, mtu, n_mbufs);
> 
> you're using the hash of one dev in the name, but that mempool can later
> be used by other dev, and even be no longer used by the original dev.
> OTOH no one is going to know what the hash means anyway, so it shouldn't
> confuse any more on debug.

I agree, it's just to ensure a unique name for the mempool in per port case and would not have an impact for the shared model.

I'll put a comment to explain this and that it's expected in the v2.

> 
> >          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { @@ -623,102
> > +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> >                    netdev_name, n_mbufs, socket_id,
> >                    dev->requested_n_rxq, dev->requested_n_txq);
> >
> > -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> > -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> > -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> > +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> > +                                          MP_CACHE_SZ,
> > +                                          sizeof (struct dp_packet)
> > +                                          - sizeof (struct rte_mbuf),
> > +                                          MBUF_SIZE(mtu)
> > +                                          - sizeof(struct dp_packet),
> > +                                          socket_id);
> >
> > -        if (mp) {
> > +        if (dmp->mp) {
> >              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
> >                       mp_name, n_mbufs);
> >              /* rte_pktmbuf_pool_create has done some initialization of
> the
> > -             * rte_mbuf part of each dp_packet. Some OvS specific
> fields
> > -             * of the packet still need to be initialized by
> > -             * ovs_rte_pktmbuf_init. */
> > -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> > +             * rte_mbuf part of each dp_packet, while
> ovs_rte_pktmbuf_init
> > +             * initializes some OVS specific fields of dp_packet.
> > +             */
> > +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> > +            return dmp;
> >          } else if (rte_errno == EEXIST) {
> >              /* A mempool with the same name already exists.  We just
> >               * retrieve its pointer to be returned to the caller. */
> > -            mp = rte_mempool_lookup(mp_name);
> > +            dmp->mp = rte_mempool_lookup(mp_name);
> >              /* As the mempool create returned EEXIST we can expect the
> >               * lookup has returned a valid pointer.  If for some reason
> >               * that's not the case we keep track of it. */
> >              VLOG_DBG("A mempool with name \"%s\" already exists at
> %p.",
> > -                     mp_name, mp);
> > -            /* Ensure this reused mempool will not be freed. */
> > -            dpdk_mp_do_not_free(mp);
> 
> The mempool you are about to reuse could have a refcount of 0 and about to
> be freed by a sweep. So you would need something like the function above
> before giving up dpdk_mp_mutex. Maybe you could increase the refcount for
> it now and re-adjust later if you need to.
> 

So my thinking here was that we run the sweep in netdev_dpdk_mempool_configure() just prior to calling dpdk_mp_create().

Running the sweep prior should have removed any mempool with refcount 0 in the dpdk_mp_list.
The list itself is mutex guarded, if there is an existing mempool then the associated refcount would still be 1 I thought but maybe I missed something.

Do you think it's the case it could be modified between the sweep call and reaching this point?

> > +                     mp_name, dmp->mp);
> >          } else {
> >              VLOG_ERR("Failed mempool \"%s\" create request of %u
> mbufs",
> >                       mp_name, n_mbufs);
> >          }
> > -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> MIN_NB_MBUF);
> > +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> > + MIN_NB_MBUF);
> >
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -    return mp;
> > +    rte_free(dmp);
> > +    return NULL;
> >  }
> >
> > -/* Release an existing mempool. */
> > +static int
> > +dpdk_mp_full(const struct rte_mempool *mp)
> > +OVS_REQUIRES(dpdk_mp_mutex) {
> > +    unsigned ring_count;
> > +    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > +     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > +     * during the call. However, as no mbufs will be taken from the
> mempool
> > +     * at this time, we can work around it by also checking the ring
> entries
> > +     * separately and ensuring that they have not changed.
> > +     */
> > +    ring_count = rte_mempool_ops_get_count(mp);
> > +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> 
> you'll need to rebase because of the changes here. Won't make any
> difference to your operation though
> 
> > +}
> > +
> > +/* Free unused mempools. */
> >  static void
> > -dpdk_mp_release(struct rte_mempool *mp)
> > +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
> >  {
> > -    if (!mp) {
> > -        return;
> > +    struct dpdk_mp *dmp, *next;
> > +
> > +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> > +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> > +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > +            ovs_list_remove(&dmp->list_node);
> > +            rte_mempool_free(dmp->mp);
> > +            rte_free(dmp);
> > +        }
> >      }
> > +}
> > +
> > +static struct dpdk_mp *
> > +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) {
> > +    struct dpdk_mp *dmp;
> > +    bool reuse = false;
> >
> >      ovs_mutex_lock(&dpdk_mp_mutex);
> > -    if (dpdk_mp_full(mp)) {
> > -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> > -        rte_mempool_free(mp);
> > -    } else {
> > -        struct dpdk_mp *dmp;
> > +    /* Check if shared mempools are being used, if so check existing
> mempools
> > +     * to see if reuse is possible. */
> > +    if (!per_port_mp) {
> > +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> > +            if (dmp->socket_id == dev->requested_socket_id
> > +                && dmp->mtu == mtu) {
> > +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> > +                dmp->refcount++;
> > +                reuse = true;
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    /* Sweep mempools after reuse or before create. */
> > +    dpdk_mp_sweep();
> >
> > -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!reuse) {
> > +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
> >          if (dmp) {
> > -            dmp->mp = mp;
> > -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> > +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> >          }
> >      }
> > +
> >      ovs_mutex_unlock(&dpdk_mp_mutex);
> > +
> > +    return dmp;
> >  }
> >
> > -/* Tries to allocate a new mempool - or re-use an existing one where
> > - * appropriate - on requested_socket_id with a size determined by
> > - * requested_mtu and requested Rx/Tx queues.
> > - * On success - or when re-using an existing mempool - the new
> > configuration
> > - * will be applied.
> > +/* Decrement reference to a mempool. */ static void
> > +dpdk_mp_put(struct dpdk_mp *dmp) {
> > +    if (!dmp) {
> > +        return;
> > +    }
> > +
> > +    ovs_mutex_lock(&dpdk_mp_mutex);
> > +    ovs_assert(dmp->refcount);
> > +    dmp->refcount--;
> > +    ovs_mutex_unlock(&dpdk_mp_mutex); }
> > +
> > +/* Depending on the memory model being used this function tries
> > + * identify and reuse an existing mempool or tries to allocate new
> > + * mempool on requested_socket_id with mbuf size corresponding to
> > + * requested_mtu. On success new configuration will be applied.
> >   * On error, device will be left unchanged. */  static int
> > netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
> >      OVS_REQUIRES(dev->mutex)
> >  {
> >      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> > -    struct rte_mempool *mp;
> > +    struct dpdk_mp *mp;
> >      int ret = 0;
> > +    bool per_port_mp = dpdk_per_port_mempool_enabled();
> >
> > -    dpdk_mp_sweep();
> > +    /* With shared mempools we do not need to configure a mempool if
> the MTU
> > +     * and socket ID have not changed, the previous configuration is
> still
> > +     * valid so return 0 */
> > +    if (dev->mtu == dev->requested_mtu
> > +        && dev->socket_id == dev->requested_socket_id
> > +        && (!per_port_mp)) {
> 
> can remove brackets and probably more intuitive to check the memory model
> first

Yes, with your comment later as regards the refactor based in memory model I can address this.

> 
> > +        return ret;
> > +    }
> >
> > -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> > +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
> >      if (!mp) {
> >          VLOG_ERR("Failed to create memory pool for netdev "
> >                   "%s, with MTU %d on socket %d: %s\n",
> >                   dev->up.name, dev->requested_mtu, dev-
> >requested_socket_id,
> > -                 rte_strerror(rte_errno));
> > -        ret = rte_errno;
> > +        rte_strerror(rte_errno));
> > +        return rte_errno;
> >      } else {
> > -        /* If a new MTU was requested and its rounded value equals the
> one
> > -         * that is currently used, then the existing mempool is
> returned. */
> > -        if (dev->mp != mp) {
> > -            /* A new mempool was created, release the previous one. */
> > -            dpdk_mp_release(dev->mp);
> > -        } else {
> > -            ret = EEXIST;
> > +        /* Check for any pre-existing dpdk_mp for the device */
> > +        if (dev->dpdk_mp != NULL) {
> 
> Not sure you need this check, dpdk_mp_put will handle NULL
> 

Yes dpdk_mp_put will handle NULL, but if statement below for comparison of mempools attempts to access the mempools which are part of the dpdk_mp struct.
In the case that a device is newly created there is no previous dpdk_mp and accessing dpdk_mp->mp will cause a segfault.

> > +            /* If a new MTU was requested and its rounded value equals
> the
> > +             * one that is currently used, then the existing mempool is
> > +             * returned. */
> > +            if (dev->dpdk_mp->mp != mp->mp) {
> > +                /* A new mempool was created, release the previous one.
> */
> > +                dpdk_mp_put(dev->dpdk_mp);
> > +            } else {
> > +                /* If the mempool already exists in the current dpdk_mp
> then
> > +                 * we need to ensure dpdk_mp that was just created is
> freed in
> > +                 * the next sweep as it will not be used. *> +
> dpdk_mp_put(mp);
> 
> This doesn't look right - it will likely lead to the mempool you are using
> being freed also. I think you just want to free the dpdk_mp struct, but
> not the actual mempool.

Oops, good catch, your correct, we only need to free the dpdk_mp in this case but not the mempool itself.

At this point the dpdk_mp has already been added to the list, it could make more sense to do this work at the dpdk_mp_create() function instead to avoid messing with the list altogether for the EEXIST case. I'll look into this for the v2.
> 
> > +                ret = EEXIST;
> 
> Setting ret, but returning 0 below
> 
> > +            }
> >          }
> > -        dev->mp = mp;
> > +        dev->dpdk_mp = mp;
> >          dev->mtu = dev->requested_mtu;
> >          dev->socket_id = dev->requested_socket_id;
> >          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >      }
> >
> > -    return ret;
> > +    return 0;
> >  }
> >
> 
> I know there was lots of paths already, but adding two different schemes
> is making the code here and in create even more confusing. I think there's
> a few subtle bugs above, and it's probably down to the complexity. I
> wonder could you use the per_port_mp a bit more so at least the paths for
> each scheme are a more segregated? When reviewing, it's hard to understand
> if a code branch would happen with one scheme or the other, or both.

+1, I'm looking at refactoring based on this per_port_mp the v2. The original intention was to keep the code footprint as small as possible but for readability and clarity on code paths I'd agree.

Thanks
Ian
> 
> Kevin.
> 
> >  static void
> > @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
> > int n_rxq, int n_txq)
> >
> >          for (i = 0; i < n_rxq; i++) {
> >              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev-
> >rxq_size,
> > -                                          dev->socket_id, NULL, dev-
> >mp);
> > +                                          dev->socket_id, NULL,
> > +                                          dev->dpdk_mp->mp);
> >              if (diag) {
> >                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
> >                            dev->up.name, i, rte_strerror(-diag)); @@
> > -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
> >      rte_eth_link_get_nowait(dev->port_id, &dev->link);
> >
> > -    mbp_priv = rte_mempool_get_priv(dev->mp);
> > +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
> >      dev->buf_size = mbp_priv->mbuf_data_room_size -
> > RTE_PKTMBUF_HEADROOM;
> >
> >      /* Get the Flow control configuration for DPDK-ETH */ @@ -1176,7
> > +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
> >      OVS_EXCLUDED(dev->mutex)
> >  {
> >      rte_free(dev->tx_q);
> > -    dpdk_mp_release(dev->mp);
> > +    dpdk_mp_put(dev->dpdk_mp);
> >
> >      ovs_list_remove(&dev->list_node);
> >      free(ovsrcu_get_protected(struct ingress_policer *, @@ -1928,7
> > +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> >          return EAGAIN;
> >      }
> >
> > -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> > +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
> >                                      (struct rte_mbuf **) batch-
> >packets,
> >                                      NETDEV_MAX_BURST);
> >      if (!nb_rx) {
> > @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch)
> >              continue;
> >          }
> >
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> > +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> >          if (OVS_UNLIKELY(!pkts[txcnt])) {
> >              dropped += cnt - i;
> >              break;
> > @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn
> *conn,
> >          ovs_mutex_lock(&dev->mutex);
> >          ovs_mutex_lock(&dpdk_mp_mutex);
> >
> > -        rte_mempool_dump(stream, dev->mp);
> > +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
> >
> >          ovs_mutex_unlock(&dpdk_mp_mutex);
> >          ovs_mutex_unlock(&dev->mutex); @@ -3742,7 +3814,7 @@
> > dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> >
> >      err = netdev_dpdk_mempool_configure(dev);
> >      if (!err) {
> > -        /* A new mempool was created. */
> > +        /* A new mempool was created or re-used. */
> >          netdev_change_seq_changed(&dev->up);
> >      } else if (err != EEXIST){
> >          return err;
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > 7ab90d5..95520af 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -359,6 +359,22 @@
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="per-port-mp-enabled"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          By default OVS DPDK uses a shared mempool memory model
> wherein
> > +          devices that have the same MTU and socket values can share
> the same
> > +          mempool. Setting this value to <code>true</code> changes this
> > +          behaviour. Per port mempools allow DPDK devices to use a
> private
> > +          mempool per device. This can provide greater transapraency as
> > +          regards memory usage but potentially at the cost of greater
> memory
> > +          requirements.
> > +        </p>
> > +        <p>
> > +          Changing this value requires restarting the daemon.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="tx-flush-interval"
> >                type='{"type": "integer",
> >                       "minInteger": 0, "maxInteger": 1000000}'>
> >
Kevin Traynor May 30, 2018, 10:51 a.m. UTC | #8
On 05/30/2018 11:25 AM, Stokes, Ian wrote:
>>> -                     mp_name, mp);
>>> -            /* Ensure this reused mempool will not be freed. */
>>> -            dpdk_mp_do_not_free(mp);
>> The mempool you are about to reuse could have a refcount of 0 and about to
>> be freed by a sweep. So you would need something like the function above
>> before giving up dpdk_mp_mutex. Maybe you could increase the refcount for
>> it now and re-adjust later if you need to.
>>
> So my thinking here was that we run the sweep in netdev_dpdk_mempool_configure() just prior to calling dpdk_mp_create().
> 
> Running the sweep prior should have removed any mempool with refcount 0 in the dpdk_mp_list.

Not necessarily, freeing a 0 refcount dpdk_mp and it's mempool in sweep
is conditional on no in-use mbufs, so it might not have been freed yet.

It won't be freed while you have the lock, but you need to ensure while
you still have the lock that it won't be freed after you release the
lock but before you have updated refcount.

> The list itself is mutex guarded, if there is an existing mempool then the associated refcount would still be 1 I thought but maybe I missed something.
> 
> Do you think it's the case it could be modified between the sweep call and reaching this point?
> 

No, this part is ok, it's just the case above
Stokes, Ian May 30, 2018, 11:06 a.m. UTC | #9
> Hi Ian,
> 
> Thanks for bringing this forward.
> 
> I've tested this on current master by re-configuring MTUs of existing
> ports, adding new ports, deleting existing ones, etc. It seems to be
> behaving as expected:
> - On the shared model, it allocates a new mempool only if the MTU changes,
> otherwise reuses the already existing one. And it frees a mempool only if
> it isn't being used by any port;
> - On the per-port model, it allocates a new mempool for every port. And it
> frees the mempools when the ports are destroyed / MTU is changed.
> 
> There's a catch on how mempools are freed, though. I have a comment on
> that in-line below*.
> 
> (I see Kevin has sent a review as well, so I'll refrain to touch on the
> same points, unless to re-iterate.)

Thanks for the review Tiago, comments inline.

> 
> On 18/05/2018 15:30, Ian Stokes wrote:
> > This commit re-introduces the concept of shared mempools as the
> > default memory model for DPDK devices. Per port mempools are still
> > available but must be enabled explicitly by a user.
> >
> > OVS previously used a shared mempool model for ports with the same MTU
> > and socket configuration. This was replaced by a per port mempool
> > model to address issues flagged by users such as:
> >
> > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/0425
> > 60.html
> >
> > However the per port model potentially requires an increase in memory
> > resource requirements to support the same number of ports and
> > configuration as the shared port model.
> >
> > This is considered a blocking factor for current deployments of OVS
> > when upgrading to future OVS releases as a user may have to
> > redimension memory for the same deployment configuration. This may not
> > be possible for users.
> >
> > This commit resolves the issue by re-introducing shared mempools as
> > the default memory behaviour in OVS DPDK but also refactors the memory
> > configuration code to allow for per port mempools.
> >
> > This patch adds a new global config option, per-port-mp-enabled, that
> > controls the enablement of per port mempools for DPDK devices.
> >
> >     ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> >
> > This value defaults to false; to enable per port mempool support, this
> > field should be set to true when setting other global parameters on
> > init (such as "dpdk-socket-mem", for example). Changing the value at
> > runtime is not supported, and requires restarting the vswitch daemon.
> >
> > The mempool sweep functionality is also replaced with the sweep
> > functionality from OVS 2.9 found in commits
> >
> > c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> > a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> >
> > As this patch is RFC there are a number of TO-DOs including adding a
> > memory calculation section to the documentation for both models. This
> > is expected to be completed in the v1 after RFC.
> >
> > Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> > ---
> >  Documentation/automake.mk            |   1 +
> >  Documentation/topics/dpdk/index.rst  |   1 +
> >  Documentation/topics/dpdk/memory.rst |  67 +++++++
> >  NEWS                                 |   1 +
> >  lib/dpdk-stub.c                      |   6 +
> >  lib/dpdk.c                           |  13 ++
> >  lib/dpdk.h                           |   1 +
> >  lib/netdev-dpdk.c                    | 326 +++++++++++++++++++++-------
> -------
> >  vswitchd/vswitch.xml                 |  16 ++
> >  9 files changed, 305 insertions(+), 127 deletions(-)  create mode
> > 100644 Documentation/topics/dpdk/memory.rst
> >
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 683ca14..14c2189 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -36,6 +36,7 @@ DOC_SOURCE = \
> >  	Documentation/topics/dpdk/index.rst \
> >  	Documentation/topics/dpdk/bridge.rst \
> >  	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/memory.rst \
> >  	Documentation/topics/dpdk/pdump.rst \
> >  	Documentation/topics/dpdk/phy.rst \
> >  	Documentation/topics/dpdk/pmd.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst
> > b/Documentation/topics/dpdk/index.rst
> > index 181f61a..cf24a7b 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,3 +40,4 @@ The DPDK Datapath
> >     /topics/dpdk/qos
> >     /topics/dpdk/pdump
> >     /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/memory.rst
> > b/Documentation/topics/dpdk/memory.rst
> > new file mode 100644
> > index 0000000..1198067
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/memory.rst
> > @@ -0,0 +1,67 @@
> > +..
> > +        Copyright 2018, Intel, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License");
> you may
> > +      not use this file except in compliance with the License. You may
> obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing,
> software
> > +      distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the
> > +      License for the specific language governing permissions and
> limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +=========================
> > +DPDK Device Memory Models
> > +=========================
> > +
> > +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> > +shared mempools or per port mempools. The specifics of both are
> > +detailed below.
> > +
> > +Shared Mempools
> > +---------------
> > +
> > +By Default OVS DPDK uses a shared mempool model. This means that
> > +multiple ports can share the same mempool. For example when a port is
> > +added it will have a given MTU and socket ID associated with it. If a
> > +mempool has been created previously for an existing port that has the
> > +same MTU and socket ID, that mempool is used for both ports. If there
> > +is no existing mempool supporting these parameters then a new mempool
> is created.
> > +
> > +
> > +Per Port Mempools
> > +-----------------
> > +
> > +In the per port mempool model, mempools are created per device and
> > +are not shared. The benefit of this is a more transparent memory
> > +model where mempools will not be exhausted by other DPDK devices.
> > +However this comes at a potential increase in cost for memory
> > +dimensioning for a given deployment. Users should be aware of the
> > +memory requirements for their deployment before using this model and
> allocate the required hugepage memory.
> > +
> > +Per port mempool support may be enabled via a global config value,
> > +```per-port-mp-enabled```. Setting this to true enables the per port
> > +mempool model for all DPDK devices in OVS::
> > +
> > +   $ ovs-vsctl set Open_vSwitch .
> > + other_config:per-port-mp-enabled=true
> > +
> > +.. important::
> > +
> > +    Changing this value requires restarting the daemon.
> > +
> > +.. todo::
> > +
> > +   Add memory calculation section for both mempool models.
> > diff --git a/NEWS b/NEWS
> > index ec548b0..c9991cf 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -30,6 +30,7 @@ Post-v2.9.0
> >       * New 'check-dpdk' Makefile target to run a new system testsuite.
> >         See Testing topic for the details.
> >       * Add LSC interrupt support for DPDK physical devices.
> > +     * Support both shared and per port mempools for DPDK devices.
> >     - Userspace datapath:
> >       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
> single PMD
> >       * Detailed PMD performance metrics available with new command
> > diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 041cd0c..d2a9718
> > 100644
> > --- a/lib/dpdk-stub.c
> > +++ b/lib/dpdk-stub.c
> > @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
> >      return false;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return false;
> > +}
> > +
> >  void
> >  print_dpdk_version(void)
> >  {
> > diff --git a/lib/dpdk.c b/lib/dpdk.c
> > index 00dd974..e251970 100644
> > --- a/lib/dpdk.c
> > +++ b/lib/dpdk.c
> > @@ -43,6 +43,8 @@ static FILE *log_stream = NULL;       /* Stream for
> DPDK log redirection */
> >
> >  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets
> */
> >  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
> > support */
> > +static bool per_port_mp_enabled = false; /* Status of per port mempool
> > +                                          * support */
> >
> >  static int
> >  process_vhost_flags(char *flag, const char *default_val, int size, @@
> > -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
> >      VLOG_INFO("IOMMU support for vhost-user-client %s.",
> >                 vhost_iommu_enabled ? "enabled" : "disabled");
> >
> > +    per_port_mp_enabled = smap_get_bool(ovs_other_config,
> > +                                        "per-port-mp-enabled", false);
> > +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> > +               per_port_mp_enabled ? "enabled" : "disabled");
> > +
> >      argv = grow_argv(&argv, 0, 1);
> >      argc = 1;
> >      argv[0] = xstrdup(ovs_get_program_name()); @@ -498,6 +505,12 @@
> > dpdk_vhost_iommu_enabled(void)
> >      return vhost_iommu_enabled;
> >  }
> >
> > +bool
> > +dpdk_per_port_mempool_enabled(void)
> > +{
> > +    return per_port_mp_enabled;
> > +}
> > +
> >  void
> >  dpdk_set_lcore_id(unsigned cpu)
> >  {
> > diff --git a/lib/dpdk.h b/lib/dpdk.h
> > index b041535..243385c 100644
> > --- a/lib/dpdk.h
> > +++ b/lib/dpdk.h
> > @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
> > void dpdk_set_lcore_id(unsigned cpu);  const char
> > *dpdk_get_vhost_sock_dir(void);  bool dpdk_vhost_iommu_enabled(void);
> > +bool dpdk_per_port_mempool_enabled(void);
> >  void print_dpdk_version(void);
> >
> >  #endif /* dpdk.h */
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > 87152a7..cda2ace 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
> >  #define NETDEV_DPDK_MBUF_ALIGN      1024
> >  #define NETDEV_DPDK_MAX_PKT_LEN     9728
> >
> > -/* Min number of packets in the mempool.  OVS tries to allocate a
> > mempool with
> > - * roughly estimated number of mbufs: if this fails (because the
> > system doesn't
> > - * have enough hugepages) we keep halving the number until the
> > allocation
> > - * succeeds or we reach MIN_NB_MBUF */
> > +/* Max and min number of packets in the mempool.  OVS tries to
> > +allocate a
> > + * mempool with MAX_NB_MBUF: if this fails (because the system
> > +doesn't have
> > + * enough hugepages) we keep halving the number until the allocation
> > +succeeds
> > + * or we reach MIN_NB_MBUF */
> > +
> > +#define MAX_NB_MBUF          (4096 * 64)
> >  #define MIN_NB_MBUF          (4096 * 4)
> >  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
> >
> > +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> > +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF)
> > +                  == 0);
> > +
> > +/* The smallest possible NB_MBUF that we're going to try should be a
> > +multiple
> > + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> > +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
> MIN_NB_MBUF))
> > +                  % MP_CACHE_SZ == 0);
> > +
> >  /*
> >   * DPDK XSTATS Counter names definition
> >   */
> > @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list
> > OVS_GUARDED_BY(dpdk_mutex)  static struct ovs_mutex dpdk_mp_mutex
> OVS_ACQ_AFTER(dpdk_mutex)
> >      = OVS_MUTEX_INITIALIZER;
> >
> > -/* Contains all 'struct dpdk_mp's. */ -static struct ovs_list
> > dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> > +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> > +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
> >
> > -/* Wrapper for a mempool released but not yet freed. */  struct
> > dpdk_mp {
> >       struct rte_mempool *mp;
> > +     int mtu;
> > +     int socket_id;
> > +     int refcount;
> >       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> >   };
> >
> > @@ -381,7 +393,7 @@ struct netdev_dpdk {
> >
> >      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> >          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> > -        struct rte_mempool *mp;
> > +        struct dpdk_mp *dpdk_mp;
> >
> >          /* virtio identifier for vhost devices */
> >          ovsrcu_index vid;
> > @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp
> OVS_UNUSED,
> >      dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);  }
> >
> > -static int
> > -dpdk_mp_full(const struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    unsigned ring_count;
> > -    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > -     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > -     * during the call. However, as no mbufs will be taken from the
> mempool
> > -     * at this time, we can work around it by also checking the ring
> entries
> > -     * separately and ensuring that they have not changed.
> > -     */
> > -    ring_count = rte_mempool_ops_get_count(mp);
> > -    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > -        return 1;
> > -    }
> > -
> > -    return 0;
> > -}
> > -
> > -/* Free unused mempools. */
> > -static void
> > -dpdk_mp_sweep(void)
> > +/* Calculating the required number of mbufs differs depending on the
> > + * mempool model being used. Check if per port mempools are in use
> > +before
> > + * calculating.
> > + */
> > +static uint32_t
> > +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
> > +per_port_mp)
> >  {
> > -    struct dpdk_mp *dmp, *next;
> > +    uint32_t n_mbufs;
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dpdk_mp_full(dmp->mp)) {
> > -            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_mempool_free(dmp->mp);
> > -            rte_free(dmp);
> > +    if (!per_port_mp) {
> > +        /* Shared mempools are being used.
> > +         * XXX: this is a really rough method of provisioning memory.
> > +         * It's impossible to determine what the exact memory
> requirements are
> > +         * when the number of ports and rxqs that utilize a particular
> mempool
> > +         * can change dynamically at runtime. For now, use this rough
> > +         * heurisitic.
> > +         */
> > +        if (mtu >= ETHER_MTU) {
> > +            n_mbufs = MAX_NB_MBUF;
> 
> Is this going to be configurable in the future?
> 

So this has been raised before, the general feedback was that it becomes another tunable knob for OVS DPDK that users may not be aware of or how to adjust best for based on their deployments. For usability purposes it has been hard coded to date without much issue.

> > +        } else {
> > +            n_mbufs = MIN_NB_MBUF;
> >          }
> > +    } else {
> > +        /* Per port mempools are being used.
> > +         * XXX: rough estimation of number of mbufs required for this
> port:
> > +         * <packets required to fill the device rxqs>
> > +         * + <packets that could be stuck on other ports txqs>
> > +         * + <packets in the pmd threads>
> > +         * + <additional memory for corner cases>
> > +         */
> > +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > +                  + dev->requested_n_txq * dev->requested_txq_size
> > +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > +                  + MIN_NB_MBUF;
> >      }
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -}
> > -
> > -/* Ensure a mempool will not be freed. */ -static void
> > -dpdk_mp_do_not_free(struct rte_mempool *mp)
> > OVS_REQUIRES(dpdk_mp_mutex) -{
> > -    struct dpdk_mp *dmp, *next;
> >
> > -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> > -        if (dmp->mp == mp) {
> > -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp-
> >name);
> > -            ovs_list_remove(&dmp->list_node);
> > -            rte_free(dmp);
> > -            break;
> > -        }
> > -    }
> > +    return n_mbufs;
> >  }
> >
> > -/* Returns a valid pointer when either of the following is true:
> > - *  - a new mempool was just created;
> > - *  - a matching mempool already exists. */ -static struct
> > rte_mempool * -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> > +static struct dpdk_mp *
> > +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> >  {
> >      char mp_name[RTE_MEMPOOL_NAMESIZE];
> >      const char *netdev_name = netdev_get_name(&dev->up);
> >      int socket_id = dev->requested_socket_id;
> >      uint32_t n_mbufs;
> >      uint32_t hash = hash_string(netdev_name, 0);
> > -    struct rte_mempool *mp = NULL;
> > -
> > -    /*
> > -     * XXX: rough estimation of number of mbufs required for this port:
> > -     * <packets required to fill the device rxqs>
> > -     * + <packets that could be stuck on other ports txqs>
> > -     * + <packets in the pmd threads>
> > -     * + <additional memory for corner cases>
> > -     */
> > -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> > -              + dev->requested_n_txq * dev->requested_txq_size
> > -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> NETDEV_MAX_BURST
> > -              + MIN_NB_MBUF;
> > +    struct dpdk_mp *dmp = NULL;
> > +    int ret;
> > +
> > +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!dmp) {
> > +        return NULL;
> > +    }
> > +    dmp->socket_id = socket_id;
> > +    dmp->mtu = mtu;
> > +    dmp->refcount = 1;
> > +
> > +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
> >
> > -    ovs_mutex_lock(&dpdk_mp_mutex);
> >      do {
> >          /* Full DPDK memory pool name must be unique and cannot be
> >           * longer than RTE_MEMPOOL_NAMESIZE. */
> > -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> > +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> >                             "ovs%08x%02d%05d%07u",
> >                             hash, socket_id, mtu, n_mbufs);
> >          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { @@ -623,102
> > +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> >                    netdev_name, n_mbufs, socket_id,
> >                    dev->requested_n_rxq, dev->requested_n_txq);
> >
> > -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> > -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> > -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> > +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> > +                                          MP_CACHE_SZ,
> > +                                          sizeof (struct dp_packet)
> > +                                          - sizeof (struct rte_mbuf),
> > +                                          MBUF_SIZE(mtu)
> > +                                          - sizeof(struct dp_packet),
> > +                                          socket_id);
> >
> > -        if (mp) {
> > +        if (dmp->mp) {
> >              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
> >                       mp_name, n_mbufs);
> >              /* rte_pktmbuf_pool_create has done some initialization of
> the
> > -             * rte_mbuf part of each dp_packet. Some OvS specific
> fields
> > -             * of the packet still need to be initialized by
> > -             * ovs_rte_pktmbuf_init. */
> > -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> > +             * rte_mbuf part of each dp_packet, while
> ovs_rte_pktmbuf_init
> > +             * initializes some OVS specific fields of dp_packet.
> > +             */
> > +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> > +            return dmp;
> >          } else if (rte_errno == EEXIST) {
> >              /* A mempool with the same name already exists.  We just
> >               * retrieve its pointer to be returned to the caller. */
> > -            mp = rte_mempool_lookup(mp_name);
> > +            dmp->mp = rte_mempool_lookup(mp_name);
> >              /* As the mempool create returned EEXIST we can expect the
> >               * lookup has returned a valid pointer.  If for some reason
> >               * that's not the case we keep track of it. */
> >              VLOG_DBG("A mempool with name \"%s\" already exists at
> %p.",
> > -                     mp_name, mp);
> > -            /* Ensure this reused mempool will not be freed. */
> > -            dpdk_mp_do_not_free(mp);
> > +                     mp_name, dmp->mp);
> >          } else {
> >              VLOG_ERR("Failed mempool \"%s\" create request of %u
> mbufs",
> >                       mp_name, n_mbufs);
> >          }
> 
> Just a thought, but now with shared memory where n_mbufs are initially set
> to 4096*64, one can see this error log printed a few times before the
> memory gets allocated. We could potentially demote this to a WARN and
> write a more friendly message and only print the error below, before
> returning to the caller (at that point we surely couldn't allocate the
> mempool).

Sure, good idea as discussed in another thread by Kevin, I think we can change it to debug altogether.

> 
> > -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> MIN_NB_MBUF);
> > +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
> > + MIN_NB_MBUF);
> >
> > -    ovs_mutex_unlock(&dpdk_mp_mutex);
> > -    return mp;
> > +    rte_free(dmp);
> > +    return NULL;
> >  }
> >
> > -/* Release an existing mempool. */
> > +static int
> > +dpdk_mp_full(const struct rte_mempool *mp)
> > +OVS_REQUIRES(dpdk_mp_mutex) {
> > +    unsigned ring_count;
> > +    /* This logic is needed because rte_mempool_full() is not
> guaranteed to
> > +     * be atomic and mbufs could be moved from mempool cache -->
> mempool ring
> > +     * during the call. However, as no mbufs will be taken from the
> mempool
> > +     * at this time, we can work around it by also checking the ring
> entries
> > +     * separately and ensuring that they have not changed.
> > +     */
> > +    ring_count = rte_mempool_ops_get_count(mp);
> > +    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) ==
> ring_count) {
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/* Free unused mempools. */
> >  static void
> > -dpdk_mp_release(struct rte_mempool *mp)
> > +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
> >  {
> > -    if (!mp) {
> > -        return;
> > +    struct dpdk_mp *dmp, *next;
> > +
> > +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> > +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> > +            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> > +            ovs_list_remove(&dmp->list_node);
> > +            rte_mempool_free(dmp->mp);
> > +            rte_free(dmp);
> > +        }
> >      }
> > +}
> > +
> > +static struct dpdk_mp *
> > +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) {
> > +    struct dpdk_mp *dmp;
> > +    bool reuse = false;
> >
> >      ovs_mutex_lock(&dpdk_mp_mutex);
> > -    if (dpdk_mp_full(mp)) {
> > -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> > -        rte_mempool_free(mp);
> > -    } else {
> > -        struct dpdk_mp *dmp;
> > +    /* Check if shared mempools are being used, if so check existing
> mempools
> > +     * to see if reuse is possible. */
> > +    if (!per_port_mp) {
> > +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> > +            if (dmp->socket_id == dev->requested_socket_id
> > +                && dmp->mtu == mtu) {
> > +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> > +                dmp->refcount++;
> > +                reuse = true;
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    /* Sweep mempools after reuse or before create. */
> > +    dpdk_mp_sweep();
> 
> *Should dpdk_mp_sweep() be called when destroying an interface as well, in
> common_destruct()? While testing, I've noticed that if I add one port and
> delete the same port the mempool will still be allocated until you add
> another port, since dpdk_mp_sweep() will only be called on the next call
> to dpdp_mp_get(). This could potentially create problems in the per-port
> model where one deletes a certain number of ports and can't add any other
> ports because there's (hanging) mempools holding memory.
> 

I can look into this in the next revision to handle this better if required, as discussed on separate threads it could be the case where it's preffered to defer as long as possible. The per port model is typically used in a more dynamic environment where changes occur (port add/deletes) quite often so we could look to sweep if it makes sense in common destruct then.

> >
> > -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> > +    if (!reuse) {
> > +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
> >          if (dmp) {
> > -            dmp->mp = mp;
> > -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> > +            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> >          }
> >      }
> > +
> >      ovs_mutex_unlock(&dpdk_mp_mutex);
> > +
> > +    return dmp;
> >  }
> >
> > -/* Tries to allocate a new mempool - or re-use an existing one where
> > - * appropriate - on requested_socket_id with a size determined by
> > - * requested_mtu and requested Rx/Tx queues.
> > - * On success - or when re-using an existing mempool - the new
> > configuration
> > - * will be applied.
> > +/* Decrement reference to a mempool. */ static void
> > +dpdk_mp_put(struct dpdk_mp *dmp) {
> > +    if (!dmp) {
> > +        return;
> > +    }
> > +
> > +    ovs_mutex_lock(&dpdk_mp_mutex);
> > +    ovs_assert(dmp->refcount);
> > +    dmp->refcount--;
> > +    ovs_mutex_unlock(&dpdk_mp_mutex); }
> > +
> > +/* Depending on the memory model being used this function tries
> > + * identify and reuse an existing mempool or tries to allocate new
> > + * mempool on requested_socket_id with mbuf size corresponding to
> > + * requested_mtu. On success new configuration will be applied.
> >   * On error, device will be left unchanged. */  static int
> > netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
> >      OVS_REQUIRES(dev->mutex)
> >  {
> >      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> > -    struct rte_mempool *mp;
> > +    struct dpdk_mp *mp;
> >      int ret = 0;
> > +    bool per_port_mp = dpdk_per_port_mempool_enabled();
> >
> > -    dpdk_mp_sweep();
> > +    /* With shared mempools we do not need to configure a mempool if
> the MTU
> > +     * and socket ID have not changed, the previous configuration is
> still
> > +     * valid so return 0 */
> > +    if (dev->mtu == dev->requested_mtu
> > +        && dev->socket_id == dev->requested_socket_id
> > +        && (!per_port_mp)) {
> 
> I also find that moving the `(!per_port_mp)` condition to the beginning
> improves readability here. It even goes in hand with your comment just
> above - "With shared mempools we do not ...".
> 
> > +        return ret;
> > +    }
> >
> > -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> > +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
> >      if (!mp) {
> >          VLOG_ERR("Failed to create memory pool for netdev "
> >                   "%s, with MTU %d on socket %d: %s\n",
> >                   dev->up.name, dev->requested_mtu, dev-
> >requested_socket_id,
> > -                 rte_strerror(rte_errno));
> > -        ret = rte_errno;
> > +        rte_strerror(rte_errno));
> 
> Missing indentation here.
> 

Will fix in the v2.
> > +        return rte_errno;
> >      } else {
> > -        /* If a new MTU was requested and its rounded value equals the
> one
> > -         * that is currently used, then the existing mempool is
> returned. */
> > -        if (dev->mp != mp) {
> > -            /* A new mempool was created, release the previous one. */
> > -            dpdk_mp_release(dev->mp);
> > -        } else {
> > -            ret = EEXIST;
> > +        /* Check for any pre-existing dpdk_mp for the device */
> > +        if (dev->dpdk_mp != NULL) {
> > +            /* If a new MTU was requested and its rounded value equals
> the
> > +             * one that is currently used, then the existing mempool is
> > +             * returned. */
> > +            if (dev->dpdk_mp->mp != mp->mp) {
> > +                /* A new mempool was created, release the previous one.
> */
> > +                dpdk_mp_put(dev->dpdk_mp);
> > +            } else {
> > +                /* If the mempool already exists in the current dpdk_mp
> then
> > +                 * we need to ensure dpdk_mp that was just created is
> freed in
> > +                 * the next sweep as it will not be used. */
> The code path around the `else` block here will only be called when
> `!per_port_mp`. This is because `dpdk_mp_get()` will only return an
> already existing mempool when using the shared model. Otherwise a new one
> is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)` will be
> true. Am I reading this right? If so then refactoring this a bit to
> differentiate on `per_port_mp` might help on readability - this goes in-
> line with Kevin's comment about making this piece a bit more readable.
> 
> On the same note, the comment above mislead me to think that the allocated
> `mp` is being freed, which would result in error since the same `mp` is
> then assigned below. Instead, what it is doing is decrementing the
> refcount in struct dpdk_mp, which might end up being freed on the next
> dpdk_mp_sweep() if `refcount=0`. But that won't happen on the shared model
> unless no port is using the mempool.
> 

After looking ver the ongoing thread as regards this code block, it's bugs and can be simplified by doing more work around the EEXIST error at the create_mp function, it it would also help avoid using the dpdk_mp_list for dpdk_mps that will be freed anyway due to the mempool already being available.

Thanks
Ian
> Thanks,
> Tiago.
> 
> 
> > +                dpdk_mp_put(mp);
> > +                ret = EEXIST;
> > +            }
> >          }
> > -        dev->mp = mp;
> > +        dev->dpdk_mp = mp;
> >          dev->mtu = dev->requested_mtu;
> >          dev->socket_id = dev->requested_socket_id;
> >          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> >      }
> >
> > -    return ret;
> > +    return 0;
> >  }
> >
> >  static void
> > @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
> > int n_rxq, int n_txq)
> >
> >          for (i = 0; i < n_rxq; i++) {
> >              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev-
> >rxq_size,
> > -                                          dev->socket_id, NULL, dev-
> >mp);
> > +                                          dev->socket_id, NULL,
> > +                                          dev->dpdk_mp->mp);
> >              if (diag) {
> >                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
> >                            dev->up.name, i, rte_strerror(-diag)); @@
> > -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
> >      rte_eth_link_get_nowait(dev->port_id, &dev->link);
> >
> > -    mbp_priv = rte_mempool_get_priv(dev->mp);
> > +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
> >      dev->buf_size = mbp_priv->mbuf_data_room_size -
> > RTE_PKTMBUF_HEADROOM;
> >
> >      /* Get the Flow control configuration for DPDK-ETH */ @@ -1176,7
> > +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
> >      OVS_EXCLUDED(dev->mutex)
> >  {
> >      rte_free(dev->tx_q);
> > -    dpdk_mp_release(dev->mp);
> > +    dpdk_mp_put(dev->dpdk_mp);
> >
> >      ovs_list_remove(&dev->list_node);
> >      free(ovsrcu_get_protected(struct ingress_policer *, @@ -1928,7
> > +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> >          return EAGAIN;
> >      }
> >
> > -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> > +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
> >                                      (struct rte_mbuf **) batch-
> >packets,
> >                                      NETDEV_MAX_BURST);
> >      if (!nb_rx) {
> > @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
> struct dp_packet_batch *batch)
> >              continue;
> >          }
> >
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> > +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> >          if (OVS_UNLIKELY(!pkts[txcnt])) {
> >              dropped += cnt - i;
> >              break;
> > @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn
> *conn,
> >          ovs_mutex_lock(&dev->mutex);
> >          ovs_mutex_lock(&dpdk_mp_mutex);
> >
> > -        rte_mempool_dump(stream, dev->mp);
> > +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
> >
> >          ovs_mutex_unlock(&dpdk_mp_mutex);
> >          ovs_mutex_unlock(&dev->mutex); @@ -3742,7 +3814,7 @@
> > dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> >
> >      err = netdev_dpdk_mempool_configure(dev);
> >      if (!err) {
> > -        /* A new mempool was created. */
> > +        /* A new mempool was created or re-used. */
> >          netdev_change_seq_changed(&dev->up);
> >      } else if (err != EEXIST){
> >          return err;
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> > 7ab90d5..95520af 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -359,6 +359,22 @@
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="per-port-mp-enabled"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          By default OVS DPDK uses a shared mempool memory model
> wherein
> > +          devices that have the same MTU and socket values can share
> the same
> > +          mempool. Setting this value to <code>true</code> changes this
> > +          behaviour. Per port mempools allow DPDK devices to use a
> private
> > +          mempool per device. This can provide greater transapraency as
> > +          regards memory usage but potentially at the cost of greater
> memory
> > +          requirements.
> > +        </p>
> > +        <p>
> > +          Changing this value requires restarting the daemon.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="tx-flush-interval"
> >                type='{"type": "integer",
> >                       "minInteger": 0, "maxInteger": 1000000}'>
> >
Stokes, Ian May 30, 2018, 11:07 a.m. UTC | #10
> On 05/30/2018 11:25 AM, Stokes, Ian wrote:
> >>> -                     mp_name, mp);
> >>> -            /* Ensure this reused mempool will not be freed. */
> >>> -            dpdk_mp_do_not_free(mp);
> >> The mempool you are about to reuse could have a refcount of 0 and
> >> about to be freed by a sweep. So you would need something like the
> >> function above before giving up dpdk_mp_mutex. Maybe you could
> >> increase the refcount for it now and re-adjust later if you need to.
> >>
> > So my thinking here was that we run the sweep in
> netdev_dpdk_mempool_configure() just prior to calling dpdk_mp_create().
> >
> > Running the sweep prior should have removed any mempool with refcount 0
> in the dpdk_mp_list.
> 
> Not necessarily, freeing a 0 refcount dpdk_mp and it's mempool in sweep is
> conditional on no in-use mbufs, so it might not have been freed yet.
> 
> It won't be freed while you have the lock, but you need to ensure while
> you still have the lock that it won't be freed after you release the lock
> but before you have updated refcount.
> 

Ah, I get you now. Makes more sense.

Ian
> > The list itself is mutex guarded, if there is an existing mempool then
> the associated refcount would still be 1 I thought but maybe I missed
> something.
> >
> > Do you think it's the case it could be modified between the sweep call
> and reaching this point?
> >
> 
> No, this part is ok, it's just the case above
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 683ca14..14c2189 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -36,6 +36,7 @@  DOC_SOURCE = \
 	Documentation/topics/dpdk/index.rst \
 	Documentation/topics/dpdk/bridge.rst \
 	Documentation/topics/dpdk/jumbo-frames.rst \
+	Documentation/topics/dpdk/memory.rst \
 	Documentation/topics/dpdk/pdump.rst \
 	Documentation/topics/dpdk/phy.rst \
 	Documentation/topics/dpdk/pmd.rst \
diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
index 181f61a..cf24a7b 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,3 +40,4 @@  The DPDK Datapath
    /topics/dpdk/qos
    /topics/dpdk/pdump
    /topics/dpdk/jumbo-frames
+   /topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
new file mode 100644
index 0000000..1198067
--- /dev/null
+++ b/Documentation/topics/dpdk/memory.rst
@@ -0,0 +1,67 @@ 
+..
+        Copyright 2018, Intel, Inc.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=========================
+DPDK Device Memory Models
+=========================
+
+DPDK device memory can be allocated in one of two ways in OVS DPDK,
+shared mempools or per port mempools. The specifics of both are detailed
+below.
+
+Shared Mempools
+---------------
+
+By Default OVS DPDK uses a shared mempool model. This means that multiple
+ports can share the same mempool. For example when a port is added it will
+have a given MTU and socket ID associated with it. If a mempool has been
+created previously for an existing port that has the same MTU and socket ID,
+that mempool is used for both ports. If there is no existing mempool
+supporting these parameters then a new mempool is created.
+
+
+Per Port Mempools
+-----------------
+
+In the per port mempool model, mempools are created per device and are not
+shared. The benefit of this is a more transparent memory model where mempools
+will not be exhausted by other DPDK devices. However this comes at a potential
+increase in cost for memory dimensioning for a given deployment. Users should
+be aware of the memory requirements for their deployment before using this
+model and allocate the required hugepage memory.
+
+Per port mempool support may be enabled via a global config value,
+```per-port-mp-enabled```. Setting this to true enables the per port mempool
+model for all DPDK devices in OVS::
+
+   $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
+
+.. important::
+
+    Changing this value requires restarting the daemon.
+
+.. todo::
+
+   Add memory calculation section for both mempool models.
diff --git a/NEWS b/NEWS
index ec548b0..c9991cf 100644
--- a/NEWS
+++ b/NEWS
@@ -30,6 +30,7 @@  Post-v2.9.0
      * New 'check-dpdk' Makefile target to run a new system testsuite.
        See Testing topic for the details.
      * Add LSC interrupt support for DPDK physical devices.
+     * Support both shared and per port mempools for DPDK devices.
    - Userspace datapath:
      * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
      * Detailed PMD performance metrics available with new command
diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index 041cd0c..d2a9718 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -55,6 +55,12 @@  dpdk_vhost_iommu_enabled(void)
     return false;
 }
 
+bool
+dpdk_per_port_mempool_enabled(void)
+{
+    return false;
+}
+
 void
 print_dpdk_version(void)
 {
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 00dd974..e251970 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -43,6 +43,8 @@  static FILE *log_stream = NULL;       /* Stream for DPDK log redirection */
 
 static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
 static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
+static bool per_port_mp_enabled = false; /* Status of per port mempool
+                                          * support */
 
 static int
 process_vhost_flags(char *flag, const char *default_val, int size,
@@ -354,6 +356,11 @@  dpdk_init__(const struct smap *ovs_other_config)
     VLOG_INFO("IOMMU support for vhost-user-client %s.",
                vhost_iommu_enabled ? "enabled" : "disabled");
 
+    per_port_mp_enabled = smap_get_bool(ovs_other_config,
+                                        "per-port-mp-enabled", false);
+    VLOG_INFO("Per port mempool for DPDK devices %s.",
+               per_port_mp_enabled ? "enabled" : "disabled");
+
     argv = grow_argv(&argv, 0, 1);
     argc = 1;
     argv[0] = xstrdup(ovs_get_program_name());
@@ -498,6 +505,12 @@  dpdk_vhost_iommu_enabled(void)
     return vhost_iommu_enabled;
 }
 
+bool
+dpdk_per_port_mempool_enabled(void)
+{
+    return per_port_mp_enabled;
+}
+
 void
 dpdk_set_lcore_id(unsigned cpu)
 {
diff --git a/lib/dpdk.h b/lib/dpdk.h
index b041535..243385c 100644
--- a/lib/dpdk.h
+++ b/lib/dpdk.h
@@ -38,6 +38,7 @@  void dpdk_init(const struct smap *ovs_other_config);
 void dpdk_set_lcore_id(unsigned cpu);
 const char *dpdk_get_vhost_sock_dir(void);
 bool dpdk_vhost_iommu_enabled(void);
+bool dpdk_per_port_mempool_enabled(void);
 void print_dpdk_version(void);
 
 #endif /* dpdk.h */
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 87152a7..cda2ace 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -91,13 +91,24 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define NETDEV_DPDK_MBUF_ALIGN      1024
 #define NETDEV_DPDK_MAX_PKT_LEN     9728
 
-/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
- * roughly estimated number of mbufs: if this fails (because the system doesn't
- * have enough hugepages) we keep halving the number until the allocation
- * succeeds or we reach MIN_NB_MBUF */
+/* Max and min number of packets in the mempool.  OVS tries to allocate a
+ * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
+ * enough hugepages) we keep halving the number until the allocation succeeds
+ * or we reach MIN_NB_MBUF */
+
+#define MAX_NB_MBUF          (4096 * 64)
 #define MIN_NB_MBUF          (4096 * 4)
 #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
 
+/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
+BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
+                  == 0);
+
+/* The smallest possible NB_MBUF that we're going to try should be a multiple
+ * of MP_CACHE_SZ. This is advised by DPDK documentation. */
+BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
+                  % MP_CACHE_SZ == 0);
+
 /*
  * DPDK XSTATS Counter names definition
  */
@@ -296,13 +307,14 @@  static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
 static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
     = OVS_MUTEX_INITIALIZER;
 
-/* Contains all 'struct dpdk_mp's. */
-static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
-    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
+static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
+    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
 
-/* Wrapper for a mempool released but not yet freed. */
 struct dpdk_mp {
      struct rte_mempool *mp;
+     int mtu;
+     int socket_id;
+     int refcount;
      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
  };
 
@@ -381,7 +393,7 @@  struct netdev_dpdk {
 
     PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
         struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
-        struct rte_mempool *mp;
+        struct dpdk_mp *dpdk_mp;
 
         /* virtio identifier for vhost devices */
         ovsrcu_index vid;
@@ -526,88 +538,70 @@  ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
     dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
 }
 
-static int
-dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
-{
-    unsigned ring_count;
-    /* This logic is needed because rte_mempool_full() is not guaranteed to
-     * be atomic and mbufs could be moved from mempool cache --> mempool ring
-     * during the call. However, as no mbufs will be taken from the mempool
-     * at this time, we can work around it by also checking the ring entries
-     * separately and ensuring that they have not changed.
-     */
-    ring_count = rte_mempool_ops_get_count(mp);
-    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
-        return 1;
-    }
-
-    return 0;
-}
-
-/* Free unused mempools. */
-static void
-dpdk_mp_sweep(void)
+/* Calculating the required number of mbufs differs depending on the
+ * mempool model being used. Check if per port mempools are in use before
+ * calculating.
+ */
+static uint32_t
+dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 {
-    struct dpdk_mp *dmp, *next;
+    uint32_t n_mbufs;
 
-    ovs_mutex_lock(&dpdk_mp_mutex);
-    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
-        if (dpdk_mp_full(dmp->mp)) {
-            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
-            ovs_list_remove(&dmp->list_node);
-            rte_mempool_free(dmp->mp);
-            rte_free(dmp);
+    if (!per_port_mp) {
+        /* Shared mempools are being used.
+         * XXX: this is a really rough method of provisioning memory.
+         * It's impossible to determine what the exact memory requirements are
+         * when the number of ports and rxqs that utilize a particular mempool
+         * can change dynamically at runtime. For now, use this rough
+         * heurisitic.
+         */
+        if (mtu >= ETHER_MTU) {
+            n_mbufs = MAX_NB_MBUF;
+        } else {
+            n_mbufs = MIN_NB_MBUF;
         }
+    } else {
+        /* Per port mempools are being used.
+         * XXX: rough estimation of number of mbufs required for this port:
+         * <packets required to fill the device rxqs>
+         * + <packets that could be stuck on other ports txqs>
+         * + <packets in the pmd threads>
+         * + <additional memory for corner cases>
+         */
+        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
+                  + dev->requested_n_txq * dev->requested_txq_size
+                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
+                  + MIN_NB_MBUF;
     }
-    ovs_mutex_unlock(&dpdk_mp_mutex);
-}
-
-/* Ensure a mempool will not be freed. */
-static void
-dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
-{
-    struct dpdk_mp *dmp, *next;
 
-    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
-        if (dmp->mp == mp) {
-            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
-            ovs_list_remove(&dmp->list_node);
-            rte_free(dmp);
-            break;
-        }
-    }
+    return n_mbufs;
 }
 
-/* Returns a valid pointer when either of the following is true:
- *  - a new mempool was just created;
- *  - a matching mempool already exists. */
-static struct rte_mempool *
-dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
+static struct dpdk_mp *
+dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 {
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     const char *netdev_name = netdev_get_name(&dev->up);
     int socket_id = dev->requested_socket_id;
     uint32_t n_mbufs;
     uint32_t hash = hash_string(netdev_name, 0);
-    struct rte_mempool *mp = NULL;
-
-    /*
-     * XXX: rough estimation of number of mbufs required for this port:
-     * <packets required to fill the device rxqs>
-     * + <packets that could be stuck on other ports txqs>
-     * + <packets in the pmd threads>
-     * + <additional memory for corner cases>
-     */
-    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
-              + dev->requested_n_txq * dev->requested_txq_size
-              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
-              + MIN_NB_MBUF;
+    struct dpdk_mp *dmp = NULL;
+    int ret;
+
+    dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    if (!dmp) {
+        return NULL;
+    }
+    dmp->socket_id = socket_id;
+    dmp->mtu = mtu;
+    dmp->refcount = 1;
+
+    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
-    ovs_mutex_lock(&dpdk_mp_mutex);
     do {
         /* Full DPDK memory pool name must be unique and cannot be
          * longer than RTE_MEMPOOL_NAMESIZE. */
-        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
+        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
                            "ovs%08x%02d%05d%07u",
                            hash, socket_id, mtu, n_mbufs);
         if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
@@ -623,102 +617,179 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
                   netdev_name, n_mbufs, socket_id,
                   dev->requested_n_rxq, dev->requested_n_txq);
 
-        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
-                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
-                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
+        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
+                                          MP_CACHE_SZ,
+                                          sizeof (struct dp_packet)
+                                          - sizeof (struct rte_mbuf),
+                                          MBUF_SIZE(mtu)
+                                          - sizeof(struct dp_packet),
+                                          socket_id);
 
-        if (mp) {
+        if (dmp->mp) {
             VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
                      mp_name, n_mbufs);
             /* rte_pktmbuf_pool_create has done some initialization of the
-             * rte_mbuf part of each dp_packet. Some OvS specific fields
-             * of the packet still need to be initialized by
-             * ovs_rte_pktmbuf_init. */
-            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
+             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
+             * initializes some OVS specific fields of dp_packet.
+             */
+            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
+            return dmp;
         } else if (rte_errno == EEXIST) {
             /* A mempool with the same name already exists.  We just
              * retrieve its pointer to be returned to the caller. */
-            mp = rte_mempool_lookup(mp_name);
+            dmp->mp = rte_mempool_lookup(mp_name);
             /* As the mempool create returned EEXIST we can expect the
              * lookup has returned a valid pointer.  If for some reason
              * that's not the case we keep track of it. */
             VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
-                     mp_name, mp);
-            /* Ensure this reused mempool will not be freed. */
-            dpdk_mp_do_not_free(mp);
+                     mp_name, dmp->mp);
         } else {
             VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
                      mp_name, n_mbufs);
         }
-    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
+    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
 
-    ovs_mutex_unlock(&dpdk_mp_mutex);
-    return mp;
+    rte_free(dmp);
+    return NULL;
 }
 
-/* Release an existing mempool. */
+static int
+dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
+{
+    unsigned ring_count;
+    /* This logic is needed because rte_mempool_full() is not guaranteed to
+     * be atomic and mbufs could be moved from mempool cache --> mempool ring
+     * during the call. However, as no mbufs will be taken from the mempool
+     * at this time, we can work around it by also checking the ring entries
+     * separately and ensuring that they have not changed.
+     */
+    ring_count = rte_mempool_ops_get_count(mp);
+    if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
+        return 1;
+    }
+
+    return 0;
+}
+
+/* Free unused mempools. */
 static void
-dpdk_mp_release(struct rte_mempool *mp)
+dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
 {
-    if (!mp) {
-        return;
+    struct dpdk_mp *dmp, *next;
+
+    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
+        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
+            VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
+            ovs_list_remove(&dmp->list_node);
+            rte_mempool_free(dmp->mp);
+            rte_free(dmp);
+        }
     }
+}
+
+static struct dpdk_mp *
+dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
+{
+    struct dpdk_mp *dmp;
+    bool reuse = false;
 
     ovs_mutex_lock(&dpdk_mp_mutex);
-    if (dpdk_mp_full(mp)) {
-        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
-        rte_mempool_free(mp);
-    } else {
-        struct dpdk_mp *dmp;
+    /* Check if shared mempools are being used, if so check existing mempools
+     * to see if reuse is possible. */
+    if (!per_port_mp) {
+        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
+            if (dmp->socket_id == dev->requested_socket_id
+                && dmp->mtu == mtu) {
+                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
+                dmp->refcount++;
+                reuse = true;
+                break;
+            }
+        }
+    }
+    /* Sweep mempools after reuse or before create. */
+    dpdk_mp_sweep();
 
-        dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    if (!reuse) {
+        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
         if (dmp) {
-            dmp->mp = mp;
-            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
+            ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
         }
     }
+
     ovs_mutex_unlock(&dpdk_mp_mutex);
+
+    return dmp;
 }
 
-/* Tries to allocate a new mempool - or re-use an existing one where
- * appropriate - on requested_socket_id with a size determined by
- * requested_mtu and requested Rx/Tx queues.
- * On success - or when re-using an existing mempool - the new configuration
- * will be applied.
+/* Decrement reference to a mempool. */
+static void
+dpdk_mp_put(struct dpdk_mp *dmp)
+{
+    if (!dmp) {
+        return;
+    }
+
+    ovs_mutex_lock(&dpdk_mp_mutex);
+    ovs_assert(dmp->refcount);
+    dmp->refcount--;
+    ovs_mutex_unlock(&dpdk_mp_mutex);
+}
+
+/* Depending on the memory model being used this function tries
+ * identify and reuse an existing mempool or tries to allocate new
+ * mempool on requested_socket_id with mbuf size corresponding to
+ * requested_mtu. On success new configuration will be applied.
  * On error, device will be left unchanged. */
 static int
 netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
     OVS_REQUIRES(dev->mutex)
 {
     uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
-    struct rte_mempool *mp;
+    struct dpdk_mp *mp;
     int ret = 0;
+    bool per_port_mp = dpdk_per_port_mempool_enabled();
 
-    dpdk_mp_sweep();
+    /* With shared mempools we do not need to configure a mempool if the MTU
+     * and socket ID have not changed, the previous configuration is still
+     * valid so return 0 */
+    if (dev->mtu == dev->requested_mtu
+        && dev->socket_id == dev->requested_socket_id
+        && (!per_port_mp)) {
+        return ret;
+    }
 
-    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
+    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
     if (!mp) {
         VLOG_ERR("Failed to create memory pool for netdev "
                  "%s, with MTU %d on socket %d: %s\n",
                  dev->up.name, dev->requested_mtu, dev->requested_socket_id,
-                 rte_strerror(rte_errno));
-        ret = rte_errno;
+        rte_strerror(rte_errno));
+        return rte_errno;
     } else {
-        /* If a new MTU was requested and its rounded value equals the one
-         * that is currently used, then the existing mempool is returned. */
-        if (dev->mp != mp) {
-            /* A new mempool was created, release the previous one. */
-            dpdk_mp_release(dev->mp);
-        } else {
-            ret = EEXIST;
+        /* Check for any pre-existing dpdk_mp for the device */
+        if (dev->dpdk_mp != NULL) {
+            /* If a new MTU was requested and its rounded value equals the
+             * one that is currently used, then the existing mempool is
+             * returned. */
+            if (dev->dpdk_mp->mp != mp->mp) {
+                /* A new mempool was created, release the previous one. */
+                dpdk_mp_put(dev->dpdk_mp);
+            } else {
+                /* If the mempool already exists in the current dpdk_mp then
+                 * we need to ensure dpdk_mp that was just created is freed in
+                 * the next sweep as it will not be used. */
+                dpdk_mp_put(mp);
+                ret = EEXIST;
+            }
         }
-        dev->mp = mp;
+        dev->dpdk_mp = mp;
         dev->mtu = dev->requested_mtu;
         dev->socket_id = dev->requested_socket_id;
         dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
     }
 
-    return ret;
+    return 0;
 }
 
 static void
@@ -835,7 +906,8 @@  dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 
         for (i = 0; i < n_rxq; i++) {
             diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
-                                          dev->socket_id, NULL, dev->mp);
+                                          dev->socket_id, NULL,
+                                          dev->dpdk_mp->mp);
             if (diag) {
                 VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
                           dev->up.name, i, rte_strerror(-diag));
@@ -922,7 +994,7 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
     memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
     rte_eth_link_get_nowait(dev->port_id, &dev->link);
 
-    mbp_priv = rte_mempool_get_priv(dev->mp);
+    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
     dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
 
     /* Get the Flow control configuration for DPDK-ETH */
@@ -1176,7 +1248,7 @@  common_destruct(struct netdev_dpdk *dev)
     OVS_EXCLUDED(dev->mutex)
 {
     rte_free(dev->tx_q);
-    dpdk_mp_release(dev->mp);
+    dpdk_mp_put(dev->dpdk_mp);
 
     ovs_list_remove(&dev->list_node);
     free(ovsrcu_get_protected(struct ingress_policer *,
@@ -1928,7 +2000,7 @@  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
         return EAGAIN;
     }
 
-    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
+    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
                                     (struct rte_mbuf **) batch->packets,
                                     NETDEV_MAX_BURST);
     if (!nb_rx) {
@@ -2167,7 +2239,7 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
             continue;
         }
 
-        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
+        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
         if (OVS_UNLIKELY(!pkts[txcnt])) {
             dropped += cnt - i;
             break;
@@ -3047,7 +3119,7 @@  netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
         ovs_mutex_lock(&dev->mutex);
         ovs_mutex_lock(&dpdk_mp_mutex);
 
-        rte_mempool_dump(stream, dev->mp);
+        rte_mempool_dump(stream, dev->dpdk_mp->mp);
 
         ovs_mutex_unlock(&dpdk_mp_mutex);
         ovs_mutex_unlock(&dev->mutex);
@@ -3742,7 +3814,7 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
 
     err = netdev_dpdk_mempool_configure(dev);
     if (!err) {
-        /* A new mempool was created. */
+        /* A new mempool was created or re-used. */
         netdev_change_seq_changed(&dev->up);
     } else if (err != EEXIST){
         return err;
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 7ab90d5..95520af 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -359,6 +359,22 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="per-port-mp-enabled"
+              type='{"type": "boolean"}'>
+        <p>
+          By default OVS DPDK uses a shared mempool memory model wherein
+          devices that have the same MTU and socket values can share the same
+          mempool. Setting this value to <code>true</code> changes this
+          behaviour. Per port mempools allow DPDK devices to use a private
+          mempool per device. This can provide greater transapraency as
+          regards memory usage but potentially at the cost of greater memory
+          requirements.
+        </p>
+        <p>
+          Changing this value requires restarting the daemon.
+        </p>
+      </column>
+
       <column name="other_config" key="tx-flush-interval"
               type='{"type": "integer",
                      "minInteger": 0, "maxInteger": 1000000}'>