[ovs-dev,RFC,v2,1/1] dpdk: Support both shared and per port mempools.

Message ID 1528735045-2600-1-git-send-email-ian.stokes@intel.com
State Superseded
Delegated to: Ian Stokes
Headers show
Series
  • [ovs-dev,RFC,v2,1/1] dpdk: Support both shared and per port mempools.
Related show

Commit Message

Ian Stokes June 11, 2018, 4:37 p.m.
This commit re-introduces the concept of shared mempools as the default
memory model for DPDK devices. Per port mempools are still available but
must be enabled explicitly by a user.

OVS previously used a shared mempool model for ports with the same MTU
and socket configuration. This was replaced by a per port mempool model
to address issues flagged by users such as:

https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html

However the per port model potentially requires an increase in memory
resource requirements to support the same number of ports and configuration
as the shared port model.

This is considered a blocking factor for current deployments of OVS
when upgrading to future OVS releases as a user may have to redimension
memory for the same deployment configuration. This may not be possible for
users.

This commit resolves the issue by re-introducing shared mempools as
the default memory behaviour in OVS DPDK but also refactors the memory
configuration code to allow for per port mempools.

This patch adds a new global config option, per-port-mp, that
controls the enablement of per port mempools for DPDK devices.

    ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true

This value defaults to false; to enable per port memory support,
this field should be set to true when setting other global parameters
on init (such as "dpdk-socket-mem", for example). Changing the value at
runtime is not supported, and requires restarting the vswitch
daemon.

The mempool sweep functionality is also replaced with the
sweep functionality from OVS 2.9 found in commits

c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)

As this patch is RFC there are a number of TO-DOs including adding a
memory calculation section to the documentation for both models. This is
expected to be completed in the v1 after RFC.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>

---
v1 -> v2
* Rebase to head of master.
* Change global config option 'per-port-mp-enabled' to 'per-port-memory'.
  in commit message & documentation and code.
* Specify in documentation that restart of daemon is only required if per
  port-port-memory is set after dpdk-init=true.
* Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
  previous refactor.
* Improve comments regarding unique mempool names in the shared mempool
  usecase.
* Check per_port_mp condition first when verifying if new mempool is
  required in netdev_dpdk_mempool_configure() for the shared mempool
  usecase.
* Correctly return ret variable instead of 0 for function
  netdev_dpdk_mempool_configure().
* Move VLOG_ERR regarding failure to create mempool for a device prior to
  dpdk_mp_create() returns.
* Add VLOG_DBG message flagging that the number of mbufs could not be
  allocated and that it will be retried with half that amount.
* Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
* Handle EEXIST case for per port memory in function dpdk_mp_get() to
  avoid duplication of dpdk_mps, correctly set refcount and return
  correct dpdk_mp for the device to use.
---
 Documentation/automake.mk            |   1 +
 Documentation/topics/dpdk/index.rst  |   1 +
 Documentation/topics/dpdk/memory.rst |  67 ++++++++
 NEWS                                 |   1 +
 lib/dpdk-stub.c                      |   6 +
 lib/dpdk.c                           |  12 ++
 lib/dpdk.h                           |   1 +
 lib/netdev-dpdk.c                    | 298 +++++++++++++++++++++++------------
 vswitchd/vswitch.xml                 |  17 ++
 9 files changed, 304 insertions(+), 100 deletions(-)
 create mode 100644 Documentation/topics/dpdk/memory.rst

Comments

Kevin Traynor June 19, 2018, 11:11 a.m. | #1
On 06/11/2018 05:37 PM, Ian Stokes wrote:
> This commit re-introduces the concept of shared mempools as the default
> memory model for DPDK devices. Per port mempools are still available but
> must be enabled explicitly by a user.
> 

Hi Ian, thanks for v2, comments below

> OVS previously used a shared mempool model for ports with the same MTU
> and socket configuration. This was replaced by a per port mempool model
> to address issues flagged by users such as:
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
> 
> However the per port model potentially requires an increase in memory
> resource requirements to support the same number of ports and configuration
> as the shared port model.
> 
> This is considered a blocking factor for current deployments of OVS
> when upgrading to future OVS releases as a user may have to redimension
> memory for the same deployment configuration. This may not be possible for
> users.
> 
> This commit resolves the issue by re-introducing shared mempools as
> the default memory behaviour in OVS DPDK but also refactors the memory
> configuration code to allow for per port mempools.
> 
> This patch adds a new global config option, per-port-mp, that

s/per-port-mp/per-port-memory/

> controls the enablement of per port mempools for DPDK devices.
> 
>     ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
> 
> This value defaults to false; to enable per port memory support,
> this field should be set to true when setting other global parameters
> on init (such as "dpdk-socket-mem", for example). Changing the value at
> runtime is not supported, and requires restarting the vswitch
> daemon.
> 
> The mempool sweep functionality is also replaced with the
> sweep functionality from OVS 2.9 found in commits
> 
> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> 
> As this patch is RFC there are a number of TO-DOs including adding a
> memory calculation section to the documentation for both models. This is
> expected to be completed in the v1 after RFC.
> 
> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
> 
> ---
> v1 -> v2
> * Rebase to head of master.
> * Change global config option 'per-port-mp-enabled' to 'per-port-memory'.
>   in commit message & documentation and code.
> * Specify in documentation that restart of daemon is only required if per
>   port-port-memory is set after dpdk-init=true.
> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>   previous refactor.
> * Improve comments regarding unique mempool names in the shared mempool
>   usecase.
> * Check per_port_mp condition first when verifying if new mempool is
>   required in netdev_dpdk_mempool_configure() for the shared mempool
>   usecase.
> * Correctly return ret variable instead of 0 for function
>   netdev_dpdk_mempool_configure().
> * Move VLOG_ERR regarding failure to create mempool for a device prior to
>   dpdk_mp_create() returns.
> * Add VLOG_DBG message flagging that the number of mbufs could not be
>   allocated and that it will be retried with half that amount.
> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>   avoid duplication of dpdk_mps, correctly set refcount and return
>   correct dpdk_mp for the device to use.
> ---
>  Documentation/automake.mk            |   1 +
>  Documentation/topics/dpdk/index.rst  |   1 +
>  Documentation/topics/dpdk/memory.rst |  67 ++++++++
>  NEWS                                 |   1 +
>  lib/dpdk-stub.c                      |   6 +
>  lib/dpdk.c                           |  12 ++
>  lib/dpdk.h                           |   1 +
>  lib/netdev-dpdk.c                    | 298 +++++++++++++++++++++++------------
>  vswitchd/vswitch.xml                 |  17 ++
>  9 files changed, 304 insertions(+), 100 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/memory.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 2202df4..141b33d 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>  	Documentation/topics/dpdk/index.rst \
>  	Documentation/topics/dpdk/bridge.rst \
>  	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/memory.rst \
>  	Documentation/topics/dpdk/pdump.rst \
>  	Documentation/topics/dpdk/phy.rst \
>  	Documentation/topics/dpdk/pmd.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index 181f61a..cf24a7b 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,3 +40,4 @@ The DPDK Datapath
>     /topics/dpdk/qos
>     /topics/dpdk/pdump
>     /topics/dpdk/jumbo-frames
> +   /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
> new file mode 100644
> index 0000000..7c00f0f
> --- /dev/null
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -0,0 +1,67 @@
> +..
> +        Copyright 2018, Intel, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +=========================
> +DPDK Device Memory Models
> +=========================
> +
> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> +shared mempools or per port mempools. The specifics of both are detailed
> +below.
> +
> +Shared Mempools
> +---------------
> +
> +By Default OVS DPDK uses a shared mempool model. This means that multiple
> +ports can share the same mempool. For example when a port is added it will
> +have a given MTU and socket ID associated with it. If a mempool has been
> +created previously for an existing port that has the same MTU and socket ID,
> +that mempool is used for both ports. If there is no existing mempool
> +supporting these parameters then a new mempool is created.
> +
> +Per Port Mempools
> +-----------------
> +
> +In the per port mempool model, mempools are created per device and are not
> +shared. The benefit of this is a more transparent memory model where mempools
> +will not be exhausted by other DPDK devices. However this comes at a potential
> +increase in cost for memory dimensioning for a given deployment. Users should
> +be aware of the memory requirements for their deployment before using this
> +model and allocate the required hugepage memory.
> +
> +Per port mempool support may be enabled via a global config value,
> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
> +model for all DPDK devices in OVS::
> +
> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
> +
> +.. important::
> +
> +    This value should be set before setting dpdk-init=true. If set after
> +    dpdk-init=true then the daemon must be restarted to use per-port-memory.
> +
> +.. todo::
> +
> +   Add memory calculation section for both mempool models.
> diff --git a/NEWS b/NEWS
> index 484c6dc..eff6cf1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -33,6 +33,7 @@ Post-v2.9.0
>         See Testing topic for the details.
>       * Add LSC interrupt support for DPDK physical devices.
>       * Allow init to fail and record DPDK status/version in OVS database.
> +     * Support both shared and per port mempools for DPDK devices.
>     - Userspace datapath:
>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>       * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 1df1c58..fd3fe21 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>      return false;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return false;
> +}
> +
>  void
>  print_dpdk_version(void)
>  {
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 09afd8c..6a02a17 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
>  static bool dpdk_initialized = false; /* Indicates successful initialization
>                                         * of DPDK. */
> +static bool per_port_memory = false; /* Status of per port memory support */
>  
>  static int
>  process_vhost_flags(char *flag, const char *default_val, int size,
> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>                 vhost_iommu_enabled ? "enabled" : "disabled");
>  
> +    per_port_memory = smap_get_bool(ovs_other_config,
> +                                    "per-port-memory", false);
> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> +              per_port_memory ? "enabled" : "disabled");
> +
>      argv = grow_argv(&argv, 0, 1);
>      argc = 1;
>      argv[0] = xstrdup(ovs_get_program_name());
> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>      return vhost_iommu_enabled;
>  }
>  
> +bool
> +dpdk_per_port_memory(void)
> +{
> +    return per_port_memory;
> +}
> +
>  void
>  dpdk_set_lcore_id(unsigned cpu)
>  {
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index efdaa63..bbb89d4 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>  void dpdk_set_lcore_id(unsigned cpu);
>  const char *dpdk_get_vhost_sock_dir(void);
>  bool dpdk_vhost_iommu_enabled(void);
> +bool dpdk_per_port_memory(void);
>  void print_dpdk_version(void);
>  void dpdk_status(const struct ovsrec_open_vswitch *);
>  #endif /* dpdk.h */
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2e2f568..2c7ff81 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
> - * roughly estimated number of mbufs: if this fails (because the system doesn't
> - * have enough hugepages) we keep halving the number until the allocation
> - * succeeds or we reach MIN_NB_MBUF */
> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> + * enough hugepages) we keep halving the number until the allocation succeeds
> + * or we reach MIN_NB_MBUF */
> +
> +#define MAX_NB_MBUF          (4096 * 64)
>  #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
> +                  == 0);
> +
> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
> +                  % MP_CACHE_SZ == 0);
> +
>  /*
>   * DPDK XSTATS Counter names definition
>   */
> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>      = OVS_MUTEX_INITIALIZER;
>  
>  /* Contains all 'struct dpdk_mp's. */
> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>  
> -/* Wrapper for a mempool released but not yet freed. */
>  struct dpdk_mp {
>       struct rte_mempool *mp;
> +     int mtu;
> +     int socket_id;
> +     int refcount;
>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>   };
>  
> @@ -383,7 +396,7 @@ struct netdev_dpdk {
>  
>      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> -        struct rte_mempool *mp;
> +        struct dpdk_mp *dpdk_mp;
>  
>          /* virtio identifier for vhost devices */
>          ovsrcu_index vid;
> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>       * likely be ok, as there are additional checks during mempool
>       * freeing but it would make things racey.
>       */
> -    return rte_mempool_full(mp);
> +    rte_mempool_full(mp);
> +    return 0;

I don't think you intended this? unless you are in the business of
selling memory ;-)

>  }
>  
>  /* Free unused mempools. */
>  static void
> -dpdk_mp_sweep(void)
> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>  {
>      struct dpdk_mp *dmp, *next;
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dpdk_mp_full(dmp->mp)) {
> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>              VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>              ovs_list_remove(&dmp->list_node);
>              rte_mempool_free(dmp->mp);
>              rte_free(dmp);
>          }
>      }
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>  }
>  
> -/* Ensure a mempool will not be freed. */
> -static void
> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> +/* Calculating the required number of mbufs differs depending on the
> + * mempool model being used. Check if per port mempools are in use before
> + * calculating.
> + */
> +static uint32_t
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    struct dpdk_mp *dmp, *next;
> +    uint32_t n_mbufs;
>  
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dmp->mp == mp) {
> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_free(dmp);
> -            break;
> +    if (!per_port_mp) {
> +        /* Shared mempools are being used.
> +         * XXX: this is a really rough method of provisioning memory.
> +         * It's impossible to determine what the exact memory requirements are
> +         * when the number of ports and rxqs that utilize a particular mempool
> +         * can change dynamically at runtime. For now, use this rough
> +         * heurisitic.
> +         */
> +        if (mtu >= ETHER_MTU) {
> +            n_mbufs = MAX_NB_MBUF;
> +        } else {
> +            n_mbufs = MIN_NB_MBUF;
>          }
> +    } else {
> +        /* Per port mempools are being used.
> +         * XXX: rough estimation of number of mbufs required for this port:
> +         * <packets required to fill the device rxqs>
> +         * + <packets that could be stuck on other ports txqs>
> +         * + <packets in the pmd threads>
> +         * + <additional memory for corner cases>
> +         */
> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> +                  + dev->requested_n_txq * dev->requested_txq_size
> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> +                  + MIN_NB_MBUF;
>      }
> +
> +    return n_mbufs;
>  }
>  
> -/* Returns a valid pointer when either of the following is true:
> - *  - a new mempool was just created;
> - *  - a matching mempool already exists. */
> -static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +static struct dpdk_mp *
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
>      uint32_t n_mbufs;
>      uint32_t hash = hash_string(netdev_name, 0);
> -    struct rte_mempool *mp = NULL;
> -
> -    /*
> -     * XXX: rough estimation of number of mbufs required for this port:
> -     * <packets required to fill the device rxqs>
> -     * + <packets that could be stuck on other ports txqs>
> -     * + <packets in the pmd threads>
> -     * + <additional memory for corner cases>
> -     */
> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> -              + dev->requested_n_txq * dev->requested_txq_size
> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> -              + MIN_NB_MBUF;
> +    struct dpdk_mp *dmp = NULL;
> +    int ret;
> +
> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!dmp) {
> +        return NULL;
> +    }
> +    dmp->socket_id = socket_id;
> +    dmp->mtu = mtu;
> +    dmp->refcount = 1;
> +
> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
>      do {
>          /* Full DPDK memory pool name must be unique and cannot be
> -         * longer than RTE_MEMPOOL_NAMESIZE. */
> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
> +         * mempool case this can result in one device using a mempool
> +         * which references a different device in it's name. However as
> +         * mempool names are hashed, the device name will not be readable
> +         * so this is not an issue for tasks such as debugging.
> +         */
> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>                             "ovs%08x%02d%05d%07u",
>                             hash, socket_id, mtu, n_mbufs);
>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>                    netdev_name, n_mbufs, socket_id,
>                    dev->requested_n_rxq, dev->requested_n_txq);
>  
> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> +                                          MP_CACHE_SZ,
> +                                          sizeof (struct dp_packet)
> +                                          - sizeof (struct rte_mbuf),
> +                                          MBUF_SIZE(mtu)
> +                                          - sizeof(struct dp_packet),
> +                                          socket_id);
>  
> -        if (mp) {
> +        if (dmp->mp) {
>              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>                       mp_name, n_mbufs);
>              /* rte_pktmbuf_pool_create has done some initialization of the
> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
> -             * of the packet still need to be initialized by
> -             * ovs_rte_pktmbuf_init. */
> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
> +             * initializes some OVS specific fields of dp_packet.
> +             */
> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> +            return dmp;
>          } else if (rte_errno == EEXIST) {
>              /* A mempool with the same name already exists.  We just
>               * retrieve its pointer to be returned to the caller. */
> -            mp = rte_mempool_lookup(mp_name);
> +            dmp->mp = rte_mempool_lookup(mp_name);
>              /* As the mempool create returned EEXIST we can expect the
>               * lookup has returned a valid pointer.  If for some reason
>               * that's not the case we keep track of it. */
>              VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
> -                     mp_name, mp);
> -            /* Ensure this reused mempool will not be freed. */
> -            dpdk_mp_do_not_free(mp);
> +                     mp_name, dmp->mp);
> +            return dmp;
>          } else {
> -            VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
> -                     mp_name, n_mbufs);
> +            VLOG_DBG("Failed to create mempool \"%s\" with a request of "
> +                     "%u mbufs, retrying with %u mbufs",
> +                     mp_name, n_mbufs, n_mbufs/2);
>          }
> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>  
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -    return mp;
> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs",
> +             mp_name, n_mbufs);
> +
> +    rte_free(dmp);
> +    return NULL;
>  }
>  
> -/* Release an existing mempool. */
> -static void
> -dpdk_mp_release(struct rte_mempool *mp)
> +static struct dpdk_mp *
> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    if (!mp) {
> -        return;
> -    }
> +    struct dpdk_mp *dmp, *next;
> +    bool reuse = false;
>  
>      ovs_mutex_lock(&dpdk_mp_mutex);
> -    if (dpdk_mp_full(mp)) {
> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> -        rte_mempool_free(mp);
> -    } else {
> -        struct dpdk_mp *dmp;p
> +    /* Check if shared mempools are being used, if so check existing mempools
> +     * to see if reuse is possible. */
> +    if (!per_port_mp) {
> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> +            if (dmp->socket_id == dev->requested_socket_id
> +                && dmp->mtu == mtu) {
> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> +                dmp->refcount++;
> +                reuse = true;
> +                break;
> +            }
> +        }
> +    }
> +    /* Sweep mempools after reuse or before create. */
> +    dpdk_mp_sweep();
>  
> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!reuse) {
> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>          if (dmp) {
> -            dmp->mp = mp;
> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> +            /* We need to check for the EEXIST case for per port memory.

I'm pretty sure this also applies to the shared mempool case, because
you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
be different but the rounded value that is used for the mempool name
might be the same, meaning that you can get an EEXISTS.

> +             * Compare the mempool returned by dmp to each entry in
> +             * dpdk_mp_list. If a match is found, free dmp as a new entry
> +             * is not required, set dmp to point to the existing
> +             * entry and set refcount to 1 to avoid being freed at a later
> +             * stage.
> +             */
> +            if (per_port_mp && rte_errno == EEXIST) {
> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
> +                    if (dmp->mp == next->mp) {
> +                        rte_free(dmp);
> +                        dmp = next;
> +                        dmp->refcount = 1;
> +                    }
> +                }
> +            }
> +            else {
> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> +            }

I think you need to increment refcount and use the safe list option. How
about

if (rte_errno == EEXIST) {
    LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
        if (dmp->mp == next->mp) {
            next->refcount++;
            rte_free(dmp);
            break;
        }
    }
} else {
    dmp->refcount++;
    ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
}

>          }
>      }
> +
>      ovs_mutex_unlock(&dpdk_mp_mutex);
> +
> +    return dmp;
>  }
>  
> -/* Tries to allocate a new mempool - or re-use an existing one where
> - * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> - * On success - or when re-using an existing mempool - the new configuration
> - * will be applied.
> +/* Decrement reference to a mempool. */
> +static void
> +dpdk_mp_put(struct dpdk_mp *dmp)
> +{
> +    if (!dmp) {
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dpdk_mp_mutex);
> +    ovs_assert(dmp->refcount);
> +    dmp->refcount--;
> +    ovs_mutex_unlock(&dpdk_mp_mutex);
> +}
> +
> +/* Depending on the memory model being used this function tries
> + * identify and reuse an existing mempool or tries to allocate new
> + * mempool on requested_socket_id with mbuf size corresponding to
> + * requested_mtu. On success new configuration will be applied.
>   * On error, device will be left unchanged. */
>  static int
>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> -    struct rte_mempool *mp;
> +    struct dpdk_mp *mp;

I think it should have a different name now, as it leads to mp->mp
later. It would be good if you could make is consistent with the other
functions (dmp?)

>      int ret = 0;
> +    bool per_port_mp = dpdk_per_port_memory();
>  
> -    dpdk_mp_sweep();
> +    /* With shared mempools we do not need to configure a mempool if the MTU
> +     * and socket ID have not changed, the previous configuration is still
> +     * valid so return 0 */
> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
> +        && dev->socket_id == dev->requested_socket_id) {
> +        return ret;
> +    }
>  
> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>      if (!mp) {
>          VLOG_ERR("Failed to create memory pool for netdev "
>                   "%s, with MTU %d on socket %d: %s\n",
>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>                   rte_strerror(rte_errno));
> -        ret = rte_errno;
> +        return rte_errno;
>      } else {
> -        /* If a new MTU was requested and its rounded value equals the one
> -         * that is currently used, then the existing mempool is returned. */
> -        if (dev->mp != mp) {
> -            /* A new mempool was created, release the previous one. */
> -            dpdk_mp_release(dev->mp);
> -        } else {
> -            ret = EEXIST;
> +        /* Check for any pre-existing dpdk_mp for the device before accessing
> +         * the associated mempool.
> +         */
> +        if (dev->dpdk_mp != NULL) {
> +            /* If a new MTU was requested and its rounded value does not
> +             * equal the rounded value of the current mempool, decrement
> +             * the reference count to devices current dpdk_mp as its
> +             * associated mempool will not be used for this device.
> +             */
> +            if (dev->dpdk_mp->mp != mp->mp) {
	
Not sure if check is needed. If it were to be false, wouldn't we have
already reused or got an EEXISTS and handled that case?

> +                dpdk_mp_put(dev->dpdk_mp);
> +            }
>          }
> -        dev->mp = mp;
> +        dev->dpdk_mp = mp;
>          dev->mtu = dev->requested_mtu;
>          dev->socket_id = dev->requested_socket_id;
>          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>  
>          for (i = 0; i < n_rxq; i++) {
>              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
> -                                          dev->socket_id, NULL, dev->mp);
> +                                          dev->socket_id, NULL,
> +                                          dev->dpdk_mp->mp);
>              if (diag) {
>                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>                            dev->up.name, i, rte_strerror(-diag));
> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>      rte_eth_link_get_nowait(dev->port_id, &dev->link);
>  
> -    mbp_priv = rte_mempool_get_priv(dev->mp);
> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>      dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>  
>      /* Get the Flow control configuration for DPDK-ETH */
> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>      OVS_EXCLUDED(dev->mutex)
>  {
>      rte_free(dev->tx_q);
> -    dpdk_mp_release(dev->mp);
> +    dpdk_mp_put(dev->dpdk_mp);
>  
>      ovs_list_remove(&dev->list_node);
>      free(ovsrcu_get_protected(struct ingress_policer *,
> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>          return EAGAIN;
>      }
>  
> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>                                      (struct rte_mbuf **) batch->packets,
>                                      NETDEV_MAX_BURST);
>      if (!nb_rx) {
> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>              continue;
>          }
>  
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>              dropped += cnt - i;
>              break;
> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>          ovs_mutex_lock(&dev->mutex);
>          ovs_mutex_lock(&dpdk_mp_mutex);
>  
> -        rte_mempool_dump(stream, dev->mp);
> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>  
>          ovs_mutex_unlock(&dpdk_mp_mutex);
>          ovs_mutex_unlock(&dev->mutex);
> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>  
>      err = netdev_dpdk_mempool_configure(dev);
>      if (!err) {
> -        /* A new mempool was created. */
> +        /* A new mempool was created or re-used. */
>          netdev_change_seq_changed(&dev->up);
>      } else if (err != EEXIST){
>          return err;
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 1e27a02..dbf70ac 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -359,6 +359,23 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="per-port-memory"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          By default OVS DPDK uses a shared mempool memory model wherein
> +          devices that have the same MTU and socket values can share the same
> +          mempool. Setting this value to <code>true</code> changes this
> +          behaviour. Per port memory allow DPDK devices to use private
> +          memory per device. This can provide greater transapraency as
> +          regards memory usage but potentially at the cost of greater memory
> +          requirements.
> +        </p>
> +        <p>
> +          Changing this value requires restarting the daemon if dpdk-init has
> +          already been set to true.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-flush-interval"
>                type='{"type": "integer",
>                       "minInteger": 0, "maxInteger": 1000000}'>
>
Kevin Traynor June 19, 2018, 11:16 a.m. | #2
On 06/19/2018 12:11 PM, Kevin Traynor wrote:
>> +            if (per_port_mp && rte_errno == EEXIST) {
>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>> +                    if (dmp->mp == next->mp) {
>> +                        rte_free(dmp);
>> +                        dmp = next;
>> +                        dmp->refcount = 1;
>> +                    }
>> +                }
>> +            }
>> +            else {
>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> +            }
> I think you need to increment refcount and use the safe list option. How
> about
> 

Actually no, you don't need the safe list option, as it's the dmp that
is being freed

> if (rte_errno == EEXIST) {
>     LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>         if (dmp->mp == next->mp) {
>             next->refcount++;
>             rte_free(dmp);
>             break;
>         }
>     }
> } else {
>     dmp->refcount++;
>     ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> }
>
Kevin Traynor June 19, 2018, 11:46 a.m. | #3
On 06/19/2018 12:16 PM, Kevin Traynor wrote:
> On 06/19/2018 12:11 PM, Kevin Traynor wrote:
>>> +            if (per_port_mp && rte_errno == EEXIST) {
>>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>> +                    if (dmp->mp == next->mp) {
>>> +                        rte_free(dmp);
>>> +                        dmp = next;
>>> +                        dmp->refcount = 1;
>>> +                    }
>>> +                }
>>> +            }
>>> +            else {
>>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> +            }
>> I think you need to increment refcount and use the safe list option. How
>> about
>>
> 
> Actually no, you don't need the safe list option, as it's the dmp that
> is being freed

I obviously misread this (or wasn't concentrating enough), you do still
need the dmp = next as well.

> 
>> if (rte_errno == EEXIST) {
>>     LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>>         if (dmp->mp == next->mp) {
>>             next->refcount++;
>>             rte_free(dmp);
>>             break;
>>         }
>>     }
>> } else {
>>     dmp->refcount++;
>>     ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> }
>>
>
Ian Stokes June 20, 2018, 12:34 p.m. | #4
On 6/19/2018 12:11 PM, Kevin Traynor wrote:
> On 06/11/2018 05:37 PM, Ian Stokes wrote:
>> This commit re-introduces the concept of shared mempools as the default
>> memory model for DPDK devices. Per port mempools are still available but
>> must be enabled explicitly by a user.
>>
> 
> Hi Ian, thanks for v2, comments below

Thanks for the review Kevin, comments inline.

> 
>> OVS previously used a shared mempool model for ports with the same MTU
>> and socket configuration. This was replaced by a per port mempool model
>> to address issues flagged by users such as:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>
>> However the per port model potentially requires an increase in memory
>> resource requirements to support the same number of ports and configuration
>> as the shared port model.
>>
>> This is considered a blocking factor for current deployments of OVS
>> when upgrading to future OVS releases as a user may have to redimension
>> memory for the same deployment configuration. This may not be possible for
>> users.
>>
>> This commit resolves the issue by re-introducing shared mempools as
>> the default memory behaviour in OVS DPDK but also refactors the memory
>> configuration code to allow for per port mempools.
>>
>> This patch adds a new global config option, per-port-mp, that
> 
> s/per-port-mp/per-port-memory/

Apologies, I spotted a few of these 'per-port-mp' myself, I believe it's 
in the documentation as well. I have them fixed for the next revision.

> 
>> controls the enablement of per port mempools for DPDK devices.
>>
>>      ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>
>> This value defaults to false; to enable per port memory support,
>> this field should be set to true when setting other global parameters
>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>> runtime is not supported, and requires restarting the vswitch
>> daemon.
>>
>> The mempool sweep functionality is also replaced with the
>> sweep functionality from OVS 2.9 found in commits
>>
>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>
>> As this patch is RFC there are a number of TO-DOs including adding a
>> memory calculation section to the documentation for both models. This is
>> expected to be completed in the v1 after RFC.
>>
>> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
>>
>> ---
>> v1 -> v2
>> * Rebase to head of master.
>> * Change global config option 'per-port-mp-enabled' to 'per-port-memory'.
>>    in commit message & documentation and code.
>> * Specify in documentation that restart of daemon is only required if per
>>    port-port-memory is set after dpdk-init=true.
>> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>>    previous refactor.
>> * Improve comments regarding unique mempool names in the shared mempool
>>    usecase.
>> * Check per_port_mp condition first when verifying if new mempool is
>>    required in netdev_dpdk_mempool_configure() for the shared mempool
>>    usecase.
>> * Correctly return ret variable instead of 0 for function
>>    netdev_dpdk_mempool_configure().
>> * Move VLOG_ERR regarding failure to create mempool for a device prior to
>>    dpdk_mp_create() returns.
>> * Add VLOG_DBG message flagging that the number of mbufs could not be
>>    allocated and that it will be retried with half that amount.
>> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
>> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>>    avoid duplication of dpdk_mps, correctly set refcount and return
>>    correct dpdk_mp for the device to use.
>> ---
>>   Documentation/automake.mk            |   1 +
>>   Documentation/topics/dpdk/index.rst  |   1 +
>>   Documentation/topics/dpdk/memory.rst |  67 ++++++++
>>   NEWS                                 |   1 +
>>   lib/dpdk-stub.c                      |   6 +
>>   lib/dpdk.c                           |  12 ++
>>   lib/dpdk.h                           |   1 +
>>   lib/netdev-dpdk.c                    | 298 +++++++++++++++++++++++------------
>>   vswitchd/vswitch.xml                 |  17 ++
>>   9 files changed, 304 insertions(+), 100 deletions(-)
>>   create mode 100644 Documentation/topics/dpdk/memory.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index 2202df4..141b33d 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>   	Documentation/topics/dpdk/index.rst \
>>   	Documentation/topics/dpdk/bridge.rst \
>>   	Documentation/topics/dpdk/jumbo-frames.rst \
>> +	Documentation/topics/dpdk/memory.rst \
>>   	Documentation/topics/dpdk/pdump.rst \
>>   	Documentation/topics/dpdk/phy.rst \
>>   	Documentation/topics/dpdk/pmd.rst \
>> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
>> index 181f61a..cf24a7b 100644
>> --- a/Documentation/topics/dpdk/index.rst
>> +++ b/Documentation/topics/dpdk/index.rst
>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>      /topics/dpdk/qos
>>      /topics/dpdk/pdump
>>      /topics/dpdk/jumbo-frames
>> +   /topics/dpdk/memory
>> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
>> new file mode 100644
>> index 0000000..7c00f0f
>> --- /dev/null
>> +++ b/Documentation/topics/dpdk/memory.rst
>> @@ -0,0 +1,67 @@
>> +..
>> +        Copyright 2018, Intel, Inc.
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>> +      not use this file except in compliance with the License. You may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, software
>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>> +      License for the specific language governing permissions and limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +=========================
>> +DPDK Device Memory Models
>> +=========================
>> +
>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>> +shared mempools or per port mempools. The specifics of both are detailed
>> +below.
>> +
>> +Shared Mempools
>> +---------------
>> +
>> +By Default OVS DPDK uses a shared mempool model. This means that multiple
>> +ports can share the same mempool. For example when a port is added it will
>> +have a given MTU and socket ID associated with it. If a mempool has been
>> +created previously for an existing port that has the same MTU and socket ID,
>> +that mempool is used for both ports. If there is no existing mempool
>> +supporting these parameters then a new mempool is created.
>> +
>> +Per Port Mempools
>> +-----------------
>> +
>> +In the per port mempool model, mempools are created per device and are not
>> +shared. The benefit of this is a more transparent memory model where mempools
>> +will not be exhausted by other DPDK devices. However this comes at a potential
>> +increase in cost for memory dimensioning for a given deployment. Users should
>> +be aware of the memory requirements for their deployment before using this
>> +model and allocate the required hugepage memory.
>> +
>> +Per port mempool support may be enabled via a global config value,
>> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
>> +model for all DPDK devices in OVS::
>> +
>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>> +
>> +.. important::
>> +
>> +    This value should be set before setting dpdk-init=true. If set after
>> +    dpdk-init=true then the daemon must be restarted to use per-port-memory.
>> +
>> +.. todo::
>> +
>> +   Add memory calculation section for both mempool models.
>> diff --git a/NEWS b/NEWS
>> index 484c6dc..eff6cf1 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -33,6 +33,7 @@ Post-v2.9.0
>>          See Testing topic for the details.
>>        * Add LSC interrupt support for DPDK physical devices.
>>        * Allow init to fail and record DPDK status/version in OVS database.
>> +     * Support both shared and per port mempools for DPDK devices.
>>      - Userspace datapath:
>>        * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>>        * Detailed PMD performance metrics available with new command
>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>> index 1df1c58..fd3fe21 100644
>> --- a/lib/dpdk-stub.c
>> +++ b/lib/dpdk-stub.c
>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>       return false;
>>   }
>>   
>> +bool
>> +dpdk_per_port_mempool_enabled(void)
>> +{
>> +    return false;
>> +}
>> +
>>   void
>>   print_dpdk_version(void)
>>   {
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 09afd8c..6a02a17 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>>   static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
>>   static bool dpdk_initialized = false; /* Indicates successful initialization
>>                                          * of DPDK. */
>> +static bool per_port_memory = false; /* Status of per port memory support */
>>   
>>   static int
>>   process_vhost_flags(char *flag, const char *default_val, int size,
>> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>       VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>                  vhost_iommu_enabled ? "enabled" : "disabled");
>>   
>> +    per_port_memory = smap_get_bool(ovs_other_config,
>> +                                    "per-port-memory", false);
>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>> +              per_port_memory ? "enabled" : "disabled");
>> +
>>       argv = grow_argv(&argv, 0, 1);
>>       argc = 1;
>>       argv[0] = xstrdup(ovs_get_program_name());
>> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>>       return vhost_iommu_enabled;
>>   }
>>   
>> +bool
>> +dpdk_per_port_memory(void)
>> +{
>> +    return per_port_memory;
>> +}
>> +
>>   void
>>   dpdk_set_lcore_id(unsigned cpu)
>>   {
>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>> index efdaa63..bbb89d4 100644
>> --- a/lib/dpdk.h
>> +++ b/lib/dpdk.h
>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>   void dpdk_set_lcore_id(unsigned cpu);
>>   const char *dpdk_get_vhost_sock_dir(void);
>>   bool dpdk_vhost_iommu_enabled(void);
>> +bool dpdk_per_port_memory(void);
>>   void print_dpdk_version(void);
>>   void dpdk_status(const struct ovsrec_open_vswitch *);
>>   #endif /* dpdk.h */
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 2e2f568..2c7ff81 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>   #define NETDEV_DPDK_MBUF_ALIGN      1024
>>   #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>   
>> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
>> - * roughly estimated number of mbufs: if this fails (because the system doesn't
>> - * have enough hugepages) we keep halving the number until the allocation
>> - * succeeds or we reach MIN_NB_MBUF */
>> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
>> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>> + * enough hugepages) we keep halving the number until the allocation succeeds
>> + * or we reach MIN_NB_MBUF */
>> +
>> +#define MAX_NB_MBUF          (4096 * 64)
>>   #define MIN_NB_MBUF          (4096 * 4)
>>   #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>   
>> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
>> +                  == 0);
>> +
>> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
>> +                  % MP_CACHE_SZ == 0);
>> +
>>   /*
>>    * DPDK XSTATS Counter names definition
>>    */
>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>>       = OVS_MUTEX_INITIALIZER;
>>   
>>   /* Contains all 'struct dpdk_mp's. */
>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>   
>> -/* Wrapper for a mempool released but not yet freed. */
>>   struct dpdk_mp {
>>        struct rte_mempool *mp;
>> +     int mtu;
>> +     int socket_id;
>> +     int refcount;
>>        struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>    };
>>   
>> @@ -383,7 +396,7 @@ struct netdev_dpdk {
>>   
>>       PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>           struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>> -        struct rte_mempool *mp;
>> +        struct dpdk_mp *dpdk_mp;
>>   
>>           /* virtio identifier for vhost devices */
>>           ovsrcu_index vid;
>> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>        * likely be ok, as there are additional checks during mempool
>>        * freeing but it would make things racey.
>>        */
>> -    return rte_mempool_full(mp);
>> +    rte_mempool_full(mp);
>> +    return 0;
> 
> I don't think you intended this? unless you are in the business of
> selling memory ;-)

Ooops :-/ , This is left over from testing when mbufs are still in use, 
not sure how I missed it. Will correct for v3.

> 
>>   }
>>   
>>   /* Free unused mempools. */
>>   static void
>> -dpdk_mp_sweep(void)
>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>   {
>>       struct dpdk_mp *dmp, *next;
>>   
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dpdk_mp_full(dmp->mp)) {
>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>               VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>               ovs_list_remove(&dmp->list_node);
>>               rte_mempool_free(dmp->mp);
>>               rte_free(dmp);
>>           }
>>       }
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>   }
>>   
>> -/* Ensure a mempool will not be freed. */
>> -static void
>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>> +/* Calculating the required number of mbufs differs depending on the
>> + * mempool model being used. Check if per port mempools are in use before
>> + * calculating.
>> + */
>> +static uint32_t
>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>> -    struct dpdk_mp *dmp, *next;
>> +    uint32_t n_mbufs;
>>   
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dmp->mp == mp) {
>> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
>> -            ovs_list_remove(&dmp->list_node);
>> -            rte_free(dmp);
>> -            break;
>> +    if (!per_port_mp) {
>> +        /* Shared mempools are being used.
>> +         * XXX: this is a really rough method of provisioning memory.
>> +         * It's impossible to determine what the exact memory requirements are
>> +         * when the number of ports and rxqs that utilize a particular mempool
>> +         * can change dynamically at runtime. For now, use this rough
>> +         * heurisitic.
>> +         */
>> +        if (mtu >= ETHER_MTU) {
>> +            n_mbufs = MAX_NB_MBUF;
>> +        } else {
>> +            n_mbufs = MIN_NB_MBUF;
>>           }
>> +    } else {
>> +        /* Per port mempools are being used.
>> +         * XXX: rough estimation of number of mbufs required for this port:
>> +         * <packets required to fill the device rxqs>
>> +         * + <packets that could be stuck on other ports txqs>
>> +         * + <packets in the pmd threads>
>> +         * + <additional memory for corner cases>
>> +         */
>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> +                  + dev->requested_n_txq * dev->requested_txq_size
>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> +                  + MIN_NB_MBUF;
>>       }
>> +
>> +    return n_mbufs;
>>   }
>>   
>> -/* Returns a valid pointer when either of the following is true:
>> - *  - a new mempool was just created;
>> - *  - a matching mempool already exists. */
>> -static struct rte_mempool *
>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>> +static struct dpdk_mp *
>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>>       char mp_name[RTE_MEMPOOL_NAMESIZE];
>>       const char *netdev_name = netdev_get_name(&dev->up);
>>       int socket_id = dev->requested_socket_id;
>>       uint32_t n_mbufs;
>>       uint32_t hash = hash_string(netdev_name, 0);
>> -    struct rte_mempool *mp = NULL;
>> -
>> -    /*
>> -     * XXX: rough estimation of number of mbufs required for this port:
>> -     * <packets required to fill the device rxqs>
>> -     * + <packets that could be stuck on other ports txqs>
>> -     * + <packets in the pmd threads>
>> -     * + <additional memory for corner cases>
>> -     */
>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> -              + dev->requested_n_txq * dev->requested_txq_size
>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> -              + MIN_NB_MBUF;
>> +    struct dpdk_mp *dmp = NULL;
>> +    int ret;
>> +
>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!dmp) {
>> +        return NULL;
>> +    }
>> +    dmp->socket_id = socket_id;
>> +    dmp->mtu = mtu;
>> +    dmp->refcount = 1;
>> +
>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>   
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>>       do {
>>           /* Full DPDK memory pool name must be unique and cannot be
>> -         * longer than RTE_MEMPOOL_NAMESIZE. */
>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>> +         * mempool case this can result in one device using a mempool
>> +         * which references a different device in it's name. However as
>> +         * mempool names are hashed, the device name will not be readable
>> +         * so this is not an issue for tasks such as debugging.
>> +         */
>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>                              "ovs%08x%02d%05d%07u",
>>                              hash, socket_id, mtu, n_mbufs);
>>           if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>                     netdev_name, n_mbufs, socket_id,
>>                     dev->requested_n_rxq, dev->requested_n_txq);
>>   
>> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>> +                                          MP_CACHE_SZ,
>> +                                          sizeof (struct dp_packet)
>> +                                          - sizeof (struct rte_mbuf),
>> +                                          MBUF_SIZE(mtu)
>> +                                          - sizeof(struct dp_packet),
>> +                                          socket_id);
>>   
>> -        if (mp) {
>> +        if (dmp->mp) {
>>               VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>                        mp_name, n_mbufs);
>>               /* rte_pktmbuf_pool_create has done some initialization of the
>> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
>> -             * of the packet still need to be initialized by
>> -             * ovs_rte_pktmbuf_init. */
>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
>> +             * initializes some OVS specific fields of dp_packet.
>> +             */
>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>> +            return dmp;
>>           } else if (rte_errno == EEXIST) {
>>               /* A mempool with the same name already exists.  We just
>>                * retrieve its pointer to be returned to the caller. */
>> -            mp = rte_mempool_lookup(mp_name);
>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>               /* As the mempool create returned EEXIST we can expect the
>>                * lookup has returned a valid pointer.  If for some reason
>>                * that's not the case we keep track of it. */
>>               VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
>> -                     mp_name, mp);
>> -            /* Ensure this reused mempool will not be freed. */
>> -            dpdk_mp_do_not_free(mp);
>> +                     mp_name, dmp->mp);
>> +            return dmp;
>>           } else {
>> -            VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>> -                     mp_name, n_mbufs);
>> +            VLOG_DBG("Failed to create mempool \"%s\" with a request of "
>> +                     "%u mbufs, retrying with %u mbufs",
>> +                     mp_name, n_mbufs, n_mbufs/2);
>>           }
>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>>   
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>> -    return mp;
>> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs",
>> +             mp_name, n_mbufs);
>> +
>> +    rte_free(dmp);
>> +    return NULL;
>>   }
>>   
>> -/* Release an existing mempool. */
>> -static void
>> -dpdk_mp_release(struct rte_mempool *mp)
>> +static struct dpdk_mp *
>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>> -    if (!mp) {
>> -        return;
>> -    }
>> +    struct dpdk_mp *dmp, *next;
>> +    bool reuse = false;
>>   
>>       ovs_mutex_lock(&dpdk_mp_mutex);
>> -    if (dpdk_mp_full(mp)) {
>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>> -        rte_mempool_free(mp);
>> -    } else {
>> -        struct dpdk_mp *dmp;p
>> +    /* Check if shared mempools are being used, if so check existing mempools
>> +     * to see if reuse is possible. */
>> +    if (!per_port_mp) {
>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>> +            if (dmp->socket_id == dev->requested_socket_id
>> +                && dmp->mtu == mtu) {
>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>> +                dmp->refcount++;
>> +                reuse = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    /* Sweep mempools after reuse or before create. */
>> +    dpdk_mp_sweep();
>>   
>> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!reuse) {
>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>           if (dmp) {
>> -            dmp->mp = mp;
>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>> +            /* We need to check for the EEXIST case for per port memory.
> 
> I'm pretty sure this also applies to the shared mempool case, because
> you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
> be different but the rounded value that is used for the mempool name
> might be the same, meaning that you can get an EEXISTS.
> 

Maybe I misunderstood but I would think in that case it will be caught 
by the 'reuse' code above? If the rounded value of a new ports 
requested_mtu is equal to that of an existing mempool mtu and its on the 
same socket it will be reused, you would not get to dpdk_mp_create() as 
the reuse condition will be true.

I had tested this with 2 ports, one with a requested mtu of 1500 and a 
second with a requested mtu of 1800. Both are rounded to 2030 and the 
reuse case was hit. Do you think this wont always be the case though?

>> +             * Compare the mempool returned by dmp to each entry in
>> +             * dpdk_mp_list. If a match is found, free dmp as a new entry
>> +             * is not required, set dmp to point to the existing
>> +             * entry and set refcount to 1 to avoid being freed at a later
>> +             * stage.
>> +             */
>> +            if (per_port_mp && rte_errno == EEXIST) {

If it's the case that shared_mempool will always reuse we may not need 
the per_port_mp in the condition above.

>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>> +                    if (dmp->mp == next->mp) {
>> +                        rte_free(dmp);
>> +                        dmp = next;
>> +                        dmp->refcount = 1;
>> +                    }
>> +                }
>> +            }
>> +            else {
>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> +            }
> 
> I think you need to increment refcount and use the safe list option. How
> about

I spotted the other mails and responded there also, my plan is to change 
to increment rather than = 1. but will keep the rte_free(dmp) and dmp = 
next if that sounds ok?

> 
> if (rte_errno == EEXIST) {
>      LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>          if (dmp->mp == next->mp) {
>              next->refcount++;
>              rte_free(dmp);
>              break;
>          }
>      }
> } else {
>      dmp->refcount++;
>      ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> }
> 
>>           }
>>       }
>> +
>>       ovs_mutex_unlock(&dpdk_mp_mutex);
>> +
>> +    return dmp;
>>   }
>>   
>> -/* Tries to allocate a new mempool - or re-use an existing one where
>> - * appropriate - on requested_socket_id with a size determined by
>> - * requested_mtu and requested Rx/Tx queues.
>> - * On success - or when re-using an existing mempool - the new configuration
>> - * will be applied.
>> +/* Decrement reference to a mempool. */
>> +static void
>> +dpdk_mp_put(struct dpdk_mp *dmp)
>> +{
>> +    if (!dmp) {
>> +        return;
>> +    }
>> +
>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>> +    ovs_assert(dmp->refcount);
>> +    dmp->refcount--;
>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>> +}
>> +
>> +/* Depending on the memory model being used this function tries
>> + * identify and reuse an existing mempool or tries to allocate new
>> + * mempool on requested_socket_id with mbuf size corresponding to
>> + * requested_mtu. On success new configuration will be applied.
>>    * On error, device will be left unchanged. */
>>   static int
>>   netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>       OVS_REQUIRES(dev->mutex)
>>   {
>>       uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> -    struct rte_mempool *mp;
>> +    struct dpdk_mp *mp;
> 
> I think it should have a different name now, as it leads to mp->mp
> later. It would be good if you could make is consistent with the other
> functions (dmp?)

Sure,  will do.
> 
>>       int ret = 0;
>> +    bool per_port_mp = dpdk_per_port_memory();
>>   
>> -    dpdk_mp_sweep();
>> +    /* With shared mempools we do not need to configure a mempool if the MTU
>> +     * and socket ID have not changed, the previous configuration is still
>> +     * valid so return 0 */
>> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
>> +        && dev->socket_id == dev->requested_socket_id) {
>> +        return ret;
>> +    }
>>   
>> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>       if (!mp) {
>>           VLOG_ERR("Failed to create memory pool for netdev "
>>                    "%s, with MTU %d on socket %d: %s\n",
>>                    dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>>                    rte_strerror(rte_errno));
>> -        ret = rte_errno;
>> +        return rte_errno;
>>       } else {
>> -        /* If a new MTU was requested and its rounded value equals the one
>> -         * that is currently used, then the existing mempool is returned. */
>> -        if (dev->mp != mp) {
>> -            /* A new mempool was created, release the previous one. */
>> -            dpdk_mp_release(dev->mp);
>> -        } else {
>> -            ret = EEXIST;
>> +        /* Check for any pre-existing dpdk_mp for the device before accessing
>> +         * the associated mempool.
>> +         */
>> +        if (dev->dpdk_mp != NULL) {
>> +            /* If a new MTU was requested and its rounded value does not
>> +             * equal the rounded value of the current mempool, decrement
>> +             * the reference count to devices current dpdk_mp as its
>> +             * associated mempool will not be used for this device.
>> +             */
>> +            if (dev->dpdk_mp->mp != mp->mp) {
> 	
> Not sure if check is needed. If it were to be false, wouldn't we have
> already reused or got an EEXISTS and handled that case?
> 

Yes, your right, it can be removed.

>> +                dpdk_mp_put(dev->dpdk_mp);
>> +            }
>>           }
>> -        dev->mp = mp;
>> +        dev->dpdk_mp = mp;
>>           dev->mtu = dev->requested_mtu;
>>           dev->socket_id = dev->requested_socket_id;
>>           dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>   
>>           for (i = 0; i < n_rxq; i++) {
>>               diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
>> -                                          dev->socket_id, NULL, dev->mp);
>> +                                          dev->socket_id, NULL,
>> +                                          dev->dpdk_mp->mp);
>>               if (diag) {
>>                   VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>                             dev->up.name, i, rte_strerror(-diag));
>> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>       memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>       rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>   
>> -    mbp_priv = rte_mempool_get_priv(dev->mp);
>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>       dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>>   
>>       /* Get the Flow control configuration for DPDK-ETH */
>> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>>       OVS_EXCLUDED(dev->mutex)
>>   {
>>       rte_free(dev->tx_q);
>> -    dpdk_mp_release(dev->mp);
>> +    dpdk_mp_put(dev->dpdk_mp);
>>   
>>       ovs_list_remove(&dev->list_node);
>>       free(ovsrcu_get_protected(struct ingress_policer *,
>> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>           return EAGAIN;
>>       }
>>   
>> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>                                       (struct rte_mbuf **) batch->packets,
>>                                       NETDEV_MAX_BURST);
>>       if (!nb_rx) {
>> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>>               continue;
>>           }
>>   
>> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>>               dropped += cnt - i;
>>               break;
>> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>>           ovs_mutex_lock(&dev->mutex);
>>           ovs_mutex_lock(&dpdk_mp_mutex);
>>   
>> -        rte_mempool_dump(stream, dev->mp);
>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>   
>>           ovs_mutex_unlock(&dpdk_mp_mutex);
>>           ovs_mutex_unlock(&dev->mutex);
>> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>   
>>       err = netdev_dpdk_mempool_configure(dev);
>>       if (!err) {
>> -        /* A new mempool was created. */
>> +        /* A new mempool was created or re-used. */
>>           netdev_change_seq_changed(&dev->up);
>>       } else if (err != EEXIST){
>>           return err;
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 1e27a02..dbf70ac 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -359,6 +359,23 @@
>>           </p>
>>         </column>
>>   
>> +      <column name="other_config" key="per-port-memory"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          By default OVS DPDK uses a shared mempool memory model wherein
>> +          devices that have the same MTU and socket values can share the same
>> +          mempool. Setting this value to <code>true</code> changes this
>> +          behaviour. Per port memory allow DPDK devices to use private
>> +          memory per device. This can provide greater transapraency as
>> +          regards memory usage but potentially at the cost of greater memory
>> +          requirements.
>> +        </p>
>> +        <p>
>> +          Changing this value requires restarting the daemon if dpdk-init has
>> +          already been set to true.
>> +        </p>
>> +      </column>
>> +
>>         <column name="other_config" key="tx-flush-interval"
>>                 type='{"type": "integer",
>>                        "minInteger": 0, "maxInteger": 1000000}'>
>>
>
Ian Stokes June 20, 2018, 12:36 p.m. | #5
On 6/19/2018 12:46 PM, Kevin Traynor wrote:
> On 06/19/2018 12:16 PM, Kevin Traynor wrote:
>> On 06/19/2018 12:11 PM, Kevin Traynor wrote:
>>>> +            if (per_port_mp && rte_errno == EEXIST) {
>>>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>>> +                    if (dmp->mp == next->mp) {
>>>> +                        rte_free(dmp);
>>>> +                        dmp = next;
>>>> +                        dmp->refcount = 1;
>>>> +                    }
>>>> +                }
>>>> +            }
>>>> +            else {
>>>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>>> +            }
>>> I think you need to increment refcount and use the safe list option. How
>>> about
>>>
>>
>> Actually no, you don't need the safe list option, as it's the dmp that
>> is being freed
> 
> I obviously misread this (or wasn't concentrating enough), you do still
> need the dmp = next as well.
> 

Sure, I've responded in the original thread that I'll take this 
approach. I'm not against using the second safe list but wanted to keep 
it to 1 list if possible.

Ian
>>
>>> if (rte_errno == EEXIST) {
>>>      LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>>>          if (dmp->mp == next->mp) {
>>>              next->refcount++;
>>>              rte_free(dmp);
>>>              break;
>>>          }
>>>      }
>>> } else {
>>>      dmp->refcount++;
>>>      ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> }
>>>
>>
>
Kevin Traynor June 20, 2018, 5:24 p.m. | #6
On 06/20/2018 01:34 PM, Ian Stokes wrote:
> On 6/19/2018 12:11 PM, Kevin Traynor wrote:
>> On 06/11/2018 05:37 PM, Ian Stokes wrote:
>>> This commit re-introduces the concept of shared mempools as the default
>>> memory model for DPDK devices. Per port mempools are still available but
>>> must be enabled explicitly by a user.
>>>
>>
>> Hi Ian, thanks for v2, comments below
> 
> Thanks for the review Kevin, comments inline.
> 
>>
>>> OVS previously used a shared mempool model for ports with the same MTU
>>> and socket configuration. This was replaced by a per port mempool model
>>> to address issues flagged by users such as:
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>>
>>>
>>> However the per port model potentially requires an increase in memory
>>> resource requirements to support the same number of ports and
>>> configuration
>>> as the shared port model.
>>>
>>> This is considered a blocking factor for current deployments of OVS
>>> when upgrading to future OVS releases as a user may have to redimension
>>> memory for the same deployment configuration. This may not be
>>> possible for
>>> users.
>>>
>>> This commit resolves the issue by re-introducing shared mempools as
>>> the default memory behaviour in OVS DPDK but also refactors the memory
>>> configuration code to allow for per port mempools.
>>>
>>> This patch adds a new global config option, per-port-mp, that
>>
>> s/per-port-mp/per-port-memory/
> 
> Apologies, I spotted a few of these 'per-port-mp' myself, I believe it's
> in the documentation as well. I have them fixed for the next revision.
> 
>>
>>> controls the enablement of per port mempools for DPDK devices.
>>>
>>>      ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>
>>> This value defaults to false; to enable per port memory support,
>>> this field should be set to true when setting other global parameters
>>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>>> runtime is not supported, and requires restarting the vswitch
>>> daemon.
>>>
>>> The mempool sweep functionality is also replaced with the
>>> sweep functionality from OVS 2.9 found in commits
>>>
>>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>>
>>> As this patch is RFC there are a number of TO-DOs including adding a
>>> memory calculation section to the documentation for both models. This is
>>> expected to be completed in the v1 after RFC.
>>>
>>> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
>>>
>>> ---
>>> v1 -> v2
>>> * Rebase to head of master.
>>> * Change global config option 'per-port-mp-enabled' to
>>> 'per-port-memory'.
>>>    in commit message & documentation and code.
>>> * Specify in documentation that restart of daemon is only required if
>>> per
>>>    port-port-memory is set after dpdk-init=true.
>>> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>>>    previous refactor.
>>> * Improve comments regarding unique mempool names in the shared mempool
>>>    usecase.
>>> * Check per_port_mp condition first when verifying if new mempool is
>>>    required in netdev_dpdk_mempool_configure() for the shared mempool
>>>    usecase.
>>> * Correctly return ret variable instead of 0 for function
>>>    netdev_dpdk_mempool_configure().
>>> * Move VLOG_ERR regarding failure to create mempool for a device
>>> prior to
>>>    dpdk_mp_create() returns.
>>> * Add VLOG_DBG message flagging that the number of mbufs could not be
>>>    allocated and that it will be retried with half that amount.
>>> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
>>> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>>>    avoid duplication of dpdk_mps, correctly set refcount and return
>>>    correct dpdk_mp for the device to use.
>>> ---
>>>   Documentation/automake.mk            |   1 +
>>>   Documentation/topics/dpdk/index.rst  |   1 +
>>>   Documentation/topics/dpdk/memory.rst |  67 ++++++++
>>>   NEWS                                 |   1 +
>>>   lib/dpdk-stub.c                      |   6 +
>>>   lib/dpdk.c                           |  12 ++
>>>   lib/dpdk.h                           |   1 +
>>>   lib/netdev-dpdk.c                    | 298
>>> +++++++++++++++++++++++------------
>>>   vswitchd/vswitch.xml                 |  17 ++
>>>   9 files changed, 304 insertions(+), 100 deletions(-)
>>>   create mode 100644 Documentation/topics/dpdk/memory.rst
>>>
>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>> index 2202df4..141b33d 100644
>>> --- a/Documentation/automake.mk
>>> +++ b/Documentation/automake.mk
>>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>>       Documentation/topics/dpdk/index.rst \
>>>       Documentation/topics/dpdk/bridge.rst \
>>>       Documentation/topics/dpdk/jumbo-frames.rst \
>>> +    Documentation/topics/dpdk/memory.rst \
>>>       Documentation/topics/dpdk/pdump.rst \
>>>       Documentation/topics/dpdk/phy.rst \
>>>       Documentation/topics/dpdk/pmd.rst \
>>> diff --git a/Documentation/topics/dpdk/index.rst
>>> b/Documentation/topics/dpdk/index.rst
>>> index 181f61a..cf24a7b 100644
>>> --- a/Documentation/topics/dpdk/index.rst
>>> +++ b/Documentation/topics/dpdk/index.rst
>>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>>      /topics/dpdk/qos
>>>      /topics/dpdk/pdump
>>>      /topics/dpdk/jumbo-frames
>>> +   /topics/dpdk/memory
>>> diff --git a/Documentation/topics/dpdk/memory.rst
>>> b/Documentation/topics/dpdk/memory.rst
>>> new file mode 100644
>>> index 0000000..7c00f0f
>>> --- /dev/null
>>> +++ b/Documentation/topics/dpdk/memory.rst
>>> @@ -0,0 +1,67 @@
>>> +..
>>> +        Copyright 2018, Intel, Inc.
>>> +
>>> +      Licensed under the Apache License, Version 2.0 (the
>>> "License"); you may
>>> +      not use this file except in compliance with the License. You
>>> may obtain
>>> +      a copy of the License at
>>> +
>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> +      Unless required by applicable law or agreed to in writing,
>>> software
>>> +      distributed under the License is distributed on an "AS IS"
>>> BASIS, WITHOUT
>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied. See the
>>> +      License for the specific language governing permissions and
>>> limitations
>>> +      under the License.
>>> +
>>> +      Convention for heading levels in Open vSwitch documentation:
>>> +
>>> +      =======  Heading 0 (reserved for the title in a document)
>>> +      -------  Heading 1
>>> +      ~~~~~~~  Heading 2
>>> +      +++++++  Heading 3
>>> +      '''''''  Heading 4
>>> +
>>> +      Avoid deeper levels because they do not render well.
>>> +
>>> +=========================
>>> +DPDK Device Memory Models
>>> +=========================
>>> +
>>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>>> +shared mempools or per port mempools. The specifics of both are
>>> detailed
>>> +below.
>>> +
>>> +Shared Mempools
>>> +---------------
>>> +
>>> +By Default OVS DPDK uses a shared mempool model. This means that
>>> multiple
>>> +ports can share the same mempool. For example when a port is added
>>> it will
>>> +have a given MTU and socket ID associated with it. If a mempool has
>>> been
>>> +created previously for an existing port that has the same MTU and
>>> socket ID,
>>> +that mempool is used for both ports. If there is no existing mempool
>>> +supporting these parameters then a new mempool is created.
>>> +
>>> +Per Port Mempools
>>> +-----------------
>>> +
>>> +In the per port mempool model, mempools are created per device and
>>> are not
>>> +shared. The benefit of this is a more transparent memory model where
>>> mempools
>>> +will not be exhausted by other DPDK devices. However this comes at a
>>> potential
>>> +increase in cost for memory dimensioning for a given deployment.
>>> Users should
>>> +be aware of the memory requirements for their deployment before
>>> using this
>>> +model and allocate the required hugepage memory.
>>> +
>>> +Per port mempool support may be enabled via a global config value,
>>> +```per-port-mp-enabled```. Setting this to true enables the per port
>>> mempool
>>> +model for all DPDK devices in OVS::
>>> +
>>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>> +
>>> +.. important::
>>> +
>>> +    This value should be set before setting dpdk-init=true. If set
>>> after
>>> +    dpdk-init=true then the daemon must be restarted to use
>>> per-port-memory.
>>> +
>>> +.. todo::
>>> +
>>> +   Add memory calculation section for both mempool models.
>>> diff --git a/NEWS b/NEWS
>>> index 484c6dc..eff6cf1 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -33,6 +33,7 @@ Post-v2.9.0
>>>          See Testing topic for the details.
>>>        * Add LSC interrupt support for DPDK physical devices.
>>>        * Allow init to fail and record DPDK status/version in OVS
>>> database.
>>> +     * Support both shared and per port mempools for DPDK devices.
>>>      - Userspace datapath:
>>>        * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>>> single PMD
>>>        * Detailed PMD performance metrics available with new command
>>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>>> index 1df1c58..fd3fe21 100644
>>> --- a/lib/dpdk-stub.c
>>> +++ b/lib/dpdk-stub.c
>>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>>       return false;
>>>   }
>>>   +bool
>>> +dpdk_per_port_mempool_enabled(void)
>>> +{
>>> +    return false;
>>> +}
>>> +
>>>   void
>>>   print_dpdk_version(void)
>>>   {
>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>> index 09afd8c..6a02a17 100644
>>> --- a/lib/dpdk.c
>>> +++ b/lib/dpdk.c
>>> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location
>>> of vhost-user sockets */
>>>   static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>> support */
>>>   static bool dpdk_initialized = false; /* Indicates successful
>>> initialization
>>>                                          * of DPDK. */
>>> +static bool per_port_memory = false; /* Status of per port memory
>>> support */
>>>     static int
>>>   process_vhost_flags(char *flag, const char *default_val, int size,
>>> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>       VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>>                  vhost_iommu_enabled ? "enabled" : "disabled");
>>>   +    per_port_memory = smap_get_bool(ovs_other_config,
>>> +                                    "per-port-memory", false);
>>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>>> +              per_port_memory ? "enabled" : "disabled");
>>> +
>>>       argv = grow_argv(&argv, 0, 1);
>>>       argc = 1;
>>>       argv[0] = xstrdup(ovs_get_program_name());
>>> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>>>       return vhost_iommu_enabled;
>>>   }
>>>   +bool
>>> +dpdk_per_port_memory(void)
>>> +{
>>> +    return per_port_memory;
>>> +}
>>> +
>>>   void
>>>   dpdk_set_lcore_id(unsigned cpu)
>>>   {
>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>> index efdaa63..bbb89d4 100644
>>> --- a/lib/dpdk.h
>>> +++ b/lib/dpdk.h
>>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>>   void dpdk_set_lcore_id(unsigned cpu);
>>>   const char *dpdk_get_vhost_sock_dir(void);
>>>   bool dpdk_vhost_iommu_enabled(void);
>>> +bool dpdk_per_port_memory(void);
>>>   void print_dpdk_version(void);
>>>   void dpdk_status(const struct ovsrec_open_vswitch *);
>>>   #endif /* dpdk.h */
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 2e2f568..2c7ff81 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>   #define NETDEV_DPDK_MBUF_ALIGN      1024
>>>   #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>>   -/* Min number of packets in the mempool.  OVS tries to allocate a
>>> mempool with
>>> - * roughly estimated number of mbufs: if this fails (because the
>>> system doesn't
>>> - * have enough hugepages) we keep halving the number until the
>>> allocation
>>> - * succeeds or we reach MIN_NB_MBUF */
>>> +/* Max and min number of packets in the mempool.  OVS tries to
>>> allocate a
>>> + * mempool with MAX_NB_MBUF: if this fails (because the system
>>> doesn't have
>>> + * enough hugepages) we keep halving the number until the allocation
>>> succeeds
>>> + * or we reach MIN_NB_MBUF */
>>> +
>>> +#define MAX_NB_MBUF          (4096 * 64)
>>>   #define MIN_NB_MBUF          (4096 * 4)
>>>   #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>   +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
>>> MIN_NB_MBUF)
>>> +                  == 0);
>>> +
>>> +/* The smallest possible NB_MBUF that we're going to try should be a
>>> multiple
>>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
>>> MIN_NB_MBUF))
>>> +                  % MP_CACHE_SZ == 0);
>>> +
>>>   /*
>>>    * DPDK XSTATS Counter names definition
>>>    */
>>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex
>>> OVS_ACQ_AFTER(dpdk_mutex)
>>>       = OVS_MUTEX_INITIALIZER;
>>>     /* Contains all 'struct dpdk_mp's. */
>>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>>   -/* Wrapper for a mempool released but not yet freed. */
>>>   struct dpdk_mp {
>>>        struct rte_mempool *mp;
>>> +     int mtu;
>>> +     int socket_id;
>>> +     int refcount;
>>>        struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>>    };
>>>   @@ -383,7 +396,7 @@ struct netdev_dpdk {
>>>         PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>>           struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>>> -        struct rte_mempool *mp;
>>> +        struct dpdk_mp *dpdk_mp;
>>>             /* virtio identifier for vhost devices */
>>>           ovsrcu_index vid;
>>> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp)
>>> OVS_REQUIRES(dpdk_mp_mutex)
>>>        * likely be ok, as there are additional checks during mempool
>>>        * freeing but it would make things racey.
>>>        */
>>> -    return rte_mempool_full(mp);
>>> +    rte_mempool_full(mp);
>>> +    return 0;
>>
>> I don't think you intended this? unless you are in the business of
>> selling memory ;-)
> 
> Ooops :-/ , This is left over from testing when mbufs are still in use,
> not sure how I missed it. Will correct for v3.
> 

Maybe PATCH v1 ? or you are still testing out?

>>
>>>   }
>>>     /* Free unused mempools. */
>>>   static void
>>> -dpdk_mp_sweep(void)
>>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>>   {
>>>       struct dpdk_mp *dmp, *next;
>>>   -    ovs_mutex_lock(&dpdk_mp_mutex);
>>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>> -        if (dpdk_mp_full(dmp->mp)) {
>>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>>               VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>>               ovs_list_remove(&dmp->list_node);
>>>               rte_mempool_free(dmp->mp);
>>>               rte_free(dmp);
>>>           }
>>>       }
>>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>   }
>>>   -/* Ensure a mempool will not be freed. */
>>> -static void
>>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>> +/* Calculating the required number of mbufs differs depending on the
>>> + * mempool model being used. Check if per port mempools are in use
>>> before
>>> + * calculating.
>>> + */
>>> +static uint32_t
>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
>>> per_port_mp)
>>>   {
>>> -    struct dpdk_mp *dmp, *next;
>>> +    uint32_t n_mbufs;
>>>   -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>> -        if (dmp->mp == mp) {
>>> -            VLOG_DBG("Removing mempool \"%s\" from free list",
>>> dmp->mp->name);
>>> -            ovs_list_remove(&dmp->list_node);
>>> -            rte_free(dmp);
>>> -            break;
>>> +    if (!per_port_mp) {
>>> +        /* Shared mempools are being used.
>>> +         * XXX: this is a really rough method of provisioning memory.
>>> +         * It's impossible to determine what the exact memory
>>> requirements are
>>> +         * when the number of ports and rxqs that utilize a
>>> particular mempool
>>> +         * can change dynamically at runtime. For now, use this rough
>>> +         * heurisitic.
>>> +         */
>>> +        if (mtu >= ETHER_MTU) {
>>> +            n_mbufs = MAX_NB_MBUF;
>>> +        } else {
>>> +            n_mbufs = MIN_NB_MBUF;
>>>           }
>>> +    } else {
>>> +        /* Per port mempools are being used.
>>> +         * XXX: rough estimation of number of mbufs required for
>>> this port:
>>> +         * <packets required to fill the device rxqs>
>>> +         * + <packets that could be stuck on other ports txqs>
>>> +         * + <packets in the pmd threads>
>>> +         * + <additional memory for corner cases>
>>> +         */
>>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>> +                  + dev->requested_n_txq * dev->requested_txq_size
>>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>> NETDEV_MAX_BURST
>>> +                  + MIN_NB_MBUF;
>>>       }
>>> +
>>> +    return n_mbufs;
>>>   }
>>>   -/* Returns a valid pointer when either of the following is true:
>>> - *  - a new mempool was just created;
>>> - *  - a matching mempool already exists. */
>>> -static struct rte_mempool *
>>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>> +static struct dpdk_mp *
>>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>   {
>>>       char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>       const char *netdev_name = netdev_get_name(&dev->up);
>>>       int socket_id = dev->requested_socket_id;
>>>       uint32_t n_mbufs;
>>>       uint32_t hash = hash_string(netdev_name, 0);
>>> -    struct rte_mempool *mp = NULL;
>>> -
>>> -    /*
>>> -     * XXX: rough estimation of number of mbufs required for this port:
>>> -     * <packets required to fill the device rxqs>
>>> -     * + <packets that could be stuck on other ports txqs>
>>> -     * + <packets in the pmd threads>
>>> -     * + <additional memory for corner cases>
>>> -     */
>>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>> -              + dev->requested_n_txq * dev->requested_txq_size
>>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>> NETDEV_MAX_BURST
>>> -              + MIN_NB_MBUF;
>>> +    struct dpdk_mp *dmp = NULL;
>>> +    int ret;
>>> +
>>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>> +    if (!dmp) {
>>> +        return NULL;
>>> +    }
>>> +    dmp->socket_id = socket_id;
>>> +    dmp->mtu = mtu;
>>> +    dmp->refcount = 1;
>>> +
>>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>>   -    ovs_mutex_lock(&dpdk_mp_mutex);
>>>       do {
>>>           /* Full DPDK memory pool name must be unique and cannot be
>>> -         * longer than RTE_MEMPOOL_NAMESIZE. */
>>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>>> +         * mempool case this can result in one device using a mempool
>>> +         * which references a different device in it's name. However as
>>> +         * mempool names are hashed, the device name will not be
>>> readable
>>> +         * so this is not an issue for tasks such as debugging.
>>> +         */
>>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>>                              "ovs%08x%02d%05d%07u",
>>>                              hash, socket_id, mtu, n_mbufs);
>>>           if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>                     netdev_name, n_mbufs, socket_id,
>>>                     dev->requested_n_rxq, dev->requested_n_txq);
>>>   -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>>> +                                          MP_CACHE_SZ,
>>> +                                          sizeof (struct dp_packet)
>>> +                                          - sizeof (struct rte_mbuf),
>>> +                                          MBUF_SIZE(mtu)
>>> +                                          - sizeof(struct dp_packet),
>>> +                                          socket_id);
>>>   -        if (mp) {
>>> +        if (dmp->mp) {
>>>               VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>>                        mp_name, n_mbufs);
>>>               /* rte_pktmbuf_pool_create has done some initialization
>>> of the
>>> -             * rte_mbuf part of each dp_packet. Some OvS specific
>>> fields
>>> -             * of the packet still need to be initialized by
>>> -             * ovs_rte_pktmbuf_init. */
>>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>>> +             * rte_mbuf part of each dp_packet, while
>>> ovs_rte_pktmbuf_init
>>> +             * initializes some OVS specific fields of dp_packet.
>>> +             */
>>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>>> +            return dmp;
>>>           } else if (rte_errno == EEXIST) {
>>>               /* A mempool with the same name already exists.  We just
>>>                * retrieve its pointer to be returned to the caller. */
>>> -            mp = rte_mempool_lookup(mp_name);
>>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>>               /* As the mempool create returned EEXIST we can expect the
>>>                * lookup has returned a valid pointer.  If for some
>>> reason
>>>                * that's not the case we keep track of it. */
>>>               VLOG_DBG("A mempool with name \"%s\" already exists at
>>> %p.",
>>> -                     mp_name, mp);
>>> -            /* Ensure this reused mempool will not be freed. */
>>> -            dpdk_mp_do_not_free(mp);
>>> +                     mp_name, dmp->mp);
>>> +            return dmp;
>>>           } else {
>>> -            VLOG_ERR("Failed mempool \"%s\" create request of %u
>>> mbufs",
>>> -                     mp_name, n_mbufs);
>>> +            VLOG_DBG("Failed to create mempool \"%s\" with a request
>>> of "
>>> +                     "%u mbufs, retrying with %u mbufs",
>>> +                     mp_name, n_mbufs, n_mbufs/2);
>>>           }
>>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>> MIN_NB_MBUF);
>>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>> MIN_NB_MBUF);
>>>   -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>> -    return mp;
>>> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u
>>> mbufs",
>>> +             mp_name, n_mbufs);
>>> +
>>> +    rte_free(dmp);
>>> +    return NULL;
>>>   }
>>>   -/* Release an existing mempool. */
>>> -static void
>>> -dpdk_mp_release(struct rte_mempool *mp)
>>> +static struct dpdk_mp *
>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>   {
>>> -    if (!mp) {
>>> -        return;
>>> -    }
>>> +    struct dpdk_mp *dmp, *next;
>>> +    bool reuse = false;
>>>         ovs_mutex_lock(&dpdk_mp_mutex);
>>> -    if (dpdk_mp_full(mp)) {
>>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>> -        rte_mempool_free(mp);
>>> -    } else {
>>> -        struct dpdk_mp *dmp;p
>>> +    /* Check if shared mempools are being used, if so check existing
>>> mempools
>>> +     * to see if reuse is possible. */
>>> +    if (!per_port_mp) {
>>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>> +            if (dmp->socket_id == dev->requested_socket_id
>>> +                && dmp->mtu == mtu) {
>>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>> +                dmp->refcount++;
>>> +                reuse = true;
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    /* Sweep mempools after reuse or before create. */
>>> +    dpdk_mp_sweep();
>>>   -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>> +    if (!reuse) {
>>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>>           if (dmp) {
>>> -            dmp->mp = mp;
>>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>>> +            /* We need to check for the EEXIST case for per port
>>> memory.
>>
>> I'm pretty sure this also applies to the shared mempool case, because
>> you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
>> be different but the rounded value that is used for the mempool name
>> might be the same, meaning that you can get an EEXISTS.
>>
> 
> Maybe I misunderstood but I would think in that case it will be caught
> by the 'reuse' code above? If the rounded value of a new ports
> requested_mtu is equal to that of an existing mempool mtu and its on the
> same socket it will be reused, you would not get to dpdk_mp_create() as
> the reuse condition will be true.
> 

Yes - you are right, it won't occur. Maybe you could add a small comment
to say something like "Shared mempools will have reused and not
requested a mempool that already exists."

> I had tested this with 2 ports, one with a requested mtu of 1500 and a
> second with a requested mtu of 1800. Both are rounded to 2030 and the
> reuse case was hit. Do you think this wont always be the case though?
> 

No, it's right - also a good test to make sure that the mtu is updated
correctly in that case :-)

>>> +             * Compare the mempool returned by dmp to each entry in
>>> +             * dpdk_mp_list. If a match is found, free dmp as a new
>>> entry
>>> +             * is not required, set dmp to point to the existing
>>> +             * entry and set refcount to 1 to avoid being freed at a
>>> later
>>> +             * stage.
>>> +             */
>>> +            if (per_port_mp && rte_errno == EEXIST) {
> 
> If it's the case that shared_mempool will always reuse we may not need
> the per_port_mp in the condition above.
> 

Probably it's ok either way, but might be better to leave it there,
otherwise you'll have to think about where rte_errno is being set.

>>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>> +                    if (dmp->mp == next->mp) {
>>> +                        rte_free(dmp);
>>> +                        dmp = next;
>>> +                        dmp->refcount = 1;
>>> +                    }
>>> +                }
>>> +            }
>>> +            else {
>>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> +            }
>>
>> I think you need to increment refcount and use the safe list option. How
>> about
> 
> I spotted the other mails and responded there also, my plan is to change
> to increment rather than = 1. but will keep the rte_free(dmp) and dmp =
> next if that sounds ok?
> 

sounds good, thanks.

>>
>> if (rte_errno == EEXIST) {
>>      LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>>          if (dmp->mp == next->mp) {
>>              next->refcount++;
>>              rte_free(dmp);
>>              break;
>>          }
>>      }
>> } else {
>>      dmp->refcount++;
>>      ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> }
>>
>>>           }
>>>       }
>>> +
>>>       ovs_mutex_unlock(&dpdk_mp_mutex);
>>> +
>>> +    return dmp;
>>>   }
>>>   -/* Tries to allocate a new mempool - or re-use an existing one where
>>> - * appropriate - on requested_socket_id with a size determined by
>>> - * requested_mtu and requested Rx/Tx queues.
>>> - * On success - or when re-using an existing mempool - the new
>>> configuration
>>> - * will be applied.
>>> +/* Decrement reference to a mempool. */
>>> +static void
>>> +dpdk_mp_put(struct dpdk_mp *dmp)
>>> +{
>>> +    if (!dmp) {
>>> +        return;
>>> +    }
>>> +
>>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>>> +    ovs_assert(dmp->refcount);
>>> +    dmp->refcount--;
>>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>>> +}
>>> +
>>> +/* Depending on the memory model being used this function tries
>>> + * identify and reuse an existing mempool or tries to allocate new
>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>> + * requested_mtu. On success new configuration will be applied.
>>>    * On error, device will be left unchanged. */
>>>   static int
>>>   netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>       OVS_REQUIRES(dev->mutex)
>>>   {
>>>       uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>> -    struct rte_mempool *mp;
>>> +    struct dpdk_mp *mp;
>>
>> I think it should have a different name now, as it leads to mp->mp
>> later. It would be good if you could make is consistent with the other
>> functions (dmp?)
> 
> Sure,  will do.
>>
>>>       int ret = 0;
>>> +    bool per_port_mp = dpdk_per_port_memory();
>>>   -    dpdk_mp_sweep();
>>> +    /* With shared mempools we do not need to configure a mempool if
>>> the MTU
>>> +     * and socket ID have not changed, the previous configuration is
>>> still
>>> +     * valid so return 0 */
>>> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
>>> +        && dev->socket_id == dev->requested_socket_id) {
>>> +        return ret;
>>> +    }
>>>   -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>>       if (!mp) {
>>>           VLOG_ERR("Failed to create memory pool for netdev "
>>>                    "%s, with MTU %d on socket %d: %s\n",
>>>                    dev->up.name, dev->requested_mtu,
>>> dev->requested_socket_id,
>>>                    rte_strerror(rte_errno));
>>> -        ret = rte_errno;
>>> +        return rte_errno;
>>>       } else {
>>> -        /* If a new MTU was requested and its rounded value equals
>>> the one
>>> -         * that is currently used, then the existing mempool is
>>> returned. */
>>> -        if (dev->mp != mp) {
>>> -            /* A new mempool was created, release the previous one. */
>>> -            dpdk_mp_release(dev->mp);
>>> -        } else {
>>> -            ret = EEXIST;
>>> +        /* Check for any pre-existing dpdk_mp for the device before
>>> accessing
>>> +         * the associated mempool.
>>> +         */
>>> +        if (dev->dpdk_mp != NULL) {
>>> +            /* If a new MTU was requested and its rounded value does
>>> not
>>> +             * equal the rounded value of the current mempool,
>>> decrement
>>> +             * the reference count to devices current dpdk_mp as its
>>> +             * associated mempool will not be used for this device.
>>> +             */
>>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>     
>> Not sure if check is needed. If it were to be false, wouldn't we have
>> already reused or got an EEXISTS and handled that case?
>>
> 
> Yes, your right, it can be removed.
> 
>>> +                dpdk_mp_put(dev->dpdk_mp);
>>> +            }
>>>           }
>>> -        dev->mp = mp;
>>> +        dev->dpdk_mp = mp;
>>>           dev->mtu = dev->requested_mtu;
>>>           dev->socket_id = dev->requested_socket_id;
>>>           dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
>>> int n_rxq, int n_txq)
>>>             for (i = 0; i < n_rxq; i++) {
>>>               diag = rte_eth_rx_queue_setup(dev->port_id, i,
>>> dev->rxq_size,
>>> -                                          dev->socket_id, NULL,
>>> dev->mp);
>>> +                                          dev->socket_id, NULL,
>>> +                                          dev->dpdk_mp->mp);
>>>               if (diag) {
>>>                   VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>>                             dev->up.name, i, rte_strerror(-diag));
>>> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>>       memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>>       rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>>   -    mbp_priv = rte_mempool_get_priv(dev->mp);
>>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>>       dev->buf_size = mbp_priv->mbuf_data_room_size -
>>> RTE_PKTMBUF_HEADROOM;
>>>         /* Get the Flow control configuration for DPDK-ETH */
>>> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>>>       OVS_EXCLUDED(dev->mutex)
>>>   {
>>>       rte_free(dev->tx_q);
>>> -    dpdk_mp_release(dev->mp);
>>> +    dpdk_mp_put(dev->dpdk_mp);
>>>         ovs_list_remove(&dev->list_node);
>>>       free(ovsrcu_get_protected(struct ingress_policer *,
>>> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>>           return EAGAIN;
>>>       }
>>>   -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>>                                       (struct rte_mbuf **)
>>> batch->packets,
>>>                                       NETDEV_MAX_BURST);
>>>       if (!nb_rx) {
>>> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>>> struct dp_packet_batch *batch)
>>>               continue;
>>>           }
>>>   -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>>>               dropped += cnt - i;
>>>               break;
>>> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct
>>> unixctl_conn *conn,
>>>           ovs_mutex_lock(&dev->mutex);
>>>           ovs_mutex_lock(&dpdk_mp_mutex);
>>>   -        rte_mempool_dump(stream, dev->mp);
>>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>>             ovs_mutex_unlock(&dpdk_mp_mutex);
>>>           ovs_mutex_unlock(&dev->mutex);
>>> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct
>>> netdev_dpdk *dev)
>>>         err = netdev_dpdk_mempool_configure(dev);
>>>       if (!err) {
>>> -        /* A new mempool was created. */
>>> +        /* A new mempool was created or re-used. */
>>>           netdev_change_seq_changed(&dev->up);
>>>       } else if (err != EEXIST){
>>>           return err;
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 1e27a02..dbf70ac 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -359,6 +359,23 @@
>>>           </p>
>>>         </column>
>>>   +      <column name="other_config" key="per-port-memory"
>>> +              type='{"type": "boolean"}'>
>>> +        <p>
>>> +          By default OVS DPDK uses a shared mempool memory model
>>> wherein
>>> +          devices that have the same MTU and socket values can share
>>> the same
>>> +          mempool. Setting this value to <code>true</code> changes this
>>> +          behaviour. Per port memory allow DPDK devices to use private
>>> +          memory per device. This can provide greater transapraency as
>>> +          regards memory usage but potentially at the cost of
>>> greater memory
>>> +          requirements.
>>> +        </p>
>>> +        <p>
>>> +          Changing this value requires restarting the daemon if
>>> dpdk-init has
>>> +          already been set to true.
>>> +        </p>
>>> +      </column>
>>> +
>>>         <column name="other_config" key="tx-flush-interval"
>>>                 type='{"type": "integer",
>>>                        "minInteger": 0, "maxInteger": 1000000}'>
>>>
>>
>
Ian Stokes June 20, 2018, 8:23 p.m. | #7
On 6/20/2018 6:24 PM, Kevin Traynor wrote:
> On 06/20/2018 01:34 PM, Ian Stokes wrote:
>> On 6/19/2018 12:11 PM, Kevin Traynor wrote:
>>> On 06/11/2018 05:37 PM, Ian Stokes wrote:
>>>> This commit re-introduces the concept of shared mempools as the default
>>>> memory model for DPDK devices. Per port mempools are still available but
>>>> must be enabled explicitly by a user.
>>>>
>>>
>>> Hi Ian, thanks for v2, comments below
>>
>> Thanks for the review Kevin, comments inline.
>>
>>>
>>>> OVS previously used a shared mempool model for ports with the same MTU
>>>> and socket configuration. This was replaced by a per port mempool model
>>>> to address issues flagged by users such as:
>>>>
>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>>>
>>>>
>>>> However the per port model potentially requires an increase in memory
>>>> resource requirements to support the same number of ports and
>>>> configuration
>>>> as the shared port model.
>>>>
>>>> This is considered a blocking factor for current deployments of OVS
>>>> when upgrading to future OVS releases as a user may have to redimension
>>>> memory for the same deployment configuration. This may not be
>>>> possible for
>>>> users.
>>>>
>>>> This commit resolves the issue by re-introducing shared mempools as
>>>> the default memory behaviour in OVS DPDK but also refactors the memory
>>>> configuration code to allow for per port mempools.
>>>>
>>>> This patch adds a new global config option, per-port-mp, that
>>>
>>> s/per-port-mp/per-port-memory/
>>
>> Apologies, I spotted a few of these 'per-port-mp' myself, I believe it's
>> in the documentation as well. I have them fixed for the next revision.
>>
>>>
>>>> controls the enablement of per port mempools for DPDK devices.
>>>>
>>>>       ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>>
>>>> This value defaults to false; to enable per port memory support,
>>>> this field should be set to true when setting other global parameters
>>>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>>>> runtime is not supported, and requires restarting the vswitch
>>>> daemon.
>>>>
>>>> The mempool sweep functionality is also replaced with the
>>>> sweep functionality from OVS 2.9 found in commits
>>>>
>>>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>>>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>>>
>>>> As this patch is RFC there are a number of TO-DOs including adding a
>>>> memory calculation section to the documentation for both models. This is
>>>> expected to be completed in the v1 after RFC.
>>>>
>>>> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
>>>>
>>>> ---
>>>> v1 -> v2
>>>> * Rebase to head of master.
>>>> * Change global config option 'per-port-mp-enabled' to
>>>> 'per-port-memory'.
>>>>     in commit message & documentation and code.
>>>> * Specify in documentation that restart of daemon is only required if
>>>> per
>>>>     port-port-memory is set after dpdk-init=true.
>>>> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>>>>     previous refactor.
>>>> * Improve comments regarding unique mempool names in the shared mempool
>>>>     usecase.
>>>> * Check per_port_mp condition first when verifying if new mempool is
>>>>     required in netdev_dpdk_mempool_configure() for the shared mempool
>>>>     usecase.
>>>> * Correctly return ret variable instead of 0 for function
>>>>     netdev_dpdk_mempool_configure().
>>>> * Move VLOG_ERR regarding failure to create mempool for a device
>>>> prior to
>>>>     dpdk_mp_create() returns.
>>>> * Add VLOG_DBG message flagging that the number of mbufs could not be
>>>>     allocated and that it will be retried with half that amount.
>>>> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
>>>> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>>>>     avoid duplication of dpdk_mps, correctly set refcount and return
>>>>     correct dpdk_mp for the device to use.
>>>> ---
>>>>    Documentation/automake.mk            |   1 +
>>>>    Documentation/topics/dpdk/index.rst  |   1 +
>>>>    Documentation/topics/dpdk/memory.rst |  67 ++++++++
>>>>    NEWS                                 |   1 +
>>>>    lib/dpdk-stub.c                      |   6 +
>>>>    lib/dpdk.c                           |  12 ++
>>>>    lib/dpdk.h                           |   1 +
>>>>    lib/netdev-dpdk.c                    | 298
>>>> +++++++++++++++++++++++------------
>>>>    vswitchd/vswitch.xml                 |  17 ++
>>>>    9 files changed, 304 insertions(+), 100 deletions(-)
>>>>    create mode 100644 Documentation/topics/dpdk/memory.rst
>>>>
>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>> index 2202df4..141b33d 100644
>>>> --- a/Documentation/automake.mk
>>>> +++ b/Documentation/automake.mk
>>>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>>>        Documentation/topics/dpdk/index.rst \
>>>>        Documentation/topics/dpdk/bridge.rst \
>>>>        Documentation/topics/dpdk/jumbo-frames.rst \
>>>> +    Documentation/topics/dpdk/memory.rst \
>>>>        Documentation/topics/dpdk/pdump.rst \
>>>>        Documentation/topics/dpdk/phy.rst \
>>>>        Documentation/topics/dpdk/pmd.rst \
>>>> diff --git a/Documentation/topics/dpdk/index.rst
>>>> b/Documentation/topics/dpdk/index.rst
>>>> index 181f61a..cf24a7b 100644
>>>> --- a/Documentation/topics/dpdk/index.rst
>>>> +++ b/Documentation/topics/dpdk/index.rst
>>>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>>>       /topics/dpdk/qos
>>>>       /topics/dpdk/pdump
>>>>       /topics/dpdk/jumbo-frames
>>>> +   /topics/dpdk/memory
>>>> diff --git a/Documentation/topics/dpdk/memory.rst
>>>> b/Documentation/topics/dpdk/memory.rst
>>>> new file mode 100644
>>>> index 0000000..7c00f0f
>>>> --- /dev/null
>>>> +++ b/Documentation/topics/dpdk/memory.rst
>>>> @@ -0,0 +1,67 @@
>>>> +..
>>>> +        Copyright 2018, Intel, Inc.
>>>> +
>>>> +      Licensed under the Apache License, Version 2.0 (the
>>>> "License"); you may
>>>> +      not use this file except in compliance with the License. You
>>>> may obtain
>>>> +      a copy of the License at
>>>> +
>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +      Unless required by applicable law or agreed to in writing,
>>>> software
>>>> +      distributed under the License is distributed on an "AS IS"
>>>> BASIS, WITHOUT
>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied. See the
>>>> +      License for the specific language governing permissions and
>>>> limitations
>>>> +      under the License.
>>>> +
>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>> +
>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>> +      -------  Heading 1
>>>> +      ~~~~~~~  Heading 2
>>>> +      +++++++  Heading 3
>>>> +      '''''''  Heading 4
>>>> +
>>>> +      Avoid deeper levels because they do not render well.
>>>> +
>>>> +=========================
>>>> +DPDK Device Memory Models
>>>> +=========================
>>>> +
>>>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>>>> +shared mempools or per port mempools. The specifics of both are
>>>> detailed
>>>> +below.
>>>> +
>>>> +Shared Mempools
>>>> +---------------
>>>> +
>>>> +By Default OVS DPDK uses a shared mempool model. This means that
>>>> multiple
>>>> +ports can share the same mempool. For example when a port is added
>>>> it will
>>>> +have a given MTU and socket ID associated with it. If a mempool has
>>>> been
>>>> +created previously for an existing port that has the same MTU and
>>>> socket ID,
>>>> +that mempool is used for both ports. If there is no existing mempool
>>>> +supporting these parameters then a new mempool is created.
>>>> +
>>>> +Per Port Mempools
>>>> +-----------------
>>>> +
>>>> +In the per port mempool model, mempools are created per device and
>>>> are not
>>>> +shared. The benefit of this is a more transparent memory model where
>>>> mempools
>>>> +will not be exhausted by other DPDK devices. However this comes at a
>>>> potential
>>>> +increase in cost for memory dimensioning for a given deployment.
>>>> Users should
>>>> +be aware of the memory requirements for their deployment before
>>>> using this
>>>> +model and allocate the required hugepage memory.
>>>> +
>>>> +Per port mempool support may be enabled via a global config value,
>>>> +```per-port-mp-enabled```. Setting this to true enables the per port
>>>> mempool
>>>> +model for all DPDK devices in OVS::
>>>> +
>>>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>> +
>>>> +.. important::
>>>> +
>>>> +    This value should be set before setting dpdk-init=true. If set
>>>> after
>>>> +    dpdk-init=true then the daemon must be restarted to use
>>>> per-port-memory.
>>>> +
>>>> +.. todo::
>>>> +
>>>> +   Add memory calculation section for both mempool models.
>>>> diff --git a/NEWS b/NEWS
>>>> index 484c6dc..eff6cf1 100644
>>>> --- a/NEWS
>>>> +++ b/NEWS
>>>> @@ -33,6 +33,7 @@ Post-v2.9.0
>>>>           See Testing topic for the details.
>>>>         * Add LSC interrupt support for DPDK physical devices.
>>>>         * Allow init to fail and record DPDK status/version in OVS
>>>> database.
>>>> +     * Support both shared and per port mempools for DPDK devices.
>>>>       - Userspace datapath:
>>>>         * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>>>> single PMD
>>>>         * Detailed PMD performance metrics available with new command
>>>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>>>> index 1df1c58..fd3fe21 100644
>>>> --- a/lib/dpdk-stub.c
>>>> +++ b/lib/dpdk-stub.c
>>>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>>>        return false;
>>>>    }
>>>>    +bool
>>>> +dpdk_per_port_mempool_enabled(void)
>>>> +{
>>>> +    return false;
>>>> +}
>>>> +
>>>>    void
>>>>    print_dpdk_version(void)
>>>>    {
>>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>>> index 09afd8c..6a02a17 100644
>>>> --- a/lib/dpdk.c
>>>> +++ b/lib/dpdk.c
>>>> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location
>>>> of vhost-user sockets */
>>>>    static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>>> support */
>>>>    static bool dpdk_initialized = false; /* Indicates successful
>>>> initialization
>>>>                                           * of DPDK. */
>>>> +static bool per_port_memory = false; /* Status of per port memory
>>>> support */
>>>>      static int
>>>>    process_vhost_flags(char *flag, const char *default_val, int size,
>>>> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>>        VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>>>                   vhost_iommu_enabled ? "enabled" : "disabled");
>>>>    +    per_port_memory = smap_get_bool(ovs_other_config,
>>>> +                                    "per-port-memory", false);
>>>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>>>> +              per_port_memory ? "enabled" : "disabled");
>>>> +
>>>>        argv = grow_argv(&argv, 0, 1);
>>>>        argc = 1;
>>>>        argv[0] = xstrdup(ovs_get_program_name());
>>>> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>>>>        return vhost_iommu_enabled;
>>>>    }
>>>>    +bool
>>>> +dpdk_per_port_memory(void)
>>>> +{
>>>> +    return per_port_memory;
>>>> +}
>>>> +
>>>>    void
>>>>    dpdk_set_lcore_id(unsigned cpu)
>>>>    {
>>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>>> index efdaa63..bbb89d4 100644
>>>> --- a/lib/dpdk.h
>>>> +++ b/lib/dpdk.h
>>>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>>>    void dpdk_set_lcore_id(unsigned cpu);
>>>>    const char *dpdk_get_vhost_sock_dir(void);
>>>>    bool dpdk_vhost_iommu_enabled(void);
>>>> +bool dpdk_per_port_memory(void);
>>>>    void print_dpdk_version(void);
>>>>    void dpdk_status(const struct ovsrec_open_vswitch *);
>>>>    #endif /* dpdk.h */
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 2e2f568..2c7ff81 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>    #define NETDEV_DPDK_MBUF_ALIGN      1024
>>>>    #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>>>    -/* Min number of packets in the mempool.  OVS tries to allocate a
>>>> mempool with
>>>> - * roughly estimated number of mbufs: if this fails (because the
>>>> system doesn't
>>>> - * have enough hugepages) we keep halving the number until the
>>>> allocation
>>>> - * succeeds or we reach MIN_NB_MBUF */
>>>> +/* Max and min number of packets in the mempool.  OVS tries to
>>>> allocate a
>>>> + * mempool with MAX_NB_MBUF: if this fails (because the system
>>>> doesn't have
>>>> + * enough hugepages) we keep halving the number until the allocation
>>>> succeeds
>>>> + * or we reach MIN_NB_MBUF */
>>>> +
>>>> +#define MAX_NB_MBUF          (4096 * 64)
>>>>    #define MIN_NB_MBUF          (4096 * 4)
>>>>    #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>    +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
>>>> MIN_NB_MBUF)
>>>> +                  == 0);
>>>> +
>>>> +/* The smallest possible NB_MBUF that we're going to try should be a
>>>> multiple
>>>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>>>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
>>>> MIN_NB_MBUF))
>>>> +                  % MP_CACHE_SZ == 0);
>>>> +
>>>>    /*
>>>>     * DPDK XSTATS Counter names definition
>>>>     */
>>>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex
>>>> OVS_ACQ_AFTER(dpdk_mutex)
>>>>        = OVS_MUTEX_INITIALIZER;
>>>>      /* Contains all 'struct dpdk_mp's. */
>>>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>>>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>>>    -/* Wrapper for a mempool released but not yet freed. */
>>>>    struct dpdk_mp {
>>>>         struct rte_mempool *mp;
>>>> +     int mtu;
>>>> +     int socket_id;
>>>> +     int refcount;
>>>>         struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>>>     };
>>>>    @@ -383,7 +396,7 @@ struct netdev_dpdk {
>>>>          PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>>>            struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>>>> -        struct rte_mempool *mp;
>>>> +        struct dpdk_mp *dpdk_mp;
>>>>              /* virtio identifier for vhost devices */
>>>>            ovsrcu_index vid;
>>>> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp)
>>>> OVS_REQUIRES(dpdk_mp_mutex)
>>>>         * likely be ok, as there are additional checks during mempool
>>>>         * freeing but it would make things racey.
>>>>         */
>>>> -    return rte_mempool_full(mp);
>>>> +    rte_mempool_full(mp);
>>>> +    return 0;
>>>
>>> I don't think you intended this? unless you are in the business of
>>> selling memory ;-)
>>
>> Ooops :-/ , This is left over from testing when mbufs are still in use,
>> not sure how I missed it. Will correct for v3.
>>
> 
> Maybe PATCH v1 ? or you are still testing out?

I was holding back in case there was any major gotchas on the v2, I also 
have the documentation completed for memory calculation so make sense no 
to go to v1.
> 
>>>
>>>>    }
>>>>      /* Free unused mempools. */
>>>>    static void
>>>> -dpdk_mp_sweep(void)
>>>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>>>    {
>>>>        struct dpdk_mp *dmp, *next;
>>>>    -    ovs_mutex_lock(&dpdk_mp_mutex);
>>>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>>> -        if (dpdk_mp_full(dmp->mp)) {
>>>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>>>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>>>                VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>>>                ovs_list_remove(&dmp->list_node);
>>>>                rte_mempool_free(dmp->mp);
>>>>                rte_free(dmp);
>>>>            }
>>>>        }
>>>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>>    }
>>>>    -/* Ensure a mempool will not be freed. */
>>>> -static void
>>>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>>> +/* Calculating the required number of mbufs differs depending on the
>>>> + * mempool model being used. Check if per port mempools are in use
>>>> before
>>>> + * calculating.
>>>> + */
>>>> +static uint32_t
>>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
>>>> per_port_mp)
>>>>    {
>>>> -    struct dpdk_mp *dmp, *next;
>>>> +    uint32_t n_mbufs;
>>>>    -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>>> -        if (dmp->mp == mp) {
>>>> -            VLOG_DBG("Removing mempool \"%s\" from free list",
>>>> dmp->mp->name);
>>>> -            ovs_list_remove(&dmp->list_node);
>>>> -            rte_free(dmp);
>>>> -            break;
>>>> +    if (!per_port_mp) {
>>>> +        /* Shared mempools are being used.
>>>> +         * XXX: this is a really rough method of provisioning memory.
>>>> +         * It's impossible to determine what the exact memory
>>>> requirements are
>>>> +         * when the number of ports and rxqs that utilize a
>>>> particular mempool
>>>> +         * can change dynamically at runtime. For now, use this rough
>>>> +         * heurisitic.
>>>> +         */
>>>> +        if (mtu >= ETHER_MTU) {
>>>> +            n_mbufs = MAX_NB_MBUF;
>>>> +        } else {
>>>> +            n_mbufs = MIN_NB_MBUF;
>>>>            }
>>>> +    } else {
>>>> +        /* Per port mempools are being used.
>>>> +         * XXX: rough estimation of number of mbufs required for
>>>> this port:
>>>> +         * <packets required to fill the device rxqs>
>>>> +         * + <packets that could be stuck on other ports txqs>
>>>> +         * + <packets in the pmd threads>
>>>> +         * + <additional memory for corner cases>
>>>> +         */
>>>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>>> +                  + dev->requested_n_txq * dev->requested_txq_size
>>>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>>> NETDEV_MAX_BURST
>>>> +                  + MIN_NB_MBUF;
>>>>        }
>>>> +
>>>> +    return n_mbufs;
>>>>    }
>>>>    -/* Returns a valid pointer when either of the following is true:
>>>> - *  - a new mempool was just created;
>>>> - *  - a matching mempool already exists. */
>>>> -static struct rte_mempool *
>>>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>> +static struct dpdk_mp *
>>>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>>    {
>>>>        char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>        const char *netdev_name = netdev_get_name(&dev->up);
>>>>        int socket_id = dev->requested_socket_id;
>>>>        uint32_t n_mbufs;
>>>>        uint32_t hash = hash_string(netdev_name, 0);
>>>> -    struct rte_mempool *mp = NULL;
>>>> -
>>>> -    /*
>>>> -     * XXX: rough estimation of number of mbufs required for this port:
>>>> -     * <packets required to fill the device rxqs>
>>>> -     * + <packets that could be stuck on other ports txqs>
>>>> -     * + <packets in the pmd threads>
>>>> -     * + <additional memory for corner cases>
>>>> -     */
>>>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>>> -              + dev->requested_n_txq * dev->requested_txq_size
>>>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>>> NETDEV_MAX_BURST
>>>> -              + MIN_NB_MBUF;
>>>> +    struct dpdk_mp *dmp = NULL;
>>>> +    int ret;
>>>> +
>>>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>>> +    if (!dmp) {
>>>> +        return NULL;
>>>> +    }
>>>> +    dmp->socket_id = socket_id;
>>>> +    dmp->mtu = mtu;
>>>> +    dmp->refcount = 1;
>>>> +
>>>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>>>    -    ovs_mutex_lock(&dpdk_mp_mutex);
>>>>        do {
>>>>            /* Full DPDK memory pool name must be unique and cannot be
>>>> -         * longer than RTE_MEMPOOL_NAMESIZE. */
>>>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>>> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>>>> +         * mempool case this can result in one device using a mempool
>>>> +         * which references a different device in it's name. However as
>>>> +         * mempool names are hashed, the device name will not be
>>>> readable
>>>> +         * so this is not an issue for tasks such as debugging.
>>>> +         */
>>>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>>>                               "ovs%08x%02d%05d%07u",
>>>>                               hash, socket_id, mtu, n_mbufs);
>>>>            if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>>> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>>                      netdev_name, n_mbufs, socket_id,
>>>>                      dev->requested_n_rxq, dev->requested_n_txq);
>>>>    -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>>>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>>>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>>>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>>>> +                                          MP_CACHE_SZ,
>>>> +                                          sizeof (struct dp_packet)
>>>> +                                          - sizeof (struct rte_mbuf),
>>>> +                                          MBUF_SIZE(mtu)
>>>> +                                          - sizeof(struct dp_packet),
>>>> +                                          socket_id);
>>>>    -        if (mp) {
>>>> +        if (dmp->mp) {
>>>>                VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>>>                         mp_name, n_mbufs);
>>>>                /* rte_pktmbuf_pool_create has done some initialization
>>>> of the
>>>> -             * rte_mbuf part of each dp_packet. Some OvS specific
>>>> fields
>>>> -             * of the packet still need to be initialized by
>>>> -             * ovs_rte_pktmbuf_init. */
>>>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>>>> +             * rte_mbuf part of each dp_packet, while
>>>> ovs_rte_pktmbuf_init
>>>> +             * initializes some OVS specific fields of dp_packet.
>>>> +             */
>>>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>>>> +            return dmp;
>>>>            } else if (rte_errno == EEXIST) {
>>>>                /* A mempool with the same name already exists.  We just
>>>>                 * retrieve its pointer to be returned to the caller. */
>>>> -            mp = rte_mempool_lookup(mp_name);
>>>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>>>                /* As the mempool create returned EEXIST we can expect the
>>>>                 * lookup has returned a valid pointer.  If for some
>>>> reason
>>>>                 * that's not the case we keep track of it. */
>>>>                VLOG_DBG("A mempool with name \"%s\" already exists at
>>>> %p.",
>>>> -                     mp_name, mp);
>>>> -            /* Ensure this reused mempool will not be freed. */
>>>> -            dpdk_mp_do_not_free(mp);
>>>> +                     mp_name, dmp->mp);
>>>> +            return dmp;
>>>>            } else {
>>>> -            VLOG_ERR("Failed mempool \"%s\" create request of %u
>>>> mbufs",
>>>> -                     mp_name, n_mbufs);
>>>> +            VLOG_DBG("Failed to create mempool \"%s\" with a request
>>>> of "
>>>> +                     "%u mbufs, retrying with %u mbufs",
>>>> +                     mp_name, n_mbufs, n_mbufs/2);
>>>>            }
>>>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>>> MIN_NB_MBUF);
>>>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>>> MIN_NB_MBUF);
>>>>    -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> -    return mp;
>>>> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u
>>>> mbufs",
>>>> +             mp_name, n_mbufs);
>>>> +
>>>> +    rte_free(dmp);
>>>> +    return NULL;
>>>>    }
>>>>    -/* Release an existing mempool. */
>>>> -static void
>>>> -dpdk_mp_release(struct rte_mempool *mp)
>>>> +static struct dpdk_mp *
>>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>>    {
>>>> -    if (!mp) {
>>>> -        return;
>>>> -    }
>>>> +    struct dpdk_mp *dmp, *next;
>>>> +    bool reuse = false;
>>>>          ovs_mutex_lock(&dpdk_mp_mutex);
>>>> -    if (dpdk_mp_full(mp)) {
>>>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>>> -        rte_mempool_free(mp);
>>>> -    } else {
>>>> -        struct dpdk_mp *dmp;p
>>>> +    /* Check if shared mempools are being used, if so check existing
>>>> mempools
>>>> +     * to see if reuse is possible. */
>>>> +    if (!per_port_mp) {
>>>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>> +            if (dmp->socket_id == dev->requested_socket_id
>>>> +                && dmp->mtu == mtu) {
>>>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>>> +                dmp->refcount++;
>>>> +                reuse = true;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    /* Sweep mempools after reuse or before create. */
>>>> +    dpdk_mp_sweep();
>>>>    -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>>> +    if (!reuse) {
>>>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>>>            if (dmp) {
>>>> -            dmp->mp = mp;
>>>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>>>> +            /* We need to check for the EEXIST case for per port
>>>> memory.
>>>
>>> I'm pretty sure this also applies to the shared mempool case, because
>>> you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
>>> be different but the rounded value that is used for the mempool name
>>> might be the same, meaning that you can get an EEXISTS.
>>>
>>
>> Maybe I misunderstood but I would think in that case it will be caught
>> by the 'reuse' code above? If the rounded value of a new ports
>> requested_mtu is equal to that of an existing mempool mtu and its on the
>> same socket it will be reused, you would not get to dpdk_mp_create() as
>> the reuse condition will be true.
>>
> 
> Yes - you are right, it won't occur. Maybe you could add a small comment
> to say something like "Shared mempools will have reused and not
> requested a mempool that already exists."

Can do.

Ian
> 
>> I had tested this with 2 ports, one with a requested mtu of 1500 and a
>> second with a requested mtu of 1800. Both are rounded to 2030 and the
>> reuse case was hit. Do you think this wont always be the case though?
>>
> 
> No, it's right - also a good test to make sure that the mtu is updated
> correctly in that case :-)
> 
>>>> +             * Compare the mempool returned by dmp to each entry in
>>>> +             * dpdk_mp_list. If a match is found, free dmp as a new
>>>> entry
>>>> +             * is not required, set dmp to point to the existing
>>>> +             * entry and set refcount to 1 to avoid being freed at a
>>>> later
>>>> +             * stage.
>>>> +             */
>>>> +            if (per_port_mp && rte_errno == EEXIST) {
>>
>> If it's the case that shared_mempool will always reuse we may not need
>> the per_port_mp in the condition above.
>>
> 
> Probably it's ok either way, but might be better to leave it there,
> otherwise you'll have to think about where rte_errno is being set.
>  >>>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>>> +                    if (dmp->mp == next->mp) {
>>>> +                        rte_free(dmp);
>>>> +                        dmp = next;
>>>> +                        dmp->refcount = 1;
>>>> +                    }
>>>> +                }
>>>> +            }
>>>> +            else {
>>>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>>> +            }
>>>
>>> I think you need to increment refcount and use the safe list option. How
>>> about
>>
>> I spotted the other mails and responded there also, my plan is to change
>> to increment rather than = 1. but will keep the rte_free(dmp) and dmp =
>> next if that sounds ok?
>>
> 
> sounds good, thanks.
> 
>>>
>>> if (rte_errno == EEXIST) {
>>>       LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>>>           if (dmp->mp == next->mp) {
>>>               next->refcount++;
>>>               rte_free(dmp);
>>>               break;
>>>           }
>>>       }
>>> } else {
>>>       dmp->refcount++;
>>>       ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> }
>>>
>>>>            }
>>>>        }
>>>> +
>>>>        ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> +
>>>> +    return dmp;
>>>>    }
>>>>    -/* Tries to allocate a new mempool - or re-use an existing one where
>>>> - * appropriate - on requested_socket_id with a size determined by
>>>> - * requested_mtu and requested Rx/Tx queues.
>>>> - * On success - or when re-using an existing mempool - the new
>>>> configuration
>>>> - * will be applied.
>>>> +/* Decrement reference to a mempool. */
>>>> +static void
>>>> +dpdk_mp_put(struct dpdk_mp *dmp)
>>>> +{
>>>> +    if (!dmp) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>>>> +    ovs_assert(dmp->refcount);
>>>> +    dmp->refcount--;
>>>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> +}
>>>> +
>>>> +/* Depending on the memory model being used this function tries
>>>> + * identify and reuse an existing mempool or tries to allocate new
>>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>>> + * requested_mtu. On success new configuration will be applied.
>>>>     * On error, device will be left unchanged. */
>>>>    static int
>>>>    netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>>        OVS_REQUIRES(dev->mutex)
>>>>    {
>>>>        uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>>> -    struct rte_mempool *mp;
>>>> +    struct dpdk_mp *mp;
>>>
>>> I think it should have a different name now, as it leads to mp->mp
>>> later. It would be good if you could make is consistent with the other
>>> functions (dmp?)
>>
>> Sure,  will do.
>>>
>>>>        int ret = 0;
>>>> +    bool per_port_mp = dpdk_per_port_memory();
>>>>    -    dpdk_mp_sweep();
>>>> +    /* With shared mempools we do not need to configure a mempool if
>>>> the MTU
>>>> +     * and socket ID have not changed, the previous configuration is
>>>> still
>>>> +     * valid so return 0 */
>>>> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
>>>> +        && dev->socket_id == dev->requested_socket_id) {
>>>> +        return ret;
>>>> +    }
>>>>    -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>>>        if (!mp) {
>>>>            VLOG_ERR("Failed to create memory pool for netdev "
>>>>                     "%s, with MTU %d on socket %d: %s\n",
>>>>                     dev->up.name, dev->requested_mtu,
>>>> dev->requested_socket_id,
>>>>                     rte_strerror(rte_errno));
>>>> -        ret = rte_errno;
>>>> +        return rte_errno;
>>>>        } else {
>>>> -        /* If a new MTU was requested and its rounded value equals
>>>> the one
>>>> -         * that is currently used, then the existing mempool is
>>>> returned. */
>>>> -        if (dev->mp != mp) {
>>>> -            /* A new mempool was created, release the previous one. */
>>>> -            dpdk_mp_release(dev->mp);
>>>> -        } else {
>>>> -            ret = EEXIST;
>>>> +        /* Check for any pre-existing dpdk_mp for the device before
>>>> accessing
>>>> +         * the associated mempool.
>>>> +         */
>>>> +        if (dev->dpdk_mp != NULL) {
>>>> +            /* If a new MTU was requested and its rounded value does
>>>> not
>>>> +             * equal the rounded value of the current mempool,
>>>> decrement
>>>> +             * the reference count to devices current dpdk_mp as its
>>>> +             * associated mempool will not be used for this device.
>>>> +             */
>>>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>>      
>>> Not sure if check is needed. If it were to be false, wouldn't we have
>>> already reused or got an EEXISTS and handled that case?
>>>
>>
>> Yes, your right, it can be removed.
>>
>>>> +                dpdk_mp_put(dev->dpdk_mp);
>>>> +            }
>>>>            }
>>>> -        dev->mp = mp;
>>>> +        dev->dpdk_mp = mp;
>>>>            dev->mtu = dev->requested_mtu;
>>>>            dev->socket_id = dev->requested_socket_id;
>>>>            dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
>>>> int n_rxq, int n_txq)
>>>>              for (i = 0; i < n_rxq; i++) {
>>>>                diag = rte_eth_rx_queue_setup(dev->port_id, i,
>>>> dev->rxq_size,
>>>> -                                          dev->socket_id, NULL,
>>>> dev->mp);
>>>> +                                          dev->socket_id, NULL,
>>>> +                                          dev->dpdk_mp->mp);
>>>>                if (diag) {
>>>>                    VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>>>                              dev->up.name, i, rte_strerror(-diag));
>>>> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>>>        memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>>>        rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>>>    -    mbp_priv = rte_mempool_get_priv(dev->mp);
>>>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>>>        dev->buf_size = mbp_priv->mbuf_data_room_size -
>>>> RTE_PKTMBUF_HEADROOM;
>>>>          /* Get the Flow control configuration for DPDK-ETH */
>>>> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>>>>        OVS_EXCLUDED(dev->mutex)
>>>>    {
>>>>        rte_free(dev->tx_q);
>>>> -    dpdk_mp_release(dev->mp);
>>>> +    dpdk_mp_put(dev->dpdk_mp);
>>>>          ovs_list_remove(&dev->list_node);
>>>>        free(ovsrcu_get_protected(struct ingress_policer *,
>>>> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>>>            return EAGAIN;
>>>>        }
>>>>    -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>>>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>>>                                        (struct rte_mbuf **)
>>>> batch->packets,
>>>>                                        NETDEV_MAX_BURST);
>>>>        if (!nb_rx) {
>>>> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>>>> struct dp_packet_batch *batch)
>>>>                continue;
>>>>            }
>>>>    -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>>>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>>>            if (OVS_UNLIKELY(!pkts[txcnt])) {
>>>>                dropped += cnt - i;
>>>>                break;
>>>> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct
>>>> unixctl_conn *conn,
>>>>            ovs_mutex_lock(&dev->mutex);
>>>>            ovs_mutex_lock(&dpdk_mp_mutex);
>>>>    -        rte_mempool_dump(stream, dev->mp);
>>>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>>>              ovs_mutex_unlock(&dpdk_mp_mutex);
>>>>            ovs_mutex_unlock(&dev->mutex);
>>>> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct
>>>> netdev_dpdk *dev)
>>>>          err = netdev_dpdk_mempool_configure(dev);
>>>>        if (!err) {
>>>> -        /* A new mempool was created. */
>>>> +        /* A new mempool was created or re-used. */
>>>>            netdev_change_seq_changed(&dev->up);
>>>>        } else if (err != EEXIST){
>>>>            return err;
>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>> index 1e27a02..dbf70ac 100644
>>>> --- a/vswitchd/vswitch.xml
>>>> +++ b/vswitchd/vswitch.xml
>>>> @@ -359,6 +359,23 @@
>>>>            </p>
>>>>          </column>
>>>>    +      <column name="other_config" key="per-port-memory"
>>>> +              type='{"type": "boolean"}'>
>>>> +        <p>
>>>> +          By default OVS DPDK uses a shared mempool memory model
>>>> wherein
>>>> +          devices that have the same MTU and socket values can share
>>>> the same
>>>> +          mempool. Setting this value to <code>true</code> changes this
>>>> +          behaviour. Per port memory allow DPDK devices to use private
>>>> +          memory per device. This can provide greater transapraency as
>>>> +          regards memory usage but potentially at the cost of
>>>> greater memory
>>>> +          requirements.
>>>> +        </p>
>>>> +        <p>
>>>> +          Changing this value requires restarting the daemon if
>>>> dpdk-init has
>>>> +          already been set to true.
>>>> +        </p>
>>>> +      </column>
>>>> +
>>>>          <column name="other_config" key="tx-flush-interval"
>>>>                  type='{"type": "integer",
>>>>                         "minInteger": 0, "maxInteger": 1000000}'>
>>>>
>>>
>>
>

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 2202df4..141b33d 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -36,6 +36,7 @@  DOC_SOURCE = \
 	Documentation/topics/dpdk/index.rst \
 	Documentation/topics/dpdk/bridge.rst \
 	Documentation/topics/dpdk/jumbo-frames.rst \
+	Documentation/topics/dpdk/memory.rst \
 	Documentation/topics/dpdk/pdump.rst \
 	Documentation/topics/dpdk/phy.rst \
 	Documentation/topics/dpdk/pmd.rst \
diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
index 181f61a..cf24a7b 100644
--- a/Documentation/topics/dpdk/index.rst
+++ b/Documentation/topics/dpdk/index.rst
@@ -40,3 +40,4 @@  The DPDK Datapath
    /topics/dpdk/qos
    /topics/dpdk/pdump
    /topics/dpdk/jumbo-frames
+   /topics/dpdk/memory
diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
new file mode 100644
index 0000000..7c00f0f
--- /dev/null
+++ b/Documentation/topics/dpdk/memory.rst
@@ -0,0 +1,67 @@ 
+..
+        Copyright 2018, Intel, Inc.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=========================
+DPDK Device Memory Models
+=========================
+
+DPDK device memory can be allocated in one of two ways in OVS DPDK,
+shared mempools or per port mempools. The specifics of both are detailed
+below.
+
+Shared Mempools
+---------------
+
+By Default OVS DPDK uses a shared mempool model. This means that multiple
+ports can share the same mempool. For example when a port is added it will
+have a given MTU and socket ID associated with it. If a mempool has been
+created previously for an existing port that has the same MTU and socket ID,
+that mempool is used for both ports. If there is no existing mempool
+supporting these parameters then a new mempool is created.
+
+Per Port Mempools
+-----------------
+
+In the per port mempool model, mempools are created per device and are not
+shared. The benefit of this is a more transparent memory model where mempools
+will not be exhausted by other DPDK devices. However this comes at a potential
+increase in cost for memory dimensioning for a given deployment. Users should
+be aware of the memory requirements for their deployment before using this
+model and allocate the required hugepage memory.
+
+Per port mempool support may be enabled via a global config value,
+```per-port-mp-enabled```. Setting this to true enables the per port mempool
+model for all DPDK devices in OVS::
+
+   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
+
+.. important::
+
+    This value should be set before setting dpdk-init=true. If set after
+    dpdk-init=true then the daemon must be restarted to use per-port-memory.
+
+.. todo::
+
+   Add memory calculation section for both mempool models.
diff --git a/NEWS b/NEWS
index 484c6dc..eff6cf1 100644
--- a/NEWS
+++ b/NEWS
@@ -33,6 +33,7 @@  Post-v2.9.0
        See Testing topic for the details.
      * Add LSC interrupt support for DPDK physical devices.
      * Allow init to fail and record DPDK status/version in OVS database.
+     * Support both shared and per port mempools for DPDK devices.
    - Userspace datapath:
      * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
      * Detailed PMD performance metrics available with new command
diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index 1df1c58..fd3fe21 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -56,6 +56,12 @@  dpdk_vhost_iommu_enabled(void)
     return false;
 }
 
+bool
+dpdk_per_port_mempool_enabled(void)
+{
+    return false;
+}
+
 void
 print_dpdk_version(void)
 {
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 09afd8c..6a02a17 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -47,6 +47,7 @@  static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
 static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
 static bool dpdk_initialized = false; /* Indicates successful initialization
                                        * of DPDK. */
+static bool per_port_memory = false; /* Status of per port memory support */
 
 static int
 process_vhost_flags(char *flag, const char *default_val, int size,
@@ -358,6 +359,11 @@  dpdk_init__(const struct smap *ovs_other_config)
     VLOG_INFO("IOMMU support for vhost-user-client %s.",
                vhost_iommu_enabled ? "enabled" : "disabled");
 
+    per_port_memory = smap_get_bool(ovs_other_config,
+                                    "per-port-memory", false);
+    VLOG_INFO("Per port mempool for DPDK devices %s.",
+              per_port_memory ? "enabled" : "disabled");
+
     argv = grow_argv(&argv, 0, 1);
     argc = 1;
     argv[0] = xstrdup(ovs_get_program_name());
@@ -515,6 +521,12 @@  dpdk_vhost_iommu_enabled(void)
     return vhost_iommu_enabled;
 }
 
+bool
+dpdk_per_port_memory(void)
+{
+    return per_port_memory;
+}
+
 void
 dpdk_set_lcore_id(unsigned cpu)
 {
diff --git a/lib/dpdk.h b/lib/dpdk.h
index efdaa63..bbb89d4 100644
--- a/lib/dpdk.h
+++ b/lib/dpdk.h
@@ -39,6 +39,7 @@  void dpdk_init(const struct smap *ovs_other_config);
 void dpdk_set_lcore_id(unsigned cpu);
 const char *dpdk_get_vhost_sock_dir(void);
 bool dpdk_vhost_iommu_enabled(void);
+bool dpdk_per_port_memory(void);
 void print_dpdk_version(void);
 void dpdk_status(const struct ovsrec_open_vswitch *);
 #endif /* dpdk.h */
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2e2f568..2c7ff81 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -91,13 +91,24 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define NETDEV_DPDK_MBUF_ALIGN      1024
 #define NETDEV_DPDK_MAX_PKT_LEN     9728
 
-/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
- * roughly estimated number of mbufs: if this fails (because the system doesn't
- * have enough hugepages) we keep halving the number until the allocation
- * succeeds or we reach MIN_NB_MBUF */
+/* Max and min number of packets in the mempool.  OVS tries to allocate a
+ * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
+ * enough hugepages) we keep halving the number until the allocation succeeds
+ * or we reach MIN_NB_MBUF */
+
+#define MAX_NB_MBUF          (4096 * 64)
 #define MIN_NB_MBUF          (4096 * 4)
 #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
 
+/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
+BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
+                  == 0);
+
+/* The smallest possible NB_MBUF that we're going to try should be a multiple
+ * of MP_CACHE_SZ. This is advised by DPDK documentation. */
+BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
+                  % MP_CACHE_SZ == 0);
+
 /*
  * DPDK XSTATS Counter names definition
  */
@@ -297,12 +308,14 @@  static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
     = OVS_MUTEX_INITIALIZER;
 
 /* Contains all 'struct dpdk_mp's. */
-static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
-    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
+static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
+    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
 
-/* Wrapper for a mempool released but not yet freed. */
 struct dpdk_mp {
      struct rte_mempool *mp;
+     int mtu;
+     int socket_id;
+     int refcount;
      struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
  };
 
@@ -383,7 +396,7 @@  struct netdev_dpdk {
 
     PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
         struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
-        struct rte_mempool *mp;
+        struct dpdk_mp *dpdk_mp;
 
         /* virtio identifier for vhost devices */
         ovsrcu_index vid;
@@ -544,73 +557,95 @@  dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
      * likely be ok, as there are additional checks during mempool
      * freeing but it would make things racey.
      */
-    return rte_mempool_full(mp);
+    rte_mempool_full(mp);
+    return 0;
 }
 
 /* Free unused mempools. */
 static void
-dpdk_mp_sweep(void)
+dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
 {
     struct dpdk_mp *dmp, *next;
 
-    ovs_mutex_lock(&dpdk_mp_mutex);
-    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
-        if (dpdk_mp_full(dmp->mp)) {
+    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
+        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
             VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
             ovs_list_remove(&dmp->list_node);
             rte_mempool_free(dmp->mp);
             rte_free(dmp);
         }
     }
-    ovs_mutex_unlock(&dpdk_mp_mutex);
 }
 
-/* Ensure a mempool will not be freed. */
-static void
-dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
+/* Calculating the required number of mbufs differs depending on the
+ * mempool model being used. Check if per port mempools are in use before
+ * calculating.
+ */
+static uint32_t
+dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 {
-    struct dpdk_mp *dmp, *next;
+    uint32_t n_mbufs;
 
-    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
-        if (dmp->mp == mp) {
-            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
-            ovs_list_remove(&dmp->list_node);
-            rte_free(dmp);
-            break;
+    if (!per_port_mp) {
+        /* Shared mempools are being used.
+         * XXX: this is a really rough method of provisioning memory.
+         * It's impossible to determine what the exact memory requirements are
+         * when the number of ports and rxqs that utilize a particular mempool
+         * can change dynamically at runtime. For now, use this rough
+         * heurisitic.
+         */
+        if (mtu >= ETHER_MTU) {
+            n_mbufs = MAX_NB_MBUF;
+        } else {
+            n_mbufs = MIN_NB_MBUF;
         }
+    } else {
+        /* Per port mempools are being used.
+         * XXX: rough estimation of number of mbufs required for this port:
+         * <packets required to fill the device rxqs>
+         * + <packets that could be stuck on other ports txqs>
+         * + <packets in the pmd threads>
+         * + <additional memory for corner cases>
+         */
+        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
+                  + dev->requested_n_txq * dev->requested_txq_size
+                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
+                  + MIN_NB_MBUF;
     }
+
+    return n_mbufs;
 }
 
-/* Returns a valid pointer when either of the following is true:
- *  - a new mempool was just created;
- *  - a matching mempool already exists. */
-static struct rte_mempool *
-dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
+static struct dpdk_mp *
+dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 {
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     const char *netdev_name = netdev_get_name(&dev->up);
     int socket_id = dev->requested_socket_id;
     uint32_t n_mbufs;
     uint32_t hash = hash_string(netdev_name, 0);
-    struct rte_mempool *mp = NULL;
-
-    /*
-     * XXX: rough estimation of number of mbufs required for this port:
-     * <packets required to fill the device rxqs>
-     * + <packets that could be stuck on other ports txqs>
-     * + <packets in the pmd threads>
-     * + <additional memory for corner cases>
-     */
-    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
-              + dev->requested_n_txq * dev->requested_txq_size
-              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
-              + MIN_NB_MBUF;
+    struct dpdk_mp *dmp = NULL;
+    int ret;
+
+    dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    if (!dmp) {
+        return NULL;
+    }
+    dmp->socket_id = socket_id;
+    dmp->mtu = mtu;
+    dmp->refcount = 1;
+
+    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
 
-    ovs_mutex_lock(&dpdk_mp_mutex);
     do {
         /* Full DPDK memory pool name must be unique and cannot be
-         * longer than RTE_MEMPOOL_NAMESIZE. */
-        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
+         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
+         * mempool case this can result in one device using a mempool
+         * which references a different device in it's name. However as
+         * mempool names are hashed, the device name will not be readable
+         * so this is not an issue for tasks such as debugging.
+         */
+        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
                            "ovs%08x%02d%05d%07u",
                            hash, socket_id, mtu, n_mbufs);
         if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
@@ -626,96 +661,158 @@  dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
                   netdev_name, n_mbufs, socket_id,
                   dev->requested_n_rxq, dev->requested_n_txq);
 
-        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
-                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
-                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
+        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
+                                          MP_CACHE_SZ,
+                                          sizeof (struct dp_packet)
+                                          - sizeof (struct rte_mbuf),
+                                          MBUF_SIZE(mtu)
+                                          - sizeof(struct dp_packet),
+                                          socket_id);
 
-        if (mp) {
+        if (dmp->mp) {
             VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
                      mp_name, n_mbufs);
             /* rte_pktmbuf_pool_create has done some initialization of the
-             * rte_mbuf part of each dp_packet. Some OvS specific fields
-             * of the packet still need to be initialized by
-             * ovs_rte_pktmbuf_init. */
-            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
+             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
+             * initializes some OVS specific fields of dp_packet.
+             */
+            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
+            return dmp;
         } else if (rte_errno == EEXIST) {
             /* A mempool with the same name already exists.  We just
              * retrieve its pointer to be returned to the caller. */
-            mp = rte_mempool_lookup(mp_name);
+            dmp->mp = rte_mempool_lookup(mp_name);
             /* As the mempool create returned EEXIST we can expect the
              * lookup has returned a valid pointer.  If for some reason
              * that's not the case we keep track of it. */
             VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
-                     mp_name, mp);
-            /* Ensure this reused mempool will not be freed. */
-            dpdk_mp_do_not_free(mp);
+                     mp_name, dmp->mp);
+            return dmp;
         } else {
-            VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
-                     mp_name, n_mbufs);
+            VLOG_DBG("Failed to create mempool \"%s\" with a request of "
+                     "%u mbufs, retrying with %u mbufs",
+                     mp_name, n_mbufs, n_mbufs/2);
         }
-    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
+    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
 
-    ovs_mutex_unlock(&dpdk_mp_mutex);
-    return mp;
+    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs",
+             mp_name, n_mbufs);
+
+    rte_free(dmp);
+    return NULL;
 }
 
-/* Release an existing mempool. */
-static void
-dpdk_mp_release(struct rte_mempool *mp)
+static struct dpdk_mp *
+dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 {
-    if (!mp) {
-        return;
-    }
+    struct dpdk_mp *dmp, *next;
+    bool reuse = false;
 
     ovs_mutex_lock(&dpdk_mp_mutex);
-    if (dpdk_mp_full(mp)) {
-        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
-        rte_mempool_free(mp);
-    } else {
-        struct dpdk_mp *dmp;
+    /* Check if shared mempools are being used, if so check existing mempools
+     * to see if reuse is possible. */
+    if (!per_port_mp) {
+        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
+            if (dmp->socket_id == dev->requested_socket_id
+                && dmp->mtu == mtu) {
+                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
+                dmp->refcount++;
+                reuse = true;
+                break;
+            }
+        }
+    }
+    /* Sweep mempools after reuse or before create. */
+    dpdk_mp_sweep();
 
-        dmp = dpdk_rte_mzalloc(sizeof *dmp);
+    if (!reuse) {
+        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
         if (dmp) {
-            dmp->mp = mp;
-            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
+            /* We need to check for the EEXIST case for per port memory.
+             * Compare the mempool returned by dmp to each entry in
+             * dpdk_mp_list. If a match is found, free dmp as a new entry
+             * is not required, set dmp to point to the existing
+             * entry and set refcount to 1 to avoid being freed at a later
+             * stage.
+             */
+            if (per_port_mp && rte_errno == EEXIST) {
+                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
+                    if (dmp->mp == next->mp) {
+                        rte_free(dmp);
+                        dmp = next;
+                        dmp->refcount = 1;
+                    }
+                }
+            }
+            else {
+                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
+            }
         }
     }
+
     ovs_mutex_unlock(&dpdk_mp_mutex);
+
+    return dmp;
 }
 
-/* Tries to allocate a new mempool - or re-use an existing one where
- * appropriate - on requested_socket_id with a size determined by
- * requested_mtu and requested Rx/Tx queues.
- * On success - or when re-using an existing mempool - the new configuration
- * will be applied.
+/* Decrement reference to a mempool. */
+static void
+dpdk_mp_put(struct dpdk_mp *dmp)
+{
+    if (!dmp) {
+        return;
+    }
+
+    ovs_mutex_lock(&dpdk_mp_mutex);
+    ovs_assert(dmp->refcount);
+    dmp->refcount--;
+    ovs_mutex_unlock(&dpdk_mp_mutex);
+}
+
+/* Depending on the memory model being used this function tries
+ * identify and reuse an existing mempool or tries to allocate new
+ * mempool on requested_socket_id with mbuf size corresponding to
+ * requested_mtu. On success new configuration will be applied.
  * On error, device will be left unchanged. */
 static int
 netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
     OVS_REQUIRES(dev->mutex)
 {
     uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
-    struct rte_mempool *mp;
+    struct dpdk_mp *mp;
     int ret = 0;
+    bool per_port_mp = dpdk_per_port_memory();
 
-    dpdk_mp_sweep();
+    /* With shared mempools we do not need to configure a mempool if the MTU
+     * and socket ID have not changed, the previous configuration is still
+     * valid so return 0 */
+    if (!per_port_mp && dev->mtu == dev->requested_mtu
+        && dev->socket_id == dev->requested_socket_id) {
+        return ret;
+    }
 
-    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
+    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
     if (!mp) {
         VLOG_ERR("Failed to create memory pool for netdev "
                  "%s, with MTU %d on socket %d: %s\n",
                  dev->up.name, dev->requested_mtu, dev->requested_socket_id,
                  rte_strerror(rte_errno));
-        ret = rte_errno;
+        return rte_errno;
     } else {
-        /* If a new MTU was requested and its rounded value equals the one
-         * that is currently used, then the existing mempool is returned. */
-        if (dev->mp != mp) {
-            /* A new mempool was created, release the previous one. */
-            dpdk_mp_release(dev->mp);
-        } else {
-            ret = EEXIST;
+        /* Check for any pre-existing dpdk_mp for the device before accessing
+         * the associated mempool.
+         */
+        if (dev->dpdk_mp != NULL) {
+            /* If a new MTU was requested and its rounded value does not
+             * equal the rounded value of the current mempool, decrement
+             * the reference count to devices current dpdk_mp as its
+             * associated mempool will not be used for this device.
+             */
+            if (dev->dpdk_mp->mp != mp->mp) {
+                dpdk_mp_put(dev->dpdk_mp);
+            }
         }
-        dev->mp = mp;
+        dev->dpdk_mp = mp;
         dev->mtu = dev->requested_mtu;
         dev->socket_id = dev->requested_socket_id;
         dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
@@ -838,7 +935,8 @@  dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 
         for (i = 0; i < n_rxq; i++) {
             diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
-                                          dev->socket_id, NULL, dev->mp);
+                                          dev->socket_id, NULL,
+                                          dev->dpdk_mp->mp);
             if (diag) {
                 VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
                           dev->up.name, i, rte_strerror(-diag));
@@ -926,7 +1024,7 @@  dpdk_eth_dev_init(struct netdev_dpdk *dev)
     memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
     rte_eth_link_get_nowait(dev->port_id, &dev->link);
 
-    mbp_priv = rte_mempool_get_priv(dev->mp);
+    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
     dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
 
     /* Get the Flow control configuration for DPDK-ETH */
@@ -1180,7 +1278,7 @@  common_destruct(struct netdev_dpdk *dev)
     OVS_EXCLUDED(dev->mutex)
 {
     rte_free(dev->tx_q);
-    dpdk_mp_release(dev->mp);
+    dpdk_mp_put(dev->dpdk_mp);
 
     ovs_list_remove(&dev->list_node);
     free(ovsrcu_get_protected(struct ingress_policer *,
@@ -1933,7 +2031,7 @@  netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
         return EAGAIN;
     }
 
-    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
+    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
                                     (struct rte_mbuf **) batch->packets,
                                     NETDEV_MAX_BURST);
     if (!nb_rx) {
@@ -2172,7 +2270,7 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
             continue;
         }
 
-        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
+        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
         if (OVS_UNLIKELY(!pkts[txcnt])) {
             dropped += cnt - i;
             break;
@@ -3052,7 +3150,7 @@  netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
         ovs_mutex_lock(&dev->mutex);
         ovs_mutex_lock(&dpdk_mp_mutex);
 
-        rte_mempool_dump(stream, dev->mp);
+        rte_mempool_dump(stream, dev->dpdk_mp->mp);
 
         ovs_mutex_unlock(&dpdk_mp_mutex);
         ovs_mutex_unlock(&dev->mutex);
@@ -3749,7 +3847,7 @@  dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
 
     err = netdev_dpdk_mempool_configure(dev);
     if (!err) {
-        /* A new mempool was created. */
+        /* A new mempool was created or re-used. */
         netdev_change_seq_changed(&dev->up);
     } else if (err != EEXIST){
         return err;
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 1e27a02..dbf70ac 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -359,6 +359,23 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="per-port-memory"
+              type='{"type": "boolean"}'>
+        <p>
+          By default OVS DPDK uses a shared mempool memory model wherein
+          devices that have the same MTU and socket values can share the same
+          mempool. Setting this value to <code>true</code> changes this
+          behaviour. Per port memory allow DPDK devices to use private
+          memory per device. This can provide greater transapraency as
+          regards memory usage but potentially at the cost of greater memory
+          requirements.
+        </p>
+        <p>
+          Changing this value requires restarting the daemon if dpdk-init has
+          already been set to true.
+        </p>
+      </column>
+
       <column name="other_config" key="tx-flush-interval"
               type='{"type": "integer",
                      "minInteger": 0, "maxInteger": 1000000}'>