diff mbox

[ovs-dev,v2,2/3] netdev-dpdk: Convert initialization from cmdline to db

Message ID 1451943994-27936-3-git-send-email-aconole@redhat.com
State Not Applicable
Headers show

Commit Message

Aaron Conole Jan. 4, 2016, 9:46 p.m. UTC
Existing DPDK integration is provided by use of command line options which
must be split out and passed to librte in a special manner. However, this
forces any configuration to be passed by way of a special DPDK flag, and
interferes with ovs+dpdk packaging solutions.

This commit delays dpdk initialization until the first DPDK netdev is added
to the bridge, at which point ovs initializes librte. It pulls all of
the config data from the OVS database, and assembles a new argv/argc
pair to be passed along.

Signed-off-by: Aaron Conole <aconole@redhat.com>
---
v2:
* Removed trailing whitespace
* Followed for() loop brace coding style
* Automatically enable DPDK when adding a DPDK enabled port
* Fixed an issue on startup when DPDK enabled ports are present
* Updated the documentation (including vswitch.xml) and documented all
  new parameters
* Dropped the premature initialization test

INSTALL.DPDK.md         |  75 ++++++++++++----
 lib/netdev-dpdk.c       | 224 +++++++++++++++++++++++++++++++++++-------------
 lib/netdev-dpdk.h       |  19 ++--
 vswitchd/bridge.c       |   3 +
 vswitchd/ovs-vswitchd.c |  25 +-----
 vswitchd/vswitch.xml    | 102 +++++++++++++++++++++-
 6 files changed, 341 insertions(+), 107 deletions(-)

Comments

Panu Matilainen Jan. 7, 2016, 11:10 a.m. UTC | #1
On 01/04/2016 11:46 PM, Aaron Conole wrote:
> Existing DPDK integration is provided by use of command line options which
> must be split out and passed to librte in a special manner. However, this
> forces any configuration to be passed by way of a special DPDK flag, and
> interferes with ovs+dpdk packaging solutions.
>
> This commit delays dpdk initialization until the first DPDK netdev is added
> to the bridge, at which point ovs initializes librte.

On thing to keep in mind is that rte_eal_init() can and will tear down 
the entire process on failure since DPDK calls rte_panic() if something 
so much as sneezes. In current OVS this occurs on service startup where 
its relatively harmless, but with lazy initialization there could be 
already be other activity that is in risk of getting terminated when the 
first DPDK port is added.

Fixing rte_eal_init() to gracefully return on failure has been 
discussed, and agreed on in principle, on DPDK list but all current DPDK 
versions are nasty wrt that.

	- Panu -
Aaron Conole Jan. 8, 2016, 4:31 p.m. UTC | #2
Panu Matilainen <pmatilai@redhat.com> writes:
> On 01/04/2016 11:46 PM, Aaron Conole wrote:
>> Existing DPDK integration is provided by use of command line options which
>> must be split out and passed to librte in a special manner. However, this
>> forces any configuration to be passed by way of a special DPDK flag, and
>> interferes with ovs+dpdk packaging solutions.
>>
>> This commit delays dpdk initialization until the first DPDK netdev is added
>> to the bridge, at which point ovs initializes librte.
>
> On thing to keep in mind is that rte_eal_init() can and will tear down
> the entire process on failure since DPDK calls rte_panic() if
> something so much as sneezes. In current OVS this occurs on service
> startup where its relatively harmless, but with lazy initialization
> there could be already be other activity that is in risk of getting
> terminated when the first DPDK port is added.
>
> Fixing rte_eal_init() to gracefully return on failure has been
> discussed, and agreed on in principle, on DPDK list but all current
> DPDK versions are nasty wrt that.
>
> 	- Panu -

So, I've waffled back and forth on this. I understand the reason to be
nervous about an always init option (because that wastes lots of
resources when the system won't ever use dpdk). I also understand the
possible issues *today* with dpdk_init, but even then, it's a dpdk issue
which we want fixed anyway, so I don't know that this should hold up
this patch.

It's definitely a more elegant solution to do the lazy init, but I think
there could be times where we want a "stop everything" button for
occasions where testing without DPDK doing it's thing are desired.

However, I think I've come up with a solution that gives us flexibility
to support these cases without much additional work, so let me know what
you think:

First, go back to the dpdk true/false flag
Second, add a patch in the series which changes true/false to a tristate
on/off/lazy allowing the policy of when to initialize DPDK to be user
defined. We can default it to 'off' or 'lazy', but it can be changed to
'on' if we want.

What do folks think? Too much work and code for not enough gains?

Thanks,
Aaron

PS: Thanks for looking at this, Panu!
Zoltan Kiss Jan. 11, 2016, 1:05 p.m. UTC | #3
On 08/01/16 16:31, Aaron Conole wrote:
> Panu Matilainen <pmatilai@redhat.com> writes:
>> On 01/04/2016 11:46 PM, Aaron Conole wrote:
>>> Existing DPDK integration is provided by use of command line options which
>>> must be split out and passed to librte in a special manner. However, this
>>> forces any configuration to be passed by way of a special DPDK flag, and
>>> interferes with ovs+dpdk packaging solutions.
>>>
>>> This commit delays dpdk initialization until the first DPDK netdev is added
>>> to the bridge, at which point ovs initializes librte.
>>
>> On thing to keep in mind is that rte_eal_init() can and will tear down
>> the entire process on failure since DPDK calls rte_panic() if
>> something so much as sneezes. In current OVS this occurs on service
>> startup where its relatively harmless, but with lazy initialization
>> there could be already be other activity that is in risk of getting
>> terminated when the first DPDK port is added.
>>
>> Fixing rte_eal_init() to gracefully return on failure has been
>> discussed, and agreed on in principle, on DPDK list but all current
>> DPDK versions are nasty wrt that.
>>
>> 	- Panu -
>
> So, I've waffled back and forth on this. I understand the reason to be
> nervous about an always init option (because that wastes lots of
> resources when the system won't ever use dpdk). I also understand the
> possible issues *today* with dpdk_init, but even then, it's a dpdk issue
> which we want fixed anyway, so I don't know that this should hold up
> this patch.

I couldn't find the original email where this discussion happened: why 
is it required to be able to init DPDK on the fly? I mean it's a nice 
bit, but brings in a lot of trouble. On the other hand, asking the user 
to set a "other_config:odp=true" and then restart ovs-vswitchd is not a 
huge thing to ask.


>
> It's definitely a more elegant solution to do the lazy init, but I think
> there could be times where we want a "stop everything" button for
> occasions where testing without DPDK doing it's thing are desired.
>
> However, I think I've come up with a solution that gives us flexibility
> to support these cases without much additional work, so let me know what
> you think:
>
> First, go back to the dpdk true/false flag
> Second, add a patch in the series which changes true/false to a tristate
> on/off/lazy allowing the policy of when to initialize DPDK to be user
> defined. We can default it to 'off' or 'lazy', but it can be changed to
> 'on' if we want.
>
> What do folks think? Too much work and code for not enough gains?
>
> Thanks,
> Aaron
>
> PS: Thanks for looking at this, Panu!
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
>
Aaron Conole Jan. 11, 2016, 6:51 p.m. UTC | #4
Zoltan Kiss <zoltan.kiss@linaro.org> writes:
> On 08/01/16 16:31, Aaron Conole wrote:
>> Panu Matilainen <pmatilai@redhat.com> writes:
>>> On 01/04/2016 11:46 PM, Aaron Conole wrote:
>>>> Existing DPDK integration is provided by use of command line options which
>>>> must be split out and passed to librte in a special manner. However, this
>>>> forces any configuration to be passed by way of a special DPDK flag, and
>>>> interferes with ovs+dpdk packaging solutions.
>>>>
>>>> This commit delays dpdk initialization until the first DPDK netdev is added
>>>> to the bridge, at which point ovs initializes librte.
>>>
>>> On thing to keep in mind is that rte_eal_init() can and will tear down
>>> the entire process on failure since DPDK calls rte_panic() if
>>> something so much as sneezes. In current OVS this occurs on service
>>> startup where its relatively harmless, but with lazy initialization
>>> there could be already be other activity that is in risk of getting
>>> terminated when the first DPDK port is added.
>>>
>>> Fixing rte_eal_init() to gracefully return on failure has been
>>> discussed, and agreed on in principle, on DPDK list but all current
>>> DPDK versions are nasty wrt that.
>>>
>>> 	- Panu -
>>
>> So, I've waffled back and forth on this. I understand the reason to be
>> nervous about an always init option (because that wastes lots of
>> resources when the system won't ever use dpdk). I also understand the
>> possible issues *today* with dpdk_init, but even then, it's a dpdk issue
>> which we want fixed anyway, so I don't know that this should hold up
>> this patch.
>
> I couldn't find the original email where this discussion happened: why
> is it required to be able to init DPDK on the fly? I mean it's a nice
> bit, but brings in a lot of trouble. On the other hand, asking the
> user to set a "other_config:odp=true" and then restart ovs-vswitchd is
> not a huge thing to ask.

You won't find such a discussion - it's never been "required," as
such. It is very desirable, as having such automatic behavior reduces the
amount of steps required to get up and running with DPDK enabled
OVS. This is, imho, the argument of both Panu and Kevin when they think
a big on/off switch is less than elegant. I tend to agree with that, as well.

If the user is going through the trouble of compiling with DPDK support, and
then wants to add a DPDK port, having to also set
"other_config:dpdk=yes" seems like too many ok dialogs. I hope the
analogy makes sense.

That said, it's really an unneeded enhancement, and I've pitched the
idea to folks at the office within my elastic-shooting range. The answer
has been consistently "This is an enhancement, not a requirement." So
I'll drop the lazy-init feature for now.

>> It's definitely a more elegant solution to do the lazy init, but I think
>> there could be times where we want a "stop everything" button for
>> occasions where testing without DPDK doing it's thing are desired.
>>
>> However, I think I've come up with a solution that gives us flexibility
>> to support these cases without much additional work, so let me know what
>> you think:
>>
>> First, go back to the dpdk true/false flag
>> Second, add a patch in the series which changes true/false to a tristate
>> on/off/lazy allowing the policy of when to initialize DPDK to be user
>> defined. We can default it to 'off' or 'lazy', but it can be changed to
>> 'on' if we want.
>>
>> What do folks think? Too much work and code for not enough gains?

As I wrote above, the 'second' part of this is a great follow up
enhancement, but for now I'm going to go back to the giant flag, and we
can take it as 'future development'.
Kevin Traynor Jan. 12, 2016, 1:27 p.m. UTC | #5
> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Monday, January 4, 2016 9:47 PM
> To: dev@openvswitch.org; Flavio Leitner; Traynor, Kevin
> Subject: [PATCH v2 2/3] netdev-dpdk: Convert initialization from cmdline to
> db
> 
> Existing DPDK integration is provided by use of command line options which
> must be split out and passed to librte in a special manner. However, this
> forces any configuration to be passed by way of a special DPDK flag, and
> interferes with ovs+dpdk packaging solutions.
> 
> This commit delays dpdk initialization until the first DPDK netdev is added
> to the bridge, at which point ovs initializes librte. It pulls all of
> the config data from the OVS database, and assembles a new argv/argc
> pair to be passed along.
> 
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
> v2:
> * Removed trailing whitespace
> * Followed for() loop brace coding style
> * Automatically enable DPDK when adding a DPDK enabled port
> * Fixed an issue on startup when DPDK enabled ports are present
> * Updated the documentation (including vswitch.xml) and documented all
>   new parameters
> * Dropped the premature initialization test

Hi, mostly very minor comments below,

> 
> INSTALL.DPDK.md         |  75 ++++++++++++----
>  lib/netdev-dpdk.c       | 224 +++++++++++++++++++++++++++++++++++-----------
> --
>  lib/netdev-dpdk.h       |  19 ++--
>  vswitchd/bridge.c       |   3 +
>  vswitchd/ovs-vswitchd.c |  25 +-----
>  vswitchd/vswitch.xml    | 102 +++++++++++++++++++++-
>  6 files changed, 341 insertions(+), 107 deletions(-)
> 
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 96b686c..2dd2120 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -143,22 +143,59 @@ Using the DPDK with ovs-vswitchd:
> 
>  5. Start vswitchd:
> 
> -   DPDK configuration arguments can be passed to vswitchd via `--dpdk`
> -   argument. This needs to be first argument passed to vswitchd process.
> -   dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter
> -   for dpdk initialization.
> +   DPDK configuration arguments can be passed to vswitchd via Open_vSwitch
> +   other_config database. The recognized configuration options are listed.

'Defaults will be provided for all values not set.'

> +
> +   * dpdk-lcore-mask
> +   Specifies the CPU cores on which dpdk lcore threads should be spawned.
> +   The DPDK lcore threads are used for DPDK library tasks, such as
> +   library internal message processing, logging, etc. Value should be in
> +   the form of a hex string (so '0x123') similar to the 'taskset' mask
> +   input.

'CPU cores' and '0x123' imply it will be multiple cores in this case - better
to change to CPU core and 0x1

Also, I don't think the 0x prefix is accepted based on using pmd-cpu-mask.

> +   If not specified, the value will be determined by choosing the lowest
> +   CPU core from initial cpu affinity list. Otherwise, the value will be
> +   passed directly to the DPDK library.
> +   For performance reasons, it is best to set this to a single core on
> +   the system, rather than allow lcore threads to float.

I suppose it depends on the system load and vswitch usage as to which will give
better performance but I agree setting a dedicated core is at least more
deterministic and less risky, so statement is probably fine.

> +
> +   * dpdk-mem-channels
> +   This sets the number of memory spread channels per CPU socket. It is
> purely
> +   an optimization flag.
> +
> +   * dpdk-alloc-mem
> +   This sets the total memory to preallocate from hugepages regardless of
> +   processor socket. It is recommended to use dpdk-socket-mem instead.
> +
> +   * dpdk-socket-mem
> +   Comma separated list of memory to pre-allocate from hugepages on specific
> +   sockets.
> +
> +   * dpdk-hugepage-dir
> +   Directory where hugetlbfs is mounted
> +
> +   * cuse-dev-name
> +   Option to set the vhost_cuse character device name.
> +
> +   * vhost-sock-dir
> +   Option to set the path to the vhost_user unix socket files.
> +
> +   NOTE: Changing any of these options requires restarting the ovs-vswitchd
> +   application.
> +
> +   Open vSwitch can be started as normal. DPDK will not be initialized until
> +   the first DPDK-enabled port is added to the bridge.
> 
>     ```
>     export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
> -   ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach
> +   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
>     ```
> 
>     If allocated more than one GB hugepage (as for IVSHMEM), set amount and
>     use NUMA node 0 memory:
> 
>     ```
> -   ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \
> -   -- unix:$DB_SOCK --pidfile --detach
> +   ovs-vsctl --no-wait set Open_vSwitch .
> other_config:dpdk_socket_mem="1024,0"
> +   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
>     ```
> 
>  6. Add bridge & ports
> @@ -521,11 +558,12 @@ have arbitrary names.
>       `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide
>       to your VM on the QEMU command line. More instructions on this can be
>       found in the next section "DPDK vhost-user VM configuration"
> -     Note: If you wish for the vhost-user sockets to be created in a
> -     directory other than `/usr/local/var/run/openvswitch`, you may specify
> -     another location on the ovs-vswitchd command line like so:
> 
> -      `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...`
> +  - If you wish for the vhost-user sockets to be created in a directory
> other
> +    than `/usr/local/var/run/openvswitch`, you may specify another location
> +    in the ovsdb like:
> +
> +    `./vswitchd/ovs-vsctl set Open_vSwitch . other_config:vhost-sock-
> dir=path`

--no-wait

> 
>  DPDK vhost-user VM configuration:
>  ---------------------------------
> @@ -638,14 +676,13 @@ DPDK vhost-cuse VM configuration:
> 
>  1. This step is only needed if using an alternative character device.
> 
> -   The new character device filename must be specified on the vswitchd
> -   commandline:
> +   The new character device filename must be specified in the ovsdb:
> 
> -        `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1
> ...`
> +   `./utilities/ovs-vsctl set Open_vSwitch . \
> +                          other_config:cuse-dev-name=my-vhost-net`

--no-wait
> 
> -   Note that the `--cuse_dev_name` argument and associated string must be
> the first
> -   arguments after `--dpdk` and come before the EAL arguments. In the
> example
> -   above, the character device to be used will be `/dev/my-vhost-net`.
> +   In the example above, the character device to be used will be
> +   `/dev/my-vhost-net`.
> 
>  2. This step is only needed if reusing the standard character device. It
> will
>     conflict with the kernel vhost character device so the user must first
> @@ -758,8 +795,8 @@ steps.
> 
>          <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-
> net`
>          device. If you have specificed a different name on the ovs-vswitchd
> -        commandline using the "--cuse_dev_name" parameter, please specify
> that
> -        filename instead.
> +        database using the "other_config:cuse-dev-name" parameter, please
> +        specify that filename instead.
> 
>       2. Disable SELinux or set to permissive mode
> 
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index b42712f..2ce9f71 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -29,6 +29,7 @@
>  #include <stdio.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> +#include <getopt.h>
> 
>  #include "dirs.h"
>  #include "dp-packet.h"
> @@ -49,6 +50,8 @@
>  #include "timeval.h"
>  #include "unixctl.h"
>  #include "openvswitch/vlog.h"
> +#include "smap.h"
> +#include "vswitch-idl.h"
> 
>  #include "rte_config.h"
>  #include "rte_mbuf.h"
> @@ -138,6 +141,10 @@ enum dpdk_dev_type {
>      DPDK_DEV_VHOST = 1,
>  };
> 
> +static const struct ovsrec_open_vswitch *ovs_last_cfg;
> +
> +void dpdk_init(void);
> +
>  static int rte_eal_init_ret = ENODEV;
> 
>  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
> @@ -551,8 +558,20 @@ netdev_dpdk_cast(const struct netdev *netdev)
>  static struct netdev *
>  netdev_dpdk_alloc(void)
>  {
> -    struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev);
> -    return &netdev->up;
> +    struct netdev_dpdk *netdev;
> +
> +    if (rte_eal_init_ret) {
> +        dpdk_init();

This may be gone from the next patch, but there is protection inside
and outside of dpdk_init(); to prevent it executing twice. There might
be a way to get rid of one of them while still getting the return values
you want? 

> +        if (rte_eal_init_ret) {
> +            return NULL;
> +        }
> +    }
> +
> +    netdev = dpdk_rte_mzalloc(sizeof *netdev);
> +    if (netdev) {
> +        return &netdev->up;
> +    }
> +    return NULL;
>  }
> 
>  static void
> @@ -657,7 +676,10 @@ vhost_construct_helper(struct netdev *netdev_)
> OVS_REQUIRES(dpdk_mutex)
>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> 
>      if (rte_eal_init_ret) {
> -        return rte_eal_init_ret;
> +        dpdk_init();
> +        if (rte_eal_init_ret) {
> +            return rte_eal_init_ret;
> +        }
>      }
> 
>      rte_spinlock_init(&netdev->vhost_tx_lock);
> @@ -670,6 +692,13 @@ netdev_dpdk_vhost_cuse_construct(struct netdev *netdev_)
>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>      int err;
> 
> +    if (rte_eal_init_ret) {
> +        dpdk_init();
> +        if (rte_eal_init_ret) {
> +            return rte_eal_init_ret;
> +        }
> +    }
> +
>      ovs_mutex_lock(&dpdk_mutex);
>      strncpy(netdev->vhost_id, netdev->up.name, sizeof(netdev->vhost_id));
>      err = vhost_construct_helper(netdev_);
> @@ -683,6 +712,13 @@ netdev_dpdk_vhost_user_construct(struct netdev *netdev_)
>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>      int err;
> 
> +    if (rte_eal_init_ret) {
> +        dpdk_init();
> +        if (rte_eal_init_ret) {
> +            return rte_eal_init_ret;
> +        }
> +    }
> +
>      ovs_mutex_lock(&dpdk_mutex);
>      /* Take the name of the vhost-user port and append it to the location
> where
>       * the socket is to be created, then register the socket.
> @@ -707,7 +743,10 @@ netdev_dpdk_construct(struct netdev *netdev)
>      int err;
> 
>      if (rte_eal_init_ret) {
> -        return rte_eal_init_ret;
> +        dpdk_init();
> +        if (rte_eal_init_ret) {
> +            return rte_eal_init_ret;
> +        }
>      }
> 
>      /* Names always start with "dpdk" */
> @@ -1776,6 +1815,8 @@ new_device(struct virtio_net *dev)
>      struct netdev_dpdk *netdev;
>      bool exists = false;
> 
> +
> +
>      ovs_mutex_lock(&dpdk_mutex);
>      /* Add device to the vhost port with the same name as that passed down.
> */
>      LIST_FOR_EACH(netdev, list_node, &dpdk_list) {
> @@ -2023,7 +2064,10 @@ netdev_dpdk_ring_construct(struct netdev *netdev)
>      int err = 0;
> 
>      if (rte_eal_init_ret) {
> -        return rte_eal_init_ret;
> +        dpdk_init();
> +        if (rte_eal_init_ret) {
> +            return rte_eal_init_ret;
> +        }
>      }
> 
>      ovs_mutex_lock(&dpdk_mutex);
> @@ -2111,10 +2155,14 @@ unlock_dpdk:
> 
>  static int
>  process_vhost_flags(char *flag, char *default_val, int size,
> -                    char **argv, char **new_val)
> +                    const struct ovsrec_open_vswitch *ovs_cfg,
> +                    char **new_val)
>  {
> +    const char *val;
>      int changed = 0;
> 
> +    val = smap_get(&ovs_cfg->other_config, flag);
> +
>      /* Depending on which version of vhost is in use, process the vhost-
> specific
>       * flag if it is provided on the vswitchd command line, otherwise resort
> to
>       * a default value.
> @@ -2124,9 +2172,9 @@ process_vhost_flags(char *flag, char *default_val, int
> size,
>       * For vhost-cuse: Process "-cuse_dev_name" to set the custom name of
> the
>       * vhost-cuse character device.
>       */
> -    if (!strcmp(argv[1], flag) && (strlen(argv[2]) <= size)) {
> +    if (val && (strlen(val) <= size)) {
>          changed = 1;
> -        *new_val = strdup(argv[2]);
> +        *new_val = strdup(val);
>          VLOG_INFO("User-provided %s in use: %s", flag, *new_val);
>      } else {
>          VLOG_INFO("No %s provided - defaulting to %s", flag, default_val);
> @@ -2136,68 +2184,115 @@ process_vhost_flags(char *flag, char *default_val,
> int size,
>      return changed;
>  }
> 
> -int
> -dpdk_init(int argc, char **argv)
> +static char **
> +grow_argv(char ***argv, size_t cur_siz, size_t grow_by)
>  {
> -    int result;
> -    int base = 0;
> -    char *pragram_name = argv[0];
> -    int err;
> -    int isset;
> -    cpu_set_t cpuset;
> -
> -    if (argc < 2 || strcmp(argv[1], "--dpdk"))
> -        return 0;
> +    char **new_argv = realloc(*argv, sizeof(char *) * (cur_siz + grow_by));
> +    return new_argv;
> +}
> 
> -    /* Remove the --dpdk argument from arg list.*/
> -    argc--;
> -    argv++;
> +static int
> +get_dpdk_args(const struct ovsrec_open_vswitch *ovs_cfg, char ***argv)
> +{
> +    struct dpdk_options_map {
> +        const char *ovs_configuration;
> +        const char *dpdk_option;
> +        bool default_enabled;
> +        const char *default_value;
> +    } opts[] = {
> +        {"dpdk-lcore-mask", "-c", true, "0x1"},
> +        /* XXX: DPDK 2.2.0 support, the true should become false for -n */
> +        {"dpdk-mem-channels", "-n", true, "4"},
> +        {"dpdk-alloc-mem", "-m", false, NULL},
> +        {"dpdk-socket-mem", "--socket-mem", true, "1024,0"},
> +        {"dpdk-hugepage-dir", "--huge-dir", false, NULL},
> +    };
> +    int i, ret = 1;
> +
> +    for(i = 0; i < (sizeof(opts) / sizeof(opts[0])); ++i) {
> +        const char *lookup = smap_get(&ovs_cfg->other_config,
> +                                      opts[i].ovs_configuration);
> +        if(!lookup && opts[i].default_enabled)
> +            lookup = opts[i].default_value;
> +
> +        if(lookup) {
> +            char **newargv = grow_argv(argv, ret, 2);
> +
> +            if (!newargv) {
> +                ovs_abort(0, "grow_argv() failed to allocate memory.");
> +            }
> 
> -    /* Reject --user option */
> -    int i;
> -    for (i = 0; i < argc; i++) {
> -        if (!strcmp(argv[i], "--user")) {
> -            VLOG_ERR("Can not mix --dpdk and --user options, aborting.");
> +            *argv = newargv;
> +            (*argv)[ret++] = strdup(opts[i].dpdk_option);
> +            (*argv)[ret++] = strdup(lookup);
>          }
>      }
> 
> +    return ret;
> +}
> +
> +static char **dpdk_argv;
> +static int dpdk_argc;
> +
> +static void
> +deferred_argv_release(void)
> +{
> +    int result;
> +    for (result = 0; result < dpdk_argc; ++result) {
> +        free(dpdk_argv[result]);
> +    }
> +
> +    free(dpdk_argv);
> +}
> +
> +static void
> +__dpdk_init(const struct ovsrec_open_vswitch *ovs_cfg)
> +{
> +    char **argv = NULL;
> +    int result;
> +    int argc;
> +    int err;
> +    cpu_set_t cpuset;
> +
> +    VLOG_INFO("DPDK Enabled, initializing");
> +
>  #ifdef VHOST_CUSE
> -    if (process_vhost_flags("-cuse_dev_name", strdup("vhost-net"),
> -                            PATH_MAX, argv, &cuse_dev_name)) {
> +    if (process_vhost_flags("cuse-dev-name", strdup("vhost-net"),
> +                            PATH_MAX, ovs_cfg, &cuse_dev_name)) {
>  #else
> -    if (process_vhost_flags("-vhost_sock_dir", strdup(ovs_rundir()),
> -                            NAME_MAX, argv, &vhost_sock_dir)) {
> +    if (process_vhost_flags("vhost-sock-dir", strdup(ovs_rundir()),
> +                            NAME_MAX, ovs_cfg, &vhost_sock_dir)) {
>          struct stat s;
> -        int err;
> 
>          err = stat(vhost_sock_dir, &s);
>          if (err) {
> -            VLOG_ERR("vHostUser socket DIR '%s' does not exist.",
> -                     vhost_sock_dir);
> -            return err;
> +            ovs_abort(0, "vhost-user sock directory '%s' does not exist.",
> +                      vhost_sock_dir);
>          }
>  #endif
> -        /* Remove the vhost flag configuration parameters from the argument
> -         * list, so that the correct elements are passed to the DPDK
> -         * initialization function
> -         */
> -        argc -= 2;
> -        argv += 2;    /* Increment by two to bypass the vhost flag arguments
> */
> -        base = 2;
>      }
> 
>      /* Get the main thread affinity */
>      CPU_ZERO(&cpuset);
>      err = pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t),
> &cpuset);
>      if (err) {
> -        VLOG_ERR("Thread getaffinity error %d.", err);
> -        return err;
> +        ovs_abort(0, "Thread getaffinity error %d.", err);
>      }
> 
> -    /* Keep the program name argument as this is needed for call to
> -     * rte_eal_init()
> -     */
> -    argv[0] = pragram_name;
> +    argv = grow_argv(&argv, 0, 1);
> +    if (!argv) {
> +        ovs_abort(0, "Unable to allocate an initial argv.");
> +    }
> +    argv[0] = strdup("ovs"); /* TODO use prctl to get process name */
> +    argc = get_dpdk_args(ovs_cfg, &argv);
> +
> +    argv = grow_argv(&argv, argc, argc+1);
> +    if (!argv) {
> +        ovs_abort(0, "Unable to make final argv allocation.");
> +    }
> +    argv[argc] = 0;
> +
> +    optind = 1;
> 
>      /* Make sure things are initialized ... */
>      result = rte_eal_init(argc, argv);
> @@ -2208,21 +2303,36 @@ dpdk_init(int argc, char **argv)
>      /* Set the main thread affinity back to pre rte_eal_init() value */
>      err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t),
> &cpuset);
>      if (err) {
> -        VLOG_ERR("Thread setaffinity error %d", err);
> -        return err;
> +        ovs_abort(0, "Thread getaffinity error %d.", err);
>      }
> 
> +    dpdk_argv = argv;
> +    dpdk_argc = argc;
> +
> +    atexit(deferred_argv_release);
> +
>      rte_memzone_dump(stdout);
>      rte_eal_init_ret = 0;
> 
> -    if (argc > result) {
> -        argv[result] = argv[0];
> -    }
> -
>      /* We are called from the main thread here */
>      RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
> +}
> 
> -    return result + 1 + base;
> +void
> +dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
> +{
> +    ovs_last_cfg = ovs_cfg;

I think there would need to be locking around ovs_last_cfg as we
could also be in the middle of initialization here.

> +}
> +
> +void
> +dpdk_init(void)
> +{
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> +    if(ovs_last_cfg && ovsthread_once_start(&once)) {
> +        __dpdk_init(ovs_last_cfg);
> +        ovsthread_once_done(&once);
> +    }
>  }
> 
>  static const struct netdev_class dpdk_class =
> @@ -2286,10 +2396,6 @@ netdev_dpdk_register(void)
>  {
>      static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> 
> -    if (rte_eal_init_ret) {
> -        return;
> -    }
> -
>      if (ovsthread_once_start(&once)) {
>          dpdk_common_init();
>          netdev_register_provider(&dpdk_class);
> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
> index 646d3e2..e699f37 100644
> --- a/lib/netdev-dpdk.h
> +++ b/lib/netdev-dpdk.h
> @@ -22,7 +22,9 @@ struct dp_packet;
> 
>  #define NON_PMD_CORE_ID LCORE_ID_ANY
> 
> -int dpdk_init(int argc, char **argv);
> +struct ovsrec_open_vswitch;
> +
> +void dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg);
>  void netdev_dpdk_register(void);
>  void free_dpdk_buf(struct dp_packet *);
>  int pmd_thread_setaffinity_cpu(unsigned cpu);
> @@ -30,15 +32,20 @@ int pmd_thread_setaffinity_cpu(unsigned cpu);
>  #else
> 
>  #define NON_PMD_CORE_ID UINT32_MAX
> -
>  #include "util.h"
> 
> -static inline int
> -dpdk_init(int argc, char **argv)
> +static inline void
> +dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
>  {
> -    if (argc >= 2 && !strcmp(argv[1], "--dpdk")) {
> -        ovs_fatal(0, "DPDK support not built into this copy of Open
> vSwitch.");
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> +    if (ovs_cfg && ovsthread_once_start(&once)) {
> +        if(smap_get_bool(&ovs_cfg->other_config, "dpdk", false)) {
> +            ovs_fatal(0, "DPDK not supported in this copy of Open
> vSwitch.");

"dpdk" db entry is not in this version of the patchset. 

Might be best for this function to just do nothing now. A print doesn't
even seem to make sense as this will be called for OVS without DPDK.

> +        }
> +        ovsthread_once_done(&once);
>      }
> +
>      return 0;
>  }
> 
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index f8afe55..f7aab07 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -68,6 +68,7 @@
>  #include "sflow_api.h"
>  #include "vlan-bitmap.h"
>  #include "packets.h"
> +#include "lib/netdev-dpdk.h"
> 
>  VLOG_DEFINE_THIS_MODULE(bridge);
> 
> @@ -2922,6 +2923,8 @@ bridge_run(void)
>      }
>      cfg = ovsrec_open_vswitch_first(idl);
> 
> +    dpdk_config(cfg);
> +
>      /* Initialize the ofproto library.  This only needs to run once, but
>       * it must be done after the configuration is set.  If the
>       * initialization has already occurred, bridge_init_ofproto()
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index e78ecda..c448107 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -48,7 +48,6 @@
>  #include "openvswitch/vconn.h"
>  #include "openvswitch/vlog.h"
>  #include "lib/vswitch-idl.h"
> -#include "lib/netdev-dpdk.h"
> 
>  VLOG_DEFINE_THIS_MODULE(vswitchd);
> 
> @@ -71,13 +70,6 @@ main(int argc, char *argv[])
>      int retval;
> 
>      set_program_name(argv[0]);
> -    retval = dpdk_init(argc,argv);
> -    if (retval < 0) {
> -        return retval;
> -    }
> -
> -    argc -= retval;
> -    argv += retval;
> 
>      ovs_cmdl_proctitle_init(argc, argv);
>      service_start(&argc, &argv);
> @@ -166,7 +158,7 @@ parse_options(int argc, char *argv[], char
> **unixctl_pathp)
>          {"bootstrap-ca-cert", required_argument, NULL,
> OPT_BOOTSTRAP_CA_CERT},
>          {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
>          {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
> -        {"dpdk", required_argument, NULL, OPT_DPDK},
> +        {"dpdk", optional_argument, NULL, OPT_DPDK},
>          {NULL, 0, NULL, 0},
>      };
>      char *short_options =
> ovs_cmdl_long_options_to_short_options(long_options);
> @@ -219,7 +211,7 @@ parse_options(int argc, char *argv[], char
> **unixctl_pathp)
>              exit(EXIT_FAILURE);
> 
>          case OPT_DPDK:
> -            ovs_fatal(0, "--dpdk must be given at beginning of command
> line.");
> +            ovs_fatal(0, "Using --dpdk to configure DPDK is not
> supported.");
>              break;
> 
>          default:
> @@ -256,17 +248,8 @@ usage(void)
>      daemon_usage();
>      vlog_usage();
>      printf("\nDPDK options:\n"
> -           "  --dpdk [VHOST] [DPDK]     Initialize DPDK datapath.\n"
> -           "  where DPDK are options for initializing DPDK lib and VHOST
> is\n"
> -#ifdef VHOST_CUSE
> -           "  option to override default character device name used for\n"
> -           "  for use with userspace vHost\n"
> -           "    -cuse_dev_name NAME\n"
> -#else
> -           "  option to override default directory where vhost-user\n"
> -           "  sockets are created.\n"
> -           "    -vhost_sock_dir DIR\n"
> -#endif
> +           "Configuration of DPDK via command-line is removed from this\n"
> +           "version of Open vSwitch. DPDK is configured through ovsdb.\n"
>             );
>      printf("\nOther options:\n"
>             "  --unixctl=SOCKET          override default control socket
> name\n"
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index ce0dbc1..355da95 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -176,11 +176,38 @@
>          </p>
>        </column>
> 
> -      <column name="other_config" key="pmd-cpu-mask">
> +      <column name="other_config" key="dpdk-lcore-mask"
> +              type='{"type": "integer", "minInteger": 1}'>
> +        <p>
> +          Specifies the CPU cores on which dpdk lcore threads should be
> spawned.
> +          The DPDK lcore threads are used for DPDK library tasks, such as
> +          library internal message processing, logging, etc. Value should be
> in
> +          the form of a hex string (so '0x123') similar to the 'taskset'
> mask
> +          input.
> +        </p>
> +        <p>
> +          The lowest order bit corresponds to the first CPU core.  A set bit
> +          means the corresponding core is available and a pmd thread will be
> +          created and pinned to it.  If the input does not cover all cores,
> +          those uncovered cores are considered not set.
> +        </p>
> +        <p>
> +          For performance reasons, it is best to set this to a single core
> on
> +          the system, rather than allow lcore threads to float.
> +        </p>
> +        <p>
> +          If not specified, the value will be determined by choosing the
> lowest
> +          CPU core from initial cpu affinity list. Otherwise, the value will
> be
> +          passed directly to the DPDK library.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="pmd-cpu-mask"
> +              type='{"type": "integer", "minInteger": 1}'>
>          <p>
>            Specifies CPU mask for setting the cpu affinity of PMD (Poll
>            Mode Driver) threads.  Value should be in the form of hex string,
> -          similar to the dpdk EAL '-c COREMASK' option input or the
> 'taskset'
> +          similar to the dpdk-lcore-mask option input or the 'taskset'
>            mask input.
>          </p>
>          <p>
> @@ -195,6 +222,77 @@
>          </p>
>        </column>
> 
> +      <column name="other_config" key="dpdk-mem-channels"
> +              type='{"type": "integer", "minInteger": 1}'>
> +        <p>
> +          Specifies the number of NUMA memory spread channels in the CPU to
> +          be used by DPDK. It is purely optimization.
> +        </p>
> +        <p>
> +          The default value is 4.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="dpdk-alloc-mem"
> +              type='{"type": "integer", "minInteger": 0}'>
> +        <p>
> +          Specifies the amount of memory to preallocate from the hugepage
> pool,
> +          regardless of socket. It is recommended that dpdk-socket-mem is
> used
> +          instead.
> +        </p>
> +        <p>
> +          If not specified, the value is 0.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="dpdk-socket-mem"
> +              type='{"type": "string"}'>
> +        <p>
> +          Specifies the amount of memory to preallocate from the hugepage
> pool,
> +          on a per-socket basis.
> +        </p>
> +        <p>
> +          The specifier is a comma-separated string, in ascending order of
> CPU
> +          socket (ex: 1024,2048,4096,8192 would set socket 0 to preallocate
> +          1024MB, socket 1 to preallocate 2048MB, etc.)
> +        </p>
> +        <p>
> +          If not specified, the default value is 1024,0.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="dpdk-hugepage-dir"
> +              type='{"type": "string"}'>
> +        <p>
> +          Specifies the path to the hugetlbfs mount point.
> +        </p>
> +        <p>
> +          If not specified, this will be guessed by the DPDK library
> (default
> +          is /dev/hugepages).
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="cuse-dev-name"
> +              type='{"type": "string"}'>
> +        <p>
> +          Specifies the name of the vhost-cuse character device to open for
> +          vhost-cuse support.
> +        </p>
> +        <p>
> +          The default is vhost-net
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="vhost-sock-dir"
> +              type='{"type": "string"}'>
> +        <p>
> +          Specifies the path to the vhost-user unix domain socket files.
> +        </p>
> +        <p>
> +          The default is the working directory of the application.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="n-handler-threads"
>                type='{"type": "integer", "minInteger": 1}'>
>          <p>
> --
> 2.6.1.133.gf5b6079
Aaron Conole Jan. 12, 2016, 2:46 p.m. UTC | #6
"Traynor, Kevin" <kevin.traynor@intel.com> writes:
>> -----Original Message-----
>> From: Aaron Conole [mailto:aconole@redhat.com]
>> Sent: Monday, January 4, 2016 9:47 PM
>> To: dev@openvswitch.org; Flavio Leitner; Traynor, Kevin
>> Subject: [PATCH v2 2/3] netdev-dpdk: Convert initialization from cmdline to
>> db
>> 
>> Existing DPDK integration is provided by use of command line options which
>> must be split out and passed to librte in a special manner. However, this
>> forces any configuration to be passed by way of a special DPDK flag, and
>> interferes with ovs+dpdk packaging solutions.
>> 
>> This commit delays dpdk initialization until the first DPDK netdev is added
>> to the bridge, at which point ovs initializes librte. It pulls all of
>> the config data from the OVS database, and assembles a new argv/argc
>> pair to be passed along.
>> 
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> ---
>> v2:
>> * Removed trailing whitespace
>> * Followed for() loop brace coding style
>> * Automatically enable DPDK when adding a DPDK enabled port
>> * Fixed an issue on startup when DPDK enabled ports are present
>> * Updated the documentation (including vswitch.xml) and documented all
>>   new parameters
>> * Dropped the premature initialization test
>
> Hi, mostly very minor comments below,

Kevin, thanks for the review!

>
>> 
>> INSTALL.DPDK.md         |  75 ++++++++++++----
>>  lib/netdev-dpdk.c       | 224 +++++++++++++++++++++++++++++++++++-----------
>> --
>>  lib/netdev-dpdk.h       |  19 ++--
>>  vswitchd/bridge.c       |   3 +
>>  vswitchd/ovs-vswitchd.c |  25 +-----
>>  vswitchd/vswitch.xml    | 102 +++++++++++++++++++++-
>>  6 files changed, 341 insertions(+), 107 deletions(-)
>> 
>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> index 96b686c..2dd2120 100644
>> --- a/INSTALL.DPDK.md
>> +++ b/INSTALL.DPDK.md
>> @@ -143,22 +143,59 @@ Using the DPDK with ovs-vswitchd:
>> 
>>  5. Start vswitchd:
>> 
>> -   DPDK configuration arguments can be passed to vswitchd via `--dpdk`
>> -   argument. This needs to be first argument passed to vswitchd process.
>> -   dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter
>> -   for dpdk initialization.
>> +   DPDK configuration arguments can be passed to vswitchd via Open_vSwitch
>> +   other_config database. The recognized configuration options are listed.
>
> 'Defaults will be provided for all values not set.'

Thanks, will add this.

>
>> +
>> +   * dpdk-lcore-mask
>> +   Specifies the CPU cores on which dpdk lcore threads should be spawned.
>> +   The DPDK lcore threads are used for DPDK library tasks, such as
>> +   library internal message processing, logging, etc. Value should be in
>> +   the form of a hex string (so '0x123') similar to the 'taskset' mask
>> +   input.
>
> 'CPU cores' and '0x123' imply it will be multiple cores in this case - better
> to change to CPU core and 0x1
>
> Also, I don't think the 0x prefix is accepted based on using pmd-cpu-mask.

Hrrm.. okay. I used the documentation on pmd-cpu-mask as my guideline,
so if that's broken it's a bug against pmd-cpu-mask as well (it's
documented as a <hex-string>).

>> +   If not specified, the value will be determined by choosing the lowest
>> +   CPU core from initial cpu affinity list. Otherwise, the value will be
>> +   passed directly to the DPDK library.
>> +   For performance reasons, it is best to set this to a single core on
>> +   the system, rather than allow lcore threads to float.
>
> I suppose it depends on the system load and vswitch usage as to which will give
> better performance but I agree setting a dedicated core is at least more
> deterministic and less risky, so statement is probably fine.

Well, I used the advice in rings as my guideline here. I think it's
correct to say that the workload is important. Okay, I'll think if
there's a better comment to use here.

>> +
>> +   * dpdk-mem-channels
>> +   This sets the number of memory spread channels per CPU socket. It is
>> purely
>> +   an optimization flag.
>> +
>> +   * dpdk-alloc-mem
>> +   This sets the total memory to preallocate from hugepages regardless of
>> +   processor socket. It is recommended to use dpdk-socket-mem instead.
>> +
>> +   * dpdk-socket-mem
>> +   Comma separated list of memory to pre-allocate from hugepages on specific
>> +   sockets.
>> +
>> +   * dpdk-hugepage-dir
>> +   Directory where hugetlbfs is mounted
>> +
>> +   * cuse-dev-name
>> +   Option to set the vhost_cuse character device name.
>> +
>> +   * vhost-sock-dir
>> +   Option to set the path to the vhost_user unix socket files.
>> +
>> +   NOTE: Changing any of these options requires restarting the ovs-vswitchd
>> +   application.
>> +
>> +   Open vSwitch can be started as normal. DPDK will not be initialized until
>> +   the first DPDK-enabled port is added to the bridge.
>> 
>>     ```
>>     export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
>> -   ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach
>> +   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
>>     ```
>> 
>>     If allocated more than one GB hugepage (as for IVSHMEM), set amount and
>>     use NUMA node 0 memory:
>> 
>>     ```
>> -   ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \
>> -   -- unix:$DB_SOCK --pidfile --detach
>> +   ovs-vsctl --no-wait set Open_vSwitch .
>> other_config:dpdk_socket_mem="1024,0"
>> +   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
>>     ```
>> 
>>  6. Add bridge & ports
>> @@ -521,11 +558,12 @@ have arbitrary names.
>>       `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide
>>       to your VM on the QEMU command line. More instructions on this can be
>>       found in the next section "DPDK vhost-user VM configuration"
>> -     Note: If you wish for the vhost-user sockets to be created in a
>> -     directory other than `/usr/local/var/run/openvswitch`, you may specify
>> -     another location on the ovs-vswitchd command line like so:
>> 
>> -      `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...`
>> +  - If you wish for the vhost-user sockets to be created in a directory
>> other
>> +    than `/usr/local/var/run/openvswitch`, you may specify another location
>> +    in the ovsdb like:
>> +
>> +    `./vswitchd/ovs-vsctl set Open_vSwitch . other_config:vhost-sock-
>> dir=path`
>
> --no-wait

d'oh!

>> 
>>  DPDK vhost-user VM configuration:
>>  ---------------------------------
>> @@ -638,14 +676,13 @@ DPDK vhost-cuse VM configuration:
>> 
>>  1. This step is only needed if using an alternative character device.
>> 
>> -   The new character device filename must be specified on the vswitchd
>> -   commandline:
>> +   The new character device filename must be specified in the ovsdb:
>> 
>> -        `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1
>> ...`
>> +   `./utilities/ovs-vsctl set Open_vSwitch . \
>> +                          other_config:cuse-dev-name=my-vhost-net`
>
> --no-wait

d'oh!

>> -   Note that the `--cuse_dev_name` argument and associated string must be
>> the first
>> -   arguments after `--dpdk` and come before the EAL arguments. In the
>> example
>> -   above, the character device to be used will be `/dev/my-vhost-net`.
>> +   In the example above, the character device to be used will be
>> +   `/dev/my-vhost-net`.
>> 
>>  2. This step is only needed if reusing the standard character device. It
>> will
>>     conflict with the kernel vhost character device so the user must first
>> @@ -758,8 +795,8 @@ steps.
>> 
>>          <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-
>> net`
>>          device. If you have specificed a different name on the ovs-vswitchd
>> -        commandline using the "--cuse_dev_name" parameter, please specify
>> that
>> -        filename instead.
>> +        database using the "other_config:cuse-dev-name" parameter, please
>> +        specify that filename instead.
>> 
>>       2. Disable SELinux or set to permissive mode
>> 
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index b42712f..2ce9f71 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -29,6 +29,7 @@
>>  #include <stdio.h>
>>  #include <sys/types.h>
>>  #include <sys/stat.h>
>> +#include <getopt.h>
>> 
>>  #include "dirs.h"
>>  #include "dp-packet.h"
>> @@ -49,6 +50,8 @@
>>  #include "timeval.h"
>>  #include "unixctl.h"
>>  #include "openvswitch/vlog.h"
>> +#include "smap.h"
>> +#include "vswitch-idl.h"
>> 
>>  #include "rte_config.h"
>>  #include "rte_mbuf.h"
>> @@ -138,6 +141,10 @@ enum dpdk_dev_type {
>>      DPDK_DEV_VHOST = 1,
>>  };
>> 
>> +static const struct ovsrec_open_vswitch *ovs_last_cfg;
>> +
>> +void dpdk_init(void);
>> +
>>  static int rte_eal_init_ret = ENODEV;
>> 
>>  static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
>> @@ -551,8 +558,20 @@ netdev_dpdk_cast(const struct netdev *netdev)
>>  static struct netdev *
>>  netdev_dpdk_alloc(void)
>>  {
>> -    struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev);
>> -    return &netdev->up;
>> +    struct netdev_dpdk *netdev;
>> +
>> +    if (rte_eal_init_ret) {
>> +        dpdk_init();
>
> This may be gone from the next patch, but there is protection inside
> and outside of dpdk_init(); to prevent it executing twice. There might
> be a way to get rid of one of them while still getting the return values
> you want?

I wanted to skip the dpdk_init path altogether, but you're probably
right. Anyway, you're correct that it will need to be re-done from the
patch series.

>> +        if (rte_eal_init_ret) {
>> +            return NULL;
>> +        }
>> +    }
>> +
>> +    netdev = dpdk_rte_mzalloc(sizeof *netdev);
>> +    if (netdev) {
>> +        return &netdev->up;
>> +    }
>> +    return NULL;
>>  }
>> 
>>  static void
>> @@ -657,7 +676,10 @@ vhost_construct_helper(struct netdev *netdev_)
>> OVS_REQUIRES(dpdk_mutex)
>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>> 
>>      if (rte_eal_init_ret) {
>> -        return rte_eal_init_ret;
>> +        dpdk_init();
>> +        if (rte_eal_init_ret) {
>> +            return rte_eal_init_ret;
>> +        }
>>      }
>> 
>>      rte_spinlock_init(&netdev->vhost_tx_lock);
>> @@ -670,6 +692,13 @@ netdev_dpdk_vhost_cuse_construct(struct netdev *netdev_)
>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>      int err;
>> 
>> +    if (rte_eal_init_ret) {
>> +        dpdk_init();
>> +        if (rte_eal_init_ret) {
>> +            return rte_eal_init_ret;
>> +        }
>> +    }
>> +
>>      ovs_mutex_lock(&dpdk_mutex);
>>      strncpy(netdev->vhost_id, netdev->up.name, sizeof(netdev->vhost_id));
>>      err = vhost_construct_helper(netdev_);
>> @@ -683,6 +712,13 @@ netdev_dpdk_vhost_user_construct(struct netdev *netdev_)
>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>      int err;
>> 
>> +    if (rte_eal_init_ret) {
>> +        dpdk_init();
>> +        if (rte_eal_init_ret) {
>> +            return rte_eal_init_ret;
>> +        }
>> +    }
>> +
>>      ovs_mutex_lock(&dpdk_mutex);
>>      /* Take the name of the vhost-user port and append it to the location
>> where
>>       * the socket is to be created, then register the socket.
>> @@ -707,7 +743,10 @@ netdev_dpdk_construct(struct netdev *netdev)
>>      int err;
>> 
>>      if (rte_eal_init_ret) {
>> -        return rte_eal_init_ret;
>> +        dpdk_init();
>> +        if (rte_eal_init_ret) {
>> +            return rte_eal_init_ret;
>> +        }
>>      }
>> 
>>      /* Names always start with "dpdk" */
>> @@ -1776,6 +1815,8 @@ new_device(struct virtio_net *dev)
>>      struct netdev_dpdk *netdev;
>>      bool exists = false;
>> 
>> +
>> +
>>      ovs_mutex_lock(&dpdk_mutex);
>>      /* Add device to the vhost port with the same name as that passed down.
>> */
>>      LIST_FOR_EACH(netdev, list_node, &dpdk_list) {
>> @@ -2023,7 +2064,10 @@ netdev_dpdk_ring_construct(struct netdev *netdev)
>>      int err = 0;
>> 
>>      if (rte_eal_init_ret) {
>> -        return rte_eal_init_ret;
>> +        dpdk_init();
>> +        if (rte_eal_init_ret) {
>> +            return rte_eal_init_ret;
>> +        }
>>      }
>> 
>>      ovs_mutex_lock(&dpdk_mutex);
>> @@ -2111,10 +2155,14 @@ unlock_dpdk:
>> 
>>  static int
>>  process_vhost_flags(char *flag, char *default_val, int size,
>> -                    char **argv, char **new_val)
>> +                    const struct ovsrec_open_vswitch *ovs_cfg,
>> +                    char **new_val)
>>  {
>> +    const char *val;
>>      int changed = 0;
>> 
>> +    val = smap_get(&ovs_cfg->other_config, flag);
>> +
>>      /* Depending on which version of vhost is in use, process the vhost-
>> specific
>>       * flag if it is provided on the vswitchd command line, otherwise resort
>> to
>>       * a default value.
>> @@ -2124,9 +2172,9 @@ process_vhost_flags(char *flag, char *default_val, int
>> size,
>>       * For vhost-cuse: Process "-cuse_dev_name" to set the custom name of
>> the
>>       * vhost-cuse character device.
>>       */
>> -    if (!strcmp(argv[1], flag) && (strlen(argv[2]) <= size)) {
>> +    if (val && (strlen(val) <= size)) {
>>          changed = 1;
>> -        *new_val = strdup(argv[2]);
>> +        *new_val = strdup(val);
>>          VLOG_INFO("User-provided %s in use: %s", flag, *new_val);
>>      } else {
>>          VLOG_INFO("No %s provided - defaulting to %s", flag, default_val);
>> @@ -2136,68 +2184,115 @@ process_vhost_flags(char *flag, char *default_val,
>> int size,
>>      return changed;
>>  }
>> 
>> -int
>> -dpdk_init(int argc, char **argv)
>> +static char **
>> +grow_argv(char ***argv, size_t cur_siz, size_t grow_by)
>>  {
>> -    int result;
>> -    int base = 0;
>> -    char *pragram_name = argv[0];
>> -    int err;
>> -    int isset;
>> -    cpu_set_t cpuset;
>> -
>> -    if (argc < 2 || strcmp(argv[1], "--dpdk"))
>> -        return 0;
>> +    char **new_argv = realloc(*argv, sizeof(char *) * (cur_siz + grow_by));
>> +    return new_argv;
>> +}
>> 
>> -    /* Remove the --dpdk argument from arg list.*/
>> -    argc--;
>> -    argv++;
>> +static int
>> +get_dpdk_args(const struct ovsrec_open_vswitch *ovs_cfg, char ***argv)
>> +{
>> +    struct dpdk_options_map {
>> +        const char *ovs_configuration;
>> +        const char *dpdk_option;
>> +        bool default_enabled;
>> +        const char *default_value;
>> +    } opts[] = {
>> +        {"dpdk-lcore-mask", "-c", true, "0x1"},
>> +        /* XXX: DPDK 2.2.0 support, the true should become false for -n */
>> +        {"dpdk-mem-channels", "-n", true, "4"},
>> +        {"dpdk-alloc-mem", "-m", false, NULL},
>> +        {"dpdk-socket-mem", "--socket-mem", true, "1024,0"},
>> +        {"dpdk-hugepage-dir", "--huge-dir", false, NULL},
>> +    };
>> +    int i, ret = 1;
>> +
>> +    for(i = 0; i < (sizeof(opts) / sizeof(opts[0])); ++i) {
>> +        const char *lookup = smap_get(&ovs_cfg->other_config,
>> +                                      opts[i].ovs_configuration);
>> +        if(!lookup && opts[i].default_enabled)
>> +            lookup = opts[i].default_value;
>> +
>> +        if(lookup) {
>> +            char **newargv = grow_argv(argv, ret, 2);
>> +
>> +            if (!newargv) {
>> +                ovs_abort(0, "grow_argv() failed to allocate memory.");
>> +            }
>> 
>> -    /* Reject --user option */
>> -    int i;
>> -    for (i = 0; i < argc; i++) {
>> -        if (!strcmp(argv[i], "--user")) {
>> -            VLOG_ERR("Can not mix --dpdk and --user options, aborting.");
>> +            *argv = newargv;
>> +            (*argv)[ret++] = strdup(opts[i].dpdk_option);
>> +            (*argv)[ret++] = strdup(lookup);
>>          }
>>      }
>> 
>> +    return ret;
>> +}
>> +
>> +static char **dpdk_argv;
>> +static int dpdk_argc;
>> +
>> +static void
>> +deferred_argv_release(void)
>> +{
>> +    int result;
>> +    for (result = 0; result < dpdk_argc; ++result) {
>> +        free(dpdk_argv[result]);
>> +    }
>> +
>> +    free(dpdk_argv);
>> +}
>> +
>> +static void
>> +__dpdk_init(const struct ovsrec_open_vswitch *ovs_cfg)
>> +{
>> +    char **argv = NULL;
>> +    int result;
>> +    int argc;
>> +    int err;
>> +    cpu_set_t cpuset;
>> +
>> +    VLOG_INFO("DPDK Enabled, initializing");
>> +
>>  #ifdef VHOST_CUSE
>> -    if (process_vhost_flags("-cuse_dev_name", strdup("vhost-net"),
>> -                            PATH_MAX, argv, &cuse_dev_name)) {
>> +    if (process_vhost_flags("cuse-dev-name", strdup("vhost-net"),
>> +                            PATH_MAX, ovs_cfg, &cuse_dev_name)) {
>>  #else
>> -    if (process_vhost_flags("-vhost_sock_dir", strdup(ovs_rundir()),
>> -                            NAME_MAX, argv, &vhost_sock_dir)) {
>> +    if (process_vhost_flags("vhost-sock-dir", strdup(ovs_rundir()),
>> +                            NAME_MAX, ovs_cfg, &vhost_sock_dir)) {
>>          struct stat s;
>> -        int err;
>> 
>>          err = stat(vhost_sock_dir, &s);
>>          if (err) {
>> -            VLOG_ERR("vHostUser socket DIR '%s' does not exist.",
>> -                     vhost_sock_dir);
>> -            return err;
>> +            ovs_abort(0, "vhost-user sock directory '%s' does not exist.",
>> +                      vhost_sock_dir);
>>          }
>>  #endif
>> -        /* Remove the vhost flag configuration parameters from the argument
>> -         * list, so that the correct elements are passed to the DPDK
>> -         * initialization function
>> -         */
>> -        argc -= 2;
>> -        argv += 2;    /* Increment by two to bypass the vhost flag arguments
>> */
>> -        base = 2;
>>      }
>> 
>>      /* Get the main thread affinity */
>>      CPU_ZERO(&cpuset);
>>      err = pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t),
>> &cpuset);
>>      if (err) {
>> -        VLOG_ERR("Thread getaffinity error %d.", err);
>> -        return err;
>> +        ovs_abort(0, "Thread getaffinity error %d.", err);
>>      }
>> 
>> -    /* Keep the program name argument as this is needed for call to
>> -     * rte_eal_init()
>> -     */
>> -    argv[0] = pragram_name;
>> +    argv = grow_argv(&argv, 0, 1);
>> +    if (!argv) {
>> +        ovs_abort(0, "Unable to allocate an initial argv.");
>> +    }
>> +    argv[0] = strdup("ovs"); /* TODO use prctl to get process name */
>> +    argc = get_dpdk_args(ovs_cfg, &argv);
>> +
>> +    argv = grow_argv(&argv, argc, argc+1);
>> +    if (!argv) {
>> +        ovs_abort(0, "Unable to make final argv allocation.");
>> +    }
>> +    argv[argc] = 0;
>> +
>> +    optind = 1;
>> 
>>      /* Make sure things are initialized ... */
>>      result = rte_eal_init(argc, argv);
>> @@ -2208,21 +2303,36 @@ dpdk_init(int argc, char **argv)
>>      /* Set the main thread affinity back to pre rte_eal_init() value */
>>      err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t),
>> &cpuset);
>>      if (err) {
>> -        VLOG_ERR("Thread setaffinity error %d", err);
>> -        return err;
>> +        ovs_abort(0, "Thread getaffinity error %d.", err);
>>      }
>> 
>> +    dpdk_argv = argv;
>> +    dpdk_argc = argc;
>> +
>> +    atexit(deferred_argv_release);
>> +
>>      rte_memzone_dump(stdout);
>>      rte_eal_init_ret = 0;
>> 
>> -    if (argc > result) {
>> -        argv[result] = argv[0];
>> -    }
>> -
>>      /* We are called from the main thread here */
>>      RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
>> +}
>> 
>> -    return result + 1 + base;
>> +void
>> +dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
>> +{
>> +    ovs_last_cfg = ovs_cfg;
>
> I think there would need to be locking around ovs_last_cfg as we
> could also be in the middle of initialization here.

This won't be an issue with the next series I post.

>> +}
>> +
>> +void
>> +dpdk_init(void)
>> +{
>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +    if(ovs_last_cfg && ovsthread_once_start(&once)) {
>> +        __dpdk_init(ovs_last_cfg);
>> +        ovsthread_once_done(&once);
>> +    }
>>  }
>> 
>>  static const struct netdev_class dpdk_class =
>> @@ -2286,10 +2396,6 @@ netdev_dpdk_register(void)
>>  {
>>      static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> 
>> -    if (rte_eal_init_ret) {
>> -        return;
>> -    }
>> -
>>      if (ovsthread_once_start(&once)) {
>>          dpdk_common_init();
>>          netdev_register_provider(&dpdk_class);
>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
>> index 646d3e2..e699f37 100644
>> --- a/lib/netdev-dpdk.h
>> +++ b/lib/netdev-dpdk.h
>> @@ -22,7 +22,9 @@ struct dp_packet;
>> 
>>  #define NON_PMD_CORE_ID LCORE_ID_ANY
>> 
>> -int dpdk_init(int argc, char **argv);
>> +struct ovsrec_open_vswitch;
>> +
>> +void dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg);
>>  void netdev_dpdk_register(void);
>>  void free_dpdk_buf(struct dp_packet *);
>>  int pmd_thread_setaffinity_cpu(unsigned cpu);
>> @@ -30,15 +32,20 @@ int pmd_thread_setaffinity_cpu(unsigned cpu);
>>  #else
>> 
>>  #define NON_PMD_CORE_ID UINT32_MAX
>> -
>>  #include "util.h"
>> 
>> -static inline int
>> -dpdk_init(int argc, char **argv)
>> +static inline void
>> +dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
>>  {
>> -    if (argc >= 2 && !strcmp(argv[1], "--dpdk")) {
>> -        ovs_fatal(0, "DPDK support not built into this copy of Open
>> vSwitch.");
>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +    if (ovs_cfg && ovsthread_once_start(&once)) {
>> +        if(smap_get_bool(&ovs_cfg->other_config, "dpdk", false)) {
>> +            ovs_fatal(0, "DPDK not supported in this copy of Open
>> vSwitch.");
>
> "dpdk" db entry is not in this version of the patchset. 

Good catch... I should have tested it more thoroughly.

> Might be best for this function to just do nothing now. A print doesn't
> even seem to make sense as this will be called for OVS without DPDK.

Okay.

>> +        }
>> +        ovsthread_once_done(&once);
>>      }
>> +
>>      return 0;
>>  }
>> 
>> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
>> index f8afe55..f7aab07 100644
>> --- a/vswitchd/bridge.c
>> +++ b/vswitchd/bridge.c
>> @@ -68,6 +68,7 @@
>>  #include "sflow_api.h"
>>  #include "vlan-bitmap.h"
>>  #include "packets.h"
>> +#include "lib/netdev-dpdk.h"
>> 
>>  VLOG_DEFINE_THIS_MODULE(bridge);
>> 
>> @@ -2922,6 +2923,8 @@ bridge_run(void)
>>      }
>>      cfg = ovsrec_open_vswitch_first(idl);
>> 
>> +    dpdk_config(cfg);
>> +
>>      /* Initialize the ofproto library.  This only needs to run once, but
>>       * it must be done after the configuration is set.  If the
>>       * initialization has already occurred, bridge_init_ofproto()
>> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
>> index e78ecda..c448107 100644
>> --- a/vswitchd/ovs-vswitchd.c
>> +++ b/vswitchd/ovs-vswitchd.c
>> @@ -48,7 +48,6 @@
>>  #include "openvswitch/vconn.h"
>>  #include "openvswitch/vlog.h"
>>  #include "lib/vswitch-idl.h"
>> -#include "lib/netdev-dpdk.h"
>> 
>>  VLOG_DEFINE_THIS_MODULE(vswitchd);
>> 
>> @@ -71,13 +70,6 @@ main(int argc, char *argv[])
>>      int retval;
>> 
>>      set_program_name(argv[0]);
>> -    retval = dpdk_init(argc,argv);
>> -    if (retval < 0) {
>> -        return retval;
>> -    }
>> -
>> -    argc -= retval;
>> -    argv += retval;
>> 
>>      ovs_cmdl_proctitle_init(argc, argv);
>>      service_start(&argc, &argv);
>> @@ -166,7 +158,7 @@ parse_options(int argc, char *argv[], char
>> **unixctl_pathp)
>>          {"bootstrap-ca-cert", required_argument, NULL,
>> OPT_BOOTSTRAP_CA_CERT},
>>          {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
>>          {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
>> -        {"dpdk", required_argument, NULL, OPT_DPDK},
>> +        {"dpdk", optional_argument, NULL, OPT_DPDK},
>>          {NULL, 0, NULL, 0},
>>      };
>>      char *short_options =
>> ovs_cmdl_long_options_to_short_options(long_options);
>> @@ -219,7 +211,7 @@ parse_options(int argc, char *argv[], char
>> **unixctl_pathp)
>>              exit(EXIT_FAILURE);
>> 
>>          case OPT_DPDK:
>> -            ovs_fatal(0, "--dpdk must be given at beginning of command
>> line.");
>> +            ovs_fatal(0, "Using --dpdk to configure DPDK is not
>> supported.");
>>              break;
>> 
>>          default:
>> @@ -256,17 +248,8 @@ usage(void)
>>      daemon_usage();
>>      vlog_usage();
>>      printf("\nDPDK options:\n"
>> -           "  --dpdk [VHOST] [DPDK]     Initialize DPDK datapath.\n"
>> -           "  where DPDK are options for initializing DPDK lib and VHOST
>> is\n"
>> -#ifdef VHOST_CUSE
>> -           "  option to override default character device name used for\n"
>> -           "  for use with userspace vHost\n"
>> -           "    -cuse_dev_name NAME\n"
>> -#else
>> -           "  option to override default directory where vhost-user\n"
>> -           "  sockets are created.\n"
>> -           "    -vhost_sock_dir DIR\n"
>> -#endif
>> +           "Configuration of DPDK via command-line is removed from this\n"
>> +           "version of Open vSwitch. DPDK is configured through ovsdb.\n"
>>             );
>>      printf("\nOther options:\n"
>>             "  --unixctl=SOCKET          override default control socket
>> name\n"
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index ce0dbc1..355da95 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -176,11 +176,38 @@
>>          </p>
>>        </column>
>> 
>> -      <column name="other_config" key="pmd-cpu-mask">
>> +      <column name="other_config" key="dpdk-lcore-mask"
>> +              type='{"type": "integer", "minInteger": 1}'>
>> +        <p>
>> +          Specifies the CPU cores on which dpdk lcore threads should be
>> spawned.
>> +          The DPDK lcore threads are used for DPDK library tasks, such as
>> +          library internal message processing, logging, etc. Value should be
>> in
>> +          the form of a hex string (so '0x123') similar to the 'taskset'
>> mask
>> +          input.
>> +        </p>
>> +        <p>
>> +          The lowest order bit corresponds to the first CPU core.  A set bit
>> +          means the corresponding core is available and a pmd thread will be
>> +          created and pinned to it.  If the input does not cover all cores,
>> +          those uncovered cores are considered not set.
>> +        </p>
>> +        <p>
>> +          For performance reasons, it is best to set this to a single core
>> on
>> +          the system, rather than allow lcore threads to float.
>> +        </p>
>> +        <p>
>> +          If not specified, the value will be determined by choosing the
>> lowest
>> +          CPU core from initial cpu affinity list. Otherwise, the value will
>> be
>> +          passed directly to the DPDK library.
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="pmd-cpu-mask"
>> +              type='{"type": "integer", "minInteger": 1}'>
>>          <p>
>>            Specifies CPU mask for setting the cpu affinity of PMD (Poll
>>            Mode Driver) threads.  Value should be in the form of hex string,
>> -          similar to the dpdk EAL '-c COREMASK' option input or the
>> 'taskset'
>> +          similar to the dpdk-lcore-mask option input or the 'taskset'
>>            mask input.
>>          </p>
>>          <p>
>> @@ -195,6 +222,77 @@
>>          </p>
>>        </column>
>> 
>> +      <column name="other_config" key="dpdk-mem-channels"
>> +              type='{"type": "integer", "minInteger": 1}'>
>> +        <p>
>> +          Specifies the number of NUMA memory spread channels in the CPU to
>> +          be used by DPDK. It is purely optimization.
>> +        </p>
>> +        <p>
>> +          The default value is 4.
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="dpdk-alloc-mem"
>> +              type='{"type": "integer", "minInteger": 0}'>
>> +        <p>
>> +          Specifies the amount of memory to preallocate from the hugepage
>> pool,
>> +          regardless of socket. It is recommended that dpdk-socket-mem is
>> used
>> +          instead.
>> +        </p>
>> +        <p>
>> +          If not specified, the value is 0.
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="dpdk-socket-mem"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +          Specifies the amount of memory to preallocate from the hugepage
>> pool,
>> +          on a per-socket basis.
>> +        </p>
>> +        <p>
>> +          The specifier is a comma-separated string, in ascending order of
>> CPU
>> +          socket (ex: 1024,2048,4096,8192 would set socket 0 to preallocate
>> +          1024MB, socket 1 to preallocate 2048MB, etc.)
>> +        </p>
>> +        <p>
>> +          If not specified, the default value is 1024,0.
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="dpdk-hugepage-dir"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +          Specifies the path to the hugetlbfs mount point.
>> +        </p>
>> +        <p>
>> +          If not specified, this will be guessed by the DPDK library
>> (default
>> +          is /dev/hugepages).
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="cuse-dev-name"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +          Specifies the name of the vhost-cuse character device to open for
>> +          vhost-cuse support.
>> +        </p>
>> +        <p>
>> +          The default is vhost-net
>> +        </p>
>> +      </column>
>> +
>> +      <column name="other_config" key="vhost-sock-dir"
>> +              type='{"type": "string"}'>
>> +        <p>
>> +          Specifies the path to the vhost-user unix domain socket files.
>> +        </p>
>> +        <p>
>> +          The default is the working directory of the application.
>> +        </p>
>> +      </column>
>> +
>>        <column name="other_config" key="n-handler-threads"
>>                type='{"type": "integer", "minInteger": 1}'>
>>          <p>
>> --
>> 2.6.1.133.gf5b6079
Kevin Traynor Jan. 12, 2016, 2:54 p.m. UTC | #7
> -----Original Message-----

> From: dev [mailto:dev-bounces@openvswitch.org] On Behalf Of Aaron Conole

> Sent: Monday, January 11, 2016 6:51 PM

> To: Zoltan Kiss

> Cc: dev@openvswitch.org; Flavio Leitner

> Subject: Re: [ovs-dev] [PATCH v2 2/3] netdev-dpdk: Convert initialization

> from cmdline to db

> 

> Zoltan Kiss <zoltan.kiss@linaro.org> writes:

> > On 08/01/16 16:31, Aaron Conole wrote:

> >> Panu Matilainen <pmatilai@redhat.com> writes:

> >>> On 01/04/2016 11:46 PM, Aaron Conole wrote:

> >>>> Existing DPDK integration is provided by use of command line options

> which

> >>>> must be split out and passed to librte in a special manner. However,

> this

> >>>> forces any configuration to be passed by way of a special DPDK flag, and

> >>>> interferes with ovs+dpdk packaging solutions.

> >>>>

> >>>> This commit delays dpdk initialization until the first DPDK netdev is

> added

> >>>> to the bridge, at which point ovs initializes librte.

> >>>

> >>> On thing to keep in mind is that rte_eal_init() can and will tear down

> >>> the entire process on failure since DPDK calls rte_panic() if

> >>> something so much as sneezes. In current OVS this occurs on service

> >>> startup where its relatively harmless, but with lazy initialization

> >>> there could be already be other activity that is in risk of getting

> >>> terminated when the first DPDK port is added.

> >>>

> >>> Fixing rte_eal_init() to gracefully return on failure has been

> >>> discussed, and agreed on in principle, on DPDK list but all current

> >>> DPDK versions are nasty wrt that.

> >>>

> >>> 	- Panu -

> >>

> >> So, I've waffled back and forth on this. I understand the reason to be

> >> nervous about an always init option (because that wastes lots of

> >> resources when the system won't ever use dpdk). I also understand the

> >> possible issues *today* with dpdk_init, but even then, it's a dpdk issue

> >> which we want fixed anyway, so I don't know that this should hold up

> >> this patch.

> >

> > I couldn't find the original email where this discussion happened: why

> > is it required to be able to init DPDK on the fly? I mean it's a nice

> > bit, but brings in a lot of trouble. On the other hand, asking the

> > user to set a "other_config:odp=true" and then restart ovs-vswitchd is

> > not a huge thing to ask.

> 

> You won't find such a discussion - it's never been "required," as

> such. It is very desirable, as having such automatic behavior reduces the

> amount of steps required to get up and running with DPDK enabled

> OVS. This is, imho, the argument of both Panu and Kevin when they think

> a big on/off switch is less than elegant. I tend to agree with that, as well.

> 

> If the user is going through the trouble of compiling with DPDK support, and

> then wants to add a DPDK port, having to also set

> "other_config:dpdk=yes" seems like too many ok dialogs. I hope the

> analogy makes sense.

> 

> That said, it's really an unneeded enhancement, and I've pitched the

> idea to folks at the office within my elastic-shooting range. The answer

> has been consistently "This is an enhancement, not a requirement." So

> I'll drop the lazy-init feature for now.

> 

> >> It's definitely a more elegant solution to do the lazy init, but I think

> >> there could be times where we want a "stop everything" button for

> >> occasions where testing without DPDK doing it's thing are desired.

> >>

> >> However, I think I've come up with a solution that gives us flexibility

> >> to support these cases without much additional work, so let me know what

> >> you think:

> >>

> >> First, go back to the dpdk true/false flag

> >> Second, add a patch in the series which changes true/false to a tristate

> >> on/off/lazy allowing the policy of when to initialize DPDK to be user

> >> defined. We can default it to 'off' or 'lazy', but it can be changed to

> >> 'on' if we want.

> >>

> >> What do folks think? Too much work and code for not enough gains?

> 

> As I wrote above, the 'second' part of this is a great follow up

> enhancement, but for now I'm going to go back to the giant flag, and we

> can take it as 'future development'.


I think we all agree that removing the vswitchd dpdk mandatory cmdline args by
putting them in the db/code with defaults is good for usability and getting that
alone enabled through this patch would be a good outcome IMHO.

I see "other_config:dpdk=yes" and lazy init's as ways to enable a common build.
That's good for usability and support but I think it's a separate enough change
to warrant a separate patchset/discussion.

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> http://openvswitch.org/mailman/listinfo/dev
Aaron Conole Jan. 13, 2016, 7:26 p.m. UTC | #8
Aaron Conole <aconole@redhat.com> writes:
> "Traynor, Kevin" <kevin.traynor@intel.com> writes:
>>> -----Original Message-----
>>> From: Aaron Conole [mailto:aconole@redhat.com]
>>> Sent: Monday, January 4, 2016 9:47 PM
>>> To: dev@openvswitch.org; Flavio Leitner; Traynor, Kevin
>>> Subject: [PATCH v2 2/3] netdev-dpdk: Convert initialization from cmdline to
>>> db
>>> 
>>> Existing DPDK integration is provided by use of command line options which
>>> must be split out and passed to librte in a special manner. However, this
>>> forces any configuration to be passed by way of a special DPDK flag, and
>>> interferes with ovs+dpdk packaging solutions.
>>> 
>>> This commit delays dpdk initialization until the first DPDK netdev is added
>>> to the bridge, at which point ovs initializes librte. It pulls all of
>>> the config data from the OVS database, and assembles a new argv/argc
>>> pair to be passed along.
>>> 
>>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>>> ---
>>> v2:
>>> * Removed trailing whitespace
>>> * Followed for() loop brace coding style
>>> * Automatically enable DPDK when adding a DPDK enabled port
>>> * Fixed an issue on startup when DPDK enabled ports are present
>>> * Updated the documentation (including vswitch.xml) and documented all
>>>   new parameters
>>> * Dropped the premature initialization test
>>
>> Hi, mostly very minor comments below,
<<snip>>
>>> 
>>> -static inline int
>>> -dpdk_init(int argc, char **argv)
>>> +static inline void
>>> +dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
>>>  {
>>> -    if (argc >= 2 && !strcmp(argv[1], "--dpdk")) {
>>> -        ovs_fatal(0, "DPDK support not built into this copy of Open
>>> vSwitch.");
>>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>>> +
>>> +    if (ovs_cfg && ovsthread_once_start(&once)) {
>>> +        if(smap_get_bool(&ovs_cfg->other_config, "dpdk", false)) {
>>> +            ovs_fatal(0, "DPDK not supported in this copy of Open
>>> vSwitch.");
>>
>> "dpdk" db entry is not in this version of the patchset. 
>
> Good catch... I should have tested it more thoroughly.
>
>> Might be best for this function to just do nothing now. A print doesn't
>> even seem to make sense as this will be called for OVS without DPDK.
>
> Okay.
>

I'm rethinking. We need the print in this case: user has dpdk-init=true
in the database, but this version of OVS doesn't support DPDK. The
netdevs in the database will be broken, and the user will be unaware of
why.

It's the same as the original behavior - if someone requests dpdk at
runtime, but the support wasn't there at compile time, we should abort
altogether and let the user clean things up. So that's what I'm
including in my patchset.

-Aaron
diff mbox

Patch

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 96b686c..2dd2120 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -143,22 +143,59 @@  Using the DPDK with ovs-vswitchd:
 
 5. Start vswitchd:
 
-   DPDK configuration arguments can be passed to vswitchd via `--dpdk`
-   argument. This needs to be first argument passed to vswitchd process.
-   dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter
-   for dpdk initialization.
+   DPDK configuration arguments can be passed to vswitchd via Open_vSwitch
+   other_config database. The recognized configuration options are listed.
+
+   * dpdk-lcore-mask
+   Specifies the CPU cores on which dpdk lcore threads should be spawned.
+   The DPDK lcore threads are used for DPDK library tasks, such as
+   library internal message processing, logging, etc. Value should be in
+   the form of a hex string (so '0x123') similar to the 'taskset' mask
+   input.
+   If not specified, the value will be determined by choosing the lowest
+   CPU core from initial cpu affinity list. Otherwise, the value will be
+   passed directly to the DPDK library.
+   For performance reasons, it is best to set this to a single core on
+   the system, rather than allow lcore threads to float.
+
+   * dpdk-mem-channels
+   This sets the number of memory spread channels per CPU socket. It is purely
+   an optimization flag.
+
+   * dpdk-alloc-mem
+   This sets the total memory to preallocate from hugepages regardless of
+   processor socket. It is recommended to use dpdk-socket-mem instead.
+
+   * dpdk-socket-mem
+   Comma separated list of memory to pre-allocate from hugepages on specific
+   sockets.
+
+   * dpdk-hugepage-dir
+   Directory where hugetlbfs is mounted
+
+   * cuse-dev-name
+   Option to set the vhost_cuse character device name.
+
+   * vhost-sock-dir
+   Option to set the path to the vhost_user unix socket files.
+
+   NOTE: Changing any of these options requires restarting the ovs-vswitchd
+   application.
+
+   Open vSwitch can be started as normal. DPDK will not be initialized until
+   the first DPDK-enabled port is added to the bridge.
 
    ```
    export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
-   ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach
+   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
    ```
 
    If allocated more than one GB hugepage (as for IVSHMEM), set amount and
    use NUMA node 0 memory:
 
    ```
-   ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \
-   -- unix:$DB_SOCK --pidfile --detach
+   ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk_socket_mem="1024,0"
+   ovs-vswitchd unix:$DB_SOCK --pidfile --detach
    ```
 
 6. Add bridge & ports
@@ -521,11 +558,12 @@  have arbitrary names.
      `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide
      to your VM on the QEMU command line. More instructions on this can be
      found in the next section "DPDK vhost-user VM configuration"
-     Note: If you wish for the vhost-user sockets to be created in a
-     directory other than `/usr/local/var/run/openvswitch`, you may specify
-     another location on the ovs-vswitchd command line like so:
 
-      `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...`
+  - If you wish for the vhost-user sockets to be created in a directory other
+    than `/usr/local/var/run/openvswitch`, you may specify another location
+    in the ovsdb like:
+
+    `./vswitchd/ovs-vsctl set Open_vSwitch . other_config:vhost-sock-dir=path`
 
 DPDK vhost-user VM configuration:
 ---------------------------------
@@ -638,14 +676,13 @@  DPDK vhost-cuse VM configuration:
 
 1. This step is only needed if using an alternative character device.
 
-   The new character device filename must be specified on the vswitchd
-   commandline:
+   The new character device filename must be specified in the ovsdb:
 
-        `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1 ...`
+   `./utilities/ovs-vsctl set Open_vSwitch . \
+                          other_config:cuse-dev-name=my-vhost-net`
 
-   Note that the `--cuse_dev_name` argument and associated string must be the first
-   arguments after `--dpdk` and come before the EAL arguments. In the example
-   above, the character device to be used will be `/dev/my-vhost-net`.
+   In the example above, the character device to be used will be
+   `/dev/my-vhost-net`.
 
 2. This step is only needed if reusing the standard character device. It will
    conflict with the kernel vhost character device so the user must first
@@ -758,8 +795,8 @@  steps.
 
         <my-vhost-device> refers to "vhost-net" if using the `/dev/vhost-net`
         device. If you have specificed a different name on the ovs-vswitchd
-        commandline using the "--cuse_dev_name" parameter, please specify that
-        filename instead.
+        database using the "other_config:cuse-dev-name" parameter, please
+        specify that filename instead.
 
      2. Disable SELinux or set to permissive mode
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index b42712f..2ce9f71 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -29,6 +29,7 @@ 
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <getopt.h>
 
 #include "dirs.h"
 #include "dp-packet.h"
@@ -49,6 +50,8 @@ 
 #include "timeval.h"
 #include "unixctl.h"
 #include "openvswitch/vlog.h"
+#include "smap.h"
+#include "vswitch-idl.h"
 
 #include "rte_config.h"
 #include "rte_mbuf.h"
@@ -138,6 +141,10 @@  enum dpdk_dev_type {
     DPDK_DEV_VHOST = 1,
 };
 
+static const struct ovsrec_open_vswitch *ovs_last_cfg;
+
+void dpdk_init(void);
+
 static int rte_eal_init_ret = ENODEV;
 
 static struct ovs_mutex dpdk_mutex = OVS_MUTEX_INITIALIZER;
@@ -551,8 +558,20 @@  netdev_dpdk_cast(const struct netdev *netdev)
 static struct netdev *
 netdev_dpdk_alloc(void)
 {
-    struct netdev_dpdk *netdev = dpdk_rte_mzalloc(sizeof *netdev);
-    return &netdev->up;
+    struct netdev_dpdk *netdev;
+
+    if (rte_eal_init_ret) {
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return NULL;
+        }
+    }
+
+    netdev = dpdk_rte_mzalloc(sizeof *netdev);
+    if (netdev) {
+        return &netdev->up;
+    }
+    return NULL;
 }
 
 static void
@@ -657,7 +676,10 @@  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
 
     if (rte_eal_init_ret) {
-        return rte_eal_init_ret;
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return rte_eal_init_ret;
+        }
     }
 
     rte_spinlock_init(&netdev->vhost_tx_lock);
@@ -670,6 +692,13 @@  netdev_dpdk_vhost_cuse_construct(struct netdev *netdev_)
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int err;
 
+    if (rte_eal_init_ret) {
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return rte_eal_init_ret;
+        }
+    }
+
     ovs_mutex_lock(&dpdk_mutex);
     strncpy(netdev->vhost_id, netdev->up.name, sizeof(netdev->vhost_id));
     err = vhost_construct_helper(netdev_);
@@ -683,6 +712,13 @@  netdev_dpdk_vhost_user_construct(struct netdev *netdev_)
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int err;
 
+    if (rte_eal_init_ret) {
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return rte_eal_init_ret;
+        }
+    }
+
     ovs_mutex_lock(&dpdk_mutex);
     /* Take the name of the vhost-user port and append it to the location where
      * the socket is to be created, then register the socket.
@@ -707,7 +743,10 @@  netdev_dpdk_construct(struct netdev *netdev)
     int err;
 
     if (rte_eal_init_ret) {
-        return rte_eal_init_ret;
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return rte_eal_init_ret;
+        }
     }
 
     /* Names always start with "dpdk" */
@@ -1776,6 +1815,8 @@  new_device(struct virtio_net *dev)
     struct netdev_dpdk *netdev;
     bool exists = false;
 
+
+
     ovs_mutex_lock(&dpdk_mutex);
     /* Add device to the vhost port with the same name as that passed down. */
     LIST_FOR_EACH(netdev, list_node, &dpdk_list) {
@@ -2023,7 +2064,10 @@  netdev_dpdk_ring_construct(struct netdev *netdev)
     int err = 0;
 
     if (rte_eal_init_ret) {
-        return rte_eal_init_ret;
+        dpdk_init();
+        if (rte_eal_init_ret) {
+            return rte_eal_init_ret;
+        }
     }
 
     ovs_mutex_lock(&dpdk_mutex);
@@ -2111,10 +2155,14 @@  unlock_dpdk:
 
 static int
 process_vhost_flags(char *flag, char *default_val, int size,
-                    char **argv, char **new_val)
+                    const struct ovsrec_open_vswitch *ovs_cfg,
+                    char **new_val)
 {
+    const char *val;
     int changed = 0;
 
+    val = smap_get(&ovs_cfg->other_config, flag);
+
     /* Depending on which version of vhost is in use, process the vhost-specific
      * flag if it is provided on the vswitchd command line, otherwise resort to
      * a default value.
@@ -2124,9 +2172,9 @@  process_vhost_flags(char *flag, char *default_val, int size,
      * For vhost-cuse: Process "-cuse_dev_name" to set the custom name of the
      * vhost-cuse character device.
      */
-    if (!strcmp(argv[1], flag) && (strlen(argv[2]) <= size)) {
+    if (val && (strlen(val) <= size)) {
         changed = 1;
-        *new_val = strdup(argv[2]);
+        *new_val = strdup(val);
         VLOG_INFO("User-provided %s in use: %s", flag, *new_val);
     } else {
         VLOG_INFO("No %s provided - defaulting to %s", flag, default_val);
@@ -2136,68 +2184,115 @@  process_vhost_flags(char *flag, char *default_val, int size,
     return changed;
 }
 
-int
-dpdk_init(int argc, char **argv)
+static char **
+grow_argv(char ***argv, size_t cur_siz, size_t grow_by)
 {
-    int result;
-    int base = 0;
-    char *pragram_name = argv[0];
-    int err;
-    int isset;
-    cpu_set_t cpuset;
-
-    if (argc < 2 || strcmp(argv[1], "--dpdk"))
-        return 0;
+    char **new_argv = realloc(*argv, sizeof(char *) * (cur_siz + grow_by));
+    return new_argv;
+}
 
-    /* Remove the --dpdk argument from arg list.*/
-    argc--;
-    argv++;
+static int
+get_dpdk_args(const struct ovsrec_open_vswitch *ovs_cfg, char ***argv)
+{
+    struct dpdk_options_map {
+        const char *ovs_configuration;
+        const char *dpdk_option;
+        bool default_enabled;
+        const char *default_value;
+    } opts[] = {
+        {"dpdk-lcore-mask", "-c", true, "0x1"},
+        /* XXX: DPDK 2.2.0 support, the true should become false for -n */
+        {"dpdk-mem-channels", "-n", true, "4"},
+        {"dpdk-alloc-mem", "-m", false, NULL},
+        {"dpdk-socket-mem", "--socket-mem", true, "1024,0"},
+        {"dpdk-hugepage-dir", "--huge-dir", false, NULL},
+    };
+    int i, ret = 1;
+
+    for(i = 0; i < (sizeof(opts) / sizeof(opts[0])); ++i) {
+        const char *lookup = smap_get(&ovs_cfg->other_config,
+                                      opts[i].ovs_configuration);
+        if(!lookup && opts[i].default_enabled)
+            lookup = opts[i].default_value;
+
+        if(lookup) {
+            char **newargv = grow_argv(argv, ret, 2);
+
+            if (!newargv) {
+                ovs_abort(0, "grow_argv() failed to allocate memory.");
+            }
 
-    /* Reject --user option */
-    int i;
-    for (i = 0; i < argc; i++) {
-        if (!strcmp(argv[i], "--user")) {
-            VLOG_ERR("Can not mix --dpdk and --user options, aborting.");
+            *argv = newargv;
+            (*argv)[ret++] = strdup(opts[i].dpdk_option);
+            (*argv)[ret++] = strdup(lookup);
         }
     }
 
+    return ret;
+}
+
+static char **dpdk_argv;
+static int dpdk_argc;
+
+static void
+deferred_argv_release(void)
+{
+    int result;
+    for (result = 0; result < dpdk_argc; ++result) {
+        free(dpdk_argv[result]);
+    }
+
+    free(dpdk_argv);
+}
+
+static void
+__dpdk_init(const struct ovsrec_open_vswitch *ovs_cfg)
+{
+    char **argv = NULL;
+    int result;
+    int argc;
+    int err;
+    cpu_set_t cpuset;
+
+    VLOG_INFO("DPDK Enabled, initializing");
+
 #ifdef VHOST_CUSE
-    if (process_vhost_flags("-cuse_dev_name", strdup("vhost-net"),
-                            PATH_MAX, argv, &cuse_dev_name)) {
+    if (process_vhost_flags("cuse-dev-name", strdup("vhost-net"),
+                            PATH_MAX, ovs_cfg, &cuse_dev_name)) {
 #else
-    if (process_vhost_flags("-vhost_sock_dir", strdup(ovs_rundir()),
-                            NAME_MAX, argv, &vhost_sock_dir)) {
+    if (process_vhost_flags("vhost-sock-dir", strdup(ovs_rundir()),
+                            NAME_MAX, ovs_cfg, &vhost_sock_dir)) {
         struct stat s;
-        int err;
 
         err = stat(vhost_sock_dir, &s);
         if (err) {
-            VLOG_ERR("vHostUser socket DIR '%s' does not exist.",
-                     vhost_sock_dir);
-            return err;
+            ovs_abort(0, "vhost-user sock directory '%s' does not exist.",
+                      vhost_sock_dir);
         }
 #endif
-        /* Remove the vhost flag configuration parameters from the argument
-         * list, so that the correct elements are passed to the DPDK
-         * initialization function
-         */
-        argc -= 2;
-        argv += 2;    /* Increment by two to bypass the vhost flag arguments */
-        base = 2;
     }
 
     /* Get the main thread affinity */
     CPU_ZERO(&cpuset);
     err = pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
     if (err) {
-        VLOG_ERR("Thread getaffinity error %d.", err);
-        return err;
+        ovs_abort(0, "Thread getaffinity error %d.", err);
     }
 
-    /* Keep the program name argument as this is needed for call to
-     * rte_eal_init()
-     */
-    argv[0] = pragram_name;
+    argv = grow_argv(&argv, 0, 1);
+    if (!argv) {
+        ovs_abort(0, "Unable to allocate an initial argv.");
+    }
+    argv[0] = strdup("ovs"); /* TODO use prctl to get process name */
+    argc = get_dpdk_args(ovs_cfg, &argv);
+
+    argv = grow_argv(&argv, argc, argc+1);
+    if (!argv) {
+        ovs_abort(0, "Unable to make final argv allocation.");
+    }
+    argv[argc] = 0;
+
+    optind = 1;
 
     /* Make sure things are initialized ... */
     result = rte_eal_init(argc, argv);
@@ -2208,21 +2303,36 @@  dpdk_init(int argc, char **argv)
     /* Set the main thread affinity back to pre rte_eal_init() value */
     err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
     if (err) {
-        VLOG_ERR("Thread setaffinity error %d", err);
-        return err;
+        ovs_abort(0, "Thread getaffinity error %d.", err);
     }
 
+    dpdk_argv = argv;
+    dpdk_argc = argc;
+
+    atexit(deferred_argv_release);
+
     rte_memzone_dump(stdout);
     rte_eal_init_ret = 0;
 
-    if (argc > result) {
-        argv[result] = argv[0];
-    }
-
     /* We are called from the main thread here */
     RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
+}
 
-    return result + 1 + base;
+void
+dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
+{
+    ovs_last_cfg = ovs_cfg;
+}
+
+void
+dpdk_init(void)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+    if(ovs_last_cfg && ovsthread_once_start(&once)) {
+        __dpdk_init(ovs_last_cfg);
+        ovsthread_once_done(&once);
+    }
 }
 
 static const struct netdev_class dpdk_class =
@@ -2286,10 +2396,6 @@  netdev_dpdk_register(void)
 {
     static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
 
-    if (rte_eal_init_ret) {
-        return;
-    }
-
     if (ovsthread_once_start(&once)) {
         dpdk_common_init();
         netdev_register_provider(&dpdk_class);
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
index 646d3e2..e699f37 100644
--- a/lib/netdev-dpdk.h
+++ b/lib/netdev-dpdk.h
@@ -22,7 +22,9 @@  struct dp_packet;
 
 #define NON_PMD_CORE_ID LCORE_ID_ANY
 
-int dpdk_init(int argc, char **argv);
+struct ovsrec_open_vswitch;
+
+void dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg);
 void netdev_dpdk_register(void);
 void free_dpdk_buf(struct dp_packet *);
 int pmd_thread_setaffinity_cpu(unsigned cpu);
@@ -30,15 +32,20 @@  int pmd_thread_setaffinity_cpu(unsigned cpu);
 #else
 
 #define NON_PMD_CORE_ID UINT32_MAX
-
 #include "util.h"
 
-static inline int
-dpdk_init(int argc, char **argv)
+static inline void
+dpdk_config(const struct ovsrec_open_vswitch *ovs_cfg)
 {
-    if (argc >= 2 && !strcmp(argv[1], "--dpdk")) {
-        ovs_fatal(0, "DPDK support not built into this copy of Open vSwitch.");
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+
+    if (ovs_cfg && ovsthread_once_start(&once)) {
+        if(smap_get_bool(&ovs_cfg->other_config, "dpdk", false)) {
+            ovs_fatal(0, "DPDK not supported in this copy of Open vSwitch.");
+        }
+        ovsthread_once_done(&once);
     }
+
     return 0;
 }
 
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index f8afe55..f7aab07 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -68,6 +68,7 @@ 
 #include "sflow_api.h"
 #include "vlan-bitmap.h"
 #include "packets.h"
+#include "lib/netdev-dpdk.h"
 
 VLOG_DEFINE_THIS_MODULE(bridge);
 
@@ -2922,6 +2923,8 @@  bridge_run(void)
     }
     cfg = ovsrec_open_vswitch_first(idl);
 
+    dpdk_config(cfg);
+
     /* Initialize the ofproto library.  This only needs to run once, but
      * it must be done after the configuration is set.  If the
      * initialization has already occurred, bridge_init_ofproto()
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index e78ecda..c448107 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -48,7 +48,6 @@ 
 #include "openvswitch/vconn.h"
 #include "openvswitch/vlog.h"
 #include "lib/vswitch-idl.h"
-#include "lib/netdev-dpdk.h"
 
 VLOG_DEFINE_THIS_MODULE(vswitchd);
 
@@ -71,13 +70,6 @@  main(int argc, char *argv[])
     int retval;
 
     set_program_name(argv[0]);
-    retval = dpdk_init(argc,argv);
-    if (retval < 0) {
-        return retval;
-    }
-
-    argc -= retval;
-    argv += retval;
 
     ovs_cmdl_proctitle_init(argc, argv);
     service_start(&argc, &argv);
@@ -166,7 +158,7 @@  parse_options(int argc, char *argv[], char **unixctl_pathp)
         {"bootstrap-ca-cert", required_argument, NULL, OPT_BOOTSTRAP_CA_CERT},
         {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
         {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
-        {"dpdk", required_argument, NULL, OPT_DPDK},
+        {"dpdk", optional_argument, NULL, OPT_DPDK},
         {NULL, 0, NULL, 0},
     };
     char *short_options = ovs_cmdl_long_options_to_short_options(long_options);
@@ -219,7 +211,7 @@  parse_options(int argc, char *argv[], char **unixctl_pathp)
             exit(EXIT_FAILURE);
 
         case OPT_DPDK:
-            ovs_fatal(0, "--dpdk must be given at beginning of command line.");
+            ovs_fatal(0, "Using --dpdk to configure DPDK is not supported.");
             break;
 
         default:
@@ -256,17 +248,8 @@  usage(void)
     daemon_usage();
     vlog_usage();
     printf("\nDPDK options:\n"
-           "  --dpdk [VHOST] [DPDK]     Initialize DPDK datapath.\n"
-           "  where DPDK are options for initializing DPDK lib and VHOST is\n"
-#ifdef VHOST_CUSE
-           "  option to override default character device name used for\n"
-           "  for use with userspace vHost\n"
-           "    -cuse_dev_name NAME\n"
-#else
-           "  option to override default directory where vhost-user\n"
-           "  sockets are created.\n"
-           "    -vhost_sock_dir DIR\n"
-#endif
+           "Configuration of DPDK via command-line is removed from this\n"
+           "version of Open vSwitch. DPDK is configured through ovsdb.\n"
            );
     printf("\nOther options:\n"
            "  --unixctl=SOCKET          override default control socket name\n"
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index ce0dbc1..355da95 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -176,11 +176,38 @@ 
         </p>
       </column>
 
-      <column name="other_config" key="pmd-cpu-mask">
+      <column name="other_config" key="dpdk-lcore-mask"
+              type='{"type": "integer", "minInteger": 1}'>
+        <p>
+          Specifies the CPU cores on which dpdk lcore threads should be spawned.
+          The DPDK lcore threads are used for DPDK library tasks, such as
+          library internal message processing, logging, etc. Value should be in
+          the form of a hex string (so '0x123') similar to the 'taskset' mask
+          input.
+        </p>
+        <p>
+          The lowest order bit corresponds to the first CPU core.  A set bit
+          means the corresponding core is available and a pmd thread will be
+          created and pinned to it.  If the input does not cover all cores,
+          those uncovered cores are considered not set.
+        </p>
+        <p>
+          For performance reasons, it is best to set this to a single core on
+          the system, rather than allow lcore threads to float.
+        </p>
+        <p>
+          If not specified, the value will be determined by choosing the lowest
+          CPU core from initial cpu affinity list. Otherwise, the value will be
+          passed directly to the DPDK library.
+        </p>
+      </column>
+
+      <column name="other_config" key="pmd-cpu-mask"
+              type='{"type": "integer", "minInteger": 1}'>
         <p>
           Specifies CPU mask for setting the cpu affinity of PMD (Poll
           Mode Driver) threads.  Value should be in the form of hex string,
-          similar to the dpdk EAL '-c COREMASK' option input or the 'taskset'
+          similar to the dpdk-lcore-mask option input or the 'taskset'
           mask input.
         </p>
         <p>
@@ -195,6 +222,77 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="dpdk-mem-channels"
+              type='{"type": "integer", "minInteger": 1}'>
+        <p>
+          Specifies the number of NUMA memory spread channels in the CPU to
+          be used by DPDK. It is purely optimization.
+        </p>
+        <p>
+          The default value is 4.
+        </p>
+      </column>
+
+      <column name="other_config" key="dpdk-alloc-mem"
+              type='{"type": "integer", "minInteger": 0}'>
+        <p>
+          Specifies the amount of memory to preallocate from the hugepage pool,
+          regardless of socket. It is recommended that dpdk-socket-mem is used
+          instead.
+        </p>
+        <p>
+          If not specified, the value is 0.
+        </p>
+      </column>
+
+      <column name="other_config" key="dpdk-socket-mem"
+              type='{"type": "string"}'>
+        <p>
+          Specifies the amount of memory to preallocate from the hugepage pool,
+          on a per-socket basis.
+        </p>
+        <p>
+          The specifier is a comma-separated string, in ascending order of CPU
+          socket (ex: 1024,2048,4096,8192 would set socket 0 to preallocate
+          1024MB, socket 1 to preallocate 2048MB, etc.)
+        </p>
+        <p>
+          If not specified, the default value is 1024,0.
+        </p>
+      </column>
+
+      <column name="other_config" key="dpdk-hugepage-dir"
+              type='{"type": "string"}'>
+        <p>
+          Specifies the path to the hugetlbfs mount point.
+        </p>
+        <p>
+          If not specified, this will be guessed by the DPDK library (default
+          is /dev/hugepages).
+        </p>
+      </column>
+
+      <column name="other_config" key="cuse-dev-name"
+              type='{"type": "string"}'>
+        <p>
+          Specifies the name of the vhost-cuse character device to open for
+          vhost-cuse support.
+        </p>
+        <p>
+          The default is vhost-net
+        </p>
+      </column>
+
+      <column name="other_config" key="vhost-sock-dir"
+              type='{"type": "string"}'>
+        <p>
+          Specifies the path to the vhost-user unix domain socket files.
+        </p>
+        <p>
+          The default is the working directory of the application.
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>