diff mbox series

[ovs-dev,v5] dpdk: Support running PMD threads on any core.

Message ID 20201229175641.31200-1-david.marchand@redhat.com
State Superseded
Headers show
Series [ovs-dev,v5] dpdk: Support running PMD threads on any core. | expand

Commit Message

David Marchand Dec. 29, 2020, 5:56 p.m. UTC
DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
lcore. This new API does not change the thread characteristics (like CPU
affinity).
Using this new API, there is no assumption on lcore X running on cpu X
anymore which leaves OVS free from running its PMD thread on any cpu.
The DPDK multiprocess feature is not compatible with this new API and is
disabled.

DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
which should be enough for OVS pmd threads (hopefully).

DPDK lcore/OVS pmd threads mapping are logged at threads creation and
destruction.
A new command is added to help get DPDK point of view of the DPDK lcores:

$ ovs-appctl dpdk/lcores-list
lcore 0, socket 0, role RTE, cpuset 0
lcore 1, socket 0, role NON_EAL, cpuset 1
lcore 2, socket 0, role NON_EAL, cpuset 15

Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Changes since v4:
- rebased on the master branch,
- disabled DPDK mp feature,
- updated DPDK documentation and manual with the new command,
- added notes in NEWS,

Changes since v3:
- rebased on current HEAD,
- switched back to simple warning rather than abort when registering a
  thread fails,

Changes since v2:
- introduced a new api in DPDK 20.08 (still being discussed), inbox thread at
  http://inbox.dpdk.org/dev/20200610144506.30505-1-david.marchand@redhat.com/T/#t
- this current patch depends on a patch on master I sent:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20200626122738.28163-1-david.marchand@redhat.com/
- dropped 'dpdk-lcore-mask' compat handling,

Changes since v1:
- rewired existing configuration 'dpdk-lcore-mask' to use --lcores,
- switched to a bitmap to track lcores,
- added a command to dump current mapping (Flavio): used an experimental
  API to get DPDK lcores cpuset since it is the most reliable/portable
  information,
- used the same code for the logs when starting DPDK/PMD threads,
- addressed Ilya comments,

---
 Documentation/howto/dpdk.rst |  5 +++++
 NEWS                         |  2 ++
 lib/dpdk-stub.c              |  8 ++++++-
 lib/dpdk-unixctl.man         |  2 ++
 lib/dpdk.c                   | 42 ++++++++++++++++++++++++++++++++++--
 lib/dpdk.h                   |  3 ++-
 lib/dpif-netdev.c            |  3 ++-
 7 files changed, 60 insertions(+), 5 deletions(-)

Comments

Stokes, Ian Jan. 12, 2021, 4:24 p.m. UTC | #1
Hi David,

Thanks for the patch, a few comments and queries inline.

> DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
> lcore. This new API does not change the thread characteristics (like CPU
> affinity).
> Using this new API, there is no assumption on lcore X running on cpu X
> anymore which leaves OVS free from running its PMD thread on any cpu.

Possibly just the wording above, " which leaves OVS free from running it's PMD on any cpu".
I thought OVS was allowed to run its PMDs on any core anyway as long as they were <= RTE_MAX_LCORE.
Does above mean free to run on any core greater than RTE_MAX_LCORE?.

> The DPDK multiprocess feature is not compatible with this new API and is
> disabled.
> 

I think that's OK, I'm trying to think where this is a use case, since PDUMP was removed I'm not aware of any other case.
I would assume we are still allowed to run another DPDK applications on the same host as long as the --file-prefix was specified?

For example I've come across people testing with TESTPMD and virtio vdevs on the same host as OVS DPDK in the past.

Out of interest, did you give this patch a run with the OVS DPDK unit tests?

> DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
> which should be enough for OVS pmd threads (hopefully).

I'm hard pressed to think of a case where it would not be sufficient tbh.

> 
> DPDK lcore/OVS pmd threads mapping are logged at threads creation and
> destruction.
> A new command is added to help get DPDK point of view of the DPDK lcores:
> 
> $ ovs-appctl dpdk/lcores-list
> lcore 0, socket 0, role RTE, cpuset 0
> lcore 1, socket 0, role NON_EAL, cpuset 1
> lcore 2, socket 0, role NON_EAL, cpuset 15
> 
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---
> Changes since v4:
> - rebased on the master branch,
> - disabled DPDK mp feature,
> - updated DPDK documentation and manual with the new command,
> - added notes in NEWS,
> 
> Changes since v3:
> - rebased on current HEAD,
> - switched back to simple warning rather than abort when registering a
>   thread fails,
> 
> Changes since v2:
> - introduced a new api in DPDK 20.08 (still being discussed), inbox thread at
>   http://inbox.dpdk.org/dev/20200610144506.30505-1-
> david.marchand@redhat.com/T/#t
> - this current patch depends on a patch on master I sent:
> 
> https://patchwork.ozlabs.org/project/openvswitch/patch/20200626122738.2
> 8163-1-david.marchand@redhat.com/
> - dropped 'dpdk-lcore-mask' compat handling,
> 
> Changes since v1:
> - rewired existing configuration 'dpdk-lcore-mask' to use --lcores,
> - switched to a bitmap to track lcores,
> - added a command to dump current mapping (Flavio): used an experimental
>   API to get DPDK lcores cpuset since it is the most reliable/portable
>   information,
> - used the same code for the logs when starting DPDK/PMD threads,
> - addressed Ilya comments,
> 
> ---
>  Documentation/howto/dpdk.rst |  5 +++++
>  NEWS                         |  2 ++
>  lib/dpdk-stub.c              |  8 ++++++-
>  lib/dpdk-unixctl.man         |  2 ++
>  lib/dpdk.c                   | 42 ++++++++++++++++++++++++++++++++++--
>  lib/dpdk.h                   |  3 ++-
>  lib/dpif-netdev.c            |  3 ++-
>  7 files changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/howto/dpdk.rst
> b/Documentation/howto/dpdk.rst
> index f0d45e47b6..8492f354ce 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -399,6 +399,11 @@ Supported actions for hardware offload are:
>  - Modification of IPv6 (set_field:<ADDR>->ipv6_src/ipv6_dst/mod_nw_ttl).
>  - Clone/output (tnl_push and output) for encapsulating over a tunnel.
> 
> +Multiprocess
> +------------
> +
> +This DPDK feature is not supported and disabled during OVS initialization.
> +
>  Further Reading
>  ---------------
> 
> diff --git a/NEWS b/NEWS
> index d357da31d8..f59311ab63 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -12,6 +12,8 @@ Post-v2.14.0
>     - DPDK:
>       * Removed support for vhost-user dequeue zero-copy.
>       * Add support for DPDK 20.11.
> +     * Forbid use of DPDK multiprocess feature.
> +     * Add support for running threads on cores > RTE_MAX_LCORE.
>     - Userspace datapath:
>       * Add the 'pmd' option to "ovs-appctl dpctl/dump-flows", which
>         restricts a flow dump to a single PMD thread if set.
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index b7d577870d..5bc996b665 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -39,7 +39,13 @@ dpdk_init(const struct smap *ovs_other_config)
>  }
> 
>  void
> -dpdk_set_lcore_id(unsigned cpu OVS_UNUSED)
> +dpdk_init_thread_context(unsigned cpu OVS_UNUSED)
> +{
> +    /* Nothing */
> +}
> +
> +void
> +dpdk_uninit_thread_context(void)
>  {
>      /* Nothing */
>  }
> diff --git a/lib/dpdk-unixctl.man b/lib/dpdk-unixctl.man
> index 2d6d576f24..c28bb4ef20 100644
> --- a/lib/dpdk-unixctl.man
> +++ b/lib/dpdk-unixctl.man
> @@ -10,5 +10,7 @@ list of words separated by spaces: a word can be either a
> logging \fBlevel\fR
>  \fBnotice\fR, \fBinfo\fR or \fBdebug\fR) or a \fBpattern\fR matching DPDK
>  components (see \fBdpdk/log-list\fR command on \fBovs\-appctl\fR(8))
> separated
>  by a colon from the logging \fBlevel\fR to apply.
> +.IP "\fBdpdk/lcore-list\fR"
> +Lists the DPDK lcores and their cpu affinity.

Minor nit, is it worth mentioning "role type" as well in the description above?

>  .RE
>  .
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 319540394b..defac3f7de 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -14,6 +14,10 @@
>   * limitations under the License.
>   */
> 
> +#ifndef ALLOW_EXPERIMENTAL_API
> +#define ALLOW_EXPERIMENTAL_API
> +#endif
> +

So a bit of a wider question, how does the community feel about allowing experimental API into the code base?

What is the feeling with the API in DPDK, will there be further modification to calls such as rte_mp_disable?

Will the experimental tag be removed in a future releases (e.g. 21.02).

In the past we made an exception for the rte_meter library and experimental API because it was replacing original rte_meter functionality OVS already used and I think it was an honest oversight to remove the experimental tag for the 19.11 release for those functions.

But this could open up the door to accepting experimental APIs in other areas of the OVS code base from DPDK also.

I raised this at the conference in December, genuinely interested to hear peoples thoughts?

>  #include <config.h>
>  #include "dpdk.h"
> 
> @@ -502,6 +506,12 @@ dpdk_init__(const struct smap *ovs_other_config)
>          return false;
>      }
> 
> +    if (!rte_mp_disable()) {
> +        VLOG_EMER("Could not disable multiprocess, DPDK won't be
> available.");
> +        rte_eal_cleanup();
> +        return false;
> +    }
> +
>      if (VLOG_IS_DBG_ENABLED()) {
>          size_t size;
>          char *response = NULL;
> @@ -525,6 +535,8 @@ dpdk_init__(const struct smap *ovs_other_config)
>                               dpdk_unixctl_mem_stream, rte_log_dump);
>      unixctl_command_register("dpdk/log-set", "{level | pattern:level}", 0,
>                               INT_MAX, dpdk_unixctl_log_set, NULL);
> +    unixctl_command_register("dpdk/lcores-list", "", 0, 0,
> +                             dpdk_unixctl_mem_stream, rte_lcore_dump);
> 
>      /* We are called from the main thread here */
>      RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
> @@ -601,11 +613,37 @@ dpdk_available(void)
>  }
> 
>  void
> -dpdk_set_lcore_id(unsigned cpu)
> +dpdk_init_thread_context(unsigned cpu)
>  {
>      /* NON_PMD_CORE_ID is reserved for use by non pmd threads. */
>      ovs_assert(cpu != NON_PMD_CORE_ID);
> -    RTE_PER_LCORE(_lcore_id) = cpu;
> +
> +    if (!dpdk_available()) {
> +        return;
> +    }
> +
> +    if (rte_thread_register() < 0) {
> +        VLOG_WARN("This OVS pmd thread will share resources with the non-
> pmd "
> +                  "thread: %s.", rte_strerror(rte_errno));
> +    } else {
> +        VLOG_INFO("PMD thread uses DPDK lcore %u.", rte_lcore_id());
> +    }
> +}

I'd like to do a little more testing around the case where the PMD thread is shared with a non-pmd thread above.

Do you typically see this usecase?

> +
> +void
> +dpdk_uninit_thread_context(void)
> +{
> +    unsigned int lcore_id;
> +
> +    if (!dpdk_available()) {
> +        return;
> +    }
> +
> +    lcore_id = rte_lcore_id();
> +    rte_thread_unregister();
> +    if (lcore_id != LCORE_ID_ANY) {
> +        VLOG_INFO("PMD thread released DPDK lcore %u.", lcore_id);
> +    }
>  }
> 
>  void
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index 445a51d065..1bd16b31db 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -36,7 +36,8 @@ struct smap;
>  struct ovsrec_open_vswitch;
> 
>  void dpdk_init(const struct smap *ovs_other_config);
> -void dpdk_set_lcore_id(unsigned cpu);
> +void dpdk_init_thread_context(unsigned cpu);
> +void dpdk_uninit_thread_context(void);
>  const char *dpdk_get_vhost_sock_dir(void);
>  bool dpdk_vhost_iommu_enabled(void);
>  bool dpdk_vhost_postcopy_enabled(void);
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 300861ca59..da9e1e8fcf 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -5926,7 +5926,7 @@ pmd_thread_main(void *f_)
>      /* Stores the pmd thread's 'pmd' to 'per_pmd_key'. */
>      ovsthread_setspecific(pmd->dp->per_pmd_key, pmd);
>      ovs_numa_thread_setaffinity_core(pmd->core_id);
> -    dpdk_set_lcore_id(pmd->core_id);
> +    dpdk_init_thread_context(pmd->core_id);
>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>      dfc_cache_init(&pmd->flow_cache);
>      pmd_alloc_static_tx_qid(pmd);
> @@ -6061,6 +6061,7 @@ reload:
>      dfc_cache_uninit(&pmd->flow_cache);
>      free(poll_list);
>      pmd_free_cached_ports(pmd);
> +    dpdk_uninit_thread_context();
>      return NULL;
>  }
> 

Again, I've a small bit more testing to do, so far it seems OK to me. Just would like to get the experimental conversation going as I see it as one of the potential blockers.

Regards
Ian
> --
> 2.23.0
Kevin Traynor Jan. 15, 2021, 10:39 a.m. UTC | #2
On 12/01/2021 16:24, Stokes, Ian wrote:
> Hi David,
> 
> Thanks for the patch, a few comments and queries inline.
> 
>> DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
>> lcore. This new API does not change the thread characteristics (like CPU
>> affinity).
>> Using this new API, there is no assumption on lcore X running on cpu X
>> anymore which leaves OVS free from running its PMD thread on any cpu.
> 
> Possibly just the wording above, " which leaves OVS free from running it's PMD on any cpu".
> I thought OVS was allowed to run its PMDs on any core anyway as long as they were <= RTE_MAX_LCORE.
> Does above mean free to run on any core greater than RTE_MAX_LCORE?.
> 
>> The DPDK multiprocess feature is not compatible with this new API and is
>> disabled.
>>
> 
> I think that's OK, I'm trying to think where this is a use case, since PDUMP was removed I'm not aware of any other case.
> I would assume we are still allowed to run another DPDK applications on the same host as long as the --file-prefix was specified?
> 
> For example I've come across people testing with TESTPMD and virtio vdevs on the same host as OVS DPDK in the past.
> 
> Out of interest, did you give this patch a run with the OVS DPDK unit tests?
> 
>> DPDK still limits the number of lcores to RTE_MAX_LCORE (128 on x86_64)
>> which should be enough for OVS pmd threads (hopefully).
> 
> I'm hard pressed to think of a case where it would not be sufficient tbh.
> 
>>
>> DPDK lcore/OVS pmd threads mapping are logged at threads creation and
>> destruction.
>> A new command is added to help get DPDK point of view of the DPDK lcores:
>>
>> $ ovs-appctl dpdk/lcores-list
>> lcore 0, socket 0, role RTE, cpuset 0
>> lcore 1, socket 0, role NON_EAL, cpuset 1
>> lcore 2, socket 0, role NON_EAL, cpuset 15
>>
>> Signed-off-by: David Marchand <david.marchand@redhat.com>

I didn't re-test this time due to a local issue, but the additional
explicit call to disable mp and documentation LGTM.

Acked-by: Kevin Traynor <ktraynor@redhat.com>

Comment about experimental API below.

>> ---
>> Changes since v4:
>> - rebased on the master branch,
>> - disabled DPDK mp feature,
>> - updated DPDK documentation and manual with the new command,
>> - added notes in NEWS,
>>
>> Changes since v3:
>> - rebased on current HEAD,
>> - switched back to simple warning rather than abort when registering a
>>   thread fails,
>>
>> Changes since v2:
>> - introduced a new api in DPDK 20.08 (still being discussed), inbox thread at
>>   http://inbox.dpdk.org/dev/20200610144506.30505-1-
>> david.marchand@redhat.com/T/#t
>> - this current patch depends on a patch on master I sent:
>>
>> https://patchwork.ozlabs.org/project/openvswitch/patch/20200626122738.2
>> 8163-1-david.marchand@redhat.com/
>> - dropped 'dpdk-lcore-mask' compat handling,
>>
>> Changes since v1:
>> - rewired existing configuration 'dpdk-lcore-mask' to use --lcores,
>> - switched to a bitmap to track lcores,
>> - added a command to dump current mapping (Flavio): used an experimental
>>   API to get DPDK lcores cpuset since it is the most reliable/portable
>>   information,
>> - used the same code for the logs when starting DPDK/PMD threads,
>> - addressed Ilya comments,
>>
>> ---
>>  Documentation/howto/dpdk.rst |  5 +++++
>>  NEWS                         |  2 ++
>>  lib/dpdk-stub.c              |  8 ++++++-
>>  lib/dpdk-unixctl.man         |  2 ++
>>  lib/dpdk.c                   | 42 ++++++++++++++++++++++++++++++++++--
>>  lib/dpdk.h                   |  3 ++-
>>  lib/dpif-netdev.c            |  3 ++-
>>  7 files changed, 60 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/howto/dpdk.rst
>> b/Documentation/howto/dpdk.rst
>> index f0d45e47b6..8492f354ce 100644
>> --- a/Documentation/howto/dpdk.rst
>> +++ b/Documentation/howto/dpdk.rst
>> @@ -399,6 +399,11 @@ Supported actions for hardware offload are:
>>  - Modification of IPv6 (set_field:<ADDR>->ipv6_src/ipv6_dst/mod_nw_ttl).
>>  - Clone/output (tnl_push and output) for encapsulating over a tunnel.
>>
>> +Multiprocess
>> +------------
>> +
>> +This DPDK feature is not supported and disabled during OVS initialization.
>> +
>>  Further Reading
>>  ---------------
>>
>> diff --git a/NEWS b/NEWS
>> index d357da31d8..f59311ab63 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -12,6 +12,8 @@ Post-v2.14.0
>>     - DPDK:
>>       * Removed support for vhost-user dequeue zero-copy.
>>       * Add support for DPDK 20.11.
>> +     * Forbid use of DPDK multiprocess feature.
>> +     * Add support for running threads on cores > RTE_MAX_LCORE.
>>     - Userspace datapath:
>>       * Add the 'pmd' option to "ovs-appctl dpctl/dump-flows", which
>>         restricts a flow dump to a single PMD thread if set.
>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>> index b7d577870d..5bc996b665 100644
>> --- a/lib/dpdk-stub.c
>> +++ b/lib/dpdk-stub.c
>> @@ -39,7 +39,13 @@ dpdk_init(const struct smap *ovs_other_config)
>>  }
>>
>>  void
>> -dpdk_set_lcore_id(unsigned cpu OVS_UNUSED)
>> +dpdk_init_thread_context(unsigned cpu OVS_UNUSED)
>> +{
>> +    /* Nothing */
>> +}
>> +
>> +void
>> +dpdk_uninit_thread_context(void)
>>  {
>>      /* Nothing */
>>  }
>> diff --git a/lib/dpdk-unixctl.man b/lib/dpdk-unixctl.man
>> index 2d6d576f24..c28bb4ef20 100644
>> --- a/lib/dpdk-unixctl.man
>> +++ b/lib/dpdk-unixctl.man
>> @@ -10,5 +10,7 @@ list of words separated by spaces: a word can be either a
>> logging \fBlevel\fR
>>  \fBnotice\fR, \fBinfo\fR or \fBdebug\fR) or a \fBpattern\fR matching DPDK
>>  components (see \fBdpdk/log-list\fR command on \fBovs\-appctl\fR(8))
>> separated
>>  by a colon from the logging \fBlevel\fR to apply.
>> +.IP "\fBdpdk/lcore-list\fR"
>> +Lists the DPDK lcores and their cpu affinity.
> 
> Minor nit, is it worth mentioning "role type" as well in the description above?
> 
>>  .RE
>>  .
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 319540394b..defac3f7de 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -14,6 +14,10 @@
>>   * limitations under the License.
>>   */
>>
>> +#ifndef ALLOW_EXPERIMENTAL_API
>> +#define ALLOW_EXPERIMENTAL_API
>> +#endif
>> +
> 
> So a bit of a wider question, how does the community feel about allowing experimental API into the code base?
> 
> What is the feeling with the API in DPDK, will there be further modification to calls such as rte_mp_disable?
> 
> Will the experimental tag be removed in a future releases (e.g. 21.02).
> 
> In the past we made an exception for the rte_meter library and experimental API because it was replacing original rte_meter functionality OVS already used and I think it was an honest oversight to remove the experimental tag for the 19.11 release for those functions.
> 
> But this could open up the door to accepting experimental APIs in other areas of the OVS code base from DPDK also.
> 
> I raised this at the conference in December, genuinely interested to hear peoples thoughts?
> 

I think it should be considered case by case:
- What is the benefit of the feature it enables?
- Is the API likely to change?
- What is the likely impact to OVS of it changing? in terms of code
churn and features.
- Is the author willing to rework OVS if it changes?

It is also not just a one way thing. OVS using the experimental API can
be feedback to either solidify the API or identify gaps. I think OVS
using the API successfully can be a strong influence on it becoming
stable in DPDK.

>>  #include <config.h>
>>  #include "dpdk.h"
>>
>> @@ -502,6 +506,12 @@ dpdk_init__(const struct smap *ovs_other_config)
>>          return false;
>>      }
>>
>> +    if (!rte_mp_disable()) {
>> +        VLOG_EMER("Could not disable multiprocess, DPDK won't be
>> available.");
>> +        rte_eal_cleanup();
>> +        return false;
>> +    }
>> +
>>      if (VLOG_IS_DBG_ENABLED()) {
>>          size_t size;
>>          char *response = NULL;
>> @@ -525,6 +535,8 @@ dpdk_init__(const struct smap *ovs_other_config)
>>                               dpdk_unixctl_mem_stream, rte_log_dump);
>>      unixctl_command_register("dpdk/log-set", "{level | pattern:level}", 0,
>>                               INT_MAX, dpdk_unixctl_log_set, NULL);
>> +    unixctl_command_register("dpdk/lcores-list", "", 0, 0,
>> +                             dpdk_unixctl_mem_stream, rte_lcore_dump);
>>
>>      /* We are called from the main thread here */
>>      RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
>> @@ -601,11 +613,37 @@ dpdk_available(void)
>>  }
>>
>>  void
>> -dpdk_set_lcore_id(unsigned cpu)
>> +dpdk_init_thread_context(unsigned cpu)
>>  {
>>      /* NON_PMD_CORE_ID is reserved for use by non pmd threads. */
>>      ovs_assert(cpu != NON_PMD_CORE_ID);
>> -    RTE_PER_LCORE(_lcore_id) = cpu;
>> +
>> +    if (!dpdk_available()) {
>> +        return;
>> +    }
>> +
>> +    if (rte_thread_register() < 0) {
>> +        VLOG_WARN("This OVS pmd thread will share resources with the non-
>> pmd "
>> +                  "thread: %s.", rte_strerror(rte_errno));
>> +    } else {
>> +        VLOG_INFO("PMD thread uses DPDK lcore %u.", rte_lcore_id());
>> +    }
>> +}
> 
> I'd like to do a little more testing around the case where the PMD thread is shared with a non-pmd thread above.
> 
> Do you typically see this usecase?
> 
>> +
>> +void
>> +dpdk_uninit_thread_context(void)
>> +{
>> +    unsigned int lcore_id;
>> +
>> +    if (!dpdk_available()) {
>> +        return;
>> +    }
>> +
>> +    lcore_id = rte_lcore_id();
>> +    rte_thread_unregister();
>> +    if (lcore_id != LCORE_ID_ANY) {
>> +        VLOG_INFO("PMD thread released DPDK lcore %u.", lcore_id);
>> +    }
>>  }
>>
>>  void
>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>> index 445a51d065..1bd16b31db 100644
>> --- a/lib/dpdk.h
>> +++ b/lib/dpdk.h
>> @@ -36,7 +36,8 @@ struct smap;
>>  struct ovsrec_open_vswitch;
>>
>>  void dpdk_init(const struct smap *ovs_other_config);
>> -void dpdk_set_lcore_id(unsigned cpu);
>> +void dpdk_init_thread_context(unsigned cpu);
>> +void dpdk_uninit_thread_context(void);
>>  const char *dpdk_get_vhost_sock_dir(void);
>>  bool dpdk_vhost_iommu_enabled(void);
>>  bool dpdk_vhost_postcopy_enabled(void);
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 300861ca59..da9e1e8fcf 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -5926,7 +5926,7 @@ pmd_thread_main(void *f_)
>>      /* Stores the pmd thread's 'pmd' to 'per_pmd_key'. */
>>      ovsthread_setspecific(pmd->dp->per_pmd_key, pmd);
>>      ovs_numa_thread_setaffinity_core(pmd->core_id);
>> -    dpdk_set_lcore_id(pmd->core_id);
>> +    dpdk_init_thread_context(pmd->core_id);
>>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>>      dfc_cache_init(&pmd->flow_cache);
>>      pmd_alloc_static_tx_qid(pmd);
>> @@ -6061,6 +6061,7 @@ reload:
>>      dfc_cache_uninit(&pmd->flow_cache);
>>      free(poll_list);
>>      pmd_free_cached_ports(pmd);
>> +    dpdk_uninit_thread_context();
>>      return NULL;
>>  }
>>
> 
> Again, I've a small bit more testing to do, so far it seems OK to me. Just would like to get the experimental conversation going as I see it as one of the potential blockers.
> 
> Regards
> Ian
>> --
>> 2.23.0
>
David Marchand Sept. 27, 2021, 2:57 p.m. UTC | #3
On Tue, Jan 12, 2021 at 5:24 PM Stokes, Ian <ian.stokes@intel.com> wrote:
> > DPDK 20.08 introduced a new API that associates a non-EAL thread to a free
> > lcore. This new API does not change the thread characteristics (like CPU
> > affinity).
> > Using this new API, there is no assumption on lcore X running on cpu X
> > anymore which leaves OVS free from running its PMD thread on any cpu.
>
> Possibly just the wording above, " which leaves OVS free from running it's PMD on any cpu".
> I thought OVS was allowed to run its PMDs on any core anyway as long as they were <= RTE_MAX_LCORE.

true, but with a '<' RTE_MAX_LCORE (which makes me realise my NEWS
update must be fixed).


> Does above mean free to run on any core greater than RTE_MAX_LCORE?.

Yep. I'll reword.


>
> > The DPDK multiprocess feature is not compatible with this new API and is
> > disabled.
> >
>
> I think that's OK, I'm trying to think where this is a use case, since PDUMP was removed I'm not aware of any other case.
> I would assume we are still allowed to run another DPDK applications on the same host as long as the --file-prefix was specified?

Indeed.


>
> For example I've come across people testing with TESTPMD and virtio vdevs on the same host as OVS DPDK in the past.
>
> Out of interest, did you give this patch a run with the OVS DPDK unit tests?

At the time I posted this series, I had run it through "normal" unit
tests, but I am not sure I checked the system-dpdk testsuite.
I ran it now, hit issues unrelated to my patch and sent fixes.


> >  .RE
> >  .
> > diff --git a/lib/dpdk.c b/lib/dpdk.c
> > index 319540394b..defac3f7de 100644
> > --- a/lib/dpdk.c
> > +++ b/lib/dpdk.c
> > @@ -14,6 +14,10 @@
> >   * limitations under the License.
> >   */
> >
> > +#ifndef ALLOW_EXPERIMENTAL_API
> > +#define ALLOW_EXPERIMENTAL_API
> > +#endif
> > +
>
> So a bit of a wider question, how does the community feel about allowing experimental API into the code base?

I reworked this with a #pragma trick (see v6) that makes it possible
to select the experimental API we judge acceptable in OVS.


>
> What is the feeling with the API in DPDK, will there be further modification to calls such as rte_mp_disable?

There were none since introduction of the API.

Besides, we are now considering the --in-memory flag for OVS, which
implicitely disables MP.
https://patchwork.ozlabs.org/project/openvswitch/patch/20210806124432.43042-1-roriorde@redhat.com/#2733808


>
> Will the experimental tag be removed in a future releases (e.g. 21.02).

It could be marked stable based on OVS usage/feedback.
No other opensource project uses this API which is (my) DPDK concern
and the reason why it is still experimental atm.


>
> In the past we made an exception for the rte_meter library and experimental API because it was replacing original rte_meter functionality OVS already used and I think it was an honest oversight to remove the experimental tag for the 19.11 release for those functions.

The rte_meter episod served as an example.
We now have the tools and guidance in DPDK to preserve an experimental
alias if we need for rte_thread_register() and associates.
https://doc.dpdk.org/guides/contributing/abi_policy.html?highlight=experimental#abi-changes


>
> But this could open up the door to accepting experimental APIs in other areas of the OVS code base from DPDK also.
>
> I raised this at the conference in December, genuinely interested to hear peoples thoughts?

I don't think there is an absolute answer to this question.
We can only judge case per case.

In this specific case, we are in a chicken-egg situation.
rte_thread_register() API was added for OVS in DPDK.
But OVS won't consume it unless DPDK marks it stable.
And DPDK expects feedback to mark stable.
diff mbox series

Patch

diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
index f0d45e47b6..8492f354ce 100644
--- a/Documentation/howto/dpdk.rst
+++ b/Documentation/howto/dpdk.rst
@@ -399,6 +399,11 @@  Supported actions for hardware offload are:
 - Modification of IPv6 (set_field:<ADDR>->ipv6_src/ipv6_dst/mod_nw_ttl).
 - Clone/output (tnl_push and output) for encapsulating over a tunnel.
 
+Multiprocess
+------------
+
+This DPDK feature is not supported and disabled during OVS initialization.
+
 Further Reading
 ---------------
 
diff --git a/NEWS b/NEWS
index d357da31d8..f59311ab63 100644
--- a/NEWS
+++ b/NEWS
@@ -12,6 +12,8 @@  Post-v2.14.0
    - DPDK:
      * Removed support for vhost-user dequeue zero-copy.
      * Add support for DPDK 20.11.
+     * Forbid use of DPDK multiprocess feature.
+     * Add support for running threads on cores > RTE_MAX_LCORE.
    - Userspace datapath:
      * Add the 'pmd' option to "ovs-appctl dpctl/dump-flows", which
        restricts a flow dump to a single PMD thread if set.
diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index b7d577870d..5bc996b665 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -39,7 +39,13 @@  dpdk_init(const struct smap *ovs_other_config)
 }
 
 void
-dpdk_set_lcore_id(unsigned cpu OVS_UNUSED)
+dpdk_init_thread_context(unsigned cpu OVS_UNUSED)
+{
+    /* Nothing */
+}
+
+void
+dpdk_uninit_thread_context(void)
 {
     /* Nothing */
 }
diff --git a/lib/dpdk-unixctl.man b/lib/dpdk-unixctl.man
index 2d6d576f24..c28bb4ef20 100644
--- a/lib/dpdk-unixctl.man
+++ b/lib/dpdk-unixctl.man
@@ -10,5 +10,7 @@  list of words separated by spaces: a word can be either a logging \fBlevel\fR
 \fBnotice\fR, \fBinfo\fR or \fBdebug\fR) or a \fBpattern\fR matching DPDK
 components (see \fBdpdk/log-list\fR command on \fBovs\-appctl\fR(8)) separated
 by a colon from the logging \fBlevel\fR to apply.
+.IP "\fBdpdk/lcore-list\fR"
+Lists the DPDK lcores and their cpu affinity.
 .RE
 .
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 319540394b..defac3f7de 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -14,6 +14,10 @@ 
  * limitations under the License.
  */
 
+#ifndef ALLOW_EXPERIMENTAL_API
+#define ALLOW_EXPERIMENTAL_API
+#endif
+
 #include <config.h>
 #include "dpdk.h"
 
@@ -502,6 +506,12 @@  dpdk_init__(const struct smap *ovs_other_config)
         return false;
     }
 
+    if (!rte_mp_disable()) {
+        VLOG_EMER("Could not disable multiprocess, DPDK won't be available.");
+        rte_eal_cleanup();
+        return false;
+    }
+
     if (VLOG_IS_DBG_ENABLED()) {
         size_t size;
         char *response = NULL;
@@ -525,6 +535,8 @@  dpdk_init__(const struct smap *ovs_other_config)
                              dpdk_unixctl_mem_stream, rte_log_dump);
     unixctl_command_register("dpdk/log-set", "{level | pattern:level}", 0,
                              INT_MAX, dpdk_unixctl_log_set, NULL);
+    unixctl_command_register("dpdk/lcores-list", "", 0, 0,
+                             dpdk_unixctl_mem_stream, rte_lcore_dump);
 
     /* We are called from the main thread here */
     RTE_PER_LCORE(_lcore_id) = NON_PMD_CORE_ID;
@@ -601,11 +613,37 @@  dpdk_available(void)
 }
 
 void
-dpdk_set_lcore_id(unsigned cpu)
+dpdk_init_thread_context(unsigned cpu)
 {
     /* NON_PMD_CORE_ID is reserved for use by non pmd threads. */
     ovs_assert(cpu != NON_PMD_CORE_ID);
-    RTE_PER_LCORE(_lcore_id) = cpu;
+
+    if (!dpdk_available()) {
+        return;
+    }
+
+    if (rte_thread_register() < 0) {
+        VLOG_WARN("This OVS pmd thread will share resources with the non-pmd "
+                  "thread: %s.", rte_strerror(rte_errno));
+    } else {
+        VLOG_INFO("PMD thread uses DPDK lcore %u.", rte_lcore_id());
+    }
+}
+
+void
+dpdk_uninit_thread_context(void)
+{
+    unsigned int lcore_id;
+
+    if (!dpdk_available()) {
+        return;
+    }
+
+    lcore_id = rte_lcore_id();
+    rte_thread_unregister();
+    if (lcore_id != LCORE_ID_ANY) {
+        VLOG_INFO("PMD thread released DPDK lcore %u.", lcore_id);
+    }
 }
 
 void
diff --git a/lib/dpdk.h b/lib/dpdk.h
index 445a51d065..1bd16b31db 100644
--- a/lib/dpdk.h
+++ b/lib/dpdk.h
@@ -36,7 +36,8 @@  struct smap;
 struct ovsrec_open_vswitch;
 
 void dpdk_init(const struct smap *ovs_other_config);
-void dpdk_set_lcore_id(unsigned cpu);
+void dpdk_init_thread_context(unsigned cpu);
+void dpdk_uninit_thread_context(void);
 const char *dpdk_get_vhost_sock_dir(void);
 bool dpdk_vhost_iommu_enabled(void);
 bool dpdk_vhost_postcopy_enabled(void);
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 300861ca59..da9e1e8fcf 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -5926,7 +5926,7 @@  pmd_thread_main(void *f_)
     /* Stores the pmd thread's 'pmd' to 'per_pmd_key'. */
     ovsthread_setspecific(pmd->dp->per_pmd_key, pmd);
     ovs_numa_thread_setaffinity_core(pmd->core_id);
-    dpdk_set_lcore_id(pmd->core_id);
+    dpdk_init_thread_context(pmd->core_id);
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
     dfc_cache_init(&pmd->flow_cache);
     pmd_alloc_static_tx_qid(pmd);
@@ -6061,6 +6061,7 @@  reload:
     dfc_cache_uninit(&pmd->flow_cache);
     free(poll_list);
     pmd_free_cached_ports(pmd);
+    dpdk_uninit_thread_context();
     return NULL;
 }