diff mbox

[ovs-dev] netdev-dpdk: Use intermediate queue during packet transmission.

Message ID 1481899723-54484-1-git-send-email-bhanuprakash.bodireddy@intel.com
State Changes Requested
Delegated to: Daniele Di Proietto
Headers show

Commit Message

Bodireddy, Bhanuprakash Dec. 16, 2016, 2:48 p.m. UTC
In exact match cache processing on an EMC hit, packets are queued in to
batches matching the flow. Thereafter, packets are processed in batches
for faster packet processing. This particularly is inefficient if there
are fewer packets in a batch as rte_eth_tx_burst() incurs expensive MMIO
write.

This commit adds back intermediate queue implementation. Packets are
queued and burst when the packet count >= NETDEV_MAX_BURST. Also drain
logic is refactored to handle fewer packets in the tx queues. Testing
shows significant performance gains with queueing.

Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
queueing of packets.")
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
---
This patch is based on commit '8b6987d799fb0bc530ebb7f767767b1c661548c9'
and [PATCH v2 00/19] DPDK/pmd reconfiguration refactor and bugfixes

Test results:
 * In worst case scenario with fewer packets in batch, significant
   bottleneck is observed at netdev_dpdk_eth_send() function due to 
   expensive MMIO writes.

 * Also its observed that CPI(cycles per instruction) Rate for the function
   stood between 3.15 and 4.1 which is significantly higher than acceptable
   limit of 1.0 for HPC applications and theoretical limit of 0.25 (As Backend
   pipeline can retire 4 micro-operations in a cycle).

 * With this patch, CPI for netdev_dpdk_eth_send() is at 0.55 and the overall
   throughput improved significantly.

 lib/dpif-netdev.c     | 42 +++++++++++++++++++++++++++++-
 lib/netdev-bsd.c      |  1 +
 lib/netdev-dpdk.c     | 71 +++++++++++++++++++++++++++++++++++++++++++++------
 lib/netdev-dummy.c    |  1 +
 lib/netdev-linux.c    |  1 +
 lib/netdev-provider.h |  3 +++
 lib/netdev-vport.c    |  3 ++-
 lib/netdev.c          |  9 +++++++
 lib/netdev.h          |  1 +
 9 files changed, 122 insertions(+), 10 deletions(-)

Comments

Aaron Conole Dec. 16, 2016, 7:24 p.m. UTC | #1
Hi Bhanu,

Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:

> In exact match cache processing on an EMC hit, packets are queued in to
> batches matching the flow. Thereafter, packets are processed in batches
> for faster packet processing. This particularly is inefficient if there
> are fewer packets in a batch as rte_eth_tx_burst() incurs expensive MMIO
> write.
>
> This commit adds back intermediate queue implementation. Packets are
> queued and burst when the packet count >= NETDEV_MAX_BURST. Also drain
> logic is refactored to handle fewer packets in the tx queues. Testing
> shows significant performance gains with queueing.
>
> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
> queueing of packets.")
> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
> Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
> Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
> ---

I've Cc'd Ilya just in hopes that the patch gets a better review than I
could give.  As a general comment, I like the direction - batched
operations are usually a better way of going.

Just minor below.

... snip ...
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 3509493..65dff83 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
>  static inline bool emc_entry_alive(struct emc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
>  
> +static struct tx_port *pmd_send_port_cache_lookup
> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> +
>  static void
>  emc_cache_init(struct emc_cache *flow_cache)
>  {
> @@ -2877,6 +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>  }
>  
>  static void
> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_netdev_port *port,
> +                         uint64_t now)
> +{
> +    int tx_qid;
> +
> +    if (!strcmp(port->type, "dpdk")) {

Any reason to restrict this only to dpdk ports?  It looks like you've
added a new netdev operation, so why not just call the netdev_txq_drain
unconditionally?

Also, bit of a nit, but tq_qid can be reduced in scope down to the if
block below.

> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
> +                 u32_to_odp(port->port_no));
> +
> +        if (OVS_LIKELY(tx)) {
> +            bool dynamic_txqs = tx->port->dynamic_txqs;
> +
> +            if (dynamic_txqs) {
> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
> +            } else {
> +                tx_qid = pmd->static_tx_qid;
> +            }
> +
> +            netdev_txq_drain(port->netdev, tx_qid);
> +        }
> +    }
> +}
> +
Ilya Maximets Dec. 19, 2016, 8:41 a.m. UTC | #2
Hi,

I didn't test this patch yet. Bhanu, could you please describe
your test scenario and performance results in more details.
It'll be nice if you provide throughput and latency measurement
results for different scenarios and packet sizes. Latency is
important here.

About the patch itself:

  1. 'drain' called only for PMD threads. This will lead to
     broken connection with non-PMD ports.

  2. 'xps_get_tx_qid()' called twice. First time on send and
     the second time on drain. This may lead to different
     returned 'tx_qid's and packets will stuck forever in
     tx buffer.

  3. 'txq_drain()' must take the 'tx_lock' for queue in case
     of dynamic tx queues.

  4. Waiting for 1024 polling cycles of PMD thread may cause
     a huge latency if we have few packets per second on one
     port and intensive traffic on others.

  5. This patch breaks the counting of 'tx_dropped' packets
     in netdev-dpdk.

  6. Comments in netdev-provider.h should be fixed to reflect
     all the changes.

  7. At last, I agree with Aaron that explicit allowing only
     'dpdk' ports is not a good style. Also, mentioning name of
     exact netdev inside the common code is a bad style too.

Best regards, Ilya Maximets.

On 16.12.2016 22:24, Aaron Conole wrote:
> Hi Bhanu,
> 
> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
> 
>> In exact match cache processing on an EMC hit, packets are queued in to
>> batches matching the flow. Thereafter, packets are processed in batches
>> for faster packet processing. This particularly is inefficient if there
>> are fewer packets in a batch as rte_eth_tx_burst() incurs expensive MMIO
>> write.
>>
>> This commit adds back intermediate queue implementation. Packets are
>> queued and burst when the packet count >= NETDEV_MAX_BURST. Also drain
>> logic is refactored to handle fewer packets in the tx queues. Testing
>> shows significant performance gains with queueing.
>>
>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>> queueing of packets.")
>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>> Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
>> Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
>> ---
> 
> I've Cc'd Ilya just in hopes that the patch gets a better review than I
> could give.  As a general comment, I like the direction - batched
> operations are usually a better way of going.
> 
> Just minor below.
> 
> ... snip ...
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 3509493..65dff83 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
>>  static inline bool emc_entry_alive(struct emc_entry *ce);
>>  static void emc_clear_entry(struct emc_entry *ce);
>>  
>> +static struct tx_port *pmd_send_port_cache_lookup
>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>> +
>>  static void
>>  emc_cache_init(struct emc_cache *flow_cache)
>>  {
>> @@ -2877,6 +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>>  }
>>  
>>  static void
>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>> +                         struct dp_netdev_port *port,
>> +                         uint64_t now)
>> +{
>> +    int tx_qid;
>> +
>> +    if (!strcmp(port->type, "dpdk")) {
> 
> Any reason to restrict this only to dpdk ports?  It looks like you've
> added a new netdev operation, so why not just call the netdev_txq_drain
> unconditionally?
> 
> Also, bit of a nit, but tq_qid can be reduced in scope down to the if
> block below.
> 
>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>> +                 u32_to_odp(port->port_no));
>> +
>> +        if (OVS_LIKELY(tx)) {
>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>> +
>> +            if (dynamic_txqs) {
>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>> +            } else {
>> +                tx_qid = pmd->static_tx_qid;
>> +            }
>> +
>> +            netdev_txq_drain(port->netdev, tx_qid);
>> +        }
>> +    }
>> +}
>> +
> 
> 
>
Bodireddy, Bhanuprakash Dec. 19, 2016, 6:05 p.m. UTC | #3
Thanks Ilya and Aaron for reviewing this patch and providing your comments, my reply inline. 

>-----Original Message-----
>From: Ilya Maximets [mailto:i.maximets@samsung.com]
>Sent: Monday, December 19, 2016 8:41 AM
>To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
><bhanuprakash.bodireddy@intel.com>
>Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
>Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
><antonio.fischetti@intel.com>; Markus Magnusson
><markus.magnusson@ericsson.com>
>Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>packet transmission.
>
>Hi,
>
>I didn't test this patch yet. Bhanu, could you please describe your test scenario
>and performance results in more details.

During the recent performance analysis improvements for classifier, we found that bottleneck was also observed at flow batching.
This was due to fewer packets in batch. To reproduce this, a simple P2P test case can be used with 30 IXIA streams and matching IP flow rules.
Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1 
        ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=2.2.2.1,actions=output:2

        For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
        ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=4.4.4.1,actions=output:2

This leaves fewer packets in batches and packet_batch_per_flow_execute() shall be invoked for every batch. 
With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for 64 byte udp packets. 

>It'll be nice if you provide throughput and latency measurement results for
>different scenarios and packet sizes. Latency is important here.
We are yet to do latency measurements in this case. With 30 IXIA streams comprising of 64 byte udp packets there was
an throughput improvement of 30% in P2P case and 13-15% in PVP case(single queue). we will try to get the latency stats
with and without this patch.

>
>About the patch itself:
>
>  1. 'drain' called only for PMD threads. This will lead to
>     broken connection with non-PMD ports.
I see that non-PMD ports are handled with vswitchd thread. 
Tested PVP loopback case with tap ports and found to be working as expected. Can you let me know the specific case
you are referring here so that I can verify if the patch breaks it.

>
>  2. 'xps_get_tx_qid()' called twice. First time on send and
>     the second time on drain. This may lead to different
>     returned 'tx_qid's and packets will stuck forever in
>     tx buffer.

You are right and xps_get_tx_qid() can return different tx_qids.
 I was always testing for XPS disabled cases and completely overlooked this case.
This could be a potential problem for us and any suggestions should be very helpful.

>
>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>     of dynamic tx queues.

Agree, will handle this in next version. 

>
>  4. Waiting for 1024 polling cycles of PMD thread may cause
>     a huge latency if we have few packets per second on one
>     port and intensive traffic on others.

I agree with you. We discussed this here and thought invoking the drain
logic once every 1024 cycles is an optimal solution. But if the community thinks otherwise
we can move this in to the main 'for' loop so that it can be invoked more often provided that
'DRAIN_TSC' cycles elapsed. 

>
>  5. This patch breaks the counting of 'tx_dropped' packets
>     in netdev-dpdk.

I presume you are referring to below line in the code. 

- dropped += netdev_dpdk_eth_tx_burst() 
+ netdev_dpdk_eth_tx_queue()

Will handle this in v2 of this patch.

>
>  6. Comments in netdev-provider.h should be fixed to reflect
>     all the changes.

Will make necessary changes in netdev-provider.h

>
>  7. At last, I agree with Aaron that explicit allowing only
>     'dpdk' ports is not a good style. Also, mentioning name of
>     exact netdev inside the common code is a bad style too.

Will handle this appropriately.

Regards,
Bhanu Prakash. 

>Best regards, Ilya Maximets.
>
>On 16.12.2016 22:24, Aaron Conole wrote:
>> Hi Bhanu,
>>
>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
>>
>>> In exact match cache processing on an EMC hit, packets are queued in
>>> to batches matching the flow. Thereafter, packets are processed in
>>> batches for faster packet processing. This particularly is
>>> inefficient if there are fewer packets in a batch as
>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>
>>> This commit adds back intermediate queue implementation. Packets are
>>> queued and burst when the packet count >= NETDEV_MAX_BURST. Also
>>> drain logic is refactored to handle fewer packets in the tx queues.
>>> Testing shows significant performance gains with queueing.
>>>
>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>> queueing of packets.")
>>> Signed-off-by: Bhanuprakash Bodireddy
>>> <bhanuprakash.bodireddy@intel.com>
>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>> Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
>>> Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
>>> ---
>>
>> I've Cc'd Ilya just in hopes that the patch gets a better review than
>> I could give.  As a general comment, I like the direction - batched
>> operations are usually a better way of going.
>>
>> Just minor below.
>>
>> ... snip ...
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> 3509493..65dff83 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>> emc_entry_alive(struct emc_entry *ce);  static void
>>> emc_clear_entry(struct emc_entry *ce);
>>>
>>> +static struct tx_port *pmd_send_port_cache_lookup
>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>> +
>>>  static void
>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6 +2880,31
>>> @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,  }
>>>
>>>  static void
>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>> +                         struct dp_netdev_port *port,
>>> +                         uint64_t now) {
>>> +    int tx_qid;
>>> +
>>> +    if (!strcmp(port->type, "dpdk")) {
>>
>> Any reason to restrict this only to dpdk ports?  It looks like you've
>> added a new netdev operation, so why not just call the
>> netdev_txq_drain unconditionally?
>>
>> Also, bit of a nit, but tq_qid can be reduced in scope down to the if
>> block below.
>>
>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>> +                 u32_to_odp(port->port_no));
>>> +
>>> +        if (OVS_LIKELY(tx)) {
>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>> +
>>> +            if (dynamic_txqs) {
>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>> +            } else {
>>> +                tx_qid = pmd->static_tx_qid;
>>> +            }
>>> +
>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>> +        }
>>> +    }
>>> +}
>>> +
>>
>>
>>
Ilya Maximets Dec. 20, 2016, 8:09 a.m. UTC | #4
On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
> Thanks Ilya and Aaron for reviewing this patch and providing your comments, my reply inline. 
> 
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Monday, December 19, 2016 8:41 AM
>> To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
>> <bhanuprakash.bodireddy@intel.com>
>> Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
>> Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
>> <antonio.fischetti@intel.com>; Markus Magnusson
>> <markus.magnusson@ericsson.com>
>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>> packet transmission.
>>
>> Hi,
>>
>> I didn't test this patch yet. Bhanu, could you please describe your test scenario
>> and performance results in more details.
> 
> During the recent performance analysis improvements for classifier, we found that bottleneck was also observed at flow batching.
> This was due to fewer packets in batch. To reproduce this, a simple P2P test case can be used with 30 IXIA streams and matching IP flow rules.
> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1 
>         ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> 
>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
>         ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> 
> This leaves fewer packets in batches and packet_batch_per_flow_execute() shall be invoked for every batch. 
> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for 64 byte udp packets. 
> 
>> It'll be nice if you provide throughput and latency measurement results for
>> different scenarios and packet sizes. Latency is important here.
> We are yet to do latency measurements in this case. With 30 IXIA streams comprising of 64 byte udp packets there was
> an throughput improvement of 30% in P2P case and 13-15% in PVP case(single queue). we will try to get the latency stats
> with and without this patch.
> 
>>
>> About the patch itself:
>>
>>  1. 'drain' called only for PMD threads. This will lead to
>>     broken connection with non-PMD ports.
> I see that non-PMD ports are handled with vswitchd thread. 
> Tested PVP loopback case with tap ports and found to be working as expected. Can you let me know the specific case
> you are referring here so that I can verify if the patch breaks it.

I meant something like this:


 *-----------HOST-1(br-int)-----*               *---------HOST-2(br-int)------*
 |                              |               |                             |
 |   internal_port <--> dpdk0 <-------------------> dpdk0 <--> internal_port  |
 |   192.168.0.1/24             |               |              192.168.0.2/24 |
 *------------------------------*               *-----------------------------*

 (HOST-1)# ping 192.168.0.2

In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp packets will
stuck in TX queue. The next packet will cause sending the whole batch of
NETDEV_MAX_BURST icmp packets. That is not good behaviour.

>>  2. 'xps_get_tx_qid()' called twice. First time on send and
>>     the second time on drain. This may lead to different
>>     returned 'tx_qid's and packets will stuck forever in
>>     tx buffer.
> 
> You are right and xps_get_tx_qid() can return different tx_qids.
>  I was always testing for XPS disabled cases and completely overlooked this case.
> This could be a potential problem for us and any suggestions should be very helpful.
> 
>>
>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>>     of dynamic tx queues.
> 
> Agree, will handle this in next version. 
> 
>>
>>  4. Waiting for 1024 polling cycles of PMD thread may cause
>>     a huge latency if we have few packets per second on one
>>     port and intensive traffic on others.
> 
> I agree with you. We discussed this here and thought invoking the drain
> logic once every 1024 cycles is an optimal solution. But if the community thinks otherwise
> we can move this in to the main 'for' loop so that it can be invoked more often provided that
> 'DRAIN_TSC' cycles elapsed. 
> 
>>
>>  5. This patch breaks the counting of 'tx_dropped' packets
>>     in netdev-dpdk.
> 
> I presume you are referring to below line in the code. 
> 
> - dropped += netdev_dpdk_eth_tx_burst() 
> + netdev_dpdk_eth_tx_queue()
> 
> Will handle this in v2 of this patch.
> 
>>
>>  6. Comments in netdev-provider.h should be fixed to reflect
>>     all the changes.
> 
> Will make necessary changes in netdev-provider.h
> 
>>
>>  7. At last, I agree with Aaron that explicit allowing only
>>     'dpdk' ports is not a good style. Also, mentioning name of
>>     exact netdev inside the common code is a bad style too.
> 
> Will handle this appropriately.
> 
> Regards,
> Bhanu Prakash. 
> 
>> Best regards, Ilya Maximets.
>>
>> On 16.12.2016 22:24, Aaron Conole wrote:
>>> Hi Bhanu,
>>>
>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
>>>
>>>> In exact match cache processing on an EMC hit, packets are queued in
>>>> to batches matching the flow. Thereafter, packets are processed in
>>>> batches for faster packet processing. This particularly is
>>>> inefficient if there are fewer packets in a batch as
>>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>>
>>>> This commit adds back intermediate queue implementation. Packets are
>>>> queued and burst when the packet count >= NETDEV_MAX_BURST. Also
>>>> drain logic is refactored to handle fewer packets in the tx queues.
>>>> Testing shows significant performance gains with queueing.
>>>>
>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>>> queueing of packets.")
>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>> <bhanuprakash.bodireddy@intel.com>
>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>> Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
>>>> Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
>>>> ---
>>>
>>> I've Cc'd Ilya just in hopes that the patch gets a better review than
>>> I could give.  As a general comment, I like the direction - batched
>>> operations are usually a better way of going.
>>>
>>> Just minor below.
>>>
>>> ... snip ...
>>>>
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>> 3509493..65dff83 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>>> emc_entry_alive(struct emc_entry *ce);  static void
>>>> emc_clear_entry(struct emc_entry *ce);
>>>>
>>>> +static struct tx_port *pmd_send_port_cache_lookup
>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>>> +
>>>>  static void
>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6 +2880,31
>>>> @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,  }
>>>>
>>>>  static void
>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>>> +                         struct dp_netdev_port *port,
>>>> +                         uint64_t now) {
>>>> +    int tx_qid;
>>>> +
>>>> +    if (!strcmp(port->type, "dpdk")) {
>>>
>>> Any reason to restrict this only to dpdk ports?  It looks like you've
>>> added a new netdev operation, so why not just call the
>>> netdev_txq_drain unconditionally?
>>>
>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the if
>>> block below.
>>>
>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>>> +                 u32_to_odp(port->port_no));
>>>> +
>>>> +        if (OVS_LIKELY(tx)) {
>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>>> +
>>>> +            if (dynamic_txqs) {
>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>>> +            } else {
>>>> +                tx_qid = pmd->static_tx_qid;
>>>> +            }
>>>> +
>>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>
>>>
>>>
> 
> 
>
Bodireddy, Bhanuprakash Dec. 20, 2016, 12:19 p.m. UTC | #5
>-----Original Message-----
>From: Ilya Maximets [mailto:i.maximets@samsung.com]
>Sent: Tuesday, December 20, 2016 8:09 AM
>To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
>Conole <aconole@redhat.com>
>Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
>Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
><antonio.fischetti@intel.com>; Markus Magnusson
><markus.magnusson@ericsson.com>
>Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>packet transmission.
>
>On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
>> Thanks Ilya and Aaron for reviewing this patch and providing your
>comments, my reply inline.
>>
>>> -----Original Message-----
>>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>>> Sent: Monday, December 19, 2016 8:41 AM
>>> To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
>>> <bhanuprakash.bodireddy@intel.com>
>>> Cc: dev@openvswitch.org; Daniele Di Proietto
>>> <diproiettod@vmware.com>; Thadeu Lima de Souza Cascardo
>>> <cascardo@redhat.com>; Fischetti, Antonio
>>> <antonio.fischetti@intel.com>; Markus Magnusson
>>> <markus.magnusson@ericsson.com>
>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
>>> during packet transmission.
>>>
>>> Hi,
>>>
>>> I didn't test this patch yet. Bhanu, could you please describe your
>>> test scenario and performance results in more details.
>>
>> During the recent performance analysis improvements for classifier, we
>found that bottleneck was also observed at flow batching.
>> This was due to fewer packets in batch. To reproduce this, a simple P2P test
>case can be used with 30 IXIA streams and matching IP flow rules.
>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
>>         ovs-ofctl add-flow br0
>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
>>
>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
>>         ovs-ofctl add-flow br0
>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
>>
>> This leaves fewer packets in batches and packet_batch_per_flow_execute()
>shall be invoked for every batch.
>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for
>64 byte udp packets.
>>
>>> It'll be nice if you provide throughput and latency measurement
>>> results for different scenarios and packet sizes. Latency is important here.
>> We are yet to do latency measurements in this case. With 30 IXIA
>> streams comprising of 64 byte udp packets there was an throughput
>> improvement of 30% in P2P case and 13-15% in PVP case(single queue). we
>will try to get the latency stats with and without this patch.
>>
>>>
>>> About the patch itself:
>>>
>>>  1. 'drain' called only for PMD threads. This will lead to
>>>     broken connection with non-PMD ports.
>> I see that non-PMD ports are handled with vswitchd thread.
>> Tested PVP loopback case with tap ports and found to be working as
>> expected. Can you let me know the specific case you are referring here so
>that I can verify if the patch breaks it.
>
>I meant something like this:
>
>
> *-----------HOST-1(br-int)-----*               *---------HOST-2(br-int)------*
> |                              |               |                             |
> |   internal_port <--> dpdk0 <-------------------> dpdk0 <--> internal_port  |
> |   192.168.0.1/24             |               |              192.168.0.2/24 |
> *------------------------------*               *-----------------------------*
>
> (HOST-1)# ping 192.168.0.2
>
>In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp packets
>will stuck in TX queue. The next packet will cause sending the whole batch of
>NETDEV_MAX_BURST icmp packets. That is not good behaviour.

Thanks for test case. I tested this and found to be working with the patch. 
The reason being, PMD threads uses intermediate queue implementation by invoking 'netdev_dpdk_eth_tx_queue()', 
Whereas the non-pmd thread(vswitchd) continues to call netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior here.

I had setup the case similar to one mentioned in above diagram using 2 hosts and assigning IP addresses to the respective bridges. 
When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is the call path.

(HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)

When sending ICMP echo request packet: 
Vswitchd thread:    dp_execute_cb()
                                           netdev_send()           [ pkt sent on dpdk0, qid: 0]
                                                 netdev_dpdk_send__() 
                                                       dpdk_do_tx_copy() 
                                                             netdev_dpdk_eth_tx_burst().
                                                                          

On receiving ICMP echo reply packet:
PMD thread:   dp_execute_cb()
                                    netdev_send()    [pkt sent on br0, qid: 1]
                                         netdev_linux_send()

Regards,
Bhanu Prakash. 

>
>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
>>>     the second time on drain. This may lead to different
>>>     returned 'tx_qid's and packets will stuck forever in
>>>     tx buffer.
>>
>> You are right and xps_get_tx_qid() can return different tx_qids.
>>  I was always testing for XPS disabled cases and completely overlooked this
>case.
>> This could be a potential problem for us and any suggestions should be very
>helpful.
>>
>>>
>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>>>     of dynamic tx queues.
>>
>> Agree, will handle this in next version.
>>
>>>
>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
>>>     a huge latency if we have few packets per second on one
>>>     port and intensive traffic on others.
>>
>> I agree with you. We discussed this here and thought invoking the
>> drain logic once every 1024 cycles is an optimal solution. But if the
>> community thinks otherwise we can move this in to the main 'for' loop
>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles
>elapsed.
>>
>>>
>>>  5. This patch breaks the counting of 'tx_dropped' packets
>>>     in netdev-dpdk.
>>
>> I presume you are referring to below line in the code.
>>
>> - dropped += netdev_dpdk_eth_tx_burst()
>> + netdev_dpdk_eth_tx_queue()
>>
>> Will handle this in v2 of this patch.
>>
>>>
>>>  6. Comments in netdev-provider.h should be fixed to reflect
>>>     all the changes.
>>
>> Will make necessary changes in netdev-provider.h
>>
>>>
>>>  7. At last, I agree with Aaron that explicit allowing only
>>>     'dpdk' ports is not a good style. Also, mentioning name of
>>>     exact netdev inside the common code is a bad style too.
>>
>> Will handle this appropriately.
>>
>> Regards,
>> Bhanu Prakash.
>>
>>> Best regards, Ilya Maximets.
>>>
>>> On 16.12.2016 22:24, Aaron Conole wrote:
>>>> Hi Bhanu,
>>>>
>>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
>>>>
>>>>> In exact match cache processing on an EMC hit, packets are queued
>>>>> in to batches matching the flow. Thereafter, packets are processed
>>>>> in batches for faster packet processing. This particularly is
>>>>> inefficient if there are fewer packets in a batch as
>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>>>
>>>>> This commit adds back intermediate queue implementation. Packets
>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
>>>>> Also drain logic is refactored to handle fewer packets in the tx queues.
>>>>> Testing shows significant performance gains with queueing.
>>>>>
>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>>>> queueing of packets.")
>>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>>> <bhanuprakash.bodireddy@intel.com>
>>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>>> Signed-off-by: Markus Magnusson
><markus.magnusson@ericsson.com>
>>>>> Co-authored-by: Markus Magnusson
><markus.magnusson@ericsson.com>
>>>>> ---
>>>>
>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
>>>> than I could give.  As a general comment, I like the direction -
>>>> batched operations are usually a better way of going.
>>>>
>>>> Just minor below.
>>>>
>>>> ... snip ...
>>>>>
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>> 3509493..65dff83 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>>>> emc_entry_alive(struct emc_entry *ce);  static void
>>>>> emc_clear_entry(struct emc_entry *ce);
>>>>>
>>>>> +static struct tx_port *pmd_send_port_cache_lookup
>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>>>> +
>>>>>  static void
>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>}
>>>>>
>>>>>  static void
>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>>>> +                         struct dp_netdev_port *port,
>>>>> +                         uint64_t now) {
>>>>> +    int tx_qid;
>>>>> +
>>>>> +    if (!strcmp(port->type, "dpdk")) {
>>>>
>>>> Any reason to restrict this only to dpdk ports?  It looks like
>>>> you've added a new netdev operation, so why not just call the
>>>> netdev_txq_drain unconditionally?
>>>>
>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
>>>> if block below.
>>>>
>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>>>> +                 u32_to_odp(port->port_no));
>>>>> +
>>>>> +        if (OVS_LIKELY(tx)) {
>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>>>> +
>>>>> +            if (dynamic_txqs) {
>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>>>> +            } else {
>>>>> +                tx_qid = pmd->static_tx_qid;
>>>>> +            }
>>>>> +
>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>> +
>>>>
>>>>
>>>>
>>
>>
>>
Ilya Maximets Dec. 20, 2016, 12:46 p.m. UTC | #6
On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote:
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>> Sent: Tuesday, December 20, 2016 8:09 AM
>> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
>> Conole <aconole@redhat.com>
>> Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
>> Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
>> <antonio.fischetti@intel.com>; Markus Magnusson
>> <markus.magnusson@ericsson.com>
>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>> packet transmission.
>>
>> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
>>> Thanks Ilya and Aaron for reviewing this patch and providing your
>> comments, my reply inline.
>>>
>>>> -----Original Message-----
>>>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
>>>> Sent: Monday, December 19, 2016 8:41 AM
>>>> To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
>>>> <bhanuprakash.bodireddy@intel.com>
>>>> Cc: dev@openvswitch.org; Daniele Di Proietto
>>>> <diproiettod@vmware.com>; Thadeu Lima de Souza Cascardo
>>>> <cascardo@redhat.com>; Fischetti, Antonio
>>>> <antonio.fischetti@intel.com>; Markus Magnusson
>>>> <markus.magnusson@ericsson.com>
>>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
>>>> during packet transmission.
>>>>
>>>> Hi,
>>>>
>>>> I didn't test this patch yet. Bhanu, could you please describe your
>>>> test scenario and performance results in more details.
>>>
>>> During the recent performance analysis improvements for classifier, we
>> found that bottleneck was also observed at flow batching.
>>> This was due to fewer packets in batch. To reproduce this, a simple P2P test
>> case can be used with 30 IXIA streams and matching IP flow rules.
>>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
>>>         ovs-ofctl add-flow br0
>>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
>>>
>>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
>>>         ovs-ofctl add-flow br0
>>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
>>>
>>> This leaves fewer packets in batches and packet_batch_per_flow_execute()
>> shall be invoked for every batch.
>>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for
>> 64 byte udp packets.
>>>
>>>> It'll be nice if you provide throughput and latency measurement
>>>> results for different scenarios and packet sizes. Latency is important here.
>>> We are yet to do latency measurements in this case. With 30 IXIA
>>> streams comprising of 64 byte udp packets there was an throughput
>>> improvement of 30% in P2P case and 13-15% in PVP case(single queue). we
>> will try to get the latency stats with and without this patch.
>>>
>>>>
>>>> About the patch itself:
>>>>
>>>>  1. 'drain' called only for PMD threads. This will lead to
>>>>     broken connection with non-PMD ports.
>>> I see that non-PMD ports are handled with vswitchd thread.
>>> Tested PVP loopback case with tap ports and found to be working as
>>> expected. Can you let me know the specific case you are referring here so
>> that I can verify if the patch breaks it.
>>
>> I meant something like this:
>>
>>
>> *-----------HOST-1(br-int)-----*               *---------HOST-2(br-int)------*
>> |                              |               |                             |
>> |   internal_port <--> dpdk0 <-------------------> dpdk0 <--> internal_port  |
>> |   192.168.0.1/24             |               |              192.168.0.2/24 |
>> *------------------------------*               *-----------------------------*
>>
>> (HOST-1)# ping 192.168.0.2
>>
>> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp packets
>> will stuck in TX queue. The next packet will cause sending the whole batch of
>> NETDEV_MAX_BURST icmp packets. That is not good behaviour.
> 
> Thanks for test case. I tested this and found to be working with the patch. 
> The reason being, PMD threads uses intermediate queue implementation by invoking 'netdev_dpdk_eth_tx_queue()', 
> Whereas the non-pmd thread(vswitchd) continues to call netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior here.
> 
> I had setup the case similar to one mentioned in above diagram using 2 hosts and assigning IP addresses to the respective bridges. 
> When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is the call path.
> 
> (HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)
> 
> When sending ICMP echo request packet: 
> Vswitchd thread:    dp_execute_cb()
>                                            netdev_send()           [ pkt sent on dpdk0, qid: 0]
>                                                  netdev_dpdk_send__() 
>                                                        dpdk_do_tx_copy() 
>                                                              netdev_dpdk_eth_tx_burst().
>                                                                           
> 
> On receiving ICMP echo reply packet:
> PMD thread:   dp_execute_cb()
>                                     netdev_send()    [pkt sent on br0, qid: 1]
>                                          netdev_linux_send()
> 

OK. Maybe in this exact case it works, but it isn't guaranteed that non-PMD
threads will always call send with 'may_steal=false'.

Also, I see another issue here:

    8. netdev_send(.., may_steal=true, ...) uses instant send while
       netdev_send(.., may_steal=false, ...) uses queueing of packets.
       This behaviour will lead to packet reordering issues in some
       cases.

    9. You're decreasing 'txq->count' inside 'netdev_dpdk_eth_tx_burst()'
       while it can be called with different from 'txq->burst_pkts'
       argument. Some of queued packets will be lost + memory leak.
           
Best regards, Ilya Maximets.

> 
>>
>>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
>>>>     the second time on drain. This may lead to different
>>>>     returned 'tx_qid's and packets will stuck forever in
>>>>     tx buffer.
>>>
>>> You are right and xps_get_tx_qid() can return different tx_qids.
>>>  I was always testing for XPS disabled cases and completely overlooked this
>> case.
>>> This could be a potential problem for us and any suggestions should be very
>> helpful.
>>>
>>>>
>>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>>>>     of dynamic tx queues.
>>>
>>> Agree, will handle this in next version.
>>>
>>>>
>>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
>>>>     a huge latency if we have few packets per second on one
>>>>     port and intensive traffic on others.
>>>
>>> I agree with you. We discussed this here and thought invoking the
>>> drain logic once every 1024 cycles is an optimal solution. But if the
>>> community thinks otherwise we can move this in to the main 'for' loop
>>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles
>> elapsed.
>>>
>>>>
>>>>  5. This patch breaks the counting of 'tx_dropped' packets
>>>>     in netdev-dpdk.
>>>
>>> I presume you are referring to below line in the code.
>>>
>>> - dropped += netdev_dpdk_eth_tx_burst()
>>> + netdev_dpdk_eth_tx_queue()
>>>
>>> Will handle this in v2 of this patch.
>>>
>>>>
>>>>  6. Comments in netdev-provider.h should be fixed to reflect
>>>>     all the changes.
>>>
>>> Will make necessary changes in netdev-provider.h
>>>
>>>>
>>>>  7. At last, I agree with Aaron that explicit allowing only
>>>>     'dpdk' ports is not a good style. Also, mentioning name of
>>>>     exact netdev inside the common code is a bad style too.
>>>
>>> Will handle this appropriately.
>>>
>>> Regards,
>>> Bhanu Prakash.
>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>> On 16.12.2016 22:24, Aaron Conole wrote:
>>>>> Hi Bhanu,
>>>>>
>>>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
>>>>>
>>>>>> In exact match cache processing on an EMC hit, packets are queued
>>>>>> in to batches matching the flow. Thereafter, packets are processed
>>>>>> in batches for faster packet processing. This particularly is
>>>>>> inefficient if there are fewer packets in a batch as
>>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>>>>
>>>>>> This commit adds back intermediate queue implementation. Packets
>>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
>>>>>> Also drain logic is refactored to handle fewer packets in the tx queues.
>>>>>> Testing shows significant performance gains with queueing.
>>>>>>
>>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>>>>> queueing of packets.")
>>>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>>>> <bhanuprakash.bodireddy@intel.com>
>>>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>>>>>> Signed-off-by: Markus Magnusson
>> <markus.magnusson@ericsson.com>
>>>>>> Co-authored-by: Markus Magnusson
>> <markus.magnusson@ericsson.com>
>>>>>> ---
>>>>>
>>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
>>>>> than I could give.  As a general comment, I like the direction -
>>>>> batched operations are usually a better way of going.
>>>>>
>>>>> Just minor below.
>>>>>
>>>>> ... snip ...
>>>>>>
>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>>> 3509493..65dff83 100644
>>>>>> --- a/lib/dpif-netdev.c
>>>>>> +++ b/lib/dpif-netdev.c
>>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>>>>> emc_entry_alive(struct emc_entry *ce);  static void
>>>>>> emc_clear_entry(struct emc_entry *ce);
>>>>>>
>>>>>> +static struct tx_port *pmd_send_port_cache_lookup
>>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>>>>> +
>>>>>>  static void
>>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
>>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>> }
>>>>>>
>>>>>>  static void
>>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>>>>> +                         struct dp_netdev_port *port,
>>>>>> +                         uint64_t now) {
>>>>>> +    int tx_qid;
>>>>>> +
>>>>>> +    if (!strcmp(port->type, "dpdk")) {
>>>>>
>>>>> Any reason to restrict this only to dpdk ports?  It looks like
>>>>> you've added a new netdev operation, so why not just call the
>>>>> netdev_txq_drain unconditionally?
>>>>>
>>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
>>>>> if block below.
>>>>>
>>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>>>>> +                 u32_to_odp(port->port_no));
>>>>>> +
>>>>>> +        if (OVS_LIKELY(tx)) {
>>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>>>>> +
>>>>>> +            if (dynamic_txqs) {
>>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>>>>> +            } else {
>>>>>> +                tx_qid = pmd->static_tx_qid;
>>>>>> +            }
>>>>>> +
>>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
> 
> 
>
Fischetti, Antonio Jan. 12, 2017, 5:40 p.m. UTC | #7
Hi Ilya,
thanks for your detailed feedback. I've added a couple of replies 
inline about when the instant send happens and the may_steal param.


Regards,
Antonio

> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Tuesday, December 20, 2016 12:47 PM
> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
> Conole <aconole@redhat.com>
> Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
> Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
> <antonio.fischetti@intel.com>; Markus Magnusson
> <markus.magnusson@ericsson.com>
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
> packet transmission.
> 
> On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote:
> >> -----Original Message-----
> >> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> >> Sent: Tuesday, December 20, 2016 8:09 AM
> >> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
> >> Conole <aconole@redhat.com>
> >> Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
> >> Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
> >> <antonio.fischetti@intel.com>; Markus Magnusson
> >> <markus.magnusson@ericsson.com>
> >> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> during
> >> packet transmission.
> >>
> >> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
> >>> Thanks Ilya and Aaron for reviewing this patch and providing your
> >> comments, my reply inline.
> >>>
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> >>>> Sent: Monday, December 19, 2016 8:41 AM
> >>>> To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
> >>>> <bhanuprakash.bodireddy@intel.com>
> >>>> Cc: dev@openvswitch.org; Daniele Di Proietto
> >>>> <diproiettod@vmware.com>; Thadeu Lima de Souza Cascardo
> >>>> <cascardo@redhat.com>; Fischetti, Antonio
> >>>> <antonio.fischetti@intel.com>; Markus Magnusson
> >>>> <markus.magnusson@ericsson.com>
> >>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> >>>> during packet transmission.
> >>>>
> >>>> Hi,
> >>>>
> >>>> I didn't test this patch yet. Bhanu, could you please describe your
> >>>> test scenario and performance results in more details.
> >>>
> >>> During the recent performance analysis improvements for classifier, we
> >> found that bottleneck was also observed at flow batching.
> >>> This was due to fewer packets in batch. To reproduce this, a simple
> P2P test
> >> case can be used with 30 IXIA streams and matching IP flow rules.
> >>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
> >>>         ovs-ofctl add-flow br0
> >>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> >>>
> >>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
> >>>         ovs-ofctl add-flow br0
> >>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> >>>
> >>> This leaves fewer packets in batches and
> packet_batch_per_flow_execute()
> >> shall be invoked for every batch.
> >>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case
> for
> >> 64 byte udp packets.
> >>>
> >>>> It'll be nice if you provide throughput and latency measurement
> >>>> results for different scenarios and packet sizes. Latency is
> important here.
> >>> We are yet to do latency measurements in this case. With 30 IXIA
> >>> streams comprising of 64 byte udp packets there was an throughput
> >>> improvement of 30% in P2P case and 13-15% in PVP case(single queue).
> we
> >> will try to get the latency stats with and without this patch.
> >>>
> >>>>
> >>>> About the patch itself:
> >>>>
> >>>>  1. 'drain' called only for PMD threads. This will lead to
> >>>>     broken connection with non-PMD ports.
> >>> I see that non-PMD ports are handled with vswitchd thread.
> >>> Tested PVP loopback case with tap ports and found to be working as
> >>> expected. Can you let me know the specific case you are referring here
> so
> >> that I can verify if the patch breaks it.
> >>
> >> I meant something like this:
> >>
> >>
> >> *-----------HOST-1(br-int)-----*               *---------HOST-2(br-
> int)------*
> >> |                              |               |
> |
> >> |   internal_port <--> dpdk0 <-------------------> dpdk0 <-->
> internal_port  |
> >> |   192.168.0.1/24             |               |
> 192.168.0.2/24 |
> >> *------------------------------*               *-----------------------
> ------*
> >>
> >> (HOST-1)# ping 192.168.0.2
> >>
> >> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp
> packets
> >> will stuck in TX queue. The next packet will cause sending the whole
> batch of
> >> NETDEV_MAX_BURST icmp packets. That is not good behaviour.
> >
> > Thanks for test case. I tested this and found to be working with the
> patch.
> > The reason being, PMD threads uses intermediate queue implementation by
> invoking 'netdev_dpdk_eth_tx_queue()',
> > Whereas the non-pmd thread(vswitchd) continues to call
> netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior
> here.
> >
> > I had setup the case similar to one mentioned in above diagram using 2
> hosts and assigning IP addresses to the respective bridges.
> > When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is
> the call path.
> >
> > (HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)
> >
> > When sending ICMP echo request packet:
> > Vswitchd thread:    dp_execute_cb()
> >                                            netdev_send()           [ pkt
> sent on dpdk0, qid: 0]
> >                                                  netdev_dpdk_send__()
> >                                                        dpdk_do_tx_copy()
> >
> netdev_dpdk_eth_tx_burst().
> >
> >
> > On receiving ICMP echo reply packet:
> > PMD thread:   dp_execute_cb()
> >                                     netdev_send()    [pkt sent on br0,
> qid: 1]
> >                                          netdev_linux_send()
> >
> 
> OK. Maybe in this exact case it works, but it isn't guaranteed that non-
> PMD
> threads will always call send with 'may_steal=false'.
> 

[Antonio F] In my test case - below the details - the non-PMD thread
is calling send with 'may_steal=true' and the packet is sent instantly.

Regardless of the may_steal value the netdev_dpdk_send__() function
calls dpdk_do_tx_copy() for non-PMD threads because the condition
(batch->packets[0]->source != DPBUF_DPDK)
is true, as buffer data is not from DPDK allocated memory.

In my test case I have one host with 
- 2 dpdk ports dpdk0, dpdk1, 
- an IP address is assigned to the internal port 'br0' and
- 1 single flow is set as
  ovs-ofctl add-flow br0 actions=NORMAL

When an ARP request is sent from an ixia generator to dpdk0 
1. the PMD thread forwards it to the internal br0 with may_steal=0, 
   via netdev_linux_send() 
2. the PMD thread also forwards the ARP request to dpdk1 with may_steal=1.
   As buffer data is from dpdk allocated mem
   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))
   is false => ARP request is put into the queue. 


When an ARP reply is received from br0:
1. the non-PMD thread forwards it to dpdk0 with may_steal=1.
   As buffer data is NOT from dpdk allocated mem the following
   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))
   is true => dpdk_do_tx_copy() is called and the packet is sent out immediately.


> Also, I see another issue here:
> 
>     8. netdev_send(.., may_steal=true, ...) uses instant send while
>        netdev_send(.., may_steal=false, ...) uses queueing of packets.
>        This behaviour will lead to packet reordering issues in some
>        cases.

[Antonio F] The instant send will happen when in netdev_dpdk_send__() the condition

   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))

is true => dpdk_do_tx_copy() => netdev_dpdk_eth_tx_burst() is called.
Can you please provide more details about the packet reordering issue you mentioned?

> 
>     9. You're decreasing 'txq->count' inside 'netdev_dpdk_eth_tx_burst()'
>        while it can be called with different from 'txq->burst_pkts'
>        argument. Some of queued packets will be lost + memory leak.
> 
> Best regards, Ilya Maximets.
> 
> >
> >>
> >>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
> >>>>     the second time on drain. This may lead to different
> >>>>     returned 'tx_qid's and packets will stuck forever in
> >>>>     tx buffer.
> >>>
> >>> You are right and xps_get_tx_qid() can return different tx_qids.
> >>>  I was always testing for XPS disabled cases and completely overlooked
> this
> >> case.
> >>> This could be a potential problem for us and any suggestions should be
> very
> >> helpful.
> >>>
> >>>>
> >>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
> >>>>     of dynamic tx queues.
> >>>
> >>> Agree, will handle this in next version.
> >>>
> >>>>
> >>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
> >>>>     a huge latency if we have few packets per second on one
> >>>>     port and intensive traffic on others.
> >>>
> >>> I agree with you. We discussed this here and thought invoking the
> >>> drain logic once every 1024 cycles is an optimal solution. But if the
> >>> community thinks otherwise we can move this in to the main 'for' loop
> >>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles
> >> elapsed.
> >>>
> >>>>
> >>>>  5. This patch breaks the counting of 'tx_dropped' packets
> >>>>     in netdev-dpdk.
> >>>
> >>> I presume you are referring to below line in the code.
> >>>
> >>> - dropped += netdev_dpdk_eth_tx_burst()
> >>> + netdev_dpdk_eth_tx_queue()
> >>>
> >>> Will handle this in v2 of this patch.
> >>>
> >>>>
> >>>>  6. Comments in netdev-provider.h should be fixed to reflect
> >>>>     all the changes.
> >>>
> >>> Will make necessary changes in netdev-provider.h
> >>>
> >>>>
> >>>>  7. At last, I agree with Aaron that explicit allowing only
> >>>>     'dpdk' ports is not a good style. Also, mentioning name of
> >>>>     exact netdev inside the common code is a bad style too.
> >>>
> >>> Will handle this appropriately.
> >>>
> >>> Regards,
> >>> Bhanu Prakash.
> >>>
> >>>> Best regards, Ilya Maximets.
> >>>>
> >>>> On 16.12.2016 22:24, Aaron Conole wrote:
> >>>>> Hi Bhanu,
> >>>>>
> >>>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
> >>>>>
> >>>>>> In exact match cache processing on an EMC hit, packets are queued
> >>>>>> in to batches matching the flow. Thereafter, packets are processed
> >>>>>> in batches for faster packet processing. This particularly is
> >>>>>> inefficient if there are fewer packets in a batch as
> >>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
> >>>>>>
> >>>>>> This commit adds back intermediate queue implementation. Packets
> >>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
> >>>>>> Also drain logic is refactored to handle fewer packets in the tx
> queues.
> >>>>>> Testing shows significant performance gains with queueing.
> >>>>>>
> >>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
> >>>>>> queueing of packets.")
> >>>>>> Signed-off-by: Bhanuprakash Bodireddy
> >>>>>> <bhanuprakash.bodireddy@intel.com>
> >>>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> >>>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
> >>>>>> Signed-off-by: Markus Magnusson
> >> <markus.magnusson@ericsson.com>
> >>>>>> Co-authored-by: Markus Magnusson
> >> <markus.magnusson@ericsson.com>
> >>>>>> ---
> >>>>>
> >>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
> >>>>> than I could give.  As a general comment, I like the direction -
> >>>>> batched operations are usually a better way of going.
> >>>>>
> >>>>> Just minor below.
> >>>>>
> >>>>> ... snip ...
> >>>>>>
> >>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>>>>> 3509493..65dff83 100644
> >>>>>> --- a/lib/dpif-netdev.c
> >>>>>> +++ b/lib/dpif-netdev.c
> >>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
> >>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
> >>>>>> emc_entry_alive(struct emc_entry *ce);  static void
> >>>>>> emc_clear_entry(struct emc_entry *ce);
> >>>>>>
> >>>>>> +static struct tx_port *pmd_send_port_cache_lookup
> >>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> >>>>>> +
> >>>>>>  static void
> >>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
> >>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
> >> }
> >>>>>>
> >>>>>>  static void
> >>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
> >>>>>> +                         struct dp_netdev_port *port,
> >>>>>> +                         uint64_t now) {
> >>>>>> +    int tx_qid;
> >>>>>> +
> >>>>>> +    if (!strcmp(port->type, "dpdk")) {
> >>>>>
> >>>>> Any reason to restrict this only to dpdk ports?  It looks like
> >>>>> you've added a new netdev operation, so why not just call the
> >>>>> netdev_txq_drain unconditionally?
> >>>>>
> >>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
> >>>>> if block below.
> >>>>>
> >>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
> >>>>>> +                 u32_to_odp(port->port_no));
> >>>>>> +
> >>>>>> +        if (OVS_LIKELY(tx)) {
> >>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
> >>>>>> +
> >>>>>> +            if (dynamic_txqs) {
> >>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
> >>>>>> +            } else {
> >>>>>> +                tx_qid = pmd->static_tx_qid;
> >>>>>> +            }
> >>>>>> +
> >>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
> >>>>>> +        }
> >>>>>> +    }
> >>>>>> +}
> >>>>>> +
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> >
Fischetti, Antonio Jan. 17, 2017, 3:30 p.m. UTC | #8
Will respin a patch v2 where we addressed your comments. 
So please let's continue our discussion on v2.

Thanks,
Antonio

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of Fischetti, Antonio
> Sent: Thursday, January 12, 2017 5:40 PM
> To: Ilya Maximets <i.maximets@samsung.com>; Bodireddy, Bhanuprakash
> <bhanuprakash.bodireddy@intel.com>; Aaron Conole <aconole@redhat.com>
> Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
> packet transmission.
> 
> Hi Ilya,
> thanks for your detailed feedback. I've added a couple of replies
> inline about when the instant send happens and the may_steal param.
> 
> 
> Regards,
> Antonio
> 
> > -----Original Message-----
> > From: Ilya Maximets [mailto:i.maximets@samsung.com]
> > Sent: Tuesday, December 20, 2016 12:47 PM
> > To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
> > Conole <aconole@redhat.com>
> > Cc: dev@openvswitch.org; Daniele Di Proietto <diproiettod@vmware.com>;
> > Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti, Antonio
> > <antonio.fischetti@intel.com>; Markus Magnusson
> > <markus.magnusson@ericsson.com>
> > Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> during
> > packet transmission.
> >
> > On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote:
> > >> -----Original Message-----
> > >> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> > >> Sent: Tuesday, December 20, 2016 8:09 AM
> > >> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>; Aaron
> > >> Conole <aconole@redhat.com>
> > >> Cc: dev@openvswitch.org; Daniele Di Proietto
> <diproiettod@vmware.com>;
> > >> Thadeu Lima de Souza Cascardo <cascardo@redhat.com>; Fischetti,
> Antonio
> > >> <antonio.fischetti@intel.com>; Markus Magnusson
> > >> <markus.magnusson@ericsson.com>
> > >> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> > during
> > >> packet transmission.
> > >>
> > >> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
> > >>> Thanks Ilya and Aaron for reviewing this patch and providing your
> > >> comments, my reply inline.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> > >>>> Sent: Monday, December 19, 2016 8:41 AM
> > >>>> To: Aaron Conole <aconole@redhat.com>; Bodireddy, Bhanuprakash
> > >>>> <bhanuprakash.bodireddy@intel.com>
> > >>>> Cc: dev@openvswitch.org; Daniele Di Proietto
> > >>>> <diproiettod@vmware.com>; Thadeu Lima de Souza Cascardo
> > >>>> <cascardo@redhat.com>; Fischetti, Antonio
> > >>>> <antonio.fischetti@intel.com>; Markus Magnusson
> > >>>> <markus.magnusson@ericsson.com>
> > >>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> > >>>> during packet transmission.
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> I didn't test this patch yet. Bhanu, could you please describe your
> > >>>> test scenario and performance results in more details.
> > >>>
> > >>> During the recent performance analysis improvements for classifier,
> we
> > >> found that bottleneck was also observed at flow batching.
> > >>> This was due to fewer packets in batch. To reproduce this, a simple
> > P2P test
> > >> case can be used with 30 IXIA streams and matching IP flow rules.
> > >>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
> > >>>         ovs-ofctl add-flow br0
> > >>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> > >>>
> > >>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
> > >>>         ovs-ofctl add-flow br0
> > >>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> > >>>
> > >>> This leaves fewer packets in batches and
> > packet_batch_per_flow_execute()
> > >> shall be invoked for every batch.
> > >>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY
> case
> > for
> > >> 64 byte udp packets.
> > >>>
> > >>>> It'll be nice if you provide throughput and latency measurement
> > >>>> results for different scenarios and packet sizes. Latency is
> > important here.
> > >>> We are yet to do latency measurements in this case. With 30 IXIA
> > >>> streams comprising of 64 byte udp packets there was an throughput
> > >>> improvement of 30% in P2P case and 13-15% in PVP case(single queue).
> > we
> > >> will try to get the latency stats with and without this patch.
> > >>>
> > >>>>
> > >>>> About the patch itself:
> > >>>>
> > >>>>  1. 'drain' called only for PMD threads. This will lead to
> > >>>>     broken connection with non-PMD ports.
> > >>> I see that non-PMD ports are handled with vswitchd thread.
> > >>> Tested PVP loopback case with tap ports and found to be working as
> > >>> expected. Can you let me know the specific case you are referring
> here
> > so
> > >> that I can verify if the patch breaks it.
> > >>
> > >> I meant something like this:
> > >>
> > >>
> > >> *-----------HOST-1(br-int)-----*               *---------HOST-2(br-
> > int)------*
> > >> |                              |               |
> > |
> > >> |   internal_port <--> dpdk0 <-------------------> dpdk0 <-->
> > internal_port  |
> > >> |   192.168.0.1/24             |               |
> > 192.168.0.2/24 |
> > >> *------------------------------*               *---------------------
> --
> > ------*
> > >>
> > >> (HOST-1)# ping 192.168.0.2
> > >>
> > >> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp
> > packets
> > >> will stuck in TX queue. The next packet will cause sending the whole
> > batch of
> > >> NETDEV_MAX_BURST icmp packets. That is not good behaviour.
> > >
> > > Thanks for test case. I tested this and found to be working with the
> > patch.
> > > The reason being, PMD threads uses intermediate queue implementation
> by
> > invoking 'netdev_dpdk_eth_tx_queue()',
> > > Whereas the non-pmd thread(vswitchd) continues to call
> > netdev_dpdk_eth_tx_burst() in the patch and no change in previous
> behavior
> > here.
> > >
> > > I had setup the case similar to one mentioned in above diagram using 2
> > hosts and assigning IP addresses to the respective bridges.
> > > When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is
> > the call path.
> > >
> > > (HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)
> > >
> > > When sending ICMP echo request packet:
> > > Vswitchd thread:    dp_execute_cb()
> > >                                            netdev_send()           [
> pkt
> > sent on dpdk0, qid: 0]
> > >                                                  netdev_dpdk_send__()
> > >
> dpdk_do_tx_copy()
> > >
> > netdev_dpdk_eth_tx_burst().
> > >
> > >
> > > On receiving ICMP echo reply packet:
> > > PMD thread:   dp_execute_cb()
> > >                                     netdev_send()    [pkt sent on br0,
> > qid: 1]
> > >                                          netdev_linux_send()
> > >
> >
> > OK. Maybe in this exact case it works, but it isn't guaranteed that non-
> > PMD
> > threads will always call send with 'may_steal=false'.
> >
> 
> [Antonio F] In my test case - below the details - the non-PMD thread
> is calling send with 'may_steal=true' and the packet is sent instantly.
> 
> Regardless of the may_steal value the netdev_dpdk_send__() function
> calls dpdk_do_tx_copy() for non-PMD threads because the condition
> (batch->packets[0]->source != DPBUF_DPDK)
> is true, as buffer data is not from DPDK allocated memory.
> 
> In my test case I have one host with
> - 2 dpdk ports dpdk0, dpdk1,
> - an IP address is assigned to the internal port 'br0' and
> - 1 single flow is set as
>   ovs-ofctl add-flow br0 actions=NORMAL
> 
> When an ARP request is sent from an ixia generator to dpdk0
> 1. the PMD thread forwards it to the internal br0 with may_steal=0,
>    via netdev_linux_send()
> 2. the PMD thread also forwards the ARP request to dpdk1 with may_steal=1.
>    As buffer data is from dpdk allocated mem
>    if (OVS_UNLIKELY(!may_steal ||
>                      batch->packets[0]->source != DPBUF_DPDK))
>    is false => ARP request is put into the queue.
> 
> 
> When an ARP reply is received from br0:
> 1. the non-PMD thread forwards it to dpdk0 with may_steal=1.
>    As buffer data is NOT from dpdk allocated mem the following
>    if (OVS_UNLIKELY(!may_steal ||
>                      batch->packets[0]->source != DPBUF_DPDK))
>    is true => dpdk_do_tx_copy() is called and the packet is sent out
> immediately.
> 
> 
> > Also, I see another issue here:
> >
> >     8. netdev_send(.., may_steal=true, ...) uses instant send while
> >        netdev_send(.., may_steal=false, ...) uses queueing of packets.
> >        This behaviour will lead to packet reordering issues in some
> >        cases.
> 
> [Antonio F] The instant send will happen when in netdev_dpdk_send__() the
> condition
> 
>    if (OVS_UNLIKELY(!may_steal ||
>                      batch->packets[0]->source != DPBUF_DPDK))
> 
> is true => dpdk_do_tx_copy() => netdev_dpdk_eth_tx_burst() is called.
> Can you please provide more details about the packet reordering issue you
> mentioned?
> 
> >
> >     9. You're decreasing 'txq->count' inside
> 'netdev_dpdk_eth_tx_burst()'
> >        while it can be called with different from 'txq->burst_pkts'
> >        argument. Some of queued packets will be lost + memory leak.
> >
> > Best regards, Ilya Maximets.
> >
> > >
> > >>
> > >>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
> > >>>>     the second time on drain. This may lead to different
> > >>>>     returned 'tx_qid's and packets will stuck forever in
> > >>>>     tx buffer.
> > >>>
> > >>> You are right and xps_get_tx_qid() can return different tx_qids.
> > >>>  I was always testing for XPS disabled cases and completely
> overlooked
> > this
> > >> case.
> > >>> This could be a potential problem for us and any suggestions should
> be
> > very
> > >> helpful.
> > >>>
> > >>>>
> > >>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
> > >>>>     of dynamic tx queues.
> > >>>
> > >>> Agree, will handle this in next version.
> > >>>
> > >>>>
> > >>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
> > >>>>     a huge latency if we have few packets per second on one
> > >>>>     port and intensive traffic on others.
> > >>>
> > >>> I agree with you. We discussed this here and thought invoking the
> > >>> drain logic once every 1024 cycles is an optimal solution. But if
> the
> > >>> community thinks otherwise we can move this in to the main 'for'
> loop
> > >>> so that it can be invoked more often provided that 'DRAIN_TSC'
> cycles
> > >> elapsed.
> > >>>
> > >>>>
> > >>>>  5. This patch breaks the counting of 'tx_dropped' packets
> > >>>>     in netdev-dpdk.
> > >>>
> > >>> I presume you are referring to below line in the code.
> > >>>
> > >>> - dropped += netdev_dpdk_eth_tx_burst()
> > >>> + netdev_dpdk_eth_tx_queue()
> > >>>
> > >>> Will handle this in v2 of this patch.
> > >>>
> > >>>>
> > >>>>  6. Comments in netdev-provider.h should be fixed to reflect
> > >>>>     all the changes.
> > >>>
> > >>> Will make necessary changes in netdev-provider.h
> > >>>
> > >>>>
> > >>>>  7. At last, I agree with Aaron that explicit allowing only
> > >>>>     'dpdk' ports is not a good style. Also, mentioning name of
> > >>>>     exact netdev inside the common code is a bad style too.
> > >>>
> > >>> Will handle this appropriately.
> > >>>
> > >>> Regards,
> > >>> Bhanu Prakash.
> > >>>
> > >>>> Best regards, Ilya Maximets.
> > >>>>
> > >>>> On 16.12.2016 22:24, Aaron Conole wrote:
> > >>>>> Hi Bhanu,
> > >>>>>
> > >>>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> writes:
> > >>>>>
> > >>>>>> In exact match cache processing on an EMC hit, packets are queued
> > >>>>>> in to batches matching the flow. Thereafter, packets are
> processed
> > >>>>>> in batches for faster packet processing. This particularly is
> > >>>>>> inefficient if there are fewer packets in a batch as
> > >>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
> > >>>>>>
> > >>>>>> This commit adds back intermediate queue implementation. Packets
> > >>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
> > >>>>>> Also drain logic is refactored to handle fewer packets in the tx
> > queues.
> > >>>>>> Testing shows significant performance gains with queueing.
> > >>>>>>
> > >>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
> > >>>>>> queueing of packets.")
> > >>>>>> Signed-off-by: Bhanuprakash Bodireddy
> > >>>>>> <bhanuprakash.bodireddy@intel.com>
> > >>>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> > >>>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
> > >>>>>> Signed-off-by: Markus Magnusson
> > >> <markus.magnusson@ericsson.com>
> > >>>>>> Co-authored-by: Markus Magnusson
> > >> <markus.magnusson@ericsson.com>
> > >>>>>> ---
> > >>>>>
> > >>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
> > >>>>> than I could give.  As a general comment, I like the direction -
> > >>>>> batched operations are usually a better way of going.
> > >>>>>
> > >>>>> Just minor below.
> > >>>>>
> > >>>>> ... snip ...
> > >>>>>>
> > >>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > >>>>>> 3509493..65dff83 100644
> > >>>>>> --- a/lib/dpif-netdev.c
> > >>>>>> +++ b/lib/dpif-netdev.c
> > >>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
> > >>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
> > >>>>>> emc_entry_alive(struct emc_entry *ce);  static void
> > >>>>>> emc_clear_entry(struct emc_entry *ce);
> > >>>>>>
> > >>>>>> +static struct tx_port *pmd_send_port_cache_lookup
> > >>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t
> port_no);
> > >>>>>> +
> > >>>>>>  static void
> > >>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
> > >>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
> > >> }
> > >>>>>>
> > >>>>>>  static void
> > >>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
> > >>>>>> +                         struct dp_netdev_port *port,
> > >>>>>> +                         uint64_t now) {
> > >>>>>> +    int tx_qid;
> > >>>>>> +
> > >>>>>> +    if (!strcmp(port->type, "dpdk")) {
> > >>>>>
> > >>>>> Any reason to restrict this only to dpdk ports?  It looks like
> > >>>>> you've added a new netdev operation, so why not just call the
> > >>>>> netdev_txq_drain unconditionally?
> > >>>>>
> > >>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
> > >>>>> if block below.
> > >>>>>
> > >>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
> > >>>>>> +                 u32_to_odp(port->port_no));
> > >>>>>> +
> > >>>>>> +        if (OVS_LIKELY(tx)) {
> > >>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
> > >>>>>> +
> > >>>>>> +            if (dynamic_txqs) {
> > >>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx,
> now);
> > >>>>>> +            } else {
> > >>>>>> +                tx_qid = pmd->static_tx_qid;
> > >>>>>> +            }
> > >>>>>> +
> > >>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
> > >>>>>> +        }
> > >>>>>> +    }
> > >>>>>> +}
> > >>>>>> +
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>>
> > >
> > >
> > >
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 3509493..65dff83 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -622,6 +622,9 @@  static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
 static inline bool emc_entry_alive(struct emc_entry *ce);
 static void emc_clear_entry(struct emc_entry *ce);
 
+static struct tx_port *pmd_send_port_cache_lookup
+    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
+
 static void
 emc_cache_init(struct emc_cache *flow_cache)
 {
@@ -2877,6 +2880,31 @@  cycles_count_end(struct dp_netdev_pmd_thread *pmd,
 }
 
 static void
+dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
+                         struct dp_netdev_port *port,
+                         uint64_t now)
+{
+    int tx_qid;
+
+    if (!strcmp(port->type, "dpdk")) {
+        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
+                 u32_to_odp(port->port_no));
+
+        if (OVS_LIKELY(tx)) {
+            bool dynamic_txqs = tx->port->dynamic_txqs;
+
+            if (dynamic_txqs) {
+                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
+            } else {
+                tx_qid = pmd->static_tx_qid;
+            }
+
+            netdev_txq_drain(port->netdev, tx_qid);
+        }
+    }
+}
+
+static void
 dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                            struct netdev_rxq *rx,
                            odp_port_t port_no)
@@ -3505,6 +3533,8 @@  pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
     return i;
 }
 
+enum { DRAIN_TSC = 20000ULL };
+
 static void *
 pmd_thread_main(void *f_)
 {
@@ -3512,6 +3542,7 @@  pmd_thread_main(void *f_)
     unsigned int lc = 0;
     struct polled_queue *poll_list;
     bool exiting;
+    uint64_t prev = 0, now = 0;
     int poll_cnt;
     int i;
 
@@ -3547,10 +3578,19 @@  reload:
         }
 
         if (lc++ > 1024) {
+            prev = now;
+            now = pmd->last_cycles;
+            struct tx_port *tx_port;
             bool reload;
 
             lc = 0;
 
+            if ((now - prev) > DRAIN_TSC) {
+                HMAP_FOR_EACH (tx_port, node, &pmd->send_port_cache) {
+                    dp_netdev_drain_txq_port(pmd, tx_port->port, now);
+                }
+            }
+
             coverage_try_clear();
             dp_netdev_pmd_try_optimize(pmd);
             if (!ovsrcu_try_quiesce()) {
@@ -4951,7 +4991,7 @@  dpif_dummy_register(enum dummy_level level)
                              "dp port new-number",
                              3, 3, dpif_dummy_change_port_number, NULL);
 }
-
+
 /* Datapath Classifier. */
 
 /* A set of rules that all have the same fields wildcarded. */
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index 75a330b..e13d418 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -1543,6 +1543,7 @@  netdev_bsd_update_flags(struct netdev *netdev_, enum netdev_flags off,
     netdev_bsd_rxq_recv,                             \
     netdev_bsd_rxq_wait,                             \
     netdev_bsd_rxq_drain,                            \
+    NULL,                                            \
 }
 
 const struct netdev_class netdev_bsd_class =
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index eb4ed02..945f6de 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -166,7 +166,6 @@  static const struct rte_eth_conf port_conf = {
 
 enum { DPDK_RING_SIZE = 256 };
 BUILD_ASSERT_DECL(IS_POW2(DPDK_RING_SIZE));
-enum { DRAIN_TSC = 200000ULL };
 
 enum dpdk_dev_type {
     DPDK_DEV_ETH = 0,
@@ -289,12 +288,17 @@  struct dpdk_mp {
 /* There should be one 'struct dpdk_tx_queue' created for
  * each cpu core. */
 struct dpdk_tx_queue {
+    int count;                     /* Number of buffered packets waiting to
+                                      be sent. */
     rte_spinlock_t tx_lock;        /* Protects the members and the NIC queue
                                     * from concurrent access.  It is used only
                                     * if the queue is shared among different
                                     * pmd threads (see 'concurrent_txq'). */
     int map;                       /* Mapping of configured vhost-user queues
                                     * to enabled by guest. */
+    struct rte_mbuf *burst_pkts[NETDEV_MAX_BURST];
+                                   /* Copy the packets in to intermediate queue
+                                    * to amortize the cost of MMIO write. */
 };
 
 /* dpdk has no way to remove dpdk ring ethernet devices
@@ -1242,6 +1246,7 @@  static inline int
 netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
                          struct rte_mbuf **pkts, int cnt)
 {
+    struct dpdk_tx_queue *txq = &dev->tx_q[qid];
     uint32_t nb_tx = 0;
 
     while (nb_tx != cnt) {
@@ -1265,7 +1270,8 @@  netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
         }
     }
 
-    return cnt - nb_tx;
+    txq->count = cnt - nb_tx;
+    return txq->count;
 }
 
 static inline bool
@@ -1649,6 +1655,34 @@  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
     }
 }
 
+/* Enqueue packets in an intermediate queue and call the burst
+ * function when the queue is full. This way we can amortize the
+ * cost of MMIO writes. */
+static inline void
+netdev_dpdk_eth_tx_queue(struct netdev_dpdk *dev, int qid,
+                            struct rte_mbuf **pkts, int cnt)
+{
+    struct dpdk_tx_queue *txq = &dev->tx_q[qid];
+
+    int i = 0;
+
+    while (i < cnt) {
+        int freeslots = NETDEV_MAX_BURST - txq->count;
+        int tocopy = MIN(freeslots, cnt-i);
+
+        memcpy(&txq->burst_pkts[txq->count], &pkts[i],
+               tocopy * sizeof (struct rte_mbuf *));
+
+        txq->count += tocopy;
+        i += tocopy;
+
+        /* Queue full, burst the packets */
+        if (txq->count >= NETDEV_MAX_BURST) {
+           netdev_dpdk_eth_tx_burst(dev, qid, txq->burst_pkts, txq->count);
+        }
+    }
+}
+
 static int
 netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
                        struct dp_packet_batch *batch,
@@ -1697,7 +1731,7 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
         cnt = netdev_dpdk_qos_run(dev, pkts, cnt);
         dropped = batch->count - cnt;
 
-        dropped += netdev_dpdk_eth_tx_burst(dev, qid, pkts, cnt);
+        netdev_dpdk_eth_tx_queue(dev, qid, pkts, cnt);
 
         if (OVS_UNLIKELY(dropped)) {
             rte_spinlock_lock(&dev->stats_lock);
@@ -1711,6 +1745,22 @@  netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
     }
 }
 
+/* Drain tx queues, this is called periodically to empty the
+ * intermediate queue in case of fewer packets(< NETDEV_MAX_BURST)
+ * in the queue */
+static int
+netdev_dpdk_txq_drain(struct netdev *netdev, int qid)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    struct dpdk_tx_queue *txq = &dev->tx_q[qid];
+
+    if (OVS_LIKELY(txq->count)) {
+        netdev_dpdk_eth_tx_burst(dev, qid, txq->burst_pkts, txq->count);
+    }
+
+    return 0;
+}
+
 static int
 netdev_dpdk_eth_send(struct netdev *netdev, int qid,
                      struct dp_packet_batch *batch, bool may_steal,
@@ -3051,7 +3101,7 @@  unlock:
                           SET_CONFIG, SET_TX_MULTIQ, SEND,    \
                           GET_CARRIER, GET_STATS,             \
                           GET_FEATURES, GET_STATUS,           \
-                          RECONFIGURE, RXQ_RECV)              \
+                          RECONFIGURE, RXQ_RECV, TXQ_DRAIN)   \
 {                                                             \
     NAME,                                                     \
     true,                       /* is_pmd */                  \
@@ -3118,6 +3168,7 @@  unlock:
     RXQ_RECV,                                                 \
     NULL,                       /* rx_wait */                 \
     NULL,                       /* rxq_drain */               \
+    TXQ_DRAIN,                  /* txq_drain */               \
 }
 
 static const struct netdev_class dpdk_class =
@@ -3134,7 +3185,8 @@  static const struct netdev_class dpdk_class =
         netdev_dpdk_get_features,
         netdev_dpdk_get_status,
         netdev_dpdk_reconfigure,
-        netdev_dpdk_rxq_recv);
+        netdev_dpdk_rxq_recv,
+        netdev_dpdk_txq_drain);
 
 static const struct netdev_class dpdk_ring_class =
     NETDEV_DPDK_CLASS(
@@ -3150,7 +3202,8 @@  static const struct netdev_class dpdk_ring_class =
         netdev_dpdk_get_features,
         netdev_dpdk_get_status,
         netdev_dpdk_reconfigure,
-        netdev_dpdk_rxq_recv);
+        netdev_dpdk_rxq_recv,
+        NULL);
 
 static const struct netdev_class dpdk_vhost_class =
     NETDEV_DPDK_CLASS(
@@ -3166,7 +3219,8 @@  static const struct netdev_class dpdk_vhost_class =
         NULL,
         NULL,
         netdev_dpdk_vhost_reconfigure,
-        netdev_dpdk_vhost_rxq_recv);
+        netdev_dpdk_vhost_rxq_recv,
+        NULL);
 static const struct netdev_class dpdk_vhost_client_class =
     NETDEV_DPDK_CLASS(
         "dpdkvhostuserclient",
@@ -3181,7 +3235,8 @@  static const struct netdev_class dpdk_vhost_client_class =
         NULL,
         NULL,
         netdev_dpdk_vhost_client_reconfigure,
-        netdev_dpdk_vhost_rxq_recv);
+        netdev_dpdk_vhost_rxq_recv,
+        NULL);
 
 void
 netdev_dpdk_register(void)
diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index dec1a8e..113d334 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -1384,6 +1384,7 @@  netdev_dummy_update_flags(struct netdev *netdev_,
     netdev_dummy_rxq_recv,                                      \
     netdev_dummy_rxq_wait,                                      \
     netdev_dummy_rxq_drain,                                     \
+    NULL,                                                       \
 }
 
 static const struct netdev_class dummy_class =
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index a5a9ec1..2499b3e 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -2831,6 +2831,7 @@  netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off,
     netdev_linux_rxq_recv,                                      \
     netdev_linux_rxq_wait,                                      \
     netdev_linux_rxq_drain,                                     \
+    NULL,                                                       \
 }
 
 const struct netdev_class netdev_linux_class =
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index c8507a5..fa2d8ba 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -764,6 +764,9 @@  struct netdev_class {
 
     /* Discards all packets waiting to be received from 'rx'. */
     int (*rxq_drain)(struct netdev_rxq *rx);
+
+    /* Drain all packets on queue 'qid'. */
+    int (*txq_drain)(struct netdev *netdev, int qid);
 };
 
 int netdev_register_provider(const struct netdev_class *);
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 02a246a..359932c 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -810,7 +810,8 @@  get_stats(const struct netdev *netdev, struct netdev_stats *stats)
     NULL,                   /* rx_dealloc */                \
     NULL,                   /* rx_recv */                   \
     NULL,                   /* rx_wait */                   \
-    NULL,                   /* rx_drain */
+    NULL,                   /* rx_drain */                  \
+    NULL,                   /* tx_drain */
 
 
 #define TUNNEL_CLASS(NAME, DPIF_PORT, BUILD_HEADER, PUSH_HEADER, POP_HEADER)   \
diff --git a/lib/netdev.c b/lib/netdev.c
index 0258b32..6b4fa9d 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -664,6 +664,15 @@  netdev_rxq_drain(struct netdev_rxq *rx)
             : 0);
 }
 
+/* Flush packets on the queue 'qid'. */
+int
+netdev_txq_drain(struct netdev *netdev, int qid)
+{
+    return (netdev->netdev_class->txq_drain
+            ? netdev->netdev_class->txq_drain(netdev, qid)
+            : EOPNOTSUPP);
+}
+
 /* Configures the number of tx queues of 'netdev'. Returns 0 if successful,
  * otherwise a positive errno value.
  *
diff --git a/lib/netdev.h b/lib/netdev.h
index 03059ca..d8b0ab3 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -153,6 +153,7 @@  int netdev_rxq_drain(struct netdev_rxq *);
 int netdev_send(struct netdev *, int qid, struct dp_packet_batch *,
                 bool may_steal, bool concurrent_txq);
 void netdev_send_wait(struct netdev *, int qid);
+int netdev_txq_drain(struct netdev *, int qid);
 
 /* native tunnel APIs */
 /* Structure to pass parameters required to build a tunnel header. */