[ovs-dev,v4,5/5] dpif-netdev: Flush the packets in intermediate queue.

Message ID 1502211976-76937-6-git-send-email-bhanuprakash.bodireddy@intel.com
State Rejected
Delegated to: Darrell Ball
Headers show

Commit Message

Bodireddy, Bhanuprakash Aug. 8, 2017, 5:06 p.m.
Under low rate traffic conditions, there can be 2 issues.
  (1) Packets potentially can get stuck in the intermediate queue.
  (2) Latency of the packets can increase significantly due to
       buffering in intermediate queue.

This commit handles the (1) issue by flushing the tx port queues using
dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
Also this commit addresses issue (2) by flushing the tx queues after
every rxq port processing. This reduces the latency with out impacting
the forwarding throughput.

   MASTER
  --------
   Pkt size  min(ns)   avg(ns)   max(ns)
    512      4,631      5,022    309,914
   1024      5,545      5,749    104,294
   1280      5,978      6,159     45,306
   1518      6,419      6,774    946,850

  MASTER + COMMIT
  -----------------
   Pkt size  min(ns)   avg(ns)   max(ns)
    512      4,711      5,064    182,477
   1024      5,601      5,888    701,654
   1280      6,018      6,491    533,037
   1518      6,467      6,734    312,471

PMDs can be teared down and spawned at runtime and so the rxq and txq
mapping of the PMD threads can change. In few cases packets can get
stuck in the queue due to reconfiguration and this commit helps flush
the queues.

Suggested-by: Eelco Chaudron <echaudro@redhat.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/331039.html
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
---
 lib/dpif-netdev.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

Comments

Darrell Ball Aug. 11, 2017, 1:02 a.m. | #1
Hi Bhanu

Given that you ultimately intend changes beyond those in this patch, would it make sense
just to fold the follow up series (at least, the key elements) into this series, essentially expanding
on this patch 5 ?

Thanks Darrell

-----Original Message-----
From: <ovs-dev-bounces@openvswitch.org> on behalf of Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Date: Tuesday, August 8, 2017 at 10:06 AM
To: "dev@openvswitch.org" <dev@openvswitch.org>
Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in	intermediate queue.

    Under low rate traffic conditions, there can be 2 issues.
      (1) Packets potentially can get stuck in the intermediate queue.
      (2) Latency of the packets can increase significantly due to
           buffering in intermediate queue.
    
    This commit handles the (1) issue by flushing the tx port queues using
    dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
    Also this commit addresses issue (2) by flushing the tx queues after
    every rxq port processing. This reduces the latency with out impacting
    the forwarding throughput.
    
       MASTER
      --------
       Pkt size  min(ns)   avg(ns)   max(ns)
        512      4,631      5,022    309,914
       1024      5,545      5,749    104,294
       1280      5,978      6,159     45,306
       1518      6,419      6,774    946,850
    
      MASTER + COMMIT
      -----------------
       Pkt size  min(ns)   avg(ns)   max(ns)
        512      4,711      5,064    182,477
       1024      5,601      5,888    701,654
       1280      6,018      6,491    533,037
       1518      6,467      6,734    312,471
    
    PMDs can be teared down and spawned at runtime and so the rxq and txq
    mapping of the PMD threads can change. In few cases packets can get
    stuck in the queue due to reconfiguration and this commit helps flush
    the queues.
    
    Suggested-by: Eelco Chaudron <echaudro@redhat.com>
    Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=H0yNRh-c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI&e= 
    Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
    Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
    Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
    Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
    Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
    Acked-by: Eelco Chaudron <echaudro@redhat.com>
    ---
     lib/dpif-netdev.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
     1 file changed, 51 insertions(+), 1 deletion(-)
    
    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
    index e2cd931..bfb9650 100644
    --- a/lib/dpif-netdev.c
    +++ b/lib/dpif-netdev.c
    @@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
     };
     
     #define XPS_TIMEOUT_MS 500LL
    +#define LAST_USED_QID_NONE -1
     
     /* Contained by struct dp_netdev_port's 'rxqs' member.  */
     struct dp_netdev_rxq {
    @@ -500,7 +501,13 @@ struct rxq_poll {
     struct tx_port {
         struct dp_netdev_port *port;
         int qid;
    -    long long last_used;
    +    int last_used_qid;        /* Last queue id where packets got
    +                                 enqueued. */
    +    long long last_used;      /* In case XPS is enabled, it contains the
    +                               * timestamp of the last time the port was
    +                               * used by the thread to send data.  After
    +                               * XPS_TIMEOUT_MS elapses the qid will be
    +                               * marked as -1. */
         struct hmap_node node;
     };
     
    @@ -3101,6 +3108,25 @@ cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
         non_atomic_ullong_add(&pmd->cycles.n[type], interval);
     }
     
    +static void
    +dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
    +{
    +    struct tx_port *cached_tx_port;
    +    int tx_qid;
    +
    +    HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
    +        tx_qid = cached_tx_port->last_used_qid;
    +
    +        if (tx_qid != LAST_USED_QID_NONE) {
    +            netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
    +                         cached_tx_port->port->dynamic_txqs);
    +
    +            /* Queue flushed and mark it empty. */
    +            cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
    +        }
    +    }
    +}
    +
     static int
     dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                                struct netdev_rxq *rx,
    @@ -3667,6 +3693,9 @@ dpif_netdev_run(struct dpif *dpif)
                             dp_netdev_process_rxq_port(non_pmd,
                                                        port->rxqs[i].rx,
                                                        port->port_no);
    +
    +                    dp_netdev_flush_txq_ports(non_pmd);
    +
                         cycles_count_intermediate(non_pmd, process_packets ?
                                                            PMD_CYCLES_PROCESSING
                                                          : PMD_CYCLES_IDLE);
    @@ -3854,6 +3883,8 @@ reload:
                 process_packets =
                     dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
                                                poll_list[i].port_no);
    +            dp_netdev_flush_txq_ports(pmd);
    +
                 cycles_count_intermediate(pmd,
                                           process_packets ? PMD_CYCLES_PROCESSING
                                                           : PMD_CYCLES_IDLE);
    @@ -3879,6 +3910,9 @@ reload:
     
         cycles_count_end(pmd, PMD_CYCLES_IDLE);
     
    +    /* Flush the queues as part of reconfiguration logic. */
    +    dp_netdev_flush_txq_ports(pmd);
    +
         poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
         exiting = latch_is_set(&pmd->exit_latch);
         /* Signal here to make sure the pmd finishes
    @@ -4481,6 +4515,7 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
     
         tx->port = port;
         tx->qid = -1;
    +    tx->last_used_qid = LAST_USED_QID_NONE;
     
         hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
         pmd->need_reload = true;
    @@ -5051,6 +5086,14 @@ dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
     
         dpif_netdev_xps_revalidate_pmd(pmd, now, false);
     
    +    /* The tx queue can change in XPS case, make sure packets in previous
    +     * queue is flushed properly. */
    +    if (tx->last_used_qid != LAST_USED_QID_NONE &&
    +               tx->qid != tx->last_used_qid) {
    +        netdev_txq_flush(port->netdev, tx->last_used_qid, port->dynamic_txqs);
    +        tx->last_used_qid = LAST_USED_QID_NONE;
    +    }
    +
         VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
                  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
         return min_qid;
    @@ -5146,6 +5189,13 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
                     tx_qid = pmd->static_tx_qid;
                 }
     
    +            /* In case these packets gets buffered into an intermediate
    +             * queue and XPS is enabled the flush function could find a
    +             * different tx qid assigned to its thread.  We keep track
    +             * of the qid we're now using, that will trigger the flush
    +             * function and will select the right queue to flush. */
    +            p->last_used_qid = tx_qid;
    +
                 netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
                             dynamic_txqs);
                 return;
    -- 
    2.4.11
    
    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=IXKDcYc2D7SfGkflLokKVfrNafJlFLfT94q366bHHFo&e=
Bodireddy, Bhanuprakash Aug. 11, 2017, 1:11 p.m. | #2
Hello All,

Adding all the people here who had either reviewed or provided their feedback
on the batching patches at some stage.

you are already aware that there are 2 different series on ML to implement tx
 batching (netdev layer vs dpif layer) that improves DPDK datapath performance.  
Our output batching is in netdev layer whereas ilya moved it to  dpif layer 
and simplified it. Each approach has its own pros and cons and had been
discussed in earlier threads.

While reviewing v4 of my patch series,  ilya detected a race condition that happens
when the queues in the guest are enabled/disabled at run time. Though we have
solutions to address this issue and implemented it, I realized that the code complexity
has increased and changes spanning multiple functions with additional spin locks  to 
address this one corner case. 

I think, though our patch series has flexibility it has gotten lot complex now and would
be difficult to maintain in the future. At this stage I would like to lean towards simpler
solution that's more maintainable  which is implemented by Ilya.

I would like to thank Eelco, Darrell, Jan and Ilya for reviewing our series and providing their
feedback.

Bhanuprakash. 

>-----Original Message-----
>From: Darrell Ball [mailto:dball@vmware.com]
>Sent: Friday, August 11, 2017 2:03 AM
>To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>;
>dev@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>intermediate queue.
>
>Hi Bhanu
>
>Given that you ultimately intend changes beyond those in this patch, would it
>make sense just to fold the follow up series (at least, the key elements) into
>this series, essentially expanding on this patch 5 ?
>
>Thanks Darrell
>
>-----Original Message-----
>From: <ovs-dev-bounces@openvswitch.org> on behalf of Bhanuprakash
>Bodireddy <bhanuprakash.bodireddy@intel.com>
>Date: Tuesday, August 8, 2017 at 10:06 AM
>To: "dev@openvswitch.org" <dev@openvswitch.org>
>Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>	intermediate queue.
>
>    Under low rate traffic conditions, there can be 2 issues.
>      (1) Packets potentially can get stuck in the intermediate queue.
>      (2) Latency of the packets can increase significantly due to
>           buffering in intermediate queue.
>
>    This commit handles the (1) issue by flushing the tx port queues using
>    dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
>    Also this commit addresses issue (2) by flushing the tx queues after
>    every rxq port processing. This reduces the latency with out impacting
>    the forwarding throughput.
>
>       MASTER
>      --------
>       Pkt size  min(ns)   avg(ns)   max(ns)
>        512      4,631      5,022    309,914
>       1024      5,545      5,749    104,294
>       1280      5,978      6,159     45,306
>       1518      6,419      6,774    946,850
>
>      MASTER + COMMIT
>      -----------------
>       Pkt size  min(ns)   avg(ns)   max(ns)
>        512      4,711      5,064    182,477
>       1024      5,601      5,888    701,654
>       1280      6,018      6,491    533,037
>       1518      6,467      6,734    312,471
>
>    PMDs can be teared down and spawned at runtime and so the rxq and txq
>    mapping of the PMD threads can change. In few cases packets can get
>    stuck in the queue due to reconfiguration and this commit helps flush
>    the queues.
>
>    Suggested-by: Eelco Chaudron <echaudro@redhat.com>
>    Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
>GX7JQ5Ih-
>uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=H0yNRh-
>c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI&e=
>    Signed-off-by: Bhanuprakash Bodireddy
><bhanuprakash.bodireddy@intel.com>
>    Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>    Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
>    Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
>    Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
>    Acked-by: Eelco Chaudron <echaudro@redhat.com>
>    ---
>     lib/dpif-netdev.c | 52
>+++++++++++++++++++++++++++++++++++++++++++++++++++-
>     1 file changed, 51 insertions(+), 1 deletion(-)
>
>    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>    index e2cd931..bfb9650 100644
>    --- a/lib/dpif-netdev.c
>    +++ b/lib/dpif-netdev.c
>    @@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
>     };
>
>     #define XPS_TIMEOUT_MS 500LL
>    +#define LAST_USED_QID_NONE -1
>
>     /* Contained by struct dp_netdev_port's 'rxqs' member.  */
>     struct dp_netdev_rxq {
>    @@ -500,7 +501,13 @@ struct rxq_poll {
>     struct tx_port {
>         struct dp_netdev_port *port;
>         int qid;
>    -    long long last_used;
>    +    int last_used_qid;        /* Last queue id where packets got
>    +                                 enqueued. */
>    +    long long last_used;      /* In case XPS is enabled, it contains the
>    +                               * timestamp of the last time the port was
>    +                               * used by the thread to send data.  After
>    +                               * XPS_TIMEOUT_MS elapses the qid will be
>    +                               * marked as -1. */
>         struct hmap_node node;
>     };
>
>    @@ -3101,6 +3108,25 @@ cycles_count_intermediate(struct
>dp_netdev_pmd_thread *pmd,
>         non_atomic_ullong_add(&pmd->cycles.n[type], interval);
>     }
>
>    +static void
>    +dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
>    +{
>    +    struct tx_port *cached_tx_port;
>    +    int tx_qid;
>    +
>    +    HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
>    +        tx_qid = cached_tx_port->last_used_qid;
>    +
>    +        if (tx_qid != LAST_USED_QID_NONE) {
>    +            netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
>    +                         cached_tx_port->port->dynamic_txqs);
>    +
>    +            /* Queue flushed and mark it empty. */
>    +            cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
>    +        }
>    +    }
>    +}
>    +
>     static int
>     dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>                                struct netdev_rxq *rx,
>    @@ -3667,6 +3693,9 @@ dpif_netdev_run(struct dpif *dpif)
>                             dp_netdev_process_rxq_port(non_pmd,
>                                                        port->rxqs[i].rx,
>                                                        port->port_no);
>    +
>    +                    dp_netdev_flush_txq_ports(non_pmd);
>    +
>                         cycles_count_intermediate(non_pmd, process_packets ?
>                                                            PMD_CYCLES_PROCESSING
>                                                          : PMD_CYCLES_IDLE);
>    @@ -3854,6 +3883,8 @@ reload:
>                 process_packets =
>                     dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
>                                                poll_list[i].port_no);
>    +            dp_netdev_flush_txq_ports(pmd);
>    +
>                 cycles_count_intermediate(pmd,
>                                           process_packets ? PMD_CYCLES_PROCESSING
>                                                           : PMD_CYCLES_IDLE);
>    @@ -3879,6 +3910,9 @@ reload:
>
>         cycles_count_end(pmd, PMD_CYCLES_IDLE);
>
>    +    /* Flush the queues as part of reconfiguration logic. */
>    +    dp_netdev_flush_txq_ports(pmd);
>    +
>         poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>         exiting = latch_is_set(&pmd->exit_latch);
>         /* Signal here to make sure the pmd finishes
>    @@ -4481,6 +4515,7 @@ dp_netdev_add_port_tx_to_pmd(struct
>dp_netdev_pmd_thread *pmd,
>
>         tx->port = port;
>         tx->qid = -1;
>    +    tx->last_used_qid = LAST_USED_QID_NONE;
>
>         hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port-
>>port_no));
>         pmd->need_reload = true;
>    @@ -5051,6 +5086,14 @@ dpif_netdev_xps_get_tx_qid(const struct
>dp_netdev_pmd_thread *pmd,
>
>         dpif_netdev_xps_revalidate_pmd(pmd, now, false);
>
>    +    /* The tx queue can change in XPS case, make sure packets in previous
>    +     * queue is flushed properly. */
>    +    if (tx->last_used_qid != LAST_USED_QID_NONE &&
>    +               tx->qid != tx->last_used_qid) {
>    +        netdev_txq_flush(port->netdev, tx->last_used_qid, port-
>>dynamic_txqs);
>    +        tx->last_used_qid = LAST_USED_QID_NONE;
>    +    }
>    +
>         VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
>                  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
>         return min_qid;
>    @@ -5146,6 +5189,13 @@ dp_execute_cb(void *aux_, struct
>dp_packet_batch *packets_,
>                     tx_qid = pmd->static_tx_qid;
>                 }
>
>    +            /* In case these packets gets buffered into an intermediate
>    +             * queue and XPS is enabled the flush function could find a
>    +             * different tx qid assigned to its thread.  We keep track
>    +             * of the qid we're now using, that will trigger the flush
>    +             * function and will select the right queue to flush. */
>    +            p->last_used_qid = tx_qid;
>    +
>                 netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
>                             dynamic_txqs);
>                 return;
>    --
>    2.4.11
>
>    _______________________________________________
>    dev mailing list
>    dev@openvswitch.org
>    https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=IXKDcYc2
>D7SfGkflLokKVfrNafJlFLfT94q366bHHFo&e=
>
Jan Scheurich Aug. 11, 2017, 3:20 p.m. | #3
Hi Bhanu,

Thanks a lot for all the work you put into making the original hackfest patch good enough for upstreaming in an increasingly complex and ever-evolving DPDK datapath.

I tend to agree that the complexity to achieve thread safety is becoming overwhelming. Ilya's v3 patch set submitted yesterday looks much simpler. I fully support to focus on this one, especially as it already covers time-based tx batching.

We will start testing Ilya's patch now and come back with results/comments next week.

BR, Jan

> -----Original Message-----
> From: Bodireddy, Bhanuprakash [mailto:bhanuprakash.bodireddy@intel.com]
> Sent: Friday, 11 August, 2017 15:12
> To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org; Ilya Maximets <i.maximets@samsung.com>; Jan Scheurich
> <jan.scheurich@ericsson.com>; Eelco Chaudron <echaudro@redhat.com>
> Cc: Ben Pfaff <blp@ovn.org>; Stokes, Ian <ian.stokes@intel.com>; Gray, Mark D <mark.d.gray@intel.com>; Fischetti, Antonio
> <antonio.fischetti@intel.com>; Kevin Traynor <ktraynor@redhat.com>
> Subject: RE: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in intermediate queue.
> 
> Hello All,
> 
> Adding all the people here who had either reviewed or provided their feedback
> on the batching patches at some stage.
> 
> you are already aware that there are 2 different series on ML to implement tx
>  batching (netdev layer vs dpif layer) that improves DPDK datapath performance.
> Our output batching is in netdev layer whereas ilya moved it to  dpif layer
> and simplified it. Each approach has its own pros and cons and had been
> discussed in earlier threads.
> 
> While reviewing v4 of my patch series,  ilya detected a race condition that happens
> when the queues in the guest are enabled/disabled at run time. Though we have
> solutions to address this issue and implemented it, I realized that the code complexity
> has increased and changes spanning multiple functions with additional spin locks  to
> address this one corner case.
> 
> I think, though our patch series has flexibility it has gotten lot complex now and would
> be difficult to maintain in the future. At this stage I would like to lean towards simpler
> solution that's more maintainable  which is implemented by Ilya.
> 
> I would like to thank Eelco, Darrell, Jan and Ilya for reviewing our series and providing their
> feedback.
> 
> Bhanuprakash.
> 
> >-----Original Message-----
> >From: Darrell Ball [mailto:dball@vmware.com]
> >Sent: Friday, August 11, 2017 2:03 AM
> >To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>;
> >dev@openvswitch.org
> >Subject: Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
> >intermediate queue.
> >
> >Hi Bhanu
> >
> >Given that you ultimately intend changes beyond those in this patch, would it
> >make sense just to fold the follow up series (at least, the key elements) into
> >this series, essentially expanding on this patch 5 ?
> >
> >Thanks Darrell
> >
> >-----Original Message-----
> >From: <ovs-dev-bounces@openvswitch.org> on behalf of Bhanuprakash
> >Bodireddy <bhanuprakash.bodireddy@intel.com>
> >Date: Tuesday, August 8, 2017 at 10:06 AM
> >To: "dev@openvswitch.org" <dev@openvswitch.org>
> >Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
> >	intermediate queue.
> >
> >    Under low rate traffic conditions, there can be 2 issues.
> >      (1) Packets potentially can get stuck in the intermediate queue.
> >      (2) Latency of the packets can increase significantly due to
> >           buffering in intermediate queue.
> >
> >    This commit handles the (1) issue by flushing the tx port queues using
> >    dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
> >    Also this commit addresses issue (2) by flushing the tx queues after
> >    every rxq port processing. This reduces the latency with out impacting
> >    the forwarding throughput.
> >
> >       MASTER
> >      --------
> >       Pkt size  min(ns)   avg(ns)   max(ns)
> >        512      4,631      5,022    309,914
> >       1024      5,545      5,749    104,294
> >       1280      5,978      6,159     45,306
> >       1518      6,419      6,774    946,850
> >
> >      MASTER + COMMIT
> >      -----------------
> >       Pkt size  min(ns)   avg(ns)   max(ns)
> >        512      4,711      5,064    182,477
> >       1024      5,601      5,888    701,654
> >       1280      6,018      6,491    533,037
> >       1518      6,467      6,734    312,471
> >
> >    PMDs can be teared down and spawned at runtime and so the rxq and txq
> >    mapping of the PMD threads can change. In few cases packets can get
> >    stuck in the queue due to reconfiguration and this commit helps flush
> >    the queues.
> >
> >    Suggested-by: Eelco Chaudron <echaudro@redhat.com>
> >    Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
> >3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> >2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
> >GX7JQ5Ih-
> >uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=H0yNRh-
> >c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI&e=
> >    Signed-off-by: Bhanuprakash Bodireddy
> ><bhanuprakash.bodireddy@intel.com>
> >    Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> >    Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
> >    Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
> >    Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
> >    Acked-by: Eelco Chaudron <echaudro@redhat.com>
> >    ---
> >     lib/dpif-netdev.c | 52
> >+++++++++++++++++++++++++++++++++++++++++++++++++++-
> >     1 file changed, 51 insertions(+), 1 deletion(-)
> >
> >    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >    index e2cd931..bfb9650 100644
> >    --- a/lib/dpif-netdev.c
> >    +++ b/lib/dpif-netdev.c
> >    @@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
> >     };
> >
> >     #define XPS_TIMEOUT_MS 500LL
> >    +#define LAST_USED_QID_NONE -1
> >
> >     /* Contained by struct dp_netdev_port's 'rxqs' member.  */
> >     struct dp_netdev_rxq {
> >    @@ -500,7 +501,13 @@ struct rxq_poll {
> >     struct tx_port {
> >         struct dp_netdev_port *port;
> >         int qid;
> >    -    long long last_used;
> >    +    int last_used_qid;        /* Last queue id where packets got
> >    +                                 enqueued. */
> >    +    long long last_used;      /* In case XPS is enabled, it contains the
> >    +                               * timestamp of the last time the port was
> >    +                               * used by the thread to send data.  After
> >    +                               * XPS_TIMEOUT_MS elapses the qid will be
> >    +                               * marked as -1. */
> >         struct hmap_node node;
> >     };
> >
> >    @@ -3101,6 +3108,25 @@ cycles_count_intermediate(struct
> >dp_netdev_pmd_thread *pmd,
> >         non_atomic_ullong_add(&pmd->cycles.n[type], interval);
> >     }
> >
> >    +static void
> >    +dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
> >    +{
> >    +    struct tx_port *cached_tx_port;
> >    +    int tx_qid;
> >    +
> >    +    HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
> >    +        tx_qid = cached_tx_port->last_used_qid;
> >    +
> >    +        if (tx_qid != LAST_USED_QID_NONE) {
> >    +            netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
> >    +                         cached_tx_port->port->dynamic_txqs);
> >    +
> >    +            /* Queue flushed and mark it empty. */
> >    +            cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
> >    +        }
> >    +    }
> >    +}
> >    +
> >     static int
> >     dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
> >                                struct netdev_rxq *rx,
> >    @@ -3667,6 +3693,9 @@ dpif_netdev_run(struct dpif *dpif)
> >                             dp_netdev_process_rxq_port(non_pmd,
> >                                                        port->rxqs[i].rx,
> >                                                        port->port_no);
> >    +
> >    +                    dp_netdev_flush_txq_ports(non_pmd);
> >    +
> >                         cycles_count_intermediate(non_pmd, process_packets ?
> >                                                            PMD_CYCLES_PROCESSING
> >                                                          : PMD_CYCLES_IDLE);
> >    @@ -3854,6 +3883,8 @@ reload:
> >                 process_packets =
> >                     dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
> >                                                poll_list[i].port_no);
> >    +            dp_netdev_flush_txq_ports(pmd);
> >    +
> >                 cycles_count_intermediate(pmd,
> >                                           process_packets ? PMD_CYCLES_PROCESSING
> >                                                           : PMD_CYCLES_IDLE);
> >    @@ -3879,6 +3910,9 @@ reload:
> >
> >         cycles_count_end(pmd, PMD_CYCLES_IDLE);
> >
> >    +    /* Flush the queues as part of reconfiguration logic. */
> >    +    dp_netdev_flush_txq_ports(pmd);
> >    +
> >         poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> >         exiting = latch_is_set(&pmd->exit_latch);
> >         /* Signal here to make sure the pmd finishes
> >    @@ -4481,6 +4515,7 @@ dp_netdev_add_port_tx_to_pmd(struct
> >dp_netdev_pmd_thread *pmd,
> >
> >         tx->port = port;
> >         tx->qid = -1;
> >    +    tx->last_used_qid = LAST_USED_QID_NONE;
> >
> >         hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port-
> >>port_no));
> >         pmd->need_reload = true;
> >    @@ -5051,6 +5086,14 @@ dpif_netdev_xps_get_tx_qid(const struct
> >dp_netdev_pmd_thread *pmd,
> >
> >         dpif_netdev_xps_revalidate_pmd(pmd, now, false);
> >
> >    +    /* The tx queue can change in XPS case, make sure packets in previous
> >    +     * queue is flushed properly. */
> >    +    if (tx->last_used_qid != LAST_USED_QID_NONE &&
> >    +               tx->qid != tx->last_used_qid) {
> >    +        netdev_txq_flush(port->netdev, tx->last_used_qid, port-
> >>dynamic_txqs);
> >    +        tx->last_used_qid = LAST_USED_QID_NONE;
> >    +    }
> >    +
> >         VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
> >                  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
> >         return min_qid;
> >    @@ -5146,6 +5189,13 @@ dp_execute_cb(void *aux_, struct
> >dp_packet_batch *packets_,
> >                     tx_qid = pmd->static_tx_qid;
> >                 }
> >
> >    +            /* In case these packets gets buffered into an intermediate
> >    +             * queue and XPS is enabled the flush function could find a
> >    +             * different tx qid assigned to its thread.  We keep track
> >    +             * of the qid we're now using, that will trigger the flush
> >    +             * function and will select the right queue to flush. */
> >    +            p->last_used_qid = tx_qid;
> >    +
> >                 netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
> >                             dynamic_txqs);
> >                 return;
> >    --
> >    2.4.11
> >
> >    _______________________________________________
> >    dev mailing list
> >    dev@openvswitch.org
> >    https://urldefense.proofpoint.com/v2/url?u=https-
> >3A__mail.openvswitch.org_mailman_listinfo_ovs-
> >2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> >uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=IXKDcYc2
> >D7SfGkflLokKVfrNafJlFLfT94q366bHHFo&e=
> >
Darrell Ball Aug. 11, 2017, 4:10 p.m. | #4
Hi Bhanu 

Thanks for your efforts.

The reason I sent this e-mail regarding patch 5 is that as is, it could not be merged.
Since, I have not seen the incremental changes (used in some referred customer testing) beyond this, I wanted to
give a reasonable chance to evaluate the full changes.

Cheers Darrell


-----Original Message-----
From: "Bodireddy, Bhanuprakash" <bhanuprakash.bodireddy@intel.com>
Date: Friday, August 11, 2017 at 6:11 AM
To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>, Ilya Maximets <i.maximets@samsung.com>, Jan Scheurich <jan.scheurich@ericsson.com>, Eelco Chaudron <echaudro@redhat.com>
Cc: Ben Pfaff <blp@ovn.org>, "Stokes, Ian" <ian.stokes@intel.com>, "Gray, Mark D" <mark.d.gray@intel.com>, "Fischetti, Antonio" <antonio.fischetti@intel.com>, Kevin Traynor <ktraynor@redhat.com>
Subject: RE: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in intermediate queue.

    Hello All,
    
    Adding all the people here who had either reviewed or provided their feedback
    on the batching patches at some stage.
    
    you are already aware that there are 2 different series on ML to implement tx
     batching (netdev layer vs dpif layer) that improves DPDK datapath performance.  
    Our output batching is in netdev layer whereas ilya moved it to  dpif layer 
    and simplified it. Each approach has its own pros and cons and had been
    discussed in earlier threads.
    
    While reviewing v4 of my patch series,  ilya detected a race condition that happens
    when the queues in the guest are enabled/disabled at run time. Though we have
    solutions to address this issue and implemented it, I realized that the code complexity
    has increased and changes spanning multiple functions with additional spin locks  to 
    address this one corner case. 
    
    I think, though our patch series has flexibility it has gotten lot complex now and would
    be difficult to maintain in the future. At this stage I would like to lean towards simpler
    solution that's more maintainable  which is implemented by Ilya.
    
    I would like to thank Eelco, Darrell, Jan and Ilya for reviewing our series and providing their
    feedback.
    
    Bhanuprakash. 
    
    >-----Original Message-----
    >From: Darrell Ball [mailto:dball@vmware.com]
    >Sent: Friday, August 11, 2017 2:03 AM
    >To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy@intel.com>;
    >dev@openvswitch.org
    >Subject: Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
    >intermediate queue.
    >
    >Hi Bhanu
    >
    >Given that you ultimately intend changes beyond those in this patch, would it
    >make sense just to fold the follow up series (at least, the key elements) into
    >this series, essentially expanding on this patch 5 ?
    >
    >Thanks Darrell
    >
    >-----Original Message-----
    >From: <ovs-dev-bounces@openvswitch.org> on behalf of Bhanuprakash
    >Bodireddy <bhanuprakash.bodireddy@intel.com>
    >Date: Tuesday, August 8, 2017 at 10:06 AM
    >To: "dev@openvswitch.org" <dev@openvswitch.org>
    >Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
    >	intermediate queue.
    >
    >    Under low rate traffic conditions, there can be 2 issues.
    >      (1) Packets potentially can get stuck in the intermediate queue.
    >      (2) Latency of the packets can increase significantly due to
    >           buffering in intermediate queue.
    >
    >    This commit handles the (1) issue by flushing the tx port queues using
    >    dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
    >    Also this commit addresses issue (2) by flushing the tx queues after
    >    every rxq port processing. This reduces the latency with out impacting
    >    the forwarding throughput.
    >
    >       MASTER
    >      --------
    >       Pkt size  min(ns)   avg(ns)   max(ns)
    >        512      4,631      5,022    309,914
    >       1024      5,545      5,749    104,294
    >       1280      5,978      6,159     45,306
    >       1518      6,419      6,774    946,850
    >
    >      MASTER + COMMIT
    >      -----------------
    >       Pkt size  min(ns)   avg(ns)   max(ns)
    >        512      4,711      5,064    182,477
    >       1024      5,601      5,888    701,654
    >       1280      6,018      6,491    533,037
    >       1518      6,467      6,734    312,471
    >
    >    PMDs can be teared down and spawned at runtime and so the rxq and txq
    >    mapping of the PMD threads can change. In few cases packets can get
    >    stuck in the queue due to reconfiguration and this commit helps flush
    >    the queues.
    >
    >    Suggested-by: Eelco Chaudron <echaudro@redhat.com>
    >    Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
    >3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
    >2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
    >GX7JQ5Ih-
    >uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=H0yNRh-
    >c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI&e=
    >    Signed-off-by: Bhanuprakash Bodireddy
    ><bhanuprakash.bodireddy@intel.com>
    >    Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
    >    Co-authored-by: Antonio Fischetti <antonio.fischetti@intel.com>
    >    Signed-off-by: Markus Magnusson <markus.magnusson@ericsson.com>
    >    Co-authored-by: Markus Magnusson <markus.magnusson@ericsson.com>
    >    Acked-by: Eelco Chaudron <echaudro@redhat.com>
    >    ---
    >     lib/dpif-netdev.c | 52
    >+++++++++++++++++++++++++++++++++++++++++++++++++++-
    >     1 file changed, 51 insertions(+), 1 deletion(-)
    >
    >    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
    >    index e2cd931..bfb9650 100644
    >    --- a/lib/dpif-netdev.c
    >    +++ b/lib/dpif-netdev.c
    >    @@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
    >     };
    >
    >     #define XPS_TIMEOUT_MS 500LL
    >    +#define LAST_USED_QID_NONE -1
    >
    >     /* Contained by struct dp_netdev_port's 'rxqs' member.  */
    >     struct dp_netdev_rxq {
    >    @@ -500,7 +501,13 @@ struct rxq_poll {
    >     struct tx_port {
    >         struct dp_netdev_port *port;
    >         int qid;
    >    -    long long last_used;
    >    +    int last_used_qid;        /* Last queue id where packets got
    >    +                                 enqueued. */
    >    +    long long last_used;      /* In case XPS is enabled, it contains the
    >    +                               * timestamp of the last time the port was
    >    +                               * used by the thread to send data.  After
    >    +                               * XPS_TIMEOUT_MS elapses the qid will be
    >    +                               * marked as -1. */
    >         struct hmap_node node;
    >     };
    >
    >    @@ -3101,6 +3108,25 @@ cycles_count_intermediate(struct
    >dp_netdev_pmd_thread *pmd,
    >         non_atomic_ullong_add(&pmd->cycles.n[type], interval);
    >     }
    >
    >    +static void
    >    +dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
    >    +{
    >    +    struct tx_port *cached_tx_port;
    >    +    int tx_qid;
    >    +
    >    +    HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
    >    +        tx_qid = cached_tx_port->last_used_qid;
    >    +
    >    +        if (tx_qid != LAST_USED_QID_NONE) {
    >    +            netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
    >    +                         cached_tx_port->port->dynamic_txqs);
    >    +
    >    +            /* Queue flushed and mark it empty. */
    >    +            cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
    >    +        }
    >    +    }
    >    +}
    >    +
    >     static int
    >     dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
    >                                struct netdev_rxq *rx,
    >    @@ -3667,6 +3693,9 @@ dpif_netdev_run(struct dpif *dpif)
    >                             dp_netdev_process_rxq_port(non_pmd,
    >                                                        port->rxqs[i].rx,
    >                                                        port->port_no);
    >    +
    >    +                    dp_netdev_flush_txq_ports(non_pmd);
    >    +
    >                         cycles_count_intermediate(non_pmd, process_packets ?
    >                                                            PMD_CYCLES_PROCESSING
    >                                                          : PMD_CYCLES_IDLE);
    >    @@ -3854,6 +3883,8 @@ reload:
    >                 process_packets =
    >                     dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
    >                                                poll_list[i].port_no);
    >    +            dp_netdev_flush_txq_ports(pmd);
    >    +
    >                 cycles_count_intermediate(pmd,
    >                                           process_packets ? PMD_CYCLES_PROCESSING
    >                                                           : PMD_CYCLES_IDLE);
    >    @@ -3879,6 +3910,9 @@ reload:
    >
    >         cycles_count_end(pmd, PMD_CYCLES_IDLE);
    >
    >    +    /* Flush the queues as part of reconfiguration logic. */
    >    +    dp_netdev_flush_txq_ports(pmd);
    >    +
    >         poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
    >         exiting = latch_is_set(&pmd->exit_latch);
    >         /* Signal here to make sure the pmd finishes
    >    @@ -4481,6 +4515,7 @@ dp_netdev_add_port_tx_to_pmd(struct
    >dp_netdev_pmd_thread *pmd,
    >
    >         tx->port = port;
    >         tx->qid = -1;
    >    +    tx->last_used_qid = LAST_USED_QID_NONE;
    >
    >         hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port-
    >>port_no));
    >         pmd->need_reload = true;
    >    @@ -5051,6 +5086,14 @@ dpif_netdev_xps_get_tx_qid(const struct
    >dp_netdev_pmd_thread *pmd,
    >
    >         dpif_netdev_xps_revalidate_pmd(pmd, now, false);
    >
    >    +    /* The tx queue can change in XPS case, make sure packets in previous
    >    +     * queue is flushed properly. */
    >    +    if (tx->last_used_qid != LAST_USED_QID_NONE &&
    >    +               tx->qid != tx->last_used_qid) {
    >    +        netdev_txq_flush(port->netdev, tx->last_used_qid, port-
    >>dynamic_txqs);
    >    +        tx->last_used_qid = LAST_USED_QID_NONE;
    >    +    }
    >    +
    >         VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
    >                  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
    >         return min_qid;
    >    @@ -5146,6 +5189,13 @@ dp_execute_cb(void *aux_, struct
    >dp_packet_batch *packets_,
    >                     tx_qid = pmd->static_tx_qid;
    >                 }
    >
    >    +            /* In case these packets gets buffered into an intermediate
    >    +             * queue and XPS is enabled the flush function could find a
    >    +             * different tx qid assigned to its thread.  We keep track
    >    +             * of the qid we're now using, that will trigger the flush
    >    +             * function and will select the right queue to flush. */
    >    +            p->last_used_qid = tx_qid;
    >    +
    >                 netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
    >                             dynamic_txqs);
    >                 return;
    >    --
    >    2.4.11
    >
    >    _______________________________________________
    >    dev mailing list
    >    dev@openvswitch.org
    >    https://urldefense.proofpoint.com/v2/url?u=https-
    >3A__mail.openvswitch.org_mailman_listinfo_ovs-
    >2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
    >uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=IXKDcYc2
    >D7SfGkflLokKVfrNafJlFLfT94q366bHHFo&e=
    >

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e2cd931..bfb9650 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -340,6 +340,7 @@  enum pmd_cycles_counter_type {
 };
 
 #define XPS_TIMEOUT_MS 500LL
+#define LAST_USED_QID_NONE -1
 
 /* Contained by struct dp_netdev_port's 'rxqs' member.  */
 struct dp_netdev_rxq {
@@ -500,7 +501,13 @@  struct rxq_poll {
 struct tx_port {
     struct dp_netdev_port *port;
     int qid;
-    long long last_used;
+    int last_used_qid;        /* Last queue id where packets got
+                                 enqueued. */
+    long long last_used;      /* In case XPS is enabled, it contains the
+                               * timestamp of the last time the port was
+                               * used by the thread to send data.  After
+                               * XPS_TIMEOUT_MS elapses the qid will be
+                               * marked as -1. */
     struct hmap_node node;
 };
 
@@ -3101,6 +3108,25 @@  cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
     non_atomic_ullong_add(&pmd->cycles.n[type], interval);
 }
 
+static void
+dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
+{
+    struct tx_port *cached_tx_port;
+    int tx_qid;
+
+    HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
+        tx_qid = cached_tx_port->last_used_qid;
+
+        if (tx_qid != LAST_USED_QID_NONE) {
+            netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
+                         cached_tx_port->port->dynamic_txqs);
+
+            /* Queue flushed and mark it empty. */
+            cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
+        }
+    }
+}
+
 static int
 dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
                            struct netdev_rxq *rx,
@@ -3667,6 +3693,9 @@  dpif_netdev_run(struct dpif *dpif)
                         dp_netdev_process_rxq_port(non_pmd,
                                                    port->rxqs[i].rx,
                                                    port->port_no);
+
+                    dp_netdev_flush_txq_ports(non_pmd);
+
                     cycles_count_intermediate(non_pmd, process_packets ?
                                                        PMD_CYCLES_PROCESSING
                                                      : PMD_CYCLES_IDLE);
@@ -3854,6 +3883,8 @@  reload:
             process_packets =
                 dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
                                            poll_list[i].port_no);
+            dp_netdev_flush_txq_ports(pmd);
+
             cycles_count_intermediate(pmd,
                                       process_packets ? PMD_CYCLES_PROCESSING
                                                       : PMD_CYCLES_IDLE);
@@ -3879,6 +3910,9 @@  reload:
 
     cycles_count_end(pmd, PMD_CYCLES_IDLE);
 
+    /* Flush the queues as part of reconfiguration logic. */
+    dp_netdev_flush_txq_ports(pmd);
+
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
     exiting = latch_is_set(&pmd->exit_latch);
     /* Signal here to make sure the pmd finishes
@@ -4481,6 +4515,7 @@  dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
 
     tx->port = port;
     tx->qid = -1;
+    tx->last_used_qid = LAST_USED_QID_NONE;
 
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
     pmd->need_reload = true;
@@ -5051,6 +5086,14 @@  dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
 
     dpif_netdev_xps_revalidate_pmd(pmd, now, false);
 
+    /* The tx queue can change in XPS case, make sure packets in previous
+     * queue is flushed properly. */
+    if (tx->last_used_qid != LAST_USED_QID_NONE &&
+               tx->qid != tx->last_used_qid) {
+        netdev_txq_flush(port->netdev, tx->last_used_qid, port->dynamic_txqs);
+        tx->last_used_qid = LAST_USED_QID_NONE;
+    }
+
     VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
              pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
     return min_qid;
@@ -5146,6 +5189,13 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
                 tx_qid = pmd->static_tx_qid;
             }
 
+            /* In case these packets gets buffered into an intermediate
+             * queue and XPS is enabled the flush function could find a
+             * different tx qid assigned to its thread.  We keep track
+             * of the qid we're now using, that will trigger the flush
+             * function and will select the right queue to flush. */
+            p->last_used_qid = tx_qid;
+
             netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
                         dynamic_txqs);
             return;