[ovs-dev,v5] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

Message ID 1531311738-30232-1-git-send-email-vishal.deep.ajmera@ericsson.com
State Superseded
Headers show
Series
  • [ovs-dev,v5] dpif-netdev: Avoid reordering of packets in a batch with same megaflow
Related show

Commit Message

Vishal Deep Ajmera July 11, 2018, 12:22 p.m.
OVS reads packets in batches from a given port and packets in the
batch are subjected to potentially 3 levels of lookups to identify
the datapath megaflow entry (or flow) associated with the packet.
Each megaflow entry has a dedicated buffer in which packets that match
the flow classification criteria are collected. This buffer helps OVS
perform batch processing for all packets associated with a given flow.

Each packet in the received batch is first subjected to lookup in the
Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
EMC lookup is successful, the packet is moved from the rx batch to the
per-flow buffer.

Packets that did not match any EMC entry are rearranged in the rx batch
at the beginning and are now subjected to a lookup in the megaflow cache.
Packets that match a megaflow cache entry are *appended* to the per-flow
buffer.

Packets that do not match any megaflow entry are subjected to slow-path
processing through the upcall mechanism. This cannot change the order of
packets as by definition upcall processing is only done for packets
without matching megaflow entry.

The EMC entry match fields encompass all potentially significant header
fields, typically more than specified in the associated flow's match
criteria. Hence, multiple EMC entries can point to the same flow. Given
that per-flow batching happens at each lookup stage, packets belonging
to the same megaflow can get re-ordered because some packets match EMC
entries while others do not.

The following example can illustrate the issue better. Consider
following batch of packets (labelled P1 to P8) associated with a single
TCP connection and associated with a single flow. Let us assume that
packets with just the ACK bit set in TCP flags have been received in a
prior batch also and a corresponding EMC entry exists.

1. P1 (TCP Flag: ACK)
2. P2 (TCP Flag: ACK)
3. P3 (TCP Flag: ACK)
4. P4 (TCP Flag: ACK, PSH)
5. P5 (TCP Flag: ACK)
6. P6 (TCP Flag: ACK)
7. P7 (TCP Flag: ACK)
8. P8 (TCP Flag: ACK)

The megaflow classification criteria does not include TCP flags while
the EMC match criteria does. Thus, all packets other than P4 match
the existing EMC entry and are moved to the per-flow packet batch.
Subsequently, packet P4 is moved to the same per-flow packet batch as
a result of the megaflow lookup. Though the packets have all been
correctly classified as being associated with the same flow, the
packet order has not been preserved because of the per-flow batching
performed during the EMC lookup stage. This packet re-ordering has
performance implications for TCP applications.

This patch preserves the packet ordering by performing the per-flow
batching after both the EMC and megaflow lookups are complete. As an
optimization, packets are flow-batched in emc processing till any
packet in the batch has an EMC miss.

A new flow map is maintained to keep the original order of packet
along with flow information. Post fastpath processing, packets from
flow map are *appended* to per-flow buffer.

Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
---
 lib/dpif-netdev.c | 104 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 88 insertions(+), 16 deletions(-)

Comments

Vishal Deep Ajmera July 13, 2018, 1:54 p.m. | #1
Hi Ian, Ilya,

If there are no more comments, can I request to please include the fix
in this week's pull request?

Warm Regards,
Vishal Ajmera
Stokes, Ian July 13, 2018, 5:06 p.m. | #2
On 7/13/2018 2:54 PM, Vishal Deep Ajmera wrote:
> Hi Ian, Ilya,
> 
> If there are no more comments, can I request to please include the fix
> in this week's pull request?
> 

Hi Vishal, I'll leave Ilya to respond to the changes he requested.

This weeks pull request includes the new feature 'SMC cache after EMC 
cache' (Collaboration work between Intel and Ericsson). Your patch will 
have to be rebased on top of this as there is a lot of code churn.

If you could rebase the v6 after the pull request is merged it would be 
helpful.

I'm aiming to have your patch as part of the next pull request targeted 
for early next week. The v6 will be specific for master to account for 
the SMC changes. I should be able to use the v5 to apply to 2.9 and 
previous releases if there is no other changes requested from Ilya.

Thanks
Ian


> Warm Regards,
> Vishal Ajmera
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ilya Maximets July 16, 2018, 1:12 p.m. | #3
Comments inline.

It'll be nice if you will add me in CC for the next versions.
Thanks.
Best regards, Ilya Maximets.

On 11.07.2018 15:22, Vishal Deep Ajmera wrote:
> OVS reads packets in batches from a given port and packets in the
> batch are subjected to potentially 3 levels of lookups to identify
> the datapath megaflow entry (or flow) associated with the packet.
> Each megaflow entry has a dedicated buffer in which packets that match
> the flow classification criteria are collected. This buffer helps OVS
> perform batch processing for all packets associated with a given flow.
> 
> Each packet in the received batch is first subjected to lookup in the
> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
> EMC lookup is successful, the packet is moved from the rx batch to the
> per-flow buffer.
> 
> Packets that did not match any EMC entry are rearranged in the rx batch
> at the beginning and are now subjected to a lookup in the megaflow cache.
> Packets that match a megaflow cache entry are *appended* to the per-flow
> buffer.
> 
> Packets that do not match any megaflow entry are subjected to slow-path
> processing through the upcall mechanism. This cannot change the order of
> packets as by definition upcall processing is only done for packets
> without matching megaflow entry.
> 
> The EMC entry match fields encompass all potentially significant header
> fields, typically more than specified in the associated flow's match
> criteria. Hence, multiple EMC entries can point to the same flow. Given
> that per-flow batching happens at each lookup stage, packets belonging
> to the same megaflow can get re-ordered because some packets match EMC
> entries while others do not.
> 
> The following example can illustrate the issue better. Consider
> following batch of packets (labelled P1 to P8) associated with a single
> TCP connection and associated with a single flow. Let us assume that
> packets with just the ACK bit set in TCP flags have been received in a
> prior batch also and a corresponding EMC entry exists.
> 
> 1. P1 (TCP Flag: ACK)
> 2. P2 (TCP Flag: ACK)
> 3. P3 (TCP Flag: ACK)
> 4. P4 (TCP Flag: ACK, PSH)
> 5. P5 (TCP Flag: ACK)
> 6. P6 (TCP Flag: ACK)
> 7. P7 (TCP Flag: ACK)
> 8. P8 (TCP Flag: ACK)
> 
> The megaflow classification criteria does not include TCP flags while
> the EMC match criteria does. Thus, all packets other than P4 match
> the existing EMC entry and are moved to the per-flow packet batch.
> Subsequently, packet P4 is moved to the same per-flow packet batch as
> a result of the megaflow lookup. Though the packets have all been
> correctly classified as being associated with the same flow, the
> packet order has not been preserved because of the per-flow batching
> performed during the EMC lookup stage. This packet re-ordering has
> performance implications for TCP applications.
> 
> This patch preserves the packet ordering by performing the per-flow
> batching after both the EMC and megaflow lookups are complete. As an
> optimization, packets are flow-batched in emc processing till any
> packet in the batch has an EMC miss.

Have you tried solution with direct pushing of all the packets to
'flow_map' and per-flow batching only at the end of 'dp_netdev_input__()'?

Is it really slower?
If you have some performance data, I'd like to see it, because it looks
like code should be simpler without this optimization.

Also, I'm thinking, maybe it'll be better to keep the receive order
values somewhere in the packet batch itself modifying the
'dp_packet_batch' structure and API. Using this we may restore the
original order inside the per-flow batches before executing actions and
completely avoid any modifications in processing related code.
Maybe, this solution will look more clean and easy to understand.


> 
> A new flow map is maintained to keep the original order of packet
> along with flow information. Post fastpath processing, packets from
> flow map are *appended* to per-flow buffer.
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> ---
>  lib/dpif-netdev.c | 104 +++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 88 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8b3556d..52ababb 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -208,6 +208,13 @@ struct dpcls_rule {
>      /* 'flow' must be the last field, additional space is allocated here. */
>  };
>  
> +/* data structure to keep packet order till fastpath processing */
> +struct dp_packet_flow_map {
> +    struct dp_packet *packet;
> +    struct dp_netdev_flow *flow;
> +    uint16_t tcp_flags;
> +};
> +
>  static void dpcls_init(struct dpcls *);
>  static void dpcls_destroy(struct dpcls *);
>  static void dpcls_sort_subtable_vector(struct dpcls *);
> @@ -5602,6 +5609,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, tcp_flags);
>  }
>  
> +static inline void
> +packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map,
> +                           struct dp_netdev_flow *flow,
> +                           struct dp_packet *packet,
> +                           uint16_t tcp_flags,
> +                           size_t *map_cnt)
> +{
> +    struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++];
> +    map->flow = flow;
> +    map->packet = packet;
> +    map->tcp_flags = tcp_flags;
> +}
> +
>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> @@ -5621,6 +5641,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
> +               struct dp_packet_flow_map *flow_map,
> +               size_t *n_flows,
> +               uint8_t *index_map,
>                 bool md_is_valid, odp_port_t port_no)
>  {
>      struct emc_cache *flow_cache = &pmd->flow_cache;
> @@ -5631,6 +5654,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>      uint32_t cur_min;
>      int i;
>      uint16_t tcp_flags;
> +    size_t map_cnt = 0;

You're setting the value of 'n_flows' before calling this function.
You should use passed value here or not initialize 'n_flows' in caller function.

> +    bool batch_enable = true;
>  
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>      pmd_perf_update_counter(&pmd->perf_stats,
> @@ -5661,11 +5686,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>          if ((*recirc_depth_get() == 0) &&
>              dp_packet_has_flow_mark(packet, &mark)) {
>              flow = mark_to_flow_find(pmd, mark);
> -            if (flow) {
> +            if (OVS_LIKELY(flow)) {
>                  tcp_flags = parse_tcp_flags(packet);
> -                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                        n_batches);
> -                continue;
> +                if (OVS_LIKELY(batch_enable)) {
> +                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                            n_batches);
> +                    continue;
> +                } else {
> +                    /* Flow batching should be performed only after fast-path
> +                     * processing is also completed for packets with emc miss
> +                     * or else it will result in reordering of packets with
> +                     * same datapath flows. */
> +                    packet_enqueue_to_flow_map(flow_map, flow, packet,
> +                                               tcp_flags, &map_cnt);

"continue" required here to avoid double processing.
Better to move the above "continue" statement out of the "if/else".

> +                }
>              }
>          }
>  
> @@ -5685,19 +5719,37 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>          }
>          if (OVS_LIKELY(flow)) {
>              tcp_flags = miniflow_get_tcp_flags(&key->mf);
> -            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                    n_batches);
> +            if (OVS_LIKELY(batch_enable)) {
> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                        n_batches);
> +            } else {
> +                /* Flow batching should be performed only after fast-path
> +                 * processing is also completed for packets with emc miss
> +                 * or else it will result in reordering of packets with
> +                 * same datapath flows. */
> +                packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags,
> +                                           &map_cnt);

I think it's better to reorder arguments of this function to look similar with
'dp_netdev_queue_batches'. Like this:

    packet_enqueue_to_flow_map(packet, flow, tcp_flags, flow_map, &map_cnt);

> +            }
> +

Not sure if this empty line required.

>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */
>              dp_packet_batch_refill(packets_, packet, i);
> +
> +            /* preserve the order of packet for flow batching */

Comments should start with a capital letter and end with a period.
(This is for all the comments in this patch)

> +            index_map[n_missed] = map_cnt;
> +            flow_map[map_cnt++].flow = NULL;
> +
>              /* 'key[n_missed]' contains the key of the current packet and it
>               * must be returned to the caller. The next key should be extracted
>               * to 'keys[n_missed + 1]'. */
>              key = &keys[++n_missed];
> +
> +            /* skip batching for subsequent packets to avoid reordering */
> +            batch_enable = false;
>          }
>      }
> -
> +    *n_flows = map_cnt;
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
>                              cnt - n_dropped - n_missed);
>  
> @@ -5784,8 +5836,8 @@ static inline void
>  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
>                       struct netdev_flow_key *keys,
> -                     struct packet_batch_per_flow batches[],
> -                     size_t *n_batches,
> +                     struct dp_packet_flow_map *flow_map,
> +                     uint8_t *index_map,
>                       odp_port_t in_port)
>  {
>      const size_t cnt = dp_packet_batch_size(packets_);
> @@ -5864,6 +5916,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>  
>      DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>          struct dp_netdev_flow *flow;
> +        /* get the original order of this packet in received batch */
> +        int recv_idx = index_map[i];
>  
>          if (OVS_UNLIKELY(!rules[i])) {
>              continue;
> @@ -5872,9 +5926,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>          flow = dp_netdev_flow_cast(rules[i]);
>  
>          emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow,
> -                                miniflow_get_tcp_flags(&keys[i].mf),
> -                                batches, n_batches);
> +        /* add these packets into the flow map in the same order
> +         * as received.
> +         */
> +        flow_map[recv_idx].packet = packet;
> +        flow_map[recv_idx].flow = flow;
> +        flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf)
I think, it's better to use some API instead of manual filling the fields.

> +

Empty line not needed.

>      }
>  
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -5905,19 +5963,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
> -    size_t n_batches;
> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> +    uint8_t index_map[PKT_ARRAY_SIZE];
> +    size_t n_batches, n_flows = 0;
>      odp_port_t in_port;
> +    size_t i;
>  
>      n_batches = 0;
>      emc_processing(pmd, packets, keys, batches, &n_batches,
> -                            md_is_valid, port_no);
> +                   flow_map, &n_flows, index_map, md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
>          in_port = packets->packets[0]->md.in_port.odp_port;
>          fast_path_processing(pmd, packets, keys,
> -                             batches, &n_batches, in_port);
> +                             flow_map, index_map, in_port);
>      }
>  
> +    /* batch rest of packets which are in flow map */
> +    for (i = 0; i < n_flows; i++) {
> +        struct dp_packet_flow_map *map = &flow_map[i];
> +
> +        if (OVS_UNLIKELY(!map->flow)) {
> +            continue;
> +        }
> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
> +                                batches, &n_batches);
> +     }
> +
> +

Two empty lines in a row.

>      /* All the flow batches need to be reset before any call to
>       * packet_batch_per_flow_execute() as it could potentially trigger
>       * recirculation. When a packet matching flow ‘j’ happens to be
> @@ -5927,7 +6000,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>       * already its own batches[k] still waiting to be served.  So if its
>       * ‘batch’ member is not reset, the recirculated packet would be wrongly
>       * appended to batches[k] of the 1st call to dp_netdev_input__(). */
> -    size_t i;
>      for (i = 0; i < n_batches; i++) {
>          batches[i].flow->batch = NULL;
>      }
>
Vishal Deep Ajmera July 27, 2018, 10:35 a.m. | #4
> Have you tried solution with direct pushing of all the packets to
> 'flow_map' and per-flow batching only at the end of 'dp_netdev_input__()'?
> 
> Is it really slower?
> If you have some performance data, I'd like to see it, because it looks
> like code should be simpler without this optimization.

I do not have any comparison data. However, for best case where all 
packets have a hit in EMC, this will add to the cycles since an extra loop for 
batch needs to be run in dp_netdev_input__() after emc processing.

> 
> Also, I'm thinking, maybe it'll be better to keep the receive order
> values somewhere in the packet batch itself modifying the
> 'dp_packet_batch' structure and API. Using this we may restore the
> original order inside the per-flow batches before executing actions and
> completely avoid any modifications in processing related code.
> Maybe, this solution will look more clean and easy to understand.

I could not try the solution suggested by you. However I have addressed your 
code review comments in v6 patch. Also, v6 patch is rebased with dfc/smc feature.

Warm Regards,
Vishal Ajmera

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8b3556d..52ababb 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -208,6 +208,13 @@  struct dpcls_rule {
     /* 'flow' must be the last field, additional space is allocated here. */
 };
 
+/* data structure to keep packet order till fastpath processing */
+struct dp_packet_flow_map {
+    struct dp_packet *packet;
+    struct dp_netdev_flow *flow;
+    uint16_t tcp_flags;
+};
+
 static void dpcls_init(struct dpcls *);
 static void dpcls_destroy(struct dpcls *);
 static void dpcls_sort_subtable_vector(struct dpcls *);
@@ -5602,6 +5609,19 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, tcp_flags);
 }
 
+static inline void
+packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map,
+                           struct dp_netdev_flow *flow,
+                           struct dp_packet *packet,
+                           uint16_t tcp_flags,
+                           size_t *map_cnt)
+{
+    struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++];
+    map->flow = flow;
+    map->packet = packet;
+    map->tcp_flags = tcp_flags;
+}
+
 /* Try to process all ('cnt') the 'packets' using only the exact match cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
@@ -5621,6 +5641,9 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
                struct dp_packet_batch *packets_,
                struct netdev_flow_key *keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
+               struct dp_packet_flow_map *flow_map,
+               size_t *n_flows,
+               uint8_t *index_map,
                bool md_is_valid, odp_port_t port_no)
 {
     struct emc_cache *flow_cache = &pmd->flow_cache;
@@ -5631,6 +5654,8 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
     uint32_t cur_min;
     int i;
     uint16_t tcp_flags;
+    size_t map_cnt = 0;
+    bool batch_enable = true;
 
     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
     pmd_perf_update_counter(&pmd->perf_stats,
@@ -5661,11 +5686,20 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         if ((*recirc_depth_get() == 0) &&
             dp_packet_has_flow_mark(packet, &mark)) {
             flow = mark_to_flow_find(pmd, mark);
-            if (flow) {
+            if (OVS_LIKELY(flow)) {
                 tcp_flags = parse_tcp_flags(packet);
-                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
-                                        n_batches);
-                continue;
+                if (OVS_LIKELY(batch_enable)) {
+                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
+                                            n_batches);
+                    continue;
+                } else {
+                    /* Flow batching should be performed only after fast-path
+                     * processing is also completed for packets with emc miss
+                     * or else it will result in reordering of packets with
+                     * same datapath flows. */
+                    packet_enqueue_to_flow_map(flow_map, flow, packet,
+                                               tcp_flags, &map_cnt);
+                }
             }
         }
 
@@ -5685,19 +5719,37 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         if (OVS_LIKELY(flow)) {
             tcp_flags = miniflow_get_tcp_flags(&key->mf);
-            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
-                                    n_batches);
+            if (OVS_LIKELY(batch_enable)) {
+                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
+                                        n_batches);
+            } else {
+                /* Flow batching should be performed only after fast-path
+                 * processing is also completed for packets with emc miss
+                 * or else it will result in reordering of packets with
+                 * same datapath flows. */
+                packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags,
+                                           &map_cnt);
+            }
+
         } else {
             /* Exact match cache missed. Group missed packets together at
              * the beginning of the 'packets' array. */
             dp_packet_batch_refill(packets_, packet, i);
+
+            /* preserve the order of packet for flow batching */
+            index_map[n_missed] = map_cnt;
+            flow_map[map_cnt++].flow = NULL;
+
             /* 'key[n_missed]' contains the key of the current packet and it
              * must be returned to the caller. The next key should be extracted
              * to 'keys[n_missed + 1]'. */
             key = &keys[++n_missed];
+
+            /* skip batching for subsequent packets to avoid reordering */
+            batch_enable = false;
         }
     }
-
+    *n_flows = map_cnt;
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
                             cnt - n_dropped - n_missed);
 
@@ -5784,8 +5836,8 @@  static inline void
 fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                      struct dp_packet_batch *packets_,
                      struct netdev_flow_key *keys,
-                     struct packet_batch_per_flow batches[],
-                     size_t *n_batches,
+                     struct dp_packet_flow_map *flow_map,
+                     uint8_t *index_map,
                      odp_port_t in_port)
 {
     const size_t cnt = dp_packet_batch_size(packets_);
@@ -5864,6 +5916,8 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
     DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
         struct dp_netdev_flow *flow;
+        /* get the original order of this packet in received batch */
+        int recv_idx = index_map[i];
 
         if (OVS_UNLIKELY(!rules[i])) {
             continue;
@@ -5872,9 +5926,13 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
         flow = dp_netdev_flow_cast(rules[i]);
 
         emc_probabilistic_insert(pmd, &keys[i], flow);
-        dp_netdev_queue_batches(packet, flow,
-                                miniflow_get_tcp_flags(&keys[i].mf),
-                                batches, n_batches);
+        /* add these packets into the flow map in the same order
+         * as received.
+         */
+        flow_map[recv_idx].packet = packet;
+        flow_map[recv_idx].flow = flow;
+        flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
+
     }
 
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
@@ -5905,19 +5963,34 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
         struct netdev_flow_key keys[PKT_ARRAY_SIZE];
     struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
-    size_t n_batches;
+    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
+    uint8_t index_map[PKT_ARRAY_SIZE];
+    size_t n_batches, n_flows = 0;
     odp_port_t in_port;
+    size_t i;
 
     n_batches = 0;
     emc_processing(pmd, packets, keys, batches, &n_batches,
-                            md_is_valid, port_no);
+                   flow_map, &n_flows, index_map, md_is_valid, port_no);
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
         in_port = packets->packets[0]->md.in_port.odp_port;
         fast_path_processing(pmd, packets, keys,
-                             batches, &n_batches, in_port);
+                             flow_map, index_map, in_port);
     }
 
+    /* batch rest of packets which are in flow map */
+    for (i = 0; i < n_flows; i++) {
+        struct dp_packet_flow_map *map = &flow_map[i];
+
+        if (OVS_UNLIKELY(!map->flow)) {
+            continue;
+        }
+        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
+                                batches, &n_batches);
+     }
+
+
     /* All the flow batches need to be reset before any call to
      * packet_batch_per_flow_execute() as it could potentially trigger
      * recirculation. When a packet matching flow ‘j’ happens to be
@@ -5927,7 +6000,6 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
      * already its own batches[k] still waiting to be served.  So if its
      * ‘batch’ member is not reset, the recirculated packet would be wrongly
      * appended to batches[k] of the 1st call to dp_netdev_input__(). */
-    size_t i;
     for (i = 0; i < n_batches; i++) {
         batches[i].flow->batch = NULL;
     }