diff mbox series

[ovs-dev,v6] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

Message ID 1532715997-26089-1-git-send-email-vishal.deep.ajmera@ericsson.com
State Accepted
Delegated to: Ian Stokes
Headers show
Series [ovs-dev,v6] dpif-netdev: Avoid reordering of packets in a batch with same megaflow | expand

Commit Message

Vishal Deep Ajmera July 27, 2018, 6:26 p.m. UTC
OVS reads packets in batches from a given port and packets in the
batch are subjected to potentially 3 levels of lookups to identify
the datapath megaflow entry (or flow) associated with the packet.
Each megaflow entry has a dedicated buffer in which packets that match
the flow classification criteria are collected. This buffer helps OVS
perform batch processing for all packets associated with a given flow.

Each packet in the received batch is first subjected to lookup in the
Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
EMC lookup is successful, the packet is moved from the rx batch to the
per-flow buffer.

Packets that did not match any EMC entry are rearranged in the rx batch
at the beginning and are now subjected to a lookup in the megaflow cache.
Packets that match a megaflow cache entry are *appended* to the per-flow
buffer.

Packets that do not match any megaflow entry are subjected to slow-path
processing through the upcall mechanism. This cannot change the order of
packets as by definition upcall processing is only done for packets
without matching megaflow entry.

The EMC entry match fields encompass all potentially significant header
fields, typically more than specified in the associated flow's match
criteria. Hence, multiple EMC entries can point to the same flow. Given
that per-flow batching happens at each lookup stage, packets belonging
to the same megaflow can get re-ordered because some packets match EMC
entries while others do not.

The following example can illustrate the issue better. Consider
following batch of packets (labelled P1 to P8) associated with a single
TCP connection and associated with a single flow. Let us assume that
packets with just the ACK bit set in TCP flags have been received in a
prior batch also and a corresponding EMC entry exists.

1. P1 (TCP Flag: ACK)
2. P2 (TCP Flag: ACK)
3. P3 (TCP Flag: ACK)
4. P4 (TCP Flag: ACK, PSH)
5. P5 (TCP Flag: ACK)
6. P6 (TCP Flag: ACK)
7. P7 (TCP Flag: ACK)
8. P8 (TCP Flag: ACK)

The megaflow classification criteria does not include TCP flags while
the EMC match criteria does. Thus, all packets other than P4 match
the existing EMC entry and are moved to the per-flow packet batch.
Subsequently, packet P4 is moved to the same per-flow packet batch as
a result of the megaflow lookup. Though the packets have all been
correctly classified as being associated with the same flow, the
packet order has not been preserved because of the per-flow batching
performed during the EMC lookup stage. This packet re-ordering has
performance implications for TCP applications.

This patch preserves the packet ordering by performing the per-flow
batching after both the EMC and megaflow lookups are complete. As an
optimization, packets are flow-batched in emc processing till any
packet in the batch has an EMC miss.

A new flow map is maintained to keep the original order of packet
along with flow information. Post fastpath processing, packets from
flow map are *appended* to per-flow buffer.

Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
---
 lib/dpif-netdev.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 106 insertions(+), 19 deletions(-)

Comments

Stokes, Ian Aug. 8, 2018, 9:06 a.m. UTC | #1
On 7/27/2018 7:26 PM, Vishal Deep Ajmera wrote:
> OVS reads packets in batches from a given port and packets in the
> batch are subjected to potentially 3 levels of lookups to identify
> the datapath megaflow entry (or flow) associated with the packet.
> Each megaflow entry has a dedicated buffer in which packets that match
> the flow classification criteria are collected. This buffer helps OVS
> perform batch processing for all packets associated with a given flow.
> 
> Each packet in the received batch is first subjected to lookup in the
> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
> EMC lookup is successful, the packet is moved from the rx batch to the
> per-flow buffer.
> 
> Packets that did not match any EMC entry are rearranged in the rx batch
> at the beginning and are now subjected to a lookup in the megaflow cache.
> Packets that match a megaflow cache entry are *appended* to the per-flow
> buffer.
> 
> Packets that do not match any megaflow entry are subjected to slow-path
> processing through the upcall mechanism. This cannot change the order of
> packets as by definition upcall processing is only done for packets
> without matching megaflow entry.
> 
> The EMC entry match fields encompass all potentially significant header
> fields, typically more than specified in the associated flow's match
> criteria. Hence, multiple EMC entries can point to the same flow. Given
> that per-flow batching happens at each lookup stage, packets belonging
> to the same megaflow can get re-ordered because some packets match EMC
> entries while others do not.
> 
> The following example can illustrate the issue better. Consider
> following batch of packets (labelled P1 to P8) associated with a single
> TCP connection and associated with a single flow. Let us assume that
> packets with just the ACK bit set in TCP flags have been received in a
> prior batch also and a corresponding EMC entry exists.
> 
> 1. P1 (TCP Flag: ACK)
> 2. P2 (TCP Flag: ACK)
> 3. P3 (TCP Flag: ACK)
> 4. P4 (TCP Flag: ACK, PSH)
> 5. P5 (TCP Flag: ACK)
> 6. P6 (TCP Flag: ACK)
> 7. P7 (TCP Flag: ACK)
> 8. P8 (TCP Flag: ACK)
> 
> The megaflow classification criteria does not include TCP flags while
> the EMC match criteria does. Thus, all packets other than P4 match
> the existing EMC entry and are moved to the per-flow packet batch.
> Subsequently, packet P4 is moved to the same per-flow packet batch as
> a result of the megaflow lookup. Though the packets have all been
> correctly classified as being associated with the same flow, the
> packet order has not been preserved because of the per-flow batching
> performed during the EMC lookup stage. This packet re-ordering has
> performance implications for TCP applications.
> 
> This patch preserves the packet ordering by performing the per-flow
> batching after both the EMC and megaflow lookups are complete. As an
> optimization, packets are flow-batched in emc processing till any
> packet in the batch has an EMC miss.
> 
> A new flow map is maintained to keep the original order of packet
> along with flow information. Post fastpath processing, packets from
> flow map are *appended* to per-flow buffer.
> 

Thanks for the V6 Vishal, looking at this today myself.

Ilya, has the v6 addressed your concerns from the v5?

Thanks
Ian

> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> ---
>   lib/dpif-netdev.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 106 insertions(+), 19 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 13a20f0..3ea25e2 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -244,6 +244,13 @@ struct dpcls_rule {
>       /* 'flow' must be the last field, additional space is allocated here. */
>   };
>   
> +/* Data structure to keep packet order till fastpath processing. */
> +struct dp_packet_flow_map {
> +    struct dp_packet *packet;
> +    struct dp_netdev_flow *flow;
> +    uint16_t tcp_flags;
> +};
> +
>   static void dpcls_init(struct dpcls *);
>   static void dpcls_destroy(struct dpcls *);
>   static void dpcls_sort_subtable_vector(struct dpcls *);
> @@ -5774,6 +5781,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>       packet_batch_per_flow_update(batch, pkt, tcp_flags);
>   }
>   
> +static inline void
> +packet_enqueue_to_flow_map(struct dp_packet *packet,
> +                           struct dp_netdev_flow *flow,
> +                           uint16_t tcp_flags,
> +                           struct dp_packet_flow_map *flow_map,
> +                           size_t index)
> +{
> +    struct dp_packet_flow_map *map = &flow_map[index];
> +    map->flow = flow;
> +    map->packet = packet;
> +    map->tcp_flags = tcp_flags;
> +}
> +
>   /* SMC lookup function for a batch of packets.
>    * By doing batching SMC lookup, we can use prefetch
>    * to hide memory access latency.
> @@ -5783,8 +5803,9 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>               struct netdev_flow_key *keys,
>               struct netdev_flow_key **missed_keys,
>               struct dp_packet_batch *packets_,
> -            struct packet_batch_per_flow batches[],
> -            size_t *n_batches, const int cnt)
> +            const int cnt,
> +            struct dp_packet_flow_map *flow_map,
> +            uint8_t *index_map)
>   {
>       int i;
>       struct dp_packet *packet;
> @@ -5792,6 +5813,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>       struct dfc_cache *cache = &pmd->flow_cache;
>       struct smc_cache *smc_cache = &cache->smc_cache;
>       const struct cmap_node *flow_node;
> +    int recv_idx;
> +    uint16_t tcp_flags;
>   
>       /* Prefetch buckets for all packets */
>       for (i = 0; i < cnt; i++) {
> @@ -5802,6 +5825,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>           struct dp_netdev_flow *flow = NULL;
>           flow_node = smc_entry_get(pmd, keys[i].hash);
>           bool hit = false;
> +        /* Get the original order of this packet in received batch. */
> +        recv_idx = index_map[i];
>   
>           if (OVS_LIKELY(flow_node != NULL)) {
>               CMAP_NODE_FOR_EACH (flow, node, flow_node) {
> @@ -5809,12 +5834,17 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>                    * number, we need to  verify that the input ports match. */
>                   if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
>                   flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
> +                    tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
> +
>                       /* SMC hit and emc miss, we insert into EMC */
>                       emc_probabilistic_insert(pmd, &keys[i], flow);
>                       keys[i].len =
>                           netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> -                    dp_netdev_queue_batches(packet, flow,
> -                    miniflow_get_tcp_flags(&keys[i].mf), batches, n_batches);
> +                    /* Add these packets into the flow map in the same order
> +                     * as received.
> +                     */
> +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> +                                               flow_map, recv_idx);
>                       n_smc_hit++;
>                       hit = true;
>                       break;
> @@ -5828,6 +5858,10 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>           /* SMC missed. Group missed packets together at
>            * the beginning of the 'packets' array. */
>           dp_packet_batch_refill(packets_, packet, i);
> +
> +        /* Preserve the order of packet for flow batching. */
> +        index_map[n_missed] = recv_idx;
> +
>           /* Put missed keys to the pointer arrays return to the caller */
>           missed_keys[n_missed++] = &keys[i];
>       }
> @@ -5856,6 +5890,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                  struct netdev_flow_key *keys,
>                  struct netdev_flow_key **missed_keys,
>                  struct packet_batch_per_flow batches[], size_t *n_batches,
> +               struct dp_packet_flow_map *flow_map,
> +               size_t *n_flows, uint8_t *index_map,
>                  bool md_is_valid, odp_port_t port_no)
>   {
>       struct netdev_flow_key *key = &keys[0];
> @@ -5867,6 +5903,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>       int i;
>       uint16_t tcp_flags;
>       bool smc_enable_db;
> +    size_t map_cnt = 0;
> +    bool batch_enable = true;
>   
>       atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>       atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
> @@ -5897,10 +5935,19 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>           if ((*recirc_depth_get() == 0) &&
>               dp_packet_has_flow_mark(packet, &mark)) {
>               flow = mark_to_flow_find(pmd, mark);
> -            if (flow) {
> +            if (OVS_LIKELY(flow)) {
>                   tcp_flags = parse_tcp_flags(packet);
> -                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                        n_batches);
> +                if (OVS_LIKELY(batch_enable)) {
> +                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                            n_batches);
> +                } else {
> +                    /* Flow batching should be performed only after fast-path
> +                     * processing is also completed for packets with emc miss
> +                     * or else it will result in reordering of packets with
> +                     * same datapath flows. */
> +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> +                                               flow_map, map_cnt++);
> +                }
>                   continue;
>               }
>           }
> @@ -5923,13 +5970,27 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>           }
>           if (OVS_LIKELY(flow)) {
>               tcp_flags = miniflow_get_tcp_flags(&key->mf);
> -            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                    n_batches);
>               n_emc_hit++;
> +            if (OVS_LIKELY(batch_enable)) {
> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                        n_batches);
> +            } else {
> +                /* Flow batching should be performed only after fast-path
> +                 * processing is also completed for packets with emc miss
> +                 * or else it will result in reordering of packets with
> +                 * same datapath flows. */
> +                packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> +                                           flow_map, map_cnt++);
> +            }
>           } else {
>               /* Exact match cache missed. Group missed packets together at
>                * the beginning of the 'packets' array. */
>               dp_packet_batch_refill(packets_, packet, i);
> +
> +            /* Preserve the order of packet for flow batching. */
> +            index_map[n_missed] = map_cnt;
> +            flow_map[map_cnt++].flow = NULL;
> +
>               /* 'key[n_missed]' contains the key of the current packet and it
>                * will be passed to SMC lookup. The next key should be extracted
>                * to 'keys[n_missed + 1]'.
> @@ -5937,8 +5998,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>                * which will be returned to the caller for future processing. */
>               missed_keys[n_missed] = key;
>               key = &keys[++n_missed];
> +
> +            /* Skip batching for subsequent packets to avoid reordering. */
> +            batch_enable = false;
>           }
>       }
> +    /* Count of packets which are not flow batched. */
> +    *n_flows = map_cnt;
>   
>       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
>   
> @@ -5947,8 +6013,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>       }
>   
>       /* Packets miss EMC will do a batch lookup in SMC if enabled */
> -    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
> -                            n_batches, n_missed);
> +    smc_lookup_batch(pmd, keys, missed_keys, packets_,
> +                     n_missed, flow_map, index_map);
>   
>       return dp_packet_batch_size(packets_);
>   }
> @@ -6035,8 +6101,8 @@ static inline void
>   fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                        struct dp_packet_batch *packets_,
>                        struct netdev_flow_key **keys,
> -                     struct packet_batch_per_flow batches[],
> -                     size_t *n_batches,
> +                     struct dp_packet_flow_map *flow_map,
> +                     uint8_t *index_map,
>                        odp_port_t in_port)
>   {
>       const size_t cnt = dp_packet_batch_size(packets_);
> @@ -6116,6 +6182,9 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>   
>       DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>           struct dp_netdev_flow *flow;
> +        /* Get the original order of this packet in received batch. */
> +        int recv_idx = index_map[i];
> +        uint16_t tcp_flags;
>   
>           if (OVS_UNLIKELY(!rules[i])) {
>               continue;
> @@ -6126,9 +6195,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>           smc_insert(pmd, keys[i], hash);
>   
>           emc_probabilistic_insert(pmd, keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow,
> -                                miniflow_get_tcp_flags(&keys[i]->mf),
> -                                batches, n_batches);
> +        /* Add these packets into the flow map in the same order
> +         * as received.
> +         */
> +        tcp_flags = miniflow_get_tcp_flags(&keys[i]->mf);
> +        packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> +                                   flow_map, recv_idx);
>       }
>   
>       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -6161,18 +6233,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>       struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>       struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>       size_t n_batches;
> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> +    uint8_t index_map[PKT_ARRAY_SIZE];
> +    size_t n_flows, i;
> +
>       odp_port_t in_port;
>   
>       n_batches = 0;
>       dfc_processing(pmd, packets, keys, missed_keys, batches, &n_batches,
> -                            md_is_valid, port_no);
> +                   flow_map, &n_flows, index_map, md_is_valid, port_no);
> +
>       if (!dp_packet_batch_is_empty(packets)) {
>           /* Get ingress port from first packet's metadata. */
>           in_port = packets->packets[0]->md.in_port.odp_port;
>           fast_path_processing(pmd, packets, missed_keys,
> -                             batches, &n_batches, in_port);
> +                             flow_map, index_map, in_port);
>       }
>   
> +    /* Batch rest of packets which are in flow map. */
> +    for (i = 0; i < n_flows; i++) {
> +        struct dp_packet_flow_map *map = &flow_map[i];
> +
> +        if (OVS_UNLIKELY(!map->flow)) {
> +            continue;
> +        }
> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
> +                                batches, &n_batches);
> +     }
> +
>       /* All the flow batches need to be reset before any call to
>        * packet_batch_per_flow_execute() as it could potentially trigger
>        * recirculation. When a packet matching flow ‘j’ happens to be
> @@ -6182,7 +6270,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>        * already its own batches[k] still waiting to be served.  So if its
>        * ‘batch’ member is not reset, the recirculated packet would be wrongly
>        * appended to batches[k] of the 1st call to dp_netdev_input__(). */
> -    size_t i;
>       for (i = 0; i < n_batches; i++) {
>           batches[i].flow->batch = NULL;
>       }
>
Ilya Maximets Aug. 8, 2018, 2:39 p.m. UTC | #2
On 08.08.2018 12:06, Ian Stokes wrote:
> On 7/27/2018 7:26 PM, Vishal Deep Ajmera wrote:
>> OVS reads packets in batches from a given port and packets in the
>> batch are subjected to potentially 3 levels of lookups to identify
>> the datapath megaflow entry (or flow) associated with the packet.
>> Each megaflow entry has a dedicated buffer in which packets that match
>> the flow classification criteria are collected. This buffer helps OVS
>> perform batch processing for all packets associated with a given flow.
>>
>> Each packet in the received batch is first subjected to lookup in the
>> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
>> EMC lookup is successful, the packet is moved from the rx batch to the
>> per-flow buffer.
>>
>> Packets that did not match any EMC entry are rearranged in the rx batch
>> at the beginning and are now subjected to a lookup in the megaflow cache.
>> Packets that match a megaflow cache entry are *appended* to the per-flow
>> buffer.
>>
>> Packets that do not match any megaflow entry are subjected to slow-path
>> processing through the upcall mechanism. This cannot change the order of
>> packets as by definition upcall processing is only done for packets
>> without matching megaflow entry.
>>
>> The EMC entry match fields encompass all potentially significant header
>> fields, typically more than specified in the associated flow's match
>> criteria. Hence, multiple EMC entries can point to the same flow. Given
>> that per-flow batching happens at each lookup stage, packets belonging
>> to the same megaflow can get re-ordered because some packets match EMC
>> entries while others do not.
>>
>> The following example can illustrate the issue better. Consider
>> following batch of packets (labelled P1 to P8) associated with a single
>> TCP connection and associated with a single flow. Let us assume that
>> packets with just the ACK bit set in TCP flags have been received in a
>> prior batch also and a corresponding EMC entry exists.
>>
>> 1. P1 (TCP Flag: ACK)
>> 2. P2 (TCP Flag: ACK)
>> 3. P3 (TCP Flag: ACK)
>> 4. P4 (TCP Flag: ACK, PSH)
>> 5. P5 (TCP Flag: ACK)
>> 6. P6 (TCP Flag: ACK)
>> 7. P7 (TCP Flag: ACK)
>> 8. P8 (TCP Flag: ACK)
>>
>> The megaflow classification criteria does not include TCP flags while
>> the EMC match criteria does. Thus, all packets other than P4 match
>> the existing EMC entry and are moved to the per-flow packet batch.
>> Subsequently, packet P4 is moved to the same per-flow packet batch as
>> a result of the megaflow lookup. Though the packets have all been
>> correctly classified as being associated with the same flow, the
>> packet order has not been preserved because of the per-flow batching
>> performed during the EMC lookup stage. This packet re-ordering has
>> performance implications for TCP applications.
>>
>> This patch preserves the packet ordering by performing the per-flow
>> batching after both the EMC and megaflow lookups are complete. As an
>> optimization, packets are flow-batched in emc processing till any
>> packet in the batch has an EMC miss.
>>
>> A new flow map is maintained to keep the original order of packet
>> along with flow information. Post fastpath processing, packets from
>> flow map are *appended* to per-flow buffer.
>>
> 
> Thanks for the V6 Vishal, looking at this today myself.
> 
> Ilya, has the v6 addressed your concerns from the v5?

Hi.
Ian, I had no much time. So, I'm still looking at the patch.
Vishal, It'll be nice if you'll add version history for your
patches under the cut line. It's hard to track all the changes
directly from the code.

Best regards, Ilya Maximets.

> Thanks
> Ian
> 
>> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
>> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
>> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
>> ---
>>   lib/dpif-netdev.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 106 insertions(+), 19 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 13a20f0..3ea25e2 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -244,6 +244,13 @@ struct dpcls_rule {
>>       /* 'flow' must be the last field, additional space is allocated here. */
>>   };
>>   +/* Data structure to keep packet order till fastpath processing. */
>> +struct dp_packet_flow_map {
>> +    struct dp_packet *packet;
>> +    struct dp_netdev_flow *flow;
>> +    uint16_t tcp_flags;
>> +};
>> +
>>   static void dpcls_init(struct dpcls *);
>>   static void dpcls_destroy(struct dpcls *);
>>   static void dpcls_sort_subtable_vector(struct dpcls *);
>> @@ -5774,6 +5781,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>>       packet_batch_per_flow_update(batch, pkt, tcp_flags);
>>   }
>>   +static inline void
>> +packet_enqueue_to_flow_map(struct dp_packet *packet,
>> +                           struct dp_netdev_flow *flow,
>> +                           uint16_t tcp_flags,
>> +                           struct dp_packet_flow_map *flow_map,
>> +                           size_t index)
>> +{
>> +    struct dp_packet_flow_map *map = &flow_map[index];
>> +    map->flow = flow;
>> +    map->packet = packet;
>> +    map->tcp_flags = tcp_flags;
>> +}
>> +
>>   /* SMC lookup function for a batch of packets.
>>    * By doing batching SMC lookup, we can use prefetch
>>    * to hide memory access latency.
>> @@ -5783,8 +5803,9 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>               struct netdev_flow_key *keys,
>>               struct netdev_flow_key **missed_keys,
>>               struct dp_packet_batch *packets_,
>> -            struct packet_batch_per_flow batches[],
>> -            size_t *n_batches, const int cnt)
>> +            const int cnt,
>> +            struct dp_packet_flow_map *flow_map,
>> +            uint8_t *index_map)
>>   {
>>       int i;
>>       struct dp_packet *packet;
>> @@ -5792,6 +5813,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>       struct dfc_cache *cache = &pmd->flow_cache;
>>       struct smc_cache *smc_cache = &cache->smc_cache;
>>       const struct cmap_node *flow_node;
>> +    int recv_idx;
>> +    uint16_t tcp_flags;
>>         /* Prefetch buckets for all packets */
>>       for (i = 0; i < cnt; i++) {
>> @@ -5802,6 +5825,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>           struct dp_netdev_flow *flow = NULL;
>>           flow_node = smc_entry_get(pmd, keys[i].hash);
>>           bool hit = false;
>> +        /* Get the original order of this packet in received batch. */
>> +        recv_idx = index_map[i];
>>             if (OVS_LIKELY(flow_node != NULL)) {
>>               CMAP_NODE_FOR_EACH (flow, node, flow_node) {
>> @@ -5809,12 +5834,17 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>                    * number, we need to  verify that the input ports match. */
>>                   if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
>>                   flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
>> +                    tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
>> +
>>                       /* SMC hit and emc miss, we insert into EMC */
>>                       emc_probabilistic_insert(pmd, &keys[i], flow);
>>                       keys[i].len =
>>                           netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
>> -                    dp_netdev_queue_batches(packet, flow,
>> -                    miniflow_get_tcp_flags(&keys[i].mf), batches, n_batches);
>> +                    /* Add these packets into the flow map in the same order
>> +                     * as received.
>> +                     */
>> +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
>> +                                               flow_map, recv_idx);
>>                       n_smc_hit++;
>>                       hit = true;
>>                       break;
>> @@ -5828,6 +5858,10 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>           /* SMC missed. Group missed packets together at
>>            * the beginning of the 'packets' array. */
>>           dp_packet_batch_refill(packets_, packet, i);
>> +
>> +        /* Preserve the order of packet for flow batching. */
>> +        index_map[n_missed] = recv_idx;
>> +
>>           /* Put missed keys to the pointer arrays return to the caller */
>>           missed_keys[n_missed++] = &keys[i];
>>       }
>> @@ -5856,6 +5890,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>                  struct netdev_flow_key *keys,
>>                  struct netdev_flow_key **missed_keys,
>>                  struct packet_batch_per_flow batches[], size_t *n_batches,
>> +               struct dp_packet_flow_map *flow_map,
>> +               size_t *n_flows, uint8_t *index_map,
>>                  bool md_is_valid, odp_port_t port_no)
>>   {
>>       struct netdev_flow_key *key = &keys[0];
>> @@ -5867,6 +5903,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>       int i;
>>       uint16_t tcp_flags;
>>       bool smc_enable_db;
>> +    size_t map_cnt = 0;
>> +    bool batch_enable = true;
>>         atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>>       atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>> @@ -5897,10 +5935,19 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>           if ((*recirc_depth_get() == 0) &&
>>               dp_packet_has_flow_mark(packet, &mark)) {
>>               flow = mark_to_flow_find(pmd, mark);
>> -            if (flow) {
>> +            if (OVS_LIKELY(flow)) {
>>                   tcp_flags = parse_tcp_flags(packet);
>> -                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
>> -                                        n_batches);
>> +                if (OVS_LIKELY(batch_enable)) {
>> +                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
>> +                                            n_batches);
>> +                } else {
>> +                    /* Flow batching should be performed only after fast-path
>> +                     * processing is also completed for packets with emc miss
>> +                     * or else it will result in reordering of packets with
>> +                     * same datapath flows. */
>> +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
>> +                                               flow_map, map_cnt++);
>> +                }
>>                   continue;
>>               }
>>           }
>> @@ -5923,13 +5970,27 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>           }
>>           if (OVS_LIKELY(flow)) {
>>               tcp_flags = miniflow_get_tcp_flags(&key->mf);
>> -            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
>> -                                    n_batches);
>>               n_emc_hit++;
>> +            if (OVS_LIKELY(batch_enable)) {
>> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
>> +                                        n_batches);
>> +            } else {
>> +                /* Flow batching should be performed only after fast-path
>> +                 * processing is also completed for packets with emc miss
>> +                 * or else it will result in reordering of packets with
>> +                 * same datapath flows. */
>> +                packet_enqueue_to_flow_map(packet, flow, tcp_flags,
>> +                                           flow_map, map_cnt++);
>> +            }
>>           } else {
>>               /* Exact match cache missed. Group missed packets together at
>>                * the beginning of the 'packets' array. */
>>               dp_packet_batch_refill(packets_, packet, i);
>> +
>> +            /* Preserve the order of packet for flow batching. */
>> +            index_map[n_missed] = map_cnt;
>> +            flow_map[map_cnt++].flow = NULL;
>> +
>>               /* 'key[n_missed]' contains the key of the current packet and it
>>                * will be passed to SMC lookup. The next key should be extracted
>>                * to 'keys[n_missed + 1]'.
>> @@ -5937,8 +5998,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>                * which will be returned to the caller for future processing. */
>>               missed_keys[n_missed] = key;
>>               key = &keys[++n_missed];
>> +
>> +            /* Skip batching for subsequent packets to avoid reordering. */
>> +            batch_enable = false;
>>           }
>>       }
>> +    /* Count of packets which are not flow batched. */
>> +    *n_flows = map_cnt;
>>         pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
>>   @@ -5947,8 +6013,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>       }
>>         /* Packets miss EMC will do a batch lookup in SMC if enabled */
>> -    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
>> -                            n_batches, n_missed);
>> +    smc_lookup_batch(pmd, keys, missed_keys, packets_,
>> +                     n_missed, flow_map, index_map);
>>         return dp_packet_batch_size(packets_);
>>   }
>> @@ -6035,8 +6101,8 @@ static inline void
>>   fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>>                        struct dp_packet_batch *packets_,
>>                        struct netdev_flow_key **keys,
>> -                     struct packet_batch_per_flow batches[],
>> -                     size_t *n_batches,
>> +                     struct dp_packet_flow_map *flow_map,
>> +                     uint8_t *index_map,
>>                        odp_port_t in_port)
>>   {
>>       const size_t cnt = dp_packet_batch_size(packets_);
>> @@ -6116,6 +6182,9 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>>         DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>>           struct dp_netdev_flow *flow;
>> +        /* Get the original order of this packet in received batch. */
>> +        int recv_idx = index_map[i];
>> +        uint16_t tcp_flags;
>>             if (OVS_UNLIKELY(!rules[i])) {
>>               continue;
>> @@ -6126,9 +6195,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>>           smc_insert(pmd, keys[i], hash);
>>             emc_probabilistic_insert(pmd, keys[i], flow);
>> -        dp_netdev_queue_batches(packet, flow,
>> -                                miniflow_get_tcp_flags(&keys[i]->mf),
>> -                                batches, n_batches);
>> +        /* Add these packets into the flow map in the same order
>> +         * as received.
>> +         */
>> +        tcp_flags = miniflow_get_tcp_flags(&keys[i]->mf);
>> +        packet_enqueue_to_flow_map(packet, flow, tcp_flags,
>> +                                   flow_map, recv_idx);
>>       }
>>         pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
>> @@ -6161,18 +6233,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>>       struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>>       struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>>       size_t n_batches;
>> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
>> +    uint8_t index_map[PKT_ARRAY_SIZE];
>> +    size_t n_flows, i;
>> +
>>       odp_port_t in_port;
>>         n_batches = 0;
>>       dfc_processing(pmd, packets, keys, missed_keys, batches, &n_batches,
>> -                            md_is_valid, port_no);
>> +                   flow_map, &n_flows, index_map, md_is_valid, port_no);
>> +
>>       if (!dp_packet_batch_is_empty(packets)) {
>>           /* Get ingress port from first packet's metadata. */
>>           in_port = packets->packets[0]->md.in_port.odp_port;
>>           fast_path_processing(pmd, packets, missed_keys,
>> -                             batches, &n_batches, in_port);
>> +                             flow_map, index_map, in_port);
>>       }
>>   +    /* Batch rest of packets which are in flow map. */
>> +    for (i = 0; i < n_flows; i++) {
>> +        struct dp_packet_flow_map *map = &flow_map[i];
>> +
>> +        if (OVS_UNLIKELY(!map->flow)) {
>> +            continue;
>> +        }
>> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
>> +                                batches, &n_batches);
>> +     }
>> +
>>       /* All the flow batches need to be reset before any call to
>>        * packet_batch_per_flow_execute() as it could potentially trigger
>>        * recirculation. When a packet matching flow ‘j’ happens to be
>> @@ -6182,7 +6270,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>>        * already its own batches[k] still waiting to be served.  So if its
>>        * ‘batch’ member is not reset, the recirculated packet would be wrongly
>>        * appended to batches[k] of the 1st call to dp_netdev_input__(). */
>> -    size_t i;
>>       for (i = 0; i < n_batches; i++) {
>>           batches[i].flow->batch = NULL;
>>       }
>>
> 
> 
>
Stokes, Ian Aug. 22, 2018, 12:30 p.m. UTC | #3
> On 7/27/2018 7:26 PM, Vishal Deep Ajmera wrote:
> > OVS reads packets in batches from a given port and packets in the
> > batch are subjected to potentially 3 levels of lookups to identify the
> > datapath megaflow entry (or flow) associated with the packet.
> > Each megaflow entry has a dedicated buffer in which packets that match
> > the flow classification criteria are collected. This buffer helps OVS
> > perform batch processing for all packets associated with a given flow.
> >
> > Each packet in the received batch is first subjected to lookup in the
> > Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
> > EMC lookup is successful, the packet is moved from the rx batch to the
> > per-flow buffer.
> >
> > Packets that did not match any EMC entry are rearranged in the rx
> > batch at the beginning and are now subjected to a lookup in the megaflow
> cache.
> > Packets that match a megaflow cache entry are *appended* to the
> > per-flow buffer.
> >
> > Packets that do not match any megaflow entry are subjected to
> > slow-path processing through the upcall mechanism. This cannot change
> > the order of packets as by definition upcall processing is only done
> > for packets without matching megaflow entry.
> >
> > The EMC entry match fields encompass all potentially significant
> > header fields, typically more than specified in the associated flow's
> > match criteria. Hence, multiple EMC entries can point to the same
> > flow. Given that per-flow batching happens at each lookup stage,
> > packets belonging to the same megaflow can get re-ordered because some
> > packets match EMC entries while others do not.
> >
> > The following example can illustrate the issue better. Consider
> > following batch of packets (labelled P1 to P8) associated with a
> > single TCP connection and associated with a single flow. Let us assume
> > that packets with just the ACK bit set in TCP flags have been received
> > in a prior batch also and a corresponding EMC entry exists.
> >
> > 1. P1 (TCP Flag: ACK)
> > 2. P2 (TCP Flag: ACK)
> > 3. P3 (TCP Flag: ACK)
> > 4. P4 (TCP Flag: ACK, PSH)
> > 5. P5 (TCP Flag: ACK)
> > 6. P6 (TCP Flag: ACK)
> > 7. P7 (TCP Flag: ACK)
> > 8. P8 (TCP Flag: ACK)
> >
> > The megaflow classification criteria does not include TCP flags while
> > the EMC match criteria does. Thus, all packets other than P4 match the
> > existing EMC entry and are moved to the per-flow packet batch.
> > Subsequently, packet P4 is moved to the same per-flow packet batch as
> > a result of the megaflow lookup. Though the packets have all been
> > correctly classified as being associated with the same flow, the
> > packet order has not been preserved because of the per-flow batching
> > performed during the EMC lookup stage. This packet re-ordering has
> > performance implications for TCP applications.
> >
> > This patch preserves the packet ordering by performing the per-flow
> > batching after both the EMC and megaflow lookups are complete. As an
> > optimization, packets are flow-batched in emc processing till any
> > packet in the batch has an EMC miss.
> >
> > A new flow map is maintained to keep the original order of packet
> > along with flow information. Post fastpath processing, packets from
> > flow map are *appended* to per-flow buffer.
> >
> 
> Thanks for the V6 Vishal, looking at this today myself.
> 
> Ilya, has the v6 addressed your concerns from the v5?

There haven't been any more comments on this patch. From what I can see the issues raised by Ilya on v5 have been addressed in the v6.  V6 seems to test ok, unless there are objections I'm going to merge it as part of this weeks dpdk_merge pull request for master.

Thanks
Ian

> 
> > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
> > Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
> > ---
> >   lib/dpif-netdev.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-
> --------
> >   1 file changed, 106 insertions(+), 19 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > 13a20f0..3ea25e2 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -244,6 +244,13 @@ struct dpcls_rule {
> >       /* 'flow' must be the last field, additional space is allocated
> here. */
> >   };
> >
> > +/* Data structure to keep packet order till fastpath processing. */
> > +struct dp_packet_flow_map {
> > +    struct dp_packet *packet;
> > +    struct dp_netdev_flow *flow;
> > +    uint16_t tcp_flags;
> > +};
> > +
> >   static void dpcls_init(struct dpcls *);
> >   static void dpcls_destroy(struct dpcls *);
> >   static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5774,6
> > +5781,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
> >       packet_batch_per_flow_update(batch, pkt, tcp_flags);
> >   }
> >
> > +static inline void
> > +packet_enqueue_to_flow_map(struct dp_packet *packet,
> > +                           struct dp_netdev_flow *flow,
> > +                           uint16_t tcp_flags,
> > +                           struct dp_packet_flow_map *flow_map,
> > +                           size_t index) {
> > +    struct dp_packet_flow_map *map = &flow_map[index];
> > +    map->flow = flow;
> > +    map->packet = packet;
> > +    map->tcp_flags = tcp_flags;
> > +}
> > +
> >   /* SMC lookup function for a batch of packets.
> >    * By doing batching SMC lookup, we can use prefetch
> >    * to hide memory access latency.
> > @@ -5783,8 +5803,9 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> >               struct netdev_flow_key *keys,
> >               struct netdev_flow_key **missed_keys,
> >               struct dp_packet_batch *packets_,
> > -            struct packet_batch_per_flow batches[],
> > -            size_t *n_batches, const int cnt)
> > +            const int cnt,
> > +            struct dp_packet_flow_map *flow_map,
> > +            uint8_t *index_map)
> >   {
> >       int i;
> >       struct dp_packet *packet;
> > @@ -5792,6 +5813,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> >       struct dfc_cache *cache = &pmd->flow_cache;
> >       struct smc_cache *smc_cache = &cache->smc_cache;
> >       const struct cmap_node *flow_node;
> > +    int recv_idx;
> > +    uint16_t tcp_flags;
> >
> >       /* Prefetch buckets for all packets */
> >       for (i = 0; i < cnt; i++) {
> > @@ -5802,6 +5825,8 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> >           struct dp_netdev_flow *flow = NULL;
> >           flow_node = smc_entry_get(pmd, keys[i].hash);
> >           bool hit = false;
> > +        /* Get the original order of this packet in received batch. */
> > +        recv_idx = index_map[i];
> >
> >           if (OVS_LIKELY(flow_node != NULL)) {
> >               CMAP_NODE_FOR_EACH (flow, node, flow_node) { @@ -5809,12
> > +5834,17 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
> >                    * number, we need to  verify that the input ports
> match. */
> >                   if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr,
> &keys[i]) &&
> >                   flow->flow.in_port.odp_port ==
> > packet->md.in_port.odp_port)) {
> > +                    tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
> > +
> >                       /* SMC hit and emc miss, we insert into EMC */
> >                       emc_probabilistic_insert(pmd, &keys[i], flow);
> >                       keys[i].len =
> >
> netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> > -                    dp_netdev_queue_batches(packet, flow,
> > -                    miniflow_get_tcp_flags(&keys[i].mf), batches,
> n_batches);
> > +                    /* Add these packets into the flow map in the same
> order
> > +                     * as received.
> > +                     */
> > +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> > +                                               flow_map, recv_idx);
> >                       n_smc_hit++;
> >                       hit = true;
> >                       break;
> > @@ -5828,6 +5858,10 @@ smc_lookup_batch(struct dp_netdev_pmd_thread
> *pmd,
> >           /* SMC missed. Group missed packets together at
> >            * the beginning of the 'packets' array. */
> >           dp_packet_batch_refill(packets_, packet, i);
> > +
> > +        /* Preserve the order of packet for flow batching. */
> > +        index_map[n_missed] = recv_idx;
> > +
> >           /* Put missed keys to the pointer arrays return to the caller
> */
> >           missed_keys[n_missed++] = &keys[i];
> >       }
> > @@ -5856,6 +5890,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >                  struct netdev_flow_key *keys,
> >                  struct netdev_flow_key **missed_keys,
> >                  struct packet_batch_per_flow batches[], size_t
> > *n_batches,
> > +               struct dp_packet_flow_map *flow_map,
> > +               size_t *n_flows, uint8_t *index_map,
> >                  bool md_is_valid, odp_port_t port_no)
> >   {
> >       struct netdev_flow_key *key = &keys[0]; @@ -5867,6 +5903,8 @@
> > dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >       int i;
> >       uint16_t tcp_flags;
> >       bool smc_enable_db;
> > +    size_t map_cnt = 0;
> > +    bool batch_enable = true;
> >
> >       atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
> >       atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); @@
> > -5897,10 +5935,19 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >           if ((*recirc_depth_get() == 0) &&
> >               dp_packet_has_flow_mark(packet, &mark)) {
> >               flow = mark_to_flow_find(pmd, mark);
> > -            if (flow) {
> > +            if (OVS_LIKELY(flow)) {
> >                   tcp_flags = parse_tcp_flags(packet);
> > -                dp_netdev_queue_batches(packet, flow, tcp_flags,
> batches,
> > -                                        n_batches);
> > +                if (OVS_LIKELY(batch_enable)) {
> > +                    dp_netdev_queue_batches(packet, flow, tcp_flags,
> batches,
> > +                                            n_batches);
> > +                } else {
> > +                    /* Flow batching should be performed only after
> fast-path
> > +                     * processing is also completed for packets with
> emc miss
> > +                     * or else it will result in reordering of packets
> with
> > +                     * same datapath flows. */
> > +                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> > +                                               flow_map, map_cnt++);
> > +                }
> >                   continue;
> >               }
> >           }
> > @@ -5923,13 +5970,27 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >           }
> >           if (OVS_LIKELY(flow)) {
> >               tcp_flags = miniflow_get_tcp_flags(&key->mf);
> > -            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> > -                                    n_batches);
> >               n_emc_hit++;
> > +            if (OVS_LIKELY(batch_enable)) {
> > +                dp_netdev_queue_batches(packet, flow, tcp_flags,
> batches,
> > +                                        n_batches);
> > +            } else {
> > +                /* Flow batching should be performed only after fast-
> path
> > +                 * processing is also completed for packets with emc
> miss
> > +                 * or else it will result in reordering of packets with
> > +                 * same datapath flows. */
> > +                packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> > +                                           flow_map, map_cnt++);
> > +            }
> >           } else {
> >               /* Exact match cache missed. Group missed packets together
> at
> >                * the beginning of the 'packets' array. */
> >               dp_packet_batch_refill(packets_, packet, i);
> > +
> > +            /* Preserve the order of packet for flow batching. */
> > +            index_map[n_missed] = map_cnt;
> > +            flow_map[map_cnt++].flow = NULL;
> > +
> >               /* 'key[n_missed]' contains the key of the current packet
> and it
> >                * will be passed to SMC lookup. The next key should be
> extracted
> >                * to 'keys[n_missed + 1]'.
> > @@ -5937,8 +5998,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >                * which will be returned to the caller for future
> processing. */
> >               missed_keys[n_missed] = key;
> >               key = &keys[++n_missed];
> > +
> > +            /* Skip batching for subsequent packets to avoid
> reordering. */
> > +            batch_enable = false;
> >           }
> >       }
> > +    /* Count of packets which are not flow batched. */
> > +    *n_flows = map_cnt;
> >
> >       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> > n_emc_hit);
> >
> > @@ -5947,8 +6013,8 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >       }
> >
> >       /* Packets miss EMC will do a batch lookup in SMC if enabled */
> > -    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
> > -                            n_batches, n_missed);
> > +    smc_lookup_batch(pmd, keys, missed_keys, packets_,
> > +                     n_missed, flow_map, index_map);
> >
> >       return dp_packet_batch_size(packets_);
> >   }
> > @@ -6035,8 +6101,8 @@ static inline void
> >   fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> >                        struct dp_packet_batch *packets_,
> >                        struct netdev_flow_key **keys,
> > -                     struct packet_batch_per_flow batches[],
> > -                     size_t *n_batches,
> > +                     struct dp_packet_flow_map *flow_map,
> > +                     uint8_t *index_map,
> >                        odp_port_t in_port)
> >   {
> >       const size_t cnt = dp_packet_batch_size(packets_); @@ -6116,6
> > +6182,9 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> >
> >       DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> >           struct dp_netdev_flow *flow;
> > +        /* Get the original order of this packet in received batch. */
> > +        int recv_idx = index_map[i];
> > +        uint16_t tcp_flags;
> >
> >           if (OVS_UNLIKELY(!rules[i])) {
> >               continue;
> > @@ -6126,9 +6195,12 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
> >           smc_insert(pmd, keys[i], hash);
> >
> >           emc_probabilistic_insert(pmd, keys[i], flow);
> > -        dp_netdev_queue_batches(packet, flow,
> > -                                miniflow_get_tcp_flags(&keys[i]->mf),
> > -                                batches, n_batches);
> > +        /* Add these packets into the flow map in the same order
> > +         * as received.
> > +         */
> > +        tcp_flags = miniflow_get_tcp_flags(&keys[i]->mf);
> > +        packet_enqueue_to_flow_map(packet, flow, tcp_flags,
> > +                                   flow_map, recv_idx);
> >       }
> >
> >       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> > @@ -6161,18 +6233,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,
> >       struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
> >       struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
> >       size_t n_batches;
> > +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> > +    uint8_t index_map[PKT_ARRAY_SIZE];
> > +    size_t n_flows, i;
> > +
> >       odp_port_t in_port;
> >
> >       n_batches = 0;
> >       dfc_processing(pmd, packets, keys, missed_keys, batches,
> &n_batches,
> > -                            md_is_valid, port_no);
> > +                   flow_map, &n_flows, index_map, md_is_valid,
> > + port_no);
> > +
> >       if (!dp_packet_batch_is_empty(packets)) {
> >           /* Get ingress port from first packet's metadata. */
> >           in_port = packets->packets[0]->md.in_port.odp_port;
> >           fast_path_processing(pmd, packets, missed_keys,
> > -                             batches, &n_batches, in_port);
> > +                             flow_map, index_map, in_port);
> >       }
> >
> > +    /* Batch rest of packets which are in flow map. */
> > +    for (i = 0; i < n_flows; i++) {
> > +        struct dp_packet_flow_map *map = &flow_map[i];
> > +
> > +        if (OVS_UNLIKELY(!map->flow)) {
> > +            continue;
> > +        }
> > +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
> > +                                batches, &n_batches);
> > +     }
> > +
> >       /* All the flow batches need to be reset before any call to
> >        * packet_batch_per_flow_execute() as it could potentially trigger
> >        * recirculation. When a packet matching flow ‘j’ happens to be
> > @@ -6182,7 +6270,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,
> >        * already its own batches[k] still waiting to be served.  So if
> its
> >        * ‘batch’ member is not reset, the recirculated packet would be
> wrongly
> >        * appended to batches[k] of the 1st call to dp_netdev_input__().
> */
> > -    size_t i;
> >       for (i = 0; i < n_batches; i++) {
> >           batches[i].flow->batch = NULL;
> >       }
> >
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ilya Maximets Aug. 22, 2018, 3:32 p.m. UTC | #4
On 22.08.2018 15:30, Stokes, Ian wrote:
>> On 7/27/2018 7:26 PM, Vishal Deep Ajmera wrote:
>>> OVS reads packets in batches from a given port and packets in the
>>> batch are subjected to potentially 3 levels of lookups to identify the
>>> datapath megaflow entry (or flow) associated with the packet.
>>> Each megaflow entry has a dedicated buffer in which packets that match
>>> the flow classification criteria are collected. This buffer helps OVS
>>> perform batch processing for all packets associated with a given flow.
>>>
>>> Each packet in the received batch is first subjected to lookup in the
>>> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
>>> EMC lookup is successful, the packet is moved from the rx batch to the
>>> per-flow buffer.
>>>
>>> Packets that did not match any EMC entry are rearranged in the rx
>>> batch at the beginning and are now subjected to a lookup in the megaflow
>> cache.
>>> Packets that match a megaflow cache entry are *appended* to the
>>> per-flow buffer.
>>>
>>> Packets that do not match any megaflow entry are subjected to
>>> slow-path processing through the upcall mechanism. This cannot change
>>> the order of packets as by definition upcall processing is only done
>>> for packets without matching megaflow entry.
>>>
>>> The EMC entry match fields encompass all potentially significant
>>> header fields, typically more than specified in the associated flow's
>>> match criteria. Hence, multiple EMC entries can point to the same
>>> flow. Given that per-flow batching happens at each lookup stage,
>>> packets belonging to the same megaflow can get re-ordered because some
>>> packets match EMC entries while others do not.
>>>
>>> The following example can illustrate the issue better. Consider
>>> following batch of packets (labelled P1 to P8) associated with a
>>> single TCP connection and associated with a single flow. Let us assume
>>> that packets with just the ACK bit set in TCP flags have been received
>>> in a prior batch also and a corresponding EMC entry exists.
>>>
>>> 1. P1 (TCP Flag: ACK)
>>> 2. P2 (TCP Flag: ACK)
>>> 3. P3 (TCP Flag: ACK)
>>> 4. P4 (TCP Flag: ACK, PSH)
>>> 5. P5 (TCP Flag: ACK)
>>> 6. P6 (TCP Flag: ACK)
>>> 7. P7 (TCP Flag: ACK)
>>> 8. P8 (TCP Flag: ACK)
>>>
>>> The megaflow classification criteria does not include TCP flags while
>>> the EMC match criteria does. Thus, all packets other than P4 match the
>>> existing EMC entry and are moved to the per-flow packet batch.
>>> Subsequently, packet P4 is moved to the same per-flow packet batch as
>>> a result of the megaflow lookup. Though the packets have all been
>>> correctly classified as being associated with the same flow, the
>>> packet order has not been preserved because of the per-flow batching
>>> performed during the EMC lookup stage. This packet re-ordering has
>>> performance implications for TCP applications.
>>>
>>> This patch preserves the packet ordering by performing the per-flow
>>> batching after both the EMC and megaflow lookups are complete. As an
>>> optimization, packets are flow-batched in emc processing till any
>>> packet in the batch has an EMC miss.
>>>
>>> A new flow map is maintained to keep the original order of packet
>>> along with flow information. Post fastpath processing, packets from
>>> flow map are *appended* to per-flow buffer.
>>>
>>
>> Thanks for the V6 Vishal, looking at this today myself.
>>
>> Ilya, has the v6 addressed your concerns from the v5?
> 
> There haven't been any more comments on this patch. From what I can see the issues raised by Ilya on v5 have been addressed in the v6.  V6 seems to test ok, unless there are objections I'm going to merge it as part of this weeks dpdk_merge pull request for master.

Hi, Ian, everyone.
It's ok to merge. I still don't like the change, but I'll get along with it.
P.S. Current version can not be applied cleanly, minor rebase required.

Below text is not directly related to this patch.
---
Meanwhile, in general, I think that the main processing part
of dpif-netdev (dp_netdev_input__) is overcomplicated. It looks
like we have 4 different levels of packet batching (some of them
exists at the same time):
1. rx batches
2. Per-flow batches
3. new flow maps
4. output batches.
And we're only constantly batching packets in a different data
structures. That's frustrating.

I'd like to refactor it in a way of more iterative and sequential
processing like using "keys_map" approach from "dpcls_lookup" and
call different processing functions until "keys_map" is not empty.
Each processing function may use "ULLONG_FOR_EACH_1" for this
"key_map", set up rules for matched packets end disable them
from "keys_map" using "ULLONG_SET0".
Right now we have 5 options to classify packets:
1. has_flow_mark()
2. emc_lookup()
3. smc_lookup_batch()
4. dpcls_lookup()
5. handle_packet_upcall()
some of them works with packet batches, some with individual packets.
IMHO, if we'll make a plain pipeline from them calling the next
stage until there are some unhandled packets in "keys_map", it could
be much simpler in compare to current call tree with a mess of
processing functions, each of which implemented in a different way.
In the end we'll be able to push all the packets to per-flow batches
and execute actions.

In fact, this will effectively resolve current reordering issue too.

I'm not sure if the above text is parsable. =) Sorry, if not.

Any thoughts?
Anyway, I'm going to try.

Best regards, Ilya Maximets.

> Thanks
> Ian
> 
>>
>>> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
>>> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
>>> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
>>> ---
>>>   lib/dpif-netdev.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-
>> --------
>>>   1 file changed, 106 insertions(+), 19 deletions(-)
Wang, Yipeng1 Aug. 29, 2018, 1:18 a.m. UTC | #5
>-----Original Message-----
>From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Ilya Maximets
>Sent: Wednesday, August 22, 2018 8:33 AM
>To: Stokes, Ian <ian.stokes@intel.com>; Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>; dev@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH v6] dpif-netdev: Avoid reordering of packets in a batch with same megaflow
>
>master.
>
>Hi, Ian, everyone.
>It's ok to merge. I still don't like the change, but I'll get along with it.
>P.S. Current version can not be applied cleanly, minor rebase required.
>
>Below text is not directly related to this patch.
>---
>Meanwhile, in general, I think that the main processing part
>of dpif-netdev (dp_netdev_input__) is overcomplicated. It looks
>like we have 4 different levels of packet batching (some of them
>exists at the same time):
>1. rx batches
>2. Per-flow batches
>3. new flow maps
>4. output batches.
>And we're only constantly batching packets in a different data
>structures. That's frustrating.
>
>I'd like to refactor it in a way of more iterative and sequential
>processing like using "keys_map" approach from "dpcls_lookup" and
>call different processing functions until "keys_map" is not empty.
>Each processing function may use "ULLONG_FOR_EACH_1" for this
>"key_map", set up rules for matched packets end disable them
>from "keys_map" using "ULLONG_SET0".
>Right now we have 5 options to classify packets:
>1. has_flow_mark()
>2. emc_lookup()
>3. smc_lookup_batch()
>4. dpcls_lookup()
>5. handle_packet_upcall()
>some of them works with packet batches, some with individual packets.
>IMHO, if we'll make a plain pipeline from them calling the next
>stage until there are some unhandled packets in "keys_map", it could
>be much simpler in compare to current call tree with a mess of
>processing functions, each of which implemented in a different way.
>In the end we'll be able to push all the packets to per-flow batches
>and execute actions.
>
>In fact, this will effectively resolve current reordering issue too.
>
>I'm not sure if the above text is parsable. =) Sorry, if not.
>
>Any thoughts?
>Anyway, I'm going to try.

[Wang, Yipeng] 
Hi, Ilya,

Your suggestion of refactoring current code would be nice. Here is my two cents:

1. Previously when I implemented SMC cache, I tried to create a key array, and pass it to EMC_batch processing and SMC batch processing as two stages(mostly for performance reasons, batching EMC enables me to do software pipelining). The issue (at least what I observed) is miniflow_extracting to a full 32-element key array before EMC processing occupies so many cache lines thus it is costly. Current implementation reuses the same key cache line to do miniflow_extract if previous key hits EMC. This helps cache locality on the mostly-EMC-hit case.  So if we assume EMC-hit is common, you may consider EMC lookup and miniflow extract separately from others for performance reasons.

2. Using masks to pass batches around is a good improvement on code readability. However, I don't know a convenient way to implement software pipelining with masks. So sometimes a compact pointer array may still benefit if software pipelining is needed.

Please include me once you have a draft.

Thanks
Yipeng
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 13a20f0..3ea25e2 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -244,6 +244,13 @@  struct dpcls_rule {
     /* 'flow' must be the last field, additional space is allocated here. */
 };
 
+/* Data structure to keep packet order till fastpath processing. */
+struct dp_packet_flow_map {
+    struct dp_packet *packet;
+    struct dp_netdev_flow *flow;
+    uint16_t tcp_flags;
+};
+
 static void dpcls_init(struct dpcls *);
 static void dpcls_destroy(struct dpcls *);
 static void dpcls_sort_subtable_vector(struct dpcls *);
@@ -5774,6 +5781,19 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, tcp_flags);
 }
 
+static inline void
+packet_enqueue_to_flow_map(struct dp_packet *packet,
+                           struct dp_netdev_flow *flow,
+                           uint16_t tcp_flags,
+                           struct dp_packet_flow_map *flow_map,
+                           size_t index)
+{
+    struct dp_packet_flow_map *map = &flow_map[index];
+    map->flow = flow;
+    map->packet = packet;
+    map->tcp_flags = tcp_flags;
+}
+
 /* SMC lookup function for a batch of packets.
  * By doing batching SMC lookup, we can use prefetch
  * to hide memory access latency.
@@ -5783,8 +5803,9 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
             struct netdev_flow_key *keys,
             struct netdev_flow_key **missed_keys,
             struct dp_packet_batch *packets_,
-            struct packet_batch_per_flow batches[],
-            size_t *n_batches, const int cnt)
+            const int cnt,
+            struct dp_packet_flow_map *flow_map,
+            uint8_t *index_map)
 {
     int i;
     struct dp_packet *packet;
@@ -5792,6 +5813,8 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
     struct dfc_cache *cache = &pmd->flow_cache;
     struct smc_cache *smc_cache = &cache->smc_cache;
     const struct cmap_node *flow_node;
+    int recv_idx;
+    uint16_t tcp_flags;
 
     /* Prefetch buckets for all packets */
     for (i = 0; i < cnt; i++) {
@@ -5802,6 +5825,8 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
         struct dp_netdev_flow *flow = NULL;
         flow_node = smc_entry_get(pmd, keys[i].hash);
         bool hit = false;
+        /* Get the original order of this packet in received batch. */
+        recv_idx = index_map[i];
 
         if (OVS_LIKELY(flow_node != NULL)) {
             CMAP_NODE_FOR_EACH (flow, node, flow_node) {
@@ -5809,12 +5834,17 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
                  * number, we need to  verify that the input ports match. */
                 if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, &keys[i]) &&
                 flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
+                    tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
+
                     /* SMC hit and emc miss, we insert into EMC */
                     emc_probabilistic_insert(pmd, &keys[i], flow);
                     keys[i].len =
                         netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
-                    dp_netdev_queue_batches(packet, flow,
-                    miniflow_get_tcp_flags(&keys[i].mf), batches, n_batches);
+                    /* Add these packets into the flow map in the same order
+                     * as received.
+                     */
+                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
+                                               flow_map, recv_idx);
                     n_smc_hit++;
                     hit = true;
                     break;
@@ -5828,6 +5858,10 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
         /* SMC missed. Group missed packets together at
          * the beginning of the 'packets' array. */
         dp_packet_batch_refill(packets_, packet, i);
+
+        /* Preserve the order of packet for flow batching. */
+        index_map[n_missed] = recv_idx;
+
         /* Put missed keys to the pointer arrays return to the caller */
         missed_keys[n_missed++] = &keys[i];
     }
@@ -5856,6 +5890,8 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
                struct netdev_flow_key *keys,
                struct netdev_flow_key **missed_keys,
                struct packet_batch_per_flow batches[], size_t *n_batches,
+               struct dp_packet_flow_map *flow_map,
+               size_t *n_flows, uint8_t *index_map,
                bool md_is_valid, odp_port_t port_no)
 {
     struct netdev_flow_key *key = &keys[0];
@@ -5867,6 +5903,8 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
     int i;
     uint16_t tcp_flags;
     bool smc_enable_db;
+    size_t map_cnt = 0;
+    bool batch_enable = true;
 
     atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
     atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
@@ -5897,10 +5935,19 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
         if ((*recirc_depth_get() == 0) &&
             dp_packet_has_flow_mark(packet, &mark)) {
             flow = mark_to_flow_find(pmd, mark);
-            if (flow) {
+            if (OVS_LIKELY(flow)) {
                 tcp_flags = parse_tcp_flags(packet);
-                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
-                                        n_batches);
+                if (OVS_LIKELY(batch_enable)) {
+                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
+                                            n_batches);
+                } else {
+                    /* Flow batching should be performed only after fast-path
+                     * processing is also completed for packets with emc miss
+                     * or else it will result in reordering of packets with
+                     * same datapath flows. */
+                    packet_enqueue_to_flow_map(packet, flow, tcp_flags,
+                                               flow_map, map_cnt++);
+                }
                 continue;
             }
         }
@@ -5923,13 +5970,27 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
         }
         if (OVS_LIKELY(flow)) {
             tcp_flags = miniflow_get_tcp_flags(&key->mf);
-            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
-                                    n_batches);
             n_emc_hit++;
+            if (OVS_LIKELY(batch_enable)) {
+                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
+                                        n_batches);
+            } else {
+                /* Flow batching should be performed only after fast-path
+                 * processing is also completed for packets with emc miss
+                 * or else it will result in reordering of packets with
+                 * same datapath flows. */
+                packet_enqueue_to_flow_map(packet, flow, tcp_flags,
+                                           flow_map, map_cnt++);
+            }
         } else {
             /* Exact match cache missed. Group missed packets together at
              * the beginning of the 'packets' array. */
             dp_packet_batch_refill(packets_, packet, i);
+
+            /* Preserve the order of packet for flow batching. */
+            index_map[n_missed] = map_cnt;
+            flow_map[map_cnt++].flow = NULL;
+
             /* 'key[n_missed]' contains the key of the current packet and it
              * will be passed to SMC lookup. The next key should be extracted
              * to 'keys[n_missed + 1]'.
@@ -5937,8 +5998,13 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
              * which will be returned to the caller for future processing. */
             missed_keys[n_missed] = key;
             key = &keys[++n_missed];
+
+            /* Skip batching for subsequent packets to avoid reordering. */
+            batch_enable = false;
         }
     }
+    /* Count of packets which are not flow batched. */
+    *n_flows = map_cnt;
 
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, n_emc_hit);
 
@@ -5947,8 +6013,8 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
     }
 
     /* Packets miss EMC will do a batch lookup in SMC if enabled */
-    smc_lookup_batch(pmd, keys, missed_keys, packets_, batches,
-                            n_batches, n_missed);
+    smc_lookup_batch(pmd, keys, missed_keys, packets_,
+                     n_missed, flow_map, index_map);
 
     return dp_packet_batch_size(packets_);
 }
@@ -6035,8 +6101,8 @@  static inline void
 fast_path_processing(struct dp_netdev_pmd_thread *pmd,
                      struct dp_packet_batch *packets_,
                      struct netdev_flow_key **keys,
-                     struct packet_batch_per_flow batches[],
-                     size_t *n_batches,
+                     struct dp_packet_flow_map *flow_map,
+                     uint8_t *index_map,
                      odp_port_t in_port)
 {
     const size_t cnt = dp_packet_batch_size(packets_);
@@ -6116,6 +6182,9 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
     DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
         struct dp_netdev_flow *flow;
+        /* Get the original order of this packet in received batch. */
+        int recv_idx = index_map[i];
+        uint16_t tcp_flags;
 
         if (OVS_UNLIKELY(!rules[i])) {
             continue;
@@ -6126,9 +6195,12 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
         smc_insert(pmd, keys[i], hash);
 
         emc_probabilistic_insert(pmd, keys[i], flow);
-        dp_netdev_queue_batches(packet, flow,
-                                miniflow_get_tcp_flags(&keys[i]->mf),
-                                batches, n_batches);
+        /* Add these packets into the flow map in the same order
+         * as received.
+         */
+        tcp_flags = miniflow_get_tcp_flags(&keys[i]->mf);
+        packet_enqueue_to_flow_map(packet, flow, tcp_flags,
+                                   flow_map, recv_idx);
     }
 
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
@@ -6161,18 +6233,34 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
     struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
     struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
     size_t n_batches;
+    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
+    uint8_t index_map[PKT_ARRAY_SIZE];
+    size_t n_flows, i;
+
     odp_port_t in_port;
 
     n_batches = 0;
     dfc_processing(pmd, packets, keys, missed_keys, batches, &n_batches,
-                            md_is_valid, port_no);
+                   flow_map, &n_flows, index_map, md_is_valid, port_no);
+
     if (!dp_packet_batch_is_empty(packets)) {
         /* Get ingress port from first packet's metadata. */
         in_port = packets->packets[0]->md.in_port.odp_port;
         fast_path_processing(pmd, packets, missed_keys,
-                             batches, &n_batches, in_port);
+                             flow_map, index_map, in_port);
     }
 
+    /* Batch rest of packets which are in flow map. */
+    for (i = 0; i < n_flows; i++) {
+        struct dp_packet_flow_map *map = &flow_map[i];
+
+        if (OVS_UNLIKELY(!map->flow)) {
+            continue;
+        }
+        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
+                                batches, &n_batches);
+     }
+
     /* All the flow batches need to be reset before any call to
      * packet_batch_per_flow_execute() as it could potentially trigger
      * recirculation. When a packet matching flow ‘j’ happens to be
@@ -6182,7 +6270,6 @@  dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
      * already its own batches[k] still waiting to be served.  So if its
      * ‘batch’ member is not reset, the recirculated packet would be wrongly
      * appended to batches[k] of the 1st call to dp_netdev_input__(). */
-    size_t i;
     for (i = 0; i < n_batches; i++) {
         batches[i].flow->batch = NULL;
     }