Message ID | 1530924401-32446-1-git-send-email-vishal.deep.ajmera@ericsson.com |
---|---|
State | Superseded |
Delegated to: | Ian Stokes |
Headers | show |
Series | [ovs-dev,v3] dpif-netdev: Avoid reordering of packets in a batch with same megaflow | expand |
Hi Vishal, I guess the series is not rebased on top of the HWOL series, many of the comments are related to this issue. Saturday, July 7, 2018 3:47 AM, Vishal Deep Ajmera: > Subject: [ovs-dev] [PATCH v3] dpif-netdev: Avoid reordering of packets in a > batch with same megaflow > > OVS reads packets in batches from a given port and packets in the batch are > subjected to potentially 3 levels of lookups to identify the datapath > megaflow entry (or flow) associated with the packet. > Each megaflow entry has a dedicated buffer in which packets that match the > flow classification criteria are collected. This buffer helps OVS perform batch > processing for all packets associated with a given flow. > > Each packet in the received batch is first subjected to lookup in the Exact > Match Cache (EMC). Each EMC entry will point to a flow. If the EMC lookup is > successful, the packet is moved from the rx batch to the per-flow buffer. > > Packets that did not match any EMC entry are rearranged in the rx batch at > the beginning and are now subjected to a lookup in the megaflow cache. > Packets that match a megaflow cache entry are *appended* to the per-flow > buffer. > > Packets that do not match any megaflow entry are subjected to slow-path > processing through the upcall mechanism. This cannot change the order of > packets as by definition upcall processing is only done for packets without > matching megaflow entry. > > The EMC entry match fields encompass all potentially significant header > fields, typically more than specified in the associated flow's match criteria. > Hence, multiple EMC entries can point to the same flow. Given that per-flow > batching happens at each lookup stage, packets belonging to the same > megaflow can get re-ordered because some packets match EMC entries > while others do not. > > The following example can illustrate the issue better. Consider following > batch of packets (labelled P1 to P8) associated with a single TCP connection > and associated with a single flow. Let us assume that packets with just the > ACK bit set in TCP flags have been received in a prior batch also and a > corresponding EMC entry exists. > > 1. P1 (TCP Flag: ACK) > 2. P2 (TCP Flag: ACK) > 3. P3 (TCP Flag: ACK) > 4. P4 (TCP Flag: ACK, PSH) > 5. P5 (TCP Flag: ACK) > 6. P6 (TCP Flag: ACK) > 7. P7 (TCP Flag: ACK) > 8. P8 (TCP Flag: ACK) > > The megaflow classification criteria does not include TCP flags while the EMC > match criteria does. Thus, all packets other than P4 match the existing EMC > entry and are moved to the per-flow packet batch. > Subsequently, packet P4 is moved to the same per-flow packet batch as a > result of the megaflow lookup. Though the packets have all been correctly > classified as being associated with the same flow, the packet order has not > been preserved because of the per-flow batching performed during the EMC > lookup stage. This packet re-ordering has performance implications for TCP > applications. > > This patch preserves the packet ordering by performing the per-flow > batching after both the EMC and megaflow lookups are complete. As an > optimization, packets are flow-batched in emc processing till any packet in > the batch has an EMC miss. > > A new flow map is maintained to keep the original order of packet along with > flow information. Post fastpath processing, packets from flow map are > *appended* to per-flow buffer. > > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> > Co-authored-by: Venkatesan Pradeep > <venkatesan.pradeep@ericsson.com> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> > --- > lib/dpif-netdev.c | 80 > ++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 67 insertions(+), 13 deletions(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff..23cda57 > 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -207,6 +207,13 @@ struct dpcls_rule { > /* 'flow' must be the last field, additional space is allocated here. */ }; > > +/* data structure to keep packet order till fastpath processing */ > +struct dp_packet_flow_map { > + struct dp_packet *packet; > + struct dp_netdev_flow *flow; > + uint16_t tcp_flags; > +}; > + > static void dpcls_init(struct dpcls *); static void dpcls_destroy(struct dpcls *); > static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5081,10 +5088,10 > @@ struct packet_batch_per_flow { static inline void > packet_batch_per_flow_update(struct packet_batch_per_flow *batch, > struct dp_packet *packet, > - const struct miniflow *mf) > + uint16_t tcp_flags) This change is already done. > { > batch->byte_count += dp_packet_size(packet); > - batch->tcp_flags |= miniflow_get_tcp_flags(mf); > + batch->tcp_flags |= tcp_flags; Ditto > batch->array.packets[batch->array.count++] = packet; } > > @@ -5118,7 +5125,7 @@ packet_batch_per_flow_execute(struct > packet_batch_per_flow *batch, > > static inline void > dp_netdev_queue_batches(struct dp_packet *pkt, > - struct dp_netdev_flow *flow, const struct miniflow *mf, > + struct dp_netdev_flow *flow, uint16_t > + tcp_flags, Ditto > struct packet_batch_per_flow *batches, > size_t *n_batches) { @@ -5129,7 +5136,7 @@ > dp_netdev_queue_batches(struct dp_packet *pkt, > packet_batch_per_flow_init(batch, flow); > } > > - packet_batch_per_flow_update(batch, pkt, mf); > + packet_batch_per_flow_update(batch, pkt, tcp_flags); Ditto > } > > /* Try to process all ('cnt') the 'packets' using only the exact match cache > @@ -5151,6 +5158,9 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > struct packet_batch_per_flow batches[], size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + size_t *n_flows, > + uint8_t *index_map, > bool md_is_valid, odp_port_t port_no) { > struct emc_cache *flow_cache = &pmd->flow_cache; @@ -5160,6 +5170,9 > @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > const size_t cnt = dp_packet_batch_size(packets_); > uint32_t cur_min; > int i; > + size_t map_cnt = 0; > + uint16_t tcp_flags = 0; This var was already defined by HWOL series > + bool batch_enable = true; > > atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); > pmd_perf_update_counter(&pmd->perf_stats, > @@ -5168,6 +5181,7 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > > DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { > struct dp_netdev_flow *flow; > + struct dp_packet_flow_map *map; > > if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) { > dp_packet_delete(packet); > @@ -5200,8 +5214,20 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > flow = NULL; > } > if (OVS_LIKELY(flow)) { > - dp_netdev_queue_batches(packet, flow, &key->mf, batches, > - n_batches); > + tcp_flags = miniflow_get_tcp_flags(&key->mf); This change was already done. > + if (OVS_LIKELY(batch_enable)) { > + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > + n_batches); I think we should have the same if/else in case we have a hit on the flow mark search (mark_to_flow_find). Currently in case we have a flow mark we immediately queue the packet into batch, this can cause reordering as you stated. Perhaps this logic should be on a sperate inline function. > + } else { > + /* Flow batching should be performed only after fast-path > + * processing is also completed for packets with emc miss > + * or else it will result in reordering of packets with > + * same datapath flows. */ > + map = &flow_map[map_cnt++]; > + map->flow = flow; > + map->packet = packet; > + map->tcp_flags = tcp_flags; > + } > } else { > /* Exact match cache missed. Group missed packets together at > * the beginning of the 'packets' array. */ @@ -5210,9 +5236,17 @@ > emc_processing(struct dp_netdev_pmd_thread *pmd, > * must be returned to the caller. The next key should be extracted > * to 'keys[n_missed + 1]'. */ > key = &keys[++n_missed]; > + > + /* preserve the order of packet for flow batching */ > + index_map[packets_->count - 1] = map_cnt; > + flow_map[map_cnt++].flow = NULL; > + > + /* skip batching for subsequent packets to avoid reordering */ > + batch_enable = false; > } > } > > + *n_flows = map_cnt; > pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, > cnt - n_dropped - n_missed); > > @@ -5299,8 +5333,8 @@ static inline void fast_path_processing(struct > dp_netdev_pmd_thread *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > - struct packet_batch_per_flow batches[], > - size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + uint8_t *index_map, > odp_port_t in_port) { > const size_t cnt = dp_packet_batch_size(packets_); @@ -5379,6 +5413,8 > @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, > > DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > struct dp_netdev_flow *flow; > + /* get the original order of this packet in received batch */ > + int recv_idx = index_map[i]; > > if (OVS_UNLIKELY(!rules[i])) { > continue; > @@ -5387,7 +5423,12 @@ fast_path_processing(struct > dp_netdev_pmd_thread *pmd, > flow = dp_netdev_flow_cast(rules[i]); > > emc_probabilistic_insert(pmd, &keys[i], flow); > - dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, > n_batches); > + /* add these packets into the flow map in the same order > + * as received. > + */ > + flow_map[recv_idx].packet = packet; > + flow_map[recv_idx].flow = flow; > + flow_map[recv_idx].tcp_flags = > + miniflow_get_tcp_flags(&keys[i].mf); > } > > pmd_perf_update_counter(&pmd->perf_stats, > PMD_STAT_MASKED_HIT, @@ -5418,17 +5459,31 @@ > dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, > OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > struct netdev_flow_key keys[PKT_ARRAY_SIZE]; > struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; > - size_t n_batches; > + struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE]; > + uint8_t index_map[PKT_ARRAY_SIZE]; > + size_t n_batches, n_flows = 0; > odp_port_t in_port; > + size_t i; > > n_batches = 0; > emc_processing(pmd, packets, keys, batches, &n_batches, > - md_is_valid, port_no); > + flow_map, &n_flows, index_map, md_is_valid, > + port_no); > if (!dp_packet_batch_is_empty(packets)) { > /* Get ingress port from first packet's metadata. */ > in_port = packets->packets[0]->md.in_port.odp_port; > fast_path_processing(pmd, packets, keys, > - batches, &n_batches, in_port); > + flow_map, index_map, in_port); > + } > + > + /* batch rest of packets which are in flow map */ > + for (i = 0; i < n_flows; i++) { > + struct dp_packet_flow_map *map = &flow_map[i]; > + > + if (OVS_UNLIKELY(!map->flow)) { > + continue; > + } > + dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, > + batches, &n_batches); > } > > /* All the flow batches need to be reset before any call to @@ -5440,7 > +5495,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, > * already its own batches[k] still waiting to be served. So if its > * ‘batch’ member is not reset, the recirculated packet would be wrongly > * appended to batches[k] of the 1st call to dp_netdev_input__(). */ > - size_t i; > for (i = 0; i < n_batches; i++) { > batches[i].flow->batch = NULL; > } > -- > 1.9.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fma > il.openvswitch.org%2Fmailman%2Flistinfo%2Fovs- > dev&data=02%7C01%7Cshahafs%40mellanox.com%7Cd7bfe76711254a9 > 791fb08d5e3600bf0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C > 636664924035769004&sdata=0qETrlq3VxQOYG7rPjP%2FSJ6vPYaHXeIV9F > wfYTCFlIg%3D&reserved=0
> > I guess the series is not rebased on top of the HWOL series, many of the > comments are related to this issue. Yes it is not rebased on top of HWOL series. Ian asked for this, but I could not figure out who will rebase with which patch. Now that pull request is already raised with your patch-set, I will rebase with your patch and post v4 version. Thanks.
> > > > > I guess the series is not rebased on top of the HWOL series, many of > > the comments are related to this issue. > > Yes it is not rebased on top of HWOL series. Ian asked for this, but I could not > figure out who will rebase with which patch. Now that pull request is already > raised with your patch-set, I will rebase with your patch and post v4 version. > Thanks. > Hi Ian, Shahaf, I have rebased the patch with latest OVS and sent V4 version. Please review and merge if there are no issues seen. Warm Regards, Vishal Ajmera
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff..23cda57 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -207,6 +207,13 @@ struct dpcls_rule { /* 'flow' must be the last field, additional space is allocated here. */ }; +/* data structure to keep packet order till fastpath processing */ +struct dp_packet_flow_map { + struct dp_packet *packet; + struct dp_netdev_flow *flow; + uint16_t tcp_flags; +}; + static void dpcls_init(struct dpcls *); static void dpcls_destroy(struct dpcls *); static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5081,10 +5088,10 @@ struct packet_batch_per_flow { static inline void packet_batch_per_flow_update(struct packet_batch_per_flow *batch, struct dp_packet *packet, - const struct miniflow *mf) + uint16_t tcp_flags) { batch->byte_count += dp_packet_size(packet); - batch->tcp_flags |= miniflow_get_tcp_flags(mf); + batch->tcp_flags |= tcp_flags; batch->array.packets[batch->array.count++] = packet; } @@ -5118,7 +5125,7 @@ packet_batch_per_flow_execute(struct packet_batch_per_flow *batch, static inline void dp_netdev_queue_batches(struct dp_packet *pkt, - struct dp_netdev_flow *flow, const struct miniflow *mf, + struct dp_netdev_flow *flow, uint16_t tcp_flags, struct packet_batch_per_flow *batches, size_t *n_batches) { @@ -5129,7 +5136,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt, packet_batch_per_flow_init(batch, flow); } - packet_batch_per_flow_update(batch, pkt, mf); + packet_batch_per_flow_update(batch, pkt, tcp_flags); } /* Try to process all ('cnt') the 'packets' using only the exact match cache @@ -5151,6 +5158,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, struct packet_batch_per_flow batches[], size_t *n_batches, + struct dp_packet_flow_map *flow_map, + size_t *n_flows, + uint8_t *index_map, bool md_is_valid, odp_port_t port_no) { struct emc_cache *flow_cache = &pmd->flow_cache; @@ -5160,6 +5170,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, const size_t cnt = dp_packet_batch_size(packets_); uint32_t cur_min; int i; + size_t map_cnt = 0; + uint16_t tcp_flags = 0; + bool batch_enable = true; atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); pmd_perf_update_counter(&pmd->perf_stats, @@ -5168,6 +5181,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { struct dp_netdev_flow *flow; + struct dp_packet_flow_map *map; if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) { dp_packet_delete(packet); @@ -5200,8 +5214,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, flow = NULL; } if (OVS_LIKELY(flow)) { - dp_netdev_queue_batches(packet, flow, &key->mf, batches, - n_batches); + tcp_flags = miniflow_get_tcp_flags(&key->mf); + if (OVS_LIKELY(batch_enable)) { + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, + n_batches); + } else { + /* Flow batching should be performed only after fast-path + * processing is also completed for packets with emc miss + * or else it will result in reordering of packets with + * same datapath flows. */ + map = &flow_map[map_cnt++]; + map->flow = flow; + map->packet = packet; + map->tcp_flags = tcp_flags; + } } else { /* Exact match cache missed. Group missed packets together at * the beginning of the 'packets' array. */ @@ -5210,9 +5236,17 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, * must be returned to the caller. The next key should be extracted * to 'keys[n_missed + 1]'. */ key = &keys[++n_missed]; + + /* preserve the order of packet for flow batching */ + index_map[packets_->count - 1] = map_cnt; + flow_map[map_cnt++].flow = NULL; + + /* skip batching for subsequent packets to avoid reordering */ + batch_enable = false; } } + *n_flows = map_cnt; pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, cnt - n_dropped - n_missed); @@ -5299,8 +5333,8 @@ static inline void fast_path_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, - struct packet_batch_per_flow batches[], - size_t *n_batches, + struct dp_packet_flow_map *flow_map, + uint8_t *index_map, odp_port_t in_port) { const size_t cnt = dp_packet_batch_size(packets_); @@ -5379,6 +5413,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { struct dp_netdev_flow *flow; + /* get the original order of this packet in received batch */ + int recv_idx = index_map[i]; if (OVS_UNLIKELY(!rules[i])) { continue; @@ -5387,7 +5423,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, flow = dp_netdev_flow_cast(rules[i]); emc_probabilistic_insert(pmd, &keys[i], flow); - dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches); + /* add these packets into the flow map in the same order + * as received. + */ + flow_map[recv_idx].packet = packet; + flow_map[recv_idx].flow = flow; + flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf); } pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT, @@ -5418,17 +5459,31 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE]; struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; - size_t n_batches; + struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE]; + uint8_t index_map[PKT_ARRAY_SIZE]; + size_t n_batches, n_flows = 0; odp_port_t in_port; + size_t i; n_batches = 0; emc_processing(pmd, packets, keys, batches, &n_batches, - md_is_valid, port_no); + flow_map, &n_flows, index_map, md_is_valid, port_no); if (!dp_packet_batch_is_empty(packets)) { /* Get ingress port from first packet's metadata. */ in_port = packets->packets[0]->md.in_port.odp_port; fast_path_processing(pmd, packets, keys, - batches, &n_batches, in_port); + flow_map, index_map, in_port); + } + + /* batch rest of packets which are in flow map */ + for (i = 0; i < n_flows; i++) { + struct dp_packet_flow_map *map = &flow_map[i]; + + if (OVS_UNLIKELY(!map->flow)) { + continue; + } + dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, + batches, &n_batches); } /* All the flow batches need to be reset before any call to @@ -5440,7 +5495,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, * already its own batches[k] still waiting to be served. So if its * ‘batch’ member is not reset, the recirculated packet would be wrongly * appended to batches[k] of the 1st call to dp_netdev_input__(). */ - size_t i; for (i = 0; i < n_batches; i++) { batches[i].flow->batch = NULL; }