Message ID | 1531170831-5641-1-git-send-email-vishal.deep.ajmera@ericsson.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev,v4] dpif-netdev: Avoid reordering of packets in a batch with same megaflow | expand |
Tuesday, July 10, 2018 12:14 AM, Vishal Deep Ajmera: > Subject: [ovs-dev] [PATCH v4] dpif-netdev: Avoid reordering of packets in a > batch with same megaflow > > OVS reads packets in batches from a given port and packets in the batch are > subjected to potentially 3 levels of lookups to identify the datapath > megaflow entry (or flow) associated with the packet. > Each megaflow entry has a dedicated buffer in which packets that match the > flow classification criteria are collected. This buffer helps OVS perform batch > processing for all packets associated with a given flow. > > Each packet in the received batch is first subjected to lookup in the Exact > Match Cache (EMC). Each EMC entry will point to a flow. If the EMC lookup is > successful, the packet is moved from the rx batch to the per-flow buffer. > > Packets that did not match any EMC entry are rearranged in the rx batch at > the beginning and are now subjected to a lookup in the megaflow cache. > Packets that match a megaflow cache entry are *appended* to the per-flow > buffer. > > Packets that do not match any megaflow entry are subjected to slow-path > processing through the upcall mechanism. This cannot change the order of > packets as by definition upcall processing is only done for packets without > matching megaflow entry. > > The EMC entry match fields encompass all potentially significant header > fields, typically more than specified in the associated flow's match criteria. > Hence, multiple EMC entries can point to the same flow. Given that per-flow > batching happens at each lookup stage, packets belonging to the same > megaflow can get re-ordered because some packets match EMC entries > while others do not. > > The following example can illustrate the issue better. Consider following > batch of packets (labelled P1 to P8) associated with a single TCP connection > and associated with a single flow. Let us assume that packets with just the > ACK bit set in TCP flags have been received in a prior batch also and a > corresponding EMC entry exists. > > 1. P1 (TCP Flag: ACK) > 2. P2 (TCP Flag: ACK) > 3. P3 (TCP Flag: ACK) > 4. P4 (TCP Flag: ACK, PSH) > 5. P5 (TCP Flag: ACK) > 6. P6 (TCP Flag: ACK) > 7. P7 (TCP Flag: ACK) > 8. P8 (TCP Flag: ACK) > > The megaflow classification criteria does not include TCP flags while the EMC > match criteria does. Thus, all packets other than P4 match the existing EMC > entry and are moved to the per-flow packet batch. > Subsequently, packet P4 is moved to the same per-flow packet batch as a > result of the megaflow lookup. Though the packets have all been correctly > classified as being associated with the same flow, the packet order has not > been preserved because of the per-flow batching performed during the EMC > lookup stage. This packet re-ordering has performance implications for TCP > applications. > > This patch preserves the packet ordering by performing the per-flow > batching after both the EMC and megaflow lookups are complete. As an > optimization, packets are flow-batched in emc processing till any packet in > the batch has an EMC miss. > > A new flow map is maintained to keep the original order of packet along with > flow information. Post fastpath processing, packets from flow map are > *appended* to per-flow buffer. > > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> > Co-authored-by: Venkatesan Pradeep > <venkatesan.pradeep@ericsson.com> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> Reviewed-by: Shahaf Shuler <shahafs@mellanox.com> > --- > lib/dpif-netdev.c | 103 > +++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 87 insertions(+), 16 deletions(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 8b3556d..d4b8f99 > 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -208,6 +208,13 @@ struct dpcls_rule { > /* 'flow' must be the last field, additional space is allocated here. */ }; > > +/* data structure to keep packet order till fastpath processing */ > +struct dp_packet_flow_map { > + struct dp_packet *packet; > + struct dp_netdev_flow *flow; > + uint16_t tcp_flags; > +}; > + > static void dpcls_init(struct dpcls *); static void dpcls_destroy(struct dpcls *); > static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5602,6 +5609,19 > @@ dp_netdev_queue_batches(struct dp_packet *pkt, > packet_batch_per_flow_update(batch, pkt, tcp_flags); } > > +static inline void > +packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map, > + struct dp_netdev_flow *flow, > + struct dp_packet *packet, > + uint16_t tcp_flags, > + size_t *map_cnt) { > + struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++]; > + map->flow = flow; > + map->packet = packet; > + map->tcp_flags = tcp_flags; > +} > + > /* Try to process all ('cnt') the 'packets' using only the exact match cache > * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the > * miniflow is copied into 'keys' and the packet pointer is moved at the @@ - > 5621,6 +5641,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > struct packet_batch_per_flow batches[], size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + size_t *n_flows, > + uint8_t *index_map, > bool md_is_valid, odp_port_t port_no) { > struct emc_cache *flow_cache = &pmd->flow_cache; @@ -5631,6 +5654,8 > @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > uint32_t cur_min; > int i; > uint16_t tcp_flags; > + size_t map_cnt = 0; > + bool batch_enable = true; > > atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); > pmd_perf_update_counter(&pmd->perf_stats, > @@ -5661,11 +5686,20 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > if ((*recirc_depth_get() == 0) && > dp_packet_has_flow_mark(packet, &mark)) { > flow = mark_to_flow_find(pmd, mark); > - if (flow) { > + if (OVS_LIKELY(flow)) { > tcp_flags = parse_tcp_flags(packet); > - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > - n_batches); > - continue; > + if (OVS_LIKELY(batch_enable)) { > + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > + n_batches); > + continue; > + } else { > + /* Flow batching should be performed only after fast-path > + * processing is also completed for packets with emc miss > + * or else it will result in reordering of packets with > + * same datapath flows. */ > + packet_enqueue_to_flow_map(flow_map, flow, packet, > + tcp_flags, &map_cnt); > + } > } > } > > @@ -5685,8 +5719,18 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > } > if (OVS_LIKELY(flow)) { > tcp_flags = miniflow_get_tcp_flags(&key->mf); > - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > - n_batches); > + if (OVS_LIKELY(batch_enable)) { > + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > + n_batches); > + } else { > + /* Flow batching should be performed only after fast-path > + * processing is also completed for packets with emc miss > + * or else it will result in reordering of packets with > + * same datapath flows. */ > + packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags, > + &map_cnt); > + } > + > } else { > /* Exact match cache missed. Group missed packets together at > * the beginning of the 'packets' array. */ @@ -5695,9 +5739,16 @@ > emc_processing(struct dp_netdev_pmd_thread *pmd, > * must be returned to the caller. The next key should be extracted > * to 'keys[n_missed + 1]'. */ > key = &keys[++n_missed]; > + > + /* preserve the order of packet for flow batching */ > + index_map[packets_->count - 1] = map_cnt; > + flow_map[map_cnt++].flow = NULL; > + > + /* skip batching for subsequent packets to avoid reordering */ > + batch_enable = false; > } > } > - > + *n_flows = map_cnt; > pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, > cnt - n_dropped - n_missed); > > @@ -5784,8 +5835,8 @@ static inline void fast_path_processing(struct > dp_netdev_pmd_thread *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > - struct packet_batch_per_flow batches[], > - size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + uint8_t *index_map, > odp_port_t in_port) { > const size_t cnt = dp_packet_batch_size(packets_); @@ -5864,6 +5915,8 > @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, > > DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > struct dp_netdev_flow *flow; > + /* get the original order of this packet in received batch */ > + int recv_idx = index_map[i]; > > if (OVS_UNLIKELY(!rules[i])) { > continue; > @@ -5872,9 +5925,13 @@ fast_path_processing(struct > dp_netdev_pmd_thread *pmd, > flow = dp_netdev_flow_cast(rules[i]); > > emc_probabilistic_insert(pmd, &keys[i], flow); > - dp_netdev_queue_batches(packet, flow, > - miniflow_get_tcp_flags(&keys[i].mf), > - batches, n_batches); > + /* add these packets into the flow map in the same order > + * as received. > + */ > + flow_map[recv_idx].packet = packet; > + flow_map[recv_idx].flow = flow; > + flow_map[recv_idx].tcp_flags = > + miniflow_get_tcp_flags(&keys[i].mf); > + > } > > pmd_perf_update_counter(&pmd->perf_stats, > PMD_STAT_MASKED_HIT, @@ -5905,19 +5962,34 @@ > dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, > OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > struct netdev_flow_key keys[PKT_ARRAY_SIZE]; > struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; > - size_t n_batches; > + struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE]; > + uint8_t index_map[PKT_ARRAY_SIZE]; > + size_t n_batches, n_flows = 0; > odp_port_t in_port; > + size_t i; > > n_batches = 0; > emc_processing(pmd, packets, keys, batches, &n_batches, > - md_is_valid, port_no); > + flow_map, &n_flows, index_map, md_is_valid, > + port_no); > if (!dp_packet_batch_is_empty(packets)) { > /* Get ingress port from first packet's metadata. */ > in_port = packets->packets[0]->md.in_port.odp_port; > fast_path_processing(pmd, packets, keys, > - batches, &n_batches, in_port); > + flow_map, index_map, in_port); > } > > + /* batch rest of packets which are in flow map */ > + for (i = 0; i < n_flows; i++) { > + struct dp_packet_flow_map *map = &flow_map[i]; > + > + if (OVS_UNLIKELY(!map->flow)) { > + continue; > + } > + dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, > + batches, &n_batches); > + } > + > + > /* All the flow batches need to be reset before any call to > * packet_batch_per_flow_execute() as it could potentially trigger > * recirculation. When a packet matching flow ‘j’ happens to be @@ - > 5927,7 +5999,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread > *pmd, > * already its own batches[k] still waiting to be served. So if its > * ‘batch’ member is not reset, the recirculated packet would be wrongly > * appended to batches[k] of the 1st call to dp_netdev_input__(). */ > - size_t i; > for (i = 0; i < n_batches; i++) { > batches[i].flow->batch = NULL; > } > -- > 1.9.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fma > il.openvswitch.org%2Fmailman%2Flistinfo%2Fovs- > dev&data=02%7C01%7Cshahafs%40mellanox.com%7C0a4e1400f9514c2 > 45fcb08d5e59db8ab%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C > 636667387935563045&sdata=PKzGJYA6MyP0uJPHbGRCw2SZtYgg7R0YIp > %2BXexQNZAM%3D&reserved=0
Not a full review. One comment inline. Best regards, Ilya Maximets. On 10.07.2018 00:13, Vishal Deep Ajmera wrote: > OVS reads packets in batches from a given port and packets in the > batch are subjected to potentially 3 levels of lookups to identify > the datapath megaflow entry (or flow) associated with the packet. > Each megaflow entry has a dedicated buffer in which packets that match > the flow classification criteria are collected. This buffer helps OVS > perform batch processing for all packets associated with a given flow. > > Each packet in the received batch is first subjected to lookup in the > Exact Match Cache (EMC). Each EMC entry will point to a flow. If the > EMC lookup is successful, the packet is moved from the rx batch to the > per-flow buffer. > > Packets that did not match any EMC entry are rearranged in the rx batch > at the beginning and are now subjected to a lookup in the megaflow cache. > Packets that match a megaflow cache entry are *appended* to the per-flow > buffer. > > Packets that do not match any megaflow entry are subjected to slow-path > processing through the upcall mechanism. This cannot change the order of > packets as by definition upcall processing is only done for packets > without matching megaflow entry. > > The EMC entry match fields encompass all potentially significant header > fields, typically more than specified in the associated flow's match > criteria. Hence, multiple EMC entries can point to the same flow. Given > that per-flow batching happens at each lookup stage, packets belonging > to the same megaflow can get re-ordered because some packets match EMC > entries while others do not. > > The following example can illustrate the issue better. Consider > following batch of packets (labelled P1 to P8) associated with a single > TCP connection and associated with a single flow. Let us assume that > packets with just the ACK bit set in TCP flags have been received in a > prior batch also and a corresponding EMC entry exists. > > 1. P1 (TCP Flag: ACK) > 2. P2 (TCP Flag: ACK) > 3. P3 (TCP Flag: ACK) > 4. P4 (TCP Flag: ACK, PSH) > 5. P5 (TCP Flag: ACK) > 6. P6 (TCP Flag: ACK) > 7. P7 (TCP Flag: ACK) > 8. P8 (TCP Flag: ACK) > > The megaflow classification criteria does not include TCP flags while > the EMC match criteria does. Thus, all packets other than P4 match > the existing EMC entry and are moved to the per-flow packet batch. > Subsequently, packet P4 is moved to the same per-flow packet batch as > a result of the megaflow lookup. Though the packets have all been > correctly classified as being associated with the same flow, the > packet order has not been preserved because of the per-flow batching > performed during the EMC lookup stage. This packet re-ordering has > performance implications for TCP applications. > > This patch preserves the packet ordering by performing the per-flow > batching after both the EMC and megaflow lookups are complete. As an > optimization, packets are flow-batched in emc processing till any > packet in the batch has an EMC miss. > > A new flow map is maintained to keep the original order of packet > along with flow information. Post fastpath processing, packets from > flow map are *appended* to per-flow buffer. > > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> > Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> > Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com> > --- > lib/dpif-netdev.c | 103 +++++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 87 insertions(+), 16 deletions(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 8b3556d..d4b8f99 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -208,6 +208,13 @@ struct dpcls_rule { > /* 'flow' must be the last field, additional space is allocated here. */ > }; > > +/* data structure to keep packet order till fastpath processing */ > +struct dp_packet_flow_map { > + struct dp_packet *packet; > + struct dp_netdev_flow *flow; > + uint16_t tcp_flags; > +}; > + > static void dpcls_init(struct dpcls *); > static void dpcls_destroy(struct dpcls *); > static void dpcls_sort_subtable_vector(struct dpcls *); > @@ -5602,6 +5609,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt, > packet_batch_per_flow_update(batch, pkt, tcp_flags); > } > > +static inline void > +packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map, > + struct dp_netdev_flow *flow, > + struct dp_packet *packet, > + uint16_t tcp_flags, > + size_t *map_cnt) > +{ > + struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++]; > + map->flow = flow; > + map->packet = packet; > + map->tcp_flags = tcp_flags; > +} > + > /* Try to process all ('cnt') the 'packets' using only the exact match cache > * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the > * miniflow is copied into 'keys' and the packet pointer is moved at the > @@ -5621,6 +5641,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > struct packet_batch_per_flow batches[], size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + size_t *n_flows, > + uint8_t *index_map, > bool md_is_valid, odp_port_t port_no) > { > struct emc_cache *flow_cache = &pmd->flow_cache; > @@ -5631,6 +5654,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > uint32_t cur_min; > int i; > uint16_t tcp_flags; > + size_t map_cnt = 0; > + bool batch_enable = true; > > atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); > pmd_perf_update_counter(&pmd->perf_stats, > @@ -5661,11 +5686,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > if ((*recirc_depth_get() == 0) && > dp_packet_has_flow_mark(packet, &mark)) { > flow = mark_to_flow_find(pmd, mark); > - if (flow) { > + if (OVS_LIKELY(flow)) { > tcp_flags = parse_tcp_flags(packet); > - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > - n_batches); > - continue; > + if (OVS_LIKELY(batch_enable)) { > + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > + n_batches); > + continue; > + } else { > + /* Flow batching should be performed only after fast-path > + * processing is also completed for packets with emc miss > + * or else it will result in reordering of packets with > + * same datapath flows. */ > + packet_enqueue_to_flow_map(flow_map, flow, packet, > + tcp_flags, &map_cnt); > + } > } > } > > @@ -5685,8 +5719,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > } > if (OVS_LIKELY(flow)) { > tcp_flags = miniflow_get_tcp_flags(&key->mf); > - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > - n_batches); > + if (OVS_LIKELY(batch_enable)) { > + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, > + n_batches); > + } else { > + /* Flow batching should be performed only after fast-path > + * processing is also completed for packets with emc miss > + * or else it will result in reordering of packets with > + * same datapath flows. */ > + packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags, > + &map_cnt); > + } > + > } else { > /* Exact match cache missed. Group missed packets together at > * the beginning of the 'packets' array. */ > @@ -5695,9 +5739,16 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > * must be returned to the caller. The next key should be extracted > * to 'keys[n_missed + 1]'. */ > key = &keys[++n_missed]; > + > + /* preserve the order of packet for flow batching */ > + index_map[packets_->count - 1] = map_cnt; Using the "packets_->count" without accessor + inside a refill loop is highly inaccurate/potentially dangerous. Isn't it equal to "n_missed"? > + flow_map[map_cnt++].flow = NULL; > + > + /* skip batching for subsequent packets to avoid reordering */ > + batch_enable = false; > } > } > - > + *n_flows = map_cnt; > pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, > cnt - n_dropped - n_missed); > > @@ -5784,8 +5835,8 @@ static inline void > fast_path_processing(struct dp_netdev_pmd_thread *pmd, > struct dp_packet_batch *packets_, > struct netdev_flow_key *keys, > - struct packet_batch_per_flow batches[], > - size_t *n_batches, > + struct dp_packet_flow_map *flow_map, > + uint8_t *index_map, > odp_port_t in_port) > { > const size_t cnt = dp_packet_batch_size(packets_); > @@ -5864,6 +5915,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, > > DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > struct dp_netdev_flow *flow; > + /* get the original order of this packet in received batch */ > + int recv_idx = index_map[i]; > > if (OVS_UNLIKELY(!rules[i])) { > continue; > @@ -5872,9 +5925,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, > flow = dp_netdev_flow_cast(rules[i]); > > emc_probabilistic_insert(pmd, &keys[i], flow); > - dp_netdev_queue_batches(packet, flow, > - miniflow_get_tcp_flags(&keys[i].mf), > - batches, n_batches); > + /* add these packets into the flow map in the same order > + * as received. > + */ > + flow_map[recv_idx].packet = packet; > + flow_map[recv_idx].flow = flow; > + flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf); > + > } > > pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT, > @@ -5905,19 +5962,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, > OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > struct netdev_flow_key keys[PKT_ARRAY_SIZE]; > struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; > - size_t n_batches; > + struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE]; > + uint8_t index_map[PKT_ARRAY_SIZE]; > + size_t n_batches, n_flows = 0; > odp_port_t in_port; > + size_t i; > > n_batches = 0; > emc_processing(pmd, packets, keys, batches, &n_batches, > - md_is_valid, port_no); > + flow_map, &n_flows, index_map, md_is_valid, port_no); > if (!dp_packet_batch_is_empty(packets)) { > /* Get ingress port from first packet's metadata. */ > in_port = packets->packets[0]->md.in_port.odp_port; > fast_path_processing(pmd, packets, keys, > - batches, &n_batches, in_port); > + flow_map, index_map, in_port); > } > > + /* batch rest of packets which are in flow map */ > + for (i = 0; i < n_flows; i++) { > + struct dp_packet_flow_map *map = &flow_map[i]; > + > + if (OVS_UNLIKELY(!map->flow)) { > + continue; > + } > + dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, > + batches, &n_batches); > + } > + > + > /* All the flow batches need to be reset before any call to > * packet_batch_per_flow_execute() as it could potentially trigger > * recirculation. When a packet matching flow ‘j’ happens to be > @@ -5927,7 +5999,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, > * already its own batches[k] still waiting to be served. So if its > * ‘batch’ member is not reset, the recirculated packet would be wrongly > * appended to batches[k] of the 1st call to dp_netdev_input__(). */ > - size_t i; > for (i = 0; i < n_batches; i++) { > batches[i].flow->batch = NULL; > } >
> > + > > + /* preserve the order of packet for flow batching */ > > + index_map[packets_->count - 1] = map_cnt; > > Using the "packets_->count" without accessor + inside a refill loop is highly > inaccurate/potentially dangerous. Isn't it equal to "n_missed"? > Hi Ilya, Thanks for the review. However I could not understand your point about inaccurate/potentially dangerous code. Can you elaborate on a scenario where this is likely to fail/crash ?
On 10.07.2018 14:14, Vishal Deep Ajmera wrote: >>> + >>> + /* preserve the order of packet for flow batching */ >>> + index_map[packets_->count - 1] = map_cnt; >> >> Using the "packets_->count" without accessor + inside a refill loop is highly >> inaccurate/potentially dangerous. Isn't it equal to "n_missed"? >> > Hi Ilya, > > Thanks for the review. However I could not understand your point about > inaccurate/potentially dangerous code. Can you elaborate on a scenario > where this is likely to fail/crash ? This is potentially dangerous from the future modifications and hard to read for reviewer/person who tries to understand how it works. Current implementation will fail if someone will change the logic of 'DP_PACKET_BATCH_REFILL_FOR_EACH', for example. The key point is that 'DP_PACKET_BATCH_REFILL_FOR_EACH' and 'dp_packet_batch_refill' manipulates with the batch size in a hidden from the end user manner. Using of the 'packets_->count' directly requires knowledge of how refilling works internally, but these functions was intended to hide internals of batch manipulations. Also, as I understand, you're storing 'map_cnt' for each missed packet. So, why not just use 'n_missed' for that purpose? Best regards, Ilya Maximets.
> > This is potentially dangerous from the future modifications and hard to read for > reviewer/person who tries to understand how it works. > > Current implementation will fail if someone will change the logic of > 'DP_PACKET_BATCH_REFILL_FOR_EACH', for example. > > The key point is that 'DP_PACKET_BATCH_REFILL_FOR_EACH' and > 'dp_packet_batch_refill' manipulates with the batch size in a hidden from the > end user manner. Using of the 'packets_->count' directly requires knowledge of > how refilling works internally, but these functions was intended to hide internals > of batch manipulations. > > Also, as I understand, you're storing 'map_cnt' for each missed packet. > So, why not just use 'n_missed' for that purpose? > map_cnt keeps count of not just the emc miss packet but also all subsequent packets (with emc hit) after one emc miss.
On 10.07.2018 17:48, Vishal Deep Ajmera wrote: >> >> This is potentially dangerous from the future modifications and hard to read for >> reviewer/person who tries to understand how it works. >> >> Current implementation will fail if someone will change the logic of >> 'DP_PACKET_BATCH_REFILL_FOR_EACH', for example. >> >> The key point is that 'DP_PACKET_BATCH_REFILL_FOR_EACH' and >> 'dp_packet_batch_refill' manipulates with the batch size in a hidden from the >> end user manner. Using of the 'packets_->count' directly requires knowledge of >> how refilling works internally, but these functions was intended to hide internals >> of batch manipulations. >> >> Also, as I understand, you're storing 'map_cnt' for each missed packet. >> So, why not just use 'n_missed' for that purpose? >> > > map_cnt keeps count of not just the emc miss packet but also all subsequent > packets (with emc hit) after one emc miss. Yes. But we're talking about replacing 'packets_->count' with 'n_missed'. Not about 'map_cnt'.
> >> > >> This is potentially dangerous from the future modifications and hard to read > for > >> reviewer/person who tries to understand how it works. > >> > >> Current implementation will fail if someone will change the logic of > >> 'DP_PACKET_BATCH_REFILL_FOR_EACH', for example. > >> > >> The key point is that 'DP_PACKET_BATCH_REFILL_FOR_EACH' and > >> 'dp_packet_batch_refill' manipulates with the batch size in a hidden from the > >> end user manner. Using of the 'packets_->count' directly requires knowledge > of > >> how refilling works internally, but these functions was intended to hide > internals > >> of batch manipulations. > >> > >> Also, as I understand, you're storing 'map_cnt' for each missed packet. > >> So, why not just use 'n_missed' for that purpose? > >> > > > > map_cnt keeps count of not just the emc miss packet but also all subsequent > > packets (with emc hit) after one emc miss. > > Yes. But we're talking about replacing 'packets_->count' with 'n_missed'. > Not about 'map_cnt'. Thanks Ilya. I have fixed this in v5 patch-set.
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 8b3556d..d4b8f99 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -208,6 +208,13 @@ struct dpcls_rule { /* 'flow' must be the last field, additional space is allocated here. */ }; +/* data structure to keep packet order till fastpath processing */ +struct dp_packet_flow_map { + struct dp_packet *packet; + struct dp_netdev_flow *flow; + uint16_t tcp_flags; +}; + static void dpcls_init(struct dpcls *); static void dpcls_destroy(struct dpcls *); static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5602,6 +5609,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt, packet_batch_per_flow_update(batch, pkt, tcp_flags); } +static inline void +packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map, + struct dp_netdev_flow *flow, + struct dp_packet *packet, + uint16_t tcp_flags, + size_t *map_cnt) +{ + struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++]; + map->flow = flow; + map->packet = packet; + map->tcp_flags = tcp_flags; +} + /* Try to process all ('cnt') the 'packets' using only the exact match cache * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the * miniflow is copied into 'keys' and the packet pointer is moved at the @@ -5621,6 +5641,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, struct packet_batch_per_flow batches[], size_t *n_batches, + struct dp_packet_flow_map *flow_map, + size_t *n_flows, + uint8_t *index_map, bool md_is_valid, odp_port_t port_no) { struct emc_cache *flow_cache = &pmd->flow_cache; @@ -5631,6 +5654,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, uint32_t cur_min; int i; uint16_t tcp_flags; + size_t map_cnt = 0; + bool batch_enable = true; atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); pmd_perf_update_counter(&pmd->perf_stats, @@ -5661,11 +5686,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, if ((*recirc_depth_get() == 0) && dp_packet_has_flow_mark(packet, &mark)) { flow = mark_to_flow_find(pmd, mark); - if (flow) { + if (OVS_LIKELY(flow)) { tcp_flags = parse_tcp_flags(packet); - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, - n_batches); - continue; + if (OVS_LIKELY(batch_enable)) { + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, + n_batches); + continue; + } else { + /* Flow batching should be performed only after fast-path + * processing is also completed for packets with emc miss + * or else it will result in reordering of packets with + * same datapath flows. */ + packet_enqueue_to_flow_map(flow_map, flow, packet, + tcp_flags, &map_cnt); + } } } @@ -5685,8 +5719,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, } if (OVS_LIKELY(flow)) { tcp_flags = miniflow_get_tcp_flags(&key->mf); - dp_netdev_queue_batches(packet, flow, tcp_flags, batches, - n_batches); + if (OVS_LIKELY(batch_enable)) { + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, + n_batches); + } else { + /* Flow batching should be performed only after fast-path + * processing is also completed for packets with emc miss + * or else it will result in reordering of packets with + * same datapath flows. */ + packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags, + &map_cnt); + } + } else { /* Exact match cache missed. Group missed packets together at * the beginning of the 'packets' array. */ @@ -5695,9 +5739,16 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, * must be returned to the caller. The next key should be extracted * to 'keys[n_missed + 1]'. */ key = &keys[++n_missed]; + + /* preserve the order of packet for flow batching */ + index_map[packets_->count - 1] = map_cnt; + flow_map[map_cnt++].flow = NULL; + + /* skip batching for subsequent packets to avoid reordering */ + batch_enable = false; } } - + *n_flows = map_cnt; pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, cnt - n_dropped - n_missed); @@ -5784,8 +5835,8 @@ static inline void fast_path_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_, struct netdev_flow_key *keys, - struct packet_batch_per_flow batches[], - size_t *n_batches, + struct dp_packet_flow_map *flow_map, + uint8_t *index_map, odp_port_t in_port) { const size_t cnt = dp_packet_batch_size(packets_); @@ -5864,6 +5915,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { struct dp_netdev_flow *flow; + /* get the original order of this packet in received batch */ + int recv_idx = index_map[i]; if (OVS_UNLIKELY(!rules[i])) { continue; @@ -5872,9 +5925,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, flow = dp_netdev_flow_cast(rules[i]); emc_probabilistic_insert(pmd, &keys[i], flow); - dp_netdev_queue_batches(packet, flow, - miniflow_get_tcp_flags(&keys[i].mf), - batches, n_batches); + /* add these packets into the flow map in the same order + * as received. + */ + flow_map[recv_idx].packet = packet; + flow_map[recv_idx].flow = flow; + flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf); + } pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT, @@ -5905,19 +5962,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE]; struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; - size_t n_batches; + struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE]; + uint8_t index_map[PKT_ARRAY_SIZE]; + size_t n_batches, n_flows = 0; odp_port_t in_port; + size_t i; n_batches = 0; emc_processing(pmd, packets, keys, batches, &n_batches, - md_is_valid, port_no); + flow_map, &n_flows, index_map, md_is_valid, port_no); if (!dp_packet_batch_is_empty(packets)) { /* Get ingress port from first packet's metadata. */ in_port = packets->packets[0]->md.in_port.odp_port; fast_path_processing(pmd, packets, keys, - batches, &n_batches, in_port); + flow_map, index_map, in_port); } + /* batch rest of packets which are in flow map */ + for (i = 0; i < n_flows; i++) { + struct dp_packet_flow_map *map = &flow_map[i]; + + if (OVS_UNLIKELY(!map->flow)) { + continue; + } + dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, + batches, &n_batches); + } + + /* All the flow batches need to be reset before any call to * packet_batch_per_flow_execute() as it could potentially trigger * recirculation. When a packet matching flow ‘j’ happens to be @@ -5927,7 +5999,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, * already its own batches[k] still waiting to be served. So if its * ‘batch’ member is not reset, the recirculated packet would be wrongly * appended to batches[k] of the 1st call to dp_netdev_input__(). */ - size_t i; for (i = 0; i < n_batches; i++) { batches[i].flow->batch = NULL; }