diff mbox

[ovs-dev,RFC,2/4] dpif-netdev: Skip EMC lookup/insert for recirculated packets.

Message ID 1497867118-4195-2-git-send-email-antonio.fischetti@intel.com
State Superseded
Headers show

Commit Message

Fischetti, Antonio June 19, 2017, 10:11 a.m. UTC
From: Antonio Fischetti <antonio.fischetti@intel.com>

When OVS is configured as a firewall, with thousands of active
concurrent connections, the EMC gets quicly saturated and may come under
heavy thrashing for the reason that original and recirculated packets
keep overwrite existing active EMC entries due to its limited size(8k).

This thrashing causes the EMC to be less efficient than the dcpls in
terms of lookups and insertions.

This patch allows to use the EMC efficiently by allowing only the 'original'
packets to hit EMC. All recirculated packets are sent to classifier directly.
An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy is
set to trigger this logic. By doing so when EMC utilization exceeds
EMC_FULL_THRESHOLD.
 - EMC Insertions are allowed just for original packets. EMC insertion
   and look up is skipped for recirculated packets.
 - Recirculated packets are sent to classifier.

This patch depends on the previous one in this series. It's based on patch
"dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html

Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
---
In our Connection Tracker testbench set up with

 table=0, priority=1 actions=drop
 table=0, priority=10,arp actions=NORMAL
 table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
 table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
 table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
 table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
 table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

we saw the following performance improvement.

Measured packet Rx rate (regardless of packet loss). Bidirectional
test with 64B UDP packets.
Each row is a test with a different number of traffic streams. The traffic
generator is set so that each stream establishes one UDP connection.
Mpps columns reports the Rx rates on the 2 sides.

 Traffic |    Orig    |     Orig      |  +changes  |   +changes
 Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
---------+------------+---------------+------------+---------------
     10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
    100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
  1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
  2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
  3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
  4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
 10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192

 lib/dpif-netdev.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

Comments

Billy O'Mahony June 23, 2017, 5:38 p.m. UTC | #1
Hi Antonio,

This is a really interesting patch. Comments inline below. 

Thanks,
/Billy.

> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com
> Sent: Monday, June 19, 2017 11:12 AM
> To: dev@openvswitch.org
> Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for
> recirculated packets.
> 
> From: Antonio Fischetti <antonio.fischetti@intel.com>
> 
> When OVS is configured as a firewall, with thousands of active concurrent
> connections, the EMC gets quicly saturated and may come under heavy
> thrashing for the reason that original and recirculated packets keep overwrite
> existing active EMC entries due to its limited size(8k).
> 
> This thrashing causes the EMC to be less efficient than the dcpls in terms of
> lookups and insertions.
> 
> This patch allows to use the EMC efficiently by allowing only the 'original'
> packets to hit EMC. All recirculated packets are sent to classifier directly.
> An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy
> is set to trigger this logic. By doing so when EMC utilization exceeds
> EMC_FULL_THRESHOLD.
>  - EMC Insertions are allowed just for original packets. EMC insertion
>    and look up is skipped for recirculated packets.
>  - Recirculated packets are sent to classifier.
> 
> This patch depends on the previous one in this series. It's based on patch
> "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> 
> Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> Signed-off-by: Bhanuprakash Bodireddy
> <bhanuprakash.bodireddy@intel.com>
> Co-authored-by: Bhanuprakash Bodireddy
> <bhanuprakash.bodireddy@intel.com>
> ---
> In our Connection Tracker testbench set up with
> 
>  table=0, priority=1 actions=drop
>  table=0, priority=10,arp actions=NORMAL  table=0, priority=100,ct_state=-
> trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> 
> we saw the following performance improvement.
> 
> Measured packet Rx rate (regardless of packet loss). Bidirectional test with
> 64B UDP packets.
> Each row is a test with a different number of traffic streams. The traffic
> generator is set so that each stream establishes one UDP connection.
> Mpps columns reports the Rx rates on the 2 sides.
> 
>  Traffic |    Orig    |     Orig      |  +changes  |   +changes
>  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> ---------+------------+---------------+------------+---------------
>      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
>     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
>   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
>   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
>   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
>   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
>  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> 

[[BO'M]] 
A few questions on the test:
Are all the pkts rxd being recirculated?
Are there any flows present where the pkts do not require recirculation? 
Was the rxd rss hash calculation offloaded to the NIC?
For the cases with larger numbers of flows (10K , 100K) did you investigate the results when the EMC is simply switched off? 

Say we have 3000 flows (the lowest figure at which we see a positive effect) that means 6000 flows are contending for places in the emc.  
Is the effect we see here to do with disabling recirculated packets in particular or just reducing contention on the emc in general. I know that the recirculated pkt hashes require software hashing albeit a small amount so they do make a good category of packet to drop from the emc when contention is severe.

Once there are too many entries contending  for any cache it's going to become a liability as the lookup_cost + (miss_rate * cost_of_miss) grows to be greater that the cost_of_miss. And in that case any scheme where a very cheap decision can be made to not insert a certain category of packets and to also not check the emc for that same category will reduce the cache miss rate.

But I'm not sure that is_recirculated is the best categorization on which to make that decision. Mainly because it is not controllable. In this test case 50% pkts were recirculated so by ruling these packets out of eligibility for the EMC you get a really large reduction in EMC contention. However you can never know ahead of time if any packets will be recirc'd. You may have a situation where the EMC is totally swamped with 200K flows as above but none of them are re-circ'd packets so ruling out those packets will give no benefit. 


>  lib/dpif-netdev.c | 46 ++++++++++++++++++++++++++++++++++++++++--
> ----
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index fd2ed52..64a3cd4
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);  }
> 
> +#define EMC_FULL_THRESHOLD 0x0000F000
> +
[[BO'M]] 

Use of FULL in #def is a little misleading as it is really indicating the threshold after which recirc pkts do not get inserted to the EMC. 

EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?

As this #def is used with an & operator not a > operator in the patch it needs to be clear that number has the restrictions of 
a) having exactly one bit set so is limited to power of two values. 
b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)

Would suggest that value is based directly off EM_FLOW_HASH_SHIFT

e.g 
#def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1)) 
#def EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
#def EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))

#def EMC_RECIRCT_NO_INSERT_THRESHOLD  EMC_RECIRCT_NO_INSERT_HALF


>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the @@ -
> 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>              pkt_metadata_prefetch_init(&packets[i+1]->md);
>          }
> 
> +        /*
> +         * EMC lookup is skipped when one or both of the following
> +         * two cases occurs:
> +         *
> +         *   - EMC is disabled.  This is detected from cur_min.
> +         *
> +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> +         *     packet to be classified is being recirculated.  When this
> +         *     happens also EMC insertions are skipped for recirculated
> +         *     packets.  So that EMC is used just to store entries which
> +         *     are hit from the 'original' packets.  This way the EMC
> +         *     thrashing is mitigated with a benefit on performance.
> +         */
>          if (!md_is_valid) {
>              pkt_metadata_init(&packet->md, port_no);
>              miniflow_extract(packet, &key->mf); @@ -4603,11 +4618,18 @@
> emc_processing(struct dp_netdev_pmd_thread *pmd,
>          } else {
>              /* Recirculated packets. */
>              miniflow_extract(packet, &key->mf);
> -            if (OVS_LIKELY(cur_min)) {
> -                key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
> -                flow = emc_lookup(flow_cache, key);
> -            } else {
> +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> +                /* EMC occupancy is over the threshold.  We skip EMC
> +                 * lookup for recirculated packets. */
>                  flow = NULL;
> +            } else {
> +                if (OVS_LIKELY(cur_min)) {
> +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> +                                    &key->mf);
> +                    flow = emc_lookup(flow_cache, key);
> +                } else {
> +                    flow = NULL;
> +                }
>              }
>          }
>          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> -        emc_probabilistic_insert(pmd, key, netdev_flow);
> +        /* When EMC occupancy goes over a threshold we avoid inserting new
> +         * entries for recirculated packets. */
> +        if (!packet->md.recirc_id) {
> +            emc_probabilistic_insert(pmd, key, netdev_flow);
> +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> +            emc_probabilistic_insert(pmd, key, netdev_flow);
> +        }
>      }
>  }
> 
> @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> +        /* When EMC occupancy goes over a threshold we avoid inserting new
> +         * entries for recirculated packets. */
> +        if (!packet->md.recirc_id) {
> +            emc_probabilistic_insert(pmd, &keys[i], flow);
> +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> +            emc_probabilistic_insert(pmd, &keys[i], flow);
> +        }
>          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
>      }
> 
> --
> 2.4.11
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Fischetti, Antonio June 23, 2017, 9:49 p.m. UTC | #2
Thanks a lot Billy, really appreciate your feedback.
My replies inline.

/Antonio

> -----Original Message-----
> From: O Mahony, Billy
> Sent: Friday, June 23, 2017 6:39 PM
> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Hi Antonio,
> 
> This is a really interesting patch. Comments inline below.
> 
> Thanks,
> /Billy.
> 
> > -----Original Message-----
> > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com
> > Sent: Monday, June 19, 2017 11:12 AM
> > To: dev@openvswitch.org
> > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for
> > recirculated packets.
> >
> > From: Antonio Fischetti <antonio.fischetti@intel.com>
> >
> > When OVS is configured as a firewall, with thousands of active
> concurrent
> > connections, the EMC gets quicly saturated and may come under heavy
> > thrashing for the reason that original and recirculated packets keep
> overwrite
> > existing active EMC entries due to its limited size(8k).
> >
> > This thrashing causes the EMC to be less efficient than the dcpls in
> terms of
> > lookups and insertions.
> >
> > This patch allows to use the EMC efficiently by allowing only the
> 'original'
> > packets to hit EMC. All recirculated packets are sent to classifier
> directly.
> > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy
> > is set to trigger this logic. By doing so when EMC utilization exceeds
> > EMC_FULL_THRESHOLD.
> >  - EMC Insertions are allowed just for original packets. EMC insertion
> >    and look up is skipped for recirculated packets.
> >  - Recirculated packets are sent to classifier.
> >
> > This patch depends on the previous one in this series. It's based on
> patch
> > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"
> at:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> >
> > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> > Signed-off-by: Bhanuprakash Bodireddy
> > <bhanuprakash.bodireddy@intel.com>
> > Co-authored-by: Bhanuprakash Bodireddy
> > <bhanuprakash.bodireddy@intel.com>
> > ---
> > In our Connection Tracker testbench set up with
> >
> >  table=0, priority=1 actions=drop
> >  table=0, priority=10,arp actions=NORMAL  table=0,
> priority=100,ct_state=-
> > trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> > table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> >
> > we saw the following performance improvement.
> >
> > Measured packet Rx rate (regardless of packet loss). Bidirectional test
> with
> > 64B UDP packets.
> > Each row is a test with a different number of traffic streams. The
> traffic
> > generator is set so that each stream establishes one UDP connection.
> > Mpps columns reports the Rx rates on the 2 sides.
> >
> >  Traffic |    Orig    |     Orig      |  +changes  |   +changes
> >  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> > ---------+------------+---------------+------------+---------------
> >      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
> >     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
> >   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
> >   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
> >   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
> >   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
> >  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> > 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> >
> 
> [[BO'M]]
> A few questions on the test:
> Are all the pkts rxd being recirculated?

[AF] Yes, I sent UDP packets with the firewall rules above. Every packets 
goes through the CT module, so after that it is recirculated.

> Are there any flows present where the pkts do not require recirculation?

[AF] No. The flow
table=0,priority=100,ct_state=-trk,ip actions=ct(table=1)
implies that any received packet goes through CT and then it is recirculated.

> Was the rxd rss hash calculation offloaded to the NIC?

[AF] Yes.

> For the cases with larger numbers of flows (10K , 100K) did you
> investigate the results when the EMC is simply switched off?

[AF] No, I just focused on this criteria to check if I could really get some
performance improvement. It's likely that with - say 100k or more flows - it's
better to switch off EMC? 

> 
> Say we have 3000 flows (the lowest figure at which we see a positive
> effect) that means 6000 flows are contending for places in the emc.
> Is the effect we see here to do with disabling recirculated packets in
> particular or just reducing contention on the emc in general. 

[AF] I guess the benefit comes from reducing the continuous recursive overwrites in EMC.
With this approach
  => less overhead for insertions
  => it's more likely that original packets do find a matching EMC entry. Otherwise
whatever we insert into EMC it's likely to be overwritten before a new packet
of that same flow comes in. When we insert entries for recirculated packets for sure we're deleting useful entries referring to some active megaflows and this will in turn happen again for the next packets recursively.


I know that
> the recirculated pkt hashes require software hashing albeit a small amount
> so they do make a good category of packet to drop from the emc when
> contention is severe.

[AF] Agree, if we skip EMC for recirculated packets
 1. we save some cpu cycles for hash computation 
 2. we mitigate the EMC thrashing, and this is the main purpose of this patch. 

> 
> Once there are too many entries contending  for any cache it's going to
> become a liability as the lookup_cost + (miss_rate * cost_of_miss) grows
> to be greater that the cost_of_miss. And in that case any scheme where a
> very cheap decision can be made to not insert a certain category of
> packets and to also not check the emc for that same category will reduce
> the cache miss rate.

[AF] agree, thanks for explaining this with clear words, I tried to do the
same but I was not so successful :-)

> 
> But I'm not sure that is_recirculated is the best categorization on which
> to make that decision. Mainly because it is not controllable. In this test
> case 50% pkts were recirculated so by ruling these packets out of
> eligibility for the EMC you get a really large reduction in EMC
> contention. However you can never know ahead of time if any packets will
> be recirc'd. You may have a situation where the EMC is totally swamped
> with 200K flows as above but none of them are re-circ'd packets so ruling
> out those packets will give no benefit.

[AF] agree, this approach is limited to conntrack, or any tunneling usecases.

> 
> 
> >  lib/dpif-netdev.c | 46 ++++++++++++++++++++++++++++++++++++++++--
> > ----
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> fd2ed52..64a3cd4
> > 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
> >      packet_batch_per_flow_update(batch, pkt, mf);  }
> >
> > +#define EMC_FULL_THRESHOLD 0x0000F000
> > +
> [[BO'M]]
> 
> Use of FULL in #def is a little misleading as it is really indicating the
> threshold after which recirc pkts do not get inserted to the EMC.
> 
> EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?

[AF] you're right, I'll take note of that. Or something like EMC_OCCUPANCY_THRESHOLD? Hmm,.. 

> 
> As this #def is used with an & operator not a > operator in the patch it
> needs to be clear that number has the restrictions of
> a) having exactly one bit set so is limited to power of two values.
> b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)
> 
> Would suggest that value is based directly off EM_FLOW_HASH_SHIFT
> 
> e.g
> #def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1))
> #def EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
> #def EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))
> 
> #def EMC_RECIRCT_NO_INSERT_THRESHOLD  EMC_RECIRCT_NO_INSERT_HALF
> 

[AF] Thanks, that would help also on readability. I used the & operator just for
performance reasons, I'm not quite sure if the > operator would require some more 
cpu cycles, I didn't check that.

> 
> >  /* Try to process all ('cnt') the 'packets' using only the exact match
> cache
> >   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]',
> the
> >   * miniflow is copied into 'keys' and the packet pointer is moved at
> the @@ -
> > 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> >              pkt_metadata_prefetch_init(&packets[i+1]->md);
> >          }
> >
> > +        /*
> > +         * EMC lookup is skipped when one or both of the following
> > +         * two cases occurs:
> > +         *
> > +         *   - EMC is disabled.  This is detected from cur_min.
> > +         *
> > +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> > +         *     packet to be classified is being recirculated.  When
> this
> > +         *     happens also EMC insertions are skipped for recirculated
> > +         *     packets.  So that EMC is used just to store entries
> which
> > +         *     are hit from the 'original' packets.  This way the EMC
> > +         *     thrashing is mitigated with a benefit on performance.
> > +         */
> >          if (!md_is_valid) {
> >              pkt_metadata_init(&packet->md, port_no);
> >              miniflow_extract(packet, &key->mf); @@ -4603,11 +4618,18 @@
> > emc_processing(struct dp_netdev_pmd_thread *pmd,
> >          } else {
> >              /* Recirculated packets. */
> >              miniflow_extract(packet, &key->mf);
> > -            if (OVS_LIKELY(cur_min)) {
> > -                key->hash = dpif_netdev_packet_get_rss_hash(packet,
> &key->mf);
> > -                flow = emc_lookup(flow_cache, key);
> > -            } else {
> > +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> > +                /* EMC occupancy is over the threshold.  We skip EMC
> > +                 * lookup for recirculated packets. */
> >                  flow = NULL;
> > +            } else {
> > +                if (OVS_LIKELY(cur_min)) {
> > +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > +                                    &key->mf);
> > +                    flow = emc_lookup(flow_cache, key);
> > +                } else {
> > +                    flow = NULL;
> > +                }
> >              }
> >          }
> >          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13 @@
> > handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> >                                               add_actions->size);
> >          }
> >          ovs_mutex_unlock(&pmd->flow_mutex);
> > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        /* When EMC occupancy goes over a threshold we avoid inserting
> new
> > +         * entries for recirculated packets. */
> > +        if (!packet->md.recirc_id) {
> > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        }
> >      }
> >  }
> >
> > @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> > dp_netdev_pmd_thread *pmd,
> >
> >          flow = dp_netdev_flow_cast(rules[i]);
> >
> > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        /* When EMC occupancy goes over a threshold we avoid inserting
> new
> > +         * entries for recirculated packets. */
> > +        if (!packet->md.recirc_id) {
> > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        }
> >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> > n_batches);
> >      }
> >
> > --
> > 2.4.11
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Billy O'Mahony July 26, 2017, 8:54 a.m. UTC | #3
Hi Antonio,

> -----Original Message-----
> From: Fischetti, Antonio
> Sent: Friday, June 23, 2017 10:49 PM
> To: O Mahony, Billy <billy.o.mahony@intel.com>; dev@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Thanks a lot Billy, really appreciate your feedback.
> My replies inline.
> 
> /Antonio
> 
> > -----Original Message-----
> > From: O Mahony, Billy
> > Sent: Friday, June 23, 2017 6:39 PM
> > To: Fischetti, Antonio <antonio.fischetti@intel.com>;
> > dev@openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > lookup/insert for recirculated packets.
> >
> > Hi Antonio,
> >
> > This is a really interesting patch. Comments inline below.
> >
> > Thanks,
> > /Billy.
> >
> > > -----Original Message-----
> > > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
> > > bounces@openvswitch.org] On Behalf Of antonio.fischetti@intel.com
> > > Sent: Monday, June 19, 2017 11:12 AM
> > > To: dev@openvswitch.org
> > > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > > lookup/insert
> > for
> > > recirculated packets.
> > >
> > > From: Antonio Fischetti <antonio.fischetti@intel.com>
> > >
> > > When OVS is configured as a firewall, with thousands of active
> > concurrent
> > > connections, the EMC gets quicly saturated and may come under heavy
> > > thrashing for the reason that original and recirculated packets keep
> > overwrite
> > > existing active EMC entries due to its limited size(8k).
> > >
> > > This thrashing causes the EMC to be less efficient than the dcpls in
> > terms of
> > > lookups and insertions.
> > >
> > > This patch allows to use the EMC efficiently by allowing only the
> > 'original'
> > > packets to hit EMC. All recirculated packets are sent to classifier
> > directly.
> > > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC
> > > occupancy is set to trigger this logic. By doing so when EMC
> > > utilization exceeds EMC_FULL_THRESHOLD.
> > >  - EMC Insertions are allowed just for original packets. EMC insertion
> > >    and look up is skipped for recirculated packets.
> > >  - Recirculated packets are sent to classifier.
> > >
> > > This patch depends on the previous one in this series. It's based on
> > patch
> > > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"
> > at:
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.h
> > > tml
> > >
> > > Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
> > > Signed-off-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodireddy@intel.com>
> > > Co-authored-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodireddy@intel.com>
> > > ---
> > > In our Connection Tracker testbench set up with
> > >
> > >  table=0, priority=1 actions=drop
> > >  table=0, priority=10,arp actions=NORMAL  table=0,
> > priority=100,ct_state=-
> > > trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> > > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > > actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2
> > > actions=drop table=1, ct_state=+est+trk,ip,in_port=2
> > > actions=output:1
> > >
> > > we saw the following performance improvement.
> > >
> > > Measured packet Rx rate (regardless of packet loss). Bidirectional
> > > test
> > with
> > > 64B UDP packets.
> > > Each row is a test with a different number of traffic streams. The
> > traffic
> > > generator is set so that each stream establishes one UDP connection.
> > > Mpps columns reports the Rx rates on the 2 sides.
> > >
> > >  Traffic |    Orig    |     Orig      |  +changes  |   +changes
> > >  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> > > ---------+------------+---------------+------------+---------------
> > >      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
> > >     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
> > >   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
> > >   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
> > >   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
> > >   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
> > >  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> > > 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> > >
> >
> > [[BO'M]]
> > A few questions on the test:
> > Are all the pkts rxd being recirculated?
> 
> [AF] Yes, I sent UDP packets with the firewall rules above. Every packets goes
> through the CT module, so after that it is recirculated.
> 
> > Are there any flows present where the pkts do not require recirculation?
> 
> [AF] No. The flow
> table=0,priority=100,ct_state=-trk,ip actions=ct(table=1) implies that any
> received packet goes through CT and then it is recirculated.
> 
> > Was the rxd rss hash calculation offloaded to the NIC?
> 
> [AF] Yes.
> 
> > For the cases with larger numbers of flows (10K , 100K) did you
> > investigate the results when the EMC is simply switched off?
> 
> [AF] No, I just focused on this criteria to check if I could really get some
> performance improvement. It's likely that with - say 100k or more flows - it's
> better to switch off EMC?
[[BO'M]] Quite possibly but I think it would depend on the distribution of the packets across flows. If each packet round-robins through each of the 100K flows (maybe a typical traffic generator behavior)  then that is the worst possible situation for the EMC - nearly all packets would miss the EMC and it'd be better off switched off. But in a more realistic situation the probability that the next packet belongs to any given flow is random or even better maybe a normal distribution, then the EMC would provide many more hits. 
> 
> >
> > Say we have 3000 flows (the lowest figure at which we see a positive
> > effect) that means 6000 flows are contending for places in the emc.
> > Is the effect we see here to do with disabling recirculated packets in
> > particular or just reducing contention on the emc in general.
> 
> [AF] I guess the benefit comes from reducing the continuous recursive
> overwrites in EMC.
> With this approach
>   => less overhead for insertions
>   => it's more likely that original packets do find a matching EMC entry.
> Otherwise whatever we insert into EMC it's likely to be overwritten before a
> new packet of that same flow comes in. When we insert entries for
> recirculated packets for sure we're deleting useful entries referring to some
> active megaflows and this will in turn happen again for the next packets
> recursively.
[[BO'M]] Okay. I think we are agreeing that when the EMC is overloaded then it is wasteful to shove even more entries in there! 
> 
> 
> I know that
> > the recirculated pkt hashes require software hashing albeit a small
> > amount so they do make a good category of packet to drop from the emc
> > when contention is severe.
> 
> [AF] Agree, if we skip EMC for recirculated packets  1. we save some cpu
> cycles for hash computation  2. we mitigate the EMC thrashing, and this is the
> main purpose of this patch.
> 
> >
> > Once there are too many entries contending  for any cache it's going
> > to become a liability as the lookup_cost + (miss_rate * cost_of_miss)
> > grows to be greater that the cost_of_miss. And in that case any scheme
> > where a very cheap decision can be made to not insert a certain
> > category of packets and to also not check the emc for that same
> > category will reduce the cache miss rate.
> 
> [AF] agree, thanks for explaining this with clear words, I tried to do the same
> but I was not so successful :-)
> 
> >
> > But I'm not sure that is_recirculated is the best categorization on
> > which to make that decision. Mainly because it is not controllable. In
> > this test case 50% pkts were recirculated so by ruling these packets
> > out of eligibility for the EMC you get a really large reduction in EMC
> > contention. However you can never know ahead of time if any packets
> > will be recirc'd. You may have a situation where the EMC is totally
> > swamped with 200K flows as above but none of them are re-circ'd
> > packets so ruling out those packets will give no benefit.
> 
> [AF] agree, this approach is limited to conntrack, or any tunneling usecases.
> 
[[BO'M]] I'll send an email to ovs-dev to elaborate further on what could be a more general scheme of EMC load-shedding.  If people think it's useful we can begin to refine it further on the ML.
> >
> >
> > >  lib/dpif-netdev.c | 46
> ++++++++++++++++++++++++++++++++++++++++--
> > > ----
> > >  1 file changed, 40 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > fd2ed52..64a3cd4
> > > 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet
> *pkt,
> > >      packet_batch_per_flow_update(batch, pkt, mf);  }
> > >
> > > +#define EMC_FULL_THRESHOLD 0x0000F000
> > > +
> > [[BO'M]]
> >
> > Use of FULL in #def is a little misleading as it is really indicating
> > the threshold after which recirc pkts do not get inserted to the EMC.
> >
> > EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?
> 
> [AF] you're right, I'll take note of that. Or something like
> EMC_OCCUPANCY_THRESHOLD? Hmm,..
> 
> >
> > As this #def is used with an & operator not a > operator in the patch
> > it needs to be clear that number has the restrictions of
> > a) having exactly one bit set so is limited to power of two values.
> > b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)
> >
> > Would suggest that value is based directly off EM_FLOW_HASH_SHIFT
> >
> > e.g
> > #def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1))
> #def
> > EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
> #def
> > EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))
> >
> > #def EMC_RECIRCT_NO_INSERT_THRESHOLD
> EMC_RECIRCT_NO_INSERT_HALF
> >
> 
> [AF] Thanks, that would help also on readability. I used the & operator just
> for performance reasons, I'm not quite sure if the > operator would require
> some more cpu cycles, I didn't check that.
> 
> >
> > >  /* Try to process all ('cnt') the 'packets' using only the exact
> > > match
> > cache
> > >   * 'pmd->flow_cache'. If a flow is not found for a packet
> > > 'packets[i]',
> > the
> > >   * miniflow is copied into 'keys' and the packet pointer is moved
> > > at
> > the @@ -
> > > 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> > >              pkt_metadata_prefetch_init(&packets[i+1]->md);
> > >          }
> > >
> > > +        /*
> > > +         * EMC lookup is skipped when one or both of the following
> > > +         * two cases occurs:
> > > +         *
> > > +         *   - EMC is disabled.  This is detected from cur_min.
> > > +         *
> > > +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> > > +         *     packet to be classified is being recirculated.  When
> > this
> > > +         *     happens also EMC insertions are skipped for recirculated
> > > +         *     packets.  So that EMC is used just to store entries
> > which
> > > +         *     are hit from the 'original' packets.  This way the EMC
> > > +         *     thrashing is mitigated with a benefit on performance.
> > > +         */
> > >          if (!md_is_valid) {
> > >              pkt_metadata_init(&packet->md, port_no);
> > >              miniflow_extract(packet, &key->mf); @@ -4603,11
> > > +4618,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> > >          } else {
> > >              /* Recirculated packets. */
> > >              miniflow_extract(packet, &key->mf);
> > > -            if (OVS_LIKELY(cur_min)) {
> > > -                key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > &key->mf);
> > > -                flow = emc_lookup(flow_cache, key);
> > > -            } else {
> > > +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> > > +                /* EMC occupancy is over the threshold.  We skip EMC
> > > +                 * lookup for recirculated packets. */
> > >                  flow = NULL;
> > > +            } else {
> > > +                if (OVS_LIKELY(cur_min)) {
> > > +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > > +                                    &key->mf);
> > > +                    flow = emc_lookup(flow_cache, key);
> > > +                } else {
> > > +                    flow = NULL;
> > > +                }
> > >              }
> > >          }
> > >          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13
> > > @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> > >                                               add_actions->size);
> > >          }
> > >          ovs_mutex_unlock(&pmd->flow_mutex);
> > > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        /* When EMC occupancy goes over a threshold we avoid
> > > + inserting
> > new
> > > +         * entries for recirculated packets. */
> > > +        if (!packet->md.recirc_id) {
> > > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        }
> > >      }
> > >  }
> > >
> > > @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> > > dp_netdev_pmd_thread *pmd,
> > >
> > >          flow = dp_netdev_flow_cast(rules[i]);
> > >
> > > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        /* When EMC occupancy goes over a threshold we avoid
> > > + inserting
> > new
> > > +         * entries for recirculated packets. */
> > > +        if (!packet->md.recirc_id) {
> > > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        }
> > >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> > > n_batches);
> > >      }
> > >
> > > --
> > > 2.4.11
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Fischetti, Antonio July 26, 2017, 12:43 p.m. UTC | #4
Thanks for your feedback, good to see this could follow up with some further
solutions. In the meantime - based also on your suggestions - I posted a 
v2 of this patch at 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335940.html

It's in a patchset that begins at 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335938.html

-Antonio

> -----Original Message-----
> From: O Mahony, Billy
> Sent: Wednesday, July 26, 2017 9:55 AM
> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Hi Antonio,
>
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index fd2ed52..64a3cd4 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4538,6 +4538,8 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
 }
 
+#define EMC_FULL_THRESHOLD 0x0000F000
+
 /* Try to process all ('cnt') the 'packets' using only the exact match cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
@@ -4582,6 +4584,19 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
             pkt_metadata_prefetch_init(&packets[i+1]->md);
         }
 
+        /*
+         * EMC lookup is skipped when one or both of the following
+         * two cases occurs:
+         *
+         *   - EMC is disabled.  This is detected from cur_min.
+         *
+         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
+         *     packet to be classified is being recirculated.  When this
+         *     happens also EMC insertions are skipped for recirculated
+         *     packets.  So that EMC is used just to store entries which
+         *     are hit from the 'original' packets.  This way the EMC
+         *     thrashing is mitigated with a benefit on performance.
+         */
         if (!md_is_valid) {
             pkt_metadata_init(&packet->md, port_no);
             miniflow_extract(packet, &key->mf);
@@ -4603,11 +4618,18 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         } else {
             /* Recirculated packets. */
             miniflow_extract(packet, &key->mf);
-            if (OVS_LIKELY(cur_min)) {
-                key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
-                flow = emc_lookup(flow_cache, key);
-            } else {
+            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
+                /* EMC occupancy is over the threshold.  We skip EMC
+                 * lookup for recirculated packets. */
                 flow = NULL;
+            } else {
+                if (OVS_LIKELY(cur_min)) {
+                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
+                                    &key->mf);
+                    flow = emc_lookup(flow_cache, key);
+                } else {
+                    flow = NULL;
+                }
             }
         }
         key->len = 0; /* Not computed yet. */
@@ -4695,7 +4717,13 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        /* When EMC occupancy goes over a threshold we avoid inserting new
+         * entries for recirculated packets. */
+        if (!packet->md.recirc_id) {
+            emc_probabilistic_insert(pmd, key, netdev_flow);
+        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
+            emc_probabilistic_insert(pmd, key, netdev_flow);
+        }
     }
 }
 
@@ -4788,7 +4816,13 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
         flow = dp_netdev_flow_cast(rules[i]);
 
-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        /* When EMC occupancy goes over a threshold we avoid inserting new
+         * entries for recirculated packets. */
+        if (!packet->md.recirc_id) {
+            emc_probabilistic_insert(pmd, &keys[i], flow);
+        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
+            emc_probabilistic_insert(pmd, &keys[i], flow);
+        }
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }