[ovs-dev,v3,1/4] dpif-netdev: Skip EMC lookup/insert for recirc packets

Message ID 1502466766-17370-2-git-send-email-antonio.fischetti@intel.com
State New
Headers show

Commit Message

Fischetti, Antonio Aug. 11, 2017, 3:52 p.m.
When OVS is configured as a firewall, with thousands of active
concurrent connections, the EMC gets quicly saturated and may
come under heavy thrashing for the reason that original and
recirculated packets keep overwriting the existing active EMC
entries due to its limited size (8k).

This thrashing causes the EMC to be less efficient than the dcpls
in terms of lookups and insertions.

This patch allows to use the EMC efficiently by allowing only
the 'original' packets to hit EMC. All recirculated packets are
sent to the classifier directly.
An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50% -
for EMC occupancy is set to trigger this logic. By doing so when
EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:
 - EMC Insertions are allowed just for original packets.
   EMC insertion and look up are skipped for recirculated packets.
 - Recirculated packets are sent to the classifier.

This patch is based on patch
"dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html

CC: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
---
Connection Tracker testbench set up with

 table=0, priority=1 actions=drop
 table=0, priority=10,arp actions=NORMAL
 table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
 table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
 table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
 table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
 table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

2 PMDs, 3 Tx queues.

I measured packet Rx rate (regardless of packet loss). Bidirectional
test with 64B UDP packets.
Each row is a test with a different number of traffic streams. The traffic
generator is set so that each stream establishes one UDP connection.
Mpps columns reports the Rx rates on the 2 sides.

I set up the generator to loop on the dest IP addr on one side,
and loop instead on the source IP addr on the other side.

For example to generate 10 different flows, I was sending to phy port #1
UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63, PortDest: 63

Instead to phy port #2 (source and dest IPs are now swapped):
UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63, PortDest: 63

I saw the following performance improvement.

Original OvS-DPDK means at Commit ID:
  6b1babacc3ca0488e07596bf822fe356c9bab646

          +----------------------+-----------------------+
          |  Original OvS-DPDK   |   Original OvS-DPDK   |
          |                      |    + this patch       |
 ---------+------------+---------+------------+----------+
  Traffic |     Rx     |   EMC   |     Rx     |   EMC    |
  Streams |   [Mpps]   | entries |   [Mpps]   | entries  |
 ---------+------------+---------+------------+----------+
     100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |
   1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |
   2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |
   3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |
   4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |
  10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |
  30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |
 ---------+------------+---------+------------+----------+

This test setup implies 1 recirculation on each received packet.
We didn't check this patch in a test scenario where more than 1
recirculation is occurring per packet.
---
 lib/dpif-netdev.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 61 insertions(+), 4 deletions(-)

Comments

Darrell Ball Aug. 14, 2017, 6:26 a.m. | #1
-----Original Message-----
From: <ovs-dev-bounces@openvswitch.org> on behalf of "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>
Date: Friday, August 11, 2017 at 8:52 AM
To: "dev@openvswitch.org" <dev@openvswitch.org>
Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for	recirc packets

    When OVS is configured as a firewall, with thousands of active
    concurrent connections, the EMC gets quicly saturated and may
    come under heavy thrashing for the reason that original and
    recirculated packets keep overwriting the existing active EMC
    entries due to its limited size (8k).


The recirculated packet could have been modified, in which case, maybe we
still want to do the emc lookup/insert ?

    
    This thrashing causes the EMC to be less efficient than the dcpls
    in terms of lookups and insertions.
    
    This patch allows to use the EMC efficiently by allowing only
    the 'original' packets to hit EMC. All recirculated packets are
    sent to the classifier directly.
    An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50% -
    for EMC occupancy is set to trigger this logic. By doing so when
    EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:
     - EMC Insertions are allowed just for original packets.
       EMC insertion and look up are skipped for recirculated packets.
     - Recirculated packets are sent to the classifier.
    
    This patch is based on patch
    "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=NHY06RD-Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e= 
    
    CC: Jan Scheurich <jan.scheurich@ericsson.com>
    Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
    Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
    Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
    ---
    Connection Tracker testbench set up with
    
     table=0, priority=1 actions=drop
     table=0, priority=10,arp actions=NORMAL
     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
     table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
    
    2 PMDs, 3 Tx queues.
    
    I measured packet Rx rate (regardless of packet loss). Bidirectional
    test with 64B UDP packets.
    Each row is a test with a different number of traffic streams. The traffic
    generator is set so that each stream establishes one UDP connection.
    Mpps columns reports the Rx rates on the 2 sides.
    
    I set up the generator to loop on the dest IP addr on one side,
    and loop instead on the source IP addr on the other side.
    
    For example to generate 10 different flows, I was sending to phy port #1
    UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63, PortDest: 63
    
    Instead to phy port #2 (source and dest IPs are now swapped):
    UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63, PortDest: 63
    
    I saw the following performance improvement.
    
    Original OvS-DPDK means at Commit ID:
      6b1babacc3ca0488e07596bf822fe356c9bab646
    
              +----------------------+-----------------------+
              |  Original OvS-DPDK   |   Original OvS-DPDK   |
              |                      |    + this patch       |
     ---------+------------+---------+------------+----------+
      Traffic |     Rx     |   EMC   |     Rx     |   EMC    |
      Streams |   [Mpps]   | entries |   [Mpps]   | entries  |
     ---------+------------+---------+------------+----------+
         100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |
       1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |
       2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |
       3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |
       4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |
      10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |
      30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |
     ---------+------------+---------+------------+----------+
    
    This test setup implies 1 recirculation on each received packet.
    We didn't check this patch in a test scenario where more than 1
    recirculation is occurring per packet.
    ---
     lib/dpif-netdev.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++----
     1 file changed, 61 insertions(+), 4 deletions(-)
    
    diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
    index bea1c3f..8f6b96b 100644
    --- a/lib/dpif-netdev.c
    +++ b/lib/dpif-netdev.c
    @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
         packet_batch_per_flow_update(batch, pkt, mf);
     }
     
    +/* Threshold to skip EMC for recirculated packets. */
    +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000
    +
     /* Try to process all ('cnt') the 'packets' using only the exact match cache
      * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
      * miniflow is copied into 'keys' and the packet pointer is moved at the
    @@ -4714,8 +4717,36 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
             key->len = 0; /* Not computed yet. */
             key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
     
    -        /* If EMC is disabled skip emc_lookup */
    -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
    +        /*
    +         * EMC lookup is skipped when one or both of the following
    +         * two cases occurs:
    +         *
    +         *    - EMC is disabled.  This is detected from cur_min.
    +         *
    +         *    - The EMC occupancy exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD and
    +         *      the packet to be classified is being recirculated.  When this
    +         *      happens also EMC insertions are skipped for recirculated
    +         *      packets.  So that EMC is used just to store entries which
    +         *      are hit from the 'original' packets.  This way the EMC
    +         *      thrashing is mitigated with a benefit on performance.
    +         */
    +        if (OVS_LIKELY(cur_min)) {
    +            if (!md_is_valid) {
    +                flow = emc_lookup(flow_cache, key);
    +            } else {
    +                /* Recirculated packet. */
    +                if (flow_cache->n_entries & EMC_RECIRCT_NO_INSERT_THRESHOLD) {
    +                    /* EMC occupancy is over the threshold.  We skip EMC
    +                     * lookup for recirculated packets. */
    +                    flow = NULL;
    +                } else {
    +                    flow = emc_lookup(flow_cache, key);
    +                }
    +            }
    +        } else {
    +            flow = NULL;
    +        }
    +
             if (OVS_LIKELY(flow)) {
                 dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                         n_batches);
    @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                                  add_actions->size);
             }
             ovs_mutex_unlock(&pmd->flow_mutex);
    -        emc_probabilistic_insert(pmd, key, netdev_flow);
    +        /* EMC insertion can be skipped by a probabilistic criteria or
    +         * - in case of recirculated packets - depending on the number of
    +         * EMC entries. */
    +        if (!packet->md.recirc_id) {
    +            emc_probabilistic_insert(pmd, key, netdev_flow);
    +        } else {
    +            /* Recirculated packets.  When EMC occupancy goes over
    +             * a threshold we avoid inserting new entries. */
    +            if (!(pmd->flow_cache.n_entries &
    +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
    +                /* Still under the threshold. */
    +                emc_probabilistic_insert(pmd, key, netdev_flow);
    +            }
    +        }
         }
     }
     
    @@ -4893,7 +4937,20 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
     
             flow = dp_netdev_flow_cast(rules[i]);
     
    -        emc_probabilistic_insert(pmd, &keys[i], flow);
    +        /* EMC insertion can be skipped by a probabilistic criteria or
    +         * - in case of recirculated packets - depending on the number of
    +         * EMC entries. */
    +        if (!packet->md.recirc_id) {
    +            emc_probabilistic_insert(pmd, &keys[i], flow);
    +        } else {
    +            /* Recirculated packets.  When EMC occupancy goes over
    +             * a threshold we avoid inserting new entries. */
    +            if (!(pmd->flow_cache.n_entries &
    +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
    +                /* Still under the threshold. */
    +                emc_probabilistic_insert(pmd, &keys[i], flow);
    +            }
    +        }
             dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
         }
     
    -- 
    2.4.11
    
    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=NHY06RD-Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-xSW7voYnxrudlh_WPXXsKJ1n1o680-3ZCuwj33q0H8&e=
Fischetti, Antonio Aug. 15, 2017, 1:55 p.m. | #2
> -----Original Message-----
> From: Darrell Ball [mailto:dball@vmware.com]
> Sent: Monday, August 14, 2017 7:27 AM
> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for
> recirc packets
> 
> 
> 
> -----Original Message-----
> From: <ovs-dev-bounces@openvswitch.org> on behalf of
> "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>
> Date: Friday, August 11, 2017 at 8:52 AM
> To: "dev@openvswitch.org" <dev@openvswitch.org>
> Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for
> 	recirc packets
> 
>     When OVS is configured as a firewall, with thousands of active
>     concurrent connections, the EMC gets quicly saturated and may
>     come under heavy thrashing for the reason that original and
>     recirculated packets keep overwriting the existing active EMC
>     entries due to its limited size (8k).
> 
> 
> The recirculated packet could have been modified, in which case, maybe we
> still want to do the emc lookup/insert ?

[Antonio] 
IMPO I'd say we should still skip emc anyway, because the purpose is to 
mitigate thrashing when emc is full. So any recirculated packet should
be classified at the dpcls/ofproto layers.
I don't know if I'm missing something from your question?

We can expect that a recirc pkt that has been modified - similarly to all 
other recirculated pkts - could result in a miss when emc is full. 
Later we should do an emc insertion that is likely to overwrite some 
active entry. And recursively, this new insertion itself could be 
overwritten - due to the shortage of locations - even before it is hit 
again. This proposal is to mitigate the thrashing with the criteria of 
reserving emc usage to original packets only. 
So a limited resource like emc hopefully could be used more efficiently, 
especially when there is more than 1 recirculation.
I guess that adding an exception for modified recirc pkts could also 
drop a bit the throughtput as we should add another if statement inside 
emc_processing.


> 
> 
>     This thrashing causes the EMC to be less efficient than the dcpls
>     in terms of lookups and insertions.
> 
>     This patch allows to use the EMC efficiently by allowing only
>     the 'original' packets to hit EMC. All recirculated packets are
>     sent to the classifier directly.
>     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50% -
>     for EMC occupancy is set to trigger this logic. By doing so when
>     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:
>      - EMC Insertions are allowed just for original packets.
>        EMC insertion and look up are skipped for recirculated packets.
>      - Recirculated packets are sent to the classifier.
> 
>     This patch is based on patch
>     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
>     https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> uZnsw&m=NHY06RD-Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-
> PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=
> 
>     CC: Jan Scheurich <jan.scheurich@ericsson.com>
>     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>
>     Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>     Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>     ---
>     Connection Tracker testbench set up with
> 
>      table=0, priority=1 actions=drop
>      table=0, priority=10,arp actions=NORMAL
>      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
>      table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2
>      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
>      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
>      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> 
>     2 PMDs, 3 Tx queues.
> 
>     I measured packet Rx rate (regardless of packet loss). Bidirectional
>     test with 64B UDP packets.
>     Each row is a test with a different number of traffic streams. The traffic
>     generator is set so that each stream establishes one UDP connection.
>     Mpps columns reports the Rx rates on the 2 sides.
> 
>     I set up the generator to loop on the dest IP addr on one side,
>     and loop instead on the source IP addr on the other side.
> 
>     For example to generate 10 different flows, I was sending to phy port #1
>     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63, PortDest: 63
> 
>     Instead to phy port #2 (source and dest IPs are now swapped):
>     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63, PortDest:
> 63
> 
>     I saw the following performance improvement.
> 
>     Original OvS-DPDK means at Commit ID:
>       6b1babacc3ca0488e07596bf822fe356c9bab646
> 
>               +----------------------+-----------------------+
>               |  Original OvS-DPDK   |   Original OvS-DPDK   |
>               |                      |    + this patch       |
>      ---------+------------+---------+------------+----------+
>       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |
>       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |
>      ---------+------------+---------+------------+----------+
>          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |
>        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |
>        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |
>        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |
>        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |
>       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |
>       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |
>      ---------+------------+---------+------------+----------+
> 
>     This test setup implies 1 recirculation on each received packet.
>     We didn't check this patch in a test scenario where more than 1
>     recirculation is occurring per packet.
>     ---
>      lib/dpif-netdev.c | 65
> +++++++++++++++++++++++++++++++++++++++++++++++++++----
>      1 file changed, 61 insertions(+), 4 deletions(-)
> 
>     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>     index bea1c3f..8f6b96b 100644
>     --- a/lib/dpif-netdev.c
>     +++ b/lib/dpif-netdev.c
>     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>          packet_batch_per_flow_update(batch, pkt, mf);
>      }
> 
>     +/* Threshold to skip EMC for recirculated packets. */
>     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000
>     +
>      /* Try to process all ('cnt') the 'packets' using only the exact match
> cache
>       * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]',
> the
>       * miniflow is copied into 'keys' and the packet pointer is moved at the
>     @@ -4714,8 +4717,36 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>              key->len = 0; /* Not computed yet. */
>              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
> 
>     -        /* If EMC is disabled skip emc_lookup */
>     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
>     +        /*
>     +         * EMC lookup is skipped when one or both of the following
>     +         * two cases occurs:
>     +         *
>     +         *    - EMC is disabled.  This is detected from cur_min.
>     +         *
>     +         *    - The EMC occupancy exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD
> and
>     +         *      the packet to be classified is being recirculated.  When
> this
>     +         *      happens also EMC insertions are skipped for recirculated
>     +         *      packets.  So that EMC is used just to store entries which
>     +         *      are hit from the 'original' packets.  This way the EMC
>     +         *      thrashing is mitigated with a benefit on performance.
>     +         */
>     +        if (OVS_LIKELY(cur_min)) {
>     +            if (!md_is_valid) {
>     +                flow = emc_lookup(flow_cache, key);
>     +            } else {
>     +                /* Recirculated packet. */
>     +                if (flow_cache->n_entries &
> EMC_RECIRCT_NO_INSERT_THRESHOLD) {
>     +                    /* EMC occupancy is over the threshold.  We skip EMC
>     +                     * lookup for recirculated packets. */
>     +                    flow = NULL;
>     +                } else {
>     +                    flow = emc_lookup(flow_cache, key);
>     +                }
>     +            }
>     +        } else {
>     +            flow = NULL;
>     +        }
>     +
>              if (OVS_LIKELY(flow)) {
>                  dp_netdev_queue_batches(packet, flow, &key->mf, batches,
>                                          n_batches);
>     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct dp_netdev_pmd_thread
> *pmd,
>                                                   add_actions->size);
>              }
>              ovs_mutex_unlock(&pmd->flow_mutex);
>     -        emc_probabilistic_insert(pmd, key, netdev_flow);
>     +        /* EMC insertion can be skipped by a probabilistic criteria or
>     +         * - in case of recirculated packets - depending on the number of
>     +         * EMC entries. */
>     +        if (!packet->md.recirc_id) {
>     +            emc_probabilistic_insert(pmd, key, netdev_flow);
>     +        } else {
>     +            /* Recirculated packets.  When EMC occupancy goes over
>     +             * a threshold we avoid inserting new entries. */
>     +            if (!(pmd->flow_cache.n_entries &
>     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
>     +                /* Still under the threshold. */
>     +                emc_probabilistic_insert(pmd, key, netdev_flow);
>     +            }
>     +        }
>          }
>      }
> 
>     @@ -4893,7 +4937,20 @@ fast_path_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>              flow = dp_netdev_flow_cast(rules[i]);
> 
>     -        emc_probabilistic_insert(pmd, &keys[i], flow);
>     +        /* EMC insertion can be skipped by a probabilistic criteria or
>     +         * - in case of recirculated packets - depending on the number of
>     +         * EMC entries. */
>     +        if (!packet->md.recirc_id) {
>     +            emc_probabilistic_insert(pmd, &keys[i], flow);
>     +        } else {
>     +            /* Recirculated packets.  When EMC occupancy goes over
>     +             * a threshold we avoid inserting new entries. */
>     +            if (!(pmd->flow_cache.n_entries &
>     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
>     +                /* Still under the threshold. */
>     +                emc_probabilistic_insert(pmd, &keys[i], flow);
>     +            }
>     +        }
>              dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
>          }
> 
>     --
>     2.4.11
> 
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org
>     https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=NHY06RD-
> Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-xSW7voYnxrudlh_WPXXsKJ1n1o680-
> 3ZCuwj33q0H8&e=
>
Darrell Ball Aug. 16, 2017, 8:08 a.m. | #3
-----Original Message-----
From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

Date: Tuesday, August 15, 2017 at 6:55 AM
To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>
Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for recirc packets

    
    
    > -----Original Message-----

    > From: Darrell Ball [mailto:dball@vmware.com]

    > Sent: Monday, August 14, 2017 7:27 AM

    > To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

    > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

    > recirc packets

    > 

    > 

    > 

    > -----Original Message-----

    > From: <ovs-dev-bounces@openvswitch.org> on behalf of

    > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>

    > Date: Friday, August 11, 2017 at 8:52 AM

    > To: "dev@openvswitch.org" <dev@openvswitch.org>

    > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

    > 	recirc packets

    > 

    >     When OVS is configured as a firewall, with thousands of active

    >     concurrent connections, the EMC gets quicly saturated and may

    >     come under heavy thrashing for the reason that original and

    >     recirculated packets keep overwriting the existing active EMC

    >     entries due to its limited size (8k).

    > 

    > 

    > The recirculated packet could have been modified, in which case, maybe we

    > still want to do the emc lookup/insert ?

    
    [Antonio] 
    IMPO I'd say we should still skip emc anyway, because the purpose is to 
    mitigate thrashing when emc is full. So any recirculated packet should
    be classified at the dpcls/ofproto layers.
    I don't know if I'm missing something from your question?
    
    We can expect that a recirc pkt that has been modified - similarly to all 
    other recirculated pkts - could result in a miss when emc is full. 
    Later we should do an emc insertion that is likely to overwrite some 
    active entry. And recursively, this new insertion itself could be 
    overwritten - due to the shortage of locations - even before it is hit 
    again. This proposal is to mitigate the thrashing with the criteria of 
    reserving emc usage to original packets only. 
    So a limited resource like emc hopefully could be used more efficiently, 
    especially when there is more than 1 recirculation.
    I guess that adding an exception for modified recirc pkts could also 
    drop a bit the throughtput as we should add another if statement inside 
    emc_processing.

[Darrell]
I’ll can drop the edited packet case as my concern was really more general.
The concern is that recirculated packets should still be forwarded quickly if possible
and using emc should help that. The first time through, emc is used for the packet and then the second
time through, emc is not used, so it is slower. But, possibly the argument could be made that since it is recirculated,
it is already slower, in which case, maybe a penalty for recirculated packets is reasonable.
Instead of having a simple 50% black and white cutoff, maybe a penalty to the insertion probability could be used ?

    
    > 

    > 

    >     This thrashing causes the EMC to be less efficient than the dcpls

    >     in terms of lookups and insertions.

    > 

    >     This patch allows to use the EMC efficiently by allowing only

    >     the 'original' packets to hit EMC. All recirculated packets are

    >     sent to the classifier directly.

    >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50% -

    >     for EMC occupancy is set to trigger this logic. By doing so when

    >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:

    >      - EMC Insertions are allowed just for original packets.

    >        EMC insertion and look up are skipped for recirculated packets.

    >      - Recirculated packets are sent to the classifier.

    > 

    >     This patch is based on patch

    >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:

    >     https://urldefense.proofpoint.com/v2/url?u=https-

    > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

    > 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

    > uZnsw&m=NHY06RD-Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

    > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=

    > 

    >     CC: Jan Scheurich <jan.scheurich@ericsson.com>

    >     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

    >     Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

    >     Co-authored-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>

    >     ---

    >     Connection Tracker testbench set up with

    > 

    >      table=0, priority=1 actions=drop

    >      table=0, priority=10,arp actions=NORMAL

    >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

    >      table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2

    >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

    >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

    >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

    > 

    >     2 PMDs, 3 Tx queues.

    > 

    >     I measured packet Rx rate (regardless of packet loss). Bidirectional

    >     test with 64B UDP packets.

    >     Each row is a test with a different number of traffic streams. The traffic

    >     generator is set so that each stream establishes one UDP connection.

    >     Mpps columns reports the Rx rates on the 2 sides.

    > 

    >     I set up the generator to loop on the dest IP addr on one side,

    >     and loop instead on the source IP addr on the other side.

    > 

    >     For example to generate 10 different flows, I was sending to phy port #1

    >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63, PortDest: 63

    > 

    >     Instead to phy port #2 (source and dest IPs are now swapped):

    >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63, PortDest:

    > 63

    > 

    >     I saw the following performance improvement.

    > 

    >     Original OvS-DPDK means at Commit ID:

    >       6b1babacc3ca0488e07596bf822fe356c9bab646

    > 

    >               +----------------------+-----------------------+

    >               |  Original OvS-DPDK   |   Original OvS-DPDK   |

    >               |                      |    + this patch       |

    >      ---------+------------+---------+------------+----------+

    >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |

    >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |

    >      ---------+------------+---------+------------+----------+

    >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |

    >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |

    >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |

    >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |

    >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |

    >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |

    >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |

    >      ---------+------------+---------+------------+----------+

    > 

    >     This test setup implies 1 recirculation on each received packet.

    >     We didn't check this patch in a test scenario where more than 1

    >     recirculation is occurring per packet.

    >     ---

    >      lib/dpif-netdev.c | 65

    > +++++++++++++++++++++++++++++++++++++++++++++++++++----

    >      1 file changed, 61 insertions(+), 4 deletions(-)

    > 

    >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

    >     index bea1c3f..8f6b96b 100644

    >     --- a/lib/dpif-netdev.c

    >     +++ b/lib/dpif-netdev.c

    >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct dp_packet *pkt,

    >          packet_batch_per_flow_update(batch, pkt, mf);

    >      }

    > 

    >     +/* Threshold to skip EMC for recirculated packets. */

    >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000

    >     +

    >      /* Try to process all ('cnt') the 'packets' using only the exact match

    > cache

    >       * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]',

    > the

    >       * miniflow is copied into 'keys' and the packet pointer is moved at the

    >     @@ -4714,8 +4717,36 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,

    >              key->len = 0; /* Not computed yet. */

    >              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);

    > 

    >     -        /* If EMC is disabled skip emc_lookup */

    >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

    >     +        /*

    >     +         * EMC lookup is skipped when one or both of the following

    >     +         * two cases occurs:

    >     +         *

    >     +         *    - EMC is disabled.  This is detected from cur_min.

    >     +         *

    >     +         *    - The EMC occupancy exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD

    > and

    >     +         *      the packet to be classified is being recirculated.  When

    > this

    >     +         *      happens also EMC insertions are skipped for recirculated

    >     +         *      packets.  So that EMC is used just to store entries which

    >     +         *      are hit from the 'original' packets.  This way the EMC

    >     +         *      thrashing is mitigated with a benefit on performance.

    >     +         */

    >     +        if (OVS_LIKELY(cur_min)) {

    >     +            if (!md_is_valid) {

    >     +                flow = emc_lookup(flow_cache, key);

    >     +            } else {

    >     +                /* Recirculated packet. */

    >     +                if (flow_cache->n_entries &

    > EMC_RECIRCT_NO_INSERT_THRESHOLD) {

    >     +                    /* EMC occupancy is over the threshold.  We skip EMC

    >     +                     * lookup for recirculated packets. */

    >     +                    flow = NULL;

    >     +                } else {

    >     +                    flow = emc_lookup(flow_cache, key);

    >     +                }

    >     +            }

    >     +        } else {

    >     +            flow = NULL;

    >     +        }

    >     +

    >              if (OVS_LIKELY(flow)) {

    >                  dp_netdev_queue_batches(packet, flow, &key->mf, batches,

    >                                          n_batches);

    >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct dp_netdev_pmd_thread

    > *pmd,

    >                                                   add_actions->size);

    >              }

    >              ovs_mutex_unlock(&pmd->flow_mutex);

    >     -        emc_probabilistic_insert(pmd, key, netdev_flow);

    >     +        /* EMC insertion can be skipped by a probabilistic criteria or

    >     +         * - in case of recirculated packets - depending on the number of

    >     +         * EMC entries. */

    >     +        if (!packet->md.recirc_id) {

    >     +            emc_probabilistic_insert(pmd, key, netdev_flow);

    >     +        } else {

    >     +            /* Recirculated packets.  When EMC occupancy goes over

    >     +             * a threshold we avoid inserting new entries. */

    >     +            if (!(pmd->flow_cache.n_entries &

    >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

    >     +                /* Still under the threshold. */

    >     +                emc_probabilistic_insert(pmd, key, netdev_flow);

    >     +            }

    >     +        }

    >          }

    >      }

    > 

    >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct dp_netdev_pmd_thread

    > *pmd,

    > 

    >              flow = dp_netdev_flow_cast(rules[i]);

    > 

    >     -        emc_probabilistic_insert(pmd, &keys[i], flow);

    >     +        /* EMC insertion can be skipped by a probabilistic criteria or

    >     +         * - in case of recirculated packets - depending on the number of

    >     +         * EMC entries. */

    >     +        if (!packet->md.recirc_id) {

    >     +            emc_probabilistic_insert(pmd, &keys[i], flow);

    >     +        } else {

    >     +            /* Recirculated packets.  When EMC occupancy goes over

    >     +             * a threshold we avoid inserting new entries. */

    >     +            if (!(pmd->flow_cache.n_entries &

    >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

    >     +                /* Still under the threshold. */

    >     +                emc_probabilistic_insert(pmd, &keys[i], flow);

    >     +            }

    >     +        }

    >              dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

    > n_batches);

    >          }

    > 

    >     --

    >     2.4.11

    > 

    >     _______________________________________________

    >     dev mailing list

    >     dev@openvswitch.org

    >     https://urldefense.proofpoint.com/v2/url?u=https-

    > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

    > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=NHY06RD-

    > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-xSW7voYnxrudlh_WPXXsKJ1n1o680-

    > 3ZCuwj33q0H8&e=

    >
Fischetti, Antonio Aug. 16, 2017, 12:42 p.m. | #4
> -----Original Message-----

> From: Darrell Ball [mailto:dball@vmware.com]

> Sent: Wednesday, August 16, 2017 9:09 AM

> To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

> recirc packets

> 

> 

> 

> -----Original Message-----

> From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

> Date: Tuesday, August 15, 2017 at 6:55 AM

> To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org"

> <dev@openvswitch.org>

> Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

> recirc packets

> 

> 

> 

>     > -----Original Message-----

>     > From: Darrell Ball [mailto:dball@vmware.com]

>     > Sent: Monday, August 14, 2017 7:27 AM

>     > To: Fischetti, Antonio <antonio.fischetti@intel.com>; dev@openvswitch.org

>     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> for

>     > recirc packets

>     >

>     >

>     >

>     > -----Original Message-----

>     > From: <ovs-dev-bounces@openvswitch.org> on behalf of

>     > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>

>     > Date: Friday, August 11, 2017 at 8:52 AM

>     > To: "dev@openvswitch.org" <dev@openvswitch.org>

>     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

>     > 	recirc packets

>     >

>     >     When OVS is configured as a firewall, with thousands of active

>     >     concurrent connections, the EMC gets quicly saturated and may

>     >     come under heavy thrashing for the reason that original and

>     >     recirculated packets keep overwriting the existing active EMC

>     >     entries due to its limited size (8k).

>     >

>     >

>     > The recirculated packet could have been modified, in which case, maybe we

>     > still want to do the emc lookup/insert ?

> 

>     [Antonio]

>     IMPO I'd say we should still skip emc anyway, because the purpose is to

>     mitigate thrashing when emc is full. So any recirculated packet should

>     be classified at the dpcls/ofproto layers.

>     I don't know if I'm missing something from your question?

> 

>     We can expect that a recirc pkt that has been modified - similarly to all

>     other recirculated pkts - could result in a miss when emc is full.

>     Later we should do an emc insertion that is likely to overwrite some

>     active entry. And recursively, this new insertion itself could be

>     overwritten - due to the shortage of locations - even before it is hit

>     again. This proposal is to mitigate the thrashing with the criteria of

>     reserving emc usage to original packets only.

>     So a limited resource like emc hopefully could be used more efficiently,

>     especially when there is more than 1 recirculation.

>     I guess that adding an exception for modified recirc pkts could also

>     drop a bit the throughtput as we should add another if statement inside

>     emc_processing.

> 

> [Darrell]

> I’ll can drop the edited packet case as my concern was really more general.

> The concern is that recirculated packets should still be forwarded quickly if

> possible

> and using emc should help that. The first time through, emc is used for the

> packet and then the second

> time through, emc is not used, so it is slower. But, possibly the argument

> could be made that since it is recirculated,

> it is already slower, in which case, maybe a penalty for recirculated packets

> is reasonable.


[Antonio]
Agree. Other than that, in case of an emc congestion - eg a firewall with
say 6,000 connections - with a lot of overwrites, the effect could be that 
a lot of lookups will fail and the new insertions are just overwriting active 
flows. This keeps a high failure for lookups and the continuous overwrites 
for insertions become an overhead. So in this case there's a penalty 
as for the original (ie the 1st time through) as for the recirculated packets.
With this approach we are considering that with 6,000 flows we would need at
least 12,000 entries with 1 recirculation. So one strategy to reduce thrashing
could be to restrict emc usage to original packets only. The counterpart is 
that recirculated packets are slower, but the overall effect should be a 
benefit.


> Instead of having a simple 50% black and white cutoff, maybe a penalty to the

> insertion probability could be used ?


[Antonio]
Yes, at the beginning I was considering this solution. I then preferred 
the current one because it allows not only to skip insertions but also 
to skip lookups, especially when RSS hash must be computed in software.

The check of the threshold - as this is happening inside emc_processing - 
is done with an '&' operation so to use as less cpu cycles as possible.


> 

> 

>     >

>     >

>     >     This thrashing causes the EMC to be less efficient than the dcpls

>     >     in terms of lookups and insertions.

>     >

>     >     This patch allows to use the EMC efficiently by allowing only

>     >     the 'original' packets to hit EMC. All recirculated packets are

>     >     sent to the classifier directly.

>     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - of 50% -

>     >     for EMC occupancy is set to trigger this logic. By doing so when

>     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:

>     >      - EMC Insertions are allowed just for original packets.

>     >        EMC insertion and look up are skipped for recirculated packets.

>     >      - Recirculated packets are sent to the classifier.

>     >

>     >     This patch is based on patch

>     >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"

> at:

>     >     https://urldefense.proofpoint.com/v2/url?u=https-

>     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

>     >

> 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

>     > uZnsw&m=NHY06RD-Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

>     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=

>     >

>     >     CC: Jan Scheurich <jan.scheurich@ericsson.com>

>     >     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

>     >     Signed-off-by: Bhanuprakash Bodireddy

> <bhanuprakash.bodireddy@intel.com>

>     >     Co-authored-by: Bhanuprakash Bodireddy

> <bhanuprakash.bodireddy@intel.com>

>     >     ---

>     >     Connection Tracker testbench set up with

>     >

>     >      table=0, priority=1 actions=drop

>     >      table=0, priority=10,arp actions=NORMAL

>     >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

>     >      table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2

>     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

>     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

>     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

>     >

>     >     2 PMDs, 3 Tx queues.

>     >

>     >     I measured packet Rx rate (regardless of packet loss). Bidirectional

>     >     test with 64B UDP packets.

>     >     Each row is a test with a different number of traffic streams. The

> traffic

>     >     generator is set so that each stream establishes one UDP connection.

>     >     Mpps columns reports the Rx rates on the 2 sides.

>     >

>     >     I set up the generator to loop on the dest IP addr on one side,

>     >     and loop instead on the source IP addr on the other side.

>     >

>     >     For example to generate 10 different flows, I was sending to phy port

> #1

>     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63,

> PortDest: 63

>     >

>     >     Instead to phy port #2 (source and dest IPs are now swapped):

>     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63,

> PortDest:

>     > 63

>     >

>     >     I saw the following performance improvement.

>     >

>     >     Original OvS-DPDK means at Commit ID:

>     >       6b1babacc3ca0488e07596bf822fe356c9bab646

>     >

>     >               +----------------------+-----------------------+

>     >               |  Original OvS-DPDK   |   Original OvS-DPDK   |

>     >               |                      |    + this patch       |

>     >      ---------+------------+---------+------------+----------+

>     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |

>     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |

>     >      ---------+------------+---------+------------+----------+

>     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |

>     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |

>     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |

>     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |

>     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |

>     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |

>     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |

>     >      ---------+------------+---------+------------+----------+

>     >

>     >     This test setup implies 1 recirculation on each received packet.

>     >     We didn't check this patch in a test scenario where more than 1

>     >     recirculation is occurring per packet.

>     >     ---

>     >      lib/dpif-netdev.c | 65

>     > +++++++++++++++++++++++++++++++++++++++++++++++++++----

>     >      1 file changed, 61 insertions(+), 4 deletions(-)

>     >

>     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

>     >     index bea1c3f..8f6b96b 100644

>     >     --- a/lib/dpif-netdev.c

>     >     +++ b/lib/dpif-netdev.c

>     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct dp_packet *pkt,

>     >          packet_batch_per_flow_update(batch, pkt, mf);

>     >      }

>     >

>     >     +/* Threshold to skip EMC for recirculated packets. */

>     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000

>     >     +

>     >      /* Try to process all ('cnt') the 'packets' using only the exact

> match

>     > cache

>     >       * 'pmd->flow_cache'. If a flow is not found for a packet

> 'packets[i]',

>     > the

>     >       * miniflow is copied into 'keys' and the packet pointer is moved at

> the

>     >     @@ -4714,8 +4717,36 @@ emc_processing(struct dp_netdev_pmd_thread

> *pmd,

>     >              key->len = 0; /* Not computed yet. */

>     >              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

> >mf);

>     >

>     >     -        /* If EMC is disabled skip emc_lookup */

>     >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

>     >     +        /*

>     >     +         * EMC lookup is skipped when one or both of the following

>     >     +         * two cases occurs:

>     >     +         *

>     >     +         *    - EMC is disabled.  This is detected from cur_min.

>     >     +         *

>     >     +         *    - The EMC occupancy exceeds

> EMC_RECIRCT_NO_INSERT_THRESHOLD

>     > and

>     >     +         *      the packet to be classified is being recirculated.

> When

>     > this

>     >     +         *      happens also EMC insertions are skipped for

> recirculated

>     >     +         *      packets.  So that EMC is used just to store entries

> which

>     >     +         *      are hit from the 'original' packets.  This way the

> EMC

>     >     +         *      thrashing is mitigated with a benefit on

> performance.

>     >     +         */

>     >     +        if (OVS_LIKELY(cur_min)) {

>     >     +            if (!md_is_valid) {

>     >     +                flow = emc_lookup(flow_cache, key);

>     >     +            } else {

>     >     +                /* Recirculated packet. */

>     >     +                if (flow_cache->n_entries &

>     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {

>     >     +                    /* EMC occupancy is over the threshold.  We skip

> EMC

>     >     +                     * lookup for recirculated packets. */

>     >     +                    flow = NULL;

>     >     +                } else {

>     >     +                    flow = emc_lookup(flow_cache, key);

>     >     +                }

>     >     +            }

>     >     +        } else {

>     >     +            flow = NULL;

>     >     +        }

>     >     +

>     >              if (OVS_LIKELY(flow)) {

>     >                  dp_netdev_queue_batches(packet, flow, &key->mf, batches,

>     >                                          n_batches);

>     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct

> dp_netdev_pmd_thread

>     > *pmd,

>     >                                                   add_actions->size);

>     >              }

>     >              ovs_mutex_unlock(&pmd->flow_mutex);

>     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);

>     >     +        /* EMC insertion can be skipped by a probabilistic criteria

> or

>     >     +         * - in case of recirculated packets - depending on the

> number of

>     >     +         * EMC entries. */

>     >     +        if (!packet->md.recirc_id) {

>     >     +            emc_probabilistic_insert(pmd, key, netdev_flow);

>     >     +        } else {

>     >     +            /* Recirculated packets.  When EMC occupancy goes over

>     >     +             * a threshold we avoid inserting new entries. */

>     >     +            if (!(pmd->flow_cache.n_entries &

>     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

>     >     +                /* Still under the threshold. */

>     >     +                emc_probabilistic_insert(pmd, key, netdev_flow);

>     >     +            }

>     >     +        }

>     >          }

>     >      }

>     >

>     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct

> dp_netdev_pmd_thread

>     > *pmd,

>     >

>     >              flow = dp_netdev_flow_cast(rules[i]);

>     >

>     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);

>     >     +        /* EMC insertion can be skipped by a probabilistic criteria

> or

>     >     +         * - in case of recirculated packets - depending on the

> number of

>     >     +         * EMC entries. */

>     >     +        if (!packet->md.recirc_id) {

>     >     +            emc_probabilistic_insert(pmd, &keys[i], flow);

>     >     +        } else {

>     >     +            /* Recirculated packets.  When EMC occupancy goes over

>     >     +             * a threshold we avoid inserting new entries. */

>     >     +            if (!(pmd->flow_cache.n_entries &

>     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

>     >     +                /* Still under the threshold. */

>     >     +                emc_probabilistic_insert(pmd, &keys[i], flow);

>     >     +            }

>     >     +        }

>     >              dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,

>     > n_batches);

>     >          }

>     >

>     >     --

>     >     2.4.11

>     >

>     >     _______________________________________________

>     >     dev mailing list

>     >     dev@openvswitch.org

>     >     https://urldefense.proofpoint.com/v2/url?u=https-

>     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

>     > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> uZnsw&m=NHY06RD-

>     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-xSW7voYnxrudlh_WPXXsKJ1n1o680-

>     > 3ZCuwj33q0H8&e=

>     >

> 

>
Jan Scheurich Aug. 16, 2017, 4:23 p.m. | #5
Hi, 

I agree that in the event of EMC overload it is beneficial to reduce the number of EMC insertions and lookups as they just generate overhead and degrade overall throughput. At the same time we want to keep as much of the EMC acceleration as possible for a fraction of traffic that can benefit from EMC most.

For EMC insertion we have already done earlier this by introducing probabilistic EMC insertion, which greatly reduces the costly effect of EMC thrashing. But we didn't touch the lookup part. How should we select the packets (or rather packet datapath traversals) for which to perform lookup?

There are several proposals in the air: Only do it for the first pass, not for recirculated packets, only do it for RSS hash values below a (dynamic) threshold, possibly others.

For EMC insertion we consciously settled on a random selection as the datapath has no a priori insight into which flows are better candidates than others and big flows that benefit most have a higher chance of getting cached.

Is there a reason to assume that a deterministic selection on some non-random criteria like the recirculation count will on average (over deployments and applications) give a better performance than a random selection?

I don't believe so. For example, the number of "EMC flows" in each pass through the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows in first pass (from VM) but 1 flow after NSH encapsulation in second pass.

I believe a random selection with dynamically adapted probability is the best we can do without a priori knowledge about the traffic patterns and pipeline organization.

The RSS hash threshold method looks like the only pseudo-random criterion that we can use that produces consistent result for every packet of a flow and does require more information. Of course elephant flows with an unlucky hash value might never get to use the EMC, but that risk we have with any stateless selection scheme.

The new thing required will be the dynamic adjustment of lookup probability to the EMC fill level and/or hit ratio. Any ideas for that? I guess we'd need a scheme that periodically increases the probability again to probe for changed traffic patterns. 

Once we have that I think the same dynamic probability could be possible to use also for probabilistic EMC insertion.

BR, Jan

> -----Original Message-----

> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> bounces@openvswitch.org] On Behalf Of Fischetti, Antonio

> Sent: Wednesday, 16 August, 2017 14:42

> To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org

> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> for recirc packets

> 

> 

> > -----Original Message-----

> > From: Darrell Ball [mailto:dball@vmware.com]

> > Sent: Wednesday, August 16, 2017 9:09 AM

> > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> dev@openvswitch.org

> > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> for

> > recirc packets

> >

> >

> >

> > -----Original Message-----

> > From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

> > Date: Tuesday, August 15, 2017 at 6:55 AM

> > To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org"

> > <dev@openvswitch.org>

> > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> for

> > recirc packets

> >

> >

> >

> >     > -----Original Message-----

> >     > From: Darrell Ball [mailto:dball@vmware.com]

> >     > Sent: Monday, August 14, 2017 7:27 AM

> >     > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> dev@openvswitch.org

> >     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

> lookup/insert

> > for

> >     > recirc packets

> >     >

> >     >

> >     >

> >     > -----Original Message-----

> >     > From: <ovs-dev-bounces@openvswitch.org> on behalf of

> >     > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>

> >     > Date: Friday, August 11, 2017 at 8:52 AM

> >     > To: "dev@openvswitch.org" <dev@openvswitch.org>

> >     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

> lookup/insert for

> >     > 	recirc packets

> >     >

> >     >     When OVS is configured as a firewall, with thousands of active

> >     >     concurrent connections, the EMC gets quicly saturated and may

> >     >     come under heavy thrashing for the reason that original and

> >     >     recirculated packets keep overwriting the existing active EMC

> >     >     entries due to its limited size (8k).

> >     >

> >     >

> >     > The recirculated packet could have been modified, in which case,

> maybe we

> >     > still want to do the emc lookup/insert ?

> >

> >     [Antonio]

> >     IMPO I'd say we should still skip emc anyway, because the purpose is

> to

> >     mitigate thrashing when emc is full. So any recirculated packet should

> >     be classified at the dpcls/ofproto layers.

> >     I don't know if I'm missing something from your question?

> >

> >     We can expect that a recirc pkt that has been modified - similarly to

> all

> >     other recirculated pkts - could result in a miss when emc is full.

> >     Later we should do an emc insertion that is likely to overwrite some

> >     active entry. And recursively, this new insertion itself could be

> >     overwritten - due to the shortage of locations - even before it is hit

> >     again. This proposal is to mitigate the thrashing with the criteria of

> >     reserving emc usage to original packets only.

> >     So a limited resource like emc hopefully could be used more

> efficiently,

> >     especially when there is more than 1 recirculation.

> >     I guess that adding an exception for modified recirc pkts could also

> >     drop a bit the throughtput as we should add another if statement

> inside

> >     emc_processing.

> >

> > [Darrell]

> > I’ll can drop the edited packet case as my concern was really more

> general.

> > The concern is that recirculated packets should still be forwarded quickly

> if

> > possible

> > and using emc should help that. The first time through, emc is used for

> the

> > packet and then the second

> > time through, emc is not used, so it is slower. But, possibly the argument

> > could be made that since it is recirculated,

> > it is already slower, in which case, maybe a penalty for recirculated

> packets

> > is reasonable.

> 

> [Antonio]

> Agree. Other than that, in case of an emc congestion - eg a firewall with

> say 6,000 connections - with a lot of overwrites, the effect could be that

> a lot of lookups will fail and the new insertions are just overwriting active

> flows. This keeps a high failure for lookups and the continuous overwrites

> for insertions become an overhead. So in this case there's a penalty

> as for the original (ie the 1st time through) as for the recirculated packets.

> With this approach we are considering that with 6,000 flows we would

> need at

> least 12,000 entries with 1 recirculation. So one strategy to reduce

> thrashing

> could be to restrict emc usage to original packets only. The counterpart is

> that recirculated packets are slower, but the overall effect should be a

> benefit.

> 

> 

> > Instead of having a simple 50% black and white cutoff, maybe a penalty

> to the

> > insertion probability could be used ?

> 

> [Antonio]

> Yes, at the beginning I was considering this solution. I then preferred

> the current one because it allows not only to skip insertions but also

> to skip lookups, especially when RSS hash must be computed in software.

> 

> The check of the threshold - as this is happening inside emc_processing -

> is done with an '&' operation so to use as less cpu cycles as possible.

> 

> 

> >

> >

> >     >

> >     >

> >     >     This thrashing causes the EMC to be less efficient than the dcpls

> >     >     in terms of lookups and insertions.

> >     >

> >     >     This patch allows to use the EMC efficiently by allowing only

> >     >     the 'original' packets to hit EMC. All recirculated packets are

> >     >     sent to the classifier directly.

> >     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD -

> of 50% -

> >     >     for EMC occupancy is set to trigger this logic. By doing so when

> >     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:

> >     >      - EMC Insertions are allowed just for original packets.

> >     >        EMC insertion and look up are skipped for recirculated packets.

> >     >      - Recirculated packets are sent to the classifier.

> >     >

> >     >     This patch is based on patch

> >     >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-

> show"

> > at:

> >     >     https://urldefense.proofpoint.com/v2/url?u=https-

> >     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

> >     >

> >

> 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV

> hFA09CGX7JQ5Ih-

> >     > uZnsw&m=NHY06RD-

> Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

> >     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=

> >     >

> >     >     CC: Jan Scheurich <jan.scheurich@ericsson.com>

> >     >     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

> >     >     Signed-off-by: Bhanuprakash Bodireddy

> > <bhanuprakash.bodireddy@intel.com>

> >     >     Co-authored-by: Bhanuprakash Bodireddy

> > <bhanuprakash.bodireddy@intel.com>

> >     >     ---

> >     >     Connection Tracker testbench set up with

> >     >

> >     >      table=0, priority=1 actions=drop

> >     >      table=0, priority=10,arp actions=NORMAL

> >     >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

> >     >      table=1, ct_state=+new+trk,ip,in_port=1

> actions=ct(commit),output:2

> >     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> >     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

> >     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

> >     >

> >     >     2 PMDs, 3 Tx queues.

> >     >

> >     >     I measured packet Rx rate (regardless of packet loss).

> Bidirectional

> >     >     test with 64B UDP packets.

> >     >     Each row is a test with a different number of traffic streams. The

> > traffic

> >     >     generator is set so that each stream establishes one UDP

> connection.

> >     >     Mpps columns reports the Rx rates on the 2 sides.

> >     >

> >     >     I set up the generator to loop on the dest IP addr on one side,

> >     >     and loop instead on the source IP addr on the other side.

> >     >

> >     >     For example to generate 10 different flows, I was sending to phy

> port

> > #1

> >     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63,

> > PortDest: 63

> >     >

> >     >     Instead to phy port #2 (source and dest IPs are now swapped):

> >     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63,

> > PortDest:

> >     > 63

> >     >

> >     >     I saw the following performance improvement.

> >     >

> >     >     Original OvS-DPDK means at Commit ID:

> >     >       6b1babacc3ca0488e07596bf822fe356c9bab646

> >     >

> >     >               +----------------------+-----------------------+

> >     >               |  Original OvS-DPDK   |   Original OvS-DPDK   |

> >     >               |                      |    + this patch       |

> >     >      ---------+------------+---------+------------+----------+

> >     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |

> >     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |

> >     >      ---------+------------+---------+------------+----------+

> >     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |

> >     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |

> >     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |

> >     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |

> >     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |

> >     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |

> >     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |

> >     >      ---------+------------+---------+------------+----------+

> >     >

> >     >     This test setup implies 1 recirculation on each received packet.

> >     >     We didn't check this patch in a test scenario where more than 1

> >     >     recirculation is occurring per packet.

> >     >     ---

> >     >      lib/dpif-netdev.c | 65

> >     > +++++++++++++++++++++++++++++++++++++++++++++++++++----

> >     >      1 file changed, 61 insertions(+), 4 deletions(-)

> >     >

> >     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

> >     >     index bea1c3f..8f6b96b 100644

> >     >     --- a/lib/dpif-netdev.c

> >     >     +++ b/lib/dpif-netdev.c

> >     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct

> dp_packet *pkt,

> >     >          packet_batch_per_flow_update(batch, pkt, mf);

> >     >      }

> >     >

> >     >     +/* Threshold to skip EMC for recirculated packets. */

> >     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000

> >     >     +

> >     >      /* Try to process all ('cnt') the 'packets' using only the exact

> > match

> >     > cache

> >     >       * 'pmd->flow_cache'. If a flow is not found for a packet

> > 'packets[i]',

> >     > the

> >     >       * miniflow is copied into 'keys' and the packet pointer is moved

> at

> > the

> >     >     @@ -4714,8 +4717,36 @@ emc_processing(struct

> dp_netdev_pmd_thread

> > *pmd,

> >     >              key->len = 0; /* Not computed yet. */

> >     >              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

> > >mf);

> >     >

> >     >     -        /* If EMC is disabled skip emc_lookup */

> >     >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

> >     >     +        /*

> >     >     +         * EMC lookup is skipped when one or both of the following

> >     >     +         * two cases occurs:

> >     >     +         *

> >     >     +         *    - EMC is disabled.  This is detected from cur_min.

> >     >     +         *

> >     >     +         *    - The EMC occupancy exceeds

> > EMC_RECIRCT_NO_INSERT_THRESHOLD

> >     > and

> >     >     +         *      the packet to be classified is being recirculated.

> > When

> >     > this

> >     >     +         *      happens also EMC insertions are skipped for

> > recirculated

> >     >     +         *      packets.  So that EMC is used just to store entries

> > which

> >     >     +         *      are hit from the 'original' packets.  This way the

> > EMC

> >     >     +         *      thrashing is mitigated with a benefit on

> > performance.

> >     >     +         */

> >     >     +        if (OVS_LIKELY(cur_min)) {

> >     >     +            if (!md_is_valid) {

> >     >     +                flow = emc_lookup(flow_cache, key);

> >     >     +            } else {

> >     >     +                /* Recirculated packet. */

> >     >     +                if (flow_cache->n_entries &

> >     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {

> >     >     +                    /* EMC occupancy is over the threshold.  We skip

> > EMC

> >     >     +                     * lookup for recirculated packets. */

> >     >     +                    flow = NULL;

> >     >     +                } else {

> >     >     +                    flow = emc_lookup(flow_cache, key);

> >     >     +                }

> >     >     +            }

> >     >     +        } else {

> >     >     +            flow = NULL;

> >     >     +        }

> >     >     +

> >     >              if (OVS_LIKELY(flow)) {

> >     >                  dp_netdev_queue_batches(packet, flow, &key->mf,

> batches,

> >     >                                          n_batches);

> >     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct

> > dp_netdev_pmd_thread

> >     > *pmd,

> >     >                                                   add_actions->size);

> >     >              }

> >     >              ovs_mutex_unlock(&pmd->flow_mutex);

> >     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);

> >     >     +        /* EMC insertion can be skipped by a probabilistic criteria

> > or

> >     >     +         * - in case of recirculated packets - depending on the

> > number of

> >     >     +         * EMC entries. */

> >     >     +        if (!packet->md.recirc_id) {

> >     >     +            emc_probabilistic_insert(pmd, key, netdev_flow);

> >     >     +        } else {

> >     >     +            /* Recirculated packets.  When EMC occupancy goes over

> >     >     +             * a threshold we avoid inserting new entries. */

> >     >     +            if (!(pmd->flow_cache.n_entries &

> >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

> >     >     +                /* Still under the threshold. */

> >     >     +                emc_probabilistic_insert(pmd, key, netdev_flow);

> >     >     +            }

> >     >     +        }

> >     >          }

> >     >      }

> >     >

> >     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct

> > dp_netdev_pmd_thread

> >     > *pmd,

> >     >

> >     >              flow = dp_netdev_flow_cast(rules[i]);

> >     >

> >     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);

> >     >     +        /* EMC insertion can be skipped by a probabilistic criteria

> > or

> >     >     +         * - in case of recirculated packets - depending on the

> > number of

> >     >     +         * EMC entries. */

> >     >     +        if (!packet->md.recirc_id) {

> >     >     +            emc_probabilistic_insert(pmd, &keys[i], flow);

> >     >     +        } else {

> >     >     +            /* Recirculated packets.  When EMC occupancy goes over

> >     >     +             * a threshold we avoid inserting new entries. */

> >     >     +            if (!(pmd->flow_cache.n_entries &

> >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

> >     >     +                /* Still under the threshold. */

> >     >     +                emc_probabilistic_insert(pmd, &keys[i], flow);

> >     >     +            }

> >     >     +        }

> >     >              dp_netdev_queue_batches(packet, flow, &keys[i].mf,

> batches,

> >     > n_batches);

> >     >          }

> >     >

> >     >     --

> >     >     2.4.11

> >     >

> >     >     _______________________________________________

> >     >     dev mailing list

> >     >     dev@openvswitch.org

> >     >     https://urldefense.proofpoint.com/v2/url?u=https-

> >     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> >     >

> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > uZnsw&m=NHY06RD-

> >     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

> xSW7voYnxrudlh_WPXXsKJ1n1o680-

> >     > 3ZCuwj33q0H8&e=

> >     >

> >

> >

> 

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Darrell Ball Aug. 16, 2017, 5:31 p.m. | #6
Something happened to your email – it is mostly blank lines; also inserted b/w lines belonging to same paragraph ?
I have a few clarifications about the other lines.


-----Original Message-----
From: Jan Scheurich <jan.scheurich@ericsson.com>

Date: Wednesday, August 16, 2017 at 9:23 AM
To: "Fischetti, Antonio" <antonio.fischetti@intel.com>, Darrell Ball <dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>
Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for recirc packets

    Hi, 
    
    
    
    I agree that in the event of EMC overload it is beneficial to reduce the number of EMC insertions and lookups as they just generate overhead and degrade overall throughput. At the same time we want to keep as much of the EMC acceleration as possible for a fraction of traffic that can benefit from EMC most.
    
    
    
    For EMC insertion we have already done earlier this by introducing probabilistic EMC insertion, which greatly reduces the costly effect of EMC thrashing. But we didn't touch the lookup part. How should we select the packets (or rather packet datapath traversals) for which to perform lookup?
    
    
    
    There are several proposals in the air: Only do it for the first pass, not for recirculated packets, only do it for RSS hash values below a (dynamic) threshold, possibly others.
    
    
    
    For EMC insertion we consciously settled on a random selection as the datapath has no a priori insight into which flows are better candidates than others and big flows that benefit most have a higher chance of getting cached.
    
    
    
    Is there a reason to assume that a deterministic selection on some non-random criteria like the recirculation count will on average (over deployments and applications) give a better performance than a random selection?
    
    
    
    I don't believe so. For example, the number of "EMC flows" in each pass through the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows in first pass (from VM) but 1 flow after NSH encapsulation in second pass.
    
    
    
    I believe a random selection with dynamically adapted probability is the best we can do without a priori knowledge about the traffic patterns and pipeline organization.
    
    
    
    The RSS hash threshold method looks like the only pseudo-random criterion that we can use that produces consistent result for every packet of a flow and does require more information. Of course elephant flows with an unlucky hash value might never get to use the EMC, but that risk we have with any stateless selection scheme.
    
[Darrell] It is probably something I know by another name, but JTBC, can you define the “RSS hash threshold method” ?     

    
    The new thing required will be the dynamic adjustment of lookup probability to the EMC fill level and/or hit ratio. 


[Darrell] Did you mean insertion probability rather than lookup probability ? 



Any ideas for that? I guess we'd need a scheme that periodically increases the probability again to probe for changed traffic patterns. 
    
    
    
    Once we have that I think the same dynamic probability could be possible to use also for probabilistic EMC insertion.
    
    
    
    BR, Jan
    
    
    
    > -----Original Message-----

    
    > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

    
    > bounces@openvswitch.org] On Behalf Of Fischetti, Antonio

    
    > Sent: Wednesday, 16 August, 2017 14:42

    
    > To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org

    
    > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

    
    > for recirc packets

    
    > 

    
    > 

    
    > > -----Original Message-----

    
    > > From: Darrell Ball [mailto:dball@vmware.com]

    
    > > Sent: Wednesday, August 16, 2017 9:09 AM

    
    > > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

    
    > dev@openvswitch.org

    
    > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

    
    > for

    
    > > recirc packets

    
    > >

    
    > >

    
    > >

    
    > > -----Original Message-----

    
    > > From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

    
    > > Date: Tuesday, August 15, 2017 at 6:55 AM

    
    > > To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org"

    
    > > <dev@openvswitch.org>

    
    > > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

    
    > for

    
    > > recirc packets

    
    > >

    
    > >

    
    > >

    
    > >     > -----Original Message-----

    
    > >     > From: Darrell Ball [mailto:dball@vmware.com]

    
    > >     > Sent: Monday, August 14, 2017 7:27 AM

    
    > >     > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

    
    > dev@openvswitch.org

    
    > >     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

    
    > lookup/insert

    
    > > for

    
    > >     > recirc packets

    
    > >     >

    
    > >     >

    
    > >     >

    
    > >     > -----Original Message-----

    
    > >     > From: <ovs-dev-bounces@openvswitch.org> on behalf of

    
    > >     > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>

    
    > >     > Date: Friday, August 11, 2017 at 8:52 AM

    
    > >     > To: "dev@openvswitch.org" <dev@openvswitch.org>

    
    > >     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

    
    > lookup/insert for

    
    > >     > 	recirc packets

    
    > >     >

    
    > >     >     When OVS is configured as a firewall, with thousands of active

    
    > >     >     concurrent connections, the EMC gets quicly saturated and may

    
    > >     >     come under heavy thrashing for the reason that original and

    
    > >     >     recirculated packets keep overwriting the existing active EMC

    
    > >     >     entries due to its limited size (8k).

    
    > >     >

    
    > >     >

    
    > >     > The recirculated packet could have been modified, in which case,

    
    > maybe we

    
    > >     > still want to do the emc lookup/insert ?

    
    > >

    
    > >     [Antonio]

    
    > >     IMPO I'd say we should still skip emc anyway, because the purpose is

    
    > to

    
    > >     mitigate thrashing when emc is full. So any recirculated packet should

    
    > >     be classified at the dpcls/ofproto layers.

    
    > >     I don't know if I'm missing something from your question?

    
    > >

    
    > >     We can expect that a recirc pkt that has been modified - similarly to

    
    > all

    
    > >     other recirculated pkts - could result in a miss when emc is full.

    
    > >     Later we should do an emc insertion that is likely to overwrite some

    
    > >     active entry. And recursively, this new insertion itself could be

    
    > >     overwritten - due to the shortage of locations - even before it is hit

    
    > >     again. This proposal is to mitigate the thrashing with the criteria of

    
    > >     reserving emc usage to original packets only.

    
    > >     So a limited resource like emc hopefully could be used more

    
    > efficiently,

    
    > >     especially when there is more than 1 recirculation.

    
    > >     I guess that adding an exception for modified recirc pkts could also

    
    > >     drop a bit the throughtput as we should add another if statement

    
    > inside

    
    > >     emc_processing.

    
    > >

    
    > > [Darrell]

    
    > > I’ll can drop the edited packet case as my concern was really more

    
    > general.

    
    > > The concern is that recirculated packets should still be forwarded quickly

    
    > if

    
    > > possible

    
    > > and using emc should help that. The first time through, emc is used for

    
    > the

    
    > > packet and then the second

    
    > > time through, emc is not used, so it is slower. But, possibly the argument

    
    > > could be made that since it is recirculated,

    
    > > it is already slower, in which case, maybe a penalty for recirculated

    
    > packets

    
    > > is reasonable.

    
    > 

    
    > [Antonio]

    
    > Agree. Other than that, in case of an emc congestion - eg a firewall with

    
    > say 6,000 connections - with a lot of overwrites, the effect could be that

    
    > a lot of lookups will fail and the new insertions are just overwriting active

    
    > flows. This keeps a high failure for lookups and the continuous overwrites

    
    > for insertions become an overhead. So in this case there's a penalty

    
    > as for the original (ie the 1st time through) as for the recirculated packets.

    
    > With this approach we are considering that with 6,000 flows we would

    
    > need at

    
    > least 12,000 entries with 1 recirculation. So one strategy to reduce

    
    > thrashing

    
    > could be to restrict emc usage to original packets only. The counterpart is

    
    > that recirculated packets are slower, but the overall effect should be a

    
    > benefit.

    
    > 

    
    > 

    
    > > Instead of having a simple 50% black and white cutoff, maybe a penalty

    
    > to the

    
    > > insertion probability could be used ?

    
    > 

    
    > [Antonio]

    
    > Yes, at the beginning I was considering this solution. I then preferred

    
    > the current one because it allows not only to skip insertions but also

    
    > to skip lookups, especially when RSS hash must be computed in software.

    
    > 

    
    > The check of the threshold - as this is happening inside emc_processing -

    
    > is done with an '&' operation so to use as less cpu cycles as possible.

    
    > 

    
    > 

    
    > >

    
    > >

    
    > >     >

    
    > >     >

    
    > >     >     This thrashing causes the EMC to be less efficient than the dcpls

    
    > >     >     in terms of lookups and insertions.

    
    > >     >

    
    > >     >     This patch allows to use the EMC efficiently by allowing only

    
    > >     >     the 'original' packets to hit EMC. All recirculated packets are

    
    > >     >     sent to the classifier directly.

    
    > >     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD -

    
    > of 50% -

    
    > >     >     for EMC occupancy is set to trigger this logic. By doing so when

    
    > >     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:

    
    > >     >      - EMC Insertions are allowed just for original packets.

    
    > >     >        EMC insertion and look up are skipped for recirculated packets.

    
    > >     >      - Recirculated packets are sent to the classifier.

    
    > >     >

    
    > >     >     This patch is based on patch

    
    > >     >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-

    
    > show"

    
    > > at:

    
    > >     >     https://urldefense.proofpoint.com/v2/url?u=https-

    
    > >     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

    
    > >     >

    
    > >

    
    > 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV

    
    > hFA09CGX7JQ5Ih-

    
    > >     > uZnsw&m=NHY06RD-

    
    > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

    
    > >     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=

    
    > >     >

    
    > >     >     CC: Jan Scheurich <jan.scheurich@ericsson.com>

    
    > >     >     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

    
    > >     >     Signed-off-by: Bhanuprakash Bodireddy

    
    > > <bhanuprakash.bodireddy@intel.com>

    
    > >     >     Co-authored-by: Bhanuprakash Bodireddy

    
    > > <bhanuprakash.bodireddy@intel.com>

    
    > >     >     ---

    
    > >     >     Connection Tracker testbench set up with

    
    > >     >

    
    > >     >      table=0, priority=1 actions=drop

    
    > >     >      table=0, priority=10,arp actions=NORMAL

    
    > >     >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

    
    > >     >      table=1, ct_state=+new+trk,ip,in_port=1

    
    > actions=ct(commit),output:2

    
    > >     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

    
    > >     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

    
    > >     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

    
    > >     >

    
    > >     >     2 PMDs, 3 Tx queues.

    
    > >     >

    
    > >     >     I measured packet Rx rate (regardless of packet loss).

    
    > Bidirectional

    
    > >     >     test with 64B UDP packets.

    
    > >     >     Each row is a test with a different number of traffic streams. The

    
    > > traffic

    
    > >     >     generator is set so that each stream establishes one UDP

    
    > connection.

    
    > >     >     Mpps columns reports the Rx rates on the 2 sides.

    
    > >     >

    
    > >     >     I set up the generator to loop on the dest IP addr on one side,

    
    > >     >     and loop instead on the source IP addr on the other side.

    
    > >     >

    
    > >     >     For example to generate 10 different flows, I was sending to phy

    
    > port

    
    > > #1

    
    > >     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63,

    
    > > PortDest: 63

    
    > >     >

    
    > >     >     Instead to phy port #2 (source and dest IPs are now swapped):

    
    > >     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63,

    
    > > PortDest:

    
    > >     > 63

    
    > >     >

    
    > >     >     I saw the following performance improvement.

    
    > >     >

    
    > >     >     Original OvS-DPDK means at Commit ID:

    
    > >     >       6b1babacc3ca0488e07596bf822fe356c9bab646

    
    > >     >

    
    > >     >               +----------------------+-----------------------+

    
    > >     >               |  Original OvS-DPDK   |   Original OvS-DPDK   |

    
    > >     >               |                      |    + this patch       |

    
    > >     >      ---------+------------+---------+------------+----------+

    
    > >     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |

    
    > >     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |

    
    > >     >      ---------+------------+---------+------------+----------+

    
    > >     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |

    
    > >     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |

    
    > >     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |

    
    > >     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |

    
    > >     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |

    
    > >     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |

    
    > >     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |

    
    > >     >      ---------+------------+---------+------------+----------+

    
    > >     >

    
    > >     >     This test setup implies 1 recirculation on each received packet.

    
    > >     >     We didn't check this patch in a test scenario where more than 1

    
    > >     >     recirculation is occurring per packet.

    
    > >     >     ---

    
    > >     >      lib/dpif-netdev.c | 65

    
    > >     > +++++++++++++++++++++++++++++++++++++++++++++++++++----

    
    > >     >      1 file changed, 61 insertions(+), 4 deletions(-)

    
    > >     >

    
    > >     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

    
    > >     >     index bea1c3f..8f6b96b 100644

    
    > >     >     --- a/lib/dpif-netdev.c

    
    > >     >     +++ b/lib/dpif-netdev.c

    
    > >     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct

    
    > dp_packet *pkt,

    
    > >     >          packet_batch_per_flow_update(batch, pkt, mf);

    
    > >     >      }

    
    > >     >

    
    > >     >     +/* Threshold to skip EMC for recirculated packets. */

    
    > >     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000

    
    > >     >     +

    
    > >     >      /* Try to process all ('cnt') the 'packets' using only the exact

    
    > > match

    
    > >     > cache

    
    > >     >       * 'pmd->flow_cache'. If a flow is not found for a packet

    
    > > 'packets[i]',

    
    > >     > the

    
    > >     >       * miniflow is copied into 'keys' and the packet pointer is moved

    
    > at

    
    > > the

    
    > >     >     @@ -4714,8 +4717,36 @@ emc_processing(struct

    
    > dp_netdev_pmd_thread

    
    > > *pmd,

    
    > >     >              key->len = 0; /* Not computed yet. */

    
    > >     >              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-

    
    > > >mf);

    
    > >     >

    
    > >     >     -        /* If EMC is disabled skip emc_lookup */

    
    > >     >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);

    
    > >     >     +        /*

    
    > >     >     +         * EMC lookup is skipped when one or both of the following

    
    > >     >     +         * two cases occurs:

    
    > >     >     +         *

    
    > >     >     +         *    - EMC is disabled.  This is detected from cur_min.

    
    > >     >     +         *

    
    > >     >     +         *    - The EMC occupancy exceeds

    
    > > EMC_RECIRCT_NO_INSERT_THRESHOLD

    
    > >     > and

    
    > >     >     +         *      the packet to be classified is being recirculated.

    
    > > When

    
    > >     > this

    
    > >     >     +         *      happens also EMC insertions are skipped for

    
    > > recirculated

    
    > >     >     +         *      packets.  So that EMC is used just to store entries

    
    > > which

    
    > >     >     +         *      are hit from the 'original' packets.  This way the

    
    > > EMC

    
    > >     >     +         *      thrashing is mitigated with a benefit on

    
    > > performance.

    
    > >     >     +         */

    
    > >     >     +        if (OVS_LIKELY(cur_min)) {

    
    > >     >     +            if (!md_is_valid) {

    
    > >     >     +                flow = emc_lookup(flow_cache, key);

    
    > >     >     +            } else {

    
    > >     >     +                /* Recirculated packet. */

    
    > >     >     +                if (flow_cache->n_entries &

    
    > >     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {

    
    > >     >     +                    /* EMC occupancy is over the threshold.  We skip

    
    > > EMC

    
    > >     >     +                     * lookup for recirculated packets. */

    
    > >     >     +                    flow = NULL;

    
    > >     >     +                } else {

    
    > >     >     +                    flow = emc_lookup(flow_cache, key);

    
    > >     >     +                }

    
    > >     >     +            }

    
    > >     >     +        } else {

    
    > >     >     +            flow = NULL;

    
    > >     >     +        }

    
    > >     >     +

    
    > >     >              if (OVS_LIKELY(flow)) {

    
    > >     >                  dp_netdev_queue_batches(packet, flow, &key->mf,

    
    > batches,

    
    > >     >                                          n_batches);

    
    > >     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct

    
    > > dp_netdev_pmd_thread

    
    > >     > *pmd,

    
    > >     >                                                   add_actions->size);

    
    > >     >              }

    
    > >     >              ovs_mutex_unlock(&pmd->flow_mutex);

    
    > >     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);

    
    > >     >     +        /* EMC insertion can be skipped by a probabilistic criteria

    
    > > or

    
    > >     >     +         * - in case of recirculated packets - depending on the

    
    > > number of

    
    > >     >     +         * EMC entries. */

    
    > >     >     +        if (!packet->md.recirc_id) {

    
    > >     >     +            emc_probabilistic_insert(pmd, key, netdev_flow);

    
    > >     >     +        } else {

    
    > >     >     +            /* Recirculated packets.  When EMC occupancy goes over

    
    > >     >     +             * a threshold we avoid inserting new entries. */

    
    > >     >     +            if (!(pmd->flow_cache.n_entries &

    
    > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

    
    > >     >     +                /* Still under the threshold. */

    
    > >     >     +                emc_probabilistic_insert(pmd, key, netdev_flow);

    
    > >     >     +            }

    
    > >     >     +        }

    
    > >     >          }

    
    > >     >      }

    
    > >     >

    
    > >     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct

    
    > > dp_netdev_pmd_thread

    
    > >     > *pmd,

    
    > >     >

    
    > >     >              flow = dp_netdev_flow_cast(rules[i]);

    
    > >     >

    
    > >     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);

    
    > >     >     +        /* EMC insertion can be skipped by a probabilistic criteria

    
    > > or

    
    > >     >     +         * - in case of recirculated packets - depending on the

    
    > > number of

    
    > >     >     +         * EMC entries. */

    
    > >     >     +        if (!packet->md.recirc_id) {

    
    > >     >     +            emc_probabilistic_insert(pmd, &keys[i], flow);

    
    > >     >     +        } else {

    
    > >     >     +            /* Recirculated packets.  When EMC occupancy goes over

    
    > >     >     +             * a threshold we avoid inserting new entries. */

    
    > >     >     +            if (!(pmd->flow_cache.n_entries &

    
    > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

    
    > >     >     +                /* Still under the threshold. */

    
    > >     >     +                emc_probabilistic_insert(pmd, &keys[i], flow);

    
    > >     >     +            }

    
    > >     >     +        }

    
    > >     >              dp_netdev_queue_batches(packet, flow, &keys[i].mf,

    
    > batches,

    
    > >     > n_batches);

    
    > >     >          }

    
    > >     >

    
    > >     >     --

    
    > >     >     2.4.11

    
    > >     >

    
    > >     >     _______________________________________________

    
    > >     >     dev mailing list

    
    > >     >     dev@openvswitch.org

    
    > >     >     https://urldefense.proofpoint.com/v2/url?u=https-

    
    > >     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

    
    > >     >

    
    > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

    
    > > uZnsw&m=NHY06RD-

    
    > >     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

    
    > xSW7voYnxrudlh_WPXXsKJ1n1o680-

    
    > >     > 3ZCuwj33q0H8&e=

    
    > >     >

    
    > >

    
    > >

    
    > 

    
    > _______________________________________________

    
    > dev mailing list

    
    > dev@openvswitch.org

    
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=dGZmbKhBG9tJHY4odedsGA&m=TlTavCfm2NzTMaeBux9jVUZlCVRoTGmcyPqI2Yq-zfU&s=YgHbNLy7Rm164X_HzR1dLam6mU2jyht7EGdPDJBumrs&e=
Darrell Ball Aug. 16, 2017, 7:05 p.m. | #7
On Wed, Aug 16, 2017 at 10:31 AM, Darrell Ball <dball@vmware.com> wrote:

> Something happened to your email – it is mostly blank lines; also inserted
> b/w lines belonging to same paragraph ?
>

It looks like the problem is related to the receiving email client and
possibly some some special formatting.
This looks fine in gmail.



> I have a few clarifications about the other lines.
>







>
>
> -----Original Message-----
> From: Jan Scheurich <jan.scheurich@ericsson.com>
> Date: Wednesday, August 16, 2017 at 9:23 AM
> To: "Fischetti, Antonio" <antonio.fischetti@intel.com>, Darrell Ball <
> dball@vmware.com>, "dev@openvswitch.org" <dev@openvswitch.org>
> Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert
> for recirc packets
>
>     Hi,
>
>
>
>     I agree that in the event of EMC overload it is beneficial to reduce
> the number of EMC insertions and lookups as they just generate overhead and
> degrade overall throughput. At the same time we want to keep as much of the
> EMC acceleration as possible for a fraction of traffic that can benefit
> from EMC most.
>
>
>
>     For EMC insertion we have already done earlier this by introducing
> probabilistic EMC insertion, which greatly reduces the costly effect of EMC
> thrashing. But we didn't touch the lookup part. How should we select the
> packets (or rather packet datapath traversals) for which to perform lookup?
>
>
>
>     There are several proposals in the air: Only do it for the first pass,
> not for recirculated packets, only do it for RSS hash values below a
> (dynamic) threshold, possibly others.
>
>
>
>     For EMC insertion we consciously settled on a random selection as the
> datapath has no a priori insight into which flows are better candidates
> than others and big flows that benefit most have a higher chance of getting
> cached.
>
>
>
>     Is there a reason to assume that a deterministic selection on some
> non-random criteria like the recirculation count will on average (over
> deployments and applications) give a better performance than a random
> selection?
>
>
>
>     I don't believe so. For example, the number of "EMC flows" in each
> pass through the datapath can differ hugely: 1 GRE tunnel flow in first
> pass (from phy port), 100K tenant flows after tunnel decapsulation. Or 100K
> tenant flows in first pass (from VM) but 1 flow after NSH encapsulation in
> second pass.
>
>
>
>     I believe a random selection with dynamically adapted probability is
> the best we can do without a priori knowledge about the traffic patterns
> and pipeline organization.
>
>
>
>     The RSS hash threshold method looks like the only pseudo-random
> criterion that we can use that produces consistent result for every packet
> of a flow and does require more information. Of course elephant flows with
> an unlucky hash value might never get to use the EMC, but that risk we have
> with any stateless selection scheme.
>
> [Darrell] It is probably something I know by another name, but JTBC, can
> you define the “RSS hash threshold method” ?
>
>
>     The new thing required will be the dynamic adjustment of lookup
> probability to the EMC fill level and/or hit ratio.
>
>
> [Darrell] Did you mean insertion probability rather than lookup
> probability ?
>
>
>
> Any ideas for that? I guess we'd need a scheme that periodically increases
> the probability again to probe for changed traffic patterns.
>
>
>
>     Once we have that I think the same dynamic probability could be
> possible to use also for probabilistic EMC insertion.
>
>
>
>     BR, Jan
>
>
>
>     > -----Original Message-----
>
>     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-
>
>     > bounces@openvswitch.org] On Behalf Of Fischetti, Antonio
>
>     > Sent: Wednesday, 16 August, 2017 14:42
>
>     > To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org
>
>     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
> lookup/insert
>
>     > for recirc packets
>
>     >
>
>     >
>
>     > > -----Original Message-----
>
>     > > From: Darrell Ball [mailto:dball@vmware.com]
>
>     > > Sent: Wednesday, August 16, 2017 9:09 AM
>
>     > > To: Fischetti, Antonio <antonio.fischetti@intel.com>;
>
>     > dev@openvswitch.org
>
>     > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
> lookup/insert
>
>     > for
>
>     > > recirc packets
>
>     > >
>
>     > >
>
>     > >
>
>     > > -----Original Message-----
>
>     > > From: "Fischetti, Antonio" <antonio.fischetti@intel.com>
>
>     > > Date: Tuesday, August 15, 2017 at 6:55 AM
>
>     > > To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org"
>
>     > > <dev@openvswitch.org>
>
>     > > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
> lookup/insert
>
>     > for
>
>     > > recirc packets
>
>     > >
>
>     > >
>
>     > >
>
>     > >     > -----Original Message-----
>
>     > >     > From: Darrell Ball [mailto:dball@vmware.com]
>
>     > >     > Sent: Monday, August 14, 2017 7:27 AM
>
>     > >     > To: Fischetti, Antonio <antonio.fischetti@intel.com>;
>
>     > dev@openvswitch.org
>
>     > >     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
>
>     > lookup/insert
>
>     > > for
>
>     > >     > recirc packets
>
>     > >     >
>
>     > >     >
>
>     > >     >
>
>     > >     > -----Original Message-----
>
>     > >     > From: <ovs-dev-bounces@openvswitch.org> on behalf of
>
>     > >     > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>
>
>     > >     > Date: Friday, August 11, 2017 at 8:52 AM
>
>     > >     > To: "dev@openvswitch.org" <dev@openvswitch.org>
>
>     > >     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
>
>     > lookup/insert for
>
>     > >     >   recirc packets
>
>     > >     >
>
>     > >     >     When OVS is configured as a firewall, with thousands of
> active
>
>     > >     >     concurrent connections, the EMC gets quicly saturated
> and may
>
>     > >     >     come under heavy thrashing for the reason that original
> and
>
>     > >     >     recirculated packets keep overwriting the existing
> active EMC
>
>     > >     >     entries due to its limited size (8k).
>
>     > >     >
>
>     > >     >
>
>     > >     > The recirculated packet could have been modified, in which
> case,
>
>     > maybe we
>
>     > >     > still want to do the emc lookup/insert ?
>
>     > >
>
>     > >     [Antonio]
>
>     > >     IMPO I'd say we should still skip emc anyway, because the
> purpose is
>
>     > to
>
>     > >     mitigate thrashing when emc is full. So any recirculated
> packet should
>
>     > >     be classified at the dpcls/ofproto layers.
>
>     > >     I don't know if I'm missing something from your question?
>
>     > >
>
>     > >     We can expect that a recirc pkt that has been modified -
> similarly to
>
>     > all
>
>     > >     other recirculated pkts - could result in a miss when emc is
> full.
>
>     > >     Later we should do an emc insertion that is likely to
> overwrite some
>
>     > >     active entry. And recursively, this new insertion itself could
> be
>
>     > >     overwritten - due to the shortage of locations - even before
> it is hit
>
>     > >     again. This proposal is to mitigate the thrashing with the
> criteria of
>
>     > >     reserving emc usage to original packets only.
>
>     > >     So a limited resource like emc hopefully could be used more
>
>     > efficiently,
>
>     > >     especially when there is more than 1 recirculation.
>
>     > >     I guess that adding an exception for modified recirc pkts
> could also
>
>     > >     drop a bit the throughtput as we should add another if
> statement
>
>     > inside
>
>     > >     emc_processing.
>
>     > >
>
>     > > [Darrell]
>
>     > > I’ll can drop the edited packet case as my concern was really more
>
>     > general.
>
>     > > The concern is that recirculated packets should still be forwarded
> quickly
>
>     > if
>
>     > > possible
>
>     > > and using emc should help that. The first time through, emc is
> used for
>
>     > the
>
>     > > packet and then the second
>
>     > > time through, emc is not used, so it is slower. But, possibly the
> argument
>
>     > > could be made that since it is recirculated,
>
>     > > it is already slower, in which case, maybe a penalty for
> recirculated
>
>     > packets
>
>     > > is reasonable.
>
>     >
>
>     > [Antonio]
>
>     > Agree. Other than that, in case of an emc congestion - eg a firewall
> with
>
>     > say 6,000 connections - with a lot of overwrites, the effect could
> be that
>
>     > a lot of lookups will fail and the new insertions are just
> overwriting active
>
>     > flows. This keeps a high failure for lookups and the continuous
> overwrites
>
>     > for insertions become an overhead. So in this case there's a penalty
>
>     > as for the original (ie the 1st time through) as for the
> recirculated packets.
>
>     > With this approach we are considering that with 6,000 flows we would
>
>     > need at
>
>     > least 12,000 entries with 1 recirculation. So one strategy to reduce
>
>     > thrashing
>
>     > could be to restrict emc usage to original packets only. The
> counterpart is
>
>     > that recirculated packets are slower, but the overall effect should
> be a
>
>     > benefit.
>
>     >
>
>     >
>
>     > > Instead of having a simple 50% black and white cutoff, maybe a
> penalty
>
>     > to the
>
>     > > insertion probability could be used ?
>
>     >
>
>     > [Antonio]
>
>     > Yes, at the beginning I was considering this solution. I then
> preferred
>
>     > the current one because it allows not only to skip insertions but
> also
>
>     > to skip lookups, especially when RSS hash must be computed in
> software.
>
>     >
>
>     > The check of the threshold - as this is happening inside
> emc_processing -
>
>     > is done with an '&' operation so to use as less cpu cycles as
> possible.
>
>     >
>
>     >
>
>     > >
>
>     > >
>
>     > >     >
>
>     > >     >
>
>     > >     >     This thrashing causes the EMC to be less efficient than
> the dcpls
>
>     > >     >     in terms of lookups and insertions.
>
>     > >     >
>
>     > >     >     This patch allows to use the EMC efficiently by allowing
> only
>
>     > >     >     the 'original' packets to hit EMC. All recirculated
> packets are
>
>     > >     >     sent to the classifier directly.
>
>     > >     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD -
>
>     > of 50% -
>
>     > >     >     for EMC occupancy is set to trigger this logic. By doing
> so when
>
>     > >     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:
>
>     > >     >      - EMC Insertions are allowed just for original packets.
>
>     > >     >        EMC insertion and look up are skipped for
> recirculated packets.
>
>     > >     >      - Recirculated packets are sent to the classifier.
>
>     > >     >
>
>     > >     >     This patch is based on patch
>
>     > >     >     "dpif-netdev: add EMC entry count and %full figure to
> pmd-stats-
>
>     > show"
>
>     > > at:
>
>     > >     >     https://urldefense.proofpoint.com/v2/url?u=https-
>
>     > >     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>
>     > >     >
>
>     > >
>
>     > 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV
>
>     > hFA09CGX7JQ5Ih-
>
>     > >     > uZnsw&m=NHY06RD-
>
>     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-
>
>     > >     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=
>
>     > >     >
>
>     > >     >     CC: Jan Scheurich <jan.scheurich@ericsson.com>
>
>     > >     >     Signed-off-by: Antonio Fischetti <
> antonio.fischetti@intel.com>
>
>     > >     >     Signed-off-by: Bhanuprakash Bodireddy
>
>     > > <bhanuprakash.bodireddy@intel.com>
>
>     > >     >     Co-authored-by: Bhanuprakash Bodireddy
>
>     > > <bhanuprakash.bodireddy@intel.com>
>
>     > >     >     ---
>
>     > >     >     Connection Tracker testbench set up with
>
>     > >     >
>
>     > >     >      table=0, priority=1 actions=drop
>
>     > >     >      table=0, priority=10,arp actions=NORMAL
>
>     > >     >      table=0, priority=100,ct_state=-trk,ip
> actions=ct(table=1)
>
>     > >     >      table=1, ct_state=+new+trk,ip,in_port=1
>
>     > actions=ct(commit),output:2
>
>     > >     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
>
>     > >     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
>
>     > >     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
>
>     > >     >
>
>     > >     >     2 PMDs, 3 Tx queues.
>
>     > >     >
>
>     > >     >     I measured packet Rx rate (regardless of packet loss).
>
>     > Bidirectional
>
>     > >     >     test with 64B UDP packets.
>
>     > >     >     Each row is a test with a different number of traffic
> streams. The
>
>     > > traffic
>
>     > >     >     generator is set so that each stream establishes one UDP
>
>     > connection.
>
>     > >     >     Mpps columns reports the Rx rates on the 2 sides.
>
>     > >     >
>
>     > >     >     I set up the generator to loop on the dest IP addr on
> one side,
>
>     > >     >     and loop instead on the source IP addr on the other side.
>
>     > >     >
>
>     > >     >     For example to generate 10 different flows, I was
> sending to phy
>
>     > port
>
>     > > #1
>
>     > >     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29],
> PortSrc: 63,
>
>     > > PortDest: 63
>
>     > >     >
>
>     > >     >     Instead to phy port #2 (source and dest IPs are now
> swapped):
>
>     > >     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10,
> PortSrc: 63,
>
>     > > PortDest:
>
>     > >     > 63
>
>     > >     >
>
>     > >     >     I saw the following performance improvement.
>
>     > >     >
>
>     > >     >     Original OvS-DPDK means at Commit ID:
>
>     > >     >       6b1babacc3ca0488e07596bf822fe356c9bab646
>
>     > >     >
>
>     > >     >               +----------------------+------
> -----------------+
>
>     > >     >               |  Original OvS-DPDK   |   Original OvS-DPDK
>  |
>
>     > >     >               |                      |    + this patch
>  |
>
>     > >     >      ---------+------------+-------
> --+------------+----------+
>
>     > >     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC
> |
>
>     > >     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries
> |
>
>     > >     >      ---------+------------+-------
> --+------------+----------+
>
>     > >     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201
> |
>
>     > >     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006
> |
>
>     > >     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884
> |
>
>     > >     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757
> |
>
>     > >     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280
> |
>
>     > >     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090
> |
>
>     > >     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192
> |
>
>     > >     >      ---------+------------+-------
> --+------------+----------+
>
>     > >     >
>
>     > >     >     This test setup implies 1 recirculation on each received
> packet.
>
>     > >     >     We didn't check this patch in a test scenario where more
> than 1
>
>     > >     >     recirculation is occurring per packet.
>
>     > >     >     ---
>
>     > >     >      lib/dpif-netdev.c | 65
>
>     > >     > +++++++++++++++++++++++++++++++++++++++++++++++++++----
>
>     > >     >      1 file changed, 61 insertions(+), 4 deletions(-)
>
>     > >     >
>
>     > >     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>
>     > >     >     index bea1c3f..8f6b96b 100644
>
>     > >     >     --- a/lib/dpif-netdev.c
>
>     > >     >     +++ b/lib/dpif-netdev.c
>
>     > >     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct
>
>     > dp_packet *pkt,
>
>     > >     >          packet_batch_per_flow_update(batch, pkt, mf);
>
>     > >     >      }
>
>     > >     >
>
>     > >     >     +/* Threshold to skip EMC for recirculated packets. */
>
>     > >     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000
>
>     > >     >     +
>
>     > >     >      /* Try to process all ('cnt') the 'packets' using only
> the exact
>
>     > > match
>
>     > >     > cache
>
>     > >     >       * 'pmd->flow_cache'. If a flow is not found for a
> packet
>
>     > > 'packets[i]',
>
>     > >     > the
>
>     > >     >       * miniflow is copied into 'keys' and the packet
> pointer is moved
>
>     > at
>
>     > > the
>
>     > >     >     @@ -4714,8 +4717,36 @@ emc_processing(struct
>
>     > dp_netdev_pmd_thread
>
>     > > *pmd,
>
>     > >     >              key->len = 0; /* Not computed yet. */
>
>     > >     >              key->hash = dpif_netdev_packet_get_rss_hash(packet,
> &key-
>
>     > > >mf);
>
>     > >     >
>
>     > >     >     -        /* If EMC is disabled skip emc_lookup */
>
>     > >     >     -        flow = (cur_min == 0) ? NULL:
> emc_lookup(flow_cache, key);
>
>     > >     >     +        /*
>
>     > >     >     +         * EMC lookup is skipped when one or both of
> the following
>
>     > >     >     +         * two cases occurs:
>
>     > >     >     +         *
>
>     > >     >     +         *    - EMC is disabled.  This is detected from
> cur_min.
>
>     > >     >     +         *
>
>     > >     >     +         *    - The EMC occupancy exceeds
>
>     > > EMC_RECIRCT_NO_INSERT_THRESHOLD
>
>     > >     > and
>
>     > >     >     +         *      the packet to be classified is being
> recirculated.
>
>     > > When
>
>     > >     > this
>
>     > >     >     +         *      happens also EMC insertions are skipped
> for
>
>     > > recirculated
>
>     > >     >     +         *      packets.  So that EMC is used just to
> store entries
>
>     > > which
>
>     > >     >     +         *      are hit from the 'original' packets.
> This way the
>
>     > > EMC
>
>     > >     >     +         *      thrashing is mitigated with a benefit on
>
>     > > performance.
>
>     > >     >     +         */
>
>     > >     >     +        if (OVS_LIKELY(cur_min)) {
>
>     > >     >     +            if (!md_is_valid) {
>
>     > >     >     +                flow = emc_lookup(flow_cache, key);
>
>     > >     >     +            } else {
>
>     > >     >     +                /* Recirculated packet. */
>
>     > >     >     +                if (flow_cache->n_entries &
>
>     > >     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {
>
>     > >     >     +                    /* EMC occupancy is over the
> threshold.  We skip
>
>     > > EMC
>
>     > >     >     +                     * lookup for recirculated packets.
> */
>
>     > >     >     +                    flow = NULL;
>
>     > >     >     +                } else {
>
>     > >     >     +                    flow = emc_lookup(flow_cache, key);
>
>     > >     >     +                }
>
>     > >     >     +            }
>
>     > >     >     +        } else {
>
>     > >     >     +            flow = NULL;
>
>     > >     >     +        }
>
>     > >     >     +
>
>     > >     >              if (OVS_LIKELY(flow)) {
>
>     > >     >                  dp_netdev_queue_batches(packet, flow,
> &key->mf,
>
>     > batches,
>
>     > >     >                                          n_batches);
>
>     > >     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct
>
>     > > dp_netdev_pmd_thread
>
>     > >     > *pmd,
>
>     > >     >
>  add_actions->size);
>
>     > >     >              }
>
>     > >     >              ovs_mutex_unlock(&pmd->flow_mutex);
>
>     > >     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);
>
>     > >     >     +        /* EMC insertion can be skipped by a
> probabilistic criteria
>
>     > > or
>
>     > >     >     +         * - in case of recirculated packets -
> depending on the
>
>     > > number of
>
>     > >     >     +         * EMC entries. */
>
>     > >     >     +        if (!packet->md.recirc_id) {
>
>     > >     >     +            emc_probabilistic_insert(pmd, key,
> netdev_flow);
>
>     > >     >     +        } else {
>
>     > >     >     +            /* Recirculated packets.  When EMC
> occupancy goes over
>
>     > >     >     +             * a threshold we avoid inserting new
> entries. */
>
>     > >     >     +            if (!(pmd->flow_cache.n_entries &
>
>     > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
>
>     > >     >     +                /* Still under the threshold. */
>
>     > >     >     +                emc_probabilistic_insert(pmd, key,
> netdev_flow);
>
>     > >     >     +            }
>
>     > >     >     +        }
>
>     > >     >          }
>
>     > >     >      }
>
>     > >     >
>
>     > >     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct
>
>     > > dp_netdev_pmd_thread
>
>     > >     > *pmd,
>
>     > >     >
>
>     > >     >              flow = dp_netdev_flow_cast(rules[i]);
>
>     > >     >
>
>     > >     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);
>
>     > >     >     +        /* EMC insertion can be skipped by a
> probabilistic criteria
>
>     > > or
>
>     > >     >     +         * - in case of recirculated packets -
> depending on the
>
>     > > number of
>
>     > >     >     +         * EMC entries. */
>
>     > >     >     +        if (!packet->md.recirc_id) {
>
>     > >     >     +            emc_probabilistic_insert(pmd, &keys[i],
> flow);
>
>     > >     >     +        } else {
>
>     > >     >     +            /* Recirculated packets.  When EMC
> occupancy goes over
>
>     > >     >     +             * a threshold we avoid inserting new
> entries. */
>
>     > >     >     +            if (!(pmd->flow_cache.n_entries &
>
>     > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
>
>     > >     >     +                /* Still under the threshold. */
>
>     > >     >     +                emc_probabilistic_insert(pmd, &keys[i],
> flow);
>
>     > >     >     +            }
>
>     > >     >     +        }
>
>     > >     >              dp_netdev_queue_batches(packet, flow,
> &keys[i].mf,
>
>     > batches,
>
>     > >     > n_batches);
>
>     > >     >          }
>
>     > >     >
>
>     > >     >     --
>
>     > >     >     2.4.11
>
>     > >     >
>
>     > >     >     _______________________________________________
>
>     > >     >     dev mailing list
>
>     > >     >     dev@openvswitch.org
>
>     > >     >     https://urldefense.proofpoint.com/v2/url?u=https-
>
>     > >     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-
>
>     > >     >
>
>     > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>
>     > > uZnsw&m=NHY06RD-
>
>     > >     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-
>
>     > xSW7voYnxrudlh_WPXXsKJ1n1o680-
>
>     > >     > 3ZCuwj33q0H8&e=
>
>     > >     >
>
>     > >
>
>     > >
>
>     >
>
>     > _______________________________________________
>
>     > dev mailing list
>
>     > dev@openvswitch.org
>
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.
> openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwIGaQ&c=
> Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=dGZmbKhBG9tJHY4odedsGA&m=
> TlTavCfm2NzTMaeBux9jVUZlCVRoTGmcyPqI2Yq-zfU&s=YgHbNLy7Rm164X_
> HzR1dLam6mU2jyht7EGdPDJBumrs&e=
>
>
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Fischetti, Antonio Aug. 17, 2017, 10:33 a.m. | #8
Thanks Jan for your feedback and the interesting usecases described. 
Please find below some questions/comments I added inline.

Regards,
-Antonio


> -----Original Message-----

> From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]

> Sent: Wednesday, August 16, 2017 5:24 PM

> To: Fischetti, Antonio <antonio.fischetti@intel.com>; Darrell Ball

> <dball@vmware.com>; dev@openvswitch.org

> Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for

> recirc packets

> 

> Hi,

> 

> I agree that in the event of EMC overload it is beneficial to reduce the number

> of EMC insertions and lookups as they just generate overhead and degrade

> overall throughput. At the same time we want to keep as much of the EMC

> acceleration as possible for a fraction of traffic that can benefit from EMC

> most.


[Antonio] 
Perfectly agree, the goal should be to reserve the emc acceleration to a 'fraction'
of the traffic.


> 

> For EMC insertion we have already done earlier this by introducing

> probabilistic EMC insertion, which greatly reduces the costly effect of EMC

> thrashing. But we didn't touch the lookup part. How should we select the

> packets (or rather packet datapath traversals) for which to perform lookup?

> 

> There are several proposals in the air: Only do it for the first pass, not for

> recirculated packets, only do it for RSS hash values below a (dynamic)

> threshold, possibly others.

> 

> For EMC insertion we consciously settled on a random selection as the datapath

> has no a priori insight into which flows are better candidates than others and

> big flows that benefit most have a higher chance of getting cached.

> 

> Is there a reason to assume that a deterministic selection on some non-random

> criteria like the recirculation count will on average (over deployments and

> applications) give a better performance than a random selection?


[Antonio]
If we consider latency and jitter a deterministic solution should be 
more preferable than a solution which behaves differently depending 
on the particular values of the packet fields, eg the IP addresses.


> 

> I don't believe so. For example, the number of "EMC flows" in each pass through

> the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy

> port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows in

> first pass (from VM) but 1 flow after NSH encapsulation in second pass.


[Antonio]
Maybe I'm wrong but shouldn't the different flows encapped in a GRE 
tunnel hit the EMC in different locations? Because even if they all have the 
same outer IP addresses, they differ in the L4 ports so the 5-tuple hash
- and the emc locations - should vary. Same thing for NSH encapsulation?


> 

> I believe a random selection with dynamically adapted probability is the best

> we can do without a priori knowledge about the traffic patterns and pipeline

> organization.


[Antonio]
This proposal is orthogonal to other approaches that look at the usage
of the single locations, eg policies not to overwrite active locations or to 
reduce in general the emc usage. 
I think we should consider both the two strategies to tackle two different 
aspects of the thrashing and use emc more efficiently:
 1. skip emc lookup/insert for recirc packets (which is only activated when 
   emc entries exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD);
 2. any other strategy that limits emc usage or offers a better entries eviction.

So - being agnostic of what's the traffic type - if we have 100k flows 
that could potentially be recirculated:
 1. allows to tackle the thrashing due to recirculation, which is activated
    when the emc entries exceeds a threshold. 
 2. allows to limit the emc usage to fewer flows because we don't want 
    100k flows to hit emc.

> 

> The RSS hash threshold method looks like the only pseudo-random criterion that

> we can use that produces consistent result for every packet of a flow and does

> require more information. Of course elephant flows with an unlucky hash value

> might never get to use the EMC, but that risk we have with any stateless

> selection scheme.

> 

> The new thing required will be the dynamic adjustment of lookup probability to

> the EMC fill level and/or hit ratio. Any ideas for that? I guess we'd need a

> scheme that periodically increases the probability again to probe for changed

> traffic patterns.

> 

> Once we have that I think the same dynamic probability could be possible to use

> also for probabilistic EMC insertion.

> 

> BR, Jan

> 

> > -----Original Message-----

> > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> > bounces@openvswitch.org] On Behalf Of Fischetti, Antonio

> > Sent: Wednesday, 16 August, 2017 14:42

> > To: Darrell Ball <dball@vmware.com>; dev@openvswitch.org

> > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> > for recirc packets

> >

> >

> > > -----Original Message-----

> > > From: Darrell Ball [mailto:dball@vmware.com]

> > > Sent: Wednesday, August 16, 2017 9:09 AM

> > > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> > dev@openvswitch.org

> > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> > for

> > > recirc packets

> > >

> > >

> > >

> > > -----Original Message-----

> > > From: "Fischetti, Antonio" <antonio.fischetti@intel.com>

> > > Date: Tuesday, August 15, 2017 at 6:55 AM

> > > To: Darrell Ball <dball@vmware.com>, "dev@openvswitch.org"

> > > <dev@openvswitch.org>

> > > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> > for

> > > recirc packets

> > >

> > >

> > >

> > >     > -----Original Message-----

> > >     > From: Darrell Ball [mailto:dball@vmware.com]

> > >     > Sent: Monday, August 14, 2017 7:27 AM

> > >     > To: Fischetti, Antonio <antonio.fischetti@intel.com>;

> > dev@openvswitch.org

> > >     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

> > lookup/insert

> > > for

> > >     > recirc packets

> > >     >

> > >     >

> > >     >

> > >     > -----Original Message-----

> > >     > From: <ovs-dev-bounces@openvswitch.org> on behalf of

> > >     > "antonio.fischetti@intel.com" <antonio.fischetti@intel.com>

> > >     > Date: Friday, August 11, 2017 at 8:52 AM

> > >     > To: "dev@openvswitch.org" <dev@openvswitch.org>

> > >     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC

> > lookup/insert for

> > >     > 	recirc packets

> > >     >

> > >     >     When OVS is configured as a firewall, with thousands of active

> > >     >     concurrent connections, the EMC gets quicly saturated and may

> > >     >     come under heavy thrashing for the reason that original and

> > >     >     recirculated packets keep overwriting the existing active EMC

> > >     >     entries due to its limited size (8k).

> > >     >

> > >     >

> > >     > The recirculated packet could have been modified, in which case,

> > maybe we

> > >     > still want to do the emc lookup/insert ?

> > >

> > >     [Antonio]

> > >     IMPO I'd say we should still skip emc anyway, because the purpose is

> > to

> > >     mitigate thrashing when emc is full. So any recirculated packet should

> > >     be classified at the dpcls/ofproto layers.

> > >     I don't know if I'm missing something from your question?

> > >

> > >     We can expect that a recirc pkt that has been modified - similarly to

> > all

> > >     other recirculated pkts - could result in a miss when emc is full.

> > >     Later we should do an emc insertion that is likely to overwrite some

> > >     active entry. And recursively, this new insertion itself could be

> > >     overwritten - due to the shortage of locations - even before it is hit

> > >     again. This proposal is to mitigate the thrashing with the criteria of

> > >     reserving emc usage to original packets only.

> > >     So a limited resource like emc hopefully could be used more

> > efficiently,

> > >     especially when there is more than 1 recirculation.

> > >     I guess that adding an exception for modified recirc pkts could also

> > >     drop a bit the throughtput as we should add another if statement

> > inside

> > >     emc_processing.

> > >

> > > [Darrell]

> > > I’ll can drop the edited packet case as my concern was really more

> > general.

> > > The concern is that recirculated packets should still be forwarded quickly

> > if

> > > possible

> > > and using emc should help that. The first time through, emc is used for

> > the

> > > packet and then the second

> > > time through, emc is not used, so it is slower. But, possibly the argument

> > > could be made that since it is recirculated,

> > > it is already slower, in which case, maybe a penalty for recirculated

> > packets

> > > is reasonable.

> >

> > [Antonio]

> > Agree. Other than that, in case of an emc congestion - eg a firewall with

> > say 6,000 connections - with a lot of overwrites, the effect could be that

> > a lot of lookups will fail and the new insertions are just overwriting active

> > flows. This keeps a high failure for lookups and the continuous overwrites

> > for insertions become an overhead. So in this case there's a penalty

> > as for the original (ie the 1st time through) as for the recirculated

> packets.

> > With this approach we are considering that with 6,000 flows we would

> > need at

> > least 12,000 entries with 1 recirculation. So one strategy to reduce

> > thrashing

> > could be to restrict emc usage to original packets only. The counterpart is

> > that recirculated packets are slower, but the overall effect should be a

> > benefit.

> >

> >

> > > Instead of having a simple 50% black and white cutoff, maybe a penalty

> > to the

> > > insertion probability could be used ?

> >

> > [Antonio]

> > Yes, at the beginning I was considering this solution. I then preferred

> > the current one because it allows not only to skip insertions but also

> > to skip lookups, especially when RSS hash must be computed in software.

> >

> > The check of the threshold - as this is happening inside emc_processing -

> > is done with an '&' operation so to use as less cpu cycles as possible.

> >

> >

> > >

> > >

> > >     >

> > >     >

> > >     >     This thrashing causes the EMC to be less efficient than the dcpls

> > >     >     in terms of lookups and insertions.

> > >     >

> > >     >     This patch allows to use the EMC efficiently by allowing only

> > >     >     the 'original' packets to hit EMC. All recirculated packets are

> > >     >     sent to the classifier directly.

> > >     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD -

> > of 50% -

> > >     >     for EMC occupancy is set to trigger this logic. By doing so when

> > >     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:

> > >     >      - EMC Insertions are allowed just for original packets.

> > >     >        EMC insertion and look up are skipped for recirculated

> packets.

> > >     >      - Recirculated packets are sent to the classifier.

> > >     >

> > >     >     This patch is based on patch

> > >     >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-

> > show"

> > > at:

> > >     >     https://urldefense.proofpoint.com/v2/url?u=https-

> > >     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

> > >     >

> > >

> > 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV

> > hFA09CGX7JQ5Ih-

> > >     > uZnsw&m=NHY06RD-

> > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

> > >     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=

> > >     >

> > >     >     CC: Jan Scheurich <jan.scheurich@ericsson.com>

> > >     >     Signed-off-by: Antonio Fischetti <antonio.fischetti@intel.com>

> > >     >     Signed-off-by: Bhanuprakash Bodireddy

> > > <bhanuprakash.bodireddy@intel.com>

> > >     >     Co-authored-by: Bhanuprakash Bodireddy

> > > <bhanuprakash.bodireddy@intel.com>

> > >     >     ---

> > >     >     Connection Tracker testbench set up with

> > >     >

> > >     >      table=0, priority=1 actions=drop

> > >     >      table=0, priority=10,arp actions=NORMAL

> > >     >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)

> > >     >      table=1, ct_state=+new+trk,ip,in_port=1

> > actions=ct(commit),output:2

> > >     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2

> > >     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop

> > >     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1

> > >     >

> > >     >     2 PMDs, 3 Tx queues.

> > >     >

> > >     >     I measured packet Rx rate (regardless of packet loss).

> > Bidirectional

> > >     >     test with 64B UDP packets.

> > >     >     Each row is a test with a different number of traffic streams.

> The

> > > traffic

> > >     >     generator is set so that each stream establishes one UDP

> > connection.

> > >     >     Mpps columns reports the Rx rates on the 2 sides.

> > >     >

> > >     >     I set up the generator to loop on the dest IP addr on one side,

> > >     >     and loop instead on the source IP addr on the other side.

> > >     >

> > >     >     For example to generate 10 different flows, I was sending to phy

> > port

> > > #1

> > >     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63,

> > > PortDest: 63

> > >     >

> > >     >     Instead to phy port #2 (source and dest IPs are now swapped):

> > >     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63,

> > > PortDest:

> > >     > 63

> > >     >

> > >     >     I saw the following performance improvement.

> > >     >

> > >     >     Original OvS-DPDK means at Commit ID:

> > >     >       6b1babacc3ca0488e07596bf822fe356c9bab646

> > >     >

> > >     >               +----------------------+-----------------------+

> > >     >               |  Original OvS-DPDK   |   Original OvS-DPDK   |

> > >     >               |                      |    + this patch       |

> > >     >      ---------+------------+---------+------------+----------+

> > >     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |

> > >     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |

> > >     >      ---------+------------+---------+------------+----------+

> > >     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |

> > >     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |

> > >     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |

> > >     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |

> > >     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |

> > >     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |

> > >     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |

> > >     >      ---------+------------+---------+------------+----------+

> > >     >

> > >     >     This test setup implies 1 recirculation on each received packet.

> > >     >     We didn't check this patch in a test scenario where more than 1

> > >     >     recirculation is occurring per packet.

> > >     >     ---

> > >     >      lib/dpif-netdev.c | 65

> > >     > +++++++++++++++++++++++++++++++++++++++++++++++++++----

> > >     >      1 file changed, 61 insertions(+), 4 deletions(-)

> > >     >

> > >     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c

> > >     >     index bea1c3f..8f6b96b 100644

> > >     >     --- a/lib/dpif-netdev.c

> > >     >     +++ b/lib/dpif-netdev.c

> > >     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct

> > dp_packet *pkt,

> > >     >          packet_batch_per_flow_update(batch, pkt, mf);

> > >     >      }

> > >     >

> > >     >     +/* Threshold to skip EMC for recirculated packets. */

> > >     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000

> > >     >     +

> > >     >      /* Try to process all ('cnt') the 'packets' using only the exact

> > > match

> > >     > cache

> > >     >       * 'pmd->flow_cache'. If a flow is not found for a packet

> > > 'packets[i]',

> > >     > the

> > >     >       * miniflow is copied into 'keys' and the packet pointer is

> moved

> > at

> > > the

> > >     >     @@ -4714,8 +4717,36 @@ emc_processing(struct

> > dp_netdev_pmd_thread

> > > *pmd,

> > >     >              key->len = 0; /* Not computed yet. */

> > >     >              key->hash = dpif_netdev_packet_get_rss_hash(packet,

> &key-

> > > >mf);

> > >     >

> > >     >     -        /* If EMC is disabled skip emc_lookup */

> > >     >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache,

> key);

> > >     >     +        /*

> > >     >     +         * EMC lookup is skipped when one or both of the

> following

> > >     >     +         * two cases occurs:

> > >     >     +         *

> > >     >     +         *    - EMC is disabled.  This is detected from cur_min.

> > >     >     +         *

> > >     >     +         *    - The EMC occupancy exceeds

> > > EMC_RECIRCT_NO_INSERT_THRESHOLD

> > >     > and

> > >     >     +         *      the packet to be classified is being

> recirculated.

> > > When

> > >     > this

> > >     >     +         *      happens also EMC insertions are skipped for

> > > recirculated

> > >     >     +         *      packets.  So that EMC is used just to store

> entries

> > > which

> > >     >     +         *      are hit from the 'original' packets.  This way

> the

> > > EMC

> > >     >     +         *      thrashing is mitigated with a benefit on

> > > performance.

> > >     >     +         */

> > >     >     +        if (OVS_LIKELY(cur_min)) {

> > >     >     +            if (!md_is_valid) {

> > >     >     +                flow = emc_lookup(flow_cache, key);

> > >     >     +            } else {

> > >     >     +                /* Recirculated packet. */

> > >     >     +                if (flow_cache->n_entries &

> > >     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {

> > >     >     +                    /* EMC occupancy is over the threshold.  We

> skip

> > > EMC

> > >     >     +                     * lookup for recirculated packets. */

> > >     >     +                    flow = NULL;

> > >     >     +                } else {

> > >     >     +                    flow = emc_lookup(flow_cache, key);

> > >     >     +                }

> > >     >     +            }

> > >     >     +        } else {

> > >     >     +            flow = NULL;

> > >     >     +        }

> > >     >     +

> > >     >              if (OVS_LIKELY(flow)) {

> > >     >                  dp_netdev_queue_batches(packet, flow, &key->mf,

> > batches,

> > >     >                                          n_batches);

> > >     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct

> > > dp_netdev_pmd_thread

> > >     > *pmd,

> > >     >                                                   add_actions->size);

> > >     >              }

> > >     >              ovs_mutex_unlock(&pmd->flow_mutex);

> > >     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);

> > >     >     +        /* EMC insertion can be skipped by a probabilistic

> criteria

> > > or

> > >     >     +         * - in case of recirculated packets - depending on the

> > > number of

> > >     >     +         * EMC entries. */

> > >     >     +        if (!packet->md.recirc_id) {

> > >     >     +            emc_probabilistic_insert(pmd, key, netdev_flow);

> > >     >     +        } else {

> > >     >     +            /* Recirculated packets.  When EMC occupancy goes

> over

> > >     >     +             * a threshold we avoid inserting new entries. */

> > >     >     +            if (!(pmd->flow_cache.n_entries &

> > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

> > >     >     +                /* Still under the threshold. */

> > >     >     +                emc_probabilistic_insert(pmd, key, netdev_flow);

> > >     >     +            }

> > >     >     +        }

> > >     >          }

> > >     >      }

> > >     >

> > >     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct

> > > dp_netdev_pmd_thread

> > >     > *pmd,

> > >     >

> > >     >              flow = dp_netdev_flow_cast(rules[i]);

> > >     >

> > >     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);

> > >     >     +        /* EMC insertion can be skipped by a probabilistic

> criteria

> > > or

> > >     >     +         * - in case of recirculated packets - depending on the

> > > number of

> > >     >     +         * EMC entries. */

> > >     >     +        if (!packet->md.recirc_id) {

> > >     >     +            emc_probabilistic_insert(pmd, &keys[i], flow);

> > >     >     +        } else {

> > >     >     +            /* Recirculated packets.  When EMC occupancy goes

> over

> > >     >     +             * a threshold we avoid inserting new entries. */

> > >     >     +            if (!(pmd->flow_cache.n_entries &

> > >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {

> > >     >     +                /* Still under the threshold. */

> > >     >     +                emc_probabilistic_insert(pmd, &keys[i], flow);

> > >     >     +            }

> > >     >     +        }

> > >     >              dp_netdev_queue_batches(packet, flow, &keys[i].mf,

> > batches,

> > >     > n_batches);

> > >     >          }

> > >     >

> > >     >     --

> > >     >     2.4.11

> > >     >

> > >     >     _______________________________________________

> > >     >     dev mailing list

> > >     >     dev@openvswitch.org

> > >     >     https://urldefense.proofpoint.com/v2/url?u=https-

> > >     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-

> > >     >

> > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-

> > > uZnsw&m=NHY06RD-

> > >     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-

> > xSW7voYnxrudlh_WPXXsKJ1n1o680-

> > >     > 3ZCuwj33q0H8&e=

> > >     >

> > >

> > >

> >

> > _______________________________________________

> > dev mailing list

> > dev@openvswitch.org

> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Jan Scheurich Aug. 17, 2017, 12:22 p.m. | #9
The RSS hash threshold method looks like the only pseudo-random criterion that we can use that produces consistent result for every packet of a flow and does require more information. Of course elephant flows with an unlucky hash value might never get to use the EMC, but that risk we have with any stateless selection scheme.

[Darrell] It is probably something I know by another name, but JTBC, can you define the “RSS hash threshold method” ?

I am referring to Billy's proposal (https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/336509.html)

In essence the is suggests to only select packets for EMC lookup whose RSS hash is above a certain threshold. The lookup probability is determined by the threshold (e.g. threshold of 0.75 * UINT32_MAX corresponds to 25%). It is pseudo-random as, assuming that the hash result is uniformly distributed, flows will profit from EMC lookup with the same probability.


    The new thing required will be the dynamic adjustment of lookup probability to the EMC fill level and/or hit ratio.

[Darrell] Did you mean insertion probability rather than lookup probability ?

No, I actually meant dynamic adaptation of lookup probability. We don't want to reduce the EMC lookup probability when the EMC is not yet overloaded, but only when the EMC hit rate degrades due to collisions. When we devise an algorithm to adapt lookup probability, we can study if it could make sense to also dynamically adjust the currently fixed (configurable) EMC insertion probability based on EMC fill level and/or hit rate.

BR, Jan
Jan Scheurich Aug. 17, 2017, 12:42 p.m. | #10
Hi Antonio,

> > Is there a reason to assume that a deterministic selection on some non-
> random
> > criteria like the recirculation count will on average (over deployments
> and
> > applications) give a better performance than a random selection?
> 
> [Antonio]
> If we consider latency and jitter a deterministic solution should be
> more preferable than a solution which behaves differently depending
> on the particular values of the packet fields, eg the IP addresses.

Do you have measurements showing that latency is significantly affected 
by EMC hit vs DPCLS hit?  I wouldn't think so.  Only throughput should vary.

Probabilistic EMC lookup should only apply in situations where EMC is 
overloaded, meaning we have thousands of packet flows. In this case we
maximize the aggregate throughput of the statistical flow mix. But it is not
that a flow using EMC would see higher throughput than analogous flows that
don't.

> > I don't believe so. For example, the number of "EMC flows" in each pass
> through
> > the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy
> > port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows
> in
> > first pass (from VM) but 1 flow after NSH encapsulation in second pass.
> 
> [Antonio]
> Maybe I'm wrong but shouldn't the different flows encapped in a GRE
> tunnel hit the EMC in different locations? Because even if they all have the
> same outer IP addresses, they differ in the L4 ports so the 5-tuple hash
> - and the emc locations - should vary. Same thing for NSH encapsulation?

Neither GRE nor NSH packets have L4 ports for RSS hashing. GRE is a separate
IP protocol (not UDP). All packets of a GRE tunnel share the same pair of IP
addresses. NSH is even a non-IP protocol.

> > I believe a random selection with dynamically adapted probability is the
> best
> > we can do without a priori knowledge about the traffic patterns and
> pipeline
> > organization.
> 
> [Antonio]
> This proposal is orthogonal to other approaches that look at the usage
> of the single locations, eg policies not to overwrite active locations or to
> reduce in general the emc usage.
> I think we should consider both the two strategies to tackle two different
> aspects of the thrashing and use emc more efficiently:
>  1. skip emc lookup/insert for recirc packets (which is only activated when
>    emc entries exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD);
>  2. any other strategy that limits emc usage or offers a better entries
> eviction.
> 
> So - being agnostic of what's the traffic type - if we have 100k flows
> that could potentially be recirculated:
>  1. allows to tackle the thrashing due to recirculation, which is activated
>     when the emc entries exceeds a threshold.
>  2. allows to limit the emc usage to fewer flows because we don't want
>     100k flows to hit emc.

First of all: we only discuss limiting EMC lookups in the case of EMC overload.
I still don't think that it is a good idea to general skip EMC lookup for
recirculated  flows in that situation. It may be the right thing to do in some 
scenarios (e.g. GRE -> VM), but exactly the wrong in others (e.g. VM -> GRE).

If we go for a probabilistic reduction of EMC lookups we'd statistically have a
balanced improvement in all (known and unknown) scenarios.

BR, Jan
Darrell Ball Aug. 17, 2017, 6:45 p.m. | #11
On 8/17/17, 5:22 AM, "Jan Scheurich" <jan.scheurich@ericsson.com> wrote:

        The RSS hash threshold method looks like the only pseudo-random criterion that we can use that produces consistent result for every packet of a flow and does require more information. Of course elephant flows with an unlucky hash value might never get to use the EMC, but that risk we have with any stateless selection scheme.
    
    
    
    [Darrell] It is probably something I know by another name, but JTBC, can you define the “RSS hash threshold method” ?
    
    
    
    I am referring to Billy's proposal (https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DAugust_336509.html&d=DwIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=dGZmbKhBG9tJHY4odedsGA&m=_i_IqWudJqU7R3_ZaFm7HHhOQMHwm_U6G-EIyOGjkxI&s=IDMHVf9n5CjmHMI67mzMd0HZegNJ_LntZLfcdpRUvJI&e= )

    
    In essence the is suggests to only select packets for EMC lookup whose RSS hash is above a certain threshold. The lookup probability is determined by the threshold (e.g. threshold of 0.75 * UINT32_MAX corresponds to 25%). It is pseudo-random as, assuming that the hash result is uniformly distributed, flows will profit from EMC lookup with the same probability.
    

[Darrell] ahh, there is no actual patch yet, just an e-mail
                I see, you have a coined the term “RSS hash threshold method” for the approach; the nomenclature makes sense now.
                I’ll have separate comments, of course, on the proposal itself.

    
  
        The new thing required will be the dynamic adjustment of lookup probability to the EMC fill level and/or hit ratio.
    
    
    
    [Darrell] Did you mean insertion probability rather than lookup probability ?
    
    
    
    No, I actually meant dynamic adaptation of lookup probability. We don't want to reduce the EMC lookup probability when the EMC is not yet overloaded, but only when the EMC hit rate degrades due to collisions. When we devise an algorithm to adapt lookup probability, we can study if it could make sense to also dynamically adjust the currently fixed (configurable) EMC insertion probability based on EMC fill level and/or hit rate.


[Darrell] Now that I know what you are referring to above, it is a lot easier to make the linkage. 
    
    
    
    BR, Jan
Fischetti, Antonio Aug. 23, 2017, 2:35 p.m. | #12
> -----Original Message-----
> From: Jan Scheurich [mailto:jan.scheurich@ericsson.com]
> Sent: Thursday, August 17, 2017 1:42 PM
> To: Fischetti, Antonio <antonio.fischetti@intel.com>; Darrell Ball
> <dball@vmware.com>; dev@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for
> recirc packets
> 
> Hi Antonio,
> 
> > > Is there a reason to assume that a deterministic selection on some non-
> > random
> > > criteria like the recirculation count will on average (over deployments
> > and
> > > applications) give a better performance than a random selection?
> >
> > [Antonio]
> > If we consider latency and jitter a deterministic solution should be
> > more preferable than a solution which behaves differently depending
> > on the particular values of the packet fields, eg the IP addresses.
> 
> Do you have measurements showing that latency is significantly affected
> by EMC hit vs DPCLS hit?  I wouldn't think so.  Only throughput should vary.
> 

[Antonio]
Agree. 
What I meant to say is that - broadly speaking - it should be
preferable to adopt solutions that seem to be more deterministic, 
especially in a Telco deployment.
This approach - at least at a first glance - seems to be more deterministic
than other approaches like the "RSS hash threshold method" because
the latter can treat the packet differently depending on their header.

IMPO it could be good to have this approach in parallel with some other 
strategies - like the "RSS hash threshold method" - because they operate 
on two different causes/levels of the same problem.
 

> Probabilistic EMC lookup should only apply in situations where EMC is
> overloaded, meaning we have thousands of packet flows. In this case we
> maximize the aggregate throughput of the statistical flow mix. But it is not
> that a flow using EMC would see higher throughput than analogous flows that
> don't.
> 
> > > I don't believe so. For example, the number of "EMC flows" in each pass
> > through
> > > the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy
> > > port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows
> > in
> > > first pass (from VM) but 1 flow after NSH encapsulation in second pass.
> >
> > [Antonio]
> > Maybe I'm wrong but shouldn't the different flows encapped in a GRE
> > tunnel hit the EMC in different locations? Because even if they all have the
> > same outer IP addresses, they differ in the L4 ports so the 5-tuple hash
> > - and the emc locations - should vary. Same thing for NSH encapsulation?
> 
> Neither GRE nor NSH packets have L4 ports for RSS hashing. GRE is a separate
> IP protocol (not UDP). All packets of a GRE tunnel share the same pair of IP
> addresses. NSH is even a non-IP protocol.
> 
> > > I believe a random selection with dynamically adapted probability is the
> > best
> > > we can do without a priori knowledge about the traffic patterns and
> > pipeline
> > > organization.
> >
> > [Antonio]
> > This proposal is orthogonal to other approaches that look at the usage
> > of the single locations, eg policies not to overwrite active locations or to
> > reduce in general the emc usage.
> > I think we should consider both the two strategies to tackle two different
> > aspects of the thrashing and use emc more efficiently:
> >  1. skip emc lookup/insert for recirc packets (which is only activated when
> >    emc entries exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD);
> >  2. any other strategy that limits emc usage or offers a better entries
> > eviction.
> >
> > So - being agnostic of what's the traffic type - if we have 100k flows
> > that could potentially be recirculated:
> >  1. allows to tackle the thrashing due to recirculation, which is activated
> >     when the emc entries exceeds a threshold.
> >  2. allows to limit the emc usage to fewer flows because we don't want
> >     100k flows to hit emc.
> 
> First of all: we only discuss limiting EMC lookups in the case of EMC overload.
> I still don't think that it is a good idea to general skip EMC lookup for
> recirculated  flows in that situation. It may be the right thing to do in some
> scenarios (e.g. GRE -> VM), but exactly the wrong in others (e.g. VM -> GRE).
> 
> If we go for a probabilistic reduction of EMC lookups we'd statistically have a
> balanced improvement in all (known and unknown) scenarios.
> 
> BR, Jan
O Mahony, Billy Sept. 6, 2017, 3:23 p.m. | #13
Hi All,

On the "“RSS hash threshold method” for EMC load shedding I hope to have time to do an RFC to illustrate in the next week or so give a better idea of what I mean.

Thanks,
Billy.

> -----Original Message-----

> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-

> bounces@openvswitch.org] On Behalf Of Darrell Ball

> Sent: Thursday, August 17, 2017 7:46 PM

> To: Jan Scheurich <jan.scheurich@ericsson.com>; Darrell Ball

> <dlu998@gmail.com>

> Cc: dev@openvswitch.org

> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert

> for recirc packets

> 

> 

> 

> On 8/17/17, 5:22 AM, "Jan Scheurich" <jan.scheurich@ericsson.com> wrote:

> 

>         The RSS hash threshold method looks like the only pseudo-random

> criterion that we can use that produces consistent result for every packet of

> a flow and does require more information. Of course elephant flows with an

> unlucky hash value might never get to use the EMC, but that risk we have

> with any stateless selection scheme.

> 

> 

> 

>     [Darrell] It is probably something I know by another name, but JTBC, can

> you define the “RSS hash threshold method” ?

> 

> 

> 

>     I am referring to Billy's proposal

> (https://urldefense.proofpoint.com/v2/url?u=https-

> 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-

> 2DAugust_336509.html&d=DwIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-

> YihVMNtXt-

> uEs&r=dGZmbKhBG9tJHY4odedsGA&m=_i_IqWudJqU7R3_ZaFm7HHhOQM

> Hwm_U6G-

> EIyOGjkxI&s=IDMHVf9n5CjmHMI67mzMd0HZegNJ_LntZLfcdpRUvJI&e= )

> 

> 

>     In essence the is suggests to only select packets for EMC lookup whose RSS

> hash is above a certain threshold. The lookup probability is determined by

> the threshold (e.g. threshold of 0.75 * UINT32_MAX corresponds to 25%). It

> is pseudo-random as, assuming that the hash result is uniformly distributed,

> flows will profit from EMC lookup with the same probability.

> 

> 

> [Darrell] ahh, there is no actual patch yet, just an e-mail

>                 I see, you have a coined the term “RSS hash threshold method” for

> the approach; the nomenclature makes sense now.

>                 I’ll have separate comments, of course, on the proposal itself.

> 

> 

> 

>         The new thing required will be the dynamic adjustment of lookup

> probability to the EMC fill level and/or hit ratio.

> 

> 

> 

>     [Darrell] Did you mean insertion probability rather than lookup probability ?

> 

> 

> 

>     No, I actually meant dynamic adaptation of lookup probability. We don't

> want to reduce the EMC lookup probability when the EMC is not yet

> overloaded, but only when the EMC hit rate degrades due to collisions.

> When we devise an algorithm to adapt lookup probability, we can study if it

> could make sense to also dynamically adjust the currently fixed

> (configurable) EMC insertion probability based on EMC fill level and/or hit

> rate.

> 

> 

> [Darrell] Now that I know what you are referring to above, it is a lot easier to

> make the linkage.

> 

> 

> 

>     BR, Jan

> 

> 

> 

> _______________________________________________

> dev mailing list

> dev@openvswitch.org

> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index bea1c3f..8f6b96b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4663,6 +4663,9 @@  dp_netdev_queue_batches(struct dp_packet *pkt,
     packet_batch_per_flow_update(batch, pkt, mf);
 }
 
+/* Threshold to skip EMC for recirculated packets. */
+#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000
+
 /* Try to process all ('cnt') the 'packets' using only the exact match cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
@@ -4714,8 +4717,36 @@  emc_processing(struct dp_netdev_pmd_thread *pmd,
         key->len = 0; /* Not computed yet. */
         key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
 
-        /* If EMC is disabled skip emc_lookup */
-        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+        /*
+         * EMC lookup is skipped when one or both of the following
+         * two cases occurs:
+         *
+         *    - EMC is disabled.  This is detected from cur_min.
+         *
+         *    - The EMC occupancy exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD and
+         *      the packet to be classified is being recirculated.  When this
+         *      happens also EMC insertions are skipped for recirculated
+         *      packets.  So that EMC is used just to store entries which
+         *      are hit from the 'original' packets.  This way the EMC
+         *      thrashing is mitigated with a benefit on performance.
+         */
+        if (OVS_LIKELY(cur_min)) {
+            if (!md_is_valid) {
+                flow = emc_lookup(flow_cache, key);
+            } else {
+                /* Recirculated packet. */
+                if (flow_cache->n_entries & EMC_RECIRCT_NO_INSERT_THRESHOLD) {
+                    /* EMC occupancy is over the threshold.  We skip EMC
+                     * lookup for recirculated packets. */
+                    flow = NULL;
+                } else {
+                    flow = emc_lookup(flow_cache, key);
+                }
+            }
+        } else {
+            flow = NULL;
+        }
+
         if (OVS_LIKELY(flow)) {
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
@@ -4800,7 +4831,20 @@  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
                                              add_actions->size);
         }
         ovs_mutex_unlock(&pmd->flow_mutex);
-        emc_probabilistic_insert(pmd, key, netdev_flow);
+        /* EMC insertion can be skipped by a probabilistic criteria or
+         * - in case of recirculated packets - depending on the number of
+         * EMC entries. */
+        if (!packet->md.recirc_id) {
+            emc_probabilistic_insert(pmd, key, netdev_flow);
+        } else {
+            /* Recirculated packets.  When EMC occupancy goes over
+             * a threshold we avoid inserting new entries. */
+            if (!(pmd->flow_cache.n_entries &
+                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
+                /* Still under the threshold. */
+                emc_probabilistic_insert(pmd, key, netdev_flow);
+            }
+        }
     }
 }
 
@@ -4893,7 +4937,20 @@  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
 
         flow = dp_netdev_flow_cast(rules[i]);
 
-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        /* EMC insertion can be skipped by a probabilistic criteria or
+         * - in case of recirculated packets - depending on the number of
+         * EMC entries. */
+        if (!packet->md.recirc_id) {
+            emc_probabilistic_insert(pmd, &keys[i], flow);
+        } else {
+            /* Recirculated packets.  When EMC occupancy goes over
+             * a threshold we avoid inserting new entries. */
+            if (!(pmd->flow_cache.n_entries &
+                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
+                /* Still under the threshold. */
+                emc_probabilistic_insert(pmd, &keys[i], flow);
+            }
+        }
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }