diff mbox

[ovs-dev,0/3] Output packet batching.

Message ID 1ea3e4e1-315c-d7ba-0ede-24bc636c67e4@samsung.com
State Superseded
Headers show

Commit Message

Ilya Maximets July 4, 2017, 9:50 a.m. UTC
On 03.07.2017 20:33, Darrell Ball wrote:
> 
> 
> On 7/3/17, 7:31 AM, "ovs-dev-bounces@openvswitch.org on behalf of Jan Scheurich" <ovs-dev-bounces@openvswitch.org on behalf of jan.scheurich@ericsson.com> wrote:
> 
>     I like this generic approach to collect the packets to be output per port for each Rx batch in dpif-netdev. It is indeed simpler than the approach in [1].
>     
>     However, [1] originally had a larger scope, namely to buffer packets in an intermediate queue per netdev tx queue *across* multiple rx batches. Unfortunately this aspect was "optimized" away in v3 to minimize the impact on minimum and average latency for ports with little traffic.
>     
>     Limiting the output batching to a single rx batch has little benefit unless the average rx batch size is significant. According to our measurements with instrumented netdev-dpdk datapath this happens only very close to or above PMD saturation. At realistic production load levels (say 95% PMD processing cycles) the average rx batch size is still around 1-2 only, so the gain we can expect is negligible. The primary use case for this patch appears to be to push up the saturation throughput in benchmarks.
> 
> [Darrell]
> This patch series suggested by Ilya is mainly a subset of the series suggested by Bhanu here
> [1] https://patchwork.ozlabs.org/patch/706461/
> and hence are not really comparable.
> Ilya’s series/patch collects packets per rx batch and sends each subset as a tx batch per output. As pointed out here, by Jan and
> Has also come up many times before, real rx batch sizes are themselves small. Also, in general, an implementation should not
> rely on a very lucky distribution of input packets from one rx batch to provide some benefit. I also think the benefit
> will be negligible in the general case. It would be good to see what the extra cost is for the additional checking when 
> the general case of negligible benefit is in effect.
> 
> I agree one valid suggested concept of Ilya patch set is trying to put the code in a common code path across
> netdevs. This may be an oversimplification in some cases. If tx batching belongs at the netdev layer in some cases,
> I don’t think we should put some at dpif-netdev and some in netdev.  Mixing layers for one function (i.e. tx batching)
> will be more confusing and less maintainable.
> 
> 
>     
>     In our perspective a perhaps more important use case for tx batching would be to reduce the number of virtio interrupts a PMD needs to trigger in Qemu/KVM guests using the kernel virtio-net driver (currently one per tx batch).
>     
>     In internal benchmarks with kernel applications we have observed that the OVS PMD spends up to 30% of its cycles kicking the eventfd and the guest OS in the VM spends 50% or more of a vCPU processing virtio-net interrupts. This seriously degrades the performance of both OVS and application.
>     
>     With a vhostuser tx batching patch that is not limited to a single Rx batch but relies on periodic flushing (e.g. every 50-100 us), we have been able to reduce this interrupt overhead and significantly improve e.g. the iperf performance.
>     
>     My suggestion would be to complement the generic output packet batching per rx batch in dpif-netdev in this patch with specific support for tx batching for vhostuser ports in netdev-dpdk, using an intermediate queue with a configurable flushing interval (including the possibility to turn buffering off when latency is an issue).
> 
> [Darrell]
> If some tx batching logically belongs at the netdev layer, then we should put all of it there, since in this 
> situation the behavior really is “class or subclass” specific and we should recognize that.
> I don’t think we should mix and match tx batching across layers.
> 
> The general idea of decoupling rx and tx, which is Bhanu’s patch ultimate direction, has merit in some situations, especially
> where added latency is less important vs thoughput. This would ultimately
> allow collecting packets of a TX batch across different RXs and over longer time than a single RX batch.
> However, as Jan points out, there is going to be a tradeoff b/w latency and throughput; I don’t think this is a
> surprise to anyone. That logically leads us to more configuration knobs, which I think is ok here.
> 
> 
>     
>     BR, Jan
>     
>     > -----Original Message-----
>     > From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Ilya Maximets
>     > Sent: Friday, 30 June, 2017 14:03
>     > To: ovs-dev@openvswitch.org; Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
>     > Cc: Heetae Ahn <heetae82.ahn@samsung.com>; Ilya Maximets <i.maximets@samsung.com>
>     > Subject: [ovs-dev] [PATCH 0/3] Output packet batching.
>     > 
>     > This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>     > Implementation of [1] looks very complex and introduces many pitfalls for
>     > later code modifications like possible packet stucks.
>     > 
>     > This version targeted to make simple and flexible output packet batching on
>     > higher level without introducing and even simplifying netdev layer.
>     > 
>     > Patch set consists of 3 patches. All the functionality introduced in the
>     > first patch. Two others are just cleanups of netdevs to not do unnecessary
>     > things.
>     > 
>     > Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>     > significant performance improvement.
>     > More accurate and intensive testing required.
>     > 
>     > [1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet transmission.
>     >     https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DJune_334762.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=RfKtd9eOmZJKMCN5_yrODHXgvt_GatMcZzsxG3UCcTc&e= 
>     > 
>     > Ilya Maximets (3):
>     >   dpif-netdev: Output packet batching.
>     >   netdev: Remove unused may_steal.
>     >   netdev: Remove useless cutlen.
>     > 
>     >  lib/dpif-netdev.c     | 81 ++++++++++++++++++++++++++++++++++++---------------
>     >  lib/netdev-bsd.c      |  7 ++---
>     >  lib/netdev-dpdk.c     | 30 +++++++------------
>     >  lib/netdev-dummy.c    |  6 ++--
>     >  lib/netdev-linux.c    |  7 ++---
>     >  lib/netdev-provider.h |  7 ++---
>     >  lib/netdev.c          | 12 +++-----
>     >  lib/netdev.h          |  2 +-
>     >  8 files changed, 83 insertions(+), 69 deletions(-)
>     > 
>     > --
>     > 2.7.4
>     > 
>     > _______________________________________________
>     > dev mailing list
>     > dev@openvswitch.org
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e= 
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e= 


Hi Darrell and Jan.
Thanks for looking at this. I agree with Darrell that mixing implementations
on two different levels is a bad idea, but as I already wrote in reply to
Bhanuprakash [2], there is no issues with implementing of output batching
of more than one rx batch.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html

Look at the incremental below. This is how it may look like:

---8<---------------------------------------------------------------->8---

---8<---------------------------------------------------------------->8---

I can easily change milliseconds granularity to microseconds if you wish.
For the final version part of this incremental will be squashed to the first patch.

One difference with netdev layer solution is that we batching only packets
recieved by current thread not mixing packets from different threads.
I think that it's an advantage of this solution because we will never
have performance issues connected to flushing from the non-local thread.
(see issue description in my reply to Bhanuprakash [2])

Best regards, Ilya Maximets.

Comments

Bodireddy, Bhanuprakash July 4, 2017, 11:37 a.m. UTC | #1
Apologies for snipping the text. I did it to keep this thread readable. 

>
>Hi Darrell and Jan.
>Thanks for looking at this. I agree with Darrell that mixing implementations on
>two different levels is a bad idea, but as I already wrote in reply to
>Bhanuprakash [2], there is no issues with implementing of output batching of
>more than one rx batch.
>
>[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html
>
>Look at the incremental below. This is how it may look like:
HI Ilya,

I briefly  went through the incremental patch and see that you introduced config parameter
to tune the latency and built the logic around it. It may work but we are back to same question.

Is dpif layer the right place to do all of this? Shouldn't this logic be part of netdev layer as tx batching 
rightly belongs to netdev layer.  If a specific use case warrants tuning the queuing, flushing and latency
parameters let this be done at netdev layer by providing more configs with acceptable defaults and leave
the dpif layer simple as it is now.

You referred  to performance issues with flushing triggered from non-local thread (on different NUMA node).
This may be because in lab, we simulate these cases and saturate the 10G link. But this may not be a very pressing
issue in real world scenarios.

- Bhanuprakash.
Ilya Maximets July 4, 2017, 12:58 p.m. UTC | #2
On 04.07.2017 14:37, Bodireddy, Bhanuprakash wrote:
> Apologies for snipping the text. I did it to keep this thread readable. 
> 
>>
>> Hi Darrell and Jan.
>> Thanks for looking at this. I agree with Darrell that mixing implementations on
>> two different levels is a bad idea, but as I already wrote in reply to
>> Bhanuprakash [2], there is no issues with implementing of output batching of
>> more than one rx batch.
>>
>> [2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html
>>
>> Look at the incremental below. This is how it may look like:
> HI Ilya,
> 
> I briefly  went through the incremental patch and see that you introduced config parameter
> to tune the latency and built the logic around it. It may work but we are back to same question.
> 
> Is dpif layer the right place to do all of this? Shouldn't this logic be part of netdev layer as tx batching 
> rightly belongs to netdev layer. 

Why do you think so? 'batched RX' and 'Per-flow batching' are the parts of
dpif-netdev and 'TX batching' logically should belong to that layer.

> If a specific use case warrants tuning the queuing, flushing and latency
> parameters let this be done at netdev layer by providing more configs with acceptable defaults and leave
> the dpif layer simple as it is now.

You will need to implement all the flushing logic on dpif layer anyway
to avoid issues with lost and stuck packets. This means that you'll
need to implement almost the same code as in my incremental patch with
one difference that you'll call 'netdev_txq_flush' instead of
'dp_netdev_pmd_flush_output_on_port'.

About simplicity of dpif-netdev: Incremental looks large and complex, but
it mostly refactors the code implemented in the main patch. So it'll be
not so complex after the squashing.

About 'acceptable defaults': I guess you mean different default values for
physical DPDK ports and vhost-user. But, as Jan mentioned, the main usecase
for output batching of many rx batches is the kernel virtio-net driver in
guest with enabled notifications. From my point of view, DPDK virtio_pmd
driver is the main usecase for OvS+DPDK solution. This means that we'll not
be able to set up same acceptable for both cases default values for knobs.

To provide more flexible configuration we can switch the config parameter
to per-port configurable. This will allow as to use both kind of guests
simultaneously with possibility to achieve good results on both.
But this can be done using both approaches of implementation layer.

> You referred  to performance issues with flushing triggered from non-local thread (on different NUMA node).
> This may be because in lab, we simulate these cases and saturate the 10G link. But this may not be a very pressing
> issue in real world scenarios.

I have few real scenarios there same traffic value goes to a single VM
from 2 different NUMA nodes simultaneously. I expect performance degradation
in this scenario if the packets will be flushed from different thread because
the cross-numa datapath much slower. (I measured that on my system cross-numa
path is around 20-25% slower)

Unfortunately, I don't have the lab with these scenarios installed now to
make tests.

To be clear, I'm not really a fan of time-based output batching of multiple
rx batches. I like the patch as it is without the incremental.
But if Darrel and Jan thinks that it's really needed I can add it in the way
implemented in incremental patch. I prefer to keep default max output latency
set to 0 because it covers the main usecase for OvS+DPDK: DPDK-enabled guest.

Best regards, Ilya Maximets.
diff mbox

Patch

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 38282bd..360e737 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -710,6 +710,12 @@  dp_packet_batch_is_empty(const struct dp_packet_batch *batch)
     return !dp_packet_batch_size(batch);
 }
 
+static inline bool
+dp_packet_batch_is_full(const struct dp_packet_batch *batch)
+{
+    return dp_packet_batch_size(batch) == NETDEV_MAX_BURST;
+}
+
 #define DP_PACKET_BATCH_FOR_EACH(PACKET, BATCH)    \
     for (size_t i = 0; i < dp_packet_batch_size(BATCH); i++)  \
         if (PACKET = BATCH->packets[i], true)
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e49f665..181dcb8 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -84,6 +84,9 @@  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 #define MAX_RECIRC_DEPTH 5
 DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
 
+/* Use instant packet send by default. */
+#define DEFAULT_OUTPUT_MAX_LATENCY 0
+
 /* Configuration parameters. */
 enum { MAX_FLOWS = 65536 };     /* Maximum number of flows in flow table. */
 enum { MAX_METERS = 65536 };    /* Maximum number of meters. */
@@ -262,6 +265,9 @@  struct dp_netdev {
     struct ovs_mutex meter_locks[N_METER_LOCKS];
     struct dp_meter *meters[MAX_METERS]; /* Meter bands. */
 
+    /* The time that a packet can wait in output batch for sending. */
+    atomic_uint32_t output_max_latency;
+
     /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
 
@@ -494,6 +500,7 @@  struct tx_port {
     int qid;
     long long last_used;
     struct hmap_node node;
+    long long output_time;
     struct dp_packet_batch output_pkts;
 };
 
@@ -660,7 +667,7 @@  static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
                                        struct rxq_poll *poll)
     OVS_REQUIRES(pmd->port_mutex);
 static void dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *,
-                                               long long now);
+                                               long long now, bool force);
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
@@ -1182,6 +1189,7 @@  create_dp_netdev(const char *name, const struct dpif_class *class,
 
     conntrack_init(&dp->conntrack);
 
+    atomic_init(&dp->output_max_latency, DEFAULT_OUTPUT_MAX_LATENCY);
     atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
 
     cmap_init(&dp->poll_threads);
@@ -2848,7 +2856,7 @@  dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
     dp_packet_batch_init_packet(&pp, execute->packet);
     dp_netdev_execute_actions(pmd, &pp, false, execute->flow,
                               execute->actions, execute->actions_len, now);
-    dp_netdev_pmd_flush_output_packets(pmd, now);
+    dp_netdev_pmd_flush_output_packets(pmd, now, true);
 
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -2897,6 +2905,16 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
         smap_get_ullong(other_config, "emc-insert-inv-prob",
                         DEFAULT_EM_FLOW_INSERT_INV_PROB);
     uint32_t insert_min, cur_min;
+    uint32_t output_max_latency, cur_max_latency;
+
+    output_max_latency = smap_get_int(other_config, "output-max-latency",
+                                      DEFAULT_OUTPUT_MAX_LATENCY);
+    atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+    if (output_max_latency != cur_max_latency) {
+        atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
+        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
+                  output_max_latency);
+    }
 
     if (!nullable_string_is_equal(dp->pmd_cmask, cmask)) {
         free(dp->pmd_cmask);
@@ -3085,26 +3103,34 @@  cycles_count_end(struct dp_netdev_pmd_thread *pmd,
 }
 
 static void
+dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
+                                   struct tx_port *p, long long now)
+{
+    int tx_qid;
+    bool dynamic_txqs;
+
+    dynamic_txqs = p->port->dynamic_txqs;
+    if (dynamic_txqs) {
+        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
+    } else {
+        tx_qid = pmd->static_tx_qid;
+    }
+
+    netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
+                dynamic_txqs);
+    dp_packet_batch_init(&p->output_pkts);
+}
+
+static void
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
-                                   long long now)
+                                   long long now, bool force)
 {
     struct tx_port *p;
 
     HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
-        if (!dp_packet_batch_is_empty(&p->output_pkts)) {
-            int tx_qid;
-            bool dynamic_txqs;
-
-            dynamic_txqs = p->port->dynamic_txqs;
-            if (dynamic_txqs) {
-                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
-            } else {
-                tx_qid = pmd->static_tx_qid;
-            }
-
-            netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
-                        dynamic_txqs);
-            dp_packet_batch_init(&p->output_pkts);
+        if (!dp_packet_batch_is_empty(&p->output_pkts)
+            && (force || p->output_time <= now)) {
+            dp_netdev_pmd_flush_output_on_port(pmd, p, now);
         }
     }
 }
@@ -3128,7 +3154,7 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
 
         cycles_count_start(pmd);
         dp_netdev_input(pmd, &batch, port_no, now);
-        dp_netdev_pmd_flush_output_packets(pmd, now);
+        dp_netdev_pmd_flush_output_packets(pmd, now, false);
         cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
     } else if (error != EAGAIN && error != EOPNOTSUPP) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
@@ -3663,6 +3689,8 @@  pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
 {
     struct tx_port *tx_port_cached;
 
+    /* Flush all the queued packets. */
+    dp_netdev_pmd_flush_output_packets(pmd, 0, true);
     /* Free all used tx queue ids. */
     dpif_netdev_xps_revalidate_pmd(pmd, 0, true);
 
@@ -4388,6 +4416,7 @@  dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
 
     tx->port = port;
     tx->qid = -1;
+    tx->output_time = 0LL;
     dp_packet_batch_init(&tx->output_pkts);
 
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
@@ -5054,8 +5083,18 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             }
             dp_packet_batch_apply_cutlen(packets_);
 
+            if (dp_packet_batch_is_empty(&p->output_pkts)) {
+                uint32_t cur_max_latency;
+
+                atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+                p->output_time = now + cur_max_latency;
+            }
+
             DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
                 dp_packet_batch_add(&p->output_pkts, packet);
+                if (OVS_UNLIKELY(dp_packet_batch_is_full(&p->output_pkts))) {
+                    dp_netdev_pmd_flush_output_on_port(pmd, p, now);
+                }
             }
             return;
         }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index abfe397..34ee908 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -344,6 +344,21 @@ 
         </p>
       </column>
 
+      <column name="other_config" key="output-max-latency"
+              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
+        <p>
+          Specifies the time in milliseconds that a packet can wait in output
+          batch for sending i.e. amount of time that packet can spend in an
+          intermediate output queue before sending to netdev.
+          This option can be used to configure balance between throughput
+          and latency. Lower values decreases latency while higher values
+          may be useful to achieve higher performance.
+        </p>
+        <p>
+          Defaults to 0 i.e. instant packet sending (latency optimized).
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>