diff mbox

[ovs-dev,RFC,2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

Message ID 1502712602-17223-2-git-send-email-i.maximets@samsung.com
State RFC
Headers show

Commit Message

Ilya Maximets Aug. 14, 2017, 12:10 p.m. UTC
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
---
 lib/dpif-netdev.c    | 16 +++++++++-------
 vswitchd/vswitch.xml |  5 +++--
 2 files changed, 12 insertions(+), 9 deletions(-)

Comments

Jan Scheurich Sept. 2, 2017, 3:14 p.m. UTC | #1
Hi,

Vishal and I have been benchmarking the impact of the several Tx-batching patches on the performance of OVS in the phy-VM-phy scenario with different applications in the VM:

The OVS versions we tested are:

(master):	OVS master (
(Ilya-3): 	Output batching within one Rx batch :
		(master) + [PATCH v3 1-3/4] Output packet batching
(Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching + 
		[PATCH RFC 1/2] timeval: Introduce time_usec() + 
		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
(Ilya-4-Jan):	Time-based output batching with us resolution using TSC cycles
		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching + 
		Incremental patch using TSC cycles in 
		https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html

Application 1: iperf server as representative for kernel applications:

The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and iperf process are pinned to the same vCPU for best performance. The iperf client also runs in a VM on a different server. OVS nodes on client and server are configured identically.

                Iperf                                 iperf CPU  Ping
OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
------------------------------------------------------------------------
Master           6.83        1708.63        43.50%      100%     39 us
Ilya-3           6.88        1951.35        47.17%      100%     40 us
Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us

Discussion:
- Without time-based Tx batching the iperf server CPU is the bottleneck due to virtio interrupt load.
- Ilya-3 does not provide any benefit.
- With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to the virtio eventfd).
- The iperf throughput increases by 15%, still limited by the vCPU capacity. But the bottleneck moves from the virtio interrupt handlers in the guest kernel to the TCP stack and iperf process. With multiple threads can fully load the 10G physical link.
- As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server and client side)
- There is no significant difference between the CLOCK_MONOTONIC and the TSC-based implementations.


Application 2: dpdk pktgen as representative for DPDK application:

OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
----------------------------------------------------------------------
Master       n/a          3.92        305.43         99.65%
Ilya-3       n/a          3.84        310.58         99.31%
Ilya-6       0 us         3.82        312.47         99.67%
Ilya-6       50 us        3.80        314.60         99.65%
Ilya-4-Jan   50 us        3.78        313.65         98.86%

Discussion:
- For DPDK applications in the VM Tx batching does not provide any throughput benefit.
- At full PMD load the output batching overhead causes a capacity drop of 2-3%.
- There is no significant difference between CLOCK_MONOTONIC and TSC implementations.
- perf top measurements indicate that the clock_gettime system call eats about 0.6% of the PMD cycles. This appears not enough to replace it by some TSC-based time implementation.

A zip file with the detailed measurement results can be downloaded from 
https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8


Conclusions: 
-----------------
1. Time based Tx-batching provides significant performance improvements for kernel-based applications.
2. DPDK applications do not benefit in throughput but suffer from the latency increase.
3. The worst case overhead implied by Tx batching is about 3% and should be acceptable.
4. As there is the obvious trade-off between throughput improvement and latency increase, the maximum output latency should be a configuration option. Ideally OVS should have a default parameter per switch and an additional parameter per interface to override the default parameter.
5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need to go replace this by some TSC-based clock.

Regards, Jan and Vishal

> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets@samsung.com]
> Sent: Monday, 14 August, 2017 14:10
> To: ovs-dev@openvswitch.org; Jan Scheurich
> <jan.scheurich@ericsson.com>
> Cc: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>;
> Heetae Ahn <heetae82.ahn@samsung.com>; Vishal Deep Ajmera
> <vishal.deep.ajmera@ericsson.com>; Ilya Maximets
> <i.maximets@samsung.com>
> Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> output-max-latency.
Jan Scheurich Sept. 20, 2017, 2:57 p.m. UTC | #2
Hi Ilya,

I have spent some more time on analyzing and thinking through your latest propose patch set for time-based Tx batching:

> (Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
> 		(master) + [PATCH v3 1-3/4] Output packet batching +
> 		[PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		[PATCH RFC 1/2] timeval: Introduce time_usec() +
> 		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

I would like to suggest that you re-spin a new version where you integrate the last three RFC patches as non-RFC with the following changes/additions:

1. Fold-in patch http://patchwork.ozlabs.org/patch/800276/ (dpif-netdev: Keep latest measured time for PMD thread) to store the time in us resolution in the PMD struct. That may seem a small optimization but makes the code so much cleaner and will help avoid unnecessary extra system calls to read CLOCK_MONOTONIC.

2. Don't set port->output_time when you enqueue a new batch to an output port in function dp_execute_cb(), but when you actually send a batch to the netdev in dp_netdev_pmd_flush_output_on_port(). This still ensures we don't flush more frequently than specified in cur_max_latency (unless the batch size limit is reached), but we can avoid any unnecessary delay when packets are received in intervals larger than cur_max_latency (at 50 us this would be the case for packet rates below 20Kpps!). In this case each packet (batch) would be flushed immediately at the end of each iteration as in non-time based tx batching.

In this context it might be good to rename the configuration parameter to something like "tx-batch-gap".

3. Considering that time-based tx batching is beneficial if and only if the guest virtio driver is interrupt-based, I believe it would be best if OVS automatically applied time-based tx batching for vhostuser tx queues where the driver has requested interrupts. Unfortunately this information is today hidden deep in DPDK's rte_vhost library (file virtio_net.c)

	/* Kick the guest if necessary. */
	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
			&& (vq->callfd >= 0))
		eventfd_write(vq->callfd, (eventfd_t)1);
	return count;

So to automate this, we'd need a new library function in rte_vhost for OVS to be able to query this queue property. Perhaps it is not too late to get this into DPDK 17.11. Interaction with vhostuser PMD?

Having to configure time-based tx batching per port is only a second best option. Nova in OpenStack , for example, does not have the knowledge if time-based tx batching is appropriate for a vhostuser port and there is no Neutron port attribute today that would help determining that.

Thanks, Jan


> -----Original Message-----
> From: ovs-dev-bounces@openvswitch.org [mailto:ovs-dev-bounces@openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Saturday, 02 September, 2017 17:14
> To: dev@openvswitch.org; Ilya Maximets <i.maximets@samsung.com>
> Subject: Re: [ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
> 
> Hi,
> 
> Vishal and I have been benchmarking the impact of the several Tx-batching patches on the performance of OVS in the phy-VM-phy
> scenario with different applications in the VM:
> 
> The OVS versions we tested are:
> 
> (master):	OVS master (
> (Ilya-3): 	Output batching within one Rx batch :
> 		(master) + [PATCH v3 1-3/4] Output packet batching
> (Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
> 		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		[PATCH RFC 1/2] timeval: Introduce time_usec() +
> 		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
> (Ilya-4-Jan):	Time-based output batching with us resolution using TSC cycles
> 		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		Incremental patch using TSC cycles in
> 		https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html
> 
> Application 1: iperf server as representative for kernel applications:
> 
> The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and iperf process are pinned to the same vCPU for best
> performance. The iperf client also runs in a VM on a different server. OVS nodes on client and server are configured identically.
> 
>                 Iperf                                 iperf CPU  Ping
> OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
> ------------------------------------------------------------------------
> Master           6.83        1708.63        43.50%      100%     39 us
> Ilya-3           6.88        1951.35        47.17%      100%     40 us
> Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
> Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us
> 
> Discussion:
> - Without time-based Tx batching the iperf server CPU is the bottleneck due to virtio interrupt load.
> - Ilya-3 does not provide any benefit.
> - With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to the virtio eventfd).
> - The iperf throughput increases by 15%, still limited by the vCPU capacity. But the bottleneck moves from the virtio interrupt handlers
> in the guest kernel to the TCP stack and iperf process. With multiple threads can fully load the 10G physical link.
> - As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server and client side)
> - There is no significant difference between the CLOCK_MONOTONIC and the TSC-based implementations.
> 
> 
> Application 2: dpdk pktgen as representative for DPDK application:
> 
> OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
> ----------------------------------------------------------------------
> Master       n/a          3.92        305.43         99.65%
> Ilya-3       n/a          3.84        310.58         99.31%
> Ilya-6       0 us         3.82        312.47         99.67%
> Ilya-6       50 us        3.80        314.60         99.65%
> Ilya-4-Jan   50 us        3.78        313.65         98.86%
> 
> Discussion:
> - For DPDK applications in the VM Tx batching does not provide any throughput benefit.
> - At full PMD load the output batching overhead causes a capacity drop of 2-3%.
> - There is no significant difference between CLOCK_MONOTONIC and TSC implementations.
> - perf top measurements indicate that the clock_gettime system call eats about 0.6% of the PMD cycles. This appears not enough to
> replace it by some TSC-based time implementation.
> 
> A zip file with the detailed measurement results can be downloaded from
> https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8
> 
> 
> Conclusions:
> -----------------
> 1. Time based Tx-batching provides significant performance improvements for kernel-based applications.
> 2. DPDK applications do not benefit in throughput but suffer from the latency increase.
> 3. The worst case overhead implied by Tx batching is about 3% and should be acceptable.
> 4. As there is the obvious trade-off between throughput improvement and latency increase, the maximum output latency should be a
> configuration option. Ideally OVS should have a default parameter per switch and an additional parameter per interface to override
> the default parameter.
> 5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need to go replace this by some TSC-based clock.
> 
> Regards, Jan and Vishal
> 
> > -----Original Message-----
> > From: Ilya Maximets [mailto:i.maximets@samsung.com]
> > Sent: Monday, 14 August, 2017 14:10
> > To: ovs-dev@openvswitch.org; Jan Scheurich
> > <jan.scheurich@ericsson.com>
> > Cc: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>;
> > Heetae Ahn <heetae82.ahn@samsung.com>; Vishal Deep Ajmera
> > <vishal.deep.ajmera@ericsson.com>; Ilya Maximets
> > <i.maximets@samsung.com>
> > Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> > output-max-latency.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0d78ae4..cf1591c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2825,7 +2825,7 @@  dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
     struct dp_netdev *dp = get_dp_netdev(dpif);
     struct dp_netdev_pmd_thread *pmd;
     struct dp_packet_batch pp;
-    long long now = time_msec();
+    long long now = time_usec();
 
     if (dp_packet_size(execute->packet) < ETH_HEADER_LEN ||
         dp_packet_size(execute->packet) > UINT16_MAX) {
@@ -2925,7 +2925,7 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
     if (output_max_latency != cur_max_latency) {
         atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
-        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
+        VLOG_INFO("Output maximum latency set to %"PRIu32" us",
                   output_max_latency);
     }
 
@@ -3166,7 +3166,7 @@  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
     }
 
     if (!now) {
-        now = time_msec();
+        now = time_usec();
     }
 
     HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
@@ -3190,7 +3190,7 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
     dp_packet_batch_init(&batch);
     error = netdev_rxq_recv(rx, &batch);
     if (!error) {
-        long long now = time_msec();
+        long long now = time_usec();
 
         *recirc_depth_get() = 0;
 
@@ -3768,7 +3768,7 @@  dpif_netdev_run(struct dpif *dpif)
         }
 
         cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
-        dpif_netdev_xps_revalidate_pmd(non_pmd, time_msec(), false);
+        dpif_netdev_xps_revalidate_pmd(non_pmd, time_usec(), false);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
 
         dp_netdev_pmd_unref(non_pmd);
@@ -4742,7 +4742,7 @@  packet_batch_per_flow_execute(struct packet_batch_per_flow *batch,
     struct dp_netdev_flow *flow = batch->flow;
 
     dp_netdev_flow_used(flow, batch->array.count, batch->byte_count,
-                        batch->tcp_flags, now);
+                        batch->tcp_flags, now / 1000);
 
     actions = dp_netdev_flow_get_actions(flow);
 
@@ -5111,7 +5111,7 @@  dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
         if (!tx->port->dynamic_txqs) {
             continue;
         }
-        interval = now - tx->last_used;
+        interval = now / 1000 - tx->last_used;
         if (tx->qid >= 0 && (purge || interval >= XPS_TIMEOUT_MS)) {
             port = tx->port;
             ovs_mutex_lock(&port->txq_used_mutex);
@@ -5132,6 +5132,8 @@  dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread *pmd,
 
     if (OVS_UNLIKELY(!now)) {
         now = time_msec();
+    } else {
+        now /= 1000;
     }
 
     interval = now - tx->last_used;
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 23930f0..1c6ae7c 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -345,9 +345,10 @@ 
       </column>
 
       <column name="other_config" key="output-max-latency"
-              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
+              type='{"type": "integer",
+                     "minInteger": 0, "maxInteger": 1000000}'>
         <p>
-          Specifies the time in milliseconds that a packet can wait in output
+          Specifies the time in microseconds that a packet can wait in output
           batch for sending i.e. amount of time that packet can spend in an
           intermediate output queue before sending to netdev.
           This option can be used to configure balance between throughput