diff mbox

[ovs-dev,RFC,v3,4/4] dpif-netdev: Time based output batching.

Message ID CFF8EF42F1132E4CBE2BF0AB6C21C58D72737996@ESESSMB107.ericsson.se
State RFC
Headers show

Commit Message

Jan Scheurich Aug. 14, 2017, 1:12 p.m. UTC
> >>From earlier in-house trials we know we need to target flush times of 50
> us or less, so we clearly need better time resolution. Sub-ms timing in PMD
> should be based on TSC cycles, which are already kept in the pmd struct.
> Could you provide a corresponding patch for performance testing?
> 
> I don't think that TSC is suitable in this case. Some reasons:
> 
> * non-PMD threads are able to float across cpu cores.
> * Turbo-boost can be enabled or frequency can be adjusted manually after
> DPDK init.
> * TSC cycles only calculated if DPDK enabled.
> 
> TSC is used currently only for not really precise statistics.
> For the real features we need more accurate time accounting.
> 
> I believe that CLOCK_MONOTONIC is able to provide at least microsecond
> granularity on the most of systems. We just need to add one more wrapper
> function like 'time_usec()' to the lib/timeval.
> 

We have tested the effect of turbo mode on TSC and there is none. The TSC frequency remains at the nominal clock speed, no matter if the core is clocked down or up. So, I believe for PMD threads (where performance matters) TSC would be an adequate and efficient clock.

On PMDs I am a bit concerned about the overhead/latency introduced with the clock_gettime() system call, but I haven't done any measurements to check the actual impact. Have you?

If we go for CLOCK_MONOTONIC in microsecond resolution, we should make sure that the clock is read not more than once once every iteration (and cache the us value as now in the pmd ctx struct as suggested in your other patch). But then for consistency also the XPS feature should use the PMD time in us resolution.

For non-PMD thread we could actually skip time-based output batching completely. The packet rates and the frequency of calls to dpif_netdev_run() in the main ovs-vswitchd thread are so low that time-based flushing doesn't seem to make much sense.

Below you can find an alternative incremental patch on top of your RFC 4/4 that uses TSC on PMD. We will be comparing the two alternatives for performance both with non-PMD guests (iperf3) as well as PMD guests (DPDK testpmd).

BR, Jan

Comments

Ilya Maximets Aug. 14, 2017, 1:35 p.m. UTC | #1
On 14.08.2017 16:12, Jan Scheurich wrote:
>>> >From earlier in-house trials we know we need to target flush times of 50
>> us or less, so we clearly need better time resolution. Sub-ms timing in PMD
>> should be based on TSC cycles, which are already kept in the pmd struct.
>> Could you provide a corresponding patch for performance testing?
>>
>> I don't think that TSC is suitable in this case. Some reasons:
>>
>> * non-PMD threads are able to float across cpu cores.
>> * Turbo-boost can be enabled or frequency can be adjusted manually after
>> DPDK init.
>> * TSC cycles only calculated if DPDK enabled.
>>
>> TSC is used currently only for not really precise statistics.
>> For the real features we need more accurate time accounting.
>>
>> I believe that CLOCK_MONOTONIC is able to provide at least microsecond
>> granularity on the most of systems. We just need to add one more wrapper
>> function like 'time_usec()' to the lib/timeval.
>>
> 
> We have tested the effect of turbo mode on TSC and there is none. The TSC frequency remains at the nominal clock speed, no matter if the core is clocked down or up. So, I believe for PMD threads (where performance matters) TSC would be an adequate and efficient clock.

It's highly platform dependent and testing on a few systems doesn't guarantee anything.
From the other hand POSIX guarantee the monotonic characteristics for CLOCK_MONOTONIC.

> 
> On PMDs I am a bit concerned about the overhead/latency introduced with the clock_gettime() system call, but I haven't done any measurements to check the actual impact. Have you?

Have you seen my incremental patches?
There is no overhead, because we're just replacing 'time_msec' with 'time_usec'.
No difference except converting timespec to usec instead of msec.

> 
> If we go for CLOCK_MONOTONIC in microsecond resolution, we should make sure that the clock is read not more than once once every iteration (and cache the us value as now in the pmd ctx struct as suggested in your other patch). But then for consistency also the XPS feature should use the PMD time in us resolution.

Again, please, look at my incremental patches.

> 
> For non-PMD thread we could actually skip time-based output batching completely. The packet rates and the frequency of calls to dpif_netdev_run() in the main ovs-vswitchd thread are so low that time-based flushing doesn't seem to make much sense.
> 
> Below you can find an alternative incremental patch on top of your RFC 4/4 that uses TSC on PMD. We will be comparing the two alternatives for performance both with non-PMD guests (iperf3) as well as PMD guests (DPDK testpmd).

In your version you need to move all the output_batching related code
under #ifdef DPDK_NETDEV because it will brake userspace networking if
compiled without dpdk and output-max-latency != 0.

> 
> BR, Jan
> 
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 0d78ae4..8285786 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -265,7 +265,7 @@ struct dp_netdev {
>      struct hmap ports;
>      struct seq *port_seq;       /* Incremented whenever a port changes. */
>  
> -    /* The time that a packet can wait in output batch for sending. */
> +    /* Time in cycles that a packet can wait in output batch for sending. */
>      atomic_uint32_t output_max_latency;
>  
>      /* Meters. */
> @@ -508,7 +508,7 @@ struct tx_port {
>      int qid;
>      long long last_used;
>      struct hmap_node node;
> -    long long output_time;
> +    long long flush_time;   /* Time in LSC cycles to flush output buffer. */
>      struct dp_packet_batch output_pkts;
>  };
>  
> @@ -622,6 +622,7 @@ struct dpif_netdev {
>      uint64_t last_port_seq;
>  };
>  
> +static inline unsigned long cycles_per_microsecond(void);
>  static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
>                                struct dp_netdev_port **portp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -2921,11 +2922,12 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      uint32_t output_max_latency, cur_max_latency;
>  
>      output_max_latency = smap_get_int(other_config, "output-max-latency",
> -                                      DEFAULT_OUTPUT_MAX_LATENCY);
> +                                      DEFAULT_OUTPUT_MAX_LATENCY)
> +                         * cycles_per_microsecond();
>      atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
>      if (output_max_latency != cur_max_latency) {
>          atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
> -        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
> +        VLOG_INFO("Output maximum latency set to %"PRIu32" cycles",
>                    output_max_latency);
>      }
>  
> @@ -3091,6 +3093,16 @@ cycles_counter(void)
>  #endif
>  }
>  
> +static inline unsigned long
> +cycles_per_microsecond(void)
> +{
> +#ifdef DPDK_NETDEV
> +    return rte_get_tsc_hz() / (1000 * 1000);
> +#else
> +    return 0;
> +#endif
> +}
> +
>  /* Fake mutex to make sure that the calls to cycles_count_* are balanced */
>  extern struct ovs_mutex cycles_counter_fake_mutex;
>  
> @@ -3171,7 +3183,7 @@ dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>  
>      HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
>          if (!dp_packet_batch_is_empty(&p->output_pkts)
> -            && (force || p->output_time <= now)) {
> +            && (force || p->flush_time <= pmd->last_cycles)) {
>              output_cnt += dp_netdev_pmd_flush_output_on_port(pmd, p, now);
>          }
>      }
> @@ -3196,7 +3208,6 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>  
>          batch_cnt = batch.count;
>          dp_netdev_input(pmd, &batch, port_no, now);
> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, now, false);
>      } else if (error != EAGAIN && error != EOPNOTSUPP) {
>          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>  
> @@ -3732,7 +3743,6 @@ dpif_netdev_run(struct dpif *dpif)
>      struct dp_netdev_pmd_thread *non_pmd;
>      uint64_t new_tnl_seq;
>      int process_packets;
> -    bool need_to_flush = true;
>  
>      ovs_mutex_lock(&dp->port_mutex);
>      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> @@ -3751,21 +3761,16 @@ dpif_netdev_run(struct dpif *dpif)
>                      cycles_count_intermediate(non_pmd, process_packets
>                                                         ? PMD_CYCLES_PROCESSING
>                                                         : PMD_CYCLES_IDLE);
> -                    if (process_packets) {
> -                        need_to_flush = false;
> -                    }
>                  }
>              }
>          }
> -        if (need_to_flush) {
> -            /* We didn't receive anything in the process loop.
> -             * Check if we need to send something. */
> -            process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
> -                                                                 0, false);
> -            cycles_count_intermediate(non_pmd, process_packets
> -                                               ? PMD_CYCLES_PROCESSING
> -                                               : PMD_CYCLES_IDLE);
> -        }
> +
> +        /* Check if queued packets need to be flushed. */
> +        process_packets =
> +                dp_netdev_pmd_flush_output_packets(non_pmd, 0, false);
> +        cycles_count_intermediate(non_pmd, process_packets
> +                                           ? PMD_CYCLES_PROCESSING
> +                                           : PMD_CYCLES_IDLE);
>  
>          cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
>          dpif_netdev_xps_revalidate_pmd(non_pmd, time_msec(), false);
> @@ -3946,7 +3951,6 @@ reload:
>      cycles_count_start(pmd);
>      for (;;) {
>          int process_packets;
> -        bool need_to_flush = true;
>  
>          for (i = 0; i < poll_cnt; i++) {
>              process_packets =
> @@ -3955,20 +3959,14 @@ reload:
>              cycles_count_intermediate(pmd,
>                                        process_packets ? PMD_CYCLES_PROCESSING
>                                                        : PMD_CYCLES_IDLE);
> -            if (process_packets) {
> -                need_to_flush = false;
> -            }
>          }
>  
> -        if (need_to_flush) {
> -            /* We didn't receive anything in the process loop.
> -             * Check if we need to send something. */
> -            process_packets = dp_netdev_pmd_flush_output_packets(pmd,
> -                                                                 0, false);
> -            cycles_count_intermediate(pmd,
> -                                      process_packets ? PMD_CYCLES_PROCESSING
> -                                                      : PMD_CYCLES_IDLE);
> -        }
> +        /* Check if queued packets need to be flushed. */
> +        process_packets =
> +                dp_netdev_pmd_flush_output_packets(pmd, 0, false);
> +        cycles_count_intermediate(pmd,
> +                                  process_packets ? PMD_CYCLES_PROCESSING
> +                                                  : PMD_CYCLES_IDLE);
>  
>          if (lc++ > 1024) {
>              bool reload;
> @@ -4593,7 +4591,7 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
>  
>      tx->port = port;
>      tx->qid = -1;
> -    tx->output_time = 0LL;
> +    tx->flush_time = 0LL;
>      dp_packet_batch_init(&tx->output_pkts);
>  
>      hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
> @@ -5276,11 +5274,12 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                  dp_netdev_pmd_flush_output_on_port(pmd, p, now);
>              }
>  
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +            if (dp_packet_batch_is_empty(&p->output_pkts) &&
> +                !dp_packet_batch_is_empty(packets_)) {
>                  uint32_t cur_max_latency;
>  
>                  atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
> -                p->output_time = now + cur_max_latency;
> +                p->flush_time = pmd->last_cycles + cur_max_latency;
>                  pmd->n_output_batches++;
>              }
>  
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 23930f0..8ad1d00 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -345,9 +345,9 @@
>        </column>
>  
>        <column name="other_config" key="output-max-latency"
> -              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
> +              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000000}'>
>          <p>
> -          Specifies the time in milliseconds that a packet can wait in output
> +          Specifies the time in microseconds that a packet can wait in output
>            batch for sending i.e. amount of time that packet can spend in an
>            intermediate output queue before sending to netdev.
>            This option can be used to configure balance between throughput
> 
>
Jan Scheurich Aug. 14, 2017, 3:33 p.m. UTC | #2
> > We have tested the effect of turbo mode on TSC and there is none. The
> TSC frequency remains at the nominal clock speed, no matter if the core is
> clocked down or up. So, I believe for PMD threads (where performance
> matters) TSC would be an adequate and efficient clock.
> 
> It's highly platform dependent and testing on a few systems doesn't
> guarantee anything.
> From the other hand POSIX guarantee the monotonic characteristics for
> CLOCK_MONOTONIC.

TSC is also monotonic on a given core. Does CLOCK_MONOTONIC guarantee any better accuracy than TSC for PMD threads?

> > On PMDs I am a bit concerned about the overhead/latency introduced
> with the clock_gettime() system call, but I haven't done any measurements
> to check the actual impact. Have you?
> 
> Have you seen my incremental patches?
> There is no overhead, because we're just replacing 'time_msec' with
> 'time_usec'.
> No difference except converting timespec to usec instead of msec.

I did look at you incremental patches and we will test their performance. I was concerned about the system call cost on master already before. Perhaps I'm paranoid, but I would like to double check by testing.

> > If we go for CLOCK_MONOTONIC in microsecond resolution, we should
> make sure that the clock is read not more than once once every iteration
> (and cache the us value as now in the pmd ctx struct as suggested in your
> other patch). But then for consistency also the XPS feature should use the
> PMD time in us resolution.
> 
> Again, please, look at my incremental patches.

As far as I could see you did, for example, not consistently adapt tx_port->last_used to microsecond resolution.

> > For non-PMD thread we could actually skip time-based output batching
> completely. The packet rates and the frequency of calls to
> dpif_netdev_run() in the main ovs-vswitchd thread are so low that time-
> based flushing doesn't seem to make much sense.

Have you considered this option? 

> >
> > Below you can find an alternative incremental patch on top of your RFC
> 4/4 that uses TSC on PMD. We will be comparing the two alternatives for
> performance both with non-PMD guests (iperf3) as well as PMD guests
> (DPDK testpmd).
> 
> In your version you need to move all the output_batching related code
> under #ifdef DPDK_NETDEV because it will brake userspace networking if
> compiled without dpdk and output-max-latency != 0.

Not sure. Batching should implicitly be disabled because cycles_counter() and cycles_per_microsecond() would both return zero. But I agree that would be fairly cryptic design. If we used TSC in PMDs we should explicitly not do time-based tx batching on the non-PMD thread.

Anyway, if the cost of the clock_gettime() system call proves insignificant and our performance tests comparing our TSC-based with your CLOCK_MONOTONIC-based implementation show equivalent results, we can go for your approach.

BR, Jan
Darrell Ball Aug. 29, 2017, 10:45 p.m. UTC | #3
Hi Jan/Ilya/Bhanu

Just wondering if we are still pursuing this patch series ?

Thanks Darrell




On 8/14/17, 8:33 AM, "ovs-dev-bounces@openvswitch.org on behalf of Jan Scheurich" <ovs-dev-bounces@openvswitch.org on behalf of jan.scheurich@ericsson.com> wrote:

    > > We have tested the effect of turbo mode on TSC and there is none. The
    > TSC frequency remains at the nominal clock speed, no matter if the core is
    > clocked down or up. So, I believe for PMD threads (where performance
    > matters) TSC would be an adequate and efficient clock.
    > 
    > It's highly platform dependent and testing on a few systems doesn't
    > guarantee anything.
    > From the other hand POSIX guarantee the monotonic characteristics for
    > CLOCK_MONOTONIC.
    
    TSC is also monotonic on a given core. Does CLOCK_MONOTONIC guarantee any better accuracy than TSC for PMD threads?
    
    > > On PMDs I am a bit concerned about the overhead/latency introduced
    > with the clock_gettime() system call, but I haven't done any measurements
    > to check the actual impact. Have you?
    > 
    > Have you seen my incremental patches?
    > There is no overhead, because we're just replacing 'time_msec' with
    > 'time_usec'.
    > No difference except converting timespec to usec instead of msec.
    
    I did look at you incremental patches and we will test their performance. I was concerned about the system call cost on master already before. Perhaps I'm paranoid, but I would like to double check by testing.
    
    > > If we go for CLOCK_MONOTONIC in microsecond resolution, we should
    > make sure that the clock is read not more than once once every iteration
    > (and cache the us value as now in the pmd ctx struct as suggested in your
    > other patch). But then for consistency also the XPS feature should use the
    > PMD time in us resolution.
    > 
    > Again, please, look at my incremental patches.
    
    As far as I could see you did, for example, not consistently adapt tx_port->last_used to microsecond resolution.
    
    > > For non-PMD thread we could actually skip time-based output batching
    > completely. The packet rates and the frequency of calls to
    > dpif_netdev_run() in the main ovs-vswitchd thread are so low that time-
    > based flushing doesn't seem to make much sense.
    
    Have you considered this option? 
    
    > >
    > > Below you can find an alternative incremental patch on top of your RFC
    > 4/4 that uses TSC on PMD. We will be comparing the two alternatives for
    > performance both with non-PMD guests (iperf3) as well as PMD guests
    > (DPDK testpmd).
    > 
    > In your version you need to move all the output_batching related code
    > under #ifdef DPDK_NETDEV because it will brake userspace networking if
    > compiled without dpdk and output-max-latency != 0.
    
    Not sure. Batching should implicitly be disabled because cycles_counter() and cycles_per_microsecond() would both return zero. But I agree that would be fairly cryptic design. If we used TSC in PMDs we should explicitly not do time-based tx batching on the non-PMD thread.
    
    Anyway, if the cost of the clock_gettime() system call proves insignificant and our performance tests comparing our TSC-based with your CLOCK_MONOTONIC-based implementation show equivalent results, we can go for your approach.
    
    BR, Jan
    _______________________________________________
    dev mailing list
    dev@openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=YSJX1FJ-09MF851q3vAIW-9-2W4nZruCOdyvxUMB9vE&s=QUvdTK7m_i90FSk4aRxMN7d_1TSc46NZK9zfV3dI3Cc&e=
Ilya Maximets Aug. 30, 2017, 7:40 a.m. UTC | #4
Hi Darrell,

Yes, I'm still interested in this series. I had a lot of other work last
few weeks. I hope that I'll have enough time to reply to the questions
and concerns in a near future.

Best regards, Ilya Maximets.

On 30.08.2017 01:45, Darrell Ball wrote:
> Hi Jan/Ilya/Bhanu
> 
> Just wondering if we are still pursuing this patch series ?
> 
> Thanks Darrell
> 
> 
> 
> 
> On 8/14/17, 8:33 AM, "ovs-dev-bounces@openvswitch.org on behalf of Jan Scheurich" <ovs-dev-bounces@openvswitch.org on behalf of jan.scheurich@ericsson.com> wrote:
> 
>     > > We have tested the effect of turbo mode on TSC and there is none. The
>     > TSC frequency remains at the nominal clock speed, no matter if the core is
>     > clocked down or up. So, I believe for PMD threads (where performance
>     > matters) TSC would be an adequate and efficient clock.
>     > 
>     > It's highly platform dependent and testing on a few systems doesn't
>     > guarantee anything.
>     > From the other hand POSIX guarantee the monotonic characteristics for
>     > CLOCK_MONOTONIC.
>     
>     TSC is also monotonic on a given core. Does CLOCK_MONOTONIC guarantee any better accuracy than TSC for PMD threads?
>     
>     > > On PMDs I am a bit concerned about the overhead/latency introduced
>     > with the clock_gettime() system call, but I haven't done any measurements
>     > to check the actual impact. Have you?
>     > 
>     > Have you seen my incremental patches?
>     > There is no overhead, because we're just replacing 'time_msec' with
>     > 'time_usec'.
>     > No difference except converting timespec to usec instead of msec.
>     
>     I did look at you incremental patches and we will test their performance. I was concerned about the system call cost on master already before. Perhaps I'm paranoid, but I would like to double check by testing.
>     
>     > > If we go for CLOCK_MONOTONIC in microsecond resolution, we should
>     > make sure that the clock is read not more than once once every iteration
>     > (and cache the us value as now in the pmd ctx struct as suggested in your
>     > other patch). But then for consistency also the XPS feature should use the
>     > PMD time in us resolution.
>     > 
>     > Again, please, look at my incremental patches.
>     
>     As far as I could see you did, for example, not consistently adapt tx_port->last_used to microsecond resolution.
>     
>     > > For non-PMD thread we could actually skip time-based output batching
>     > completely. The packet rates and the frequency of calls to
>     > dpif_netdev_run() in the main ovs-vswitchd thread are so low that time-
>     > based flushing doesn't seem to make much sense.
>     
>     Have you considered this option? 
>     
>     > >
>     > > Below you can find an alternative incremental patch on top of your RFC
>     > 4/4 that uses TSC on PMD. We will be comparing the two alternatives for
>     > performance both with non-PMD guests (iperf3) as well as PMD guests
>     > (DPDK testpmd).
>     > 
>     > In your version you need to move all the output_batching related code
>     > under #ifdef DPDK_NETDEV because it will brake userspace networking if
>     > compiled without dpdk and output-max-latency != 0.
>     
>     Not sure. Batching should implicitly be disabled because cycles_counter() and cycles_per_microsecond() would both return zero. But I agree that would be fairly cryptic design. If we used TSC in PMDs we should explicitly not do time-based tx batching on the non-PMD thread.
>     
>     Anyway, if the cost of the clock_gettime() system call proves insignificant and our performance tests comparing our TSC-based with your CLOCK_MONOTONIC-based implementation show equivalent results, we can go for your approach.
>     
>     BR, Jan
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=YSJX1FJ-09MF851q3vAIW-9-2W4nZruCOdyvxUMB9vE&s=QUvdTK7m_i90FSk4aRxMN7d_1TSc46NZK9zfV3dI3Cc&e= 
>     
>
Jan Scheurich Aug. 30, 2017, 4:12 p.m. UTC | #5
Yes we do, sorry for the delay!

We are actively testing the performance of the patch series' with iperf as simulation of kernel app and pktken as DPDK app in the guest:

a) OVS master
b) Ilya's patches w/o time-based batching
c) Ilya's time-based batching using CLOCK_MONOTONIC
d) Our TSC-based time-based batching alternative.

We will publish our detailed results to the mailing list tomorrow.

Regards, Jan

> -----Original Message-----
> From: Darrell Ball [mailto:dball@vmware.com]
> Sent: Wednesday, 30 August, 2017 00:46
> To: Jan Scheurich <jan.scheurich@ericsson.com>; Ilya Maximets
> <i.maximets@samsung.com>; ovs-dev@openvswitch.org; Bhanuprakash
> Bodireddy <bhanuprakash.bodireddy@intel.com>
> Cc: Heetae Ahn <heetae82.ahn@samsung.com>
> Subject: Re: [ovs-dev] [PATCH RFC v3 4/4] dpif-netdev: Time based output
> batching.
> 
> Hi Jan/Ilya/Bhanu
> 
> Just wondering if we are still pursuing this patch series ?
> 
> Thanks Darrell
> 
> 
> 
> 
> On 8/14/17, 8:33 AM, "ovs-dev-bounces@openvswitch.org on behalf of Jan
> Scheurich" <ovs-dev-bounces@openvswitch.org on behalf of
> jan.scheurich@ericsson.com> wrote:
> 
>     > > We have tested the effect of turbo mode on TSC and there is none.
> The
>     > TSC frequency remains at the nominal clock speed, no matter if the
> core is
>     > clocked down or up. So, I believe for PMD threads (where performance
>     > matters) TSC would be an adequate and efficient clock.
>     >
>     > It's highly platform dependent and testing on a few systems doesn't
>     > guarantee anything.
>     > From the other hand POSIX guarantee the monotonic characteristics
> for
>     > CLOCK_MONOTONIC.
> 
>     TSC is also monotonic on a given core. Does CLOCK_MONOTONIC
> guarantee any better accuracy than TSC for PMD threads?
> 
>     > > On PMDs I am a bit concerned about the overhead/latency
> introduced
>     > with the clock_gettime() system call, but I haven't done any
> measurements
>     > to check the actual impact. Have you?
>     >
>     > Have you seen my incremental patches?
>     > There is no overhead, because we're just replacing 'time_msec' with
>     > 'time_usec'.
>     > No difference except converting timespec to usec instead of msec.
> 
>     I did look at you incremental patches and we will test their performance.
> I was concerned about the system call cost on master already before.
> Perhaps I'm paranoid, but I would like to double check by testing.
> 
>     > > If we go for CLOCK_MONOTONIC in microsecond resolution, we
> should
>     > make sure that the clock is read not more than once once every
> iteration
>     > (and cache the us value as now in the pmd ctx struct as suggested in
> your
>     > other patch). But then for consistency also the XPS feature should use
> the
>     > PMD time in us resolution.
>     >
>     > Again, please, look at my incremental patches.
> 
>     As far as I could see you did, for example, not consistently adapt tx_port-
> >last_used to microsecond resolution.
> 
>     > > For non-PMD thread we could actually skip time-based output
> batching
>     > completely. The packet rates and the frequency of calls to
>     > dpif_netdev_run() in the main ovs-vswitchd thread are so low that
> time-
>     > based flushing doesn't seem to make much sense.
> 
>     Have you considered this option?
> 
>     > >
>     > > Below you can find an alternative incremental patch on top of your
> RFC
>     > 4/4 that uses TSC on PMD. We will be comparing the two alternatives
> for
>     > performance both with non-PMD guests (iperf3) as well as PMD guests
>     > (DPDK testpmd).
>     >
>     > In your version you need to move all the output_batching related code
>     > under #ifdef DPDK_NETDEV because it will brake userspace
> networking if
>     > compiled without dpdk and output-max-latency != 0.
> 
>     Not sure. Batching should implicitly be disabled because
> cycles_counter() and cycles_per_microsecond() would both return zero.
> But I agree that would be fairly cryptic design. If we used TSC in PMDs we
> should explicitly not do time-based tx batching on the non-PMD thread.
> 
>     Anyway, if the cost of the clock_gettime() system call proves
> insignificant and our performance tests comparing our TSC-based with
> your CLOCK_MONOTONIC-based implementation show equivalent results,
> we can go for your approach.
> 
>     BR, Jan
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org
>     https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> uZnsw&m=YSJX1FJ-09MF851q3vAIW-9-
> 2W4nZruCOdyvxUMB9vE&s=QUvdTK7m_i90FSk4aRxMN7d_1TSc46NZK9zf
> V3dI3Cc&e=
>
diff mbox

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0d78ae4..8285786 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -265,7 +265,7 @@  struct dp_netdev {
     struct hmap ports;
     struct seq *port_seq;       /* Incremented whenever a port changes. */
 
-    /* The time that a packet can wait in output batch for sending. */
+    /* Time in cycles that a packet can wait in output batch for sending. */
     atomic_uint32_t output_max_latency;
 
     /* Meters. */
@@ -508,7 +508,7 @@  struct tx_port {
     int qid;
     long long last_used;
     struct hmap_node node;
-    long long output_time;
+    long long flush_time;   /* Time in LSC cycles to flush output buffer. */
     struct dp_packet_batch output_pkts;
 };
 
@@ -622,6 +622,7 @@  struct dpif_netdev {
     uint64_t last_port_seq;
 };
 
+static inline unsigned long cycles_per_microsecond(void);
 static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
                               struct dp_netdev_port **portp)
     OVS_REQUIRES(dp->port_mutex);
@@ -2921,11 +2922,12 @@  dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     uint32_t output_max_latency, cur_max_latency;
 
     output_max_latency = smap_get_int(other_config, "output-max-latency",
-                                      DEFAULT_OUTPUT_MAX_LATENCY);
+                                      DEFAULT_OUTPUT_MAX_LATENCY)
+                         * cycles_per_microsecond();
     atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
     if (output_max_latency != cur_max_latency) {
         atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
-        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
+        VLOG_INFO("Output maximum latency set to %"PRIu32" cycles",
                   output_max_latency);
     }
 
@@ -3091,6 +3093,16 @@  cycles_counter(void)
 #endif
 }
 
+static inline unsigned long
+cycles_per_microsecond(void)
+{
+#ifdef DPDK_NETDEV
+    return rte_get_tsc_hz() / (1000 * 1000);
+#else
+    return 0;
+#endif
+}
+
 /* Fake mutex to make sure that the calls to cycles_count_* are balanced */
 extern struct ovs_mutex cycles_counter_fake_mutex;
 
@@ -3171,7 +3183,7 @@  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
 
     HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
         if (!dp_packet_batch_is_empty(&p->output_pkts)
-            && (force || p->output_time <= now)) {
+            && (force || p->flush_time <= pmd->last_cycles)) {
             output_cnt += dp_netdev_pmd_flush_output_on_port(pmd, p, now);
         }
     }
@@ -3196,7 +3208,6 @@  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
 
         batch_cnt = batch.count;
         dp_netdev_input(pmd, &batch, port_no, now);
-        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, now, false);
     } else if (error != EAGAIN && error != EOPNOTSUPP) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
 
@@ -3732,7 +3743,6 @@  dpif_netdev_run(struct dpif *dpif)
     struct dp_netdev_pmd_thread *non_pmd;
     uint64_t new_tnl_seq;
     int process_packets;
-    bool need_to_flush = true;
 
     ovs_mutex_lock(&dp->port_mutex);
     non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
@@ -3751,21 +3761,16 @@  dpif_netdev_run(struct dpif *dpif)
                     cycles_count_intermediate(non_pmd, process_packets
                                                        ? PMD_CYCLES_PROCESSING
                                                        : PMD_CYCLES_IDLE);
-                    if (process_packets) {
-                        need_to_flush = false;
-                    }
                 }
             }
         }
-        if (need_to_flush) {
-            /* We didn't receive anything in the process loop.
-             * Check if we need to send something. */
-            process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
-                                                                 0, false);
-            cycles_count_intermediate(non_pmd, process_packets
-                                               ? PMD_CYCLES_PROCESSING
-                                               : PMD_CYCLES_IDLE);
-        }
+
+        /* Check if queued packets need to be flushed. */
+        process_packets =
+                dp_netdev_pmd_flush_output_packets(non_pmd, 0, false);
+        cycles_count_intermediate(non_pmd, process_packets
+                                           ? PMD_CYCLES_PROCESSING
+                                           : PMD_CYCLES_IDLE);
 
         cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
         dpif_netdev_xps_revalidate_pmd(non_pmd, time_msec(), false);
@@ -3946,7 +3951,6 @@  reload:
     cycles_count_start(pmd);
     for (;;) {
         int process_packets;
-        bool need_to_flush = true;
 
         for (i = 0; i < poll_cnt; i++) {
             process_packets =
@@ -3955,20 +3959,14 @@  reload:
             cycles_count_intermediate(pmd,
                                       process_packets ? PMD_CYCLES_PROCESSING
                                                       : PMD_CYCLES_IDLE);
-            if (process_packets) {
-                need_to_flush = false;
-            }
         }
 
-        if (need_to_flush) {
-            /* We didn't receive anything in the process loop.
-             * Check if we need to send something. */
-            process_packets = dp_netdev_pmd_flush_output_packets(pmd,
-                                                                 0, false);
-            cycles_count_intermediate(pmd,
-                                      process_packets ? PMD_CYCLES_PROCESSING
-                                                      : PMD_CYCLES_IDLE);
-        }
+        /* Check if queued packets need to be flushed. */
+        process_packets =
+                dp_netdev_pmd_flush_output_packets(pmd, 0, false);
+        cycles_count_intermediate(pmd,
+                                  process_packets ? PMD_CYCLES_PROCESSING
+                                                  : PMD_CYCLES_IDLE);
 
         if (lc++ > 1024) {
             bool reload;
@@ -4593,7 +4591,7 @@  dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
 
     tx->port = port;
     tx->qid = -1;
-    tx->output_time = 0LL;
+    tx->flush_time = 0LL;
     dp_packet_batch_init(&tx->output_pkts);
 
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
@@ -5276,11 +5274,12 @@  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
                 dp_netdev_pmd_flush_output_on_port(pmd, p, now);
             }
 
-            if (dp_packet_batch_is_empty(&p->output_pkts)) {
+            if (dp_packet_batch_is_empty(&p->output_pkts) &&
+                !dp_packet_batch_is_empty(packets_)) {
                 uint32_t cur_max_latency;
 
                 atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
-                p->output_time = now + cur_max_latency;
+                p->flush_time = pmd->last_cycles + cur_max_latency;
                 pmd->n_output_batches++;
             }
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 23930f0..8ad1d00 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -345,9 +345,9 @@ 
       </column>
 
       <column name="other_config" key="output-max-latency"
-              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
+              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000000}'>
         <p>
-          Specifies the time in milliseconds that a packet can wait in output
+          Specifies the time in microseconds that a packet can wait in output
           batch for sending i.e. amount of time that packet can spend in an
           intermediate output queue before sending to netdev.
           This option can be used to configure balance between throughput