Message ID | 20210607114811.21613-1-david.marchand@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev] dpif-netdev: Report overhead busy cycles per pmd. | expand |
Hi David, David Marchand <david.marchand@redhat.com> writes: > When setting PMD auto load balance parameters and observing the > feature (and the per rxq statistics) in action, you can easily get > rebalances while the sum of per rxq pmd usage is below the pmd load > threshold you had set. > > This is because the dpif-netdev/pmd-rxq-show command only reports "pure" > rxq cycles while some cycles are used in the pmd mainloop and adds up to > the total pmd load. > > dpif-netdev/pmd-stats-show does report per pmd load usage. > This load is measured since the last dpif-netdev/pmd-stats-clear call. > On the other hand, the auto load balance feature uses the pmd load on a 10s > sliding window which makes it non trivial to correlate. > > Gather per pmd busy cycles with the same periodicity and report the > difference as overhead in dpif-netdev/pmd-rxq-show so that we have all > info in a single command. > > Example: > $ ovs-appctl dpif-netdev/pmd-rxq-show > pmd thread numa_id 0 core_id 4: > isolated : false > port: dpdk0 queue-id: 0 (enabled) pmd usage: 37 % > port: dpdk1 queue-id: 0 (enabled) pmd usage: 36 % > port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 4 % > pmd thread numa_id 0 core_id 18: > isolated : false > port: vhost0 queue-id: 0 (enabled) pmd usage: 37 % > port: vhost1 queue-id: 0 (enabled) pmd usage: 39 % > port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost5 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 5 % > > Signed-off-by: David Marchand <david.marchand@redhat.com> > --- ... snip ... > diff --git a/tests/pmd.at b/tests/pmd.at > index cc5371d5a5..256adb83f0 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -72,6 +72,7 @@ CHECK_PMD_THREADS_CREATED() > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL I don't see this added anywhere, but I do see "overhead:" added. I guess this is leftover from an earlier version. It causes test case failures. > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > ]) > > @@ -103,6 +104,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -134,6 +136,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -149,13 +152,13 @@ TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) > CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) > > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 3 4 7 > port: p0 queue-id: 1 2 5 6 > ]) > > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=roundrobin]) > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 2 4 6 > port: p0 queue-id: 1 3 5 7 > ]) > @@ -167,6 +170,7 @@ CHECK_PMD_THREADS_CREATED([1], [], [+$TMP]) > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL
On Thu, Jun 10, 2021 at 3:37 PM Aaron Conole <aconole@redhat.com> wrote: > > Hi David, > > David Marchand <david.marchand@redhat.com> writes: > > > When setting PMD auto load balance parameters and observing the > > feature (and the per rxq statistics) in action, you can easily get > > rebalances while the sum of per rxq pmd usage is below the pmd load > > threshold you had set. > > > > This is because the dpif-netdev/pmd-rxq-show command only reports "pure" > > rxq cycles while some cycles are used in the pmd mainloop and adds up to > > the total pmd load. > > > > dpif-netdev/pmd-stats-show does report per pmd load usage. > > This load is measured since the last dpif-netdev/pmd-stats-clear call. > > On the other hand, the auto load balance feature uses the pmd load on a 10s > > sliding window which makes it non trivial to correlate. > > > > Gather per pmd busy cycles with the same periodicity and report the > > difference as overhead in dpif-netdev/pmd-rxq-show so that we have all > > info in a single command. > > > > Example: > > $ ovs-appctl dpif-netdev/pmd-rxq-show > > pmd thread numa_id 0 core_id 4: > > isolated : false > > port: dpdk0 queue-id: 0 (enabled) pmd usage: 37 % > > port: dpdk1 queue-id: 0 (enabled) pmd usage: 36 % > > port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % > > port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % > > port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % > > overhead: 4 % > > pmd thread numa_id 0 core_id 18: > > isolated : false > > port: vhost0 queue-id: 0 (enabled) pmd usage: 37 % > > port: vhost1 queue-id: 0 (enabled) pmd usage: 39 % > > port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % > > port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % > > port: vhost5 queue-id: 0 (enabled) pmd usage: 0 % > > overhead: 5 % > > > > Signed-off-by: David Marchand <david.marchand@redhat.com> > > --- > > ... snip ... > > > diff --git a/tests/pmd.at b/tests/pmd.at > > index cc5371d5a5..256adb83f0 100644 > > --- a/tests/pmd.at > > +++ b/tests/pmd.at > > @@ -72,6 +72,7 @@ CHECK_PMD_THREADS_CREATED() > > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > > pmd thread numa_id <cleared> core_id <cleared>: > > isolated : false > > + usage: NOT AVAIL > > I don't see this added anywhere, but I do see "overhead:" added. I > guess this is leftover from an earlier version. It causes test case failures. Errr, yes. Kevin also reported it offlist. I'll send v2 after retesting unit test.. Thanks.
Hi David, A few comments below, thanks, Kevin. On 07/06/2021 12:48, David Marchand wrote: > When setting PMD auto load balance parameters and observing the > feature (and the per rxq statistics) in action, you can easily get > rebalances while the sum of per rxq pmd usage is below the pmd load > threshold you had set. > It is useful for ALB yes, but I think it is more a general improvement than just that. That general feedback we got was along the lines of: if I add all the rxq %'s on the PMD, it is not at 100%, but if I turn up the traffic gen the throughput does not increase, is the calculation wrong? etc. > This is because the dpif-netdev/pmd-rxq-show command only reports "pure" > rxq cycles while some cycles are used in the pmd mainloop and adds up to > the total pmd load. > > dpif-netdev/pmd-stats-show does report per pmd load usage. > This load is measured since the last dpif-netdev/pmd-stats-clear call. > On the other hand, the auto load balance feature uses the pmd load on a 10s > sliding window which makes it non trivial to correlate. > > Gather per pmd busy cycles with the same periodicity and report the > difference as overhead in dpif-netdev/pmd-rxq-show so that we have all > info in a single command. > I didn't check what is already in the docs, but it need a few lines like above to say what 'overhead' is. > Example: > $ ovs-appctl dpif-netdev/pmd-rxq-show > pmd thread numa_id 0 core_id 4: > isolated : false > port: dpdk0 queue-id: 0 (enabled) pmd usage: 37 % > port: dpdk1 queue-id: 0 (enabled) pmd usage: 36 % > port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 4 % > pmd thread numa_id 0 core_id 18: > isolated : false > port: vhost0 queue-id: 0 (enabled) pmd usage: 37 % > port: vhost1 queue-id: 0 (enabled) pmd usage: 39 % > port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % > port: vhost5 queue-id: 0 (enabled) pmd usage: 0 % > overhead: 5 % > > Signed-off-by: David Marchand <david.marchand@redhat.com> > --- > Documentation/topics/dpdk/pmd.rst | 6 ++ > lib/dpif-netdev.c | 100 ++++++++++++++++++++---------- > tests/pmd.at | 8 ++- > 3 files changed, 80 insertions(+), 34 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst > index e481e79414..2df10dcc5d 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -213,6 +213,12 @@ load for each of its associated queues every 10 seconds. If the aggregated PMD > load reaches the load threshold for 6 consecutive intervals then PMD considers > itself to be overloaded. > > +.. versionchanged:: 2.16.0 > + > + The per PMD load is shown in the same fashion than Rx queue's in the output > + of ``pmd-rxq-show``. It accounts for all Rx queue's processing cycles and > + internal PMD core main loop cost. > + > For example, to set the load threshold to 70%:: > > $ ovs-vsctl set open_vswitch .\ > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 650e67ab30..bcebcebe0e 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -235,11 +235,11 @@ struct dfc_cache { > > /* Time in microseconds of the interval in which rxq processing cycles used > * in rxq to pmd assignments is measured and stored. */ > -#define PMD_RXQ_INTERVAL_LEN 10000000LL > +#define PMD_INTERVAL_LEN 10000000LL > > /* Number of intervals for which cycles are stored > * and used during rxq to pmd assignment. */ > -#define PMD_RXQ_INTERVAL_MAX 6 > +#define PMD_INTERVAL_MAX 6 > > /* Time in microseconds to try RCU quiescing. */ > #define PMD_RCU_QUIESCE_INTERVAL 10000LL > @@ -465,9 +465,9 @@ struct dp_netdev_rxq { > > /* Counters of cycles spent successfully polling and processing pkts. */ > atomic_ullong cycles[RXQ_N_CYCLES]; > - /* We store PMD_RXQ_INTERVAL_MAX intervals of data for an rxq and then > + /* We store PMD_INTERVAL_MAX intervals of data for an rxq and then > sum them to yield the cycles used for an rxq. */ > - atomic_ullong cycles_intrvl[PMD_RXQ_INTERVAL_MAX]; > + atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX]; > }; > > /* A port in a netdev-based datapath. */ > @@ -703,13 +703,18 @@ struct dp_netdev_pmd_thread { > long long int next_optimization; > /* End of the next time interval for which processing cycles > are stored for each polled rxq. */ > - long long int rxq_next_cycle_store; > + long long int next_cycle_store; > > /* Last interval timestamp. */ > uint64_t intrvl_tsc_prev; > /* Last interval cycles. */ > atomic_ullong intrvl_cycles; > > + /* Write index for 'busy_cycles_intrvl'. */ > + unsigned int intrvl_idx; > + /* Busy cycles in last PMD_INTERVAL_MAX intervals. */ > + atomic_ullong busy_cycles_intrvl[PMD_INTERVAL_MAX]; > + > /* Current context of the PMD thread. */ > struct dp_netdev_pmd_thread_ctx ctx; > > @@ -1229,6 +1234,8 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) > struct rxq_poll *list; > size_t n_rxq; > uint64_t total_cycles = 0; > + uint64_t busy_cycles = 0; > + uint64_t polling_cycles = 0; > > ds_put_format(reply, > "pmd thread numa_id %d core_id %u:\n isolated : %s\n", > @@ -1241,16 +1248,27 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > + > + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { > + uint64_t cycles; > + > + atomic_read_relaxed(&pmd->busy_cycles_intrvl[j], &cycles); > + busy_cycles += cycles; > + } > + if (busy_cycles > total_cycles) { > + busy_cycles = total_cycles; > + } > > for (int i = 0; i < n_rxq; i++) { > struct dp_netdev_rxq *rxq = list[i].rxq; > const char *name = netdev_rxq_get_name(rxq->rx); > uint64_t proc_cycles = 0; > > - for (int j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) { > + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { > proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, j); > } > + polling_cycles += proc_cycles; polling_cycles is a bit vague as it sounds like it could include polling but not receiving packets. Also, there is now more of a mix of pmd and and rxq *_cycles in this fn it can be a little confusing, so it might be worthwhile to distinguish a bit more e.g. for (int j = 0; j < PMD_INTERVAL_MAX; j++) { rxq_proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, j); } total_rxqs_proc_cycles += rxq_proc_cycles; > ds_put_format(reply, " port: %-16s queue-id: %2d", name, > netdev_rxq_get_queue_id(list[i].rxq->rx)); > ds_put_format(reply, " %s", netdev_rxq_enabled(list[i].rxq->rx) > @@ -1265,6 +1283,19 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) > } > ds_put_cstr(reply, "\n"); > } > + > + ds_put_cstr(reply, " overhead: "); You could check n_rxq and only display overhead if it is > 0. Otherwise overhead will stay as 'NOT AVAIL' for pmds with no rxqs. e.g.: # ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 8: isolated : false port: urport queue-id: 0 (enabled) pmd usage: 95 % overhead: 4 % pmd thread numa_id 1 core_id 9: isolated : false overhead: NOT AVAIL > + if (total_cycles) { > + if (polling_cycles > busy_cycles) { > + polling_cycles = busy_cycles; > + } > + ds_put_format(reply, "%2"PRIu64" %%", > + (busy_cycles - polling_cycles) * 100 / total_cycles); > + } else { > + ds_put_cstr(reply, "NOT AVAIL"); > + } > + ds_put_cstr(reply, "\n"); > + > ovs_mutex_unlock(&pmd->port_mutex); > free(list); > } > @@ -4621,7 +4652,7 @@ static void > dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx, > unsigned long long cycles) > { > - unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX; > + unsigned int idx = rx->intrvl_idx++ % PMD_INTERVAL_MAX; > atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles); > } > > @@ -5090,7 +5121,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex) > > if (assign_cyc) { > /* Sum the queue intervals and store the cycle history. */ > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > } > dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > @@ -5587,7 +5618,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, > } > > /* Sum the queue intervals and store the cycle history. */ > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); > } > dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, > @@ -5648,7 +5679,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > for (int id = 0; id < num_pmds; id++) { > if (pmd->core_id == core_list[id]) { > if (pmd_usage[id]) { > @@ -5707,11 +5738,11 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > /* Get the total pmd cycles for an interval. */ > atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); > /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_RXQ_INTERVAL_MAX; > + total_cycles *= PMD_INTERVAL_MAX; > > ovs_mutex_lock(&pmd->port_mutex); > HMAP_FOR_EACH (poll, node, &pmd->poll_list) { > - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { > + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { > total_proc += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i); > } > } > @@ -5820,7 +5851,7 @@ dpif_netdev_run(struct dpif *dpif) > pmd_alb->rebalance_poll_timer = now; > CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > if (atomic_count_get(&pmd->pmd_overloaded) >= > - PMD_RXQ_INTERVAL_MAX) { > + PMD_INTERVAL_MAX) { > pmd_rebalance = true; > break; > } > @@ -6030,6 +6061,7 @@ reload: > > pmd->intrvl_tsc_prev = 0; > atomic_store_relaxed(&pmd->intrvl_cycles, 0); > + pmd->intrvl_idx = 0; Not sure it's a good idea resetting this everytime the PMD is reloaded. Maybe it can mean that newer busy_cycles_intrvl[] entries are overwritten and older ones are kept longer, but anyway next comment probably superceeds this one. I did some testing and saw that as busy_cycles_intrvl[] is not cleared on reload, it can temoporarily spike overhead if a busy rxq was moved off the pmd. Even though everything will flush out (and stats won't all line up initially aside from this), it looks a bit alarming to see a high overhead. I tested resetting it here and it removed the spike. > cycles_counter_update(s); > > pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; > @@ -6556,7 +6588,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp, > pmd_thread_ctx_time_update(pmd); > pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL; > pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; > - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; > + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; > hmap_init(&pmd->poll_list); > hmap_init(&pmd->tx_ports); > hmap_init(&pmd->tnl_port_cache); > @@ -8777,31 +8809,33 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, > uint64_t tot_idle = 0, tot_proc = 0; > unsigned int pmd_load = 0; > > - if (pmd->ctx.now > pmd->rxq_next_cycle_store) { > + if (pmd->ctx.now > pmd->next_cycle_store) { > uint64_t curr_tsc; > uint8_t rebalance_load_trigger; > struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb; > - if (pmd_alb->is_enabled && !pmd->isolated > - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= > - pmd->prev_stats[PMD_CYCLES_ITER_IDLE]) > - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= > - pmd->prev_stats[PMD_CYCLES_ITER_BUSY])) > - { > + unsigned int idx; > + > + if (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= > + pmd->prev_stats[PMD_CYCLES_ITER_IDLE] && > + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= > + pmd->prev_stats[PMD_CYCLES_ITER_BUSY]) { > tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] - > pmd->prev_stats[PMD_CYCLES_ITER_IDLE]; > tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - > pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; > > - if (tot_proc) { > - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); > - } > + if (pmd_alb->is_enabled && !pmd->isolated) { > + if (tot_proc) { > + pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); > + } > > - atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, > - &rebalance_load_trigger); > - if (pmd_load >= rebalance_load_trigger) { > - atomic_count_inc(&pmd->pmd_overloaded); > - } else { > - atomic_count_set(&pmd->pmd_overloaded, 0); > + atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, > + &rebalance_load_trigger); > + if (pmd_load >= rebalance_load_trigger) { > + atomic_count_inc(&pmd->pmd_overloaded); > + } else { > + atomic_count_set(&pmd->pmd_overloaded, 0); > + } > } > } > > @@ -8824,9 +8858,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, > atomic_store_relaxed(&pmd->intrvl_cycles, > curr_tsc - pmd->intrvl_tsc_prev); > } > + idx = pmd->intrvl_idx++ % PMD_INTERVAL_MAX; > + atomic_store_relaxed(&pmd->busy_cycles_intrvl[idx], tot_proc); > pmd->intrvl_tsc_prev = curr_tsc; > /* Start new measuring interval */ > - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; > + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; > } > > if (pmd->ctx.now > pmd->next_optimization) { > diff --git a/tests/pmd.at b/tests/pmd.at > index cc5371d5a5..256adb83f0 100644 > --- a/tests/pmd.at > +++ b/tests/pmd.at > @@ -72,6 +72,7 @@ CHECK_PMD_THREADS_CREATED() > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > ]) > > @@ -103,6 +104,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -134,6 +136,7 @@ dummy@ovs-dummy: hit:0 missed:0 > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL > @@ -149,13 +152,13 @@ TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) > CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) > > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 3 4 7 > port: p0 queue-id: 1 2 5 6 > ]) > > AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=roundrobin]) > -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl > port: p0 queue-id: 0 2 4 6 > port: p0 queue-id: 1 3 5 7 > ]) > @@ -167,6 +170,7 @@ CHECK_PMD_THREADS_CREATED([1], [], [+$TMP]) > AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl > pmd thread numa_id <cleared> core_id <cleared>: > isolated : false > + usage: NOT AVAIL > port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL > port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL >
On Fri, Jun 11, 2021 at 2:02 PM Kevin Traynor <ktraynor@redhat.com> wrote: > > Hi David, > > A few comments below, Thanks for the review and comments, all seemed fine to me. Retested with my setup and I just posted v2 on the ml.
diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index e481e79414..2df10dcc5d 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -213,6 +213,12 @@ load for each of its associated queues every 10 seconds. If the aggregated PMD load reaches the load threshold for 6 consecutive intervals then PMD considers itself to be overloaded. +.. versionchanged:: 2.16.0 + + The per PMD load is shown in the same fashion than Rx queue's in the output + of ``pmd-rxq-show``. It accounts for all Rx queue's processing cycles and + internal PMD core main loop cost. + For example, to set the load threshold to 70%:: $ ovs-vsctl set open_vswitch .\ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 650e67ab30..bcebcebe0e 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -235,11 +235,11 @@ struct dfc_cache { /* Time in microseconds of the interval in which rxq processing cycles used * in rxq to pmd assignments is measured and stored. */ -#define PMD_RXQ_INTERVAL_LEN 10000000LL +#define PMD_INTERVAL_LEN 10000000LL /* Number of intervals for which cycles are stored * and used during rxq to pmd assignment. */ -#define PMD_RXQ_INTERVAL_MAX 6 +#define PMD_INTERVAL_MAX 6 /* Time in microseconds to try RCU quiescing. */ #define PMD_RCU_QUIESCE_INTERVAL 10000LL @@ -465,9 +465,9 @@ struct dp_netdev_rxq { /* Counters of cycles spent successfully polling and processing pkts. */ atomic_ullong cycles[RXQ_N_CYCLES]; - /* We store PMD_RXQ_INTERVAL_MAX intervals of data for an rxq and then + /* We store PMD_INTERVAL_MAX intervals of data for an rxq and then sum them to yield the cycles used for an rxq. */ - atomic_ullong cycles_intrvl[PMD_RXQ_INTERVAL_MAX]; + atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX]; }; /* A port in a netdev-based datapath. */ @@ -703,13 +703,18 @@ struct dp_netdev_pmd_thread { long long int next_optimization; /* End of the next time interval for which processing cycles are stored for each polled rxq. */ - long long int rxq_next_cycle_store; + long long int next_cycle_store; /* Last interval timestamp. */ uint64_t intrvl_tsc_prev; /* Last interval cycles. */ atomic_ullong intrvl_cycles; + /* Write index for 'busy_cycles_intrvl'. */ + unsigned int intrvl_idx; + /* Busy cycles in last PMD_INTERVAL_MAX intervals. */ + atomic_ullong busy_cycles_intrvl[PMD_INTERVAL_MAX]; + /* Current context of the PMD thread. */ struct dp_netdev_pmd_thread_ctx ctx; @@ -1229,6 +1234,8 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) struct rxq_poll *list; size_t n_rxq; uint64_t total_cycles = 0; + uint64_t busy_cycles = 0; + uint64_t polling_cycles = 0; ds_put_format(reply, "pmd thread numa_id %d core_id %u:\n isolated : %s\n", @@ -1241,16 +1248,27 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) /* Get the total pmd cycles for an interval. */ atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_RXQ_INTERVAL_MAX; + total_cycles *= PMD_INTERVAL_MAX; + + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { + uint64_t cycles; + + atomic_read_relaxed(&pmd->busy_cycles_intrvl[j], &cycles); + busy_cycles += cycles; + } + if (busy_cycles > total_cycles) { + busy_cycles = total_cycles; + } for (int i = 0; i < n_rxq; i++) { struct dp_netdev_rxq *rxq = list[i].rxq; const char *name = netdev_rxq_get_name(rxq->rx); uint64_t proc_cycles = 0; - for (int j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) { + for (int j = 0; j < PMD_INTERVAL_MAX; j++) { proc_cycles += dp_netdev_rxq_get_intrvl_cycles(rxq, j); } + polling_cycles += proc_cycles; ds_put_format(reply, " port: %-16s queue-id: %2d", name, netdev_rxq_get_queue_id(list[i].rxq->rx)); ds_put_format(reply, " %s", netdev_rxq_enabled(list[i].rxq->rx) @@ -1265,6 +1283,19 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd) } ds_put_cstr(reply, "\n"); } + + ds_put_cstr(reply, " overhead: "); + if (total_cycles) { + if (polling_cycles > busy_cycles) { + polling_cycles = busy_cycles; + } + ds_put_format(reply, "%2"PRIu64" %%", + (busy_cycles - polling_cycles) * 100 / total_cycles); + } else { + ds_put_cstr(reply, "NOT AVAIL"); + } + ds_put_cstr(reply, "\n"); + ovs_mutex_unlock(&pmd->port_mutex); free(list); } @@ -4621,7 +4652,7 @@ static void dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned long long cycles) { - unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX; + unsigned int idx = rx->intrvl_idx++ % PMD_INTERVAL_MAX; atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles); } @@ -5090,7 +5121,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex) if (assign_cyc) { /* Sum the queue intervals and store the cycle history. */ - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); } dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, @@ -5587,7 +5618,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, } /* Sum the queue intervals and store the cycle history. */ - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i); } dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, @@ -5648,7 +5679,7 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list, /* Get the total pmd cycles for an interval. */ atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_RXQ_INTERVAL_MAX; + total_cycles *= PMD_INTERVAL_MAX; for (int id = 0; id < num_pmds; id++) { if (pmd->core_id == core_list[id]) { if (pmd_usage[id]) { @@ -5707,11 +5738,11 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) /* Get the total pmd cycles for an interval. */ atomic_read_relaxed(&pmd->intrvl_cycles, &total_cycles); /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_RXQ_INTERVAL_MAX; + total_cycles *= PMD_INTERVAL_MAX; ovs_mutex_lock(&pmd->port_mutex); HMAP_FOR_EACH (poll, node, &pmd->poll_list) { - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) { + for (unsigned i = 0; i < PMD_INTERVAL_MAX; i++) { total_proc += dp_netdev_rxq_get_intrvl_cycles(poll->rxq, i); } } @@ -5820,7 +5851,7 @@ dpif_netdev_run(struct dpif *dpif) pmd_alb->rebalance_poll_timer = now; CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { if (atomic_count_get(&pmd->pmd_overloaded) >= - PMD_RXQ_INTERVAL_MAX) { + PMD_INTERVAL_MAX) { pmd_rebalance = true; break; } @@ -6030,6 +6061,7 @@ reload: pmd->intrvl_tsc_prev = 0; atomic_store_relaxed(&pmd->intrvl_cycles, 0); + pmd->intrvl_idx = 0; cycles_counter_update(s); pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; @@ -6556,7 +6588,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp, pmd_thread_ctx_time_update(pmd); pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL; pmd->next_rcu_quiesce = pmd->ctx.now + PMD_RCU_QUIESCE_INTERVAL; - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; hmap_init(&pmd->poll_list); hmap_init(&pmd->tx_ports); hmap_init(&pmd->tnl_port_cache); @@ -8777,31 +8809,33 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, uint64_t tot_idle = 0, tot_proc = 0; unsigned int pmd_load = 0; - if (pmd->ctx.now > pmd->rxq_next_cycle_store) { + if (pmd->ctx.now > pmd->next_cycle_store) { uint64_t curr_tsc; uint8_t rebalance_load_trigger; struct pmd_auto_lb *pmd_alb = &pmd->dp->pmd_alb; - if (pmd_alb->is_enabled && !pmd->isolated - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= - pmd->prev_stats[PMD_CYCLES_ITER_IDLE]) - && (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= - pmd->prev_stats[PMD_CYCLES_ITER_BUSY])) - { + unsigned int idx; + + if (pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] >= + pmd->prev_stats[PMD_CYCLES_ITER_IDLE] && + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] >= + pmd->prev_stats[PMD_CYCLES_ITER_BUSY]) { tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] - pmd->prev_stats[PMD_CYCLES_ITER_IDLE]; tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; - if (tot_proc) { - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); - } + if (pmd_alb->is_enabled && !pmd->isolated) { + if (tot_proc) { + pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); + } - atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, - &rebalance_load_trigger); - if (pmd_load >= rebalance_load_trigger) { - atomic_count_inc(&pmd->pmd_overloaded); - } else { - atomic_count_set(&pmd->pmd_overloaded, 0); + atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, + &rebalance_load_trigger); + if (pmd_load >= rebalance_load_trigger) { + atomic_count_inc(&pmd->pmd_overloaded); + } else { + atomic_count_set(&pmd->pmd_overloaded, 0); + } } } @@ -8824,9 +8858,11 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, atomic_store_relaxed(&pmd->intrvl_cycles, curr_tsc - pmd->intrvl_tsc_prev); } + idx = pmd->intrvl_idx++ % PMD_INTERVAL_MAX; + atomic_store_relaxed(&pmd->busy_cycles_intrvl[idx], tot_proc); pmd->intrvl_tsc_prev = curr_tsc; /* Start new measuring interval */ - pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; + pmd->next_cycle_store = pmd->ctx.now + PMD_INTERVAL_LEN; } if (pmd->ctx.now > pmd->next_optimization) { diff --git a/tests/pmd.at b/tests/pmd.at index cc5371d5a5..256adb83f0 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -72,6 +72,7 @@ CHECK_PMD_THREADS_CREATED() AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl pmd thread numa_id <cleared> core_id <cleared>: isolated : false + usage: NOT AVAIL port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL ]) @@ -103,6 +104,7 @@ dummy@ovs-dummy: hit:0 missed:0 AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl pmd thread numa_id <cleared> core_id <cleared>: isolated : false + usage: NOT AVAIL port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL @@ -134,6 +136,7 @@ dummy@ovs-dummy: hit:0 missed:0 AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl pmd thread numa_id <cleared> core_id <cleared>: isolated : false + usage: NOT AVAIL port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL @@ -149,13 +152,13 @@ TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]]) AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3]) CHECK_PMD_THREADS_CREATED([2], [], [+$TMP]) -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl port: p0 queue-id: 0 3 4 7 port: p0 queue-id: 1 2 5 6 ]) AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=roundrobin]) -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/\<port:.*AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl port: p0 queue-id: 0 2 4 6 port: p0 queue-id: 1 3 5 7 ]) @@ -167,6 +170,7 @@ CHECK_PMD_THREADS_CREATED([1], [], [+$TMP]) AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed SED_NUMA_CORE_PATTERN], [0], [dnl pmd thread numa_id <cleared> core_id <cleared>: isolated : false + usage: NOT AVAIL port: p0 queue-id: 0 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 1 (enabled) pmd usage: NOT AVAIL port: p0 queue-id: 2 (enabled) pmd usage: NOT AVAIL
When setting PMD auto load balance parameters and observing the feature (and the per rxq statistics) in action, you can easily get rebalances while the sum of per rxq pmd usage is below the pmd load threshold you had set. This is because the dpif-netdev/pmd-rxq-show command only reports "pure" rxq cycles while some cycles are used in the pmd mainloop and adds up to the total pmd load. dpif-netdev/pmd-stats-show does report per pmd load usage. This load is measured since the last dpif-netdev/pmd-stats-clear call. On the other hand, the auto load balance feature uses the pmd load on a 10s sliding window which makes it non trivial to correlate. Gather per pmd busy cycles with the same periodicity and report the difference as overhead in dpif-netdev/pmd-rxq-show so that we have all info in a single command. Example: $ ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 4: isolated : false port: dpdk0 queue-id: 0 (enabled) pmd usage: 37 % port: dpdk1 queue-id: 0 (enabled) pmd usage: 36 % port: vhost3 queue-id: 0 (enabled) pmd usage: 0 % port: vhost6 queue-id: 0 (enabled) pmd usage: 0 % port: vhost7 queue-id: 0 (enabled) pmd usage: 0 % overhead: 4 % pmd thread numa_id 0 core_id 18: isolated : false port: vhost0 queue-id: 0 (enabled) pmd usage: 37 % port: vhost1 queue-id: 0 (enabled) pmd usage: 39 % port: vhost2 queue-id: 0 (enabled) pmd usage: 0 % port: vhost4 queue-id: 0 (enabled) pmd usage: 0 % port: vhost5 queue-id: 0 (enabled) pmd usage: 0 % overhead: 5 % Signed-off-by: David Marchand <david.marchand@redhat.com> --- Documentation/topics/dpdk/pmd.rst | 6 ++ lib/dpif-netdev.c | 100 ++++++++++++++++++++---------- tests/pmd.at | 8 ++- 3 files changed, 80 insertions(+), 34 deletions(-)