Message ID | 20230610181037.2435494-1-hzhou@ovn.org |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev,v3] ovn-northd.at: Fix unstable LSP incremental processing test. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | fail | github build: failed |
ovsrobot/github-robot-_ovn-kubernetes | fail | github build: failed |
On 6/10/23 20:10, Han Zhou wrote: > The test case is unstable because there are many factors that can impact > the number of recomputes. For example, when northd updates both NB and > SB, both sides have in-flight transactions, and if the SB notification > comes back before the NB transaction (for updating nb_cfg) completes, > ovn-northd falls back to recompute because idl_txn for NB is not > available, while if the NB notification comes back first it won't > trigger recompute because the nb_cfg columns are omitted for alert. > > To avoid the instability of the test case, this patch modifies it to > verify the best case (also the most common scenario), by running the > test multiple (10) times and making sure at least 70% of the time all > the recompute counters match expectation. > > (The recompute triggered by in-flight transactions will be improved > later and the success rate in the test can be adjusted accordingly.) > > Reported-by: Dumitru Ceara <dceara@redhat.com> > Signed-off-by: Han Zhou <hzhou@ovn.org> > --- > v1 -> v2: wait ports up and sync after each port binding, to avoid race. > v2 -> v3: refactor and test case and use a probability based test strategy > to ensure the test stability (see commit message). Neat, thanks for fixing this! The only thing with this approach is that the test case takes quite long to execute 30s on my system. Luckily we don't have too many tests like this so: Acked-by: Dumitru Ceara <dceara@redhat.com> Regards, Dumitru
On Wed, Jun 14, 2023 at 4:16 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 6/10/23 20:10, Han Zhou wrote: > > The test case is unstable because there are many factors that can impact > > the number of recomputes. For example, when northd updates both NB and > > SB, both sides have in-flight transactions, and if the SB notification > > comes back before the NB transaction (for updating nb_cfg) completes, > > ovn-northd falls back to recompute because idl_txn for NB is not > > available, while if the NB notification comes back first it won't > > trigger recompute because the nb_cfg columns are omitted for alert. > > > > To avoid the instability of the test case, this patch modifies it to > > verify the best case (also the most common scenario), by running the > > test multiple (10) times and making sure at least 70% of the time all > > the recompute counters match expectation. > > > > (The recompute triggered by in-flight transactions will be improved > > later and the success rate in the test can be adjusted accordingly.) > > > > Reported-by: Dumitru Ceara <dceara@redhat.com> > > Signed-off-by: Han Zhou <hzhou@ovn.org> > > --- > > v1 -> v2: wait ports up and sync after each port binding, to avoid race. > > v2 -> v3: refactor and test case and use a probability based test strategy > > to ensure the test stability (see commit message). > > Neat, thanks for fixing this! The only thing with this approach is that > the test case takes quite long to execute 30s on my system. Luckily we > don't have too many tests like this so: > > Acked-by: Dumitru Ceara <dceara@redhat.com> > Thanks Dumitru. That sounds really slow. It is faster on my computer (<10s), not too bad. If we really want it to be faster we can reduce the number of runs. I applied it to main for now (I did see several CI failures because of this test). Regards, Han > Regards, > Dumitru >
On 6/14/23 22:37, Han Zhou wrote: > On Wed, Jun 14, 2023 at 4:16 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> On 6/10/23 20:10, Han Zhou wrote: >>> The test case is unstable because there are many factors that can impact >>> the number of recomputes. For example, when northd updates both NB and >>> SB, both sides have in-flight transactions, and if the SB notification >>> comes back before the NB transaction (for updating nb_cfg) completes, >>> ovn-northd falls back to recompute because idl_txn for NB is not >>> available, while if the NB notification comes back first it won't >>> trigger recompute because the nb_cfg columns are omitted for alert. >>> >>> To avoid the instability of the test case, this patch modifies it to >>> verify the best case (also the most common scenario), by running the >>> test multiple (10) times and making sure at least 70% of the time all >>> the recompute counters match expectation. >>> >>> (The recompute triggered by in-flight transactions will be improved >>> later and the success rate in the test can be adjusted accordingly.) >>> >>> Reported-by: Dumitru Ceara <dceara@redhat.com> >>> Signed-off-by: Han Zhou <hzhou@ovn.org> >>> --- >>> v1 -> v2: wait ports up and sync after each port binding, to avoid race. >>> v2 -> v3: refactor and test case and use a probability based test > strategy >>> to ensure the test stability (see commit message). >> >> Neat, thanks for fixing this! The only thing with this approach is that >> the test case takes quite long to execute 30s on my system. Luckily we >> don't have too many tests like this so: >> >> Acked-by: Dumitru Ceara <dceara@redhat.com> >> > Thanks Dumitru. That sounds really slow. It is faster on my computer It was with address and undefined behavior sanitizers enabled and with optimizations completely disabled (for both OVS and OVN). With -O2 and no sanitizers it takes 16s. > (<10s), not too bad. If we really want it to be faster we can reduce the > number of runs. I applied it to main for now (I did see several CI failures > because of this test). > Great, thanks! > Regards, > Han > >> Regards, >> Dumitru >> >
diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at index 9f2f15abca8f..b7c916874a2f 100644 --- a/tests/ovn-northd.at +++ b/tests/ovn-northd.at @@ -9493,50 +9493,67 @@ sim_add hv1 as hv1 ovs-vsctl add-br br-phys ovn_attach n1 br-phys 192.168.0.11 -ovs-vsctl add-port br-int lsp0-0 -- set interface lsp0-0 external_ids:iface-id=lsp0-0 -ovs-vsctl add-port br-int lsp0-1 -- set interface lsp0-1 external_ids:iface-id=lsp0-1 -ovs-vsctl add-port br-int lsp0-2 -- set interface lsp0-2 external_ids:iface-id=lsp0-2 -check ovn-nbctl --wait=hv ls-add ls0 - -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=hv lsp-add ls0 lsp0-0 -- lsp-set-addresses lsp0-0 "unknown" -OVS_WAIT_UNTIL([test `as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute` = 5]) -OVS_WAIT_UNTIL([test `as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute` = 5]) - -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=hv lsp-add ls0 lsp0-1 -- lsp-set-addresses lsp0-1 "aa:aa:aa:00:00:01 192.168.0.11" -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0 -]) -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [0 -]) - -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=hv lsp-add ls0 lsp0-2 -- lsp-set-addresses lsp0-2 "aa:aa:aa:00:00:02 192.168.0.12" -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0 -]) -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [0 -]) - -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=hv lsp-del lsp0-1 -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0 -]) -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [1 -]) +fail_count=0 +check_recompute_counter() { + northd_recomp=$(as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute) + if test x$northd_recomp != x$1; then + fail_count=$(($fail_count + 1)) + echo check northd recompute failed: expected $1, got $northd_recomp + return 1 + fi + lflow_recomp=$(as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute) + if test x$lflow_recomp != x$2; then + fail_count=$(($fail_count + 1)) + echo check lflow recompute failed: expected $2, got $lflow_recomp + return 1 + fi + return 0 +} -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=hv lsp-set-addresses lsp0-2 "aa:aa:aa:00:00:88 192.168.0.88" -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0 -]) -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [1 -]) +# Depending on order of responses from NB and SB, the number of recompute may +# be different. This test case only verifies the best case scenario, which +# should have the expected recompute count at least 70% of the time. +for i in $(seq 10); do + check ovn-nbctl --wait=hv ls-add ls$i + + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-0 -- lsp-set-addresses lsp${i}-0 "unknown" + ovs-vsctl add-port br-int lsp${i}-0 -- set interface lsp${i}-0 external_ids:iface-id=lsp${i}-0 + wait_for_ports_up + check ovn-nbctl --wait=hv sync + check_recompute_counter 5 5 || break + + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-1 -- lsp-set-addresses lsp${i}-1 "aa:aa:aa:00:00:01 192.168.0.11" + ovs-vsctl add-port br-int lsp${i}-1 -- set interface lsp${i}-1 external_ids:iface-id=lsp${i}-1 + wait_for_ports_up + check ovn-nbctl --wait=hv sync + check_recompute_counter 0 0 || break + + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-2 -- lsp-set-addresses lsp${i}-2 "aa:aa:aa:00:00:02 192.168.0.12" + ovs-vsctl add-port br-int lsp${i}-2 -- set interface lsp${i}-2 external_ids:iface-id=lsp${i}-2 + wait_for_ports_up + check ovn-nbctl --wait=hv sync + check_recompute_counter 0 0 || break + + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=hv lsp-del lsp${i}-1 + check_recompute_counter 0 1 || break + + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=hv lsp-set-addresses lsp${i}-2 "aa:aa:aa:00:00:88 192.168.0.88" + check_recompute_counter 0 1 || break + + # No change, no recompute + check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats + check ovn-nbctl --wait=sb sync + check_recompute_counter 0 0 || break -# No change, no recompute -check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats -check ovn-nbctl --wait=sb sync -AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0 -]) +done +echo Test failed $fail_count in 10. +AT_CHECK([test $fail_count -le 3]) OVN_CLEANUP([hv1]) AT_CLEANUP
The test case is unstable because there are many factors that can impact the number of recomputes. For example, when northd updates both NB and SB, both sides have in-flight transactions, and if the SB notification comes back before the NB transaction (for updating nb_cfg) completes, ovn-northd falls back to recompute because idl_txn for NB is not available, while if the NB notification comes back first it won't trigger recompute because the nb_cfg columns are omitted for alert. To avoid the instability of the test case, this patch modifies it to verify the best case (also the most common scenario), by running the test multiple (10) times and making sure at least 70% of the time all the recompute counters match expectation. (The recompute triggered by in-flight transactions will be improved later and the success rate in the test can be adjusted accordingly.) Reported-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Han Zhou <hzhou@ovn.org> --- v1 -> v2: wait ports up and sync after each port binding, to avoid race. v2 -> v3: refactor and test case and use a probability based test strategy to ensure the test stability (see commit message). tests/ovn-northd.at | 99 ++++++++++++++++++++++++++------------------- 1 file changed, 58 insertions(+), 41 deletions(-)