diff mbox series

[ovs-dev,v3] ovn-northd.at: Fix unstable LSP incremental processing test.

Message ID 20230610181037.2435494-1-hzhou@ovn.org
State Accepted
Headers show
Series [ovs-dev,v3] ovn-northd.at: Fix unstable LSP incremental processing test. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_ovn-kubernetes fail github build: failed

Commit Message

Han Zhou June 10, 2023, 6:10 p.m. UTC
The test case is unstable because there are many factors that can impact
the number of recomputes. For example, when northd updates both NB and
SB, both sides have in-flight transactions, and if the SB notification
comes back before the NB transaction (for updating nb_cfg) completes,
ovn-northd falls back to recompute because idl_txn for NB is not
available, while if the NB notification comes back first it won't
trigger recompute because the nb_cfg columns are omitted for alert.

To avoid the instability of the test case, this patch modifies it to
verify the best case (also the most common scenario), by running the
test multiple (10) times and making sure at least 70% of the time all
the recompute counters match expectation.

(The recompute triggered by in-flight transactions will be improved
later and the success rate in the test can be adjusted accordingly.)

Reported-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Han Zhou <hzhou@ovn.org>
---
v1 -> v2: wait ports up and sync after each port binding, to avoid race.
v2 -> v3: refactor and test case and use a probability based test strategy
to ensure the test stability (see commit message).

 tests/ovn-northd.at | 99 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 41 deletions(-)

Comments

Dumitru Ceara June 14, 2023, 11:15 a.m. UTC | #1
On 6/10/23 20:10, Han Zhou wrote:
> The test case is unstable because there are many factors that can impact
> the number of recomputes. For example, when northd updates both NB and
> SB, both sides have in-flight transactions, and if the SB notification
> comes back before the NB transaction (for updating nb_cfg) completes,
> ovn-northd falls back to recompute because idl_txn for NB is not
> available, while if the NB notification comes back first it won't
> trigger recompute because the nb_cfg columns are omitted for alert.
> 
> To avoid the instability of the test case, this patch modifies it to
> verify the best case (also the most common scenario), by running the
> test multiple (10) times and making sure at least 70% of the time all
> the recompute counters match expectation.
> 
> (The recompute triggered by in-flight transactions will be improved
> later and the success rate in the test can be adjusted accordingly.)
> 
> Reported-by: Dumitru Ceara <dceara@redhat.com>
> Signed-off-by: Han Zhou <hzhou@ovn.org>
> ---
> v1 -> v2: wait ports up and sync after each port binding, to avoid race.
> v2 -> v3: refactor and test case and use a probability based test strategy
> to ensure the test stability (see commit message).

Neat, thanks for fixing this!  The only thing with this approach is that
the test case takes quite long to execute 30s on my system.  Luckily we
don't have too many tests like this so:

Acked-by: Dumitru Ceara <dceara@redhat.com>

Regards,
Dumitru
Han Zhou June 14, 2023, 8:37 p.m. UTC | #2
On Wed, Jun 14, 2023 at 4:16 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 6/10/23 20:10, Han Zhou wrote:
> > The test case is unstable because there are many factors that can impact
> > the number of recomputes. For example, when northd updates both NB and
> > SB, both sides have in-flight transactions, and if the SB notification
> > comes back before the NB transaction (for updating nb_cfg) completes,
> > ovn-northd falls back to recompute because idl_txn for NB is not
> > available, while if the NB notification comes back first it won't
> > trigger recompute because the nb_cfg columns are omitted for alert.
> >
> > To avoid the instability of the test case, this patch modifies it to
> > verify the best case (also the most common scenario), by running the
> > test multiple (10) times and making sure at least 70% of the time all
> > the recompute counters match expectation.
> >
> > (The recompute triggered by in-flight transactions will be improved
> > later and the success rate in the test can be adjusted accordingly.)
> >
> > Reported-by: Dumitru Ceara <dceara@redhat.com>
> > Signed-off-by: Han Zhou <hzhou@ovn.org>
> > ---
> > v1 -> v2: wait ports up and sync after each port binding, to avoid race.
> > v2 -> v3: refactor and test case and use a probability based test
strategy
> > to ensure the test stability (see commit message).
>
> Neat, thanks for fixing this!  The only thing with this approach is that
> the test case takes quite long to execute 30s on my system.  Luckily we
> don't have too many tests like this so:
>
> Acked-by: Dumitru Ceara <dceara@redhat.com>
>
Thanks Dumitru. That sounds really slow. It is faster on my computer
(<10s), not too bad. If we really want it to be faster we can reduce the
number of runs. I applied it to main for now (I did see several CI failures
because of this test).

Regards,
Han

> Regards,
> Dumitru
>
Dumitru Ceara June 14, 2023, 9:41 p.m. UTC | #3
On 6/14/23 22:37, Han Zhou wrote:
> On Wed, Jun 14, 2023 at 4:16 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 6/10/23 20:10, Han Zhou wrote:
>>> The test case is unstable because there are many factors that can impact
>>> the number of recomputes. For example, when northd updates both NB and
>>> SB, both sides have in-flight transactions, and if the SB notification
>>> comes back before the NB transaction (for updating nb_cfg) completes,
>>> ovn-northd falls back to recompute because idl_txn for NB is not
>>> available, while if the NB notification comes back first it won't
>>> trigger recompute because the nb_cfg columns are omitted for alert.
>>>
>>> To avoid the instability of the test case, this patch modifies it to
>>> verify the best case (also the most common scenario), by running the
>>> test multiple (10) times and making sure at least 70% of the time all
>>> the recompute counters match expectation.
>>>
>>> (The recompute triggered by in-flight transactions will be improved
>>> later and the success rate in the test can be adjusted accordingly.)
>>>
>>> Reported-by: Dumitru Ceara <dceara@redhat.com>
>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>>> ---
>>> v1 -> v2: wait ports up and sync after each port binding, to avoid race.
>>> v2 -> v3: refactor and test case and use a probability based test
> strategy
>>> to ensure the test stability (see commit message).
>>
>> Neat, thanks for fixing this!  The only thing with this approach is that
>> the test case takes quite long to execute 30s on my system.  Luckily we
>> don't have too many tests like this so:
>>
>> Acked-by: Dumitru Ceara <dceara@redhat.com>
>>
> Thanks Dumitru. That sounds really slow. It is faster on my computer

It was with address and undefined behavior sanitizers enabled and with
optimizations completely disabled (for both OVS and OVN).  With -O2 and
no sanitizers it takes 16s.

> (<10s), not too bad. If we really want it to be faster we can reduce the
> number of runs. I applied it to main for now (I did see several CI failures
> because of this test).
> 

Great, thanks!

> Regards,
> Han
> 
>> Regards,
>> Dumitru
>>
>
diff mbox series

Patch

diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
index 9f2f15abca8f..b7c916874a2f 100644
--- a/tests/ovn-northd.at
+++ b/tests/ovn-northd.at
@@ -9493,50 +9493,67 @@  sim_add hv1
 as hv1
 ovs-vsctl add-br br-phys
 ovn_attach n1 br-phys 192.168.0.11
-ovs-vsctl add-port br-int lsp0-0 -- set interface lsp0-0 external_ids:iface-id=lsp0-0
-ovs-vsctl add-port br-int lsp0-1 -- set interface lsp0-1 external_ids:iface-id=lsp0-1
-ovs-vsctl add-port br-int lsp0-2 -- set interface lsp0-2 external_ids:iface-id=lsp0-2
 
-check ovn-nbctl --wait=hv ls-add ls0
-
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=hv lsp-add ls0 lsp0-0 -- lsp-set-addresses lsp0-0 "unknown"
-OVS_WAIT_UNTIL([test `as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute` = 5])
-OVS_WAIT_UNTIL([test `as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute` = 5])
-
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=hv lsp-add ls0 lsp0-1 -- lsp-set-addresses lsp0-1 "aa:aa:aa:00:00:01 192.168.0.11"
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0
-])
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [0
-])
-
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=hv lsp-add ls0 lsp0-2 -- lsp-set-addresses lsp0-2 "aa:aa:aa:00:00:02 192.168.0.12"
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0
-])
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [0
-])
-
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=hv lsp-del lsp0-1
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0
-])
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [1
-])
+fail_count=0
+check_recompute_counter() {
+    northd_recomp=$(as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute)
+    if test x$northd_recomp != x$1; then
+        fail_count=$(($fail_count + 1))
+        echo check northd recompute failed: expected $1, got $northd_recomp
+        return 1
+    fi
+    lflow_recomp=$(as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute)
+    if test x$lflow_recomp != x$2; then
+        fail_count=$(($fail_count + 1))
+        echo check lflow recompute failed: expected $2, got $lflow_recomp
+        return 1
+    fi
+    return 0
+}
 
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=hv lsp-set-addresses lsp0-2 "aa:aa:aa:00:00:88 192.168.0.88"
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0
-])
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats lflow recompute], [0], [1
-])
+# Depending on order of responses from NB and SB, the number of recompute may
+# be different. This test case only verifies the best case scenario, which
+# should have the expected recompute count at least 70% of the time.
+for i in $(seq 10); do
+    check ovn-nbctl --wait=hv ls-add ls$i
+
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-0 -- lsp-set-addresses lsp${i}-0 "unknown"
+    ovs-vsctl add-port br-int lsp${i}-0 -- set interface lsp${i}-0 external_ids:iface-id=lsp${i}-0
+    wait_for_ports_up
+    check ovn-nbctl --wait=hv sync
+    check_recompute_counter 5 5 || break
+
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-1 -- lsp-set-addresses lsp${i}-1 "aa:aa:aa:00:00:01 192.168.0.11"
+    ovs-vsctl add-port br-int lsp${i}-1 -- set interface lsp${i}-1 external_ids:iface-id=lsp${i}-1
+    wait_for_ports_up
+    check ovn-nbctl --wait=hv sync
+    check_recompute_counter 0 0 || break
+
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=hv lsp-add ls$i lsp${i}-2 -- lsp-set-addresses lsp${i}-2 "aa:aa:aa:00:00:02 192.168.0.12"
+    ovs-vsctl add-port br-int lsp${i}-2 -- set interface lsp${i}-2 external_ids:iface-id=lsp${i}-2
+    wait_for_ports_up
+    check ovn-nbctl --wait=hv sync
+    check_recompute_counter 0 0 || break
+
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=hv lsp-del lsp${i}-1
+    check_recompute_counter 0 1 || break
+
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=hv lsp-set-addresses lsp${i}-2 "aa:aa:aa:00:00:88 192.168.0.88"
+    check_recompute_counter 0 1 || break
+
+    # No change, no recompute
+    check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
+    check ovn-nbctl --wait=sb sync
+    check_recompute_counter 0 0 || break
 
-# No change, no recompute
-check as northd ovn-appctl -t NORTHD_TYPE inc-engine/clear-stats
-check ovn-nbctl --wait=sb sync
-AT_CHECK([as northd ovn-appctl -t NORTHD_TYPE inc-engine/show-stats northd recompute], [0], [0
-])
+done
+echo Test failed $fail_count in 10.
+AT_CHECK([test $fail_count -le 3])
 
 OVN_CLEANUP([hv1])
 AT_CLEANUP