diff mbox series

[ovs-dev] Fixed multiple flaky tests

Message ID 20220321095648.710943-1-xsimonar@redhat.com
State Accepted
Headers show
Series [ovs-dev] Fixed multiple flaky tests | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_ovn-kubernetes success github build: passed
ovsrobot/github-robot-_Build_and_Test fail github build: failed

Commit Message

Xavier Simonart March 21, 2022, 9:56 a.m. UTC
- send gratuitous arp on localnet
- send gratuitous arp for nat ips in localnet
- dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
- send gratuitous arp for NAT rules on distributed router
- 2 HVs, 1 lport/HV, localport ports

Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
---
 tests/ovn.at | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Comments

Dumitru Ceara March 21, 2022, 11:02 a.m. UTC | #1
On 3/21/22 10:56, Xavier Simonart wrote:
> - send gratuitous arp on localnet
> - send gratuitous arp for nat ips in localnet
> - dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
> - send gratuitous arp for NAT rules on distributed router
> - 2 HVs, 1 lport/HV, localport ports
> 
> Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
> ---

Hi Xavier,

>  tests/ovn.at | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 166b5f72e..cf027703f 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at

[...]

> @@ -10391,7 +10395,7 @@ test_dns6() {
>  }
>  
>  AT_CAPTURE_FILE([ofctl_monitor0.log])
> -as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
> +as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
>  --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
>  

Is this really needed?  We set OVS_CTL_TIMEOUT=30 in atlocal.in.

Thanks,
Dumitru
Xavier Simonart March 30, 2022, 10:40 a.m. UTC | #2
Hi Dumitru

On Mon, Mar 21, 2022 at 12:02 PM Dumitru Ceara <dceara@redhat.com> wrote:

> On 3/21/22 10:56, Xavier Simonart wrote:
> > - send gratuitous arp on localnet
> > - send gratuitous arp for nat ips in localnet
> > - dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
> > - send gratuitous arp for NAT rules on distributed router
> > - 2 HVs, 1 lport/HV, localport ports
> >
> > Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
> > ---
>
> Hi Xavier,
>
> >  tests/ovn.at | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index 166b5f72e..cf027703f 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
>
> [...]
>
> > @@ -10391,7 +10395,7 @@ test_dns6() {
> >  }
> >
> >  AT_CAPTURE_FILE([ofctl_monitor0.log])
> > -as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
> > +as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
> >  --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
> >
>
> Is this really needed?  We set OVS_CTL_TIMEOUT=30 in atlocal.in.
>
> Thanks for looking into this.

Yes :-)
The issue here is that we start an ovs-ofctl monitor, then run a few times
(send packet / check monitor.log).
The test is quite long, and on slow systems (e.g. VM running s390
architecture on a x86_64 host), the 30 seconds are not enough for all
tests, and hence monitoring stops. Then we check (for 30 seconds) whether
monitor.log contains what we expect ...
300 seconds might be a little much... just depends on how slow the system
is... Requires at least one minute on my system...

Thanks
Xavier

Thanks,
> Dumitru
>
>
Mark Michelson April 5, 2022, 3:22 p.m. UTC | #3
On 3/30/22 06:40, Xavier Simonart wrote:
> Hi Dumitru
> 
> On Mon, Mar 21, 2022 at 12:02 PM Dumitru Ceara <dceara@redhat.com> wrote:
> 
>> On 3/21/22 10:56, Xavier Simonart wrote:
>>> - send gratuitous arp on localnet
>>> - send gratuitous arp for nat ips in localnet
>>> - dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
>>> - send gratuitous arp for NAT rules on distributed router
>>> - 2 HVs, 1 lport/HV, localport ports
>>>
>>> Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
>>> ---
>>
>> Hi Xavier,
>>
>>>   tests/ovn.at | 20 +++++++++++++-------
>>>   1 file changed, 13 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>> index 166b5f72e..cf027703f 100644
>>> --- a/tests/ovn.at
>>> +++ b/tests/ovn.at
>>
>> [...]
>>
>>> @@ -10391,7 +10395,7 @@ test_dns6() {
>>>   }
>>>
>>>   AT_CAPTURE_FILE([ofctl_monitor0.log])
>>> -as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
>>> +as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
>>>   --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
>>>
>>
>> Is this really needed?  We set OVS_CTL_TIMEOUT=30 in atlocal.in.
>>
>> Thanks for looking into this.
> 
> Yes :-)
> The issue here is that we start an ovs-ofctl monitor, then run a few times
> (send packet / check monitor.log).
> The test is quite long, and on slow systems (e.g. VM running s390
> architecture on a x86_64 host), the 30 seconds are not enough for all
> tests, and hence monitoring stops. Then we check (for 30 seconds) whether
> monitor.log contains what we expect ...
> 300 seconds might be a little much... just depends on how slow the system
> is... Requires at least one minute on my system...

Thanks for the clarification on this.

Acked-by: Mark Michelson

I think this should be committed as-is since it's fixing the flakiness 
of the tests.

With regards to the DNS test in particular, my assumption is that the 
long timeout is needed because the DNS queries are falling back to 
system default behavior when OVN is not able to answer the queries 
successfully. This can take a very long time since the system will wait 
a certain amount of time for a response and may retry the query a 
multiple times as well. If there were some way to temporarily override 
default system behavior it may end up speeding up the test and 
eliminating the need to set the timeout so large.

It's also possible I'm completely wrong in the above paragraph :)

> 
> Thanks
> Xavier
> 
> Thanks,
>> Dumitru
>>
>>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara April 5, 2022, 4:06 p.m. UTC | #4
On 4/5/22 17:22, Mark Michelson wrote:
> On 3/30/22 06:40, Xavier Simonart wrote:
>> Hi Dumitru
>>
>> On Mon, Mar 21, 2022 at 12:02 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>>> On 3/21/22 10:56, Xavier Simonart wrote:
>>>> - send gratuitous arp on localnet
>>>> - send gratuitous arp for nat ips in localnet
>>>> - dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
>>>> - send gratuitous arp for NAT rules on distributed router
>>>> - 2 HVs, 1 lport/HV, localport ports
>>>>
>>>> Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
>>>> ---
>>>
>>> Hi Xavier,
>>>
>>>>   tests/ovn.at | 20 +++++++++++++-------
>>>>   1 file changed, 13 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/tests/ovn.at b/tests/ovn.at
>>>> index 166b5f72e..cf027703f 100644
>>>> --- a/tests/ovn.at
>>>> +++ b/tests/ovn.at
>>>
>>> [...]
>>>
>>>> @@ -10391,7 +10395,7 @@ test_dns6() {
>>>>   }
>>>>
>>>>   AT_CAPTURE_FILE([ofctl_monitor0.log])
>>>> -as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
>>>> +as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
>>>>   --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
>>>>
>>>
>>> Is this really needed?  We set OVS_CTL_TIMEOUT=30 in atlocal.in.
>>>
>>> Thanks for looking into this.
>>
>> Yes :-)
>> The issue here is that we start an ovs-ofctl monitor, then run a few
>> times
>> (send packet / check monitor.log).
>> The test is quite long, and on slow systems (e.g. VM running s390
>> architecture on a x86_64 host), the 30 seconds are not enough for all
>> tests, and hence monitoring stops. Then we check (for 30 seconds) whether
>> monitor.log contains what we expect ...
>> 300 seconds might be a little much... just depends on how slow the system
>> is... Requires at least one minute on my system...
> 
> Thanks for the clarification on this.
> 
> Acked-by: Mark Michelson
> 
> I think this should be committed as-is since it's fixing the flakiness
> of the tests.
> 

Sounds good to me too, the CI has been quite unhappy lately.

> With regards to the DNS test in particular, my assumption is that the
> long timeout is needed because the DNS queries are falling back to
> system default behavior when OVN is not able to answer the queries
> successfully. This can take a very long time since the system will wait
> a certain amount of time for a response and may retry the query a
> multiple times as well. If there were some way to temporarily override
> default system behavior it may end up speeding up the test and
> eliminating the need to set the timeout so large.
> 
> It's also possible I'm completely wrong in the above paragraph :)
> 
>>
>> Thanks
>> Xavier
>>
>> Thanks,
>>> Dumitru
>>>
>>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
Numan Siddique April 6, 2022, 2:20 p.m. UTC | #5
On Tue, Apr 5, 2022 at 12:07 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 4/5/22 17:22, Mark Michelson wrote:
> > On 3/30/22 06:40, Xavier Simonart wrote:
> >> Hi Dumitru
> >>
> >> On Mon, Mar 21, 2022 at 12:02 PM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >>> On 3/21/22 10:56, Xavier Simonart wrote:
> >>>> - send gratuitous arp on localnet
> >>>> - send gratuitous arp for nat ips in localnet
> >>>> - dns lookup : 1 HV, 2 LS, 2 LSPs/LS -- ovn-northd -- dp-groups=yes
> >>>> - send gratuitous arp for NAT rules on distributed router
> >>>> - 2 HVs, 1 lport/HV, localport ports
> >>>>
> >>>> Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
> >>>> ---
> >>>
> >>> Hi Xavier,
> >>>
> >>>>   tests/ovn.at | 20 +++++++++++++-------
> >>>>   1 file changed, 13 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/tests/ovn.at b/tests/ovn.at
> >>>> index 166b5f72e..cf027703f 100644
> >>>> --- a/tests/ovn.at
> >>>> +++ b/tests/ovn.at
> >>>
> >>> [...]
> >>>
> >>>> @@ -10391,7 +10395,7 @@ test_dns6() {
> >>>>   }
> >>>>
> >>>>   AT_CAPTURE_FILE([ofctl_monitor0.log])
> >>>> -as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
> >>>> +as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
> >>>>   --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
> >>>>
> >>>
> >>> Is this really needed?  We set OVS_CTL_TIMEOUT=30 in atlocal.in.
> >>>
> >>> Thanks for looking into this.
> >>
> >> Yes :-)
> >> The issue here is that we start an ovs-ofctl monitor, then run a few
> >> times
> >> (send packet / check monitor.log).
> >> The test is quite long, and on slow systems (e.g. VM running s390
> >> architecture on a x86_64 host), the 30 seconds are not enough for all
> >> tests, and hence monitoring stops. Then we check (for 30 seconds) whether
> >> monitor.log contains what we expect ...
> >> 300 seconds might be a little much... just depends on how slow the system
> >> is... Requires at least one minute on my system...
> >
> > Thanks for the clarification on this.
> >
> > Acked-by: Mark Michelson
> >
> > I think this should be committed as-is since it's fixing the flakiness
> > of the tests.
> >
>
> Sounds good to me too, the CI has been quite unhappy lately.
>
> > With regards to the DNS test in particular, my assumption is that the
> > long timeout is needed because the DNS queries are falling back to
> > system default behavior when OVN is not able to answer the queries
> > successfully. This can take a very long time since the system will wait
> > a certain amount of time for a response and may retry the query a
> > multiple times as well. If there were some way to temporarily override
> > default system behavior it may end up speeding up the test and
> > eliminating the need to set the timeout so large.
> >
> > It's also possible I'm completely wrong in the above paragraph :)
> >
> >>

Thanks,  I applied this patch to the main branch.

Numan

> >> Thanks
> >> Xavier
> >>
> >> Thanks,
> >>> Dumitru
> >>>
> >>>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>
> >
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/tests/ovn.at b/tests/ovn.at
index 166b5f72e..cf027703f 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -69,6 +69,10 @@  m4_define([OVN_CHECK_PACKETS],
   [ovn_check_packets__ "$1" "$2" "__file__:__line__"
    AT_CHECK([sort $rcv_text], [0], [expout], [ignore], [dump_diff__ "$1" "$2"])])
 
+m4_define([OVN_CHECK_PACKETS_UNIQ],
+  [ovn_check_packets__ "$1" "$2" "__file__:__line__"
+   AT_CHECK([sort $rcv_text | uniq], [0], [expout])])
+
 m4_define([OVN_CHECK_PACKETS_REMOVE_BROADCAST],
   [ovn_check_packets_remove_broadcast__ "$1" "$2" "__file__:__line__"
    AT_CHECK([sort $rcv_text], [0], [expout], [ignore], [dump_diff__ "$1" "$2"])])
@@ -5889,7 +5893,7 @@  AT_CHECK([ovs-vsctl add-port br-int localvif1 -- set Interface localvif1 externa
 
 # Wait for packet to be received.
 echo "fffffffffffff0000000000108060001080006040001f00000000001c0a80102000000000000c0a80102" > expected
-OVN_CHECK_PACKETS([hv/snoopvif-tx.pcap], [expected])
+OVN_CHECK_PACKETS_UNIQ([hv/snoopvif-tx.pcap], [expected])
 
 # Check GARP packet when restart openflow connection.
 as hv
@@ -5902,7 +5906,7 @@  start_daemon ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -vunixctl
 
 # Wait for packet to be received.
 echo "fffffffffffff0000000000108060001080006040001f00000000001c0a80102000000000000c0a80102" > expected
-OVN_CHECK_PACKETS([hv/snoopvif-tx.pcap], [expected])
+OVN_CHECK_PACKETS_UNIQ([hv/snoopvif-tx.pcap], [expected])
 
 # Delete the localnet ports.
 AT_CHECK([ovs-vsctl del-port localvif1])
@@ -8636,7 +8640,7 @@  OVS_WAIT_UNTIL([test `wc -c < "hv1/snoopvif-tx.pcap"` -ge 140])
 trim_zeros() {
     sed 's/\(00\)\{1,\}$//'
 }
-$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros > packets
+$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros |sort | uniq > packets
 AT_CHECK([sort packets], [0], [dnl
 fffffffffffff0000000000108060001080006040001f00000000001c0a80001000000000000c0a80001
 fffffffffffff0000000000108060001080006040001f00000000001c0a80002000000000000c0a80002
@@ -8733,7 +8737,7 @@  OVS_WAIT_UNTIL([test `wc -c < "hv1/snoopvif-tx.pcap"` -ge 50])
 trim_zeros() {
     sed 's/\(00\)\{1,\}$//'
 }
-$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros > packets
+$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros | sort | uniq > packets
 expected="fffffffffffff0000000000108060001080006040001f00000000001c0a80001000000000000c0a80001"
 echo $expected > expout
 expected="fffffffffffff0000000000108060001080006040001f00000000001c0a80002000000000000c0a80002"
@@ -10391,7 +10395,7 @@  test_dns6() {
 }
 
 AT_CAPTURE_FILE([ofctl_monitor0.log])
-as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
+as hv1 ovs-ofctl -t 300 monitor br-int resume --detach --no-chdir \
 --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
 
 set_dns_params vm2
@@ -11607,7 +11611,7 @@  expected="fffffffffffff0000000000108060001080006040001f00000000001c0a80001000000
 echo $expected > expout
 expected="fffffffffffff0000000000108060001080006040001f00000000001c0a80002000000000000c0a80002"
 echo $expected >> expout
-AT_CHECK([sort packets], [0], [expout])
+AT_CHECK([sort packets | uniq], [0], [expout])
 sort packets | cat
 
 # Temporarily remove nat-addresses option to avoid race conditions
@@ -11650,7 +11654,7 @@  trim_zeros() {
     sed 's/\(00\)\{1,\}$//'
 }
 
-$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros > packets
+$PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/snoopvif-tx.pcap | trim_zeros | sort | uniq > packets
 garp_1="fffffffffffff0000000000308060001080006040001f00000000003c0a80003000000000000c0a80003"
 echo $garp_1 > expout
 garp_2="fffffffffffff0000000000408060001080006040001f00000000004c0a80004000000000000c0a80004"
@@ -12494,6 +12498,7 @@  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int |awk '/table=65/{print substr($8, 1
 # remove the localport from br-int and re-create it
 as hv1
 check ovs-vsctl del-port vif01
+check ovn-nbctl --wait=hv sync
 AT_CHECK([as hv1 ovs-ofctl dump-flows br-int |awk '/table=65/{print substr($8, 16, length($8))}' |sort -n], [0], [dnl
 11
 ])
@@ -12501,6 +12506,7 @@  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int |awk '/table=65/{print substr($8, 1
 as hv1
 check ovs-vsctl add-port br-int vif01 \
     -- set Interface vif01 external-ids:iface-id=lp01
+check ovn-nbctl --wait=hv sync
 AT_CHECK([as hv1 ovs-ofctl dump-flows br-int |awk '/table=65/{print substr($8, 16, length($8))}' |sort -n], [0], [dnl
 2
 11