diff mbox series

[ovs-dev,RFC] ci: Remove '--recheck' in CI.

Message ID 20230608140411.148041-1-dceara@redhat.com
State RFC
Headers show
Series [ovs-dev,RFC] ci: Remove '--recheck' in CI. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_ovn-kubernetes success github build: passed
ovsrobot/github-robot-_Build_and_Test fail github build: failed

Commit Message

Dumitru Ceara June 8, 2023, 2:04 p.m. UTC
If we want to catch new failures faster we have a better chance if CI
doesn't auto-retry (once).

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
NOTE: Sending this as RFC to start the discussion with the community.
---
 .ci/linux-build.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Mark Michelson June 8, 2023, 7:58 p.m. UTC | #1
In theory, this idea is great. By removing the recheck, our CI will 
potentially turn red more often, resulting in us investigating our tests 
to resolve what makes them flaky. Eventually, we will resolve all of the 
flakes and CI will be green. From that point on, any time the CI turns 
red, it will immediately get our attention and we will fix whatever 
needs fixing.

I have two main fears here:

1) If we don't fix the existing flaky tests quickly enough, it will be 
considered "normal" for CI to be red, meaning we are less likely to 
notice genuine failures when they happen. Worse, developers familiar 
with the flaky reputation of tests will just issue their own rechecks 
until the CI turns green.

2) Let's say we track down and fix all current flaky tests. Developers 
will soon write new tests, and those tests may end up being flaky. Then 
our CI will start turning red again due to test failures seemingly 
unrelated to code being committed. The result is the same as (1).

By leaving the recheck intact, we are less likely to notice and fix 
flaky tests. That's also bad.

I think that if we are going to remove the recheck now, then we have to 
be fully dedicated to eradicating all existing flaky tests. If we as a 
project can be dedicated to fixing flaky tests *immediately* when they 
are found, then we could move forward with this change now. I just don't 
trust us to actually do it.

I think a more plausible idea would be to identify all known flaky 
tests, fix those, and then remove the recheck. This way, we only have to 
deal with new flaky tests that get written in the future. I think we can 
deal with this situation more easily than suddenly going from green CI 
to red CI because of existing flaky tests.


Everything below this sentence is wishful thinking and shouldn't get in 
the way of approving/rejecting this change.


I also think that we need to identify the common causes of flaky tests, 
and take measures to prevent them from happening in the future. There 
are a few ways I can think of to accomplish this:

1) Ensure people performing code reviews are aware of these patterns and 
point them out during code review.

2) Automatically prevent known flaky patterns from existing in our code. 
For instance, we could introduce a build-time check that ensures that 
nobody attempts to sleep during a test to "let things settle."

3) Provide functions/macros to use in the testsuite that avoid 
potentially flaky behavior. As a hypothetical example, let's say that a 
common cause of flakes is not using "--wait=hv" or "--wait=sb" after a 
series of ovn-nbctl commands. We could provide a macro like:

OVN_NBCTL([hv], [
lr-add ro0
ls-add sw0
])

(That's probably bad syntax, but I think you get the idea)

This would expand to something like:
check ovn-nbctl lr-add ro0
check ovn-nbctl ls-add sw0
check ovn-nbctl --wait=hv sync

If we combine this with the idea from (2), we could prevent all bare 
uses of `ovn-nbctl` in tests, thus preventing this flaky behavior from 
existing.

4) Identify commonly used patterns in tests and provide macros that 
essentially do the work for us. Some examples might be:

a) Common logical network topologies could be set up for you by calling 
a single macro.
b) Checking logical flows could have a macro that dumps the information 
you care about, sorts the output, anonymizes the table numbers, and 
eliminates odd spacing in the output.
c) Ensuring TCP/UDP/IP connectivity between two endpoints could 
potentially be done with a macro or two, assuming that the traffic you 
are sending is not overly specific.

The idea here is that by having commonly used macros that are 
bulletproof, we are less likely to introduce flakes by hand-coding the 
same scenarios. It also would likely reduce the code size of the 
testsuite, which is a nice bonus.

5) Enhanced debugability of tests would be great. There are too many 
times that I've looked at testsuite.log on a failing test and had no 
idea where to start looking for the failure. There are other times where 
it's clear what failed, but trying to figure out why is more difficult. 
Unfortunately, there isn't likely to be a magic cure-all for this problem.

On 6/8/23 10:04, Dumitru Ceara wrote:
> If we want to catch new failures faster we have a better chance if CI
> doesn't auto-retry (once).
> 
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
> NOTE: Sending this as RFC to start the discussion with the community.
> ---
>   .ci/linux-build.sh | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
> index 907a0dc6c9..64f7a96d91 100755
> --- a/.ci/linux-build.sh
> +++ b/.ci/linux-build.sh
> @@ -66,7 +66,7 @@ function execute_tests()
>   
>       export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
>       if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}" $JOBS \
> -        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
> +        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
>       then
>           # testsuite.log is necessary for debugging.
>           cat */_build/sub/tests/testsuite.log
> @@ -81,7 +81,7 @@ function execute_system_tests()
>   
>         configure_ovn $OPTS
>         make $JOBS || { cat config.log; exit 1; }
> -      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE" RECHECK=yes; then
> +      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
>             # $log_file is necessary for debugging.
>             cat tests/$log_file
>             exit 1
Dumitru Ceara June 9, 2023, 10:25 a.m. UTC | #2
Hi Mark,

Thanks for your feedback on this!

On 6/8/23 21:58, Mark Michelson wrote:
> In theory, this idea is great. By removing the recheck, our CI will
> potentially turn red more often, resulting in us investigating our tests
> to resolve what makes them flaky. Eventually, we will resolve all of the
> flakes and CI will be green. From that point on, any time the CI turns
> red, it will immediately get our attention and we will fix whatever
> needs fixing.
> 
> I have two main fears here:
> 
> 1) If we don't fix the existing flaky tests quickly enough, it will be
> considered "normal" for CI to be red, meaning we are less likely to
> notice genuine failures when they happen. Worse, developers familiar
> with the flaky reputation of tests will just issue their own rechecks
> until the CI turns green.
> 
> 2) Let's say we track down and fix all current flaky tests. Developers
> will soon write new tests, and those tests may end up being flaky. Then
> our CI will start turning red again due to test failures seemingly
> unrelated to code being committed. The result is the same as (1).
> 
> By leaving the recheck intact, we are less likely to notice and fix
> flaky tests. That's also bad.
> 
> I think that if we are going to remove the recheck now, then we have to
> be fully dedicated to eradicating all existing flaky tests. If we as a
> project can be dedicated to fixing flaky tests *immediately* when they
> are found, then we could move forward with this change now. I just don't
> trust us to actually do it.
> 
> I think a more plausible idea would be to identify all known flaky
> tests, fix those, and then remove the recheck. This way, we only have to
> deal with new flaky tests that get written in the future. I think we can
> deal with this situation more easily than suddenly going from green CI
> to red CI because of existing flaky tests.
> 

What about the following compromise instead?

Identify all known flaky tests, fix the ones we can fix without too much
effort, tag the ones that we couldn't easily fix with a new keyword
(e.g., "unstable" or "flaky" or whatever), and only run recheck for
those.

In support of this I did a test last night.  I force pushed the commit
that removes the --recheck multiple times in my fork to trigger CI.
The GitHub action ran 23 times which means the unit and system tests
were ran 3 x 23 = 69 times.  I collected the results and grouped them
by test and counted how many times they failed (disclaimer: it's not
completely fair because some tests run with different variations more
times than others but it should be a good enough first approximation).
I got:

     40  ovn.at:30812       propagate Port_Binding.up to NB and OVS -- ovn-northd
     34  ovn-northd.at:9487 LSP incremental processing -- ovn-northd
      8  ovn.at:14436       options:multiple requested-chassis for logical port: unclaimed behavior -- ovn-northd
      7  ovn.at:16010       tug-of-war between two chassis for the same port -- ovn-northd
      6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
      5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
      4  ovn-performance.at:227 ovn-controller incremental processing
      3  ovn.at:34900       Check default openflow flows -- ovn-northd
      3  ovn.at:34466       recomputes -- ovn-northd
      3  ovn.at:14284       options:multiple requested-chassis for logical port -- ovn-northd
      2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1 lport/LS, 1 LR -- ovn-northd
      1  system-ovn.at:6525 egress qos -- ovn-northd
      1  system-ovn.at:11151 ovn mirroring -- ovn-northd
      1  ovn-northd.at:8809 Address set incremental processing -- ovn-northd
      1  ovn-controller.at:702 ovn-controller - ovn action metering -- ovn-northd
      1  ovn.at:29467       ARP replies for SNAT external ips -- ovn-northd
      1  ovn.at:13171       1 LR with HA distributed router gateway port -- ovn-northd

The first two are probably real bugs and I reported them separately
yesterday [0] but the other 15 didn't fail that often.  I would even
go one step further and ignore the ones that only failed once (it's
acceptable IMO every now and then a maintainer has to re-trigger CI).

That would leave us with 9 tests that failed during this exercise and
need either to be fixed or to be tagged as "unstable":
      8  ovn.at:14436       options:multiple requested-chassis for logical port: unclaimed behavior -- ovn-northd
      7  ovn.at:16010       tug-of-war between two chassis for the same port -- ovn-northd
      6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
      5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
      4  ovn-performance.at:227 ovn-controller incremental processing
      3  ovn.at:34900       Check default openflow flows -- ovn-northd
      3  ovn.at:34466       recomputes -- ovn-northd
      3  ovn.at:14284       options:multiple requested-chassis for logical port -- ovn-northd
      2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1 lport/LS, 1 LR -- ovn-northd

Would you consider this compromise as a good alternative?

[0] https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405418.html

> 
> Everything below this sentence is wishful thinking and shouldn't get in
> the way of approving/rejecting this change.
> 
> 
> I also think that we need to identify the common causes of flaky tests,
> and take measures to prevent them from happening in the future. There
> are a few ways I can think of to accomplish this:
> 
> 1) Ensure people performing code reviews are aware of these patterns and
> point them out during code review.
> 
> 2) Automatically prevent known flaky patterns from existing in our code.
> For instance, we could introduce a build-time check that ensures that
> nobody attempts to sleep during a test to "let things settle."
> 
> 3) Provide functions/macros to use in the testsuite that avoid
> potentially flaky behavior. As a hypothetical example, let's say that a
> common cause of flakes is not using "--wait=hv" or "--wait=sb" after a
> series of ovn-nbctl commands. We could provide a macro like:
> 
> OVN_NBCTL([hv], [
> lr-add ro0
> ls-add sw0
> ])
> 
> (That's probably bad syntax, but I think you get the idea)
> 
> This would expand to something like:
> check ovn-nbctl lr-add ro0
> check ovn-nbctl ls-add sw0
> check ovn-nbctl --wait=hv sync
> 
> If we combine this with the idea from (2), we could prevent all bare
> uses of `ovn-nbctl` in tests, thus preventing this flaky behavior from
> existing.
> 
> 4) Identify commonly used patterns in tests and provide macros that
> essentially do the work for us. Some examples might be:
> 
> a) Common logical network topologies could be set up for you by calling
> a single macro.
> b) Checking logical flows could have a macro that dumps the information
> you care about, sorts the output, anonymizes the table numbers, and
> eliminates odd spacing in the output.
> c) Ensuring TCP/UDP/IP connectivity between two endpoints could
> potentially be done with a macro or two, assuming that the traffic you
> are sending is not overly specific.
> 
> The idea here is that by having commonly used macros that are
> bulletproof, we are less likely to introduce flakes by hand-coding the
> same scenarios. It also would likely reduce the code size of the
> testsuite, which is a nice bonus.
> 
> 5) Enhanced debugability of tests would be great. There are too many
> times that I've looked at testsuite.log on a failing test and had no
> idea where to start looking for the failure. There are other times where
> it's clear what failed, but trying to figure out why is more difficult.
> Unfortunately, there isn't likely to be a magic cure-all for this problem.
> 

I think all 5 points above are things that are nice to have regardless
of --recheck happening or not.

Thanks,
Dumitru

> On 6/8/23 10:04, Dumitru Ceara wrote:
>> If we want to catch new failures faster we have a better chance if CI
>> doesn't auto-retry (once).
>>
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> ---
>> NOTE: Sending this as RFC to start the discussion with the community.
>> ---
>>   .ci/linux-build.sh | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
>> index 907a0dc6c9..64f7a96d91 100755
>> --- a/.ci/linux-build.sh
>> +++ b/.ci/linux-build.sh
>> @@ -66,7 +66,7 @@ function execute_tests()
>>         export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
>>       if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}" $JOBS \
>> -        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
>> +        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
>>       then
>>           # testsuite.log is necessary for debugging.
>>           cat */_build/sub/tests/testsuite.log
>> @@ -81,7 +81,7 @@ function execute_system_tests()
>>           configure_ovn $OPTS
>>         make $JOBS || { cat config.log; exit 1; }
>> -      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"
>> RECHECK=yes; then
>> +      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
>>             # $log_file is necessary for debugging.
>>             cat tests/$log_file
>>             exit 1
>
Frode Nordahl June 9, 2023, 11:01 a.m. UTC | #3
fre. 9. jun. 2023, 12:25 skrev Dumitru Ceara <dceara@redhat.com>:

> Hi Mark,
>
> Thanks for your feedback on this!
>
> On 6/8/23 21:58, Mark Michelson wrote:
> > In theory, this idea is great. By removing the recheck, our CI will
> > potentially turn red more often, resulting in us investigating our tests
> > to resolve what makes them flaky. Eventually, we will resolve all of the
> > flakes and CI will be green. From that point on, any time the CI turns
> > red, it will immediately get our attention and we will fix whatever
> > needs fixing.
> >
> > I have two main fears here:
> >
> > 1) If we don't fix the existing flaky tests quickly enough, it will be
> > considered "normal" for CI to be red, meaning we are less likely to
> > notice genuine failures when they happen. Worse, developers familiar
> > with the flaky reputation of tests will just issue their own rechecks
> > until the CI turns green.
> >
> > 2) Let's say we track down and fix all current flaky tests. Developers
> > will soon write new tests, and those tests may end up being flaky. Then
> > our CI will start turning red again due to test failures seemingly
> > unrelated to code being committed. The result is the same as (1).
> >
> > By leaving the recheck intact, we are less likely to notice and fix
> > flaky tests. That's also bad.
> >
> > I think that if we are going to remove the recheck now, then we have to
> > be fully dedicated to eradicating all existing flaky tests. If we as a
> > project can be dedicated to fixing flaky tests *immediately* when they
> > are found, then we could move forward with this change now. I just don't
> > trust us to actually do it.
> >
> > I think a more plausible idea would be to identify all known flaky
> > tests, fix those, and then remove the recheck. This way, we only have to
> > deal with new flaky tests that get written in the future. I think we can
> > deal with this situation more easily than suddenly going from green CI
> > to red CI because of existing flaky tests.
> >
>
> What about the following compromise instead?
>
> Identify all known flaky tests, fix the ones we can fix without too much
> effort, tag the ones that we couldn't easily fix with a new keyword
> (e.g., "unstable" or "flaky" or whatever), and only run recheck for
> those.
>
> In support of this I did a test last night.  I force pushed the commit
> that removes the --recheck multiple times in my fork to trigger CI.
> The GitHub action ran 23 times which means the unit and system tests
> were ran 3 x 23 = 69 times.  I collected the results and grouped them
> by test and counted how many times they failed (disclaimer: it's not
> completely fair because some tests run with different variations more
> times than others but it should be a good enough first approximation).
> I got:
>
>      40  ovn.at:30812       propagate Port_Binding.up to NB and OVS --
> ovn-northd
>      34  ovn-northd.at:9487 LSP incremental processing -- ovn-northd
>       8  ovn.at:14436       options:multiple requested-chassis for
> logical port: unclaimed behavior -- ovn-northd
>       7  ovn.at:16010       tug-of-war between two chassis for the same
> port -- ovn-northd
>       6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>       5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>       4  ovn-performance.at:227 ovn-controller incremental processing
>       3  ovn.at:34900       Check default openflow flows -- ovn-northd
>       3  ovn.at:34466       recomputes -- ovn-northd
>       3  ovn.at:14284       options:multiple requested-chassis for
> logical port -- ovn-northd
>       2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
> lport/LS, 1 LR -- ovn-northd
>       1  system-ovn.at:6525 egress qos -- ovn-northd
>       1  system-ovn.at:11151 ovn mirroring -- ovn-northd
>       1  ovn-northd.at:8809 Address set incremental processing --
> ovn-northd
>       1  ovn-controller.at:702 ovn-controller - ovn action metering --
> ovn-northd
>       1  ovn.at:29467       ARP replies for SNAT external ips --
> ovn-northd
>       1  ovn.at:13171       1 LR with HA distributed router gateway port
> -- ovn-northd
>
> The first two are probably real bugs and I reported them separately
> yesterday [0] but the other 15 didn't fail that often.  I would even
> go one step further and ignore the ones that only failed once (it's
> acceptable IMO every now and then a maintainer has to re-trigger CI).
>
> That would leave us with 9 tests that failed during this exercise and
> need either to be fixed or to be tagged as "unstable":
>       8  ovn.at:14436       options:multiple requested-chassis for
> logical port: unclaimed behavior -- ovn-northd
>       7  ovn.at:16010       tug-of-war between two chassis for the same
> port -- ovn-northd
>       6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>       5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>       4  ovn-performance.at:227 ovn-controller incremental processing
>       3  ovn.at:34900       Check default openflow flows -- ovn-northd
>       3  ovn.at:34466       recomputes -- ovn-northd
>       3  ovn.at:14284       options:multiple requested-chassis for
> logical port -- ovn-northd
>       2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
> lport/LS, 1 LR -- ovn-northd
>

Just a data point, the above list is not too far from what Debian/Ubuntu
currently list as flaky tests in the packaging [1].

Would you consider this compromise as a good alternative?
>

The existence of flaky test is an annoyance outside of the context of the
upstream CI, so +1 on any effort to stop new ones from creeping in.

1:
https://salsa.debian.org/openstack-team/third-party/ovn/-/tree/debian/sid/debian

--
Frode Nordahl

[0] https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405418.html
>
> >
> > Everything below this sentence is wishful thinking and shouldn't get in
> > the way of approving/rejecting this change.
> >
> >
> > I also think that we need to identify the common causes of flaky tests,
> > and take measures to prevent them from happening in the future. There
> > are a few ways I can think of to accomplish this:
> >
> > 1) Ensure people performing code reviews are aware of these patterns and
> > point them out during code review.
> >
> > 2) Automatically prevent known flaky patterns from existing in our code.
> > For instance, we could introduce a build-time check that ensures that
> > nobody attempts to sleep during a test to "let things settle."
> >
> > 3) Provide functions/macros to use in the testsuite that avoid
> > potentially flaky behavior. As a hypothetical example, let's say that a
> > common cause of flakes is not using "--wait=hv" or "--wait=sb" after a
> > series of ovn-nbctl commands. We could provide a macro like:
> >
> > OVN_NBCTL([hv], [
> > lr-add ro0
> > ls-add sw0
> > ])
> >
> > (That's probably bad syntax, but I think you get the idea)
> >
> > This would expand to something like:
> > check ovn-nbctl lr-add ro0
> > check ovn-nbctl ls-add sw0
> > check ovn-nbctl --wait=hv sync
> >
> > If we combine this with the idea from (2), we could prevent all bare
> > uses of `ovn-nbctl` in tests, thus preventing this flaky behavior from
> > existing.
> >
> > 4) Identify commonly used patterns in tests and provide macros that
> > essentially do the work for us. Some examples might be:
> >
> > a) Common logical network topologies could be set up for you by calling
> > a single macro.
> > b) Checking logical flows could have a macro that dumps the information
> > you care about, sorts the output, anonymizes the table numbers, and
> > eliminates odd spacing in the output.
> > c) Ensuring TCP/UDP/IP connectivity between two endpoints could
> > potentially be done with a macro or two, assuming that the traffic you
> > are sending is not overly specific.
> >
> > The idea here is that by having commonly used macros that are
> > bulletproof, we are less likely to introduce flakes by hand-coding the
> > same scenarios. It also would likely reduce the code size of the
> > testsuite, which is a nice bonus.
> >
> > 5) Enhanced debugability of tests would be great. There are too many
> > times that I've looked at testsuite.log on a failing test and had no
> > idea where to start looking for the failure. There are other times where
> > it's clear what failed, but trying to figure out why is more difficult.
> > Unfortunately, there isn't likely to be a magic cure-all for this
> problem.
> >
>
> I think all 5 points above are things that are nice to have regardless
> of --recheck happening or not.
>
> Thanks,
> Dumitru
>
> > On 6/8/23 10:04, Dumitru Ceara wrote:
> >> If we want to catch new failures faster we have a better chance if CI
> >> doesn't auto-retry (once).
> >>
> >> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> ---
> >> NOTE: Sending this as RFC to start the discussion with the community.
> >> ---
> >>   .ci/linux-build.sh | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
> >> index 907a0dc6c9..64f7a96d91 100755
> >> --- a/.ci/linux-build.sh
> >> +++ b/.ci/linux-build.sh
> >> @@ -66,7 +66,7 @@ function execute_tests()
> >>         export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
> >>       if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}" $JOBS
> \
> >> -        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
> >> +        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
> >>       then
> >>           # testsuite.log is necessary for debugging.
> >>           cat */_build/sub/tests/testsuite.log
> >> @@ -81,7 +81,7 @@ function execute_system_tests()
> >>           configure_ovn $OPTS
> >>         make $JOBS || { cat config.log; exit 1; }
> >> -      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"
> >> RECHECK=yes; then
> >> +      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
> >>             # $log_file is necessary for debugging.
> >>             cat tests/$log_file
> >>             exit 1
> >
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Mark Michelson June 9, 2023, 6:52 p.m. UTC | #4
On 6/9/23 06:25, Dumitru Ceara wrote:
> Hi Mark,
> 
> Thanks for your feedback on this!
> 
> On 6/8/23 21:58, Mark Michelson wrote:
>> In theory, this idea is great. By removing the recheck, our CI will
>> potentially turn red more often, resulting in us investigating our tests
>> to resolve what makes them flaky. Eventually, we will resolve all of the
>> flakes and CI will be green. From that point on, any time the CI turns
>> red, it will immediately get our attention and we will fix whatever
>> needs fixing.
>>
>> I have two main fears here:
>>
>> 1) If we don't fix the existing flaky tests quickly enough, it will be
>> considered "normal" for CI to be red, meaning we are less likely to
>> notice genuine failures when they happen. Worse, developers familiar
>> with the flaky reputation of tests will just issue their own rechecks
>> until the CI turns green.
>>
>> 2) Let's say we track down and fix all current flaky tests. Developers
>> will soon write new tests, and those tests may end up being flaky. Then
>> our CI will start turning red again due to test failures seemingly
>> unrelated to code being committed. The result is the same as (1).
>>
>> By leaving the recheck intact, we are less likely to notice and fix
>> flaky tests. That's also bad.
>>
>> I think that if we are going to remove the recheck now, then we have to
>> be fully dedicated to eradicating all existing flaky tests. If we as a
>> project can be dedicated to fixing flaky tests *immediately* when they
>> are found, then we could move forward with this change now. I just don't
>> trust us to actually do it.
>>
>> I think a more plausible idea would be to identify all known flaky
>> tests, fix those, and then remove the recheck. This way, we only have to
>> deal with new flaky tests that get written in the future. I think we can
>> deal with this situation more easily than suddenly going from green CI
>> to red CI because of existing flaky tests.
>>
> 
> What about the following compromise instead?
> 
> Identify all known flaky tests, fix the ones we can fix without too much
> effort, tag the ones that we couldn't easily fix with a new keyword
> (e.g., "unstable" or "flaky" or whatever), and only run recheck for
> those.
> 
> In support of this I did a test last night.  I force pushed the commit
> that removes the --recheck multiple times in my fork to trigger CI.
> The GitHub action ran 23 times which means the unit and system tests
> were ran 3 x 23 = 69 times.  I collected the results and grouped them
> by test and counted how many times they failed (disclaimer: it's not
> completely fair because some tests run with different variations more
> times than others but it should be a good enough first approximation).
> I got:
> 
>       40  ovn.at:30812       propagate Port_Binding.up to NB and OVS -- ovn-northd
>       34  ovn-northd.at:9487 LSP incremental processing -- ovn-northd
>        8  ovn.at:14436       options:multiple requested-chassis for logical port: unclaimed behavior -- ovn-northd
>        7  ovn.at:16010       tug-of-war between two chassis for the same port -- ovn-northd
>        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>        4  ovn-performance.at:227 ovn-controller incremental processing
>        3  ovn.at:34900       Check default openflow flows -- ovn-northd
>        3  ovn.at:34466       recomputes -- ovn-northd
>        3  ovn.at:14284       options:multiple requested-chassis for logical port -- ovn-northd
>        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1 lport/LS, 1 LR -- ovn-northd
>        1  system-ovn.at:6525 egress qos -- ovn-northd
>        1  system-ovn.at:11151 ovn mirroring -- ovn-northd
>        1  ovn-northd.at:8809 Address set incremental processing -- ovn-northd
>        1  ovn-controller.at:702 ovn-controller - ovn action metering -- ovn-northd
>        1  ovn.at:29467       ARP replies for SNAT external ips -- ovn-northd
>        1  ovn.at:13171       1 LR with HA distributed router gateway port -- ovn-northd
> 
> The first two are probably real bugs and I reported them separately
> yesterday [0] but the other 15 didn't fail that often.  I would even
> go one step further and ignore the ones that only failed once (it's
> acceptable IMO every now and then a maintainer has to re-trigger CI).
> 
> That would leave us with 9 tests that failed during this exercise and
> need either to be fixed or to be tagged as "unstable":
>        8  ovn.at:14436       options:multiple requested-chassis for logical port: unclaimed behavior -- ovn-northd
>        7  ovn.at:16010       tug-of-war between two chassis for the same port -- ovn-northd
>        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>        4  ovn-performance.at:227 ovn-controller incremental processing
>        3  ovn.at:34900       Check default openflow flows -- ovn-northd
>        3  ovn.at:34466       recomputes -- ovn-northd
>        3  ovn.at:14284       options:multiple requested-chassis for logical port -- ovn-northd
>        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1 lport/LS, 1 LR -- ovn-northd
> 
> Would you consider this compromise as a good alternative?

I think this can be a good alternative. I think if we go with this, then 
we need to be vigilant and be willing to adjust if it turns out that we 
end up having more CI failures than expected.

The other thing we have to avoid is haphazardly labeling tests as 
"unstable" instead of actually trying to fix them.

And finally, if tests are marked as "unstable", we have to answer the 
question of whether they should even remain in the testsuite. If they're 
always being skipped and we never intend to fix the tests, then they 
probably should be removed and replaced with something else.

> 
> [0] https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405418.html
> 
>>
>> Everything below this sentence is wishful thinking and shouldn't get in
>> the way of approving/rejecting this change.
>>
>>
>> I also think that we need to identify the common causes of flaky tests,
>> and take measures to prevent them from happening in the future. There
>> are a few ways I can think of to accomplish this:
>>
>> 1) Ensure people performing code reviews are aware of these patterns and
>> point them out during code review.
>>
>> 2) Automatically prevent known flaky patterns from existing in our code.
>> For instance, we could introduce a build-time check that ensures that
>> nobody attempts to sleep during a test to "let things settle."
>>
>> 3) Provide functions/macros to use in the testsuite that avoid
>> potentially flaky behavior. As a hypothetical example, let's say that a
>> common cause of flakes is not using "--wait=hv" or "--wait=sb" after a
>> series of ovn-nbctl commands. We could provide a macro like:
>>
>> OVN_NBCTL([hv], [
>> lr-add ro0
>> ls-add sw0
>> ])
>>
>> (That's probably bad syntax, but I think you get the idea)
>>
>> This would expand to something like:
>> check ovn-nbctl lr-add ro0
>> check ovn-nbctl ls-add sw0
>> check ovn-nbctl --wait=hv sync
>>
>> If we combine this with the idea from (2), we could prevent all bare
>> uses of `ovn-nbctl` in tests, thus preventing this flaky behavior from
>> existing.
>>
>> 4) Identify commonly used patterns in tests and provide macros that
>> essentially do the work for us. Some examples might be:
>>
>> a) Common logical network topologies could be set up for you by calling
>> a single macro.
>> b) Checking logical flows could have a macro that dumps the information
>> you care about, sorts the output, anonymizes the table numbers, and
>> eliminates odd spacing in the output.
>> c) Ensuring TCP/UDP/IP connectivity between two endpoints could
>> potentially be done with a macro or two, assuming that the traffic you
>> are sending is not overly specific.
>>
>> The idea here is that by having commonly used macros that are
>> bulletproof, we are less likely to introduce flakes by hand-coding the
>> same scenarios. It also would likely reduce the code size of the
>> testsuite, which is a nice bonus.
>>
>> 5) Enhanced debugability of tests would be great. There are too many
>> times that I've looked at testsuite.log on a failing test and had no
>> idea where to start looking for the failure. There are other times where
>> it's clear what failed, but trying to figure out why is more difficult.
>> Unfortunately, there isn't likely to be a magic cure-all for this problem.
>>
> 
> I think all 5 points above are things that are nice to have regardless
> of --recheck happening or not.
> 
> Thanks,
> Dumitru
> 
>> On 6/8/23 10:04, Dumitru Ceara wrote:
>>> If we want to catch new failures faster we have a better chance if CI
>>> doesn't auto-retry (once).
>>>
>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>>> ---
>>> NOTE: Sending this as RFC to start the discussion with the community.
>>> ---
>>>    .ci/linux-build.sh | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
>>> index 907a0dc6c9..64f7a96d91 100755
>>> --- a/.ci/linux-build.sh
>>> +++ b/.ci/linux-build.sh
>>> @@ -66,7 +66,7 @@ function execute_tests()
>>>          export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
>>>        if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}" $JOBS \
>>> -        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
>>> +        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
>>>        then
>>>            # testsuite.log is necessary for debugging.
>>>            cat */_build/sub/tests/testsuite.log
>>> @@ -81,7 +81,7 @@ function execute_system_tests()
>>>            configure_ovn $OPTS
>>>          make $JOBS || { cat config.log; exit 1; }
>>> -      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"
>>> RECHECK=yes; then
>>> +      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
>>>              # $log_file is necessary for debugging.
>>>              cat tests/$log_file
>>>              exit 1
>>
>
Han Zhou June 9, 2023, 8:21 p.m. UTC | #5
On Fri, Jun 9, 2023 at 11:53 AM Mark Michelson <mmichels@redhat.com> wrote:
>
> On 6/9/23 06:25, Dumitru Ceara wrote:
> > Hi Mark,
> >
> > Thanks for your feedback on this!
> >
> > On 6/8/23 21:58, Mark Michelson wrote:
> >> In theory, this idea is great. By removing the recheck, our CI will
> >> potentially turn red more often, resulting in us investigating our
tests
> >> to resolve what makes them flaky. Eventually, we will resolve all of
the
> >> flakes and CI will be green. From that point on, any time the CI turns
> >> red, it will immediately get our attention and we will fix whatever
> >> needs fixing.
> >>
> >> I have two main fears here:
> >>
> >> 1) If we don't fix the existing flaky tests quickly enough, it will be
> >> considered "normal" for CI to be red, meaning we are less likely to
> >> notice genuine failures when they happen. Worse, developers familiar
> >> with the flaky reputation of tests will just issue their own rechecks
> >> until the CI turns green.
> >>
> >> 2) Let's say we track down and fix all current flaky tests. Developers
> >> will soon write new tests, and those tests may end up being flaky. Then
> >> our CI will start turning red again due to test failures seemingly
> >> unrelated to code being committed. The result is the same as (1).
> >>
> >> By leaving the recheck intact, we are less likely to notice and fix
> >> flaky tests. That's also bad.
> >>
> >> I think that if we are going to remove the recheck now, then we have to
> >> be fully dedicated to eradicating all existing flaky tests. If we as a
> >> project can be dedicated to fixing flaky tests *immediately* when they
> >> are found, then we could move forward with this change now. I just
don't
> >> trust us to actually do it.
> >>
> >> I think a more plausible idea would be to identify all known flaky
> >> tests, fix those, and then remove the recheck. This way, we only have
to
> >> deal with new flaky tests that get written in the future. I think we
can
> >> deal with this situation more easily than suddenly going from green CI
> >> to red CI because of existing flaky tests.
> >>
> >
> > What about the following compromise instead?
> >
> > Identify all known flaky tests, fix the ones we can fix without too much
> > effort, tag the ones that we couldn't easily fix with a new keyword
> > (e.g., "unstable" or "flaky" or whatever), and only run recheck for
> > those.
> >
> > In support of this I did a test last night.  I force pushed the commit
> > that removes the --recheck multiple times in my fork to trigger CI.
> > The GitHub action ran 23 times which means the unit and system tests
> > were ran 3 x 23 = 69 times.  I collected the results and grouped them
> > by test and counted how many times they failed (disclaimer: it's not
> > completely fair because some tests run with different variations more
> > times than others but it should be a good enough first approximation).
> > I got:
> >
> >       40  ovn.at:30812       propagate Port_Binding.up to NB and OVS --
ovn-northd
> >       34  ovn-northd.at:9487 LSP incremental processing -- ovn-northd
> >        8  ovn.at:14436       options:multiple requested-chassis for
logical port: unclaimed behavior -- ovn-northd
> >        7  ovn.at:16010       tug-of-war between two chassis for the
same port -- ovn-northd
> >        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
> >        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
> >        4  ovn-performance.at:227 ovn-controller incremental processing
> >        3  ovn.at:34900       Check default openflow flows -- ovn-northd
> >        3  ovn.at:34466       recomputes -- ovn-northd
> >        3  ovn.at:14284       options:multiple requested-chassis for
logical port -- ovn-northd
> >        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
lport/LS, 1 LR -- ovn-northd
> >        1  system-ovn.at:6525 egress qos -- ovn-northd
> >        1  system-ovn.at:11151 ovn mirroring -- ovn-northd
> >        1  ovn-northd.at:8809 Address set incremental processing --
ovn-northd
> >        1  ovn-controller.at:702 ovn-controller - ovn action metering --
ovn-northd
> >        1  ovn.at:29467       ARP replies for SNAT external ips --
ovn-northd
> >        1  ovn.at:13171       1 LR with HA distributed router gateway
port -- ovn-northd
> >
> > The first two are probably real bugs and I reported them separately

The first is a bug, and here is the fix:
https://patchwork.ozlabs.org/project/ovn/list/?series=359066

The second may be categorized as unstable test, and here is an attempt to
make it stable (not sure if it can be 100% stable because performance
counter related checks are always tricky):
https://patchwork.ozlabs.org/project/ovn/patch/20230609194337.3538411-1-hzhou@ovn.org/

> > yesterday [0] but the other 15 didn't fail that often.  I would even
> > go one step further and ignore the ones that only failed once (it's
> > acceptable IMO every now and then a maintainer has to re-trigger CI).
> >
> > That would leave us with 9 tests that failed during this exercise and
> > need either to be fixed or to be tagged as "unstable":
> >        8  ovn.at:14436       options:multiple requested-chassis for
logical port: unclaimed behavior -- ovn-northd
> >        7  ovn.at:16010       tug-of-war between two chassis for the
same port -- ovn-northd
> >        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
> >        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
> >        4  ovn-performance.at:227 ovn-controller incremental processing
> >        3  ovn.at:34900       Check default openflow flows -- ovn-northd
> >        3  ovn.at:34466       recomputes -- ovn-northd
> >        3  ovn.at:14284       options:multiple requested-chassis for
logical port -- ovn-northd
> >        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
lport/LS, 1 LR -- ovn-northd
> >
> > Would you consider this compromise as a good alternative?
>

+1 for this good alternative.

> I think this can be a good alternative. I think if we go with this, then
> we need to be vigilant and be willing to adjust if it turns out that we
> end up having more CI failures than expected.
>
> The other thing we have to avoid is haphazardly labeling tests as
> "unstable" instead of actually trying to fix them.
>
> And finally, if tests are marked as "unstable", we have to answer the
> question of whether they should even remain in the testsuite. If they're
> always being skipped and we never intend to fix the tests, then they
> probably should be removed and replaced with something else.
>
I'd like to interpret "unstable" as "should pass in most cases but may
occasionally fail, and >90% (or even higher) chance a retry will pass if it
fails in the first attempt", rather than "very likely to fail".
I think at least today we should not have test cases that fail more than
this in CI.
Of course if there are any new "very likely to fail" test cases we should
treat it as CI broken and fix them immediately (if unfortunately wasn't
caught before merging for whatever reason, which shouldn't happen in most
cases).

Thanks,
Han

> >
> > [0] https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405418.html
> >
> >>
> >> Everything below this sentence is wishful thinking and shouldn't get in
> >> the way of approving/rejecting this change.
> >>
> >>
> >> I also think that we need to identify the common causes of flaky tests,
> >> and take measures to prevent them from happening in the future. There
> >> are a few ways I can think of to accomplish this:
> >>
> >> 1) Ensure people performing code reviews are aware of these patterns
and
> >> point them out during code review.
> >>
> >> 2) Automatically prevent known flaky patterns from existing in our
code.
> >> For instance, we could introduce a build-time check that ensures that
> >> nobody attempts to sleep during a test to "let things settle."
> >>
> >> 3) Provide functions/macros to use in the testsuite that avoid
> >> potentially flaky behavior. As a hypothetical example, let's say that a
> >> common cause of flakes is not using "--wait=hv" or "--wait=sb" after a
> >> series of ovn-nbctl commands. We could provide a macro like:
> >>
> >> OVN_NBCTL([hv], [
> >> lr-add ro0
> >> ls-add sw0
> >> ])
> >>
> >> (That's probably bad syntax, but I think you get the idea)
> >>
> >> This would expand to something like:
> >> check ovn-nbctl lr-add ro0
> >> check ovn-nbctl ls-add sw0
> >> check ovn-nbctl --wait=hv sync
> >>
> >> If we combine this with the idea from (2), we could prevent all bare
> >> uses of `ovn-nbctl` in tests, thus preventing this flaky behavior from
> >> existing.
> >>
> >> 4) Identify commonly used patterns in tests and provide macros that
> >> essentially do the work for us. Some examples might be:
> >>
> >> a) Common logical network topologies could be set up for you by calling
> >> a single macro.
> >> b) Checking logical flows could have a macro that dumps the information
> >> you care about, sorts the output, anonymizes the table numbers, and
> >> eliminates odd spacing in the output.
> >> c) Ensuring TCP/UDP/IP connectivity between two endpoints could
> >> potentially be done with a macro or two, assuming that the traffic you
> >> are sending is not overly specific.
> >>
> >> The idea here is that by having commonly used macros that are
> >> bulletproof, we are less likely to introduce flakes by hand-coding the
> >> same scenarios. It also would likely reduce the code size of the
> >> testsuite, which is a nice bonus.
> >>
> >> 5) Enhanced debugability of tests would be great. There are too many
> >> times that I've looked at testsuite.log on a failing test and had no
> >> idea where to start looking for the failure. There are other times
where
> >> it's clear what failed, but trying to figure out why is more difficult.
> >> Unfortunately, there isn't likely to be a magic cure-all for this
problem.
> >>
> >
> > I think all 5 points above are things that are nice to have regardless
> > of --recheck happening or not.
> >
> > Thanks,
> > Dumitru
> >
> >> On 6/8/23 10:04, Dumitru Ceara wrote:
> >>> If we want to catch new failures faster we have a better chance if CI
> >>> doesn't auto-retry (once).
> >>>
> >>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >>> ---
> >>> NOTE: Sending this as RFC to start the discussion with the community.
> >>> ---
> >>>    .ci/linux-build.sh | 4 ++--
> >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
> >>> index 907a0dc6c9..64f7a96d91 100755
> >>> --- a/.ci/linux-build.sh
> >>> +++ b/.ci/linux-build.sh
> >>> @@ -66,7 +66,7 @@ function execute_tests()
> >>>          export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
> >>>        if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}"
$JOBS \
> >>> -        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
> >>> +        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
> >>>        then
> >>>            # testsuite.log is necessary for debugging.
> >>>            cat */_build/sub/tests/testsuite.log
> >>> @@ -81,7 +81,7 @@ function execute_system_tests()
> >>>            configure_ovn $OPTS
> >>>          make $JOBS || { cat config.log; exit 1; }
> >>> -      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"
> >>> RECHECK=yes; then
> >>> +      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
> >>>              # $log_file is necessary for debugging.
> >>>              cat tests/$log_file
> >>>              exit 1
> >>
> >
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Dumitru Ceara July 5, 2023, 5:11 p.m. UTC | #6
On 6/9/23 22:21, Han Zhou wrote:
> On Fri, Jun 9, 2023 at 11:53 AM Mark Michelson <mmichels@redhat.com> wrote:
>>
>> On 6/9/23 06:25, Dumitru Ceara wrote:
>>> Hi Mark,
>>>
>>> Thanks for your feedback on this!
>>>
>>> On 6/8/23 21:58, Mark Michelson wrote:
>>>> In theory, this idea is great. By removing the recheck, our CI will
>>>> potentially turn red more often, resulting in us investigating our
> tests
>>>> to resolve what makes them flaky. Eventually, we will resolve all of
> the
>>>> flakes and CI will be green. From that point on, any time the CI turns
>>>> red, it will immediately get our attention and we will fix whatever
>>>> needs fixing.
>>>>
>>>> I have two main fears here:
>>>>
>>>> 1) If we don't fix the existing flaky tests quickly enough, it will be
>>>> considered "normal" for CI to be red, meaning we are less likely to
>>>> notice genuine failures when they happen. Worse, developers familiar
>>>> with the flaky reputation of tests will just issue their own rechecks
>>>> until the CI turns green.
>>>>
>>>> 2) Let's say we track down and fix all current flaky tests. Developers
>>>> will soon write new tests, and those tests may end up being flaky. Then
>>>> our CI will start turning red again due to test failures seemingly
>>>> unrelated to code being committed. The result is the same as (1).
>>>>
>>>> By leaving the recheck intact, we are less likely to notice and fix
>>>> flaky tests. That's also bad.
>>>>
>>>> I think that if we are going to remove the recheck now, then we have to
>>>> be fully dedicated to eradicating all existing flaky tests. If we as a
>>>> project can be dedicated to fixing flaky tests *immediately* when they
>>>> are found, then we could move forward with this change now. I just
> don't
>>>> trust us to actually do it.
>>>>
>>>> I think a more plausible idea would be to identify all known flaky
>>>> tests, fix those, and then remove the recheck. This way, we only have
> to
>>>> deal with new flaky tests that get written in the future. I think we
> can
>>>> deal with this situation more easily than suddenly going from green CI
>>>> to red CI because of existing flaky tests.
>>>>
>>>
>>> What about the following compromise instead?
>>>
>>> Identify all known flaky tests, fix the ones we can fix without too much
>>> effort, tag the ones that we couldn't easily fix with a new keyword
>>> (e.g., "unstable" or "flaky" or whatever), and only run recheck for
>>> those.
>>>
>>> In support of this I did a test last night.  I force pushed the commit
>>> that removes the --recheck multiple times in my fork to trigger CI.
>>> The GitHub action ran 23 times which means the unit and system tests
>>> were ran 3 x 23 = 69 times.  I collected the results and grouped them
>>> by test and counted how many times they failed (disclaimer: it's not
>>> completely fair because some tests run with different variations more
>>> times than others but it should be a good enough first approximation).
>>> I got:
>>>
>>>       40  ovn.at:30812       propagate Port_Binding.up to NB and OVS --
> ovn-northd
>>>       34  ovn-northd.at:9487 LSP incremental processing -- ovn-northd
>>>        8  ovn.at:14436       options:multiple requested-chassis for
> logical port: unclaimed behavior -- ovn-northd
>>>        7  ovn.at:16010       tug-of-war between two chassis for the
> same port -- ovn-northd
>>>        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>>>        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>>>        4  ovn-performance.at:227 ovn-controller incremental processing
>>>        3  ovn.at:34900       Check default openflow flows -- ovn-northd
>>>        3  ovn.at:34466       recomputes -- ovn-northd
>>>        3  ovn.at:14284       options:multiple requested-chassis for
> logical port -- ovn-northd
>>>        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
> lport/LS, 1 LR -- ovn-northd
>>>        1  system-ovn.at:6525 egress qos -- ovn-northd
>>>        1  system-ovn.at:11151 ovn mirroring -- ovn-northd
>>>        1  ovn-northd.at:8809 Address set incremental processing --
> ovn-northd
>>>        1  ovn-controller.at:702 ovn-controller - ovn action metering --
> ovn-northd
>>>        1  ovn.at:29467       ARP replies for SNAT external ips --
> ovn-northd
>>>        1  ovn.at:13171       1 LR with HA distributed router gateway
> port -- ovn-northd
>>>
>>> The first two are probably real bugs and I reported them separately
> 
> The first is a bug, and here is the fix:
> https://patchwork.ozlabs.org/project/ovn/list/?series=359066
> 
> The second may be categorized as unstable test, and here is an attempt to
> make it stable (not sure if it can be 100% stable because performance
> counter related checks are always tricky):
> https://patchwork.ozlabs.org/project/ovn/patch/20230609194337.3538411-1-hzhou@ovn.org/
> 
>>> yesterday [0] but the other 15 didn't fail that often.  I would even
>>> go one step further and ignore the ones that only failed once (it's
>>> acceptable IMO every now and then a maintainer has to re-trigger CI).
>>>
>>> That would leave us with 9 tests that failed during this exercise and
>>> need either to be fixed or to be tagged as "unstable":
>>>        8  ovn.at:14436       options:multiple requested-chassis for
> logical port: unclaimed behavior -- ovn-northd
>>>        7  ovn.at:16010       tug-of-war between two chassis for the
> same port -- ovn-northd
>>>        6  ovn.at:29369       nb_cfg timestamp -- ovn-northd
>>>        5  ovn-northd.at:4507 northd ssl file change -- ovn-northd
>>>        4  ovn-performance.at:227 ovn-controller incremental processing
>>>        3  ovn.at:34900       Check default openflow flows -- ovn-northd
>>>        3  ovn.at:34466       recomputes -- ovn-northd
>>>        3  ovn.at:14284       options:multiple requested-chassis for
> logical port -- ovn-northd
>>>        2  ovn.at:7909        policy-based routing IPv6: 1 HVs, 3 LSs, 1
> lport/LS, 1 LR -- ovn-northd
>>>
>>> Would you consider this compromise as a good alternative?
>>
> 
> +1 for this good alternative.
> 
>> I think this can be a good alternative. I think if we go with this, then
>> we need to be vigilant and be willing to adjust if it turns out that we
>> end up having more CI failures than expected.
>>
>> The other thing we have to avoid is haphazardly labeling tests as
>> "unstable" instead of actually trying to fix them.
>>
>> And finally, if tests are marked as "unstable", we have to answer the
>> question of whether they should even remain in the testsuite. If they're
>> always being skipped and we never intend to fix the tests, then they
>> probably should be removed and replaced with something else.
>>
> I'd like to interpret "unstable" as "should pass in most cases but may
> occasionally fail, and >90% (or even higher) chance a retry will pass if it
> fails in the first attempt", rather than "very likely to fail".
> I think at least today we should not have test cases that fail more than
> this in CI.
> Of course if there are any new "very likely to fail" test cases we should
> treat it as CI broken and fix them immediately (if unfortunately wasn't
> caught before merging for whatever reason, which shouldn't happen in most
> cases).
> 

I posted a patch to implement the alternative discussed above:

https://patchwork.ozlabs.org/project/ovn/patch/20230705170813.1199533-1-dceara@redhat.com/

Regards,
Dumitru
diff mbox series

Patch

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 907a0dc6c9..64f7a96d91 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -66,7 +66,7 @@  function execute_tests()
 
     export DISTCHECK_CONFIGURE_FLAGS="$OPTS"
     if ! make distcheck CFLAGS="${COMMON_CFLAGS} ${OVN_CFLAGS}" $JOBS \
-        TESTSUITEFLAGS="$JOBS $TEST_RANGE" RECHECK=yes
+        TESTSUITEFLAGS="$JOBS $TEST_RANGE"
     then
         # testsuite.log is necessary for debugging.
         cat */_build/sub/tests/testsuite.log
@@ -81,7 +81,7 @@  function execute_system_tests()
 
       configure_ovn $OPTS
       make $JOBS || { cat config.log; exit 1; }
-      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE" RECHECK=yes; then
+      if ! sudo make $JOBS $type TESTSUITEFLAGS="$TEST_RANGE"; then
           # $log_file is necessary for debugging.
           cat tests/$log_file
           exit 1