diff mbox series

[ovs-dev,v3,2/2] ci: Run tc offload tests in GitHub Actions.

Message ID 167828869871.885480.13278006837271919367.stgit@ebuild.local
State Rejected
Headers show
Series [ovs-dev,v3,1/2] github: Combine ASAN and UBSAN runs. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/intel-ovs-compilation success test: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Eelco Chaudron March 8, 2023, 3:18 p.m. UTC
Run "make check-offloads" as part of the GitHub actions tests.

This test was run 25 times using GitHub actions, and the
failing rerun test cases where excluded. There are quite some
first-run failures, but unfortunately, there is no other
more stable kernel available as a GitHub-hosted runner.

Did not yet include sanitizers in the run, as it's causing
the test to run too long >30min and there seems to be (timing)
issues with some of the tests.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
---

v2: Added a new test keyword to exclude the failing tests.
    Added some documentation around the keyword usage.
v3: Fixed documentation and log copy for make distcheck.

 .ci/linux-build.sh                   |    6 +++++-
 .github/workflows/build-and-test.yml |    9 ++++++++-
 Documentation/topics/testing.rst     |   17 +++++++++++++++++
 tests/system-offloads-traffic.at     |    3 +++
 tests/system-traffic.at              |    2 ++
 5 files changed, 35 insertions(+), 2 deletions(-)

Comments

Simon Horman March 9, 2023, 2:42 p.m. UTC | #1
On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> Run "make check-offloads" as part of the GitHub actions tests.
> 
> This test was run 25 times using GitHub actions, and the
> failing rerun test cases where excluded. There are quite some
> first-run failures, but unfortunately, there is no other
> more stable kernel available as a GitHub-hosted runner.
> 
> Did not yet include sanitizers in the run, as it's causing
> the test to run too long >30min and there seems to be (timing)
> issues with some of the tests.
> 
> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>

Hi Eelco,

I like this patch a lot.
But I am observing reliability problems when executing the new job.

For 5 runs, on each occasion some tests failed the first time.
And on 3 of those runs at least one test failed on the recheck,
so the job failed.

The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559

And a summary of the checks run for recheck for each job run is as follows.

Attempt #1
85:  failed
166: ok

Attempt #2
50:  ok
64:  failed
86:  ok
135: failed

Attempt #3
50:  ok

Attempt #4
50:  failed

Attempt #5
50:  ok
163: ok
169: ok
Eelco Chaudron March 9, 2023, 4:22 p.m. UTC | #2
On 9 Mar 2023, at 15:42, Simon Horman wrote:

> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
>> Run "make check-offloads" as part of the GitHub actions tests.
>>
>> This test was run 25 times using GitHub actions, and the
>> failing rerun test cases where excluded. There are quite some
>> first-run failures, but unfortunately, there is no other
>> more stable kernel available as a GitHub-hosted runner.
>>
>> Did not yet include sanitizers in the run, as it's causing
>> the test to run too long >30min and there seems to be (timing)
>> issues with some of the tests.
>>
>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>
> Hi Eelco,
>
> I like this patch a lot.
> But I am observing reliability problems when executing the new job.
>
> For 5 runs, on each occasion some tests failed the first time.
> And on 3 of those runs at least one test failed on the recheck,
> so the job failed.

Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.

I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.

Ilya any other approach you can think off?


> The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559
>
> And a summary of the checks run for recheck for each job run is as follows.
>
> Attempt #1
> 85:  failed
> 166: ok
>
> Attempt #2
> 50:  ok
> 64:  failed
> 86:  ok
> 135: failed
>
> Attempt #3
> 50:  ok
>
> Attempt #4
> 50:  failed
>
> Attempt #5
> 50:  ok
> 163: ok
> 169: ok
Ilya Maximets March 9, 2023, 4:26 p.m. UTC | #3
On 3/9/23 17:22, Eelco Chaudron wrote:
> 
> 
> On 9 Mar 2023, at 15:42, Simon Horman wrote:
> 
>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
>>> Run "make check-offloads" as part of the GitHub actions tests.
>>>
>>> This test was run 25 times using GitHub actions, and the
>>> failing rerun test cases where excluded. There are quite some
>>> first-run failures, but unfortunately, there is no other
>>> more stable kernel available as a GitHub-hosted runner.
>>>
>>> Did not yet include sanitizers in the run, as it's causing
>>> the test to run too long >30min and there seems to be (timing)
>>> issues with some of the tests.
>>>
>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>
>> Hi Eelco,
>>
>> I like this patch a lot.
>> But I am observing reliability problems when executing the new job.
>>
>> For 5 runs, on each occasion some tests failed the first time.
>> And on 3 of those runs at least one test failed on the recheck,
>> so the job failed.
> 
> Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
> 
> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
> 
> Ilya any other approach you can think off?

Not really. :/
GHA is kind of hard to debug.  Cirrus has a way to give you a console
access to the running job, but GHA doesn't have that.

BTW, didn't we fic the force commit test recently?  I see it disabled
in the patch.

> 
> 
>> The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559
>>
>> And a summary of the checks run for recheck for each job run is as follows.
>>
>> Attempt #1
>> 85:  failed
>> 166: ok
>>
>> Attempt #2
>> 50:  ok
>> 64:  failed
>> 86:  ok
>> 135: failed
>>
>> Attempt #3
>> 50:  ok
>>
>> Attempt #4
>> 50:  failed
>>
>> Attempt #5
>> 50:  ok
>> 163: ok
>> 169: ok
>
Ilya Maximets March 9, 2023, 4:26 p.m. UTC | #4
On 3/9/23 17:26, Ilya Maximets wrote:
> On 3/9/23 17:22, Eelco Chaudron wrote:
>>
>>
>> On 9 Mar 2023, at 15:42, Simon Horman wrote:
>>
>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
>>>> Run "make check-offloads" as part of the GitHub actions tests.
>>>>
>>>> This test was run 25 times using GitHub actions, and the
>>>> failing rerun test cases where excluded. There are quite some
>>>> first-run failures, but unfortunately, there is no other
>>>> more stable kernel available as a GitHub-hosted runner.
>>>>
>>>> Did not yet include sanitizers in the run, as it's causing
>>>> the test to run too long >30min and there seems to be (timing)
>>>> issues with some of the tests.
>>>>
>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>>
>>> Hi Eelco,
>>>
>>> I like this patch a lot.
>>> But I am observing reliability problems when executing the new job.
>>>
>>> For 5 runs, on each occasion some tests failed the first time.
>>> And on 3 of those runs at least one test failed on the recheck,
>>> so the job failed.
>>
>> Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
>>
>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
>>
>> Ilya any other approach you can think off?
> 
> Not really. :/
> GHA is kind of hard to debug.  Cirrus has a way to give you a console
> access to the running job, but GHA doesn't have that.
> 
> BTW, didn't we fic the force commit test recently?  I see it disabled

* fix

> in the patch.
> 
>>
>>
>>> The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559
>>>
>>> And a summary of the checks run for recheck for each job run is as follows.
>>>
>>> Attempt #1
>>> 85:  failed
>>> 166: ok
>>>
>>> Attempt #2
>>> 50:  ok
>>> 64:  failed
>>> 86:  ok
>>> 135: failed
>>>
>>> Attempt #3
>>> 50:  ok
>>>
>>> Attempt #4
>>> 50:  failed
>>>
>>> Attempt #5
>>> 50:  ok
>>> 163: ok
>>> 169: ok
>>
>
Frode Nordahl March 10, 2023, 6:12 a.m. UTC | #5
On Thu, Mar 9, 2023 at 5:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 3/9/23 17:26, Ilya Maximets wrote:
> > On 3/9/23 17:22, Eelco Chaudron wrote:
> >>
> >>
> >> On 9 Mar 2023, at 15:42, Simon Horman wrote:
> >>
> >>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> >>>> Run "make check-offloads" as part of the GitHub actions tests.
> >>>>
> >>>> This test was run 25 times using GitHub actions, and the
> >>>> failing rerun test cases where excluded. There are quite some
> >>>> first-run failures, but unfortunately, there is no other
> >>>> more stable kernel available as a GitHub-hosted runner.
> >>>>
> >>>> Did not yet include sanitizers in the run, as it's causing
> >>>> the test to run too long >30min and there seems to be (timing)
> >>>> issues with some of the tests.
> >>>>
> >>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> >>>
> >>> Hi Eelco,
> >>>
> >>> I like this patch a lot.
> >>> But I am observing reliability problems when executing the new job.
> >>>
> >>> For 5 runs, on each occasion some tests failed the first time.
> >>> And on 3 of those runs at least one test failed on the recheck,
> >>> so the job failed.
> >>
> >> Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
> >>
> >> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
> >>
> >> Ilya any other approach you can think off?
> >
> > Not really. :/
> > GHA is kind of hard to debug.  Cirrus has a way to give you a console
> > access to the running job, but GHA doesn't have that.

fwiw; There are third party actions available that would allow you to
SSH into the runner [0], which could possibly be useful to temporarily
enable while debugging.

0: https://github.com/marketplace/actions/debugging-with-ssh
Simon Horman March 10, 2023, 9:15 a.m. UTC | #6
On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
> 
> 
> On 9 Mar 2023, at 15:42, Simon Horman wrote:
> 
> > On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> >> Run "make check-offloads" as part of the GitHub actions tests.
> >>
> >> This test was run 25 times using GitHub actions, and the
> >> failing rerun test cases where excluded. There are quite some
> >> first-run failures, but unfortunately, there is no other
> >> more stable kernel available as a GitHub-hosted runner.
> >>
> >> Did not yet include sanitizers in the run, as it's causing
> >> the test to run too long >30min and there seems to be (timing)
> >> issues with some of the tests.
> >>
> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> >
> > Hi Eelco,
> >
> > I like this patch a lot.
> > But I am observing reliability problems when executing the new job.
> >
> > For 5 runs, on each occasion some tests failed the first time.
> > And on 3 of those runs at least one test failed on the recheck,
> > so the job failed.
> 
> Damn :)

Yes, it pained me to report this.

> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
> 
> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.

I will see if I can spend some cycles on reproducing this (outside of GHA).
I'll likely start with the tests that show up in the summary below.

> Ilya any other approach you can think off?
> 
> 
> > The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559
> >
> > And a summary of the checks run for recheck for each job run is as follows.
> >
> > Attempt #1
> > 85:  failed
> > 166: ok
> >
> > Attempt #2
> > 50:  ok
> > 64:  failed
> > 86:  ok
> > 135: failed
> >
> > Attempt #3
> > 50:  ok
> >
> > Attempt #4
> > 50:  failed
> >
> > Attempt #5
> > 50:  ok
> > 163: ok
> > 169: ok
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Simon Horman March 10, 2023, 9:16 a.m. UTC | #7
On Fri, Mar 10, 2023 at 07:12:33AM +0100, Frode Nordahl wrote:
> On Thu, Mar 9, 2023 at 5:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:

...

> > > GHA is kind of hard to debug.  Cirrus has a way to give you a console
> > > access to the running job, but GHA doesn't have that.
> 
> fwiw; There are third party actions available that would allow you to
> SSH into the runner [0], which could possibly be useful to temporarily
> enable while debugging.
> 
> 0: https://github.com/marketplace/actions/debugging-with-ssh

Nice, I did not know that.
Simon Horman March 10, 2023, 4:20 p.m. UTC | #8
On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote:
> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
> > 
> > 
> > On 9 Mar 2023, at 15:42, Simon Horman wrote:
> > 
> > > On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> > >> Run "make check-offloads" as part of the GitHub actions tests.
> > >>
> > >> This test was run 25 times using GitHub actions, and the
> > >> failing rerun test cases where excluded. There are quite some
> > >> first-run failures, but unfortunately, there is no other
> > >> more stable kernel available as a GitHub-hosted runner.
> > >>
> > >> Did not yet include sanitizers in the run, as it's causing
> > >> the test to run too long >30min and there seems to be (timing)
> > >> issues with some of the tests.
> > >>
> > >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> > >
> > > Hi Eelco,
> > >
> > > I like this patch a lot.
> > > But I am observing reliability problems when executing the new job.
> > >
> > > For 5 runs, on each occasion some tests failed the first time.
> > > And on 3 of those runs at least one test failed on the recheck,
> > > so the job failed.
> > 
> > Damn :)
> 
> Yes, it pained me to report this.
> 
> > I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
> > 
> > I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
> 
> I will see if I can spend some cycles on reproducing this (outside of GHA).
> I'll likely start with the tests that show up in the summary below.

I started off by looking at check-offloads test:

50. system-traffic.at:1524: testing datapath - basic truncate action ...

I haven't dug into the code to debug the problem yet.
But I have collected some results that might be interesting.


My testing was on a low-end VM with Ubuntu 18.04, with no HW offload:
$ uname -psv
Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64

Using the latest main branch:
d2f6fbe9fe6c ("ofproto-dpif-upcall: Remove redundant time_msec() in revalidate().")

I ran this in a for loop that exited on failure.

for i in $(seq 50); do echo $i; TESTSUITEFLAGS="50 -v" make check-offloads >& 50.log || break; done

I did this 5 times.
On one run it failed 1st go, on 3 other runs it failed in less than 10
iterations, and on the another it made it to the 16th iteration.

On one of the runs the error looked like this:

...
./system-traffic.at:1573: ovs-appctl revalidator/purge
./system-traffic.at:1574: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
--- /dev/null   2023-03-10 15:43:42.948000000 +0000
+++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stderr     2023-03-10 16:00:02.904493675 +0000
@@ -0,0 +1 @@
+ovs-ofctl: br0: failed to connect to socket (Connection reset by peer)
...

The others looked like this:


./system-traffic.at:1620: ovs-appctl revalidator/purge
./system-traffic.at:1621: ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
./system-traffic.at:1626: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
./system-traffic.at:1630: check_logs 
--- /dev/null   2023-03-10 15:43:42.948000000 +0000
+++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stdout     2023-03-10 16:03:06.996925562 +0000
@@ -0,0 +1,2 @@
+2023-03-10T16:03:04.666Z|00137|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:9505babe-8c30-4ee3-bd58-aa4f6f310219 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0xd4),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::16,label=0,proto=58,tclass=0,hlimit=1,frag=no),icmpv6(type=143,code=0)
+2023-03-10T16:03:04.666Z|00138|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:c86015ff-4a41-4879-92b5-518c52be4b46 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::2,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=133,code=0)
Eelco Chaudron March 28, 2023, 11:45 a.m. UTC | #9
On 10 Mar 2023, at 17:20, Simon Horman wrote:

> On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote:
>> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
>>>
>>>
>>> On 9 Mar 2023, at 15:42, Simon Horman wrote:
>>>
>>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
>>>>> Run "make check-offloads" as part of the GitHub actions tests.
>>>>>
>>>>> This test was run 25 times using GitHub actions, and the
>>>>> failing rerun test cases where excluded. There are quite some
>>>>> first-run failures, but unfortunately, there is no other
>>>>> more stable kernel available as a GitHub-hosted runner.
>>>>>
>>>>> Did not yet include sanitizers in the run, as it's causing
>>>>> the test to run too long >30min and there seems to be (timing)
>>>>> issues with some of the tests.
>>>>>
>>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>>>
>>>> Hi Eelco,
>>>>
>>>> I like this patch a lot.
>>>> But I am observing reliability problems when executing the new job.
>>>>
>>>> For 5 runs, on each occasion some tests failed the first time.
>>>> And on 3 of those runs at least one test failed on the recheck,
>>>> so the job failed.
>>>
>>> Damn :)
>>
>> Yes, it pained me to report this.
>>
>>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
>>>
>>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
>>
>> I will see if I can spend some cycles on reproducing this (outside of GHA).
>> I'll likely start with the tests that show up in the summary below.
>
> I started off by looking at check-offloads test:
>
> 50. system-traffic.at:1524: testing datapath - basic truncate action ...
>
> I haven't dug into the code to debug the problem yet.
> But I have collected some results that might be interesting.
>
>
> My testing was on a low-end VM with Ubuntu 18.04, with no HW offload:
> $ uname -psv
> Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64


So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests.

I thought I fixed it by this old commit:

  https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac

However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly:

  https://github.com/chaudron/ovs/actions

Note that the above 25x runs I did before and none of the above tests failed…

I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix.

As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now…

//Eelco

> Using the latest main branch:
> d2f6fbe9fe6c ("ofproto-dpif-upcall: Remove redundant time_msec() in revalidate().")
>
> I ran this in a for loop that exited on failure.
>
> for i in $(seq 50); do echo $i; TESTSUITEFLAGS="50 -v" make check-offloads >& 50.log || break; done
>
> I did this 5 times.
> On one run it failed 1st go, on 3 other runs it failed in less than 10
> iterations, and on the another it made it to the 16th iteration.
>
> On one of the runs the error looked like this:
>
> ...
> ./system-traffic.at:1573: ovs-appctl revalidator/purge
> ./system-traffic.at:1574: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
> --- /dev/null   2023-03-10 15:43:42.948000000 +0000
> +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stderr     2023-03-10 16:00:02.904493675 +0000
> @@ -0,0 +1 @@
> +ovs-ofctl: br0: failed to connect to socket (Connection reset by peer)
> ...
>
> The others looked like this:
>
>
> ./system-traffic.at:1620: ovs-appctl revalidator/purge
> ./system-traffic.at:1621: ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
> ./system-traffic.at:1626: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p'
> ./system-traffic.at:1630: check_logs
> --- /dev/null   2023-03-10 15:43:42.948000000 +0000
> +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stdout     2023-03-10 16:03:06.996925562 +0000
> @@ -0,0 +1,2 @@
> +2023-03-10T16:03:04.666Z|00137|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:9505babe-8c30-4ee3-bd58-aa4f6f310219 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0xd4),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::16,label=0,proto=58,tclass=0,hlimit=1,frag=no),icmpv6(type=143,code=0)
> +2023-03-10T16:03:04.666Z|00138|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:c86015ff-4a41-4879-92b5-518c52be4b46 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::2,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=133,code=0)
Simon Horman March 29, 2023, 3:34 p.m. UTC | #10
On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote:
> 
> 
> On 10 Mar 2023, at 17:20, Simon Horman wrote:
> 
> > On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote:
> >> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
> >>>
> >>>
> >>> On 9 Mar 2023, at 15:42, Simon Horman wrote:
> >>>
> >>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> >>>>> Run "make check-offloads" as part of the GitHub actions tests.
> >>>>>
> >>>>> This test was run 25 times using GitHub actions, and the
> >>>>> failing rerun test cases where excluded. There are quite some
> >>>>> first-run failures, but unfortunately, there is no other
> >>>>> more stable kernel available as a GitHub-hosted runner.
> >>>>>
> >>>>> Did not yet include sanitizers in the run, as it's causing
> >>>>> the test to run too long >30min and there seems to be (timing)
> >>>>> issues with some of the tests.
> >>>>>
> >>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> >>>>
> >>>> Hi Eelco,
> >>>>
> >>>> I like this patch a lot.
> >>>> But I am observing reliability problems when executing the new job.
> >>>>
> >>>> For 5 runs, on each occasion some tests failed the first time.
> >>>> And on 3 of those runs at least one test failed on the recheck,
> >>>> so the job failed.
> >>>
> >>> Damn :)
> >>
> >> Yes, it pained me to report this.
> >>
> >>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
> >>>
> >>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
> >>
> >> I will see if I can spend some cycles on reproducing this (outside of GHA).
> >> I'll likely start with the tests that show up in the summary below.
> >
> > I started off by looking at check-offloads test:
> >
> > 50. system-traffic.at:1524: testing datapath - basic truncate action ...
> >
> > I haven't dug into the code to debug the problem yet.
> > But I have collected some results that might be interesting.
> >
> >
> > My testing was on a low-end VM with Ubuntu 18.04, with no HW offload:
> > $ uname -psv
> > Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64
> 
> 
> So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests.
> 
> I thought I fixed it by this old commit:
> 
>   https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac
> 
> However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly:
> 
>   https://github.com/chaudron/ovs/actions
> 
> Note that the above 25x runs I did before and none of the above tests failed…
> 
> I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix.
> 
> As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now…

Thanks Eelco,

it's a shame this proved to be elusive.

Perhaps with a newer, as yet unreleased, Ubuntu version things will
improve - perhaps it is a kernel issue.

Or perhaps we have some deep problem related to running in
resource constrained environments, that we may uncover some day.

In any case, thanks for looking at this.
I agree that it makes sense to abandon this patchset for now.
And that using the Robot may be a promising alternative, for now.
Ilya Maximets March 31, 2023, 4:29 p.m. UTC | #11
On 3/29/23 17:34, Simon Horman wrote:
> On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote:
>>
>>
>> On 10 Mar 2023, at 17:20, Simon Horman wrote:
>>
>>> On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote:
>>>> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> On 9 Mar 2023, at 15:42, Simon Horman wrote:
>>>>>
>>>>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
>>>>>>> Run "make check-offloads" as part of the GitHub actions tests.
>>>>>>>
>>>>>>> This test was run 25 times using GitHub actions, and the
>>>>>>> failing rerun test cases where excluded. There are quite some
>>>>>>> first-run failures, but unfortunately, there is no other
>>>>>>> more stable kernel available as a GitHub-hosted runner.
>>>>>>>
>>>>>>> Did not yet include sanitizers in the run, as it's causing
>>>>>>> the test to run too long >30min and there seems to be (timing)
>>>>>>> issues with some of the tests.
>>>>>>>
>>>>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>>>>>
>>>>>> Hi Eelco,
>>>>>>
>>>>>> I like this patch a lot.
>>>>>> But I am observing reliability problems when executing the new job.
>>>>>>
>>>>>> For 5 runs, on each occasion some tests failed the first time.
>>>>>> And on 3 of those runs at least one test failed on the recheck,
>>>>>> so the job failed.
>>>>>
>>>>> Damn :)
>>>>
>>>> Yes, it pained me to report this.
>>>>
>>>>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them.
>>>>>
>>>>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them.
>>>>
>>>> I will see if I can spend some cycles on reproducing this (outside of GHA).
>>>> I'll likely start with the tests that show up in the summary below.
>>>
>>> I started off by looking at check-offloads test:
>>>
>>> 50. system-traffic.at:1524: testing datapath - basic truncate action ...
>>>
>>> I haven't dug into the code to debug the problem yet.
>>> But I have collected some results that might be interesting.
>>>
>>>
>>> My testing was on a low-end VM with Ubuntu 18.04, with no HW offload:
>>> $ uname -psv
>>> Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64
>>
>>
>> So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests.
>>
>> I thought I fixed it by this old commit:
>>
>>   https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac
>>
>> However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly:
>>
>>   https://github.com/chaudron/ovs/actions
>>
>> Note that the above 25x runs I did before and none of the above tests failed…
>>
>> I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix.
>>
>> As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now…
> 
> Thanks Eelco,
> 
> it's a shame this proved to be elusive.
> 
> Perhaps with a newer, as yet unreleased, Ubuntu version things will
> improve - perhaps it is a kernel issue.
> 
> Or perhaps we have some deep problem related to running in
> resource constrained environments, that we may uncover some day.
> 
> In any case, thanks for looking at this.
> I agree that it makes sense to abandon this patchset for now.
> And that using the Robot may be a promising alternative, for now.

FWIW, we can spin up a Fedora VM in Cirrus CI, if that helps.
We may even ask for more CPUs if needed.

See how we do CI in ovn-heater for example:
  https://github.com/ovn-org/ovn-heater/blob/main/.cirrus.yml

If VM is not necessary, we may also use containers similarly
to GitHub Actions.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 6394a8137..19ed9796d 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -159,7 +159,7 @@  fi
 
 OPTS="${EXTRA_OPTS} ${OPTS} $*"
 
-if [ "$TESTSUITE" ]; then
+if [ "$TESTSUITE" = 'test' ]; then
     # 'distcheck' will reconfigure with required options.
     # Now we only need to prepare the Makefile without sparse-wrapped CC.
     configure_ovs
@@ -169,6 +169,10 @@  if [ "$TESTSUITE" ]; then
         TESTSUITEFLAGS=-j4 RECHECK=yes
 else
     build_ovs
+    if [ -n "$TESTSUITE" ]; then
+        sudo -E PATH="$PATH" make "$TESTSUITE" TESTSUITEFLAGS="$TEST_OPTS" \
+            RECHECK=yes
+    fi
 fi
 
 exit 0
diff --git a/.github/workflows/build-and-test.yml b/.github/workflows/build-and-test.yml
index 9f518ff01..ffcb1cd4f 100644
--- a/.github/workflows/build-and-test.yml
+++ b/.github/workflows/build-and-test.yml
@@ -17,9 +17,10 @@  jobs:
       OPTS:        ${{ matrix.opts }}
       SANITIZERS:  ${{ matrix.sanitizers }}
       TESTSUITE:   ${{ matrix.testsuite }}
+      TEST_OPTS:   ${{ matrix.test_opts }}
 
     name: linux ${{ join(matrix.*, ' ') }}
-    runs-on: ubuntu-20.04
+    runs-on: ubuntu-22.04
     timeout-minutes: 30
 
     strategy:
@@ -86,6 +87,10 @@  jobs:
             m32:          m32
             opts:         --disable-ssl
 
+          - compiler:     gcc
+            testsuite:    check-offloads
+            test_opts:    "-k !github_offloads_skip"
+
     steps:
     - name: checkout
       uses: actions/checkout@v3
@@ -146,8 +151,10 @@  jobs:
         # Also, upload-artifact@v2 doesn't work well enough with wildcards.
         # So, we're just archiving everything here to avoid any issues.
         mkdir logs
+        sudo chown -R $USER ./tests/system-offloads-testsuite.* || true
         cp config.log ./logs/
         cp -r ./*/_build/sub/tests/testsuite.* ./logs/ || true
+        cp -r ./tests/system-offloads-testsuite.* ./logs/ || true
         tar -czvf logs.tgz logs/
 
     - name: upload logs on failure
diff --git a/Documentation/topics/testing.rst b/Documentation/topics/testing.rst
index 5f6940b84..c2d680e71 100644
--- a/Documentation/topics/testing.rst
+++ b/Documentation/topics/testing.rst
@@ -114,6 +114,23 @@  And from another window, one can execute ovs-xxx commands like::
 
 Once done with investigation, press ENTER to perform cleanup operation.
 
+GitHub actions
+++++++++++++++
+
+The OVS GitHub repository also runs some of these unit tests through GitHub
+actions. These tests are defined in the
+``ovs/.github/workflows/build-and-test.yml`` file.
+
+Based on the GitHub runners available, not all tests will work. In these cases,
+the AT_KEYWORDS() macro can be used. For example, to skip a
+``make check-offloads`` test, use the ``github_offloads_skip`` keyword.
+
+Only use these keywords if no other way to fix or skip the test is available.
+
+To see a list of currently skipped tests, you can do something like::
+
+    $ make check-offloads TESTSUITEFLAGS="-l" | grep -B 1 github_offloads_skip
+
 .. _testing-coverage:
 
 Coverage
diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at
index eb331d6ce..5cde26663 100644
--- a/tests/system-offloads-traffic.at
+++ b/tests/system-offloads-traffic.at
@@ -191,6 +191,7 @@  AT_CLEANUP
 
 AT_SETUP([offloads - check interface meter offloading -  offloads disabled])
 AT_KEYWORDS([dp-meter])
+AT_KEYWORDS([github_offloads_skip])
 AT_SKIP_IF([test $HAVE_NC = "no"])
 OVS_TRAFFIC_VSWITCHD_START()
 
@@ -240,6 +241,7 @@  AT_CLEANUP
 
 AT_SETUP([offloads - check interface meter offloading -  offloads enabled])
 AT_KEYWORDS([offload-meter])
+AT_KEYWORDS([github_offloads_skip])
 AT_SKIP_IF([test $SUPPORT_TC_INGRESS_PPS = "no"])
 AT_SKIP_IF([test $HAVE_NC = "no"])
 OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true])
@@ -348,6 +350,7 @@  AT_CLEANUP
 
 
 AT_SETUP([offloads - check_pkt_len action - offloads enabled])
+AT_KEYWORDS([github_offloads_skip])
 OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true])
 
 ADD_NAMESPACES(at_ns1, at_ns2, at_ns3, at_ns4)
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 380372430..e69eafdc3 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -2279,6 +2279,7 @@  OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
 AT_SETUP([conntrack - force commit])
+AT_KEYWORDS([github_offloads_skip])
 CHECK_CONNTRACK()
 OVS_TRAFFIC_VSWITCHD_START()
 AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg])
@@ -7186,6 +7187,7 @@  OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
 AT_SETUP([conntrack - Multiple ICMP traverse])
+AT_KEYWORDS([github_offloads_skip])
 dnl This tracks sending ICMP packets via conntrack multiple times for the
 dnl same packet
 CHECK_CONNTRACK()