Message ID | 167828869871.885480.13278006837271919367.stgit@ebuild.local |
---|---|
State | Rejected |
Headers | show |
Series | [ovs-dev,v3,1/2] github: Combine ASAN and UBSAN runs. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/intel-ovs-compilation | success | test: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > Run "make check-offloads" as part of the GitHub actions tests. > > This test was run 25 times using GitHub actions, and the > failing rerun test cases where excluded. There are quite some > first-run failures, but unfortunately, there is no other > more stable kernel available as a GitHub-hosted runner. > > Did not yet include sanitizers in the run, as it's causing > the test to run too long >30min and there seems to be (timing) > issues with some of the tests. > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Hi Eelco, I like this patch a lot. But I am observing reliability problems when executing the new job. For 5 runs, on each occasion some tests failed the first time. And on 3 of those runs at least one test failed on the recheck, so the job failed. The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559 And a summary of the checks run for recheck for each job run is as follows. Attempt #1 85: failed 166: ok Attempt #2 50: ok 64: failed 86: ok 135: failed Attempt #3 50: ok Attempt #4 50: failed Attempt #5 50: ok 163: ok 169: ok
On 9 Mar 2023, at 15:42, Simon Horman wrote: > On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: >> Run "make check-offloads" as part of the GitHub actions tests. >> >> This test was run 25 times using GitHub actions, and the >> failing rerun test cases where excluded. There are quite some >> first-run failures, but unfortunately, there is no other >> more stable kernel available as a GitHub-hosted runner. >> >> Did not yet include sanitizers in the run, as it's causing >> the test to run too long >30min and there seems to be (timing) >> issues with some of the tests. >> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > > Hi Eelco, > > I like this patch a lot. > But I am observing reliability problems when executing the new job. > > For 5 runs, on each occasion some tests failed the first time. > And on 3 of those runs at least one test failed on the recheck, > so the job failed. Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. Ilya any other approach you can think off? > The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559 > > And a summary of the checks run for recheck for each job run is as follows. > > Attempt #1 > 85: failed > 166: ok > > Attempt #2 > 50: ok > 64: failed > 86: ok > 135: failed > > Attempt #3 > 50: ok > > Attempt #4 > 50: failed > > Attempt #5 > 50: ok > 163: ok > 169: ok
On 3/9/23 17:22, Eelco Chaudron wrote: > > > On 9 Mar 2023, at 15:42, Simon Horman wrote: > >> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: >>> Run "make check-offloads" as part of the GitHub actions tests. >>> >>> This test was run 25 times using GitHub actions, and the >>> failing rerun test cases where excluded. There are quite some >>> first-run failures, but unfortunately, there is no other >>> more stable kernel available as a GitHub-hosted runner. >>> >>> Did not yet include sanitizers in the run, as it's causing >>> the test to run too long >30min and there seems to be (timing) >>> issues with some of the tests. >>> >>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >> >> Hi Eelco, >> >> I like this patch a lot. >> But I am observing reliability problems when executing the new job. >> >> For 5 runs, on each occasion some tests failed the first time. >> And on 3 of those runs at least one test failed on the recheck, >> so the job failed. > > Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. > > I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. > > Ilya any other approach you can think off? Not really. :/ GHA is kind of hard to debug. Cirrus has a way to give you a console access to the running job, but GHA doesn't have that. BTW, didn't we fic the force commit test recently? I see it disabled in the patch. > > >> The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559 >> >> And a summary of the checks run for recheck for each job run is as follows. >> >> Attempt #1 >> 85: failed >> 166: ok >> >> Attempt #2 >> 50: ok >> 64: failed >> 86: ok >> 135: failed >> >> Attempt #3 >> 50: ok >> >> Attempt #4 >> 50: failed >> >> Attempt #5 >> 50: ok >> 163: ok >> 169: ok >
On 3/9/23 17:26, Ilya Maximets wrote: > On 3/9/23 17:22, Eelco Chaudron wrote: >> >> >> On 9 Mar 2023, at 15:42, Simon Horman wrote: >> >>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: >>>> Run "make check-offloads" as part of the GitHub actions tests. >>>> >>>> This test was run 25 times using GitHub actions, and the >>>> failing rerun test cases where excluded. There are quite some >>>> first-run failures, but unfortunately, there is no other >>>> more stable kernel available as a GitHub-hosted runner. >>>> >>>> Did not yet include sanitizers in the run, as it's causing >>>> the test to run too long >30min and there seems to be (timing) >>>> issues with some of the tests. >>>> >>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>> >>> Hi Eelco, >>> >>> I like this patch a lot. >>> But I am observing reliability problems when executing the new job. >>> >>> For 5 runs, on each occasion some tests failed the first time. >>> And on 3 of those runs at least one test failed on the recheck, >>> so the job failed. >> >> Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. >> >> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. >> >> Ilya any other approach you can think off? > > Not really. :/ > GHA is kind of hard to debug. Cirrus has a way to give you a console > access to the running job, but GHA doesn't have that. > > BTW, didn't we fic the force commit test recently? I see it disabled * fix > in the patch. > >> >> >>> The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559 >>> >>> And a summary of the checks run for recheck for each job run is as follows. >>> >>> Attempt #1 >>> 85: failed >>> 166: ok >>> >>> Attempt #2 >>> 50: ok >>> 64: failed >>> 86: ok >>> 135: failed >>> >>> Attempt #3 >>> 50: ok >>> >>> Attempt #4 >>> 50: failed >>> >>> Attempt #5 >>> 50: ok >>> 163: ok >>> 169: ok >> >
On Thu, Mar 9, 2023 at 5:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 3/9/23 17:26, Ilya Maximets wrote: > > On 3/9/23 17:22, Eelco Chaudron wrote: > >> > >> > >> On 9 Mar 2023, at 15:42, Simon Horman wrote: > >> > >>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > >>>> Run "make check-offloads" as part of the GitHub actions tests. > >>>> > >>>> This test was run 25 times using GitHub actions, and the > >>>> failing rerun test cases where excluded. There are quite some > >>>> first-run failures, but unfortunately, there is no other > >>>> more stable kernel available as a GitHub-hosted runner. > >>>> > >>>> Did not yet include sanitizers in the run, as it's causing > >>>> the test to run too long >30min and there seems to be (timing) > >>>> issues with some of the tests. > >>>> > >>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > >>> > >>> Hi Eelco, > >>> > >>> I like this patch a lot. > >>> But I am observing reliability problems when executing the new job. > >>> > >>> For 5 runs, on each occasion some tests failed the first time. > >>> And on 3 of those runs at least one test failed on the recheck, > >>> so the job failed. > >> > >> Damn :) I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. > >> > >> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. > >> > >> Ilya any other approach you can think off? > > > > Not really. :/ > > GHA is kind of hard to debug. Cirrus has a way to give you a console > > access to the running job, but GHA doesn't have that. fwiw; There are third party actions available that would allow you to SSH into the runner [0], which could possibly be useful to temporarily enable while debugging. 0: https://github.com/marketplace/actions/debugging-with-ssh
On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: > > > On 9 Mar 2023, at 15:42, Simon Horman wrote: > > > On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > >> Run "make check-offloads" as part of the GitHub actions tests. > >> > >> This test was run 25 times using GitHub actions, and the > >> failing rerun test cases where excluded. There are quite some > >> first-run failures, but unfortunately, there is no other > >> more stable kernel available as a GitHub-hosted runner. > >> > >> Did not yet include sanitizers in the run, as it's causing > >> the test to run too long >30min and there seems to be (timing) > >> issues with some of the tests. > >> > >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > > > > Hi Eelco, > > > > I like this patch a lot. > > But I am observing reliability problems when executing the new job. > > > > For 5 runs, on each occasion some tests failed the first time. > > And on 3 of those runs at least one test failed on the recheck, > > so the job failed. > > Damn :) Yes, it pained me to report this. > I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. > > I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. I will see if I can spend some cycles on reproducing this (outside of GHA). I'll likely start with the tests that show up in the summary below. > Ilya any other approach you can think off? > > > > The job is here: https://github.com/horms/ovs/actions/runs/4365805718/jobs/7654575559 > > > > And a summary of the checks run for recheck for each job run is as follows. > > > > Attempt #1 > > 85: failed > > 166: ok > > > > Attempt #2 > > 50: ok > > 64: failed > > 86: ok > > 135: failed > > > > Attempt #3 > > 50: ok > > > > Attempt #4 > > 50: failed > > > > Attempt #5 > > 50: ok > > 163: ok > > 169: ok > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Fri, Mar 10, 2023 at 07:12:33AM +0100, Frode Nordahl wrote: > On Thu, Mar 9, 2023 at 5:26 PM Ilya Maximets <i.maximets@ovn.org> wrote: ... > > > GHA is kind of hard to debug. Cirrus has a way to give you a console > > > access to the running job, but GHA doesn't have that. > > fwiw; There are third party actions available that would allow you to > SSH into the runner [0], which could possibly be useful to temporarily > enable while debugging. > > 0: https://github.com/marketplace/actions/debugging-with-ssh Nice, I did not know that.
On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote: > On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: > > > > > > On 9 Mar 2023, at 15:42, Simon Horman wrote: > > > > > On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > > >> Run "make check-offloads" as part of the GitHub actions tests. > > >> > > >> This test was run 25 times using GitHub actions, and the > > >> failing rerun test cases where excluded. There are quite some > > >> first-run failures, but unfortunately, there is no other > > >> more stable kernel available as a GitHub-hosted runner. > > >> > > >> Did not yet include sanitizers in the run, as it's causing > > >> the test to run too long >30min and there seems to be (timing) > > >> issues with some of the tests. > > >> > > >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > > > > > > Hi Eelco, > > > > > > I like this patch a lot. > > > But I am observing reliability problems when executing the new job. > > > > > > For 5 runs, on each occasion some tests failed the first time. > > > And on 3 of those runs at least one test failed on the recheck, > > > so the job failed. > > > > Damn :) > > Yes, it pained me to report this. > > > I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. > > > > I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. > > I will see if I can spend some cycles on reproducing this (outside of GHA). > I'll likely start with the tests that show up in the summary below. I started off by looking at check-offloads test: 50. system-traffic.at:1524: testing datapath - basic truncate action ... I haven't dug into the code to debug the problem yet. But I have collected some results that might be interesting. My testing was on a low-end VM with Ubuntu 18.04, with no HW offload: $ uname -psv Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 Using the latest main branch: d2f6fbe9fe6c ("ofproto-dpif-upcall: Remove redundant time_msec() in revalidate().") I ran this in a for loop that exited on failure. for i in $(seq 50); do echo $i; TESTSUITEFLAGS="50 -v" make check-offloads >& 50.log || break; done I did this 5 times. On one run it failed 1st go, on 3 other runs it failed in less than 10 iterations, and on the another it made it to the 16th iteration. On one of the runs the error looked like this: ... ./system-traffic.at:1573: ovs-appctl revalidator/purge ./system-traffic.at:1574: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' --- /dev/null 2023-03-10 15:43:42.948000000 +0000 +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stderr 2023-03-10 16:00:02.904493675 +0000 @@ -0,0 +1 @@ +ovs-ofctl: br0: failed to connect to socket (Connection reset by peer) ... The others looked like this: ./system-traffic.at:1620: ovs-appctl revalidator/purge ./system-traffic.at:1621: ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' ./system-traffic.at:1626: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' ./system-traffic.at:1630: check_logs --- /dev/null 2023-03-10 15:43:42.948000000 +0000 +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stdout 2023-03-10 16:03:06.996925562 +0000 @@ -0,0 +1,2 @@ +2023-03-10T16:03:04.666Z|00137|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:9505babe-8c30-4ee3-bd58-aa4f6f310219 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0xd4),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::16,label=0,proto=58,tclass=0,hlimit=1,frag=no),icmpv6(type=143,code=0) +2023-03-10T16:03:04.666Z|00138|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:c86015ff-4a41-4879-92b5-518c52be4b46 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::2,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=133,code=0)
On 10 Mar 2023, at 17:20, Simon Horman wrote: > On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote: >> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: >>> >>> >>> On 9 Mar 2023, at 15:42, Simon Horman wrote: >>> >>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: >>>>> Run "make check-offloads" as part of the GitHub actions tests. >>>>> >>>>> This test was run 25 times using GitHub actions, and the >>>>> failing rerun test cases where excluded. There are quite some >>>>> first-run failures, but unfortunately, there is no other >>>>> more stable kernel available as a GitHub-hosted runner. >>>>> >>>>> Did not yet include sanitizers in the run, as it's causing >>>>> the test to run too long >30min and there seems to be (timing) >>>>> issues with some of the tests. >>>>> >>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>>> >>>> Hi Eelco, >>>> >>>> I like this patch a lot. >>>> But I am observing reliability problems when executing the new job. >>>> >>>> For 5 runs, on each occasion some tests failed the first time. >>>> And on 3 of those runs at least one test failed on the recheck, >>>> so the job failed. >>> >>> Damn :) >> >> Yes, it pained me to report this. >> >>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. >>> >>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. >> >> I will see if I can spend some cycles on reproducing this (outside of GHA). >> I'll likely start with the tests that show up in the summary below. > > I started off by looking at check-offloads test: > > 50. system-traffic.at:1524: testing datapath - basic truncate action ... > > I haven't dug into the code to debug the problem yet. > But I have collected some results that might be interesting. > > > My testing was on a low-end VM with Ubuntu 18.04, with no HW offload: > $ uname -psv > Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests. I thought I fixed it by this old commit: https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly: https://github.com/chaudron/ovs/actions Note that the above 25x runs I did before and none of the above tests failed… I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix. As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now… //Eelco > Using the latest main branch: > d2f6fbe9fe6c ("ofproto-dpif-upcall: Remove redundant time_msec() in revalidate().") > > I ran this in a for loop that exited on failure. > > for i in $(seq 50); do echo $i; TESTSUITEFLAGS="50 -v" make check-offloads >& 50.log || break; done > > I did this 5 times. > On one run it failed 1st go, on 3 other runs it failed in less than 10 > iterations, and on the another it made it to the 16th iteration. > > On one of the runs the error looked like this: > > ... > ./system-traffic.at:1573: ovs-appctl revalidator/purge > ./system-traffic.at:1574: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' > --- /dev/null 2023-03-10 15:43:42.948000000 +0000 > +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stderr 2023-03-10 16:00:02.904493675 +0000 > @@ -0,0 +1 @@ > +ovs-ofctl: br0: failed to connect to socket (Connection reset by peer) > ... > > The others looked like this: > > > ./system-traffic.at:1620: ovs-appctl revalidator/purge > ./system-traffic.at:1621: ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' > ./system-traffic.at:1626: ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[0-9]*\).*/\1/p' > ./system-traffic.at:1630: check_logs > --- /dev/null 2023-03-10 15:43:42.948000000 +0000 > +++ /home/horms/ovs/tests/system-offloads-testsuite.dir/at-groups/50/stdout 2023-03-10 16:03:06.996925562 +0000 > @@ -0,0 +1,2 @@ > +2023-03-10T16:03:04.666Z|00137|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:9505babe-8c30-4ee3-bd58-aa4f6f310219 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0xd4),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::16,label=0,proto=58,tclass=0,hlimit=1,frag=no),icmpv6(type=143,code=0) > +2023-03-10T16:03:04.666Z|00138|dpif|WARN|system@ovs-system: failed to flow_del (No such file or directory) ufid:c86015ff-4a41-4879-92b5-518c52be4b46 recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=a6:02:03:38:97:15,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::a402:3ff:fe38:9715,dst=ff02::2,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=133,code=0)
On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote: > > > On 10 Mar 2023, at 17:20, Simon Horman wrote: > > > On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote: > >> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: > >>> > >>> > >>> On 9 Mar 2023, at 15:42, Simon Horman wrote: > >>> > >>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > >>>>> Run "make check-offloads" as part of the GitHub actions tests. > >>>>> > >>>>> This test was run 25 times using GitHub actions, and the > >>>>> failing rerun test cases where excluded. There are quite some > >>>>> first-run failures, but unfortunately, there is no other > >>>>> more stable kernel available as a GitHub-hosted runner. > >>>>> > >>>>> Did not yet include sanitizers in the run, as it's causing > >>>>> the test to run too long >30min and there seems to be (timing) > >>>>> issues with some of the tests. > >>>>> > >>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > >>>> > >>>> Hi Eelco, > >>>> > >>>> I like this patch a lot. > >>>> But I am observing reliability problems when executing the new job. > >>>> > >>>> For 5 runs, on each occasion some tests failed the first time. > >>>> And on 3 of those runs at least one test failed on the recheck, > >>>> so the job failed. > >>> > >>> Damn :) > >> > >> Yes, it pained me to report this. > >> > >>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. > >>> > >>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. > >> > >> I will see if I can spend some cycles on reproducing this (outside of GHA). > >> I'll likely start with the tests that show up in the summary below. > > > > I started off by looking at check-offloads test: > > > > 50. system-traffic.at:1524: testing datapath - basic truncate action ... > > > > I haven't dug into the code to debug the problem yet. > > But I have collected some results that might be interesting. > > > > > > My testing was on a low-end VM with Ubuntu 18.04, with no HW offload: > > $ uname -psv > > Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 > > > So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests. > > I thought I fixed it by this old commit: > > https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac > > However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly: > > https://github.com/chaudron/ovs/actions > > Note that the above 25x runs I did before and none of the above tests failed… > > I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix. > > As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now… Thanks Eelco, it's a shame this proved to be elusive. Perhaps with a newer, as yet unreleased, Ubuntu version things will improve - perhaps it is a kernel issue. Or perhaps we have some deep problem related to running in resource constrained environments, that we may uncover some day. In any case, thanks for looking at this. I agree that it makes sense to abandon this patchset for now. And that using the Robot may be a promising alternative, for now.
On 3/29/23 17:34, Simon Horman wrote: > On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote: >> >> >> On 10 Mar 2023, at 17:20, Simon Horman wrote: >> >>> On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote: >>>> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: >>>>> >>>>> >>>>> On 9 Mar 2023, at 15:42, Simon Horman wrote: >>>>> >>>>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: >>>>>>> Run "make check-offloads" as part of the GitHub actions tests. >>>>>>> >>>>>>> This test was run 25 times using GitHub actions, and the >>>>>>> failing rerun test cases where excluded. There are quite some >>>>>>> first-run failures, but unfortunately, there is no other >>>>>>> more stable kernel available as a GitHub-hosted runner. >>>>>>> >>>>>>> Did not yet include sanitizers in the run, as it's causing >>>>>>> the test to run too long >30min and there seems to be (timing) >>>>>>> issues with some of the tests. >>>>>>> >>>>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>>>>> >>>>>> Hi Eelco, >>>>>> >>>>>> I like this patch a lot. >>>>>> But I am observing reliability problems when executing the new job. >>>>>> >>>>>> For 5 runs, on each occasion some tests failed the first time. >>>>>> And on 3 of those runs at least one test failed on the recheck, >>>>>> so the job failed. >>>>> >>>>> Damn :) >>>> >>>> Yes, it pained me to report this. >>>> >>>>> I did 25 runs (I did not check for re-runs), and they were fine. I also cleaned up my jobs recently, so I no longer have them. >>>>> >>>>> I can do this again and figure out wich tests are failing. Then analyze the failures to see if we need to exclude them or can fine-tune them. >>>> >>>> I will see if I can spend some cycles on reproducing this (outside of GHA). >>>> I'll likely start with the tests that show up in the summary below. >>> >>> I started off by looking at check-offloads test: >>> >>> 50. system-traffic.at:1524: testing datapath - basic truncate action ... >>> >>> I haven't dug into the code to debug the problem yet. >>> But I have collected some results that might be interesting. >>> >>> >>> My testing was on a low-end VM with Ubuntu 18.04, with no HW offload: >>> $ uname -psv >>> Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 >> >> >> So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like on GitHub) and ran the tests. >> >> I thought I fixed it by this old commit: >> >> https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac >> >> However, as you can see even after excluding the remaining failures I could not figure out, it still fails randomly: >> >> https://github.com/chaudron/ovs/actions >> >> Note that the above 25x runs I did before and none of the above tests failed… >> >> I was not able to make any of these tests fail on my local Ubuntu, and also analysing the results did not lead to a specific thing to fix. >> >> As this is working fine on my Fedora (VM) setup for multiple runs without any problem I’ll abandon this patch now :( I’ll try to get a buy-in form the Robot to run the datapath tests as part of its sanity check for now… > > Thanks Eelco, > > it's a shame this proved to be elusive. > > Perhaps with a newer, as yet unreleased, Ubuntu version things will > improve - perhaps it is a kernel issue. > > Or perhaps we have some deep problem related to running in > resource constrained environments, that we may uncover some day. > > In any case, thanks for looking at this. > I agree that it makes sense to abandon this patchset for now. > And that using the Robot may be a promising alternative, for now. FWIW, we can spin up a Fedora VM in Cirrus CI, if that helps. We may even ask for more CPUs if needed. See how we do CI in ovn-heater for example: https://github.com/ovn-org/ovn-heater/blob/main/.cirrus.yml If VM is not necessary, we may also use containers similarly to GitHub Actions. Best regards, Ilya Maximets.
diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh index 6394a8137..19ed9796d 100755 --- a/.ci/linux-build.sh +++ b/.ci/linux-build.sh @@ -159,7 +159,7 @@ fi OPTS="${EXTRA_OPTS} ${OPTS} $*" -if [ "$TESTSUITE" ]; then +if [ "$TESTSUITE" = 'test' ]; then # 'distcheck' will reconfigure with required options. # Now we only need to prepare the Makefile without sparse-wrapped CC. configure_ovs @@ -169,6 +169,10 @@ if [ "$TESTSUITE" ]; then TESTSUITEFLAGS=-j4 RECHECK=yes else build_ovs + if [ -n "$TESTSUITE" ]; then + sudo -E PATH="$PATH" make "$TESTSUITE" TESTSUITEFLAGS="$TEST_OPTS" \ + RECHECK=yes + fi fi exit 0 diff --git a/.github/workflows/build-and-test.yml b/.github/workflows/build-and-test.yml index 9f518ff01..ffcb1cd4f 100644 --- a/.github/workflows/build-and-test.yml +++ b/.github/workflows/build-and-test.yml @@ -17,9 +17,10 @@ jobs: OPTS: ${{ matrix.opts }} SANITIZERS: ${{ matrix.sanitizers }} TESTSUITE: ${{ matrix.testsuite }} + TEST_OPTS: ${{ matrix.test_opts }} name: linux ${{ join(matrix.*, ' ') }} - runs-on: ubuntu-20.04 + runs-on: ubuntu-22.04 timeout-minutes: 30 strategy: @@ -86,6 +87,10 @@ jobs: m32: m32 opts: --disable-ssl + - compiler: gcc + testsuite: check-offloads + test_opts: "-k !github_offloads_skip" + steps: - name: checkout uses: actions/checkout@v3 @@ -146,8 +151,10 @@ jobs: # Also, upload-artifact@v2 doesn't work well enough with wildcards. # So, we're just archiving everything here to avoid any issues. mkdir logs + sudo chown -R $USER ./tests/system-offloads-testsuite.* || true cp config.log ./logs/ cp -r ./*/_build/sub/tests/testsuite.* ./logs/ || true + cp -r ./tests/system-offloads-testsuite.* ./logs/ || true tar -czvf logs.tgz logs/ - name: upload logs on failure diff --git a/Documentation/topics/testing.rst b/Documentation/topics/testing.rst index 5f6940b84..c2d680e71 100644 --- a/Documentation/topics/testing.rst +++ b/Documentation/topics/testing.rst @@ -114,6 +114,23 @@ And from another window, one can execute ovs-xxx commands like:: Once done with investigation, press ENTER to perform cleanup operation. +GitHub actions +++++++++++++++ + +The OVS GitHub repository also runs some of these unit tests through GitHub +actions. These tests are defined in the +``ovs/.github/workflows/build-and-test.yml`` file. + +Based on the GitHub runners available, not all tests will work. In these cases, +the AT_KEYWORDS() macro can be used. For example, to skip a +``make check-offloads`` test, use the ``github_offloads_skip`` keyword. + +Only use these keywords if no other way to fix or skip the test is available. + +To see a list of currently skipped tests, you can do something like:: + + $ make check-offloads TESTSUITEFLAGS="-l" | grep -B 1 github_offloads_skip + .. _testing-coverage: Coverage diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at index eb331d6ce..5cde26663 100644 --- a/tests/system-offloads-traffic.at +++ b/tests/system-offloads-traffic.at @@ -191,6 +191,7 @@ AT_CLEANUP AT_SETUP([offloads - check interface meter offloading - offloads disabled]) AT_KEYWORDS([dp-meter]) +AT_KEYWORDS([github_offloads_skip]) AT_SKIP_IF([test $HAVE_NC = "no"]) OVS_TRAFFIC_VSWITCHD_START() @@ -240,6 +241,7 @@ AT_CLEANUP AT_SETUP([offloads - check interface meter offloading - offloads enabled]) AT_KEYWORDS([offload-meter]) +AT_KEYWORDS([github_offloads_skip]) AT_SKIP_IF([test $SUPPORT_TC_INGRESS_PPS = "no"]) AT_SKIP_IF([test $HAVE_NC = "no"]) OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true]) @@ -348,6 +350,7 @@ AT_CLEANUP AT_SETUP([offloads - check_pkt_len action - offloads enabled]) +AT_KEYWORDS([github_offloads_skip]) OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true]) ADD_NAMESPACES(at_ns1, at_ns2, at_ns3, at_ns4) diff --git a/tests/system-traffic.at b/tests/system-traffic.at index 380372430..e69eafdc3 100644 --- a/tests/system-traffic.at +++ b/tests/system-traffic.at @@ -2279,6 +2279,7 @@ OVS_TRAFFIC_VSWITCHD_STOP AT_CLEANUP AT_SETUP([conntrack - force commit]) +AT_KEYWORDS([github_offloads_skip]) CHECK_CONNTRACK() OVS_TRAFFIC_VSWITCHD_START() AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg ofproto_dpif_upcall:dbg]) @@ -7186,6 +7187,7 @@ OVS_TRAFFIC_VSWITCHD_STOP AT_CLEANUP AT_SETUP([conntrack - Multiple ICMP traverse]) +AT_KEYWORDS([github_offloads_skip]) dnl This tracks sending ICMP packets via conntrack multiple times for the dnl same packet CHECK_CONNTRACK()
Run "make check-offloads" as part of the GitHub actions tests. This test was run 25 times using GitHub actions, and the failing rerun test cases where excluded. There are quite some first-run failures, but unfortunately, there is no other more stable kernel available as a GitHub-hosted runner. Did not yet include sanitizers in the run, as it's causing the test to run too long >30min and there seems to be (timing) issues with some of the tests. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> --- v2: Added a new test keyword to exclude the failing tests. Added some documentation around the keyword usage. v3: Fixed documentation and log copy for make distcheck. .ci/linux-build.sh | 6 +++++- .github/workflows/build-and-test.yml | 9 ++++++++- Documentation/topics/testing.rst | 17 +++++++++++++++++ tests/system-offloads-traffic.at | 3 +++ tests/system-traffic.at | 2 ++ 5 files changed, 35 insertions(+), 2 deletions(-)