Message ID | 20190711160033.2604-1-nusiddiq@redhat.com |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev] ovn: Fix the test failures in travis CI. | expand |
On Thu, Jul 11, 2019 at 09:30:33PM +0530, nusiddiq@redhat.com wrote: > From: Numan Siddique <nusiddiq@redhat.com> > > After the commit [1], below test cases are failing repeatedly in travis CI. > > 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597) > 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844) > 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580) > 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624) > > With the commit [1], ovn-controller sends GARPs for the IPs of the distributed > router ports. The failing tests did not handle the situation if multiple GARPs > are sent. The failures are mostly timing related. This patch fixes these issues. Thanks! Applied to master.
On 11.07.2019 19:00, nusiddiq@redhat.com wrote: > From: Numan Siddique <nusiddiq@redhat.com> > > After the commit [1], below test cases are failing repeatedly in travis CI. > > 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597) > 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844) > 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580) > 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624) > > With the commit [1], ovn-controller sends GARPs for the IPs of the distributed > router ports. The failing tests did not handle the situation if multiple GARPs > are sent. The failures are mostly timing related. This patch fixes these issues. > > [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") > > Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") > CC: Ilya Maximets <i.maximets@samsung.com> > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > --- Hi. Thanks for working on this! I can confirm that this patch fixes frequent TravisCI failures. There are still some occasional failures of ovn tests, but it they was always there. (OVN tests has some timing issues). Tested-by: Ilya Maximets <i.maximets@samsung.com> However, I see that some failures was resolved by just removing the checks from tests. This somehow decreases the test coverage. So, It'll be good to have review from someone more familiar with these tests than me. Ben, what do you think about this patch? Best regards, Ilya Maximets. > tests/ovn.at | 53 ++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 35 insertions(+), 18 deletions(-) > > diff --git a/tests/ovn.at b/tests/ovn.at > index 4da7059b3..95980f2f1 100644 > --- a/tests/ovn.at > +++ b/tests/ovn.at > @@ -8593,7 +8593,9 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 > OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets > cat packets | grep $expected > exp > - cat packets | grep $exp_gw_ip_garp >> exp > + # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple > + # garp packets. So consider only the first packet. > + cat packets | grep $exp_gw_ip_garp | head -1 >> exp > AT_CHECK([cat exp], [0], [expout]) > rm -f expout > if test $backup_vswitchd_dead != 1; then > @@ -8840,7 +8842,7 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 > OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets > cat packets | grep $expected > exp > - cat packets | grep $exp_gw_ip_garp >> exp > + cat packets | grep $exp_gw_ip_garp | head -1 >> exp > AT_CHECK([cat exp], [0], [expout]) > > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap > packets > @@ -9567,20 +9569,9 @@ options:rxq_pcap=${pcap_file}-rx.pcap > > as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2 > as hv3 reset_pcap_file hv3-vif1 hv3/vif1 > -sleep 2 > -# Take note of how many packets arrived on the VLAN switch before generating > -# further traffic > -n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'` > as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet > sleep 2 > > -# On hv1, the packet should not go from vlan switch pipleline to router > -# pipeline > -as hv1 ovs-ofctl dump-flows br-int > -AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \ > -| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0 > -]]) > - > # On hv1, table 32 check that no packet goes via the tunnel port > AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \ > | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0 > @@ -9624,21 +9615,38 @@ echo $exp_garp_on_foo1 > foo1.expout > > # ovn-controller on hv2 should send garp with VLAN tag > sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101" > -echo $sent_garp > br-ex_n2.expout > > OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) > -OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout]) > +# Wait until we receive atleast 1 packet > +OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`]) > +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets > +echo $sent_garp > expout > +AT_CHECK([cat packets], [0], [expout]) > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty > AT_CHECK([cat empty], [0], []) > > # Make hv4 master > as hv1 reset_pcap_file hv1-vif1 hv1/vif1 > -as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 > as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2 > ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40 > > +# Wait till cr-alice is claimed by hv4 > +hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4) > +# check that the chassis redirect port has been claimed by the gw1 chassis > +OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \ > +logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1 > +]]) > + > +# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2 > +# should not send GARPs for the router ports. > +as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 > + > +echo $sent_garp > br-ex_n2.expout > OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) > OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout]) > + > +sleep 2 > + > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty > AT_CHECK([cat empty], [0], []) > > @@ -13580,6 +13588,8 @@ test_ip_packet_larger() { > orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000 > packet=${packet}${orig_packet_l3} > > + gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064 > + > # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than > # the gateway mtu and should be delivered to the provider bridge via the > # localnet port. > @@ -13599,6 +13609,7 @@ test_ip_packet_larger() { > expected=${expected}000000000000000000000000000000000000 > expected=${expected}000000000000000000000000000000000000 > echo $expected > br_phys_n1.expected > + echo $gw_ip_garp >> br_phys_n1.expected > else > # MTU would be 100 - 18 = 82 (hex 0052) > mtu=0052 > @@ -13622,12 +13633,18 @@ test_ip_packet_larger() { > > if test $icmp_pmtu_reply_expected = 0; then > OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected]) > - $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > packets > + $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > pkts > + # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller > + # for the gateway router port. So ignore this packet. > + cat pkts | grep -v $gw_ip_garp > packets > AT_CHECK([cat packets], [0], []) > else > OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected]) > $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap > \ > - packets > + pkts > + # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller > + # for the gateway router port. So ignore this packet. > + cat pkts | grep -v $gw_ip_garp > packets > AT_CHECK([cat packets], [0], []) > fi > } >
On 12.07.2019 19:06, Ilya Maximets wrote: > On 11.07.2019 19:00, nusiddiq@redhat.com wrote: >> From: Numan Siddique <nusiddiq@redhat.com> >> >> After the commit [1], below test cases are failing repeatedly in travis CI. >> >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597) >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844) >> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580) >> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624) >> >> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed >> router ports. The failing tests did not handle the situation if multiple GARPs >> are sent. The failures are mostly timing related. This patch fixes these issues. >> >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") >> >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") >> CC: Ilya Maximets <i.maximets@samsung.com> >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >> --- > > Hi. > Thanks for working on this! > > I can confirm that this patch fixes frequent TravisCI failures. > There are still some occasional failures of ovn tests, but it they was > always there. (OVN tests has some timing issues). > > Tested-by: Ilya Maximets <i.maximets@samsung.com> > > However, I see that some failures was resolved by just removing the > checks from tests. This somehow decreases the test coverage. > So, It'll be good to have review from someone more familiar with > these tests than me. > > Ben, what do you think about this patch? Oh. You just applied it. So, I assume, it's OK for you. =) > > Best regards, Ilya Maximets. > >> tests/ovn.at | 53 ++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 35 insertions(+), 18 deletions(-) >> >> diff --git a/tests/ovn.at b/tests/ovn.at >> index 4da7059b3..95980f2f1 100644 >> --- a/tests/ovn.at >> +++ b/tests/ovn.at >> @@ -8593,7 +8593,9 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 >> OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets >> cat packets | grep $expected > exp >> - cat packets | grep $exp_gw_ip_garp >> exp >> + # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple >> + # garp packets. So consider only the first packet. >> + cat packets | grep $exp_gw_ip_garp | head -1 >> exp >> AT_CHECK([cat exp], [0], [expout]) >> rm -f expout >> if test $backup_vswitchd_dead != 1; then >> @@ -8840,7 +8842,7 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 >> OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets >> cat packets | grep $expected > exp >> - cat packets | grep $exp_gw_ip_garp >> exp >> + cat packets | grep $exp_gw_ip_garp | head -1 >> exp >> AT_CHECK([cat exp], [0], [expout]) >> >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap > packets >> @@ -9567,20 +9569,9 @@ options:rxq_pcap=${pcap_file}-rx.pcap >> >> as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2 >> as hv3 reset_pcap_file hv3-vif1 hv3/vif1 >> -sleep 2 >> -# Take note of how many packets arrived on the VLAN switch before generating >> -# further traffic >> -n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'` >> as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet >> sleep 2 >> >> -# On hv1, the packet should not go from vlan switch pipleline to router >> -# pipeline >> -as hv1 ovs-ofctl dump-flows br-int >> -AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \ >> -| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0 >> -]]) >> - >> # On hv1, table 32 check that no packet goes via the tunnel port >> AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \ >> | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0 >> @@ -9624,21 +9615,38 @@ echo $exp_garp_on_foo1 > foo1.expout >> >> # ovn-controller on hv2 should send garp with VLAN tag >> sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101" >> -echo $sent_garp > br-ex_n2.expout >> >> OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) >> -OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout]) >> +# Wait until we receive atleast 1 packet >> +OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`]) >> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets >> +echo $sent_garp > expout >> +AT_CHECK([cat packets], [0], [expout]) >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty >> AT_CHECK([cat empty], [0], []) >> >> # Make hv4 master >> as hv1 reset_pcap_file hv1-vif1 hv1/vif1 >> -as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 >> as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2 >> ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40 >> >> +# Wait till cr-alice is claimed by hv4 >> +hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4) >> +# check that the chassis redirect port has been claimed by the gw1 chassis >> +OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \ >> +logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1 >> +]]) >> + >> +# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2 >> +# should not send GARPs for the router ports. >> +as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 >> + >> +echo $sent_garp > br-ex_n2.expout >> OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) >> OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout]) >> + >> +sleep 2 >> + >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty >> AT_CHECK([cat empty], [0], []) >> >> @@ -13580,6 +13588,8 @@ test_ip_packet_larger() { >> orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000 >> packet=${packet}${orig_packet_l3} >> >> + gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064 >> + >> # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than >> # the gateway mtu and should be delivered to the provider bridge via the >> # localnet port. >> @@ -13599,6 +13609,7 @@ test_ip_packet_larger() { >> expected=${expected}000000000000000000000000000000000000 >> expected=${expected}000000000000000000000000000000000000 >> echo $expected > br_phys_n1.expected >> + echo $gw_ip_garp >> br_phys_n1.expected >> else >> # MTU would be 100 - 18 = 82 (hex 0052) >> mtu=0052 >> @@ -13622,12 +13633,18 @@ test_ip_packet_larger() { >> >> if test $icmp_pmtu_reply_expected = 0; then >> OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected]) >> - $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > packets >> + $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > pkts >> + # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller >> + # for the gateway router port. So ignore this packet. >> + cat pkts | grep -v $gw_ip_garp > packets >> AT_CHECK([cat packets], [0], []) >> else >> OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected]) >> $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap > \ >> - packets >> + pkts >> + # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller >> + # for the gateway router port. So ignore this packet. >> + cat pkts | grep -v $gw_ip_garp > packets >> AT_CHECK([cat packets], [0], []) >> fi >> } >> > >
On Fri, Jul 12, 2019 at 07:09:42PM +0300, Ilya Maximets wrote: > On 12.07.2019 19:06, Ilya Maximets wrote: > > On 11.07.2019 19:00, nusiddiq@redhat.com wrote: > >> From: Numan Siddique <nusiddiq@redhat.com> > >> > >> After the commit [1], below test cases are failing repeatedly in travis CI. > >> > >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597) > >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844) > >> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580) > >> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624) > >> > >> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed > >> router ports. The failing tests did not handle the situation if multiple GARPs > >> are sent. The failures are mostly timing related. This patch fixes these issues. > >> > >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") > >> > >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") > >> CC: Ilya Maximets <i.maximets@samsung.com> > >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > >> --- > > > > Hi. > > Thanks for working on this! > > > > I can confirm that this patch fixes frequent TravisCI failures. > > There are still some occasional failures of ovn tests, but it they was > > always there. (OVN tests has some timing issues). > > > > Tested-by: Ilya Maximets <i.maximets@samsung.com> > > > > However, I see that some failures was resolved by just removing the > > checks from tests. This somehow decreases the test coverage. > > So, It'll be good to have review from someone more familiar with > > these tests than me. > > > > Ben, what do you think about this patch? > > Oh. You just applied it. So, I assume, it's OK for you. =) If there's a better way to do it, I'm all for it.
On Fri, Jul 12, 2019 at 10:29 PM Ben Pfaff <blp@ovn.org> wrote: > On Fri, Jul 12, 2019 at 07:09:42PM +0300, Ilya Maximets wrote: > > On 12.07.2019 19:06, Ilya Maximets wrote: > > > On 11.07.2019 19:00, nusiddiq@redhat.com wrote: > > >> From: Numan Siddique <nusiddiq@redhat.com> > > >> > > >> After the commit [1], below test cases are failing repeatedly in > travis CI. > > >> > > >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router > gateway port FAILED (ovn.at:8597) > > >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router > gateway port FAILED (ovn.at:8844) > > >> 2667: ovn -- vlan traffic for external network with distributed > router gateway port FAILED (ovn.at:9580) > > >> 2691: ovn -- router - check packet length - icmp defrag FAILED ( > ovn.at:13624) > > >> > > >> With the commit [1], ovn-controller sends GARPs for the IPs of the > distributed > > >> router ports. The failing tests did not handle the situation if > multiple GARPs > > >> are sent. The failures are mostly timing related. This patch fixes > these issues. > > >> > > >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router > port connected to bridged logical switch") > > >> > > >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router > port connected to bridged logical switch") > > >> CC: Ilya Maximets <i.maximets@samsung.com> > > >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > > >> --- > > > > > > Hi. > > > Thanks for working on this! > > > > > > I can confirm that this patch fixes frequent TravisCI failures. > > > There are still some occasional failures of ovn tests, but it they was > > > always there. (OVN tests has some timing issues). > > > > > > Tested-by: Ilya Maximets <i.maximets@samsung.com> > > > > > > However, I see that some failures was resolved by just removing the > > > checks from tests. This somehow decreases the test coverage. > > > So, It'll be good to have review from someone more familiar with > > > these tests than me. > > > > > > Ben, what do you think about this patch? > > > > Oh. You just applied it. So, I assume, it's OK for you. =) > > If there's a better way to do it, I'm all for it. > There is an issue in ovn-northd after this commit [1]. This commit makes use of the function - sbrec_port_binding_update_nat_addresses_addvalue() to update the Port_Binding.nat_addresses column. Once this code is hit, ovn-northd wakes up from the poll_block() continuously hogging the CPU. I think we are seeing these CI test issues because of this. From the ovn-northd logs I can see the below transaction messages sent all the time. ********* .......... 2019-07-12T17:26:13.837Z|74511|poll_loop|DBG|wakeup due to [POLLIN] on fd 11 (<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb) at ../lib/stream-fd.c:157 (75% CPU usage) 2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18628 2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage) 2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18629 2019-07-12T17:26:13.837Z|74515|poll_loop|DBG|wakeup due to [POLLIN] on fd 11 (<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb) at ../lib/stream-fd.c:157 (75% CPU usage) 2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18629 2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage) 2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18630 2019-07-12T17:26:13.837Z|74519|poll_loop|DBG|wakeup due to [POLLIN] on fd 11 (<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb) at ../lib/stream-fd.c:157 (75% CPU usage) 2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18630 **************** We are seeing timing related frequent failures in the OpenStack CI tests for networking-ovn. Looks to me there is an issue with these variants (*update*_addvalue) of IDL functions. I am submitting a patch in ovn-northd to not use this function as we need a fix to unblock the Openstack CI gate. But l think the actual issue seems to be in the IDL client code. The issue can be reproduced with the below commands in the sandbox with ovn enabled. ************* ovs-appctl -t ovn-northd vlog/set dbg ovn-nbctl lr-add lr0 ovn-nbctl ls-add public ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24 ovn-nbctl lsp-add public public-lr0 ovn-nbctl lsp-set-type public-lr0 router ovn-nbctl lsp-set-addresses public-lr0 router ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20 ovn-nbctl lsp-add public ln-public ovn-nbctl lsp-set-type ln-public localnet ovn-nbctl lsp-set-addresses ln-public unknown ovn-nbctl lsp-set-options ln-public network_name=public ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20 ovn-nbctl lr-add lr0 ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24 ************. Thanks Numan [1] - https://github.com/openvswitch/ovs/commit/ed198fb3b92e2a0b1f594c22280803bfc2f66029#diff-2c35162acf6ad144624954fdc4c3d9f4
diff --git a/tests/ovn.at b/tests/ovn.at index 4da7059b3..95980f2f1 100644 --- a/tests/ovn.at +++ b/tests/ovn.at @@ -8593,7 +8593,9 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets cat packets | grep $expected > exp - cat packets | grep $exp_gw_ip_garp >> exp + # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple + # garp packets. So consider only the first packet. + cat packets | grep $exp_gw_ip_garp | head -1 >> exp AT_CHECK([cat exp], [0], [expout]) rm -f expout if test $backup_vswitchd_dead != 1; then @@ -8840,7 +8842,7 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1 OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected]) $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap > packets cat packets | grep $expected > exp - cat packets | grep $exp_gw_ip_garp >> exp + cat packets | grep $exp_gw_ip_garp | head -1 >> exp AT_CHECK([cat exp], [0], [expout]) $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap > packets @@ -9567,20 +9569,9 @@ options:rxq_pcap=${pcap_file}-rx.pcap as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2 as hv3 reset_pcap_file hv3-vif1 hv3/vif1 -sleep 2 -# Take note of how many packets arrived on the VLAN switch before generating -# further traffic -n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'` as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet sleep 2 -# On hv1, the packet should not go from vlan switch pipleline to router -# pipeline -as hv1 ovs-ofctl dump-flows br-int -AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \ -| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0 -]]) - # On hv1, table 32 check that no packet goes via the tunnel port AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \ | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0 @@ -9624,21 +9615,38 @@ echo $exp_garp_on_foo1 > foo1.expout # ovn-controller on hv2 should send garp with VLAN tag sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101" -echo $sent_garp > br-ex_n2.expout OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) -OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout]) +# Wait until we receive atleast 1 packet +OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`]) +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets +echo $sent_garp > expout +AT_CHECK([cat packets], [0], [expout]) $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty AT_CHECK([cat empty], [0], []) # Make hv4 master as hv1 reset_pcap_file hv1-vif1 hv1/vif1 -as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2 ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40 +# Wait till cr-alice is claimed by hv4 +hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4) +# check that the chassis redirect port has been claimed by the gw1 chassis +OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \ +logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1 +]]) + +# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2 +# should not send GARPs for the router ports. +as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2 + +echo $sent_garp > br-ex_n2.expout OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout]) OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout]) + +sleep 2 + $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty AT_CHECK([cat empty], [0], []) @@ -13580,6 +13588,8 @@ test_ip_packet_larger() { orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000 packet=${packet}${orig_packet_l3} + gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064 + # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than # the gateway mtu and should be delivered to the provider bridge via the # localnet port. @@ -13599,6 +13609,7 @@ test_ip_packet_larger() { expected=${expected}000000000000000000000000000000000000 expected=${expected}000000000000000000000000000000000000 echo $expected > br_phys_n1.expected + echo $gw_ip_garp >> br_phys_n1.expected else # MTU would be 100 - 18 = 82 (hex 0052) mtu=0052 @@ -13622,12 +13633,18 @@ test_ip_packet_larger() { if test $icmp_pmtu_reply_expected = 0; then OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected]) - $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > packets + $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap > pkts + # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller + # for the gateway router port. So ignore this packet. + cat pkts | grep -v $gw_ip_garp > packets AT_CHECK([cat packets], [0], []) else OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected]) $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap > \ - packets + pkts + # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller + # for the gateway router port. So ignore this packet. + cat pkts | grep -v $gw_ip_garp > packets AT_CHECK([cat packets], [0], []) fi }