[ovs-dev] ovn: Fix the test failures in travis CI.
diff mbox series

Message ID 20190711160033.2604-1-nusiddiq@redhat.com
State Accepted
Headers show
Series
  • [ovs-dev] ovn: Fix the test failures in travis CI.
Related show

Commit Message

Numan Siddique July 11, 2019, 4 p.m. UTC
From: Numan Siddique <nusiddiq@redhat.com>

After the commit [1], below test cases are failing repeatedly in travis CI.

2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)

With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
router ports. The failing tests did not handle the situation if multiple GARPs
are sent. The failures are mostly timing related. This patch fixes these issues.

[1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")

Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
CC: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
---
 tests/ovn.at | 53 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 18 deletions(-)

Comments

Ben Pfaff July 12, 2019, 3:59 p.m. UTC | #1
On Thu, Jul 11, 2019 at 09:30:33PM +0530, nusiddiq@redhat.com wrote:
> From: Numan Siddique <nusiddiq@redhat.com>
> 
> After the commit [1], below test cases are failing repeatedly in travis CI.
> 
> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)
> 
> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
> router ports. The failing tests did not handle the situation if multiple GARPs
> are sent. The failures are mostly timing related. This patch fixes these issues.

Thanks!  Applied to master.
Ilya Maximets July 12, 2019, 4:06 p.m. UTC | #2
On 11.07.2019 19:00, nusiddiq@redhat.com wrote:
> From: Numan Siddique <nusiddiq@redhat.com>
> 
> After the commit [1], below test cases are failing repeatedly in travis CI.
> 
> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)
> 
> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
> router ports. The failing tests did not handle the situation if multiple GARPs
> are sent. The failures are mostly timing related. This patch fixes these issues.
> 
> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
> 
> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
> CC: Ilya Maximets <i.maximets@samsung.com>
> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> ---

Hi.
Thanks for working on this!

I can confirm that this patch fixes frequent TravisCI failures.
There are still some occasional failures of ovn tests, but it they was
always there. (OVN tests has some timing issues).

Tested-by: Ilya Maximets <i.maximets@samsung.com>

However, I see that some failures was resolved by just removing the
checks from tests. This somehow decreases the test coverage.
So, It'll be good to have review from someone more familiar with
these tests than me.

Ben, what do you think about this patch?

Best regards, Ilya Maximets.

>  tests/ovn.at | 53 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 35 insertions(+), 18 deletions(-)
> 
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 4da7059b3..95980f2f1 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -8593,7 +8593,9 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
>      OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
>      cat packets | grep $expected > exp
> -    cat packets | grep $exp_gw_ip_garp >> exp
> +    # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple
> +    # garp packets. So consider only the first packet.
> +    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
>      AT_CHECK([cat exp], [0], [expout])
>      rm -f expout
>      if test $backup_vswitchd_dead != 1; then
> @@ -8840,7 +8842,7 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
>      OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
>      cat packets | grep $expected > exp
> -    cat packets | grep $exp_gw_ip_garp >> exp
> +    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
>      AT_CHECK([cat exp], [0], [expout])
>  
>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap  > packets
> @@ -9567,20 +9569,9 @@ options:rxq_pcap=${pcap_file}-rx.pcap
>  
>  as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2
>  as hv3 reset_pcap_file hv3-vif1 hv3/vif1
> -sleep 2
> -# Take note of how many packets arrived on the VLAN switch before generating
> -# further traffic
> -n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'`
>  as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>  sleep 2
>  
> -# On hv1, the packet should not go from vlan switch pipleline to router
> -# pipeline
> -as hv1 ovs-ofctl dump-flows br-int
> -AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \
> -| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0
> -]])
> -
>  # On hv1, table 32 check that no packet goes via the tunnel port
>  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \
>  | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0
> @@ -9624,21 +9615,38 @@ echo $exp_garp_on_foo1 > foo1.expout
>  
>  # ovn-controller on hv2 should send garp with VLAN tag
>  sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101"
> -echo $sent_garp > br-ex_n2.expout
>  
>  OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
> -OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout])
> +# Wait until we receive atleast 1 packet
> +OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`])
> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets
> +echo $sent_garp > expout
> +AT_CHECK([cat packets], [0], [expout])
>  $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty
>  AT_CHECK([cat empty], [0], [])
>  
>  # Make hv4 master
>  as hv1 reset_pcap_file hv1-vif1 hv1/vif1
> -as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
>  as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2
>  ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40
>  
> +# Wait till cr-alice is claimed by hv4
> +hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4)
> +# check that the chassis redirect port has been claimed by the gw1 chassis
> +OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \
> +logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1
> +]])
> +
> +# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2
> +# should not send GARPs for the router ports.
> +as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
> +
> +echo $sent_garp > br-ex_n2.expout
>  OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
>  OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout])
> +
> +sleep 2
> +
>  $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty
>  AT_CHECK([cat empty], [0], [])
>  
> @@ -13580,6 +13588,8 @@ test_ip_packet_larger() {
>      orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000
>      packet=${packet}${orig_packet_l3}
>  
> +    gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064
> +
>      # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than
>      # the gateway mtu and should be delivered to the provider bridge via the
>      # localnet port.
> @@ -13599,6 +13609,7 @@ test_ip_packet_larger() {
>          expected=${expected}000000000000000000000000000000000000
>          expected=${expected}000000000000000000000000000000000000
>          echo $expected > br_phys_n1.expected
> +        echo $gw_ip_garp >> br_phys_n1.expected
>      else
>          # MTU would be 100 - 18 = 82 (hex 0052)
>          mtu=0052
> @@ -13622,12 +13633,18 @@ test_ip_packet_larger() {
>  
>      if test $icmp_pmtu_reply_expected = 0; then
>          OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected])
> -        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > packets
> +        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > pkts
> +        # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller
> +        # for the gateway router port. So ignore this packet.
> +        cat pkts | grep -v $gw_ip_garp > packets
>          AT_CHECK([cat packets], [0], [])
>      else
>          OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected])
>          $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap  > \
> -        packets
> +        pkts
> +        # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller
> +        # for the gateway router port. So ignore this packet.
> +        cat pkts | grep -v $gw_ip_garp > packets
>          AT_CHECK([cat packets], [0], [])
>      fi
>  }
>
Ilya Maximets July 12, 2019, 4:09 p.m. UTC | #3
On 12.07.2019 19:06, Ilya Maximets wrote:
> On 11.07.2019 19:00, nusiddiq@redhat.com wrote:
>> From: Numan Siddique <nusiddiq@redhat.com>
>>
>> After the commit [1], below test cases are failing repeatedly in travis CI.
>>
>> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
>> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
>> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
>> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)
>>
>> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
>> router ports. The failing tests did not handle the situation if multiple GARPs
>> are sent. The failures are mostly timing related. This patch fixes these issues.
>>
>> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
>>
>> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
>> CC: Ilya Maximets <i.maximets@samsung.com>
>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
>> ---
> 
> Hi.
> Thanks for working on this!
> 
> I can confirm that this patch fixes frequent TravisCI failures.
> There are still some occasional failures of ovn tests, but it they was
> always there. (OVN tests has some timing issues).
> 
> Tested-by: Ilya Maximets <i.maximets@samsung.com>
> 
> However, I see that some failures was resolved by just removing the
> checks from tests. This somehow decreases the test coverage.
> So, It'll be good to have review from someone more familiar with
> these tests than me.
> 
> Ben, what do you think about this patch?

Oh. You just applied it. So, I assume, it's OK for you. =)

> 
> Best regards, Ilya Maximets.
> 
>>  tests/ovn.at | 53 ++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 35 insertions(+), 18 deletions(-)
>>
>> diff --git a/tests/ovn.at b/tests/ovn.at
>> index 4da7059b3..95980f2f1 100644
>> --- a/tests/ovn.at
>> +++ b/tests/ovn.at
>> @@ -8593,7 +8593,9 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
>>      OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
>>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
>>      cat packets | grep $expected > exp
>> -    cat packets | grep $exp_gw_ip_garp >> exp
>> +    # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple
>> +    # garp packets. So consider only the first packet.
>> +    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
>>      AT_CHECK([cat exp], [0], [expout])
>>      rm -f expout
>>      if test $backup_vswitchd_dead != 1; then
>> @@ -8840,7 +8842,7 @@ grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
>>      OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
>>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
>>      cat packets | grep $expected > exp
>> -    cat packets | grep $exp_gw_ip_garp >> exp
>> +    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
>>      AT_CHECK([cat exp], [0], [expout])
>>  
>>      $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap  > packets
>> @@ -9567,20 +9569,9 @@ options:rxq_pcap=${pcap_file}-rx.pcap
>>  
>>  as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2
>>  as hv3 reset_pcap_file hv3-vif1 hv3/vif1
>> -sleep 2
>> -# Take note of how many packets arrived on the VLAN switch before generating
>> -# further traffic
>> -n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'`
>>  as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
>>  sleep 2
>>  
>> -# On hv1, the packet should not go from vlan switch pipleline to router
>> -# pipeline
>> -as hv1 ovs-ofctl dump-flows br-int
>> -AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \
>> -| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0
>> -]])
>> -
>>  # On hv1, table 32 check that no packet goes via the tunnel port
>>  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \
>>  | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0
>> @@ -9624,21 +9615,38 @@ echo $exp_garp_on_foo1 > foo1.expout
>>  
>>  # ovn-controller on hv2 should send garp with VLAN tag
>>  sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101"
>> -echo $sent_garp > br-ex_n2.expout
>>  
>>  OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
>> -OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout])
>> +# Wait until we receive atleast 1 packet
>> +OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`])
>> +$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets
>> +echo $sent_garp > expout
>> +AT_CHECK([cat packets], [0], [expout])
>>  $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty
>>  AT_CHECK([cat empty], [0], [])
>>  
>>  # Make hv4 master
>>  as hv1 reset_pcap_file hv1-vif1 hv1/vif1
>> -as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
>>  as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2
>>  ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40
>>  
>> +# Wait till cr-alice is claimed by hv4
>> +hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4)
>> +# check that the chassis redirect port has been claimed by the gw1 chassis
>> +OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \
>> +logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1
>> +]])
>> +
>> +# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2
>> +# should not send GARPs for the router ports.
>> +as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
>> +
>> +echo $sent_garp > br-ex_n2.expout
>>  OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
>>  OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout])
>> +
>> +sleep 2
>> +
>>  $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty
>>  AT_CHECK([cat empty], [0], [])
>>  
>> @@ -13580,6 +13588,8 @@ test_ip_packet_larger() {
>>      orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000
>>      packet=${packet}${orig_packet_l3}
>>  
>> +    gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064
>> +
>>      # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than
>>      # the gateway mtu and should be delivered to the provider bridge via the
>>      # localnet port.
>> @@ -13599,6 +13609,7 @@ test_ip_packet_larger() {
>>          expected=${expected}000000000000000000000000000000000000
>>          expected=${expected}000000000000000000000000000000000000
>>          echo $expected > br_phys_n1.expected
>> +        echo $gw_ip_garp >> br_phys_n1.expected
>>      else
>>          # MTU would be 100 - 18 = 82 (hex 0052)
>>          mtu=0052
>> @@ -13622,12 +13633,18 @@ test_ip_packet_larger() {
>>  
>>      if test $icmp_pmtu_reply_expected = 0; then
>>          OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected])
>> -        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > packets
>> +        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > pkts
>> +        # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller
>> +        # for the gateway router port. So ignore this packet.
>> +        cat pkts | grep -v $gw_ip_garp > packets
>>          AT_CHECK([cat packets], [0], [])
>>      else
>>          OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected])
>>          $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap  > \
>> -        packets
>> +        pkts
>> +        # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller
>> +        # for the gateway router port. So ignore this packet.
>> +        cat pkts | grep -v $gw_ip_garp > packets
>>          AT_CHECK([cat packets], [0], [])
>>      fi
>>  }
>>
> 
>
Ben Pfaff July 12, 2019, 4:59 p.m. UTC | #4
On Fri, Jul 12, 2019 at 07:09:42PM +0300, Ilya Maximets wrote:
> On 12.07.2019 19:06, Ilya Maximets wrote:
> > On 11.07.2019 19:00, nusiddiq@redhat.com wrote:
> >> From: Numan Siddique <nusiddiq@redhat.com>
> >>
> >> After the commit [1], below test cases are failing repeatedly in travis CI.
> >>
> >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
> >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
> >> 2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
> >> 2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)
> >>
> >> With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
> >> router ports. The failing tests did not handle the situation if multiple GARPs
> >> are sent. The failures are mostly timing related. This patch fixes these issues.
> >>
> >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
> >>
> >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
> >> CC: Ilya Maximets <i.maximets@samsung.com>
> >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> >> ---
> > 
> > Hi.
> > Thanks for working on this!
> > 
> > I can confirm that this patch fixes frequent TravisCI failures.
> > There are still some occasional failures of ovn tests, but it they was
> > always there. (OVN tests has some timing issues).
> > 
> > Tested-by: Ilya Maximets <i.maximets@samsung.com>
> > 
> > However, I see that some failures was resolved by just removing the
> > checks from tests. This somehow decreases the test coverage.
> > So, It'll be good to have review from someone more familiar with
> > these tests than me.
> > 
> > Ben, what do you think about this patch?
> 
> Oh. You just applied it. So, I assume, it's OK for you. =)

If there's a better way to do it, I'm all for it.
Numan Siddique July 12, 2019, 5:45 p.m. UTC | #5
On Fri, Jul 12, 2019 at 10:29 PM Ben Pfaff <blp@ovn.org> wrote:

> On Fri, Jul 12, 2019 at 07:09:42PM +0300, Ilya Maximets wrote:
> > On 12.07.2019 19:06, Ilya Maximets wrote:
> > > On 11.07.2019 19:00, nusiddiq@redhat.com wrote:
> > >> From: Numan Siddique <nusiddiq@redhat.com>
> > >>
> > >> After the commit [1], below test cases are failing repeatedly in
> travis CI.
> > >>
> > >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router
> gateway port FAILED (ovn.at:8597)
> > >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router
> gateway port FAILED (ovn.at:8844)
> > >> 2667: ovn -- vlan traffic for external network with distributed
> router gateway port FAILED (ovn.at:9580)
> > >> 2691: ovn -- router - check packet length - icmp defrag FAILED (
> ovn.at:13624)
> > >>
> > >> With the commit [1], ovn-controller sends GARPs for the IPs of the
> distributed
> > >> router ports. The failing tests did not handle the situation if
> multiple GARPs
> > >> are sent. The failures are mostly timing related. This patch fixes
> these issues.
> > >>
> > >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router
> port connected to bridged logical switch")
> > >>
> > >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router
> port connected to bridged logical switch")
> > >> CC: Ilya Maximets <i.maximets@samsung.com>
> > >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> > >> ---
> > >
> > > Hi.
> > > Thanks for working on this!
> > >
> > > I can confirm that this patch fixes frequent TravisCI failures.
> > > There are still some occasional failures of ovn tests, but it they was
> > > always there. (OVN tests has some timing issues).
> > >
> > > Tested-by: Ilya Maximets <i.maximets@samsung.com>
> > >
> > > However, I see that some failures was resolved by just removing the
> > > checks from tests. This somehow decreases the test coverage.
> > > So, It'll be good to have review from someone more familiar with
> > > these tests than me.
> > >
> > > Ben, what do you think about this patch?
> >
> > Oh. You just applied it. So, I assume, it's OK for you. =)
>
> If there's a better way to do it, I'm all for it.
>

There is an issue in ovn-northd after this commit [1]. This commit makes
use of the function
- sbrec_port_binding_update_nat_addresses_addvalue() to update the
Port_Binding.nat_addresses column.

Once this code is hit, ovn-northd wakes up from the poll_block()
continuously hogging the CPU.
I think we are seeing these CI test issues because of this.

From the ovn-northd logs I can see the below transaction messages sent all
the time.

*********
..........
2019-07-12T17:26:13.837Z|74511|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18628
2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at
../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact",
params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13
172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
id=18629
2019-07-12T17:26:13.837Z|74515|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18629
2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at
../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact",
params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13
172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
id=18630
2019-07-12T17:26:13.837Z|74519|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18630
****************

We are seeing timing related frequent failures in the OpenStack CI tests
for networking-ovn.

Looks to me there is an issue with these variants (*update*_addvalue)  of
IDL functions.
I am submitting a patch in ovn-northd to not use this function as we need a
fix to unblock the Openstack CI gate.
But l think the actual issue seems to be in the IDL client code.

The issue can be reproduced with the below commands in the sandbox with ovn
enabled.

*************
ovs-appctl -t ovn-northd vlog/set dbg

ovn-nbctl lr-add lr0
ovn-nbctl ls-add public
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
ovn-nbctl lsp-add public public-lr0
ovn-nbctl lsp-set-type public-lr0 router
ovn-nbctl lsp-set-addresses public-lr0 router
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

ovn-nbctl lsp-add public ln-public
ovn-nbctl lsp-set-type ln-public localnet
ovn-nbctl lsp-set-addresses ln-public unknown
ovn-nbctl lsp-set-options ln-public network_name=public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

ovn-nbctl lr-add lr0
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
************.

Thanks
Numan





[1] -
https://github.com/openvswitch/ovs/commit/ed198fb3b92e2a0b1f594c22280803bfc2f66029#diff-2c35162acf6ad144624954fdc4c3d9f4

Patch
diff mbox series

diff --git a/tests/ovn.at b/tests/ovn.at
index 4da7059b3..95980f2f1 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -8593,7 +8593,9 @@  grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
     OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
     $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
     cat packets | grep $expected > exp
-    cat packets | grep $exp_gw_ip_garp >> exp
+    # Its possible that $active_gw/br-phys_n1-tx.pcap may have received multiple
+    # garp packets. So consider only the first packet.
+    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
     AT_CHECK([cat exp], [0], [expout])
     rm -f expout
     if test $backup_vswitchd_dead != 1; then
@@ -8840,7 +8842,7 @@  grep actions=mod_dl_dst:f0:00:00:01:02:04 | wc -l` -eq 1
     OVN_CHECK_PACKETS([ext1/vif1-tx.pcap], [ext1-vif1.expected])
     $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $active_gw/br-phys_n1-tx.pcap  > packets
     cat packets | grep $expected > exp
-    cat packets | grep $exp_gw_ip_garp >> exp
+    cat packets | grep $exp_gw_ip_garp | head -1 >> exp
     AT_CHECK([cat exp], [0], [expout])
 
     $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $backup_gw/br-phys_n1-tx.pcap  > packets
@@ -9567,20 +9569,9 @@  options:rxq_pcap=${pcap_file}-rx.pcap
 
 as hv1 reset_pcap_file br-ex_n2 hv1/br-ex_n2
 as hv3 reset_pcap_file hv3-vif1 hv3/vif1
-sleep 2
-# Take note of how many packets arrived on the VLAN switch before generating
-# further traffic
-n_packets=`as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" | grep actions=clone | sed 's/.*n_packets=\([[0-9]]*\),.*/\1/'`
 as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
 sleep 2
 
-# On hv1, the packet should not go from vlan switch pipleline to router
-# pipeline
-as hv1 ovs-ofctl dump-flows br-int
-AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=65 | grep "priority=100,reg15=0x1,metadata=0x2" \
-| grep actions=clone | grep -v n_packets=$n_packets | wc -l], [0], [[0
-]])
-
 # On hv1, table 32 check that no packet goes via the tunnel port
 AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=32 \
 | grep "NXM_NX_TUN_ID" | grep -v n_packets=0 | wc -l], [0], [[0
@@ -9624,21 +9615,38 @@  echo $exp_garp_on_foo1 > foo1.expout
 
 # ovn-controller on hv2 should send garp with VLAN tag
 sent_garp="ffffffffffff0000010102038100000208060001080006040001000001010203c0a80101000000000000c0a80101"
-echo $sent_garp > br-ex_n2.expout
 
 OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
-OVN_CHECK_PACKETS([hv2/br-ex_n2-tx.pcap], [br-ex_n2.expout])
+# Wait until we receive atleast 1 packet
+OVS_WAIT_UNTIL([test 1=`$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | wc -l`])
+$PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap | head -1 > packets
+echo $sent_garp > expout
+AT_CHECK([cat packets], [0], [expout])
 $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv4/br-ex_n2-tx.pcap > empty
 AT_CHECK([cat empty], [0], [])
 
 # Make hv4 master
 as hv1 reset_pcap_file hv1-vif1 hv1/vif1
-as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
 as hv4 reset_pcap_file br-ex_n2 hv4/br-ex_n2
 ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 hv4 40
 
+# Wait till cr-alice is claimed by hv4
+hv4_chassis=$(ovn-sbctl --bare --columns=_uuid find Chassis name=hv4)
+# check that the chassis redirect port has been claimed by the gw1 chassis
+OVS_WAIT_UNTIL([ovn-sbctl --columns chassis --bare find Port_Binding \
+logical_port=cr-alice | grep $hv4_chassis | wc -l], [0],[[1
+]])
+
+# Reset the pcap file for hv2/br-ex_n2. From now on ovn-controller in hv2
+# should not send GARPs for the router ports.
+as hv2 reset_pcap_file br-ex_n2 hv2/br-ex_n2
+
+echo $sent_garp > br-ex_n2.expout
 OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [foo1.expout])
 OVN_CHECK_PACKETS([hv4/br-ex_n2-tx.pcap], [br-ex_n2.expout])
+
+sleep 2
+
 $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv2/br-ex_n2-tx.pcap > empty
 AT_CHECK([cat empty], [0], [])
 
@@ -13580,6 +13588,8 @@  test_ip_packet_larger() {
     orig_packet_l3=${orig_packet_l3}000000000000000000000000000000000000
     packet=${packet}${orig_packet_l3}
 
+    gw_ip_garp=ffffffffffff00002020121308060001080006040001000020201213aca80064000000000000aca80064
+
     # If icmp_pmtu_reply_expected is 0, it means the packet is lesser than
     # the gateway mtu and should be delivered to the provider bridge via the
     # localnet port.
@@ -13599,6 +13609,7 @@  test_ip_packet_larger() {
         expected=${expected}000000000000000000000000000000000000
         expected=${expected}000000000000000000000000000000000000
         echo $expected > br_phys_n1.expected
+        echo $gw_ip_garp >> br_phys_n1.expected
     else
         # MTU would be 100 - 18 = 82 (hex 0052)
         mtu=0052
@@ -13622,12 +13633,18 @@  test_ip_packet_larger() {
 
     if test $icmp_pmtu_reply_expected = 0; then
         OVN_CHECK_PACKETS([hv1/br-phys_n1-tx.pcap], [br_phys_n1.expected])
-        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > packets
+        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/vif1-tx.pcap  > pkts
+        # hv1/vif1-tx.pcap can receive the GARP packet generated by ovn-controller
+        # for the gateway router port. So ignore this packet.
+        cat pkts | grep -v $gw_ip_garp > packets
         AT_CHECK([cat packets], [0], [])
     else
         OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [hv1-vif1.expected])
         $PYTHON "$top_srcdir/utilities/ovs-pcap.in" hv1/br-phys_n1-tx.pcap  > \
-        packets
+        pkts
+        # hv1/br-phys_n1-tx.pcap can receive the GARP packet generated by ovn-controller
+        # for the gateway router port. So ignore this packet.
+        cat pkts | grep -v $gw_ip_garp > packets
         AT_CHECK([cat packets], [0], [])
     fi
 }