[ovs-dev] ovn-northd: Fix the ovn-northd continous looping
diff mbox series

Message ID 20190712175000.11085-1-nusiddiq@redhat.com
State Superseded
Headers show
Series
  • [ovs-dev] ovn-northd: Fix the ovn-northd continous looping
Related show

Commit Message

Numan Siddique July 12, 2019, 5:50 p.m. UTC
From: Numan Siddique <nusiddiq@redhat.com>

ovn-northd wakes up continuously from poll_block(). This issue can be reproduced
in the sandbox with the below commands

ovn-nbctl lr-add lr0
ovn-nbctl ls-add public
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
ovn-nbctl lsp-add public public-lr0
ovn-nbctl lsp-set-type public-lr0 router
ovn-nbctl lsp-set-addresses public-lr0 router
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

This issue is seen after the commit [1], which makes use of the function -
sbrec_port_binding_update_nat_addresses_addvalue() to add a value to
Port_Binding.nat_addresses column.

Looks like the IDL client code is sending the transactions to the ovsdb-server repeatedly
to update the Port_Binding.nat_addresses even though the Southbound DB has updated
the column when this function is used. The actual bug seems to be in the IDL client code
and that needs to be fixed. This patch as a quick fix, fixes ovn-northd's continuous loop
by not using this function, instead making use of sbrec_port_binding_set_nat_addresses().

The below messages are seen continuously when the ovn-nortdh debug logs are enabled.

****

2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18628
2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":
{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses",
"insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],
"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18629

2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18629
2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],
"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},
{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid",
"56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18630
2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18630
******

The OpenStack CI tests for networking-ovn is frequently failing few tests after this
commit. The failure seems to be related to timing issues as ovn-northd is hogging
the CPU continuously. We are also seeing travis CI test failures after this commit.

[1] - ed198fb3b92e

Fixes: ed198fb3b92e("ovn: Send GARP for the router ports with reside-on-redirect-chassis options set")
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
---
 ovn/northd/ovn-northd.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Comments

Gregory Rose July 12, 2019, 8:46 p.m. UTC | #1
On 7/12/2019 10:50 AM, nusiddiq@redhat.com wrote:
> From: Numan Siddique <nusiddiq@redhat.com>

In the subject line s/continous/continuous/

>
> ovn-northd wakes up continuously from poll_block(). This issue can be reproduced
> in the sandbox with the below commands
>
> ovn-nbctl lr-add lr0
> ovn-nbctl ls-add public
> ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
> ovn-nbctl lsp-add public public-lr0
> ovn-nbctl lsp-set-type public-lr0 router
> ovn-nbctl lsp-set-addresses public-lr0 router
> ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
> ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20
>
> This issue is seen after the commit [1], which makes use of the function -
> sbrec_port_binding_update_nat_addresses_addvalue() to add a value to
> Port_Binding.nat_addresses column.
>
> Looks like the IDL client code is sending the transactions to the ovsdb-server repeatedly
> to update the Port_Binding.nat_addresses even though the Southbound DB has updated
> the column when this function is used. The actual bug seems to be in the IDL client code
> and that needs to be fixed. This patch as a quick fix, fixes ovn-northd's continuous loop
> by not using this function, instead making use of sbrec_port_binding_set_nat_addresses().
>
> The below messages are seen continuously when the ovn-nortdh debug logs are enabled.
>
> ****
>
> 2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
> result=[{},{"count":1},{"count":1}], id=18628
> 2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
> 2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
> method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
> {"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":
> {"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses",
> "insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],
> "where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18629
>
> 2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18629
> 2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
> 2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
> method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
> {"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],
> "row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},
> {"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100
> is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid",
> "56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18630
> 2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18630
> ******
>
> The OpenStack CI tests for networking-ovn is frequently failing few tests after this
> commit. The failure seems to be related to timing issues as ovn-northd is hogging
> the CPU continuously. We are also seeing travis CI test failures after this commit.
>
> [1] - ed198fb3b92e
>
> Fixes: ed198fb3b92e("ovn: Send GARP for the router ports with reside-on-redirect-chassis options set")
> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> ---
>   ovn/northd/ovn-northd.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> index ce382ac89..127227712 100644
> --- a/ovn/northd/ovn-northd.c
> +++ b/ovn/northd/ovn-northd.c
> @@ -2530,13 +2530,6 @@ ovn_port_update_sbrec(struct northd_context *ctx,
>                   }
>               }
>   
> -            sbrec_port_binding_set_nat_addresses(op->sb,
> -                                                 (const char **) nats, n_nats);
> -            for (size_t i = 0; i < n_nats; i++) {
> -                free(nats[i]);
> -            }
> -            free(nats);
> -
>               /* Add the router mac and IPv4 addresses to
>                * Port_Binding.nat_addresses so that GARP is sent for these
>                * IPs by the ovn-controller on which the distributed gateway
> @@ -2578,10 +2571,18 @@ ovn_port_update_sbrec(struct northd_context *ctx,
>                                     op->peer->od->l3redirect_port->json_key);
>                   }
>   
> -                sbrec_port_binding_update_nat_addresses_addvalue(
> -                    op->sb, ds_cstr(&garp_info));
> +                n_nats++;
> +                nats = xrealloc(nats, (n_nats * sizeof *nats));
> +                nats[n_nats - 1] = ds_steal_cstr(&garp_info);
>                   ds_destroy(&garp_info);
>               }
> +
> +            sbrec_port_binding_set_nat_addresses(op->sb,
> +                                                 (const char **) nats, n_nats);
> +            for (size_t i = 0; i < n_nats; i++) {
> +                free(nats[i]);
> +            }
> +            free(nats);
>           }
>   
>           sbrec_port_binding_set_parent_port(op->sb, op->nbsp->parent_name);

Works for me.  With the subject line fixed up I think it's fine.
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Numan Siddique July 16, 2019, 6:01 p.m. UTC | #2
On Sat, Jul 13, 2019 at 2:16 AM Gregory Rose <gvrose8192@gmail.com> wrote:

> On 7/12/2019 10:50 AM, nusiddiq@redhat.com wrote:
> > From: Numan Siddique <nusiddiq@redhat.com>
>
> In the subject line s/continous/continuous/
>
> >
> > ovn-northd wakes up continuously from poll_block(). This issue can be
> reproduced
> > in the sandbox with the below commands
> >
> > ovn-nbctl lr-add lr0
> > ovn-nbctl ls-add public
> > ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
> > ovn-nbctl lsp-add public public-lr0
> > ovn-nbctl lsp-set-type public-lr0 router
> > ovn-nbctl lsp-set-addresses public-lr0 router
> > ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
> > ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20
> >
> > This issue is seen after the commit [1], which makes use of the function
> -
> > sbrec_port_binding_update_nat_addresses_addvalue() to add a value to
> > Port_Binding.nat_addresses column.
> >
> > Looks like the IDL client code is sending the transactions to the
> ovsdb-server repeatedly
> > to update the Port_Binding.nat_addresses even though the Southbound DB
> has updated
> > the column when this function is used. The actual bug seems to be in the
> IDL client code
> > and that needs to be fixed. This patch as a quick fix, fixes
> ovn-northd's continuous loop
> > by not using this function, instead making use of
> sbrec_port_binding_set_nat_addresses().
> >
> > The below messages are seen continuously when the ovn-nortdh debug logs
> are enabled.
> >
> > ****
> >
> > 2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received
> reply,
> > result=[{},{"count":1},{"count":1}], id=18628
> > 2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout
> at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
> > 2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
> > method="transact",
> params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
> >
> {"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":
> >
> {"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses",
> > "insert",["set",["00:00:20:20:12:13 172.168.0.100
> is_chassis_resident(\"cr-lr0-public\")"]]]],
> >
> "where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
> id=18629
> >
> > 2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received
> reply, result=[{},{"count":1},{"count":1}], id=18629
> > 2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout
> at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
> > 2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
> > method="transact",
> params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
> >
> {"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],
> > "row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},
> > {"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13
> 172.168.0.100
> >
> is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid",
> >
> "56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
> id=18630
> > 2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received
> reply, result=[{},{"count":1},{"count":1}], id=18630
> > ******
> >
> > The OpenStack CI tests for networking-ovn is frequently failing few
> tests after this
> > commit. The failure seems to be related to timing issues as ovn-northd
> is hogging
> > the CPU continuously. We are also seeing travis CI test failures after
> this commit.
> >
> > [1] - ed198fb3b92e
> >
> > Fixes: ed198fb3b92e("ovn: Send GARP for the router ports with
> reside-on-redirect-chassis options set")
> > Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> > ---
> >   ovn/northd/ovn-northd.c | 19 ++++++++++---------
> >   1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
> > index ce382ac89..127227712 100644
> > --- a/ovn/northd/ovn-northd.c
> > +++ b/ovn/northd/ovn-northd.c
> > @@ -2530,13 +2530,6 @@ ovn_port_update_sbrec(struct northd_context *ctx,
> >                   }
> >               }
> >
> > -            sbrec_port_binding_set_nat_addresses(op->sb,
> > -                                                 (const char **) nats,
> n_nats);
> > -            for (size_t i = 0; i < n_nats; i++) {
> > -                free(nats[i]);
> > -            }
> > -            free(nats);
> > -
> >               /* Add the router mac and IPv4 addresses to
> >                * Port_Binding.nat_addresses so that GARP is sent for
> these
> >                * IPs by the ovn-controller on which the distributed
> gateway
> > @@ -2578,10 +2571,18 @@ ovn_port_update_sbrec(struct northd_context *ctx,
> >
>  op->peer->od->l3redirect_port->json_key);
> >                   }
> >
> > -                sbrec_port_binding_update_nat_addresses_addvalue(
> > -                    op->sb, ds_cstr(&garp_info));
> > +                n_nats++;
> > +                nats = xrealloc(nats, (n_nats * sizeof *nats));
> > +                nats[n_nats - 1] = ds_steal_cstr(&garp_info);
> >                   ds_destroy(&garp_info);
> >               }
> > +
> > +            sbrec_port_binding_set_nat_addresses(op->sb,
> > +                                                 (const char **) nats,
> n_nats);
> > +            for (size_t i = 0; i < n_nats; i++) {
> > +                free(nats[i]);
> > +            }
> > +            free(nats);
> >           }
> >
> >           sbrec_port_binding_set_parent_port(op->sb,
> op->nbsp->parent_name);
>
> Works for me.  With the subject line fixed up I think it's fine.
> Tested-by: Greg Rose <gvrose8192@gmail.com>
> Reviewed-by: Greg Rose <gvrose8192@gmail.com>
>

Thanks for the review. I submitted v2 fixing the typo.

Numan


Thanks.

Patch
diff mbox series

diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index ce382ac89..127227712 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -2530,13 +2530,6 @@  ovn_port_update_sbrec(struct northd_context *ctx,
                 }
             }
 
-            sbrec_port_binding_set_nat_addresses(op->sb,
-                                                 (const char **) nats, n_nats);
-            for (size_t i = 0; i < n_nats; i++) {
-                free(nats[i]);
-            }
-            free(nats);
-
             /* Add the router mac and IPv4 addresses to
              * Port_Binding.nat_addresses so that GARP is sent for these
              * IPs by the ovn-controller on which the distributed gateway
@@ -2578,10 +2571,18 @@  ovn_port_update_sbrec(struct northd_context *ctx,
                                   op->peer->od->l3redirect_port->json_key);
                 }
 
-                sbrec_port_binding_update_nat_addresses_addvalue(
-                    op->sb, ds_cstr(&garp_info));
+                n_nats++;
+                nats = xrealloc(nats, (n_nats * sizeof *nats));
+                nats[n_nats - 1] = ds_steal_cstr(&garp_info);
                 ds_destroy(&garp_info);
             }
+
+            sbrec_port_binding_set_nat_addresses(op->sb,
+                                                 (const char **) nats, n_nats);
+            for (size_t i = 0; i < n_nats; i++) {
+                free(nats[i]);
+            }
+            free(nats);
         }
 
         sbrec_port_binding_set_parent_port(op->sb, op->nbsp->parent_name);