diff mbox series

[ovs-dev] Fix the ovn-northd recompute loops for nb_cfg updates.

Message ID 20260514215601.187213-1-numans@ovn.org
State Accepted
Headers show
Series [ovs-dev] Fix the ovn-northd recompute loops for nb_cfg updates. | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Numan Siddique May 14, 2026, 9:56 p.m. UTC
From: Numan Siddique <numans@ovn.org>

The commit in the 'Fixes' tag tried to fix the ovn-nbctl sync issue by
aborting any SB transaction by calling ovsdb_idl_txn_abort(ovnsb_txn)
to avoid syncing the nb_cfg to SB DB while a SB transaction is in
progress.

However the ovsdb_idl_txn_abort() would feed back into the
force-recompute path, creating a busy loop whenever NB.nb_cfg
keeps advancing but the engine produces no SB activity.

The issue can be easily reproduced by running the below commands
if ovn-northd takes 2 or more seconds to finish its recompute.

ovn-nbctl set nb_global . nb_cfg=2 -- ls-add sw2 &&
ovn-nbctl set nb_global . nb_cfg=3.

In the above command ideally ovn-northd lflow engine
should recompute only once, but it would recompute
twice with the below log messages seen in northd.

---
2026-05-14T20:40:50.374Z|00176|coverage|INFO|124 events never hit
2026-05-14T20:40:50.418Z|00177|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
2026-05-14T20:40:50.419Z|00178|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
...
2026-05-14T20:40:50.419Z|00193|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
2026-05-14T20:40:50.419Z|00194|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
2026-05-14T20:40:51.168Z|00195|inc_proc_eng|INFO|node: northd, recompute (forced) took 636ms
2026-05-14T20:40:53.789Z|00196|inc_proc_eng|INFO|node: routes, recompute (forced) took 2493ms
2026-05-14T20:40:56.286Z|00197|inc_proc_eng|INFO|node: lflow, recompute (forced) took 2257ms
2026-05-14T20:40:56.299Z|00198|timeval|WARN|Unreasonably long 5880ms poll interval (5852ms user, 0ms system)
----

This patch fixes this issue by avoiding the force recompute if
the sb txn was aborted intentionally.

Fixes: 56819f04166b ("northd: Prevent ovn-nbctl --wait=sb from returning
early.")

CC: Ales Musil <amusil@redhat.com>
Assisted-by: Claude Opus 4.7, Claude Code
Signed-off-by: Numan Siddique <numans@ovn.org>
---
 northd/ovn-northd.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Ales Musil May 15, 2026, 8:45 a.m. UTC | #1
On Thu, May 14, 2026 at 11:56 PM <numans@ovn.org> wrote:

> From: Numan Siddique <numans@ovn.org>
>

Hi Numan

thank you for the fix. There is one issue with commit message see down
below.


>
> The commit in the 'Fixes' tag tried to fix the ovn-nbctl sync issue by
> aborting any SB transaction by calling ovsdb_idl_txn_abort(ovnsb_txn)
> to avoid syncing the nb_cfg to SB DB while a SB transaction is in
> progress.
>
> However the ovsdb_idl_txn_abort() would feed back into the
> force-recompute path, creating a busy loop whenever NB.nb_cfg
> keeps advancing but the engine produces no SB activity.
>
> The issue can be easily reproduced by running the below commands
> if ovn-northd takes 2 or more seconds to finish its recompute.
>
> ovn-nbctl set nb_global . nb_cfg=2 -- ls-add sw2 &&
> ovn-nbctl set nb_global . nb_cfg=3.
>
> In the above command ideally ovn-northd lflow engine
> should recompute only once, but it would recompute
> twice with the below log messages seen in northd.
>
> ---


Those three dashes are why the 0-day bot complains,
it should be replaced with something else to avoid cutting
out part of the commit message.

2026-05-14T20:40:50.374Z|00176|coverage|INFO|124 events never hit
> 2026-05-14T20:40:50.418Z|00177|ovn_northd|INFO|OVNSB commit failed, force
> recompute next time.
> 2026-05-14T20:40:50.419Z|00178|ovn_northd|INFO|OVNSB commit failed, force
> recompute next time.
> ...
> 2026-05-14T20:40:50.419Z|00193|ovn_northd|INFO|OVNSB commit failed, force
> recompute next time.
> 2026-05-14T20:40:50.419Z|00194|ovn_northd|INFO|OVNSB commit failed, force
> recompute next time.
> 2026-05-14T20:40:51.168Z|00195|inc_proc_eng|INFO|node: northd, recompute
> (forced) took 636ms
> 2026-05-14T20:40:53.789Z|00196|inc_proc_eng|INFO|node: routes, recompute
> (forced) took 2493ms
> 2026-05-14T20:40:56.286Z|00197|inc_proc_eng|INFO|node: lflow, recompute
> (forced) took 2257ms
> 2026-05-14T20:40:56.299Z|00198|timeval|WARN|Unreasonably long 5880ms poll
> interval (5852ms user, 0ms system)
> ----
>

Same here.


>
> This patch fixes this issue by avoiding the force recompute if
> the sb txn was aborted intentionally.
>
> Fixes: 56819f04166b ("northd: Prevent ovn-nbctl --wait=sb from returning
> early.")
>
> CC: Ales Musil <amusil@redhat.com>
> Assisted-by: Claude Opus 4.7, Claude Code
> Signed-off-by: Numan Siddique <numans@ovn.org>
> ---
>  northd/ovn-northd.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 7e0f888c1a..e2d1066ada 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -1113,10 +1113,18 @@ main(int argc, char *argv[])
>
>                  /* Make sure we don't bump the next_cfg when we shouldn't.
>                   * This should prevent ovn-nbctl sync calls to return
> before
> -                 * the SB updates are actually done. */
> +                 * the SB updates are actually done.
> +                 *
> +                 * Track that the abort was intentional so we can
> distinguish
> +                 * it from a real commit failure below; otherwise the
> abort
> +                 * would feed back into the force-recompute path,
> creating a
> +                 * busy loop whenever NB.nb_cfg keeps advancing but the
> +                 * engine produces no SB activity. */
> +                bool ovnsb_txn_aborted_intentionally = false;
>                  if (!activity && ovnsb_txn &&
>                      ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) {
>                      ovsdb_idl_txn_abort(ovnsb_txn);
> +                    ovnsb_txn_aborted_intentionally = true;
>                  }
>
>                  /* If there are any errors, we force a full recompute in
> order
> @@ -1127,7 +1135,8 @@ main(int argc, char *argv[])
>                      inc_proc_northd_force_recompute_immediate();
>                  }
>
> -                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
> +                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) &&
> +                    !ovnsb_txn_aborted_intentionally) {
>                      VLOG_INFO("OVNSB commit failed, "
>                                "force recompute next time.");
>                      inc_proc_northd_force_recompute_immediate();
> --
> 2.53.0
>
>
With that addressed:
Acked-by: Ales Musil <amusil@redhat.com>
Numan Siddique May 15, 2026, 7:27 p.m. UTC | #2
On Fri, May 15, 2026 at 4:46 AM Ales Musil <amusil@redhat.com> wrote:
>
>
>
> On Thu, May 14, 2026 at 11:56 PM <numans@ovn.org> wrote:
>>
>> From: Numan Siddique <numans@ovn.org>
>
>
> Hi Numan
>
> thank you for the fix. There is one issue with commit message see down below.
>
>>
>>
>> The commit in the 'Fixes' tag tried to fix the ovn-nbctl sync issue by
>> aborting any SB transaction by calling ovsdb_idl_txn_abort(ovnsb_txn)
>> to avoid syncing the nb_cfg to SB DB while a SB transaction is in
>> progress.
>>
>> However the ovsdb_idl_txn_abort() would feed back into the
>> force-recompute path, creating a busy loop whenever NB.nb_cfg
>> keeps advancing but the engine produces no SB activity.
>>
>> The issue can be easily reproduced by running the below commands
>> if ovn-northd takes 2 or more seconds to finish its recompute.
>>
>> ovn-nbctl set nb_global . nb_cfg=2 -- ls-add sw2 &&
>> ovn-nbctl set nb_global . nb_cfg=3.
>>
>> In the above command ideally ovn-northd lflow engine
>> should recompute only once, but it would recompute
>> twice with the below log messages seen in northd.
>>
>> ---
>
>
> Those three dashes are why the 0-day bot complains,
> it should be replaced with something else to avoid cutting
> out part of the commit message.
>
>> 2026-05-14T20:40:50.374Z|00176|coverage|INFO|124 events never hit
>> 2026-05-14T20:40:50.418Z|00177|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
>> 2026-05-14T20:40:50.419Z|00178|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
>> ...
>> 2026-05-14T20:40:50.419Z|00193|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
>> 2026-05-14T20:40:50.419Z|00194|ovn_northd|INFO|OVNSB commit failed, force recompute next time.
>> 2026-05-14T20:40:51.168Z|00195|inc_proc_eng|INFO|node: northd, recompute (forced) took 636ms
>> 2026-05-14T20:40:53.789Z|00196|inc_proc_eng|INFO|node: routes, recompute (forced) took 2493ms
>> 2026-05-14T20:40:56.286Z|00197|inc_proc_eng|INFO|node: lflow, recompute (forced) took 2257ms
>> 2026-05-14T20:40:56.299Z|00198|timeval|WARN|Unreasonably long 5880ms poll interval (5852ms user, 0ms system)
>> ----
>
>
> Same here.
>
>>
>>
>> This patch fixes this issue by avoiding the force recompute if
>> the sb txn was aborted intentionally.
>>
>> Fixes: 56819f04166b ("northd: Prevent ovn-nbctl --wait=sb from returning
>> early.")
>>
>> CC: Ales Musil <amusil@redhat.com>
>> Assisted-by: Claude Opus 4.7, Claude Code
>> Signed-off-by: Numan Siddique <numans@ovn.org>
>> ---
>>  northd/ovn-northd.c | 13 +++++++++++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
>> index 7e0f888c1a..e2d1066ada 100644
>> --- a/northd/ovn-northd.c
>> +++ b/northd/ovn-northd.c
>> @@ -1113,10 +1113,18 @@ main(int argc, char *argv[])
>>
>>                  /* Make sure we don't bump the next_cfg when we shouldn't.
>>                   * This should prevent ovn-nbctl sync calls to return before
>> -                 * the SB updates are actually done. */
>> +                 * the SB updates are actually done.
>> +                 *
>> +                 * Track that the abort was intentional so we can distinguish
>> +                 * it from a real commit failure below; otherwise the abort
>> +                 * would feed back into the force-recompute path, creating a
>> +                 * busy loop whenever NB.nb_cfg keeps advancing but the
>> +                 * engine produces no SB activity. */
>> +                bool ovnsb_txn_aborted_intentionally = false;
>>                  if (!activity && ovnsb_txn &&
>>                      ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) {
>>                      ovsdb_idl_txn_abort(ovnsb_txn);
>> +                    ovnsb_txn_aborted_intentionally = true;
>>                  }
>>
>>                  /* If there are any errors, we force a full recompute in order
>> @@ -1127,7 +1135,8 @@ main(int argc, char *argv[])
>>                      inc_proc_northd_force_recompute_immediate();
>>                  }
>>
>> -                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
>> +                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) &&
>> +                    !ovnsb_txn_aborted_intentionally) {
>>                      VLOG_INFO("OVNSB commit failed, "
>>                                "force recompute next time.");
>>                      inc_proc_northd_force_recompute_immediate();
>> --
>> 2.53.0
>>
>
> With that addressed:
> Acked-by: Ales Musil <amusil@redhat.com>

Thanks Ales.  I fixed the commit message error and applied to main and
I've backported till 25.03.
I'll backport until 24.03 in some time.

Thanks
Numan
diff mbox series

Patch

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 7e0f888c1a..e2d1066ada 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -1113,10 +1113,18 @@  main(int argc, char *argv[])
 
                 /* Make sure we don't bump the next_cfg when we shouldn't.
                  * This should prevent ovn-nbctl sync calls to return before
-                 * the SB updates are actually done. */
+                 * the SB updates are actually done.
+                 *
+                 * Track that the abort was intentional so we can distinguish
+                 * it from a real commit failure below; otherwise the abort
+                 * would feed back into the force-recompute path, creating a
+                 * busy loop whenever NB.nb_cfg keeps advancing but the
+                 * engine produces no SB activity. */
+                bool ovnsb_txn_aborted_intentionally = false;
                 if (!activity && ovnsb_txn &&
                     ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) {
                     ovsdb_idl_txn_abort(ovnsb_txn);
+                    ovnsb_txn_aborted_intentionally = true;
                 }
 
                 /* If there are any errors, we force a full recompute in order
@@ -1127,7 +1135,8 @@  main(int argc, char *argv[])
                     inc_proc_northd_force_recompute_immediate();
                 }
 
-                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
+                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) &&
+                    !ovnsb_txn_aborted_intentionally) {
                     VLOG_INFO("OVNSB commit failed, "
                               "force recompute next time.");
                     inc_proc_northd_force_recompute_immediate();