| Message ID | 20260514215601.187213-1-numans@ovn.org |
|---|---|
| State | Accepted |
| Headers | show |
| Series | [ovs-dev] Fix the ovn-northd recompute loops for nb_cfg updates. | expand |
| Context | Check | Description |
|---|---|---|
| ovsrobot/apply-robot | warning | apply and check: warning |
| ovsrobot/github-robot-_Build_and_Test | fail | github build: failed |
| ovsrobot/github-robot-_ovn-kubernetes | success | github build: passed |
On Thu, May 14, 2026 at 11:56 PM <numans@ovn.org> wrote: > From: Numan Siddique <numans@ovn.org> > Hi Numan thank you for the fix. There is one issue with commit message see down below. > > The commit in the 'Fixes' tag tried to fix the ovn-nbctl sync issue by > aborting any SB transaction by calling ovsdb_idl_txn_abort(ovnsb_txn) > to avoid syncing the nb_cfg to SB DB while a SB transaction is in > progress. > > However the ovsdb_idl_txn_abort() would feed back into the > force-recompute path, creating a busy loop whenever NB.nb_cfg > keeps advancing but the engine produces no SB activity. > > The issue can be easily reproduced by running the below commands > if ovn-northd takes 2 or more seconds to finish its recompute. > > ovn-nbctl set nb_global . nb_cfg=2 -- ls-add sw2 && > ovn-nbctl set nb_global . nb_cfg=3. > > In the above command ideally ovn-northd lflow engine > should recompute only once, but it would recompute > twice with the below log messages seen in northd. > > --- Those three dashes are why the 0-day bot complains, it should be replaced with something else to avoid cutting out part of the commit message. 2026-05-14T20:40:50.374Z|00176|coverage|INFO|124 events never hit > 2026-05-14T20:40:50.418Z|00177|ovn_northd|INFO|OVNSB commit failed, force > recompute next time. > 2026-05-14T20:40:50.419Z|00178|ovn_northd|INFO|OVNSB commit failed, force > recompute next time. > ... > 2026-05-14T20:40:50.419Z|00193|ovn_northd|INFO|OVNSB commit failed, force > recompute next time. > 2026-05-14T20:40:50.419Z|00194|ovn_northd|INFO|OVNSB commit failed, force > recompute next time. > 2026-05-14T20:40:51.168Z|00195|inc_proc_eng|INFO|node: northd, recompute > (forced) took 636ms > 2026-05-14T20:40:53.789Z|00196|inc_proc_eng|INFO|node: routes, recompute > (forced) took 2493ms > 2026-05-14T20:40:56.286Z|00197|inc_proc_eng|INFO|node: lflow, recompute > (forced) took 2257ms > 2026-05-14T20:40:56.299Z|00198|timeval|WARN|Unreasonably long 5880ms poll > interval (5852ms user, 0ms system) > ---- > Same here. > > This patch fixes this issue by avoiding the force recompute if > the sb txn was aborted intentionally. > > Fixes: 56819f04166b ("northd: Prevent ovn-nbctl --wait=sb from returning > early.") > > CC: Ales Musil <amusil@redhat.com> > Assisted-by: Claude Opus 4.7, Claude Code > Signed-off-by: Numan Siddique <numans@ovn.org> > --- > northd/ovn-northd.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > index 7e0f888c1a..e2d1066ada 100644 > --- a/northd/ovn-northd.c > +++ b/northd/ovn-northd.c > @@ -1113,10 +1113,18 @@ main(int argc, char *argv[]) > > /* Make sure we don't bump the next_cfg when we shouldn't. > * This should prevent ovn-nbctl sync calls to return > before > - * the SB updates are actually done. */ > + * the SB updates are actually done. > + * > + * Track that the abort was intentional so we can > distinguish > + * it from a real commit failure below; otherwise the > abort > + * would feed back into the force-recompute path, > creating a > + * busy loop whenever NB.nb_cfg keeps advancing but the > + * engine produces no SB activity. */ > + bool ovnsb_txn_aborted_intentionally = false; > if (!activity && ovnsb_txn && > ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) { > ovsdb_idl_txn_abort(ovnsb_txn); > + ovnsb_txn_aborted_intentionally = true; > } > > /* If there are any errors, we force a full recompute in > order > @@ -1127,7 +1135,8 @@ main(int argc, char *argv[]) > inc_proc_northd_force_recompute_immediate(); > } > > - if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) { > + if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) && > + !ovnsb_txn_aborted_intentionally) { > VLOG_INFO("OVNSB commit failed, " > "force recompute next time."); > inc_proc_northd_force_recompute_immediate(); > -- > 2.53.0 > > With that addressed: Acked-by: Ales Musil <amusil@redhat.com>
On Fri, May 15, 2026 at 4:46 AM Ales Musil <amusil@redhat.com> wrote: > > > > On Thu, May 14, 2026 at 11:56 PM <numans@ovn.org> wrote: >> >> From: Numan Siddique <numans@ovn.org> > > > Hi Numan > > thank you for the fix. There is one issue with commit message see down below. > >> >> >> The commit in the 'Fixes' tag tried to fix the ovn-nbctl sync issue by >> aborting any SB transaction by calling ovsdb_idl_txn_abort(ovnsb_txn) >> to avoid syncing the nb_cfg to SB DB while a SB transaction is in >> progress. >> >> However the ovsdb_idl_txn_abort() would feed back into the >> force-recompute path, creating a busy loop whenever NB.nb_cfg >> keeps advancing but the engine produces no SB activity. >> >> The issue can be easily reproduced by running the below commands >> if ovn-northd takes 2 or more seconds to finish its recompute. >> >> ovn-nbctl set nb_global . nb_cfg=2 -- ls-add sw2 && >> ovn-nbctl set nb_global . nb_cfg=3. >> >> In the above command ideally ovn-northd lflow engine >> should recompute only once, but it would recompute >> twice with the below log messages seen in northd. >> >> --- > > > Those three dashes are why the 0-day bot complains, > it should be replaced with something else to avoid cutting > out part of the commit message. > >> 2026-05-14T20:40:50.374Z|00176|coverage|INFO|124 events never hit >> 2026-05-14T20:40:50.418Z|00177|ovn_northd|INFO|OVNSB commit failed, force recompute next time. >> 2026-05-14T20:40:50.419Z|00178|ovn_northd|INFO|OVNSB commit failed, force recompute next time. >> ... >> 2026-05-14T20:40:50.419Z|00193|ovn_northd|INFO|OVNSB commit failed, force recompute next time. >> 2026-05-14T20:40:50.419Z|00194|ovn_northd|INFO|OVNSB commit failed, force recompute next time. >> 2026-05-14T20:40:51.168Z|00195|inc_proc_eng|INFO|node: northd, recompute (forced) took 636ms >> 2026-05-14T20:40:53.789Z|00196|inc_proc_eng|INFO|node: routes, recompute (forced) took 2493ms >> 2026-05-14T20:40:56.286Z|00197|inc_proc_eng|INFO|node: lflow, recompute (forced) took 2257ms >> 2026-05-14T20:40:56.299Z|00198|timeval|WARN|Unreasonably long 5880ms poll interval (5852ms user, 0ms system) >> ---- > > > Same here. > >> >> >> This patch fixes this issue by avoiding the force recompute if >> the sb txn was aborted intentionally. >> >> Fixes: 56819f04166b ("northd: Prevent ovn-nbctl --wait=sb from returning >> early.") >> >> CC: Ales Musil <amusil@redhat.com> >> Assisted-by: Claude Opus 4.7, Claude Code >> Signed-off-by: Numan Siddique <numans@ovn.org> >> --- >> northd/ovn-northd.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >> index 7e0f888c1a..e2d1066ada 100644 >> --- a/northd/ovn-northd.c >> +++ b/northd/ovn-northd.c >> @@ -1113,10 +1113,18 @@ main(int argc, char *argv[]) >> >> /* Make sure we don't bump the next_cfg when we shouldn't. >> * This should prevent ovn-nbctl sync calls to return before >> - * the SB updates are actually done. */ >> + * the SB updates are actually done. >> + * >> + * Track that the abort was intentional so we can distinguish >> + * it from a real commit failure below; otherwise the abort >> + * would feed back into the force-recompute path, creating a >> + * busy loop whenever NB.nb_cfg keeps advancing but the >> + * engine produces no SB activity. */ >> + bool ovnsb_txn_aborted_intentionally = false; >> if (!activity && ovnsb_txn && >> ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) { >> ovsdb_idl_txn_abort(ovnsb_txn); >> + ovnsb_txn_aborted_intentionally = true; >> } >> >> /* If there are any errors, we force a full recompute in order >> @@ -1127,7 +1135,8 @@ main(int argc, char *argv[]) >> inc_proc_northd_force_recompute_immediate(); >> } >> >> - if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) { >> + if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) && >> + !ovnsb_txn_aborted_intentionally) { >> VLOG_INFO("OVNSB commit failed, " >> "force recompute next time."); >> inc_proc_northd_force_recompute_immediate(); >> -- >> 2.53.0 >> > > With that addressed: > Acked-by: Ales Musil <amusil@redhat.com> Thanks Ales. I fixed the commit message error and applied to main and I've backported till 25.03. I'll backport until 24.03 in some time. Thanks Numan
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c index 7e0f888c1a..e2d1066ada 100644 --- a/northd/ovn-northd.c +++ b/northd/ovn-northd.c @@ -1113,10 +1113,18 @@ main(int argc, char *argv[]) /* Make sure we don't bump the next_cfg when we shouldn't. * This should prevent ovn-nbctl sync calls to return before - * the SB updates are actually done. */ + * the SB updates are actually done. + * + * Track that the abort was intentional so we can distinguish + * it from a real commit failure below; otherwise the abort + * would feed back into the force-recompute path, creating a + * busy loop whenever NB.nb_cfg keeps advancing but the + * engine produces no SB activity. */ + bool ovnsb_txn_aborted_intentionally = false; if (!activity && ovnsb_txn && ovnsb_idl_loop.cur_cfg != ovnsb_idl_loop.next_cfg) { ovsdb_idl_txn_abort(ovnsb_txn); + ovnsb_txn_aborted_intentionally = true; } /* If there are any errors, we force a full recompute in order @@ -1127,7 +1135,8 @@ main(int argc, char *argv[]) inc_proc_northd_force_recompute_immediate(); } - if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) { + if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop) && + !ovnsb_txn_aborted_intentionally) { VLOG_INFO("OVNSB commit failed, " "force recompute next time."); inc_proc_northd_force_recompute_immediate();