diff mbox series

[ovs-dev] ovn-northd: Don't log transaction failures on standby instances.

Message ID 20220127150021.26044-1-dceara@redhat.com
State Changes Requested
Headers show
Series [ovs-dev] ovn-northd: Don't log transaction failures on standby instances. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_ovn-kubernetes success github build: passed

Commit Message

Dumitru Ceara Jan. 27, 2022, 3 p.m. UTC
We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
that are in standby mode.  That's because they need to try to take the
lock.  But if they're in standby they won't actually build a transaction
json and will return early because they don't own the SB lock.

That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
shouldn't act on it.

Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
should be called only on active instances.  The standby or paused ones
will get the incremental updates when they become active.  Otherwise we
would be forced to perform a full recompute when the instance becomes
active.

Fixes: 953832eb17f3 ("ovn-northd: Don't trigger recompute for transactions that are in progress.")
Fixes: 4597317f16d1 ("northd: Introduce incremental processing for northd")
Reported-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 northd/ovn-northd.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

Comments

Numan Siddique Feb. 10, 2022, 11:54 p.m. UTC | #1
On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
> that are in standby mode.  That's because they need to try to take the
> lock.  But if they're in standby they won't actually build a transaction
> json and will return early because they don't own the SB lock.
>
> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
> shouldn't act on it.
>
> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
> should be called only on active instances.  The standby or paused ones
> will get the incremental updates when they become active.  Otherwise we
> would be forced to perform a full recompute when the instance becomes
> active.

Hi Dumitru,

I've a question on the track clear being moved out of the standby instances.
To ensure correctness,  I suppose it's better to trigger a full recompute when a
standby instance becomes active. What do you think?

Also lets say CMS does the below operations
     - Add a logical switch S1
     - Add a  logical port p1 in S1
     - Add a logical port p2 in S1
     - Delete logical port p2
     - Delete a logical switch.

With this patch since we are not clearing the tracking information,
how does ovn-northd
process the tracked changes when it becomes active ?

Thanks
Numan



>
> Fixes: 953832eb17f3 ("ovn-northd: Don't trigger recompute for transactions that are in progress.")
> Fixes: 4597317f16d1 ("northd: Introduce incremental processing for northd")
> Reported-by: Ilya Maximets <i.maximets@ovn.org>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
>  northd/ovn-northd.c | 35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 793135ede..4e1e51931 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -1089,6 +1089,26 @@ main(int argc, char *argv[])
>                                              ovnnb_txn, ovnsb_txn,
>                                              &ovnsb_idl_loop);
>                  }
> +
> +                /* If there are any errors, we force a full recompute in order
> +                 * to ensure we handle all changes. */
> +                if (!ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop)) {
> +                    VLOG_INFO("OVNNB commit failed, "
> +                              "force recompute next time.");
> +                    recompute = true;
> +                }
> +
> +                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
> +                    VLOG_INFO("OVNSB commit failed, "
> +                              "force recompute next time.");
> +                    recompute = true;
> +                }
> +                ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
> +                ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
> +            } else {
> +                /* Make sure we send any pending requests, e.g., lock. */
> +                ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop);
> +                ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop);
>              }
>          } else {
>              /* ovn-northd is paused
> @@ -1112,21 +1132,6 @@ main(int argc, char *argv[])
>              ovsdb_idl_wait(ovnsb_idl_loop.idl);
>          }
>
> -        /* If there are any errors, we force a full recompute in order to
> -           ensure we handle all changes. */
> -        if (!ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop)) {
> -            VLOG_INFO("OVNNB commit failed, force recompute next time.");
> -            recompute = true;
> -        }
> -
> -        if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
> -            VLOG_INFO("OVNSB commit failed, force recompute next time.");
> -            recompute = true;
> -        }
> -
> -        ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
> -        ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
> -
>          unixctl_server_run(unixctl);
>          unixctl_server_wait(unixctl);
>          memory_wait();
> --
> 2.27.0
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara Feb. 11, 2022, 8:52 a.m. UTC | #2
On 2/11/22 00:54, Numan Siddique wrote:
> On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
>> that are in standby mode.  That's because they need to try to take the
>> lock.  But if they're in standby they won't actually build a transaction
>> json and will return early because they don't own the SB lock.
>>
>> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
>> shouldn't act on it.
>>
>> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
>> should be called only on active instances.  The standby or paused ones
>> will get the incremental updates when they become active.  Otherwise we
>> would be forced to perform a full recompute when the instance becomes
>> active.
> 
> Hi Dumitru,
> 

Hi Numan,

Thanks for the review!

> I've a question on the track clear being moved out of the standby instances.
> To ensure correctness,  I suppose it's better to trigger a full recompute when a
> standby instance becomes active. What do you think?
> 

I might be wrong but I don't think that's necessary.  It may also be
quite costly as full recomputes can take quite long.

> Also lets say CMS does the below operations
>      - Add a logical switch S1
>      - Add a  logical port p1 in S1
>      - Add a logical port p2 in S1
>      - Delete logical port p2
>      - Delete a logical switch.
> 
> With this patch since we are not clearing the tracking information,
> how does ovn-northd
> process the tracked changes when it becomes active ?

When ovn-northd becomes active, from a Northbound database perspective,
there were no changes; that is, S1 didn't exist when it was last active
and it doesn't exist now either.

So, there should be no NB change to process.  Accumulating tracked
changes without calling clear() on the standby has exactly this effect.

From a Southbound database perspective there are two cases:

a. The former active northd processed some (but not all) of the NB
changes and executed their corresponding SB transactions.  In this case,
the standby northd also receives update messages for the SB records that
were changed.  The standby northd tracks these changes.

When the standby northd becomes active it will:
- determine that NB state didn't change
- SB state changed and needs to be reconciled (today we do this with the
help of a NULL change handler for SB_* tables which will trigger a full
recompute).

b. The former active northd processed all of the NB changes and executed
their corresponding SB transactions.  In this case, the final state of
the NB and SB databases should be equivalent to their initial states.
NB/SB changes will be accumulated by the change tracking mechanism on
the standby resulting in empty tracked changes lists.  This is fine
because the new active northd doesn't need to do anything, the DB
contents are already consistent.

c. The former active northd processed none of the NB changes yet.  This
is very similar to case "b" above, the new active northd doesn't need to
change anything in the NB/SB and it won't do that either.

> 
> Thanks
> Numan
> 

Thanks,
Dumitru
Numan Siddique Feb. 11, 2022, 3:35 p.m. UTC | #3
On Fri, Feb 11, 2022 at 3:53 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 2/11/22 00:54, Numan Siddique wrote:
> > On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
> >> that are in standby mode.  That's because they need to try to take the
> >> lock.  But if they're in standby they won't actually build a transaction
> >> json and will return early because they don't own the SB lock.
> >>
> >> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
> >> shouldn't act on it.
> >>
> >> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
> >> should be called only on active instances.  The standby or paused ones
> >> will get the incremental updates when they become active.  Otherwise we
> >> would be forced to perform a full recompute when the instance becomes
> >> active.
> >
> > Hi Dumitru,
> >
>
> Hi Numan,
>
> Thanks for the review!
>
> > I've a question on the track clear being moved out of the standby instances.
> > To ensure correctness,  I suppose it's better to trigger a full recompute when a
> > standby instance becomes active. What do you think?
> >
>
> I might be wrong but I don't think that's necessary.  It may also be
> quite costly as full recomputes can take quite long.
>
> > Also lets say CMS does the below operations
> >      - Add a logical switch S1
> >      - Add a  logical port p1 in S1
> >      - Add a logical port p2 in S1
> >      - Delete logical port p2
> >      - Delete a logical switch.
> >
> > With this patch since we are not clearing the tracking information,
> > how does ovn-northd
> > process the tracked changes when it becomes active ?
>
> When ovn-northd becomes active, from a Northbound database perspective,
> there were no changes; that is, S1 didn't exist when it was last active
> and it doesn't exist now either.
>
> So, there should be no NB change to process.  Accumulating tracked
> changes without calling clear() on the standby has exactly this effect.
>
> From a Southbound database perspective there are two cases:
>
> a. The former active northd processed some (but not all) of the NB
> changes and executed their corresponding SB transactions.  In this case,
> the standby northd also receives update messages for the SB records that
> were changed.  The standby northd tracks these changes.
>
> When the standby northd becomes active it will:
> - determine that NB state didn't change
> - SB state changed and needs to be reconciled (today we do this with the
> help of a NULL change handler for SB_* tables which will trigger a full
> recompute).
>
> b. The former active northd processed all of the NB changes and executed
> their corresponding SB transactions.  In this case, the final state of
> the NB and SB databases should be equivalent to their initial states.
> NB/SB changes will be accumulated by the change tracking mechanism on
> the standby resulting in empty tracked changes lists.  This is fine
> because the new active northd doesn't need to do anything, the DB
> contents are already consistent.
>
> c. The former active northd processed none of the NB changes yet.  This
> is very similar to case "b" above, the new active northd doesn't need to
> change anything in the NB/SB and it won't do that either.


Thanks for the detailed explanation.

Acked-by: Numan Siddique <numans@ovn.org>

Numan

>
> >
> > Thanks
> > Numan
> >
>
> Thanks,
> Dumitru
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Han Zhou Feb. 23, 2022, 6:27 a.m. UTC | #4
On Fri, Feb 11, 2022 at 12:52 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 2/11/22 00:54, Numan Siddique wrote:
> > On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >>
> >> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
> >> that are in standby mode.  That's because they need to try to take the
> >> lock.  But if they're in standby they won't actually build a
transaction
> >> json and will return early because they don't own the SB lock.
> >>
> >> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
> >> shouldn't act on it.
> >>
> >> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
> >> should be called only on active instances.  The standby or paused ones
> >> will get the incremental updates when they become active.  Otherwise we
> >> would be forced to perform a full recompute when the instance becomes
> >> active.
> >
> > Hi Dumitru,
> >
>
> Hi Numan,
>
> Thanks for the review!
>
> > I've a question on the track clear being moved out of the standby
instances.
> > To ensure correctness,  I suppose it's better to trigger a full
recompute when a
> > standby instance becomes active. What do you think?
> >
>
> I might be wrong but I don't think that's necessary.  It may also be
> quite costly as full recomputes can take quite long.
>
> > Also lets say CMS does the below operations
> >      - Add a logical switch S1
> >      - Add a  logical port p1 in S1
> >      - Add a logical port p2 in S1
> >      - Delete logical port p2
> >      - Delete a logical switch.
> >
> > With this patch since we are not clearing the tracking information,
> > how does ovn-northd
> > process the tracked changes when it becomes active ?
>
> When ovn-northd becomes active, from a Northbound database perspective,
> there were no changes; that is, S1 didn't exist when it was last active
> and it doesn't exist now either.
>
> So, there should be no NB change to process.  Accumulating tracked
> changes without calling clear() on the standby has exactly this effect.

Hi Dumitru,

I wonder how accumulating tracked changes without calling clear() would
work.

Firstly, I was under the impression that ovsdb_idl_track_clear() must be
called before the next ovsdb_idl_run(), and the current change tracking
implementation cannot safely carry tracking information across the
iterations. This was why in ovn-controller whenever (!engine_has_run() &&
engine_need_run()) we force recompute in the next iteration. If changes
could be carried over we would have incrementally processed the accumulated
changes without forcing recompute. I can't recall the details, and I
checked the IDL again briefly but I didn't find the exact reason why it is
not safe. But I believe it was never used this way before. I should have
added a TODO for this (but somehow forgot to, sorry).

Secondly, even if it is safe to accumulate changes, it is going to be a
memory problem. Active-standby failover happens very rarely in a healthy
production environment. So even if the DB size is small, the change
tracking can grow without any limit. I tried with this patch by doing
simply a loop of adding and deleting a single logical switch 10k times on
top of an empty NB DB, and the standby northd's memory grew to 1.1GB.

Thirdly, processing all the tracked changes when standby becomes active is
likely taking higher cost than recompute, if the tracked change size is
bigger than the DB size.

So for now I would suggest keeping the logic of clearing tracking on every
iteration and force recompute when failover.

Thanks,
Han

>
> From a Southbound database perspective there are two cases:
>
> a. The former active northd processed some (but not all) of the NB
> changes and executed their corresponding SB transactions.  In this case,
> the standby northd also receives update messages for the SB records that
> were changed.  The standby northd tracks these changes.
>
> When the standby northd becomes active it will:
> - determine that NB state didn't change
> - SB state changed and needs to be reconciled (today we do this with the
> help of a NULL change handler for SB_* tables which will trigger a full
> recompute).
>
> b. The former active northd processed all of the NB changes and executed
> their corresponding SB transactions.  In this case, the final state of
> the NB and SB databases should be equivalent to their initial states.
> NB/SB changes will be accumulated by the change tracking mechanism on
> the standby resulting in empty tracked changes lists.  This is fine
> because the new active northd doesn't need to do anything, the DB
> contents are already consistent.
>
> c. The former active northd processed none of the NB changes yet.  This
> is very similar to case "b" above, the new active northd doesn't need to
> change anything in the NB/SB and it won't do that either.
>
> >
> > Thanks
> > Numan
> >
>
> Thanks,
> Dumitru
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Dumitru Ceara Feb. 23, 2022, 8:24 a.m. UTC | #5
On 2/23/22 07:27, Han Zhou wrote:
> On Fri, Feb 11, 2022 at 12:52 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 2/11/22 00:54, Numan Siddique wrote:
>>> On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>
>>>> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
>>>> that are in standby mode.  That's because they need to try to take the
>>>> lock.  But if they're in standby they won't actually build a
> transaction
>>>> json and will return early because they don't own the SB lock.
>>>>
>>>> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
>>>> shouldn't act on it.
>>>>
>>>> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
>>>> should be called only on active instances.  The standby or paused ones
>>>> will get the incremental updates when they become active.  Otherwise we
>>>> would be forced to perform a full recompute when the instance becomes
>>>> active.
>>>
>>> Hi Dumitru,
>>>
>>
>> Hi Numan,
>>
>> Thanks for the review!
>>
>>> I've a question on the track clear being moved out of the standby
> instances.
>>> To ensure correctness,  I suppose it's better to trigger a full
> recompute when a
>>> standby instance becomes active. What do you think?
>>>
>>
>> I might be wrong but I don't think that's necessary.  It may also be
>> quite costly as full recomputes can take quite long.
>>
>>> Also lets say CMS does the below operations
>>>      - Add a logical switch S1
>>>      - Add a  logical port p1 in S1
>>>      - Add a logical port p2 in S1
>>>      - Delete logical port p2
>>>      - Delete a logical switch.
>>>
>>> With this patch since we are not clearing the tracking information,
>>> how does ovn-northd
>>> process the tracked changes when it becomes active ?
>>
>> When ovn-northd becomes active, from a Northbound database perspective,
>> there were no changes; that is, S1 didn't exist when it was last active
>> and it doesn't exist now either.
>>
>> So, there should be no NB change to process.  Accumulating tracked
>> changes without calling clear() on the standby has exactly this effect.
> 
> Hi Dumitru,
> 

Hi Han,

> I wonder how accumulating tracked changes without calling clear() would
> work.
> 
> Firstly, I was under the impression that ovsdb_idl_track_clear() must be
> called before the next ovsdb_idl_run(), and the current change tracking
> implementation cannot safely carry tracking information across the
> iterations. This was why in ovn-controller whenever (!engine_has_run() &&
> engine_need_run()) we force recompute in the next iteration. If changes
> could be carried over we would have incrementally processed the accumulated
> changes without forcing recompute. I can't recall the details, and I
> checked the IDL again briefly but I didn't find the exact reason why it is
> not safe. But I believe it was never used this way before. I should have
> added a TODO for this (but somehow forgot to, sorry).
> 

I've been looking at that code too and I don't see any reason why
accumulating changes wouldn't work.  The IDL code is written such that
it processes multiple jsonrpc updates in a single run anyway:

ovsdb_idl_run() -> ovsdb_cs_run() -> which can receive up to 50
(hardcoded) jsonrpc messages.

It's possible that I missed something though, the IDL functionality is
complex, so maybe it's worth documenting that we recommend calling
ovsdb_idl_track_clear() every time ovsdb_idl_run() is called, and why.
What do you think?

> Secondly, even if it is safe to accumulate changes, it is going to be a
> memory problem. Active-standby failover happens very rarely in a healthy
> production environment. So even if the DB size is small, the change
> tracking can grow without any limit. I tried with this patch by doing
> simply a loop of adding and deleting a single logical switch 10k times on
> top of an empty NB DB, and the standby northd's memory grew to 1.1GB.
> 

This is surprising, sounds like an issue in the IDL, I would've expected
the delete to just remove the record from the track list.  I'll look
into it.

> Thirdly, processing all the tracked changes when standby becomes active is
> likely taking higher cost than recompute, if the tracked change size is
> bigger than the DB size.
> 

If changes are accumulated correctly, the total number of the tracked
record changes should never be larger than the size of the DB the last
time changes were processed + the size of the DB now, resulting in a
cost to process of O(size-of-DB), the same as for full recompute.  Or am
I missing something?

> So for now I would suggest keeping the logic of clearing tracking on every
> iteration and force recompute when failover.
> 

At least until we figure out why the memory increase in the IDL, I agree
to keep forcing a recompute on failover.  That's also because we
currently don't really incrementally process much in ovn-northd.

I'll send a v2.

> Thanks,
> Han
> 

Thanks,
Dumitru

>>
>> From a Southbound database perspective there are two cases:
>>
>> a. The former active northd processed some (but not all) of the NB
>> changes and executed their corresponding SB transactions.  In this case,
>> the standby northd also receives update messages for the SB records that
>> were changed.  The standby northd tracks these changes.
>>
>> When the standby northd becomes active it will:
>> - determine that NB state didn't change
>> - SB state changed and needs to be reconciled (today we do this with the
>> help of a NULL change handler for SB_* tables which will trigger a full
>> recompute).
>>
>> b. The former active northd processed all of the NB changes and executed
>> their corresponding SB transactions.  In this case, the final state of
>> the NB and SB databases should be equivalent to their initial states.
>> NB/SB changes will be accumulated by the change tracking mechanism on
>> the standby resulting in empty tracked changes lists.  This is fine
>> because the new active northd doesn't need to do anything, the DB
>> contents are already consistent.
>>
>> c. The former active northd processed none of the NB changes yet.  This
>> is very similar to case "b" above, the new active northd doesn't need to
>> change anything in the NB/SB and it won't do that either.
>>
>>>
>>> Thanks
>>> Numan
>>>
>>
>> Thanks,
>> Dumitru
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara Feb. 23, 2022, 8:35 a.m. UTC | #6
On 2/23/22 09:24, Dumitru Ceara wrote:
> On 2/23/22 07:27, Han Zhou wrote:
>> On Fri, Feb 11, 2022 at 12:52 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>
>>> On 2/11/22 00:54, Numan Siddique wrote:
>>>> On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com>
>> wrote:
>>>>>
>>>>> We still need to try to ovsdb_idl_loop_commit_and_wait() on instances
>>>>> that are in standby mode.  That's because they need to try to take the
>>>>> lock.  But if they're in standby they won't actually build a
>> transaction
>>>>> json and will return early because they don't own the SB lock.
>>>>>
>>>>> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but we
>>>>> shouldn't act on it.
>>>>>
>>>>> Also, to ensure that no DB changes are missed, ovsdb_idl_track_clear()
>>>>> should be called only on active instances.  The standby or paused ones
>>>>> will get the incremental updates when they become active.  Otherwise we
>>>>> would be forced to perform a full recompute when the instance becomes
>>>>> active.
>>>>
>>>> Hi Dumitru,
>>>>
>>>
>>> Hi Numan,
>>>
>>> Thanks for the review!
>>>
>>>> I've a question on the track clear being moved out of the standby
>> instances.
>>>> To ensure correctness,  I suppose it's better to trigger a full
>> recompute when a
>>>> standby instance becomes active. What do you think?
>>>>
>>>
>>> I might be wrong but I don't think that's necessary.  It may also be
>>> quite costly as full recomputes can take quite long.
>>>
>>>> Also lets say CMS does the below operations
>>>>      - Add a logical switch S1
>>>>      - Add a  logical port p1 in S1
>>>>      - Add a logical port p2 in S1
>>>>      - Delete logical port p2
>>>>      - Delete a logical switch.
>>>>
>>>> With this patch since we are not clearing the tracking information,
>>>> how does ovn-northd
>>>> process the tracked changes when it becomes active ?
>>>
>>> When ovn-northd becomes active, from a Northbound database perspective,
>>> there were no changes; that is, S1 didn't exist when it was last active
>>> and it doesn't exist now either.
>>>
>>> So, there should be no NB change to process.  Accumulating tracked
>>> changes without calling clear() on the standby has exactly this effect.
>>
>> Hi Dumitru,
>>
> 
> Hi Han,
> 
>> I wonder how accumulating tracked changes without calling clear() would
>> work.
>>
>> Firstly, I was under the impression that ovsdb_idl_track_clear() must be
>> called before the next ovsdb_idl_run(), and the current change tracking
>> implementation cannot safely carry tracking information across the
>> iterations. This was why in ovn-controller whenever (!engine_has_run() &&
>> engine_need_run()) we force recompute in the next iteration. If changes
>> could be carried over we would have incrementally processed the accumulated
>> changes without forcing recompute. I can't recall the details, and I
>> checked the IDL again briefly but I didn't find the exact reason why it is
>> not safe. But I believe it was never used this way before. I should have
>> added a TODO for this (but somehow forgot to, sorry).
>>
> 
> I've been looking at that code too and I don't see any reason why
> accumulating changes wouldn't work.  The IDL code is written such that
> it processes multiple jsonrpc updates in a single run anyway:
> 
> ovsdb_idl_run() -> ovsdb_cs_run() -> which can receive up to 50
> (hardcoded) jsonrpc messages.
> 
> It's possible that I missed something though, the IDL functionality is
> complex, so maybe it's worth documenting that we recommend calling
> ovsdb_idl_track_clear() every time ovsdb_idl_run() is called, and why.
> What do you think?
> 
>> Secondly, even if it is safe to accumulate changes, it is going to be a
>> memory problem. Active-standby failover happens very rarely in a healthy
>> production environment. So even if the DB size is small, the change
>> tracking can grow without any limit. I tried with this patch by doing
>> simply a loop of adding and deleting a single logical switch 10k times on
>> top of an empty NB DB, and the standby northd's memory grew to 1.1GB.
>>
> 
> This is surprising, sounds like an issue in the IDL, I would've expected
> the delete to just remove the record from the track list.  I'll look
> into it.
> 

I had a quick look, it's because deleting a row clears its datum, sets
tracked_old_datum and adds it to the deleted track list.  The subsequent
"add" will not find this row, and will insert a new one.

We probably need to fix this anyway because even in the current
implementation the IDL can process up to 50 consecutive
"add/delete/add/delete/.." updates.  This can potentially cause issues
due to "spurious" delete events when the client finally processes the
tracked changes.

>> Thirdly, processing all the tracked changes when standby becomes active is
>> likely taking higher cost than recompute, if the tracked change size is
>> bigger than the DB size.
>>
> 
> If changes are accumulated correctly, the total number of the tracked
> record changes should never be larger than the size of the DB the last
> time changes were processed + the size of the DB now, resulting in a
> cost to process of O(size-of-DB), the same as for full recompute.  Or am
> I missing something?
> 
>> So for now I would suggest keeping the logic of clearing tracking on every
>> iteration and force recompute when failover.
>>
> 
> At least until we figure out why the memory increase in the IDL, I agree
> to keep forcing a recompute on failover.  That's also because we
> currently don't really incrementally process much in ovn-northd.
> 
> I'll send a v2.
> 
>> Thanks,
>> Han
>>
> 
> Thanks,
> Dumitru
> 
>>>
>>> From a Southbound database perspective there are two cases:
>>>
>>> a. The former active northd processed some (but not all) of the NB
>>> changes and executed their corresponding SB transactions.  In this case,
>>> the standby northd also receives update messages for the SB records that
>>> were changed.  The standby northd tracks these changes.
>>>
>>> When the standby northd becomes active it will:
>>> - determine that NB state didn't change
>>> - SB state changed and needs to be reconciled (today we do this with the
>>> help of a NULL change handler for SB_* tables which will trigger a full
>>> recompute).
>>>
>>> b. The former active northd processed all of the NB changes and executed
>>> their corresponding SB transactions.  In this case, the final state of
>>> the NB and SB databases should be equivalent to their initial states.
>>> NB/SB changes will be accumulated by the change tracking mechanism on
>>> the standby resulting in empty tracked changes lists.  This is fine
>>> because the new active northd doesn't need to do anything, the DB
>>> contents are already consistent.
>>>
>>> c. The former active northd processed none of the NB changes yet.  This
>>> is very similar to case "b" above, the new active northd doesn't need to
>>> change anything in the NB/SB and it won't do that either.
>>>
>>>>
>>>> Thanks
>>>> Numan
>>>>
>>>
>>> Thanks,
>>> Dumitru
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
Han Zhou Feb. 23, 2022, 6:50 p.m. UTC | #7
On Wed, Feb 23, 2022 at 12:35 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 2/23/22 09:24, Dumitru Ceara wrote:
> > On 2/23/22 07:27, Han Zhou wrote:
> >> On Fri, Feb 11, 2022 at 12:52 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >>>
> >>> On 2/11/22 00:54, Numan Siddique wrote:
> >>>> On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com>
> >> wrote:
> >>>>>
> >>>>> We still need to try to ovsdb_idl_loop_commit_and_wait() on
instances
> >>>>> that are in standby mode.  That's because they need to try to take
the
> >>>>> lock.  But if they're in standby they won't actually build a
> >> transaction
> >>>>> json and will return early because they don't own the SB lock.
> >>>>>
> >>>>> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but
we
> >>>>> shouldn't act on it.
> >>>>>
> >>>>> Also, to ensure that no DB changes are missed,
ovsdb_idl_track_clear()
> >>>>> should be called only on active instances.  The standby or paused
ones
> >>>>> will get the incremental updates when they become active.
Otherwise we
> >>>>> would be forced to perform a full recompute when the instance
becomes
> >>>>> active.
> >>>>
> >>>> Hi Dumitru,
> >>>>
> >>>
> >>> Hi Numan,
> >>>
> >>> Thanks for the review!
> >>>
> >>>> I've a question on the track clear being moved out of the standby
> >> instances.
> >>>> To ensure correctness,  I suppose it's better to trigger a full
> >> recompute when a
> >>>> standby instance becomes active. What do you think?
> >>>>
> >>>
> >>> I might be wrong but I don't think that's necessary.  It may also be
> >>> quite costly as full recomputes can take quite long.
> >>>
> >>>> Also lets say CMS does the below operations
> >>>>      - Add a logical switch S1
> >>>>      - Add a  logical port p1 in S1
> >>>>      - Add a logical port p2 in S1
> >>>>      - Delete logical port p2
> >>>>      - Delete a logical switch.
> >>>>
> >>>> With this patch since we are not clearing the tracking information,
> >>>> how does ovn-northd
> >>>> process the tracked changes when it becomes active ?
> >>>
> >>> When ovn-northd becomes active, from a Northbound database
perspective,
> >>> there were no changes; that is, S1 didn't exist when it was last
active
> >>> and it doesn't exist now either.
> >>>
> >>> So, there should be no NB change to process.  Accumulating tracked
> >>> changes without calling clear() on the standby has exactly this
effect.
> >>
> >> Hi Dumitru,
> >>
> >
> > Hi Han,
> >
> >> I wonder how accumulating tracked changes without calling clear() would
> >> work.
> >>
> >> Firstly, I was under the impression that ovsdb_idl_track_clear() must
be
> >> called before the next ovsdb_idl_run(), and the current change tracking
> >> implementation cannot safely carry tracking information across the
> >> iterations. This was why in ovn-controller whenever (!engine_has_run()
&&
> >> engine_need_run()) we force recompute in the next iteration. If changes
> >> could be carried over we would have incrementally processed the
accumulated
> >> changes without forcing recompute. I can't recall the details, and I
> >> checked the IDL again briefly but I didn't find the exact reason why
it is
> >> not safe. But I believe it was never used this way before. I should
have
> >> added a TODO for this (but somehow forgot to, sorry).
> >>
> >
> > I've been looking at that code too and I don't see any reason why
> > accumulating changes wouldn't work.  The IDL code is written such that
> > it processes multiple jsonrpc updates in a single run anyway:
> >
> > ovsdb_idl_run() -> ovsdb_cs_run() -> which can receive up to 50
> > (hardcoded) jsonrpc messages.
> >
Maybe you are right, but I am not sure. It's true that IDL run can process
multiple messages, but there are functions called once for every IDL run,
after processing all the messages:
ovsdb_idl_reparse_refs_to_inserted, ovsdb_idl_reparse_deleted,
ovsdb_idl_row_destroy_postprocess, etc.

Is it possible any of these could cause problems with change tracking left
with the previous run? I haven't looked deeper yet.

> > It's possible that I missed something though, the IDL functionality is
> > complex, so maybe it's worth documenting that we recommend calling
> > ovsdb_idl_track_clear() every time ovsdb_idl_run() is called, and why.
> > What do you think?
> >

There is a comment above the ovsdb_idl_track_clear() which suggests calling
it before ovsdb_idl_run(), but it doesn't mention if it is mandatory. I
might have just followed the comment.

> >> Secondly, even if it is safe to accumulate changes, it is going to be a
> >> memory problem. Active-standby failover happens very rarely in a
healthy
> >> production environment. So even if the DB size is small, the change
> >> tracking can grow without any limit. I tried with this patch by doing
> >> simply a loop of adding and deleting a single logical switch 10k times
on
> >> top of an empty NB DB, and the standby northd's memory grew to 1.1GB.
> >>
> >
> > This is surprising, sounds like an issue in the IDL, I would've expected
> > the delete to just remove the record from the track list.  I'll look
> > into it.
> >
I wasn't surprised because I thought tracked changes were append-only, and
only got cleared by ovsdb_idl_track_clear() or ovsdb_idl_clear(). I just
recalled that the ovsdb_idl_row_untrack_change() was added some time ago
and it may change the behavior.

>
> I had a quick look, it's because deleting a row clears its datum, sets
> tracked_old_datum and adds it to the deleted track list.  The subsequent
> "add" will not find this row, and will insert a new one.
>

I don't expect the subsequent add would find the deleted item and remove it
from the tracking. In general, every add would create a new object with a
new UUID. IDL doesn't and shouldn't understand if it is the "same" to an
earlier deleted object. Think about a more general case when we keep adding
and deleting items with different names every time.

However, for the problem here, it is more about delete-after-add. I think
it is possible to avoid maintaining the items that are deleted before being
processed by the client. Be it a bug-fix or a new feature, we can improve
this, and then the tracked change size shouldn't grow much when the total
DB size is stable.

> We probably need to fix this anyway because even in the current
> implementation the IDL can process up to 50 consecutive
> "add/delete/add/delete/.." updates.  This can potentially cause issues
> due to "spurious" delete events when the client finally processes the
> tracked changes.
>

I *think* it shouldn't cause problems for the current implementation,
except that there can be wasted cycles to process an object addition and
its deletion in the same iteration, which should be fine because the case
should be rare anyway. Each add creates a new object with a new UUID, so to
the client (ovn-controller) it just processes different objects. Please let
me know if you find anything possibly going wrong here, then it's a bug.

> >> Thirdly, processing all the tracked changes when standby becomes
active is
> >> likely taking higher cost than recompute, if the tracked change size is
> >> bigger than the DB size.
> >>
> >
> > If changes are accumulated correctly, the total number of the tracked
> > record changes should never be larger than the size of the DB the last
> > time changes were processed + the size of the DB now, resulting in a
> > cost to process of O(size-of-DB), the same as for full recompute.  Or am
> > I missing something?
> >

My conclusion was based on the fact that the delete-after-add operations
are accumulated. If we can avoid that, and if the incremental processing is
implemented efficiently, i.e. the extra overhead of handling each change is
small enough, then yes I agree with you it is in the worst case similar to
full recompute.
However it would really depend on the implementation. A bad example in
ovn-controller is the change handling for anything referenced by a logical
flow. It used to reprocess the same lflow again and again when there are
multiple reference changes (now this is avoided).
Recompute has an advantage that it guarantees the upper bound time for the
worst case, but of course not efficient if the tracked changes are not big
enough - in the failover scenario since it happens rarely, it is more
likely that the accumulated change size is big, so in that case probably
recompute is not a bad idea.

Recompute may also avoid the complexity of corner cases during failover.

But again it depends on how we implement I-P in northd. We can change it to
I-P whenever we find it more feasible.

> >> So for now I would suggest keeping the logic of clearing tracking on
every
> >> iteration and force recompute when failover.
> >>
> >
> > At least until we figure out why the memory increase in the IDL, I agree
> > to keep forcing a recompute on failover.  That's also because we
> > currently don't really incrementally process much in ovn-northd.
> >
> > I'll send a v2.

Thanks! Looking at it.

Han

> >
> >> Thanks,
> >> Han
> >>
> >
> > Thanks,
> > Dumitru
> >
> >>>
> >>> From a Southbound database perspective there are two cases:
> >>>
> >>> a. The former active northd processed some (but not all) of the NB
> >>> changes and executed their corresponding SB transactions.  In this
case,
> >>> the standby northd also receives update messages for the SB records
that
> >>> were changed.  The standby northd tracks these changes.
> >>>
> >>> When the standby northd becomes active it will:
> >>> - determine that NB state didn't change
> >>> - SB state changed and needs to be reconciled (today we do this with
the
> >>> help of a NULL change handler for SB_* tables which will trigger a
full
> >>> recompute).
> >>>
> >>> b. The former active northd processed all of the NB changes and
executed
> >>> their corresponding SB transactions.  In this case, the final state of
> >>> the NB and SB databases should be equivalent to their initial states.
> >>> NB/SB changes will be accumulated by the change tracking mechanism on
> >>> the standby resulting in empty tracked changes lists.  This is fine
> >>> because the new active northd doesn't need to do anything, the DB
> >>> contents are already consistent.
> >>>
> >>> c. The former active northd processed none of the NB changes yet.
This
> >>> is very similar to case "b" above, the new active northd doesn't need
to
> >>> change anything in the NB/SB and it won't do that either.
> >>>
> >>>>
> >>>> Thanks
> >>>> Numan
> >>>>
> >>>
> >>> Thanks,
> >>> Dumitru
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev@openvswitch.org
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>
> >
>
Dumitru Ceara Feb. 23, 2022, 6:57 p.m. UTC | #8
On 2/23/22 19:50, Han Zhou wrote:
> On Wed, Feb 23, 2022 at 12:35 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 2/23/22 09:24, Dumitru Ceara wrote:
>>> On 2/23/22 07:27, Han Zhou wrote:
>>>> On Fri, Feb 11, 2022 at 12:52 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>>
>>>>> On 2/11/22 00:54, Numan Siddique wrote:
>>>>>> On Thu, Jan 27, 2022 at 10:00 AM Dumitru Ceara <dceara@redhat.com>
>>>> wrote:
>>>>>>>
>>>>>>> We still need to try to ovsdb_idl_loop_commit_and_wait() on
> instances
>>>>>>> that are in standby mode.  That's because they need to try to take
> the
>>>>>>> lock.  But if they're in standby they won't actually build a
>>>> transaction
>>>>>>> json and will return early because they don't own the SB lock.
>>>>>>>
>>>>>>> That's reported as an error by ovsdb_idl_loop_commit_and_wait() but
> we
>>>>>>> shouldn't act on it.
>>>>>>>
>>>>>>> Also, to ensure that no DB changes are missed,
> ovsdb_idl_track_clear()
>>>>>>> should be called only on active instances.  The standby or paused
> ones
>>>>>>> will get the incremental updates when they become active.
> Otherwise we
>>>>>>> would be forced to perform a full recompute when the instance
> becomes
>>>>>>> active.
>>>>>>
>>>>>> Hi Dumitru,
>>>>>>
>>>>>
>>>>> Hi Numan,
>>>>>
>>>>> Thanks for the review!
>>>>>
>>>>>> I've a question on the track clear being moved out of the standby
>>>> instances.
>>>>>> To ensure correctness,  I suppose it's better to trigger a full
>>>> recompute when a
>>>>>> standby instance becomes active. What do you think?
>>>>>>
>>>>>
>>>>> I might be wrong but I don't think that's necessary.  It may also be
>>>>> quite costly as full recomputes can take quite long.
>>>>>
>>>>>> Also lets say CMS does the below operations
>>>>>>      - Add a logical switch S1
>>>>>>      - Add a  logical port p1 in S1
>>>>>>      - Add a logical port p2 in S1
>>>>>>      - Delete logical port p2
>>>>>>      - Delete a logical switch.
>>>>>>
>>>>>> With this patch since we are not clearing the tracking information,
>>>>>> how does ovn-northd
>>>>>> process the tracked changes when it becomes active ?
>>>>>
>>>>> When ovn-northd becomes active, from a Northbound database
> perspective,
>>>>> there were no changes; that is, S1 didn't exist when it was last
> active
>>>>> and it doesn't exist now either.
>>>>>
>>>>> So, there should be no NB change to process.  Accumulating tracked
>>>>> changes without calling clear() on the standby has exactly this
> effect.
>>>>
>>>> Hi Dumitru,
>>>>
>>>
>>> Hi Han,
>>>
>>>> I wonder how accumulating tracked changes without calling clear() would
>>>> work.
>>>>
>>>> Firstly, I was under the impression that ovsdb_idl_track_clear() must
> be
>>>> called before the next ovsdb_idl_run(), and the current change tracking
>>>> implementation cannot safely carry tracking information across the
>>>> iterations. This was why in ovn-controller whenever (!engine_has_run()
> &&
>>>> engine_need_run()) we force recompute in the next iteration. If changes
>>>> could be carried over we would have incrementally processed the
> accumulated
>>>> changes without forcing recompute. I can't recall the details, and I
>>>> checked the IDL again briefly but I didn't find the exact reason why
> it is
>>>> not safe. But I believe it was never used this way before. I should
> have
>>>> added a TODO for this (but somehow forgot to, sorry).
>>>>
>>>
>>> I've been looking at that code too and I don't see any reason why
>>> accumulating changes wouldn't work.  The IDL code is written such that
>>> it processes multiple jsonrpc updates in a single run anyway:
>>>
>>> ovsdb_idl_run() -> ovsdb_cs_run() -> which can receive up to 50
>>> (hardcoded) jsonrpc messages.
>>>
> Maybe you are right, but I am not sure. It's true that IDL run can process
> multiple messages, but there are functions called once for every IDL run,
> after processing all the messages:
> ovsdb_idl_reparse_refs_to_inserted, ovsdb_idl_reparse_deleted,
> ovsdb_idl_row_destroy_postprocess, etc.
> 
> Is it possible any of these could cause problems with change tracking left
> with the previous run? I haven't looked deeper yet.
> 
>>> It's possible that I missed something though, the IDL functionality is
>>> complex, so maybe it's worth documenting that we recommend calling
>>> ovsdb_idl_track_clear() every time ovsdb_idl_run() is called, and why.
>>> What do you think?
>>>
> 
> There is a comment above the ovsdb_idl_track_clear() which suggests calling
> it before ovsdb_idl_run(), but it doesn't mention if it is mandatory. I
> might have just followed the comment.
> 
>>>> Secondly, even if it is safe to accumulate changes, it is going to be a
>>>> memory problem. Active-standby failover happens very rarely in a
> healthy
>>>> production environment. So even if the DB size is small, the change
>>>> tracking can grow without any limit. I tried with this patch by doing
>>>> simply a loop of adding and deleting a single logical switch 10k times
> on
>>>> top of an empty NB DB, and the standby northd's memory grew to 1.1GB.
>>>>
>>>
>>> This is surprising, sounds like an issue in the IDL, I would've expected
>>> the delete to just remove the record from the track list.  I'll look
>>> into it.
>>>
> I wasn't surprised because I thought tracked changes were append-only, and
> only got cleared by ovsdb_idl_track_clear() or ovsdb_idl_clear(). I just
> recalled that the ovsdb_idl_row_untrack_change() was added some time ago
> and it may change the behavior.
> 
>>
>> I had a quick look, it's because deleting a row clears its datum, sets
>> tracked_old_datum and adds it to the deleted track list.  The subsequent
>> "add" will not find this row, and will insert a new one.
>>
> 
> I don't expect the subsequent add would find the deleted item and remove it
> from the tracking. In general, every add would create a new object with a
> new UUID. IDL doesn't and shouldn't understand if it is the "same" to an
> earlier deleted object. Think about a more general case when we keep adding
> and deleting items with different names every time.
> 
> However, for the problem here, it is more about delete-after-add. I think
> it is possible to avoid maintaining the items that are deleted before being
> processed by the client. Be it a bug-fix or a new feature, we can improve
> this, and then the tracked change size shouldn't grow much when the total
> DB size is stable.
> 

You're right, this is probably the way to fix it: if the client didn't
process the "add" event yet then just silently remove it when the
"delete" event is received.

>> We probably need to fix this anyway because even in the current
>> implementation the IDL can process up to 50 consecutive
>> "add/delete/add/delete/.." updates.  This can potentially cause issues
>> due to "spurious" delete events when the client finally processes the
>> tracked changes.
>>
> 
> I *think* it shouldn't cause problems for the current implementation,
> except that there can be wasted cycles to process an object addition and
> its deletion in the same iteration, which should be fine because the case
> should be rare anyway. Each add creates a new object with a new UUID, so to
> the client (ovn-controller) it just processes different objects. Please let
> me know if you find anything possibly going wrong here, then it's a bug.
> 

No, it's exactly what you pointed in the previous paragraph
(delete-after-add), I'm not sure how easy it is to fix though.

>>>> Thirdly, processing all the tracked changes when standby becomes
> active is
>>>> likely taking higher cost than recompute, if the tracked change size is
>>>> bigger than the DB size.
>>>>
>>>
>>> If changes are accumulated correctly, the total number of the tracked
>>> record changes should never be larger than the size of the DB the last
>>> time changes were processed + the size of the DB now, resulting in a
>>> cost to process of O(size-of-DB), the same as for full recompute.  Or am
>>> I missing something?
>>>
> 
> My conclusion was based on the fact that the delete-after-add operations
> are accumulated. If we can avoid that, and if the incremental processing is
> implemented efficiently, i.e. the extra overhead of handling each change is
> small enough, then yes I agree with you it is in the worst case similar to
> full recompute.
> However it would really depend on the implementation. A bad example in
> ovn-controller is the change handling for anything referenced by a logical
> flow. It used to reprocess the same lflow again and again when there are
> multiple reference changes (now this is avoided).
> Recompute has an advantage that it guarantees the upper bound time for the
> worst case, but of course not efficient if the tracked changes are not big
> enough - in the failover scenario since it happens rarely, it is more
> likely that the accumulated change size is big, so in that case probably
> recompute is not a bad idea.
> 
> Recompute may also avoid the complexity of corner cases during failover.
> 
> But again it depends on how we implement I-P in northd. We can change it to
> I-P whenever we find it more feasible.
> 

OK.

>>>> So for now I would suggest keeping the logic of clearing tracking on
> every
>>>> iteration and force recompute when failover.
>>>>
>>>
>>> At least until we figure out why the memory increase in the IDL, I agree
>>> to keep forcing a recompute on failover.  That's also because we
>>> currently don't really incrementally process much in ovn-northd.
>>>
>>> I'll send a v2.
> 
> Thanks! Looking at it.
> 
> Han
> 

Thanks,
Dumitru
diff mbox series

Patch

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 793135ede..4e1e51931 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -1089,6 +1089,26 @@  main(int argc, char *argv[])
                                             ovnnb_txn, ovnsb_txn,
                                             &ovnsb_idl_loop);
                 }
+
+                /* If there are any errors, we force a full recompute in order
+                 * to ensure we handle all changes. */
+                if (!ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop)) {
+                    VLOG_INFO("OVNNB commit failed, "
+                              "force recompute next time.");
+                    recompute = true;
+                }
+
+                if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
+                    VLOG_INFO("OVNSB commit failed, "
+                              "force recompute next time.");
+                    recompute = true;
+                }
+                ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
+                ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
+            } else {
+                /* Make sure we send any pending requests, e.g., lock. */
+                ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop);
+                ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop);
             }
         } else {
             /* ovn-northd is paused
@@ -1112,21 +1132,6 @@  main(int argc, char *argv[])
             ovsdb_idl_wait(ovnsb_idl_loop.idl);
         }
 
-        /* If there are any errors, we force a full recompute in order to
-           ensure we handle all changes. */
-        if (!ovsdb_idl_loop_commit_and_wait(&ovnnb_idl_loop)) {
-            VLOG_INFO("OVNNB commit failed, force recompute next time.");
-            recompute = true;
-        }
-
-        if (!ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop)) {
-            VLOG_INFO("OVNSB commit failed, force recompute next time.");
-            recompute = true;
-        }
-
-        ovsdb_idl_track_clear(ovnnb_idl_loop.idl);
-        ovsdb_idl_track_clear(ovnsb_idl_loop.idl);
-
         unixctl_server_run(unixctl);
         unixctl_server_wait(unixctl);
         memory_wait();