diff mbox series

[ovs-dev] ovsdb-idl.c: Clear conditions when clearing IDL.

Message ID 1582930845-15475-1-git-send-email-dceara@redhat.com
State Superseded
Headers show
Series [ovs-dev] ovsdb-idl.c: Clear conditions when clearing IDL. | expand

Commit Message

Dumitru Ceara Feb. 28, 2020, 11 p.m. UTC
If the ovsdb-server reply to "monitor_cond_since" requests has
"found" == false then ovsdb_idl_db_parse_monitor_reply() calls
ovsdb_idl_db_clear() which iterates through all tables and
unconditionally sets table->cond_changed to false.

However, if the client had already set a new condition for some of the
tables, this new condition request will never be sent to ovsdb-server
until the condition is reset to a different value. This is due to the
check in ovsdb_idl_db_set_condition().

In order to fix this we now also call ovsdb_idl_condition_clear() for
each table condition in ovsdb_idl_db_clear(). This ensures that
resetting the condition to the same value as before the
"monitor_cond_since" reply will trigger a new request.

One way to replicate the issue is described in the bugzilla reporting
the bug, when ovn-controller is configured to use "ovn-monitor-all":
https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6

Reported-by: Dan Williams <dcbw@redhat.com>
Reported-at: https://bugzilla.redhat.com/1808125
CC: Han Zhou <hzhou@ovn.org>
Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection reset.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 lib/ovsdb-idl.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Dumitru Ceara March 2, 2020, 3:54 p.m. UTC | #1
On 2/29/20 12:00 AM, Dumitru Ceara wrote:
> If the ovsdb-server reply to "monitor_cond_since" requests has
> "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
> ovsdb_idl_db_clear() which iterates through all tables and
> unconditionally sets table->cond_changed to false.
> 
> However, if the client had already set a new condition for some of the
> tables, this new condition request will never be sent to ovsdb-server
> until the condition is reset to a different value. This is due to the
> check in ovsdb_idl_db_set_condition().
> 
> In order to fix this we now also call ovsdb_idl_condition_clear() for
> each table condition in ovsdb_idl_db_clear(). This ensures that
> resetting the condition to the same value as before the
> "monitor_cond_since" reply will trigger a new request.
> 
> One way to replicate the issue is described in the bugzilla reporting
> the bug, when ovn-controller is configured to use "ovn-monitor-all":
> https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
> 
> Reported-by: Dan Williams <dcbw@redhat.com>
> Reported-at: https://bugzilla.redhat.com/1808125
> CC: Han Zhou <hzhou@ovn.org>
> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection reset.")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>

Hi Han,

I was wondering if you could review this change because ovn-kubernetes
is trying to enable ovn-monitor-all for scalability reasons but their CI
fails due to the bug fixed by the patch below.

Thanks,
Dumitru

> ---
>  lib/ovsdb-idl.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> index 190143f..64721ca 100644
> --- a/lib/ovsdb-idl.c
> +++ b/lib/ovsdb-idl.c
> @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>          struct ovsdb_idl_table *table = &db->tables[i];
>          struct ovsdb_idl_row *row, *next_row;
>  
> -        table->cond_changed = false;
> +        if (table->cond_changed) {
> +            table->cond_changed = false;
> +            ovsdb_idl_condition_clear(&table->condition);
> +        }
> +
>          if (hmap_is_empty(&table->rows)) {
>              continue;
>          }
>
Han Zhou March 3, 2020, 6:43 a.m. UTC | #2
On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
> > If the ovsdb-server reply to "monitor_cond_since" requests has
> > "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
> > ovsdb_idl_db_clear() which iterates through all tables and
> > unconditionally sets table->cond_changed to false.
> >
> > However, if the client had already set a new condition for some of the
> > tables, this new condition request will never be sent to ovsdb-server
> > until the condition is reset to a different value. This is due to the
> > check in ovsdb_idl_db_set_condition().
> >
> > In order to fix this we now also call ovsdb_idl_condition_clear() for
> > each table condition in ovsdb_idl_db_clear(). This ensures that
> > resetting the condition to the same value as before the
> > "monitor_cond_since" reply will trigger a new request.
> >
> > One way to replicate the issue is described in the bugzilla reporting
> > the bug, when ovn-controller is configured to use "ovn-monitor-all":
> > https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
> >
> > Reported-by: Dan Williams <dcbw@redhat.com>
> > Reported-at: https://bugzilla.redhat.com/1808125
> > CC: Han Zhou <hzhou@ovn.org>
> > Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
connection reset.")
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>
> Hi Han,
>
> I was wondering if you could review this change because ovn-kubernetes
> is trying to enable ovn-monitor-all for scalability reasons but their CI
> fails due to the bug fixed by the patch below.
>
> Thanks,
> Dumitru
>
> > ---
> >  lib/ovsdb-idl.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> > index 190143f..64721ca 100644
> > --- a/lib/ovsdb-idl.c
> > +++ b/lib/ovsdb-idl.c
> > @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
> >          struct ovsdb_idl_table *table = &db->tables[i];
> >          struct ovsdb_idl_row *row, *next_row;
> >
> > -        table->cond_changed = false;
> > +        if (table->cond_changed) {
> > +            table->cond_changed = false;
> > +            ovsdb_idl_condition_clear(&table->condition);
> > +        }
> > +
> >          if (hmap_is_empty(&table->rows)) {
> >              continue;
> >          }
> >
>

Hi Dumitru,

If I understand the problem, it is when the client updated monitor
condition and at the same time it received monitor_cond_since reply from
server for it's previous conditional monitor request, and the handling of
the reply overwrite the flag table->cond_changed from true to false, and
because of this, the new condition is not sent to server.

However, I don't understand the fix. It seems that the fix should be "don't
overwrite the flag", instead of clearing the conditions as well. Also, by
simply looking at this code I don't understand why the flag needs to be set
to false when clearing data in IDL. What would be the problem if we keep it
as what it is? If you understand the details, could you explain more about
the fix?

In addition, I didn't change this logic when implementing
monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
when connection reset."). In that commit it only reduced the chance of
calling ovsdb_idl_db_clear() as an optimization. Before that commit,
ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and it
would result in same problem.

Thanks,
Han
Dumitru Ceara March 3, 2020, 12:47 p.m. UTC | #3
CC-ing Andy and Ben.

On 3/3/20 7:43 AM, Han Zhou wrote:
> 
> 
> On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>> wrote:
>>
>> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
>> > If the ovsdb-server reply to "monitor_cond_since" requests has
>> > "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
>> > ovsdb_idl_db_clear() which iterates through all tables and
>> > unconditionally sets table->cond_changed to false.
>> >
>> > However, if the client had already set a new condition for some of the
>> > tables, this new condition request will never be sent to ovsdb-server
>> > until the condition is reset to a different value. This is due to the
>> > check in ovsdb_idl_db_set_condition().
>> >
>> > In order to fix this we now also call ovsdb_idl_condition_clear() for
>> > each table condition in ovsdb_idl_db_clear(). This ensures that
>> > resetting the condition to the same value as before the
>> > "monitor_cond_since" reply will trigger a new request.
>> >
>> > One way to replicate the issue is described in the bugzilla reporting
>> > the bug, when ovn-controller is configured to use "ovn-monitor-all":
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
>> >
>> > Reported-by: Dan Williams <dcbw@redhat.com <mailto:dcbw@redhat.com>>
>> > Reported-at: https://bugzilla.redhat.com/1808125
>> > CC: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
>> > Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
> connection reset.")
>> > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>>
>>
>> Hi Han,
>>
>> I was wondering if you could review this change because ovn-kubernetes
>> is trying to enable ovn-monitor-all for scalability reasons but their CI
>> fails due to the bug fixed by the patch below.
>>
>> Thanks,
>> Dumitru
>>
>> > ---
>> >  lib/ovsdb-idl.c | 6 +++++-
>> >  1 file changed, 5 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>> > index 190143f..64721ca 100644
>> > --- a/lib/ovsdb-idl.c
>> > +++ b/lib/ovsdb-idl.c
>> > @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>> >          struct ovsdb_idl_table *table = &db->tables[i];
>> >          struct ovsdb_idl_row *row, *next_row;
>> >
>> > -        table->cond_changed = false;
>> > +        if (table->cond_changed) {
>> > +            table->cond_changed = false;
>> > +            ovsdb_idl_condition_clear(&table->condition);
>> > +        }
>> > +
>> >          if (hmap_is_empty(&table->rows)) {
>> >              continue;
>> >          }
>> >
>>
> 
> Hi Dumitru>
> If I understand the problem, it is when the client updated monitor
> condition and at the same time it received monitor_cond_since reply from
> server for it's previous conditional monitor request, and the handling
> of the reply overwrite the flag table->cond_changed from true to false,
> and because of this, the new condition is not sent to server.
>
Hi Han,

Yes, that's what's happening.

> However, I don't understand the fix. It seems that the fix should be
> "don't overwrite the flag", instead of clearing the conditions as well.
> Also, by simply looking at this code I don't understand why the flag
> needs to be set to false when clearing data in IDL. What would be the
> problem if we keep it as what it is? If you understand the details,
> could you explain more about the fix?
> 
> In addition, I didn't change this logic when implementing
> monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
> when connection reset."). In that commit it only reduced the chance of
> calling ovsdb_idl_db_clear() as an optimization. Before that commit,
> ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
> it would result in same problem.
You're right, my bad, I messed up the "git blame", sorry about that. It
was actually introduced by:

https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a

I'll fix up the commit message in V2 once we decide on the best approach.

Hi Andy, Ben,

As far as I understand, whenever ovsdb_idl_db_clear() is called during
reconnect, any buffered but unsent conditions should be cleared because
they will be anyway resent with the following monitoring condition
update message after reconnect has completed.

However, clearing the "table->cond_changed" flag is not enough because
when the IDL client (ovn-controller in my case) will try to set the same
condition as before (e.g., "true" if ovn-monitor-all=true)
ovsdb_idl_db_set_condition() will return early:

https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518

Therefore in the cases when ovsdb_idl_db_clear() is called, we should
also clear the existing conditions on the DB tables.

As Han suggested, we could also revert the changes from 5351980b047f
("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
and we wouldn't hit the issue anymore. I'm not sure however if this has
any other major implications.

What do you think?

Thanks,
Dumitru

> 
> Thanks,
> Han
Ben Pfaff March 5, 2020, 8:50 p.m. UTC | #4
On Tue, Mar 03, 2020 at 01:47:48PM +0100, Dumitru Ceara wrote:
> CC-ing Andy and Ben.
> 
> On 3/3/20 7:43 AM, Han Zhou wrote:
> > 
> > 
> > On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com
> > <mailto:dceara@redhat.com>> wrote:
> >>
> >> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
> >> > If the ovsdb-server reply to "monitor_cond_since" requests has
> >> > "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
> >> > ovsdb_idl_db_clear() which iterates through all tables and
> >> > unconditionally sets table->cond_changed to false.
> >> >
> >> > However, if the client had already set a new condition for some of the
> >> > tables, this new condition request will never be sent to ovsdb-server
> >> > until the condition is reset to a different value. This is due to the
> >> > check in ovsdb_idl_db_set_condition().
> >> >
> >> > In order to fix this we now also call ovsdb_idl_condition_clear() for
> >> > each table condition in ovsdb_idl_db_clear(). This ensures that
> >> > resetting the condition to the same value as before the
> >> > "monitor_cond_since" reply will trigger a new request.
> >> >
> >> > One way to replicate the issue is described in the bugzilla reporting
> >> > the bug, when ovn-controller is configured to use "ovn-monitor-all":
> >> > https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
> >> >
> >> > Reported-by: Dan Williams <dcbw@redhat.com <mailto:dcbw@redhat.com>>
> >> > Reported-at: https://bugzilla.redhat.com/1808125
> >> > CC: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
> >> > Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
> > connection reset.")
> >> > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> > <mailto:dceara@redhat.com>>
> >>
> >> Hi Han,
> >>
> >> I was wondering if you could review this change because ovn-kubernetes
> >> is trying to enable ovn-monitor-all for scalability reasons but their CI
> >> fails due to the bug fixed by the patch below.
> >>
> >> Thanks,
> >> Dumitru
> >>
> >> > ---
> >> >  lib/ovsdb-idl.c | 6 +++++-
> >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> >> > index 190143f..64721ca 100644
> >> > --- a/lib/ovsdb-idl.c
> >> > +++ b/lib/ovsdb-idl.c
> >> > @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
> >> >          struct ovsdb_idl_table *table = &db->tables[i];
> >> >          struct ovsdb_idl_row *row, *next_row;
> >> >
> >> > -        table->cond_changed = false;
> >> > +        if (table->cond_changed) {
> >> > +            table->cond_changed = false;
> >> > +            ovsdb_idl_condition_clear(&table->condition);
> >> > +        }
> >> > +
> >> >          if (hmap_is_empty(&table->rows)) {
> >> >              continue;
> >> >          }
> >> >
> >>
> > 
> > Hi Dumitru>
> > If I understand the problem, it is when the client updated monitor
> > condition and at the same time it received monitor_cond_since reply from
> > server for it's previous conditional monitor request, and the handling
> > of the reply overwrite the flag table->cond_changed from true to false,
> > and because of this, the new condition is not sent to server.
> >
> Hi Han,
> 
> Yes, that's what's happening.
> 
> > However, I don't understand the fix. It seems that the fix should be
> > "don't overwrite the flag", instead of clearing the conditions as well.
> > Also, by simply looking at this code I don't understand why the flag
> > needs to be set to false when clearing data in IDL. What would be the
> > problem if we keep it as what it is? If you understand the details,
> > could you explain more about the fix?
> > 
> > In addition, I didn't change this logic when implementing
> > monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
> > when connection reset."). In that commit it only reduced the chance of
> > calling ovsdb_idl_db_clear() as an optimization. Before that commit,
> > ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
> > it would result in same problem.
> You're right, my bad, I messed up the "git blame", sorry about that. It
> was actually introduced by:
> 
> https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a
> 
> I'll fix up the commit message in V2 once we decide on the best approach.
> 
> Hi Andy, Ben,
> 
> As far as I understand, whenever ovsdb_idl_db_clear() is called during
> reconnect, any buffered but unsent conditions should be cleared because
> they will be anyway resent with the following monitoring condition
> update message after reconnect has completed.
> 
> However, clearing the "table->cond_changed" flag is not enough because
> when the IDL client (ovn-controller in my case) will try to set the same
> condition as before (e.g., "true" if ovn-monitor-all=true)
> ovsdb_idl_db_set_condition() will return early:
> 
> https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518
> 
> Therefore in the cases when ovsdb_idl_db_clear() is called, we should
> also clear the existing conditions on the DB tables.
> 
> As Han suggested, we could also revert the changes from 5351980b047f
> ("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
> and we wouldn't hit the issue anymore. I'm not sure however if this has
> any other major implications.
> 
> What do you think?

I honestly don't understand what's going on here well enough at the
moment.  There's a lot of context.  If necessary, I can try to untangle
it, but I'd rather leave it to the two of you if you can figure it out.
Dumitru Ceara March 10, 2020, 4:46 p.m. UTC | #5
On 3/5/20 9:50 PM, Ben Pfaff wrote:
> On Tue, Mar 03, 2020 at 01:47:48PM +0100, Dumitru Ceara wrote:
>> CC-ing Andy and Ben.
>>
>> On 3/3/20 7:43 AM, Han Zhou wrote:
>>>
>>>
>>> On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com
>>> <mailto:dceara@redhat.com>> wrote:
>>>>
>>>> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
>>>>> If the ovsdb-server reply to "monitor_cond_since" requests has
>>>>> "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
>>>>> ovsdb_idl_db_clear() which iterates through all tables and
>>>>> unconditionally sets table->cond_changed to false.
>>>>>
>>>>> However, if the client had already set a new condition for some of the
>>>>> tables, this new condition request will never be sent to ovsdb-server
>>>>> until the condition is reset to a different value. This is due to the
>>>>> check in ovsdb_idl_db_set_condition().
>>>>>
>>>>> In order to fix this we now also call ovsdb_idl_condition_clear() for
>>>>> each table condition in ovsdb_idl_db_clear(). This ensures that
>>>>> resetting the condition to the same value as before the
>>>>> "monitor_cond_since" reply will trigger a new request.
>>>>>
>>>>> One way to replicate the issue is described in the bugzilla reporting
>>>>> the bug, when ovn-controller is configured to use "ovn-monitor-all":
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
>>>>>
>>>>> Reported-by: Dan Williams <dcbw@redhat.com <mailto:dcbw@redhat.com>>
>>>>> Reported-at: https://bugzilla.redhat.com/1808125
>>>>> CC: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
>>>>> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
>>> connection reset.")
>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com
>>> <mailto:dceara@redhat.com>>
>>>>
>>>> Hi Han,
>>>>
>>>> I was wondering if you could review this change because ovn-kubernetes
>>>> is trying to enable ovn-monitor-all for scalability reasons but their CI
>>>> fails due to the bug fixed by the patch below.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>>> ---
>>>>>  lib/ovsdb-idl.c | 6 +++++-
>>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>>>>> index 190143f..64721ca 100644
>>>>> --- a/lib/ovsdb-idl.c
>>>>> +++ b/lib/ovsdb-idl.c
>>>>> @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>>>>>          struct ovsdb_idl_table *table = &db->tables[i];
>>>>>          struct ovsdb_idl_row *row, *next_row;
>>>>>
>>>>> -        table->cond_changed = false;
>>>>> +        if (table->cond_changed) {
>>>>> +            table->cond_changed = false;
>>>>> +            ovsdb_idl_condition_clear(&table->condition);
>>>>> +        }
>>>>> +
>>>>>          if (hmap_is_empty(&table->rows)) {
>>>>>              continue;
>>>>>          }
>>>>>
>>>>
>>>
>>> Hi Dumitru>
>>> If I understand the problem, it is when the client updated monitor
>>> condition and at the same time it received monitor_cond_since reply from
>>> server for it's previous conditional monitor request, and the handling
>>> of the reply overwrite the flag table->cond_changed from true to false,
>>> and because of this, the new condition is not sent to server.
>>>
>> Hi Han,
>>
>> Yes, that's what's happening.
>>
>>> However, I don't understand the fix. It seems that the fix should be
>>> "don't overwrite the flag", instead of clearing the conditions as well.
>>> Also, by simply looking at this code I don't understand why the flag
>>> needs to be set to false when clearing data in IDL. What would be the
>>> problem if we keep it as what it is? If you understand the details,
>>> could you explain more about the fix?
>>>
>>> In addition, I didn't change this logic when implementing
>>> monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
>>> when connection reset."). In that commit it only reduced the chance of
>>> calling ovsdb_idl_db_clear() as an optimization. Before that commit,
>>> ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
>>> it would result in same problem.
>> You're right, my bad, I messed up the "git blame", sorry about that. It
>> was actually introduced by:
>>
>> https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a
>>
>> I'll fix up the commit message in V2 once we decide on the best approach.
>>
>> Hi Andy, Ben,
>>
>> As far as I understand, whenever ovsdb_idl_db_clear() is called during
>> reconnect, any buffered but unsent conditions should be cleared because
>> they will be anyway resent with the following monitoring condition
>> update message after reconnect has completed.
>>
>> However, clearing the "table->cond_changed" flag is not enough because
>> when the IDL client (ovn-controller in my case) will try to set the same
>> condition as before (e.g., "true" if ovn-monitor-all=true)
>> ovsdb_idl_db_set_condition() will return early:
>>
>> https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518
>>
>> Therefore in the cases when ovsdb_idl_db_clear() is called, we should
>> also clear the existing conditions on the DB tables.
>>
>> As Han suggested, we could also revert the changes from 5351980b047f
>> ("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
>> and we wouldn't hit the issue anymore. I'm not sure however if this has
>> any other major implications.
>>
>> What do you think?
> 
> I honestly don't understand what's going on here well enough at the
> moment.  There's a lot of context.  If necessary, I can try to untangle
> it, but I'd rather leave it to the two of you if you can figure it out.
> 

Hi Han,

I just sent a v2 of the patch. The code changes are the same as in v1,
just that I updated the "Fixes" tag to reflect the correct commit that
introduced the issue:

https://patchwork.ozlabs.org/patch/1252278/

It would be great if you could help review this new version of the patch.

Thanks,
Dumitru
Ilya Maximets March 25, 2020, 6:21 p.m. UTC | #6
On 3/3/20 1:47 PM, Dumitru Ceara wrote:
> CC-ing Andy and Ben.
> 
> On 3/3/20 7:43 AM, Han Zhou wrote:
>>
>>
>> On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com
>> <mailto:dceara@redhat.com>> wrote:
>>>
>>> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
>>>> If the ovsdb-server reply to "monitor_cond_since" requests has
>>>> "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
>>>> ovsdb_idl_db_clear() which iterates through all tables and
>>>> unconditionally sets table->cond_changed to false.
>>>>
>>>> However, if the client had already set a new condition for some of the
>>>> tables, this new condition request will never be sent to ovsdb-server
>>>> until the condition is reset to a different value. This is due to the
>>>> check in ovsdb_idl_db_set_condition().
>>>>
>>>> In order to fix this we now also call ovsdb_idl_condition_clear() for
>>>> each table condition in ovsdb_idl_db_clear(). This ensures that
>>>> resetting the condition to the same value as before the
>>>> "monitor_cond_since" reply will trigger a new request.
>>>>
>>>> One way to replicate the issue is described in the bugzilla reporting
>>>> the bug, when ovn-controller is configured to use "ovn-monitor-all":
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
>>>>
>>>> Reported-by: Dan Williams <dcbw@redhat.com <mailto:dcbw@redhat.com>>
>>>> Reported-at: https://bugzilla.redhat.com/1808125
>>>> CC: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
>>>> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
>> connection reset.")
>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com
>> <mailto:dceara@redhat.com>>
>>>
>>> Hi Han,
>>>
>>> I was wondering if you could review this change because ovn-kubernetes
>>> is trying to enable ovn-monitor-all for scalability reasons but their CI
>>> fails due to the bug fixed by the patch below.
>>>
>>> Thanks,
>>> Dumitru
>>>
>>>> ---
>>>>  lib/ovsdb-idl.c | 6 +++++-
>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>>>> index 190143f..64721ca 100644
>>>> --- a/lib/ovsdb-idl.c
>>>> +++ b/lib/ovsdb-idl.c
>>>> @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>>>>          struct ovsdb_idl_table *table = &db->tables[i];
>>>>          struct ovsdb_idl_row *row, *next_row;
>>>>
>>>> -        table->cond_changed = false;
>>>> +        if (table->cond_changed) {
>>>> +            table->cond_changed = false;
>>>> +            ovsdb_idl_condition_clear(&table->condition);
>>>> +        }
>>>> +
>>>>          if (hmap_is_empty(&table->rows)) {
>>>>              continue;
>>>>          }
>>>>
>>>
>>
>> Hi Dumitru>
>> If I understand the problem, it is when the client updated monitor
>> condition and at the same time it received monitor_cond_since reply from
>> server for it's previous conditional monitor request, and the handling
>> of the reply overwrite the flag table->cond_changed from true to false,
>> and because of this, the new condition is not sent to server.
>>
> Hi Han,
> 
> Yes, that's what's happening.
> 
>> However, I don't understand the fix. It seems that the fix should be
>> "don't overwrite the flag", instead of clearing the conditions as well.
>> Also, by simply looking at this code I don't understand why the flag
>> needs to be set to false when clearing data in IDL. What would be the
>> problem if we keep it as what it is? If you understand the details,
>> could you explain more about the fix?
>>
>> In addition, I didn't change this logic when implementing
>> monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
>> when connection reset."). In that commit it only reduced the chance of
>> calling ovsdb_idl_db_clear() as an optimization. Before that commit,
>> ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
>> it would result in same problem.
> You're right, my bad, I messed up the "git blame", sorry about that. It
> was actually introduced by:
> 
> https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a
> 
> I'll fix up the commit message in V2 once we decide on the best approach.
> 
> Hi Andy, Ben,
> 
> As far as I understand, whenever ovsdb_idl_db_clear() is called during
> reconnect, any buffered but unsent conditions should be cleared because
> they will be anyway resent with the following monitoring condition
> update message after reconnect has completed.
> 
> However, clearing the "table->cond_changed" flag is not enough because
> when the IDL client (ovn-controller in my case) will try to set the same
> condition as before (e.g., "true" if ovn-monitor-all=true)
> ovsdb_idl_db_set_condition() will return early:
> 
> https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518
> 
> Therefore in the cases when ovsdb_idl_db_clear() is called, we should
> also clear the existing conditions on the DB tables.
> 
> As Han suggested, we could also revert the changes from 5351980b047f
> ("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
> and we wouldn't hit the issue anymore. I'm not sure however if this has
> any other major implications.

If we're clearing existing conditions, we will send empty conditions in
initial monitor_cond after reconnect, AFAIU, i.e. application will stop
receiving any updates.  I think that breaks the logic that was behind commit
5351980b047f.  The application that uses idl library will have to set up
conditions from scratch in order to avoid logic changes.  I think that
reconnection process should not affect application anyhow and we can't force
application to track all the reconnections and reconfigure conditions.

So, I think that it's better to revert commit 5351980b047f instead, because
otherwise we will break its logic anyway.  And, in the end, this was just a
performance optimization.
Also, after commit 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
connection reset."),  ovsdb_idl_db_parse_monitor_reply() doesn't call
ovsdb_idl_db_clear() in most cases, i.e. the performance optimization of
sending same conditions twice might be less important.

Best regards, Ilya Maximets.
Dumitru Ceara March 25, 2020, 7:12 p.m. UTC | #7
On 3/25/20 7:21 PM, Ilya Maximets wrote:
> On 3/3/20 1:47 PM, Dumitru Ceara wrote:
>> CC-ing Andy and Ben.
>>
>> On 3/3/20 7:43 AM, Han Zhou wrote:
>>>
>>>
>>> On Mon, Mar 2, 2020 at 7:55 AM Dumitru Ceara <dceara@redhat.com
>>> <mailto:dceara@redhat.com>> wrote:
>>>>
>>>> On 2/29/20 12:00 AM, Dumitru Ceara wrote:
>>>>> If the ovsdb-server reply to "monitor_cond_since" requests has
>>>>> "found" == false then ovsdb_idl_db_parse_monitor_reply() calls
>>>>> ovsdb_idl_db_clear() which iterates through all tables and
>>>>> unconditionally sets table->cond_changed to false.
>>>>>
>>>>> However, if the client had already set a new condition for some of the
>>>>> tables, this new condition request will never be sent to ovsdb-server
>>>>> until the condition is reset to a different value. This is due to the
>>>>> check in ovsdb_idl_db_set_condition().
>>>>>
>>>>> In order to fix this we now also call ovsdb_idl_condition_clear() for
>>>>> each table condition in ovsdb_idl_db_clear(). This ensures that
>>>>> resetting the condition to the same value as before the
>>>>> "monitor_cond_since" reply will trigger a new request.
>>>>>
>>>>> One way to replicate the issue is described in the bugzilla reporting
>>>>> the bug, when ovn-controller is configured to use "ovn-monitor-all":
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6
>>>>>
>>>>> Reported-by: Dan Williams <dcbw@redhat.com <mailto:dcbw@redhat.com>>
>>>>> Reported-at: https://bugzilla.redhat.com/1808125
>>>>> CC: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
>>>>> Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
>>> connection reset.")
>>>>> Signed-off-by: Dumitru Ceara <dceara@redhat.com
>>> <mailto:dceara@redhat.com>>
>>>>
>>>> Hi Han,
>>>>
>>>> I was wondering if you could review this change because ovn-kubernetes
>>>> is trying to enable ovn-monitor-all for scalability reasons but their CI
>>>> fails due to the bug fixed by the patch below.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>>> ---
>>>>>  lib/ovsdb-idl.c | 6 +++++-
>>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>>>>> index 190143f..64721ca 100644
>>>>> --- a/lib/ovsdb-idl.c
>>>>> +++ b/lib/ovsdb-idl.c
>>>>> @@ -610,7 +610,11 @@ ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
>>>>>          struct ovsdb_idl_table *table = &db->tables[i];
>>>>>          struct ovsdb_idl_row *row, *next_row;
>>>>>
>>>>> -        table->cond_changed = false;
>>>>> +        if (table->cond_changed) {
>>>>> +            table->cond_changed = false;
>>>>> +            ovsdb_idl_condition_clear(&table->condition);
>>>>> +        }
>>>>> +
>>>>>          if (hmap_is_empty(&table->rows)) {
>>>>>              continue;
>>>>>          }
>>>>>
>>>>
>>>
>>> Hi Dumitru>
>>> If I understand the problem, it is when the client updated monitor
>>> condition and at the same time it received monitor_cond_since reply from
>>> server for it's previous conditional monitor request, and the handling
>>> of the reply overwrite the flag table->cond_changed from true to false,
>>> and because of this, the new condition is not sent to server.
>>>
>> Hi Han,
>>
>> Yes, that's what's happening.
>>
>>> However, I don't understand the fix. It seems that the fix should be
>>> "don't overwrite the flag", instead of clearing the conditions as well.
>>> Also, by simply looking at this code I don't understand why the flag
>>> needs to be set to false when clearing data in IDL. What would be the
>>> problem if we keep it as what it is? If you understand the details,
>>> could you explain more about the fix?
>>>
>>> In addition, I didn't change this logic when implementing
>>> monitor_cond_since in 403a6a0cb003 ("ovsdb-idl: Fast resync from server
>>> when connection reset."). In that commit it only reduced the chance of
>>> calling ovsdb_idl_db_clear() as an optimization. Before that commit,
>>> ovsdb_idl_db_parse_monitor_reply() always calls ovsdb_idl_db_clear() and
>>> it would result in same problem.
>> You're right, my bad, I messed up the "git blame", sorry about that. It
>> was actually introduced by:
>>
>> https://github.com/openvswitch/ovs/commit/5351980b047f4dd40be7a59a1e4b910df21eca0a
>>
>> I'll fix up the commit message in V2 once we decide on the best approach.
>>
>> Hi Andy, Ben,
>>
>> As far as I understand, whenever ovsdb_idl_db_clear() is called during
>> reconnect, any buffered but unsent conditions should be cleared because
>> they will be anyway resent with the following monitoring condition
>> update message after reconnect has completed.
>>
>> However, clearing the "table->cond_changed" flag is not enough because
>> when the IDL client (ovn-controller in my case) will try to set the same
>> condition as before (e.g., "true" if ovn-monitor-all=true)
>> ovsdb_idl_db_set_condition() will return early:
>>
>> https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L1518
>>
>> Therefore in the cases when ovsdb_idl_db_clear() is called, we should
>> also clear the existing conditions on the DB tables.
>>
>> As Han suggested, we could also revert the changes from 5351980b047f
>> ("ovsdb-idl: Avoid sending redundant conditional monitoring updates")
>> and we wouldn't hit the issue anymore. I'm not sure however if this has
>> any other major implications.
>

Hi Ilya,

Thanks for the review.

> If we're clearing existing conditions, we will send empty conditions in
> initial monitor_cond after reconnect, AFAIU, i.e. application will stop
> receiving any updates.  I think that breaks the logic that was behind commit
> 5351980b047f.  The application that uses idl library will have to set up
> conditions from scratch in order to avoid logic changes.  I think that
> reconnection process should not affect application anyhow and we can't force
> application to track all the reconnections and reconfigure conditions.
> 
Ok, makes sense. I was looking at ovn-controller as the standard
application that uses the idl library and ovn-controller tracks
reconnects and also reconfigures conditions at every iteration (also
when reconnects happen). But you're right, this change would force all
idl user apps to do the same thing.

> So, I think that it's better to revert commit 5351980b047f instead, because
> otherwise we will break its logic anyway.  And, in the end, this was just a
> performance optimization.
> Also, after commit 403a6a0cb003 ("ovsdb-idl: Fast resync from server when
> connection reset."),  ovsdb_idl_db_parse_monitor_reply() doesn't call
> ovsdb_idl_db_clear() in most cases, i.e. the performance optimization of
> sending same conditions twice might be less important.
>

Ack, I'll send a v3 that reverts 5351980b047f.

Thanks,
Dumitru


> Best regards, Ilya Maximets.
>
diff mbox series

Patch

diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
index 190143f..64721ca 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -610,7 +610,11 @@  ovsdb_idl_db_clear(struct ovsdb_idl_db *db)
         struct ovsdb_idl_table *table = &db->tables[i];
         struct ovsdb_idl_row *row, *next_row;
 
-        table->cond_changed = false;
+        if (table->cond_changed) {
+            table->cond_changed = false;
+            ovsdb_idl_condition_clear(&table->condition);
+        }
+
         if (hmap_is_empty(&table->rows)) {
             continue;
         }