diff mbox series

[ovs-dev,branch-20.03] chassis: Fix the way encaps are updated for a chassis record.

Message ID 1599473445-29748-1-git-send-email-dceara@redhat.com
State Rejected
Headers show
Series [ovs-dev,branch-20.03] chassis: Fix the way encaps are updated for a chassis record. | expand

Commit Message

Dumitru Ceara Sept. 7, 2020, 10:10 a.m. UTC
ovn-controller always stores the last configured system-id/chassis-id in
memory regardless if the connection to the SB is up or down. This is OK
as long as the change can be committed successfully when the SB DB
connection comes back up.

Without this change, if the chassis-id changes while the SB connection is
down, ovn-controller will fail to create the new record but nevertheless
update its in-memory chassis-id. When the SB connection is restored
ovn-controller tries to find the record corresponding to the chassis-id
it stored in memory. This fails causing ovn-controller to try to insert
a new record. But at this point a constraint violation is hit in the SB
because the Encap records of the "stale" chassis still exist in the DB,
along with the old chassis record.

This commit changes the way we search for a "stale" chassis record in the
SB to cover the above mentioned case. Also, in such cases there's no need
to recreate the Encaps, it's sufficient to update the chassis_name field.

Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the string parsing")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>

(cherry-picked from master commit 94a32fca2d2b825fece0ef5b1873459bd9857dd3)
---
 controller/chassis.c    | 60 ++++++++++++++++++++++++++++++-------------------
 tests/ovn-controller.at | 17 ++++++++++++++
 2 files changed, 54 insertions(+), 23 deletions(-)

Comments

Han Zhou Sept. 8, 2020, 7:27 a.m. UTC | #1
On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> ovn-controller always stores the last configured system-id/chassis-id in
> memory regardless if the connection to the SB is up or down. This is OK
> as long as the change can be committed successfully when the SB DB
> connection comes back up.
>
> Without this change, if the chassis-id changes while the SB connection is
> down, ovn-controller will fail to create the new record but nevertheless
> update its in-memory chassis-id. When the SB connection is restored
> ovn-controller tries to find the record corresponding to the chassis-id
> it stored in memory. This fails causing ovn-controller to try to insert
> a new record. But at this point a constraint violation is hit in the SB
> because the Encap records of the "stale" chassis still exist in the DB,
> along with the old chassis record.
>
> This commit changes the way we search for a "stale" chassis record in the
> SB to cover the above mentioned case. Also, in such cases there's no need
> to recreate the Encaps, it's sufficient to update the chassis_name field.
>
> Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the
string parsing")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>
> (cherry-picked from master commit
94a32fca2d2b825fece0ef5b1873459bd9857dd3)

Hi Dumitru,

Sorry that I missed the original review, but this patch seems causing
problem in master already. When RBAC is enabled, it will result in below
error:
2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this
chassis.
2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
{"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\"
prohibit modification of table \"Encap\".","error":"permission error"}

You can simply reproduce by: make sandbox, and then run: ovn-nbctl
--wait=hv sync, which will hang forever because of the problem.
This patch seems to be updating the "name" field of Encap table, which
violates the design of RBAC, which uses "name" as the identity of the
client. We shouldn't allow an user to change system-id/chassis-id directly.
I think the proper way is firstly destroying the old chassis and then
creating a new chassis, which would also avoid the complex logic in
ovn-controller regarding the "stale record" handling. What do you think?

Thanks,
Han

> ---
>  controller/chassis.c    | 60
++++++++++++++++++++++++++++++-------------------
>  tests/ovn-controller.at | 17 ++++++++++++++
>  2 files changed, 54 insertions(+), 23 deletions(-)
>
> diff --git a/controller/chassis.c b/controller/chassis.c
> index 522893e..d525463 100644
> --- a/controller/chassis.c
> +++ b/controller/chassis.c
> @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset
*encap_type_set,
>  {
>      size_t encap_type_count = 0;
>
> -    for (int i = 0; i < chassis_rec->n_encaps; i++) {
> -        if (strcmp(chassis_rec->name,
chassis_rec->encaps[i]->chassis_name)) {
> -            return true;
> -        }
> +    for (size_t i = 0; i < chassis_rec->n_encaps; i++) {
>
>          if (!sset_contains(encap_type_set,
chassis_rec->encaps[i]->type)) {
>              return true;
> @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn
*ovnsb_idl_txn,
>  }
>
>  /*
> + * Updates encaps for a given chassis. This can happen when the chassis
> + * name has changed. Also, the only thing we support updating is the
> + * chassis_name. For other changes the encaps will be recreated.
> + */
> +static void
> +chassis_update_encaps(const struct sbrec_chassis *chassis)
> +{
> +    for (size_t i = 0; i < chassis->n_encaps; i++) {
> +        sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name);
> +    }
> +}
> +
> +/*
>   * Returns a pointer to a chassis record from 'chassis_table' that
>   * matches at least one tunnel config.
>   */
> @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct
sbrec_chassis_table *chassis_table,
>  /* If this is a chassis config update after we initialized the record
once
>   * then we should always be able to find it with the ID we saved in
>   * chassis_state.
> - * Otherwise (i.e., first time we create the record) then we check if
there's
> - * a stale record from a previous controller run that didn't end
gracefully
> - * and reuse it. If not then we create a new record.
> + * Otherwise (i.e., first time we create the record or if the system-id
> + * changed) then we check if there's a stale record from a previous
> + * controller run that didn't end gracefully and reuse it. If not then we
> + * create a new record.
>   */
>  static const struct sbrec_chassis *
>  chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn,
> @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn
*ovnsb_idl_txn,
>                     const struct ovs_chassis_cfg *ovs_cfg,
>                     const char *chassis_id)
>  {
> -    const struct sbrec_chassis *chassis_rec;
> +    const struct sbrec_chassis *chassis = NULL;
>
>      if (chassis_info_id_inited(&chassis_state)) {
> -        chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name,
> -
chassis_info_id(&chassis_state));
> -        if (!chassis_rec) {
> -            VLOG_DBG("Could not find Chassis, will create it"
> -                     ": stored (%s) ovs (%s)",
> +        chassis = chassis_lookup_by_name(sbrec_chassis_by_name,
> +
chassis_info_id(&chassis_state));
> +        if (!chassis) {
> +            VLOG_DBG("Could not find Chassis, will check if the id
changed: "
> +                     "stored (%s) ovs (%s)",
>                       chassis_info_id(&chassis_state), chassis_id);
> -            if (ovnsb_idl_txn) {
> -                /* Recreate the chassis record.  */
> -                chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
> -            }
>          }
> -    } else {
> -        chassis_rec =
> -            chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id);
> +    }
>
> -        if (!chassis_rec && ovnsb_idl_txn) {
> -            chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
> +    if (!chassis) {
> +        chassis = chassis_get_stale_record(chassis_table, ovs_cfg,
chassis_id);
> +    }
> +
> +    if (!chassis) {
> +        /* Recreate the chassis record. */
> +        VLOG_DBG("Could not find Chassis, will create it: %s",
chassis_id);
> +        if (ovnsb_idl_txn) {
> +            return sbrec_chassis_insert(ovnsb_idl_txn);
>          }
>      }
> -    return chassis_rec;
> +
> +    return chassis;
>  }
>
>  /* Update a Chassis record based on the config in the ovs config. */
> @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis
*chassis_rec,
>                                  &ovs_cfg->encap_ip_set,
ovs_cfg->encap_csum,
>                                  chassis_rec);
>      if (!tunnels_changed) {
> +        chassis_update_encaps(chassis_rec);
>          return;
>      }
>
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index 63b2581..1c7aa58 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([
>      test "${sysid}" = "${chassis_id}"
>  ])
>
> +# Simulate system-id changing while ovn-controller is disconnected from
the
> +# SB.
> +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote)
> +invalid_remote=tcp:0.0.0.0:4242
> +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote}
> +expected_state="not connected"
> +OVS_WAIT_UNTIL([
> +    test "${expected_state}" = "$(ovn-appctl -t ovn-controller
connection-status)"
> +])
> +sysid=${sysid}-bar
> +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}"
> +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote}
> +OVS_WAIT_UNTIL([
> +    chassis_id=$(ovn-sbctl get Chassis "${sysid}" name)
> +    test "${sysid}" = "${chassis_id}"
> +])
> +
>  # Gracefully terminate daemons
>  OVN_CLEANUP_SBOX([hv])
>  OVN_CLEANUP_VSWITCH([main])
> --
> 1.8.3.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Numan Siddique Sept. 8, 2020, 7:46 a.m. UTC | #2
On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com> wrote:

> On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > ovn-controller always stores the last configured system-id/chassis-id in
> > memory regardless if the connection to the SB is up or down. This is OK
> > as long as the change can be committed successfully when the SB DB
> > connection comes back up.
> >
> > Without this change, if the chassis-id changes while the SB connection is
> > down, ovn-controller will fail to create the new record but nevertheless
> > update its in-memory chassis-id. When the SB connection is restored
> > ovn-controller tries to find the record corresponding to the chassis-id
> > it stored in memory. This fails causing ovn-controller to try to insert
> > a new record. But at this point a constraint violation is hit in the SB
> > because the Encap records of the "stale" chassis still exist in the DB,
> > along with the old chassis record.
> >
> > This commit changes the way we search for a "stale" chassis record in the
> > SB to cover the above mentioned case. Also, in such cases there's no need
> > to recreate the Encaps, it's sufficient to update the chassis_name field.
> >
> > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the
> string parsing")
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >
> > (cherry-picked from master commit
> 94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>
> Hi Dumitru,
>
> Sorry that I missed the original review, but this patch seems causing
> problem in master already. When RBAC is enabled, it will result in below
> error:
> 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this
> chassis.
> 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
> {"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\"
> prohibit modification of table \"Encap\".","error":"permission error"}
>
> You can simply reproduce by: make sandbox, and then run: ovn-nbctl
> --wait=hv sync, which will hang forever because of the problem.
> This patch seems to be updating the "name" field of Encap table, which
> violates the design of RBAC, which uses "name" as the identity of the
> client. We shouldn't allow an user to change system-id/chassis-id directly.
> I think the proper way is firstly destroying the old chassis and then
> creating a new chassis, which would also avoid the complex logic in
> ovn-controller regarding the "stale record" handling. What do you think?
>
>
Hi Han,

Yes. Dumitru submitted the patch to fix this issue in master and I applied
it just now.
Can you please test again with the latest master.

https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34

We see the same issue with branch-20.06 too.

Thanks
Numan

Thanks,
> Han
>
> > ---
> >  controller/chassis.c    | 60
> ++++++++++++++++++++++++++++++-------------------
> >  tests/ovn-controller.at | 17 ++++++++++++++
> >  2 files changed, 54 insertions(+), 23 deletions(-)
> >
> > diff --git a/controller/chassis.c b/controller/chassis.c
> > index 522893e..d525463 100644
> > --- a/controller/chassis.c
> > +++ b/controller/chassis.c
> > @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset
> *encap_type_set,
> >  {
> >      size_t encap_type_count = 0;
> >
> > -    for (int i = 0; i < chassis_rec->n_encaps; i++) {
> > -        if (strcmp(chassis_rec->name,
> chassis_rec->encaps[i]->chassis_name)) {
> > -            return true;
> > -        }
> > +    for (size_t i = 0; i < chassis_rec->n_encaps; i++) {
> >
> >          if (!sset_contains(encap_type_set,
> chassis_rec->encaps[i]->type)) {
> >              return true;
> > @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn
> *ovnsb_idl_txn,
> >  }
> >
> >  /*
> > + * Updates encaps for a given chassis. This can happen when the chassis
> > + * name has changed. Also, the only thing we support updating is the
> > + * chassis_name. For other changes the encaps will be recreated.
> > + */
> > +static void
> > +chassis_update_encaps(const struct sbrec_chassis *chassis)
> > +{
> > +    for (size_t i = 0; i < chassis->n_encaps; i++) {
> > +        sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name);
> > +    }
> > +}
> > +
> > +/*
> >   * Returns a pointer to a chassis record from 'chassis_table' that
> >   * matches at least one tunnel config.
> >   */
> > @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct
> sbrec_chassis_table *chassis_table,
> >  /* If this is a chassis config update after we initialized the record
> once
> >   * then we should always be able to find it with the ID we saved in
> >   * chassis_state.
> > - * Otherwise (i.e., first time we create the record) then we check if
> there's
> > - * a stale record from a previous controller run that didn't end
> gracefully
> > - * and reuse it. If not then we create a new record.
> > + * Otherwise (i.e., first time we create the record or if the system-id
> > + * changed) then we check if there's a stale record from a previous
> > + * controller run that didn't end gracefully and reuse it. If not then
> we
> > + * create a new record.
> >   */
> >  static const struct sbrec_chassis *
> >  chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn,
> > @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn
> *ovnsb_idl_txn,
> >                     const struct ovs_chassis_cfg *ovs_cfg,
> >                     const char *chassis_id)
> >  {
> > -    const struct sbrec_chassis *chassis_rec;
> > +    const struct sbrec_chassis *chassis = NULL;
> >
> >      if (chassis_info_id_inited(&chassis_state)) {
> > -        chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name,
> > -
> chassis_info_id(&chassis_state));
> > -        if (!chassis_rec) {
> > -            VLOG_DBG("Could not find Chassis, will create it"
> > -                     ": stored (%s) ovs (%s)",
> > +        chassis = chassis_lookup_by_name(sbrec_chassis_by_name,
> > +
> chassis_info_id(&chassis_state));
> > +        if (!chassis) {
> > +            VLOG_DBG("Could not find Chassis, will check if the id
> changed: "
> > +                     "stored (%s) ovs (%s)",
> >                       chassis_info_id(&chassis_state), chassis_id);
> > -            if (ovnsb_idl_txn) {
> > -                /* Recreate the chassis record.  */
> > -                chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
> > -            }
> >          }
> > -    } else {
> > -        chassis_rec =
> > -            chassis_get_stale_record(chassis_table, ovs_cfg,
> chassis_id);
> > +    }
> >
> > -        if (!chassis_rec && ovnsb_idl_txn) {
> > -            chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
> > +    if (!chassis) {
> > +        chassis = chassis_get_stale_record(chassis_table, ovs_cfg,
> chassis_id);
> > +    }
> > +
> > +    if (!chassis) {
> > +        /* Recreate the chassis record. */
> > +        VLOG_DBG("Could not find Chassis, will create it: %s",
> chassis_id);
> > +        if (ovnsb_idl_txn) {
> > +            return sbrec_chassis_insert(ovnsb_idl_txn);
> >          }
> >      }
> > -    return chassis_rec;
> > +
> > +    return chassis;
> >  }
> >
> >  /* Update a Chassis record based on the config in the ovs config. */
> > @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis
> *chassis_rec,
> >                                  &ovs_cfg->encap_ip_set,
> ovs_cfg->encap_csum,
> >                                  chassis_rec);
> >      if (!tunnels_changed) {
> > +        chassis_update_encaps(chassis_rec);
> >          return;
> >      }
> >
> > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> > index 63b2581..1c7aa58 100644
> > --- a/tests/ovn-controller.at
> > +++ b/tests/ovn-controller.at
> > @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([
> >      test "${sysid}" = "${chassis_id}"
> >  ])
> >
> > +# Simulate system-id changing while ovn-controller is disconnected from
> the
> > +# SB.
> > +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote)
> > +invalid_remote=tcp:0.0.0.0:4242
> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote}
> > +expected_state="not connected"
> > +OVS_WAIT_UNTIL([
> > +    test "${expected_state}" = "$(ovn-appctl -t ovn-controller
> connection-status)"
> > +])
> > +sysid=${sysid}-bar
> > +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}"
> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote}
> > +OVS_WAIT_UNTIL([
> > +    chassis_id=$(ovn-sbctl get Chassis "${sysid}" name)
> > +    test "${sysid}" = "${chassis_id}"
> > +])
> > +
> >  # Gracefully terminate daemons
> >  OVN_CLEANUP_SBOX([hv])
> >  OVN_CLEANUP_VSWITCH([main])
> > --
> > 1.8.3.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
Numan Siddique Sept. 8, 2020, 7:47 a.m. UTC | #3
On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org> wrote:

>
>
> On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com> wrote:
>
>> On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote:
>> >
>> > ovn-controller always stores the last configured system-id/chassis-id in
>> > memory regardless if the connection to the SB is up or down. This is OK
>> > as long as the change can be committed successfully when the SB DB
>> > connection comes back up.
>> >
>> > Without this change, if the chassis-id changes while the SB connection
>> is
>> > down, ovn-controller will fail to create the new record but nevertheless
>> > update its in-memory chassis-id. When the SB connection is restored
>> > ovn-controller tries to find the record corresponding to the chassis-id
>> > it stored in memory. This fails causing ovn-controller to try to insert
>> > a new record. But at this point a constraint violation is hit in the SB
>> > because the Encap records of the "stale" chassis still exist in the DB,
>> > along with the old chassis record.
>> >
>> > This commit changes the way we search for a "stale" chassis record in
>> the
>> > SB to cover the above mentioned case. Also, in such cases there's no
>> need
>> > to recreate the Encaps, it's sufficient to update the chassis_name
>> field.
>> >
>> > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the
>> string parsing")
>> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> >
>> > (cherry-picked from master commit
>> 94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>>
>> Hi Dumitru,
>>
>> Sorry that I missed the original review, but this patch seems causing
>> problem in master already. When RBAC is enabled, it will result in below
>> error:
>> 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this
>> chassis.
>> 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
>> {"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\"
>> prohibit modification of table \"Encap\".","error":"permission error"}
>>
>> You can simply reproduce by: make sandbox, and then run: ovn-nbctl
>> --wait=hv sync, which will hang forever because of the problem.
>> This patch seems to be updating the "name" field of Encap table, which
>> violates the design of RBAC, which uses "name" as the identity of the
>> client. We shouldn't allow an user to change system-id/chassis-id
>> directly.
>> I think the proper way is firstly destroying the old chassis and then
>> creating a new chassis, which would also avoid the complex logic in
>> ovn-controller regarding the "stale record" handling. What do you think?
>>
>>
> Hi Han,
>
> Yes. Dumitru submitted the patch to fix this issue in master and I applied
> it just now.
> Can you please test again with the latest master.
>
>
> https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>
> We see the same issue with branch-20.06 too.
>


Hi Dumitru,

Can you please make a 2 patch series for branch-20.03 which includes this
patch and the other one
which fixes the RBAC issue.

Thanks
Numan


>
> Thanks
> Numan
>
> Thanks,
>> Han
>>
>> > ---
>> >  controller/chassis.c    | 60
>> ++++++++++++++++++++++++++++++-------------------
>> >  tests/ovn-controller.at | 17 ++++++++++++++
>> >  2 files changed, 54 insertions(+), 23 deletions(-)
>> >
>> > diff --git a/controller/chassis.c b/controller/chassis.c
>> > index 522893e..d525463 100644
>> > --- a/controller/chassis.c
>> > +++ b/controller/chassis.c
>> > @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset
>> *encap_type_set,
>> >  {
>> >      size_t encap_type_count = 0;
>> >
>> > -    for (int i = 0; i < chassis_rec->n_encaps; i++) {
>> > -        if (strcmp(chassis_rec->name,
>> chassis_rec->encaps[i]->chassis_name)) {
>> > -            return true;
>> > -        }
>> > +    for (size_t i = 0; i < chassis_rec->n_encaps; i++) {
>> >
>> >          if (!sset_contains(encap_type_set,
>> chassis_rec->encaps[i]->type)) {
>> >              return true;
>> > @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn
>> *ovnsb_idl_txn,
>> >  }
>> >
>> >  /*
>> > + * Updates encaps for a given chassis. This can happen when the chassis
>> > + * name has changed. Also, the only thing we support updating is the
>> > + * chassis_name. For other changes the encaps will be recreated.
>> > + */
>> > +static void
>> > +chassis_update_encaps(const struct sbrec_chassis *chassis)
>> > +{
>> > +    for (size_t i = 0; i < chassis->n_encaps; i++) {
>> > +        sbrec_encap_set_chassis_name(chassis->encaps[i],
>> chassis->name);
>> > +    }
>> > +}
>> > +
>> > +/*
>> >   * Returns a pointer to a chassis record from 'chassis_table' that
>> >   * matches at least one tunnel config.
>> >   */
>> > @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct
>> sbrec_chassis_table *chassis_table,
>> >  /* If this is a chassis config update after we initialized the record
>> once
>> >   * then we should always be able to find it with the ID we saved in
>> >   * chassis_state.
>> > - * Otherwise (i.e., first time we create the record) then we check if
>> there's
>> > - * a stale record from a previous controller run that didn't end
>> gracefully
>> > - * and reuse it. If not then we create a new record.
>> > + * Otherwise (i.e., first time we create the record or if the system-id
>> > + * changed) then we check if there's a stale record from a previous
>> > + * controller run that didn't end gracefully and reuse it. If not then
>> we
>> > + * create a new record.
>> >   */
>> >  static const struct sbrec_chassis *
>> >  chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn,
>> > @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn
>> *ovnsb_idl_txn,
>> >                     const struct ovs_chassis_cfg *ovs_cfg,
>> >                     const char *chassis_id)
>> >  {
>> > -    const struct sbrec_chassis *chassis_rec;
>> > +    const struct sbrec_chassis *chassis = NULL;
>> >
>> >      if (chassis_info_id_inited(&chassis_state)) {
>> > -        chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name,
>> > -
>> chassis_info_id(&chassis_state));
>> > -        if (!chassis_rec) {
>> > -            VLOG_DBG("Could not find Chassis, will create it"
>> > -                     ": stored (%s) ovs (%s)",
>> > +        chassis = chassis_lookup_by_name(sbrec_chassis_by_name,
>> > +
>> chassis_info_id(&chassis_state));
>> > +        if (!chassis) {
>> > +            VLOG_DBG("Could not find Chassis, will check if the id
>> changed: "
>> > +                     "stored (%s) ovs (%s)",
>> >                       chassis_info_id(&chassis_state), chassis_id);
>> > -            if (ovnsb_idl_txn) {
>> > -                /* Recreate the chassis record.  */
>> > -                chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
>> > -            }
>> >          }
>> > -    } else {
>> > -        chassis_rec =
>> > -            chassis_get_stale_record(chassis_table, ovs_cfg,
>> chassis_id);
>> > +    }
>> >
>> > -        if (!chassis_rec && ovnsb_idl_txn) {
>> > -            chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
>> > +    if (!chassis) {
>> > +        chassis = chassis_get_stale_record(chassis_table, ovs_cfg,
>> chassis_id);
>> > +    }
>> > +
>> > +    if (!chassis) {
>> > +        /* Recreate the chassis record. */
>> > +        VLOG_DBG("Could not find Chassis, will create it: %s",
>> chassis_id);
>> > +        if (ovnsb_idl_txn) {
>> > +            return sbrec_chassis_insert(ovnsb_idl_txn);
>> >          }
>> >      }
>> > -    return chassis_rec;
>> > +
>> > +    return chassis;
>> >  }
>> >
>> >  /* Update a Chassis record based on the config in the ovs config. */
>> > @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis
>> *chassis_rec,
>> >                                  &ovs_cfg->encap_ip_set,
>> ovs_cfg->encap_csum,
>> >                                  chassis_rec);
>> >      if (!tunnels_changed) {
>> > +        chassis_update_encaps(chassis_rec);
>> >          return;
>> >      }
>> >
>> > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
>> > index 63b2581..1c7aa58 100644
>> > --- a/tests/ovn-controller.at
>> > +++ b/tests/ovn-controller.at
>> > @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([
>> >      test "${sysid}" = "${chassis_id}"
>> >  ])
>> >
>> > +# Simulate system-id changing while ovn-controller is disconnected from
>> the
>> > +# SB.
>> > +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote)
>> > +invalid_remote=tcp:0.0.0.0:4242
>> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote}
>> > +expected_state="not connected"
>> > +OVS_WAIT_UNTIL([
>> > +    test "${expected_state}" = "$(ovn-appctl -t ovn-controller
>> connection-status)"
>> > +])
>> > +sysid=${sysid}-bar
>> > +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}"
>> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote}
>> > +OVS_WAIT_UNTIL([
>> > +    chassis_id=$(ovn-sbctl get Chassis "${sysid}" name)
>> > +    test "${sysid}" = "${chassis_id}"
>> > +])
>> > +
>> >  # Gracefully terminate daemons
>> >  OVN_CLEANUP_SBOX([hv])
>> >  OVN_CLEANUP_VSWITCH([main])
>> > --
>> > 1.8.3.1
>> >
>> > _______________________________________________
>> > dev mailing list
>> > dev@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>>
Dumitru Ceara Sept. 8, 2020, 9:24 a.m. UTC | #4
On 9/8/20 9:47 AM, Numan Siddique wrote:
> 
> 
> On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org
> <mailto:numans@ovn.org>> wrote:
> 
> 
> 
>     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com
>     <mailto:zhouhan@gmail.com>> wrote:
> 
>         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com
>         <mailto:dceara@redhat.com>> wrote:
>         >
>         > ovn-controller always stores the last configured
>         system-id/chassis-id in
>         > memory regardless if the connection to the SB is up or down.
>         This is OK
>         > as long as the change can be committed successfully when the SB DB
>         > connection comes back up.
>         >
>         > Without this change, if the chassis-id changes while the SB
>         connection is
>         > down, ovn-controller will fail to create the new record but
>         nevertheless
>         > update its in-memory chassis-id. When the SB connection is
>         restored
>         > ovn-controller tries to find the record corresponding to the
>         chassis-id
>         > it stored in memory. This fails causing ovn-controller to try
>         to insert
>         > a new record. But at this point a constraint violation is hit
>         in the SB
>         > because the Encap records of the "stale" chassis still exist
>         in the DB,
>         > along with the old chassis record.
>         >
>         > This commit changes the way we search for a "stale" chassis
>         record in the
>         > SB to cover the above mentioned case. Also, in such cases
>         there's no need
>         > to recreate the Encaps, it's sufficient to update the
>         chassis_name field.
>         >
>         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
>         abstract the
>         string parsing")
>         > Signed-off-by: Dumitru Ceara <dceara@redhat.com
>         <mailto:dceara@redhat.com>>
>         >
>         > (cherry-picked from master commit
>         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
> 
>         Hi Dumitru,
> 

Hi Han, Numan,

>         Sorry that I missed the original review, but this patch seems
>         causing
>         problem in master already. When RBAC is enabled, it will result
>         in below
>         error:
>         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1
>         for this
>         chassis.
>         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
>         {"details":"RBAC rules for client \"chassis-1\" role
>         \"ovn-controller\"
>         prohibit modification of table \"Encap\".","error":"permission
>         error"}
> 
>         You can simply reproduce by: make sandbox, and then run: ovn-nbctl
>         --wait=hv sync, which will hang forever because of the problem.
>         This patch seems to be updating the "name" field of Encap table,
>         which
>         violates the design of RBAC, which uses "name" as the identity
>         of the
>         client. We shouldn't allow an user to change
>         system-id/chassis-id directly.

How can we enforce this from within OVN?

>         I think the proper way is firstly destroying the old chassis and
>         then
>         creating a new chassis, which would also avoid the complex logic in
>         ovn-controller regarding the "stale record" handling. What do
>         you think?
> 
> 

I'm afraid this would be a behavior change that might introduce bugs in
CMSs that allow changing system-ids (without RBAC enabled) without
explicitly deleting the "stale" Chassis *and* Chassis_private record.

Also, forcing CMS to do this housekeeping will probably add complex code
in the CMS. For example, if the "old" Chassis record is deleted before
ovn-controller receives the update about the system-id change in OVS DB
then ovn-controller will just recreate it.

>     Hi Han,
> 
>     Yes. Dumitru submitted the patch to fix this issue in master and I
>     applied it just now.
>     Can you please test again with the latest master.
> 
>     https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
> 
>     We see the same issue with branch-20.06 too.
> 
> 
> 
> Hi Dumitru,
> 
> Can you please make a 2 patch series for branch-20.03 which includes
> this patch and the other one
> which fixes the RBAC issue.
> 
> Thanks
> Numan
>  

Like you said, we have two options:

1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
system-id changes and completely ignore in ovn-controller any stale
chassis records that might exist in the SB DB. In this case we'd just
let transactions fail due to constraint violations if the CMS cleanup of
the Chassis/Chassis_private records doesn't happen or is incorrect.

2. Make it more user friendly for the case when RBAC is _not used_ and
try to cleanup after ourselves in ovn-controller and reuse the stale
records. This is already happening with the code on the master branch
and if we cherry pick [0] as Numan suggested and add it to this series
it would be the case for branch-20.03 too. [0] would have to go to
branch-20.06 as well.

I would say, at least for now, the least intrusive change, with the
least chance of breaking CMSs, is #2 above but I'll wait for
confirmation before sending a new version of this patch.

Thanks,
Dumitru

[0]
https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
Han Zhou Sept. 8, 2020, 5:43 p.m. UTC | #5
On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 9/8/20 9:47 AM, Numan Siddique wrote:
> >
> >
> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org
> > <mailto:numans@ovn.org>> wrote:
> >
> >
> >
> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com
> >     <mailto:zhouhan@gmail.com>> wrote:
> >
> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com
> >         <mailto:dceara@redhat.com>> wrote:
> >         >
> >         > ovn-controller always stores the last configured
> >         system-id/chassis-id in
> >         > memory regardless if the connection to the SB is up or down.
> >         This is OK
> >         > as long as the change can be committed successfully when the
SB DB
> >         > connection comes back up.
> >         >
> >         > Without this change, if the chassis-id changes while the SB
> >         connection is
> >         > down, ovn-controller will fail to create the new record but
> >         nevertheless
> >         > update its in-memory chassis-id. When the SB connection is
> >         restored
> >         > ovn-controller tries to find the record corresponding to the
> >         chassis-id
> >         > it stored in memory. This fails causing ovn-controller to try
> >         to insert
> >         > a new record. But at this point a constraint violation is hit
> >         in the SB
> >         > because the Encap records of the "stale" chassis still exist
> >         in the DB,
> >         > along with the old chassis record.
> >         >
> >         > This commit changes the way we search for a "stale" chassis
> >         record in the
> >         > SB to cover the above mentioned case. Also, in such cases
> >         there's no need
> >         > to recreate the Encaps, it's sufficient to update the
> >         chassis_name field.
> >         >
> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
> >         abstract the
> >         string parsing")
> >         > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> >         <mailto:dceara@redhat.com>>
> >         >
> >         > (cherry-picked from master commit
> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
> >
> >         Hi Dumitru,
> >
>
> Hi Han, Numan,
>
> >         Sorry that I missed the original review, but this patch seems
> >         causing
> >         problem in master already. When RBAC is enabled, it will result
> >         in below
> >         error:
> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1
> >         for this
> >         chassis.
> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
> >         {"details":"RBAC rules for client \"chassis-1\" role
> >         \"ovn-controller\"
> >         prohibit modification of table \"Encap\".","error":"permission
> >         error"}
> >
> >         You can simply reproduce by: make sandbox, and then run:
ovn-nbctl
> >         --wait=hv sync, which will hang forever because of the problem.
> >         This patch seems to be updating the "name" field of Encap table,
> >         which
> >         violates the design of RBAC, which uses "name" as the identity
> >         of the
> >         client. We shouldn't allow an user to change
> >         system-id/chassis-id directly.
>
> How can we enforce this from within OVN?
>
My suggestion is don't enforce this from OVN. We can simply document this
clearly. It would be good to prevent misuse as much as we can, but there
are so many ways to break OVN if users don't follow the correct procedure,
and we can't prevent all of them. Changing system-id is something I think
very abnormal and if someone does it purposely they must know what they are
doing. The logic is becoming more and more complex in OVN just for this
misuse case, so I am thinking about simplifying this from the requirement
point of view, unless there is a decent way of solving the problem.

> >         I think the proper way is firstly destroying the old chassis and
> >         then
> >         creating a new chassis, which would also avoid the complex
logic in
> >         ovn-controller regarding the "stale record" handling. What do
> >         you think?
> >
> >
>
> I'm afraid this would be a behavior change that might introduce bugs in
> CMSs that allow changing system-ids (without RBAC enabled) without
> explicitly deleting the "stale" Chassis *and* Chassis_private record.
>
> Also, forcing CMS to do this housekeeping will probably add complex code
> in the CMS. For example, if the "old" Chassis record is deleted before
> ovn-controller receives the update about the system-id change in OVS DB
> then ovn-controller will just recreate it.
>
Is changing system-id a common operation? I couldn't think about a real use
case when we need to change system-id. Also, it doesn't make much sense to
me if the solution supports changing it only when RBAC is disabled. In such
case we will at least document more clearly what's supported in RBAC case
and non-RBAC case, which could be more confusing than just don't support
changing system-id. For corner cases, would it be sufficient to just
document the proper steps of destroying and recreating a chassis whenever
necessary?

> >     Hi Han,
> >
> >     Yes. Dumitru submitted the patch to fix this issue in master and I
> >     applied it just now.
> >     Can you please test again with the latest master.
> >
> >
https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
> >
> >     We see the same issue with branch-20.06 too.
> >
> >
> >
> > Hi Dumitru,
> >
> > Can you please make a 2 patch series for branch-20.03 which includes
> > this patch and the other one
> > which fixes the RBAC issue.
> >
> > Thanks
> > Numan
> >
>
> Like you said, we have two options:
>
> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
> system-id changes and completely ignore in ovn-controller any stale
> chassis records that might exist in the SB DB. In this case we'd just
> let transactions fail due to constraint violations if the CMS cleanup of
> the Chassis/Chassis_private records doesn't happen or is incorrect.
>
> 2. Make it more user friendly for the case when RBAC is _not used_ and
> try to cleanup after ourselves in ovn-controller and reuse the stale
> records. This is already happening with the code on the master branch
> and if we cherry pick [0] as Numan suggested and add it to this series
> it would be the case for branch-20.03 too. [0] would have to go to
> branch-20.06 as well.
>
> I would say, at least for now, the least intrusive change, with the
> least chance of breaking CMSs, is #2 above but I'll wait for
> confirmation before sending a new version of this patch.

Unfortunately [0] is still broken. The steps to reproduce:
    $ make sandbox
    $ ovs-vsctl set open . external_ids:system-id=new-chassis
    $ ovs-vsctl set open . external_ids:system-id=new-chassis-2

The problem is, the "name" column is the RBAC identity of a ovn-controller
client. RBAC defines what operations are allowed for an authenticated
client, but the identity itself should never be changed. Otherwise, there
is no point for RBAC.

Thanks,
Han

>
> Thanks,
> Dumitru
>
> [0]
>
https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>
Dumitru Ceara Sept. 8, 2020, 6:33 p.m. UTC | #6
On 9/8/20 7:43 PM, Han Zhou wrote:
> 
> 
> On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>> wrote:
>>
>> On 9/8/20 9:47 AM, Numan Siddique wrote:
>> >
>> >
>> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org
> <mailto:numans@ovn.org>
>> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>> wrote:
>> >
>> >
>> >
>> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com
> <mailto:zhouhan@gmail.com>
>> >     <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>> wrote:
>> >
>> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
> <dceara@redhat.com <mailto:dceara@redhat.com>
>> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
>> >         >
>> >         > ovn-controller always stores the last configured
>> >         system-id/chassis-id in
>> >         > memory regardless if the connection to the SB is up or down.
>> >         This is OK
>> >         > as long as the change can be committed successfully when
> the SB DB
>> >         > connection comes back up.
>> >         >
>> >         > Without this change, if the chassis-id changes while the SB
>> >         connection is
>> >         > down, ovn-controller will fail to create the new record but
>> >         nevertheless
>> >         > update its in-memory chassis-id. When the SB connection is
>> >         restored
>> >         > ovn-controller tries to find the record corresponding to the
>> >         chassis-id
>> >         > it stored in memory. This fails causing ovn-controller to try
>> >         to insert
>> >         > a new record. But at this point a constraint violation is hit
>> >         in the SB
>> >         > because the Encap records of the "stale" chassis still exist
>> >         in the DB,
>> >         > along with the old chassis record.
>> >         >
>> >         > This commit changes the way we search for a "stale" chassis
>> >         record in the
>> >         > SB to cover the above mentioned case. Also, in such cases
>> >         there's no need
>> >         > to recreate the Encaps, it's sufficient to update the
>> >         chassis_name field.
>> >         >
>> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
>> >         abstract the
>> >         string parsing")
>> >         > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>
>> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>
>> >         >
>> >         > (cherry-picked from master commit
>> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>> >
>> >         Hi Dumitru,
>> >
>>
>> Hi Han, Numan,
>>
>> >         Sorry that I missed the original review, but this patch seems
>> >         causing
>> >         problem in master already. When RBAC is enabled, it will result
>> >         in below
>> >         error:
>> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1
>> >         for this
>> >         chassis.
>> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
>> >         {"details":"RBAC rules for client \"chassis-1\" role
>> >         \"ovn-controller\"
>> >         prohibit modification of table \"Encap\".","error":"permission
>> >         error"}
>> >
>> >         You can simply reproduce by: make sandbox, and then run:
> ovn-nbctl
>> >         --wait=hv sync, which will hang forever because of the problem.
>> >         This patch seems to be updating the "name" field of Encap table,
>> >         which
>> >         violates the design of RBAC, which uses "name" as the identity
>> >         of the
>> >         client. We shouldn't allow an user to change
>> >         system-id/chassis-id directly.
>>
>> How can we enforce this from within OVN?
>>
> My suggestion is don't enforce this from OVN. We can simply document
> this clearly. It would be good to prevent misuse as much as we can, but
> there are so many ways to break OVN if users don't follow the correct
> procedure, and we can't prevent all of them. Changing system-id is
> something I think very abnormal and if someone does it purposely they
> must know what they are doing. The logic is becoming more and more
> complex in OVN just for this misuse case, so I am thinking about
> simplifying this from the requirement point of view, unless there is a
> decent way of solving the problem.
> 
>> >         I think the proper way is firstly destroying the old chassis and
>> >         then
>> >         creating a new chassis, which would also avoid the complex
> logic in
>> >         ovn-controller regarding the "stale record" handling. What do
>> >         you think?
>> >
>> >
>>
>> I'm afraid this would be a behavior change that might introduce bugs in
>> CMSs that allow changing system-ids (without RBAC enabled) without
>> explicitly deleting the "stale" Chassis *and* Chassis_private record.
>>
>> Also, forcing CMS to do this housekeeping will probably add complex code
>> in the CMS. For example, if the "old" Chassis record is deleted before
>> ovn-controller receives the update about the system-id change in OVS DB
>> then ovn-controller will just recreate it.
>>
> Is changing system-id a common operation? I couldn't think about a real
> use case when we need to change system-id. Also, it doesn't make much
> sense to me if the solution supports changing it only when RBAC is
> disabled. In such case we will at least document more clearly what's
> supported in RBAC case and non-RBAC case, which could be more confusing
> than just don't support changing system-id. For corner cases, would it
> be sufficient to just document the proper steps of destroying and
> recreating a chassis whenever necessary?
> 

Right, it shouldn't happen very often but it may happen.

>> >     Hi Han,
>> >
>> >     Yes. Dumitru submitted the patch to fix this issue in master and I
>> >     applied it just now.
>> >     Can you please test again with the latest master.
>> >
>> >    
> https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>> >
>> >     We see the same issue with branch-20.06 too.
>> >
>> >
>> >
>> > Hi Dumitru,
>> >
>> > Can you please make a 2 patch series for branch-20.03 which includes
>> > this patch and the other one
>> > which fixes the RBAC issue.
>> >
>> > Thanks
>> > Numan
>> >  
>>
>> Like you said, we have two options:
>>
>> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
>> system-id changes and completely ignore in ovn-controller any stale
>> chassis records that might exist in the SB DB. In this case we'd just
>> let transactions fail due to constraint violations if the CMS cleanup of
>> the Chassis/Chassis_private records doesn't happen or is incorrect.
>>
>> 2. Make it more user friendly for the case when RBAC is _not used_ and
>> try to cleanup after ourselves in ovn-controller and reuse the stale
>> records. This is already happening with the code on the master branch
>> and if we cherry pick [0] as Numan suggested and add it to this series
>> it would be the case for branch-20.03 too. [0] would have to go to
>> branch-20.06 as well.
>>
>> I would say, at least for now, the least intrusive change, with the
>> least chance of breaking CMSs, is #2 above but I'll wait for
>> confirmation before sending a new version of this patch.
> 
> Unfortunately [0] is still broken. The steps to reproduce:
>     $ make sandbox
>     $ ovs-vsctl set open . external_ids:system-id=new-chassis
>     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
> 
> The problem is, the "name" column is the RBAC identity of a
> ovn-controller client. RBAC defines what operations are allowed for an
> authenticated client, but the identity itself should never be changed.
> Otherwise, there is no point for RBAC.
> 

OK but you're changing the system-id with RBAC enabled which we decided
already that is not working and can't be fixed.

In any case, you're right, this is getting too complicated and we
clearly can't handle all cases without operator help so I'll send a new
patch (on master first) to:
- document that changing system-id is something that can't be handled
completely in ovn-controller on its own and will need operator help to
cleanup stale records in SB.
- document how changing the system-id can be performed with RBAC enabled
(I think we need to change the SSL certificates too to use the new
system-id).
- remove the parts of the code in chassis.c that reuse stale
Chassis/Chassis_private records.

What do you think?

Thanks,
Dumitru
Han Zhou Sept. 8, 2020, 6:45 p.m. UTC | #7
On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 9/8/20 7:43 PM, Han Zhou wrote:
> >
> >
> > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com
> > <mailto:dceara@redhat.com>> wrote:
> >>
> >> On 9/8/20 9:47 AM, Numan Siddique wrote:
> >> >
> >> >
> >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org
> > <mailto:numans@ovn.org>
> >> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>> wrote:
> >> >
> >> >
> >> >
> >> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com
> > <mailto:zhouhan@gmail.com>
> >> >     <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>> wrote:
> >> >
> >> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
> > <dceara@redhat.com <mailto:dceara@redhat.com>
> >> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
> >> >         >
> >> >         > ovn-controller always stores the last configured
> >> >         system-id/chassis-id in
> >> >         > memory regardless if the connection to the SB is up or
down.
> >> >         This is OK
> >> >         > as long as the change can be committed successfully when
> > the SB DB
> >> >         > connection comes back up.
> >> >         >
> >> >         > Without this change, if the chassis-id changes while the SB
> >> >         connection is
> >> >         > down, ovn-controller will fail to create the new record but
> >> >         nevertheless
> >> >         > update its in-memory chassis-id. When the SB connection is
> >> >         restored
> >> >         > ovn-controller tries to find the record corresponding to
the
> >> >         chassis-id
> >> >         > it stored in memory. This fails causing ovn-controller to
try
> >> >         to insert
> >> >         > a new record. But at this point a constraint violation is
hit
> >> >         in the SB
> >> >         > because the Encap records of the "stale" chassis still
exist
> >> >         in the DB,
> >> >         > along with the old chassis record.
> >> >         >
> >> >         > This commit changes the way we search for a "stale" chassis
> >> >         record in the
> >> >         > SB to cover the above mentioned case. Also, in such cases
> >> >         there's no need
> >> >         > to recreate the Encaps, it's sufficient to update the
> >> >         chassis_name field.
> >> >         >
> >> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
> >> >         abstract the
> >> >         string parsing")
> >> >         > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> > <mailto:dceara@redhat.com>
> >> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>
> >> >         >
> >> >         > (cherry-picked from master commit
> >> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
> >> >
> >> >         Hi Dumitru,
> >> >
> >>
> >> Hi Han, Numan,
> >>
> >> >         Sorry that I missed the original review, but this patch seems
> >> >         causing
> >> >         problem in master already. When RBAC is enabled, it will
result
> >> >         in below
> >> >         error:
> >> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport
lsp1
> >> >         for this
> >> >         chassis.
> >> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction
error:
> >> >         {"details":"RBAC rules for client \"chassis-1\" role
> >> >         \"ovn-controller\"
> >> >         prohibit modification of table
\"Encap\".","error":"permission
> >> >         error"}
> >> >
> >> >         You can simply reproduce by: make sandbox, and then run:
> > ovn-nbctl
> >> >         --wait=hv sync, which will hang forever because of the
problem.
> >> >         This patch seems to be updating the "name" field of Encap
table,
> >> >         which
> >> >         violates the design of RBAC, which uses "name" as the
identity
> >> >         of the
> >> >         client. We shouldn't allow an user to change
> >> >         system-id/chassis-id directly.
> >>
> >> How can we enforce this from within OVN?
> >>
> > My suggestion is don't enforce this from OVN. We can simply document
> > this clearly. It would be good to prevent misuse as much as we can, but
> > there are so many ways to break OVN if users don't follow the correct
> > procedure, and we can't prevent all of them. Changing system-id is
> > something I think very abnormal and if someone does it purposely they
> > must know what they are doing. The logic is becoming more and more
> > complex in OVN just for this misuse case, so I am thinking about
> > simplifying this from the requirement point of view, unless there is a
> > decent way of solving the problem.
> >
> >> >         I think the proper way is firstly destroying the old chassis
and
> >> >         then
> >> >         creating a new chassis, which would also avoid the complex
> > logic in
> >> >         ovn-controller regarding the "stale record" handling. What do
> >> >         you think?
> >> >
> >> >
> >>
> >> I'm afraid this would be a behavior change that might introduce bugs in
> >> CMSs that allow changing system-ids (without RBAC enabled) without
> >> explicitly deleting the "stale" Chassis *and* Chassis_private record.
> >>
> >> Also, forcing CMS to do this housekeeping will probably add complex
code
> >> in the CMS. For example, if the "old" Chassis record is deleted before
> >> ovn-controller receives the update about the system-id change in OVS DB
> >> then ovn-controller will just recreate it.
> >>
> > Is changing system-id a common operation? I couldn't think about a real
> > use case when we need to change system-id. Also, it doesn't make much
> > sense to me if the solution supports changing it only when RBAC is
> > disabled. In such case we will at least document more clearly what's
> > supported in RBAC case and non-RBAC case, which could be more confusing
> > than just don't support changing system-id. For corner cases, would it
> > be sufficient to just document the proper steps of destroying and
> > recreating a chassis whenever necessary?
> >
>
> Right, it shouldn't happen very often but it may happen.
>
> >> >     Hi Han,
> >> >
> >> >     Yes. Dumitru submitted the patch to fix this issue in master and
I
> >> >     applied it just now.
> >> >     Can you please test again with the latest master.
> >> >
> >> >
> >
https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
> >> >
> >> >     We see the same issue with branch-20.06 too.
> >> >
> >> >
> >> >
> >> > Hi Dumitru,
> >> >
> >> > Can you please make a 2 patch series for branch-20.03 which includes
> >> > this patch and the other one
> >> > which fixes the RBAC issue.
> >> >
> >> > Thanks
> >> > Numan
> >> >
> >>
> >> Like you said, we have two options:
> >>
> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
> >> system-id changes and completely ignore in ovn-controller any stale
> >> chassis records that might exist in the SB DB. In this case we'd just
> >> let transactions fail due to constraint violations if the CMS cleanup
of
> >> the Chassis/Chassis_private records doesn't happen or is incorrect.
> >>
> >> 2. Make it more user friendly for the case when RBAC is _not used_ and
> >> try to cleanup after ourselves in ovn-controller and reuse the stale
> >> records. This is already happening with the code on the master branch
> >> and if we cherry pick [0] as Numan suggested and add it to this series
> >> it would be the case for branch-20.03 too. [0] would have to go to
> >> branch-20.06 as well.
> >>
> >> I would say, at least for now, the least intrusive change, with the
> >> least chance of breaking CMSs, is #2 above but I'll wait for
> >> confirmation before sending a new version of this patch.
> >
> > Unfortunately [0] is still broken. The steps to reproduce:
> >     $ make sandbox
> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis
> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
> >
> > The problem is, the "name" column is the RBAC identity of a
> > ovn-controller client. RBAC defines what operations are allowed for an
> > authenticated client, but the identity itself should never be changed.
> > Otherwise, there is no point for RBAC.
> >
>
> OK but you're changing the system-id with RBAC enabled which we decided
> already that is not working and can't be fixed.
>
> In any case, you're right, this is getting too complicated and we
> clearly can't handle all cases without operator help so I'll send a new
> patch (on master first) to:
> - document that changing system-id is something that can't be handled
> completely in ovn-controller on its own and will need operator help to
> cleanup stale records in SB.

Thanks Dumitru. The plan sounds good to me. Let's be clear what will be
supported in OVN and what will not. IMHO, I'd say we don't support directly
changing system-id. If it needs to be changed, follow the procedure of
deleting and recreating a new chassis, which can be described in detail for
RBAC and non-RBAC respectively. Is this the same as what you are about to
address?

> - document how changing the system-id can be performed with RBAC enabled
> (I think we need to change the SSL certificates too to use the new
> system-id).
> - remove the parts of the code in chassis.c that reuse stale
> Chassis/Chassis_private records.
>
> What do you think?
>
> Thanks,
> Dumitru
>
>
Dumitru Ceara Sept. 10, 2020, 7 a.m. UTC | #8
On 9/8/20 8:45 PM, Han Zhou wrote:
> 
> 
> 
> On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>> wrote:
>>
>> On 9/8/20 7:43 PM, Han Zhou wrote:
>> >
>> >
>> > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>
>> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote:
>> >>
>> >> On 9/8/20 9:47 AM, Numan Siddique wrote:
>> >> >
>> >> >
>> >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org
> <mailto:numans@ovn.org>
>> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>
>> >> > <mailto:numans@ovn.org <mailto:numans@ovn.org>
> <mailto:numans@ovn.org <mailto:numans@ovn.org>>>> wrote:
>> >> >
>> >> >
>> >> >
>> >> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com
> <mailto:zhouhan@gmail.com>
>> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>
>> >> >     <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>
> <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>>> wrote:
>> >> >
>> >> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
>> > <dceara@redhat.com <mailto:dceara@redhat.com>
> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>
>> >> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>
> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>> wrote:
>> >> >         >
>> >> >         > ovn-controller always stores the last configured
>> >> >         system-id/chassis-id in
>> >> >         > memory regardless if the connection to the SB is up or
> down.
>> >> >         This is OK
>> >> >         > as long as the change can be committed successfully when
>> > the SB DB
>> >> >         > connection comes back up.
>> >> >         >
>> >> >         > Without this change, if the chassis-id changes while the SB
>> >> >         connection is
>> >> >         > down, ovn-controller will fail to create the new record but
>> >> >         nevertheless
>> >> >         > update its in-memory chassis-id. When the SB connection is
>> >> >         restored
>> >> >         > ovn-controller tries to find the record corresponding
> to the
>> >> >         chassis-id
>> >> >         > it stored in memory. This fails causing ovn-controller
> to try
>> >> >         to insert
>> >> >         > a new record. But at this point a constraint violation
> is hit
>> >> >         in the SB
>> >> >         > because the Encap records of the "stale" chassis still
> exist
>> >> >         in the DB,
>> >> >         > along with the old chassis record.
>> >> >         >
>> >> >         > This commit changes the way we search for a "stale" chassis
>> >> >         record in the
>> >> >         > SB to cover the above mentioned case. Also, in such cases
>> >> >         there's no need
>> >> >         > to recreate the Encaps, it's sufficient to update the
>> >> >         chassis_name field.
>> >> >         >
>> >> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
>> >> >         abstract the
>> >> >         string parsing")
>> >> >         > Signed-off-by: Dumitru Ceara <dceara@redhat.com
> <mailto:dceara@redhat.com>
>> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>
>> >> >         <mailto:dceara@redhat.com <mailto:dceara@redhat.com>
> <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>>
>> >> >         >
>> >> >         > (cherry-picked from master commit
>> >> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>> >> >
>> >> >         Hi Dumitru,
>> >> >
>> >>
>> >> Hi Han, Numan,
>> >>
>> >> >         Sorry that I missed the original review, but this patch seems
>> >> >         causing
>> >> >         problem in master already. When RBAC is enabled, it will
> result
>> >> >         in below
>> >> >         error:
>> >> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming
> lport lsp1
>> >> >         for this
>> >> >         chassis.
>> >> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction
> error:
>> >> >         {"details":"RBAC rules for client \"chassis-1\" role
>> >> >         \"ovn-controller\"
>> >> >         prohibit modification of table
> \"Encap\".","error":"permission
>> >> >         error"}
>> >> >
>> >> >         You can simply reproduce by: make sandbox, and then run:
>> > ovn-nbctl
>> >> >         --wait=hv sync, which will hang forever because of the
> problem.
>> >> >         This patch seems to be updating the "name" field of Encap
> table,
>> >> >         which
>> >> >         violates the design of RBAC, which uses "name" as the
> identity
>> >> >         of the
>> >> >         client. We shouldn't allow an user to change
>> >> >         system-id/chassis-id directly.
>> >>
>> >> How can we enforce this from within OVN?
>> >>
>> > My suggestion is don't enforce this from OVN. We can simply document
>> > this clearly. It would be good to prevent misuse as much as we can, but
>> > there are so many ways to break OVN if users don't follow the correct
>> > procedure, and we can't prevent all of them. Changing system-id is
>> > something I think very abnormal and if someone does it purposely they
>> > must know what they are doing. The logic is becoming more and more
>> > complex in OVN just for this misuse case, so I am thinking about
>> > simplifying this from the requirement point of view, unless there is a
>> > decent way of solving the problem.
>> >
>> >> >         I think the proper way is firstly destroying the old
> chassis and
>> >> >         then
>> >> >         creating a new chassis, which would also avoid the complex
>> > logic in
>> >> >         ovn-controller regarding the "stale record" handling. What do
>> >> >         you think?
>> >> >
>> >> >
>> >>
>> >> I'm afraid this would be a behavior change that might introduce bugs in
>> >> CMSs that allow changing system-ids (without RBAC enabled) without
>> >> explicitly deleting the "stale" Chassis *and* Chassis_private record.
>> >>
>> >> Also, forcing CMS to do this housekeeping will probably add complex
> code
>> >> in the CMS. For example, if the "old" Chassis record is deleted before
>> >> ovn-controller receives the update about the system-id change in OVS DB
>> >> then ovn-controller will just recreate it.
>> >>
>> > Is changing system-id a common operation? I couldn't think about a real
>> > use case when we need to change system-id. Also, it doesn't make much
>> > sense to me if the solution supports changing it only when RBAC is
>> > disabled. In such case we will at least document more clearly what's
>> > supported in RBAC case and non-RBAC case, which could be more confusing
>> > than just don't support changing system-id. For corner cases, would it
>> > be sufficient to just document the proper steps of destroying and
>> > recreating a chassis whenever necessary?
>> >
>>
>> Right, it shouldn't happen very often but it may happen.
>>
>> >> >     Hi Han,
>> >> >
>> >> >     Yes. Dumitru submitted the patch to fix this issue in master
> and I
>> >> >     applied it just now.
>> >> >     Can you please test again with the latest master.
>> >> >
>> >> >    
>> >
> https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>> >> >
>> >> >     We see the same issue with branch-20.06 too.
>> >> >
>> >> >
>> >> >
>> >> > Hi Dumitru,
>> >> >
>> >> > Can you please make a 2 patch series for branch-20.03 which includes
>> >> > this patch and the other one
>> >> > which fixes the RBAC issue.
>> >> >
>> >> > Thanks
>> >> > Numan
>> >> >  
>> >>
>> >> Like you said, we have two options:
>> >>
>> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
>> >> system-id changes and completely ignore in ovn-controller any stale
>> >> chassis records that might exist in the SB DB. In this case we'd just
>> >> let transactions fail due to constraint violations if the CMS
> cleanup of
>> >> the Chassis/Chassis_private records doesn't happen or is incorrect.
>> >>
>> >> 2. Make it more user friendly for the case when RBAC is _not used_ and
>> >> try to cleanup after ourselves in ovn-controller and reuse the stale
>> >> records. This is already happening with the code on the master branch
>> >> and if we cherry pick [0] as Numan suggested and add it to this series
>> >> it would be the case for branch-20.03 too. [0] would have to go to
>> >> branch-20.06 as well.
>> >>
>> >> I would say, at least for now, the least intrusive change, with the
>> >> least chance of breaking CMSs, is #2 above but I'll wait for
>> >> confirmation before sending a new version of this patch.
>> >
>> > Unfortunately [0] is still broken. The steps to reproduce:
>> >     $ make sandbox
>> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis
>> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
>> >
>> > The problem is, the "name" column is the RBAC identity of a
>> > ovn-controller client. RBAC defines what operations are allowed for an
>> > authenticated client, but the identity itself should never be changed.
>> > Otherwise, there is no point for RBAC.
>> >
>>
>> OK but you're changing the system-id with RBAC enabled which we decided
>> already that is not working and can't be fixed.
>>
>> In any case, you're right, this is getting too complicated and we
>> clearly can't handle all cases without operator help so I'll send a new
>> patch (on master first) to:
>> - document that changing system-id is something that can't be handled
>> completely in ovn-controller on its own and will need operator help to
>> cleanup stale records in SB.
> 
> Thanks Dumitru. The plan sounds good to me. Let's be clear what will be
> supported in OVN and what will not. IMHO, I'd say we don't support
> directly changing system-id. If it needs to be changed, follow the
> procedure of deleting and recreating a new chassis, which can be
> described in detail for RBAC and non-RBAC respectively. Is this the same
> as what you are about to address?
> 

Exactly.

Thanks,
Dumitru

>> - document how changing the system-id can be performed with RBAC enabled
>> (I think we need to change the SSL certificates too to use the new
>> system-id).
>> - remove the parts of the code in chassis.c that reuse stale
>> Chassis/Chassis_private records.
>>
>> What do you think?
>>
>> Thanks,
>> Dumitru
>>
>>
diff mbox series

Patch

diff --git a/controller/chassis.c b/controller/chassis.c
index 522893e..d525463 100644
--- a/controller/chassis.c
+++ b/controller/chassis.c
@@ -380,10 +380,7 @@  chassis_tunnels_changed(const struct sset *encap_type_set,
 {
     size_t encap_type_count = 0;
 
-    for (int i = 0; i < chassis_rec->n_encaps; i++) {
-        if (strcmp(chassis_rec->name, chassis_rec->encaps[i]->chassis_name)) {
-            return true;
-        }
+    for (size_t i = 0; i < chassis_rec->n_encaps; i++) {
 
         if (!sset_contains(encap_type_set, chassis_rec->encaps[i]->type)) {
             return true;
@@ -456,6 +453,19 @@  chassis_build_encaps(struct ovsdb_idl_txn *ovnsb_idl_txn,
 }
 
 /*
+ * Updates encaps for a given chassis. This can happen when the chassis
+ * name has changed. Also, the only thing we support updating is the
+ * chassis_name. For other changes the encaps will be recreated.
+ */
+static void
+chassis_update_encaps(const struct sbrec_chassis *chassis)
+{
+    for (size_t i = 0; i < chassis->n_encaps; i++) {
+        sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name);
+    }
+}
+
+/*
  * Returns a pointer to a chassis record from 'chassis_table' that
  * matches at least one tunnel config.
  */
@@ -486,9 +496,10 @@  chassis_get_stale_record(const struct sbrec_chassis_table *chassis_table,
 /* If this is a chassis config update after we initialized the record once
  * then we should always be able to find it with the ID we saved in
  * chassis_state.
- * Otherwise (i.e., first time we create the record) then we check if there's
- * a stale record from a previous controller run that didn't end gracefully
- * and reuse it. If not then we create a new record.
+ * Otherwise (i.e., first time we create the record or if the system-id
+ * changed) then we check if there's a stale record from a previous
+ * controller run that didn't end gracefully and reuse it. If not then we
+ * create a new record.
  */
 static const struct sbrec_chassis *
 chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn,
@@ -497,29 +508,31 @@  chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn,
                    const struct ovs_chassis_cfg *ovs_cfg,
                    const char *chassis_id)
 {
-    const struct sbrec_chassis *chassis_rec;
+    const struct sbrec_chassis *chassis = NULL;
 
     if (chassis_info_id_inited(&chassis_state)) {
-        chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name,
-                                             chassis_info_id(&chassis_state));
-        if (!chassis_rec) {
-            VLOG_DBG("Could not find Chassis, will create it"
-                     ": stored (%s) ovs (%s)",
+        chassis = chassis_lookup_by_name(sbrec_chassis_by_name,
+                                         chassis_info_id(&chassis_state));
+        if (!chassis) {
+            VLOG_DBG("Could not find Chassis, will check if the id changed: "
+                     "stored (%s) ovs (%s)",
                      chassis_info_id(&chassis_state), chassis_id);
-            if (ovnsb_idl_txn) {
-                /* Recreate the chassis record.  */
-                chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
-            }
         }
-    } else {
-        chassis_rec =
-            chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id);
+    }
 
-        if (!chassis_rec && ovnsb_idl_txn) {
-            chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn);
+    if (!chassis) {
+        chassis = chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id);
+    }
+
+    if (!chassis) {
+        /* Recreate the chassis record. */
+        VLOG_DBG("Could not find Chassis, will create it: %s", chassis_id);
+        if (ovnsb_idl_txn) {
+            return sbrec_chassis_insert(ovnsb_idl_txn);
         }
     }
-    return chassis_rec;
+
+    return chassis;
 }
 
 /* Update a Chassis record based on the config in the ovs config. */
@@ -567,6 +580,7 @@  chassis_update(const struct sbrec_chassis *chassis_rec,
                                 &ovs_cfg->encap_ip_set, ovs_cfg->encap_csum,
                                 chassis_rec);
     if (!tunnels_changed) {
+        chassis_update_encaps(chassis_rec);
         return;
     }
 
diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
index 63b2581..1c7aa58 100644
--- a/tests/ovn-controller.at
+++ b/tests/ovn-controller.at
@@ -196,6 +196,23 @@  OVS_WAIT_UNTIL([
     test "${sysid}" = "${chassis_id}"
 ])
 
+# Simulate system-id changing while ovn-controller is disconnected from the
+# SB.
+valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote)
+invalid_remote=tcp:0.0.0.0:4242
+ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote}
+expected_state="not connected"
+OVS_WAIT_UNTIL([
+    test "${expected_state}" = "$(ovn-appctl -t ovn-controller connection-status)"
+])
+sysid=${sysid}-bar
+ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}"
+ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote}
+OVS_WAIT_UNTIL([
+    chassis_id=$(ovn-sbctl get Chassis "${sysid}" name)
+    test "${sysid}" = "${chassis_id}"
+])
+
 # Gracefully terminate daemons
 OVN_CLEANUP_SBOX([hv])
 OVN_CLEANUP_VSWITCH([main])