diff mbox series

[ovs-dev] ovn-controller: Don't enable conditional monitoring until connected.

Message ID 20220822091918.2804852-1-i.maximets@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] ovn-controller: Don't enable conditional monitoring until connected. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_ovn-kubernetes success github build: passed
ovsrobot/github-robot-_Build_and_Test fail github build: failed

Commit Message

Ilya Maximets Aug. 22, 2022, 9:19 a.m. UTC
daemon_started_recently() concept is flawed in terms that it uses
fixed number of iterations for a countdown and a fixed timeout, so
the undesired removal of configured resources can still happen under
certain conditions even with this mechanism in place.

The root cause of the original problem is that conditional monitoring
in ovn-controller works in stages.  ovn-controller doesn't know all
the information it needs to monitor until the first monitor reply
arrives, then it adjusts conditions and receives all the data.  That
doesn't work well with the ovsdb_idl_has_ever_connected() check that
will return 'true' right after the first monitor reply.
This leads to situation where resources can be removed after the
first monitor reply and re-added back after condition change.

Instead of waiting a fixed number of iterations/seconds, the correct
solution should be to always start with unconditional monitoring.
When the full initial snapshot is received, switch to conditional
monitoring as data is complete.  This way ovsdb_idl_has_ever_connected()
check will provide an accurate answer to the question: if we have
a sufficient amount of data to make decisions about resource removal?

Once switched to conditional monitoring, southbound database will
remove all the unnecessary data from the ovn-controller, so it will
not need to keep it for long.  And since setup is using conditional
monitoring, it's likely not that big anyway to be concerned about
a brief spike in memory consumption on startup.

This change will also remove unnecessary waiting on startup when all
the data is already available.

The main change is in the ovsdb_idl_has_ever_connected() check in the
update_sb_monitors(), the rest is mostly a revert of commits that
introduced the startup delay.

Test enhanced to run with and w/o conditional monitoring, since the
issue is closely related it it.

Alternative solution might be to store the first condition change
sequence number and replace all ovsdb_idl_has_ever_connected() calls
with a function that also checks for the current sequence number being
higher than condition change one, but that will require slightly more
code.  Can be implemented in the future, if the first unconditional
monitor reply will become a problem some day.

Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion & recreation during restart.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 TODO.rst                    |  7 -----
 controller/ovn-controller.c | 31 ++++++++--------------
 controller/patch.c          |  9 +------
 controller/vif-plug.c       | 33 +++++------------------
 controller/vif-plug.h       |  1 +
 lib/inc-proc-eng.c          | 11 --------
 lib/inc-proc-eng.h          |  4 ---
 lib/ovn-util.c              | 52 -------------------------------------
 lib/ovn-util.h              |  4 ---
 tests/ovn-controller.at     |  3 ++-
 tests/ovn.at                |  2 --
 11 files changed, 22 insertions(+), 135 deletions(-)

Comments

Dumitru Ceara Aug. 22, 2022, 10:21 a.m. UTC | #1
Hi Ilya,

On 8/22/22 11:19, Ilya Maximets wrote:
> daemon_started_recently() concept is flawed in terms that it uses
> fixed number of iterations for a countdown and a fixed timeout, so
> the undesired removal of configured resources can still happen under
> certain conditions even with this mechanism in place.
> 
> The root cause of the original problem is that conditional monitoring
> in ovn-controller works in stages.  ovn-controller doesn't know all
> the information it needs to monitor until the first monitor reply
> arrives, then it adjusts conditions and receives all the data.  That
> doesn't work well with the ovsdb_idl_has_ever_connected() check that
> will return 'true' right after the first monitor reply.
> This leads to situation where resources can be removed after the
> first monitor reply and re-added back after condition change.
> 
> Instead of waiting a fixed number of iterations/seconds, the correct
> solution should be to always start with unconditional monitoring.
> When the full initial snapshot is received, switch to conditional
> monitoring as data is complete.  This way ovsdb_idl_has_ever_connected()
> check will provide an accurate answer to the question: if we have
> a sufficient amount of data to make decisions about resource removal?

This does sound like the cleanest solution, I agree.  But does it create
an issue in large scale deployments where conditional monitoring is
used?  The full initial SB snapshot is likely quite large.

> 
> Once switched to conditional monitoring, southbound database will
> remove all the unnecessary data from the ovn-controller, so it will
> not need to keep it for long.  And since setup is using conditional
> monitoring, it's likely not that big anyway to be concerned about
> a brief spike in memory consumption on startup.
> 

For OpenShift and RedHat OpenStack this stands, we use
ovn-monitor-all=true for now.  But I'm not sure about other large scale
deployments.  I'll let Han comment more on this too.

> This change will also remove unnecessary waiting on startup when all
> the data is already available.
> 
> The main change is in the ovsdb_idl_has_ever_connected() check in the
> update_sb_monitors(), the rest is mostly a revert of commits that
> introduced the startup delay.
> 
> Test enhanced to run with and w/o conditional monitoring, since the
> issue is closely related it it.
> 

Good idea.

> Alternative solution might be to store the first condition change
> sequence number and replace all ovsdb_idl_has_ever_connected() calls
> with a function that also checks for the current sequence number being
> higher than condition change one, but that will require slightly more
> code.  Can be implemented in the future, if the first unconditional
> monitor reply will become a problem some day.
> 
> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion & recreation during restart.")

It seems to me we need to do something similar for OpenFlow clearing
too.  Right now we wait a configured amount of time:

https://github.com/ovn-org/ovn/commit/896adfd2d

Regards,
Dumitru

> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  TODO.rst                    |  7 -----
>  controller/ovn-controller.c | 31 ++++++++--------------
>  controller/patch.c          |  9 +------
>  controller/vif-plug.c       | 33 +++++------------------
>  controller/vif-plug.h       |  1 +
>  lib/inc-proc-eng.c          | 11 --------
>  lib/inc-proc-eng.h          |  4 ---
>  lib/ovn-util.c              | 52 -------------------------------------
>  lib/ovn-util.h              |  4 ---
>  tests/ovn-controller.at     |  3 ++-
>  tests/ovn.at                |  2 --
>  11 files changed, 22 insertions(+), 135 deletions(-)
> 
> diff --git a/TODO.rst b/TODO.rst
> index 12738280b..d54817011 100644
> --- a/TODO.rst
> +++ b/TODO.rst
> @@ -172,10 +172,3 @@ OVN To-do List
>    * Multi-threaded logical flow computation was optimized for the case
>      when datapath groups are disabled.  Datpath groups are always enabled
>      now so northd parallel processing should be revisited.
> -
> -* ovn-controller daemon module
> -
> -  * Dumitru Ceara: Add a new module e.g. ovn/lib/daemon-ovn.c that wraps
> -    OVS' daemonize_start() call and initializes the additional things, like
> -    the unixctl commands. Or, we should move the APIs such as
> -    daemon_started_recently() to OVS's lib/daemon.
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 0b0ccc48a..2ac6409bc 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -94,7 +94,6 @@ static unixctl_cb_func debug_dump_lflow_conj_ids;
>  static unixctl_cb_func lflow_cache_flush_cmd;
>  static unixctl_cb_func lflow_cache_show_stats_cmd;
>  static unixctl_cb_func debug_delay_nb_cfg_report;
> -static unixctl_cb_func debug_ignore_startup_delay;
>  
>  #define DEFAULT_BRIDGE_NAME "br-int"
>  #define DEFAULT_DATAPATH "system"
> @@ -190,7 +189,14 @@ update_sb_monitors(struct ovsdb_idl *ovnsb_idl,
>       * avoid the unnecessarily extra wake-ups of ovn-controller. */
>      ovsdb_idl_condition_add_clause_true(&ldpg);
>  
> -    if (monitor_all) {
> +    /* Don't enable conditional monitoring until we have a full view on a
> +     * data that we need to monitor.  Without that we will send a first
> +     * monitor request with incomplete conditions, receive a reply, update
> +     * conditions and receive the rest of data.  Such a split may result in
> +     * removal and re-addition of certain resources (e.g. patch ports)
> +     * on restart of ovn-controller.  Database will send deletion requiests
> +     * for all the data that we don't actually need. */
> +    if (monitor_all || !ovsdb_idl_has_ever_connected(ovnsb_idl)) {
>          ovsdb_idl_condition_add_clause_true(&pb);
>          ovsdb_idl_condition_add_clause_true(&lf);
>          ovsdb_idl_condition_add_clause_true(&mb);
> @@ -867,12 +873,11 @@ static void
>  store_nb_cfg(struct ovsdb_idl_txn *sb_txn, struct ovsdb_idl_txn *ovs_txn,
>               const struct sbrec_chassis_private *chassis,
>               const struct ovsrec_bridge *br_int,
> -             unsigned int delay_nb_cfg_report)
> +             unsigned int delay_nb_cfg_report, int64_t startup_ts)
>  {
>      struct ofctrl_acked_seqnos *acked_nb_cfg_seqnos =
>          ofctrl_acked_seqnos_get(ofctrl_seq_type_nb_cfg);
>      uint64_t cur_cfg = acked_nb_cfg_seqnos->last_acked;
> -    int64_t startup_ts = daemon_startup_ts();
>  
>      if (ovs_txn && br_int
>              && startup_ts != smap_get_ullong(&br_int->external_ids,
> @@ -3860,9 +3865,6 @@ main(int argc, char *argv[])
>                               debug_dump_lflow_conj_ids,
>                               &lflow_output_data->conj_ids);
>  
> -    unixctl_command_register("debug/ignore-startup-delay", "", 0, 0,
> -                             debug_ignore_startup_delay, NULL);
> -
>      unsigned int ovs_cond_seqno = UINT_MAX;
>      unsigned int ovnsb_cond_seqno = UINT_MAX;
>      unsigned int ovnsb_expected_cond_seqno = UINT_MAX;
> @@ -3884,6 +3886,7 @@ main(int argc, char *argv[])
>      /* Main loop. */
>      exiting = false;
>      restart = false;
> +    int64_t startup_ts = time_wall_msec();
>      bool sb_monitor_all = false;
>      while (!exiting) {
>          memory_run();
> @@ -4062,10 +4065,6 @@ main(int argc, char *argv[])
>                      }
>                      stopwatch_stop(CONTROLLER_LOOP_STOPWATCH_NAME,
>                                     time_msec());
> -                    if (engine_has_updated()) {
> -                        daemon_started_recently_countdown();
> -                    }
> -
>                      ct_zones_data = engine_get_data(&en_ct_zones);
>                      if (ovs_idl_txn) {
>                          if (ct_zones_data) {
> @@ -4228,7 +4227,7 @@ main(int argc, char *argv[])
>              }
>  
>              store_nb_cfg(ovnsb_idl_txn, ovs_idl_txn, chassis_private,
> -                         br_int, delay_nb_cfg_report);
> +                         br_int, delay_nb_cfg_report, startup_ts);
>  
>              if (pending_pkt.conn) {
>                  struct ed_type_addr_sets *as_data =
> @@ -4705,11 +4704,3 @@ debug_dump_lflow_conj_ids(struct unixctl_conn *conn, int argc OVS_UNUSED,
>      unixctl_command_reply(conn, ds_cstr(&conj_ids_dump));
>      ds_destroy(&conj_ids_dump);
>  }
> -
> -static void
> -debug_ignore_startup_delay(struct unixctl_conn *conn, int argc OVS_UNUSED,
> -                           const char *argv[] OVS_UNUSED, void *arg OVS_UNUSED)
> -{
> -    daemon_started_recently_ignore();
> -    unixctl_command_reply(conn, NULL);
> -}
> diff --git a/controller/patch.c b/controller/patch.c
> index 12e0b6f7c..ed831302c 100644
> --- a/controller/patch.c
> +++ b/controller/patch.c
> @@ -311,14 +311,7 @@ patch_run(struct ovsdb_idl_txn *ovs_idl_txn,
>      SHASH_FOR_EACH_SAFE (port_node, &existing_ports) {
>          port = port_node->data;
>          shash_delete(&existing_ports, port_node);
> -        /* Wait for some iterations before really deleting any patch ports,
> -         * because with conditional monitoring it is possible that related SB
> -         * data is not completely downloaded yet after last restart of
> -         * ovn-controller.  Otherwise it may cause unncessary dataplane
> -         * interruption during restart/upgrade. */
> -        if (!daemon_started_recently()) {
> -            remove_port(bridge_table, port);
> -        }
> +        remove_port(bridge_table, port);
>      }
>      shash_destroy(&existing_ports);
>  }
> diff --git a/controller/vif-plug.c b/controller/vif-plug.c
> index 38348bf54..eafc513f5 100644
> --- a/controller/vif-plug.c
> +++ b/controller/vif-plug.c
> @@ -469,8 +469,7 @@ vif_plug_iface_touched_this_txn(
>  static bool
>  vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
>                            struct vif_plug_ctx_in *vif_plug_ctx_in,
> -                          struct vif_plug_ctx_out *vif_plug_ctx_out,
> -                          bool can_unplug)
> +                          struct vif_plug_ctx_out *vif_plug_ctx_out)
>  {
>      if (vif_plug_iface_touched_this_txn(vif_plug_ctx_out, pb->logical_port)) {
>          return true;
> @@ -482,7 +481,7 @@ vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
>      if (lport_can_bind_on_this_chassis(vif_plug_ctx_in->chassis_rec, pb)) {
>          handled &= consider_plug_lport(pb, lbinding,
>                                         vif_plug_ctx_in, vif_plug_ctx_out);
> -    } else if (can_unplug && lbinding && lbinding->iface) {
> +    } else if (lbinding && lbinding->iface) {
>          handled &= consider_unplug_iface(lbinding->iface, pb,
>                                           vif_plug_ctx_in, vif_plug_ctx_out);
>      }
> @@ -492,8 +491,7 @@ vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
>  static bool
>  vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
>                        struct vif_plug_ctx_in *vif_plug_ctx_in,
> -                      struct vif_plug_ctx_out *vif_plug_ctx_out,
> -                      bool can_unplug)
> +                      struct vif_plug_ctx_out *vif_plug_ctx_out)
>  {
>      bool handled = true;
>      const char *vif_plug_type = smap_get(&iface_rec->external_ids,
> @@ -513,10 +511,8 @@ vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
>           * consider updating it */
>          handled &= consider_plug_lport(pb, lbinding,
>                                         vif_plug_ctx_in, vif_plug_ctx_out);
> -    } else if (can_unplug
> -               && (!pb
> -                   || !lport_can_bind_on_this_chassis(
> -                       vif_plug_ctx_in->chassis_rec, pb))) {
> +    } else if (!pb || !lport_can_bind_on_this_chassis(
> +                           vif_plug_ctx_in->chassis_rec, pb)) {
>          /* No lport for this interface or it is destined for different chassis,
>           * consuder unplugging it */
>          handled &= consider_unplug_iface(iface_rec, pb,
> @@ -525,31 +521,17 @@ vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
>      return handled;
>  }
>  
> -/* On initial startup or on IDL reconnect, several rounds of the main loop may
> - * run before data is actually loaded in the IDL, primarily depending on
> - * conditional monitoring status and other events that could trigger main loop
> - * runs during this period.  Until we find a reliable way to determine the
> - * completeness of the initial data downloading we need this counter so that we
> - * do not erronously unplug ports because the data is just not loaded yet.
> - */
>  void
>  vif_plug_run(struct vif_plug_ctx_in *vif_plug_ctx_in,
>               struct vif_plug_ctx_out *vif_plug_ctx_out)
>  {
> -    bool delay_plug = daemon_started_recently();
> -    if (delay_plug) {
> -        VLOG_DBG("vif_plug_run: daemon started recently, will not unplug "
> -                 "ports in this iteration.");
> -    }
> -
>      if (!vif_plug_ctx_in->chassis_rec) {
>          return;
>      }
>      const struct ovsrec_interface *iface_rec;
>      OVSREC_INTERFACE_TABLE_FOR_EACH (iface_rec,
>                                       vif_plug_ctx_in->iface_table) {
> -        vif_plug_handle_iface(iface_rec, vif_plug_ctx_in, vif_plug_ctx_out,
> -                              !delay_plug);
> +        vif_plug_handle_iface(iface_rec, vif_plug_ctx_in, vif_plug_ctx_out);
>      }
>  
>      struct sbrec_port_binding *target =
> @@ -564,8 +546,7 @@ vif_plug_run(struct vif_plug_ctx_in *vif_plug_ctx_in,
>              vif_plug_ctx_in->sbrec_port_binding_by_requested_chassis) {
>          enum en_lport_type lport_type = get_lport_type(pb);
>          if (lport_type == LP_VIF) {
> -            vif_plug_handle_lport_vif(pb, vif_plug_ctx_in, vif_plug_ctx_out,
> -                                      !delay_plug);
> +            vif_plug_handle_lport_vif(pb, vif_plug_ctx_in, vif_plug_ctx_out);
>          }
>      }
>      sbrec_port_binding_index_destroy_row(target);
> diff --git a/controller/vif-plug.h b/controller/vif-plug.h
> index 7a1978e38..76063591b 100644
> --- a/controller/vif-plug.h
> +++ b/controller/vif-plug.h
> @@ -71,6 +71,7 @@ void vif_plug_clear_changed(struct shash *deleted_iface_ids);
>  void vif_plug_finish_changed(struct shash *changed_iface_ids);
>  void vif_plug_clear_deleted(struct shash *deleted_iface_ids);
>  void vif_plug_finish_deleted(struct shash *changed_iface_ids);
> +void vif_plug_reset_idl_prime_counter(void);
>  
>  #ifdef  __cplusplus
>  }
> diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
> index 2e2b31033..575b774ae 100644
> --- a/lib/inc-proc-eng.c
> +++ b/lib/inc-proc-eng.c
> @@ -312,17 +312,6 @@ engine_has_run(void)
>      return false;
>  }
>  
> -bool
> -engine_has_updated(void)
> -{
> -    for (size_t i = 0; i < engine_n_nodes; i++) {
> -        if (engine_nodes[i]->state == EN_UPDATED) {
> -            return true;
> -        }
> -    }
> -    return false;
> -}
> -
>  bool
>  engine_aborted(void)
>  {
> diff --git a/lib/inc-proc-eng.h b/lib/inc-proc-eng.h
> index dc365dc18..9bfab1f7c 100644
> --- a/lib/inc-proc-eng.h
> +++ b/lib/inc-proc-eng.h
> @@ -238,10 +238,6 @@ bool engine_node_changed(struct engine_node *node);
>  /* Return true if the engine has run in the last iteration. */
>  bool engine_has_run(void);
>  
> -/* Return true if the engine has any update in any node, i.e. any input
> - * has changed; false if nothing has changed. */
> -bool engine_has_updated(void);
> -
>  /* Returns true if during the last engine run we had to abort processing. */
>  bool engine_aborted(void);
>  
> diff --git a/lib/ovn-util.c b/lib/ovn-util.c
> index d80db179a..616999eab 100644
> --- a/lib/ovn-util.c
> +++ b/lib/ovn-util.c
> @@ -883,55 +883,3 @@ get_bridge(const struct ovsrec_bridge_table *bridge_table, const char *br_name)
>      }
>      return NULL;
>  }
> -
> -#define DAEMON_STARTUP_DELAY_SEED 20
> -#define DAEMON_STARTUP_DELAY_MS   10000
> -
> -static int64_t startup_ts;
> -static int startup_delay = DAEMON_STARTUP_DELAY_SEED;
> -
> -/* Used by debug command only, for tests. */
> -static bool ignore_startup_delay = false;
> -
> -OVS_CONSTRUCTOR(startup_ts_initializer) {
> -    startup_ts = time_wall_msec();
> -}
> -
> -int64_t
> -daemon_startup_ts(void)
> -{
> -    return startup_ts;
> -}
> -
> -void
> -daemon_started_recently_countdown(void)
> -{
> -    if (startup_delay > 0) {
> -        startup_delay--;
> -    }
> -}
> -
> -void
> -daemon_started_recently_ignore(void)
> -{
> -    ignore_startup_delay = true;
> -}
> -
> -bool
> -daemon_started_recently(void)
> -{
> -    if (ignore_startup_delay) {
> -        return false;
> -    }
> -
> -    VLOG_DBG("startup_delay: %d, startup_ts: %"PRId64, startup_delay,
> -             startup_ts);
> -
> -    /* Ensure that at least an amount of updates has been handled. */
> -    if (startup_delay) {
> -        return true;
> -    }
> -
> -    /* Ensure that at least an amount of time has passed. */
> -    return time_wall_msec() - startup_ts <= DAEMON_STARTUP_DELAY_MS;
> -}
> diff --git a/lib/ovn-util.h b/lib/ovn-util.h
> index 145f974ed..b3905ef7b 100644
> --- a/lib/ovn-util.h
> +++ b/lib/ovn-util.h
> @@ -309,9 +309,5 @@ struct ovsrec_bridge_table;
>  const struct ovsrec_bridge *get_bridge(const struct ovsrec_bridge_table *,
>                                         const char *br_name);
>  
> -void daemon_started_recently_countdown(void);
> -void daemon_started_recently_ignore(void);
> -bool daemon_started_recently(void);
> -int64_t daemon_startup_ts(void);
>  
>  #endif /* OVN_UTIL_H */
> diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
> index 3c3fb31c7..2d765f31a 100644
> --- a/tests/ovn-controller.at
> +++ b/tests/ovn-controller.at
> @@ -114,7 +114,6 @@ check_patches \
>      'br-eth0 patch-localnet2-to-br-int patch-br-int-to-localnet2'
>  
>  # Delete the mapping and the ovn-bridge-mapping patch ports should go away.
> -check ovn-appctl -t ovn-controller debug/ignore-startup-delay
>  AT_CHECK([ovs-vsctl remove Open_vSwitch . external-ids ovn-bridge-mappings])
>  check_bridge_mappings
>  check_patches
> @@ -2268,6 +2267,7 @@ OVS_WAIT_UNTIL([test 6 = $(as hv1 ovs-ofctl dump-flows br-int | grep -c $(port_b
>  OVN_CLEANUP([hv1])
>  AT_CLEANUP
>  
> +OVN_FOR_EACH_NORTHD([
>  AT_SETUP([ovn-controller - restart should not delete patch ports])
>  
>  ovn_start
> @@ -2337,3 +2337,4 @@ done
>  AT_CHECK([grep "deleted interface patch" hv1/ovs-vswitchd.log], [1], [ignore])
>  OVN_CLEANUP([hv1])
>  AT_CLEANUP
> +])
> diff --git a/tests/ovn.at b/tests/ovn.at
> index bba2c9c1d..21a99089f 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -31681,7 +31681,6 @@ OVS_WAIT_UNTIL([
>      test x43 = x$(as hv1 ovs-vsctl get Interface ${iface1_uuid} mtu_request)
>  ])
>  
> -as hv1 check ovn-appctl -t ovn-controller debug/ignore-startup-delay
>  # Check that pointing requested-chassis somewhere else will unplug the port
>  check ovn-nbctl --wait=hv set Logical_Switch_Port lsp1 \
>      options:requested-chassis=non-existent-chassis
> @@ -31689,7 +31688,6 @@ OVS_WAIT_UNTIL([
>      ! as hv1 ovs-vsctl get Interface ${iface1_uuid} _uuid
>  ])
>  
> -as hv2 check ovn-appctl -t ovn-controller debug/ignore-startup-delay
>  # Check that removing an lport will unplug it
>  AT_CHECK([test x${iface2_uuid} = x$(as hv2 ovs-vsctl get Interface ${iface2_uuid} _uuid)], [0], [])
>  check ovn-nbctl --wait=hv lsp-del ${lsp2_uuid}
Ilya Maximets Aug. 22, 2022, 10:33 a.m. UTC | #2
On 8/22/22 12:21, Dumitru Ceara wrote:
> Hi Ilya,
> 
> On 8/22/22 11:19, Ilya Maximets wrote:
>> daemon_started_recently() concept is flawed in terms that it uses
>> fixed number of iterations for a countdown and a fixed timeout, so
>> the undesired removal of configured resources can still happen under
>> certain conditions even with this mechanism in place.
>>
>> The root cause of the original problem is that conditional monitoring
>> in ovn-controller works in stages.  ovn-controller doesn't know all
>> the information it needs to monitor until the first monitor reply
>> arrives, then it adjusts conditions and receives all the data.  That
>> doesn't work well with the ovsdb_idl_has_ever_connected() check that
>> will return 'true' right after the first monitor reply.
>> This leads to situation where resources can be removed after the
>> first monitor reply and re-added back after condition change.
>>
>> Instead of waiting a fixed number of iterations/seconds, the correct
>> solution should be to always start with unconditional monitoring.
>> When the full initial snapshot is received, switch to conditional
>> monitoring as data is complete.  This way ovsdb_idl_has_ever_connected()
>> check will provide an accurate answer to the question: if we have
>> a sufficient amount of data to make decisions about resource removal?
> 
> This does sound like the cleanest solution, I agree.  But does it create
> an issue in large scale deployments where conditional monitoring is
> used?  The full initial SB snapshot is likely quite large.

It can be large, yes, though it shouldn't be critical.  We will not hold
all this memory for long.  If it is critical, we can fall back on the
"Alternative solution" below.  It should not be hard to implement.
I'll wait for the discussion here to conclude before working on that.

> 
>>
>> Once switched to conditional monitoring, southbound database will
>> remove all the unnecessary data from the ovn-controller, so it will
>> not need to keep it for long.  And since setup is using conditional
>> monitoring, it's likely not that big anyway to be concerned about
>> a brief spike in memory consumption on startup.
>>
> 
> For OpenShift and RedHat OpenStack this stands, we use
> ovn-monitor-all=true for now.  But I'm not sure about other large scale
> deployments.  I'll let Han comment more on this too.
> 
>> This change will also remove unnecessary waiting on startup when all
>> the data is already available.
>>
>> The main change is in the ovsdb_idl_has_ever_connected() check in the
>> update_sb_monitors(), the rest is mostly a revert of commits that
>> introduced the startup delay.
>>
>> Test enhanced to run with and w/o conditional monitoring, since the
>> issue is closely related it it.
>>
> 
> Good idea.
> 
>> Alternative solution might be to store the first condition change
>> sequence number and replace all ovsdb_idl_has_ever_connected() calls
>> with a function that also checks for the current sequence number being
>> higher than condition change one, but that will require slightly more
>> code.  Can be implemented in the future, if the first unconditional
>> monitor reply will become a problem some day.
>>
>> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion & recreation during restart.")
> 
> It seems to me we need to do something similar for OpenFlow clearing
> too.  Right now we wait a configured amount of time:
> 
> https://github.com/ovn-org/ovn/commit/896adfd2d

Hmm, yes.  It seems like with the correct handling of Sb updates in place
we should be able to just implement an "ideal" solution:

"""
  Ideally, ovn-controller should clear the flows after it
  completes the new flows computing.
"""

Best regards, Ilya Maximets.
Han Zhou Aug. 22, 2022, 5:39 p.m. UTC | #3
On Mon, Aug 22, 2022 at 3:33 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 8/22/22 12:21, Dumitru Ceara wrote:
> > Hi Ilya,
> >
> > On 8/22/22 11:19, Ilya Maximets wrote:
> >> daemon_started_recently() concept is flawed in terms that it uses
> >> fixed number of iterations for a countdown and a fixed timeout, so
> >> the undesired removal of configured resources can still happen under
> >> certain conditions even with this mechanism in place.
> >>
> >> The root cause of the original problem is that conditional monitoring
> >> in ovn-controller works in stages.  ovn-controller doesn't know all
> >> the information it needs to monitor until the first monitor reply
> >> arrives, then it adjusts conditions and receives all the data.  That
> >> doesn't work well with the ovsdb_idl_has_ever_connected() check that
> >> will return 'true' right after the first monitor reply.
> >> This leads to situation where resources can be removed after the
> >> first monitor reply and re-added back after condition change.
> >>
> >> Instead of waiting a fixed number of iterations/seconds, the correct
> >> solution should be to always start with unconditional monitoring.
> >> When the full initial snapshot is received, switch to conditional
> >> monitoring as data is complete.  This way
ovsdb_idl_has_ever_connected()
> >> check will provide an accurate answer to the question: if we have
> >> a sufficient amount of data to make decisions about resource removal?
> >
> > This does sound like the cleanest solution, I agree.  But does it create
> > an issue in large scale deployments where conditional monitoring is
> > used?  The full initial SB snapshot is likely quite large.
>
> It can be large, yes, though it shouldn't be critical.  We will not hold
> all this memory for long.  If it is critical, we can fall back on the
> "Alternative solution" below.  It should not be hard to implement.
> I'll wait for the discussion here to conclude before working on that.

Thanks Ilya. I admit this is an approach I didn't think about before. It is
indeed cleaner.
However, the spike of memory may be undesirable, given that one of the
major advantages of conditional monitoring is saving memory on the client
side. I guess many environments don't care much about memory, but still
some environments do care, and in such cases it is critical. ovn-controller
is the distributed part, so at scale the total difference is actually
significant. Even if the memory spike won't last for long, the memory still
needs to be reserved for that. Otherwise it is possible to trigger OOM just
when ovn-controller restarts. I just did a scale test in a simulated 800
node ovn-k8s environment, with ovn-monitor-all enabled the memory is at
least 10x more than conditional monitoring (and my machine just got OOM
before the memory spike completes).

>
> >
> >>
> >> Once switched to conditional monitoring, southbound database will
> >> remove all the unnecessary data from the ovn-controller, so it will
> >> not need to keep it for long.  And since setup is using conditional
> >> monitoring, it's likely not that big anyway to be concerned about
> >> a brief spike in memory consumption on startup.
> >>
> >
> > For OpenShift and RedHat OpenStack this stands, we use
> > ovn-monitor-all=true for now.  But I'm not sure about other large scale
> > deployments.  I'll let Han comment more on this too.
> >
> >> This change will also remove unnecessary waiting on startup when all
> >> the data is already available.
> >>
> >> The main change is in the ovsdb_idl_has_ever_connected() check in the
> >> update_sb_monitors(), the rest is mostly a revert of commits that
> >> introduced the startup delay.
> >>
> >> Test enhanced to run with and w/o conditional monitoring, since the
> >> issue is closely related it it.
> >>
> >
> > Good idea.
> >
> >> Alternative solution might be to store the first condition change
> >> sequence number and replace all ovsdb_idl_has_ever_connected() calls
> >> with a function that also checks for the current sequence number being
> >> higher than condition change one, but that will require slightly more
> >> code.  Can be implemented in the future, if the first unconditional
> >> monitor reply will become a problem some day.
> >>

I believe this was the first solution we have thought about initially. The
problem is that, when new data comes (as a result of monitor condition
change), it may trigger a change of condition again, and so on. In theory,
we don't know for sure what is the point that all "initial" data has been
downloaded.
It is possible that we assume there are at most "N" monitor condition
changes required to download the required initial snapshot of data, but
then this is almost equivalent to the current "N" iterations approach.
Any idea to solve this problem?

Thanks,
Han

> >> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion &
recreation during restart.")
> >
> > It seems to me we need to do something similar for OpenFlow clearing
> > too.  Right now we wait a configured amount of time:
> >
> > https://github.com/ovn-org/ovn/commit/896adfd2d
>
> Hmm, yes.  It seems like with the correct handling of Sb updates in place
> we should be able to just implement an "ideal" solution:
>
> """
>   Ideally, ovn-controller should clear the flows after it
>   completes the new flows computing.
> """
>
> Best regards, Ilya Maximets.
Ilya Maximets Aug. 23, 2022, 1:45 p.m. UTC | #4
On 8/22/22 19:39, Han Zhou wrote:
> 
> 
> On Mon, Aug 22, 2022 at 3:33 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> On 8/22/22 12:21, Dumitru Ceara wrote:
>> > Hi Ilya,
>> >
>> > On 8/22/22 11:19, Ilya Maximets wrote:
>> >> daemon_started_recently() concept is flawed in terms that it uses
>> >> fixed number of iterations for a countdown and a fixed timeout, so
>> >> the undesired removal of configured resources can still happen under
>> >> certain conditions even with this mechanism in place.
>> >>
>> >> The root cause of the original problem is that conditional monitoring
>> >> in ovn-controller works in stages.  ovn-controller doesn't know all
>> >> the information it needs to monitor until the first monitor reply
>> >> arrives, then it adjusts conditions and receives all the data.  That
>> >> doesn't work well with the ovsdb_idl_has_ever_connected() check that
>> >> will return 'true' right after the first monitor reply.
>> >> This leads to situation where resources can be removed after the
>> >> first monitor reply and re-added back after condition change.
>> >>
>> >> Instead of waiting a fixed number of iterations/seconds, the correct
>> >> solution should be to always start with unconditional monitoring.
>> >> When the full initial snapshot is received, switch to conditional
>> >> monitoring as data is complete.  This way ovsdb_idl_has_ever_connected()
>> >> check will provide an accurate answer to the question: if we have
>> >> a sufficient amount of data to make decisions about resource removal?
>> >
>> > This does sound like the cleanest solution, I agree.  But does it create
>> > an issue in large scale deployments where conditional monitoring is
>> > used?  The full initial SB snapshot is likely quite large.
>>
>> It can be large, yes, though it shouldn't be critical.  We will not hold
>> all this memory for long.  If it is critical, we can fall back on the
>> "Alternative solution" below.  It should not be hard to implement.
>> I'll wait for the discussion here to conclude before working on that.
> 
> Thanks Ilya. I admit this is an approach I didn't think about before. It is indeed cleaner.
> However, the spike of memory may be undesirable, given that one of the major advantages of conditional monitoring is saving memory on the client side. I guess many environments don't care much about memory, but still some environments do care, and in such cases it is critical. ovn-controller is the distributed part, so at scale the total difference is actually significant. Even if the memory spike won't last for long, the memory still needs to be reserved for that. Otherwise it is possible to trigger OOM just when ovn-controller restarts. I just did a scale test in a simulated 800 node ovn-k8s environment, with ovn-monitor-all enabled the memory is at least 10x more than conditional monitoring (and my machine just got OOM before the memory spike completes).


OK, makes sense.  I'll work towards checking the condition change
sequence number then.

> 
>>
>> >
>> >>
>> >> Once switched to conditional monitoring, southbound database will
>> >> remove all the unnecessary data from the ovn-controller, so it will
>> >> not need to keep it for long.  And since setup is using conditional
>> >> monitoring, it's likely not that big anyway to be concerned about
>> >> a brief spike in memory consumption on startup.
>> >>
>> >
>> > For OpenShift and RedHat OpenStack this stands, we use
>> > ovn-monitor-all=true for now.  But I'm not sure about other large scale
>> > deployments.  I'll let Han comment more on this too.
>> >
>> >> This change will also remove unnecessary waiting on startup when all
>> >> the data is already available.
>> >>
>> >> The main change is in the ovsdb_idl_has_ever_connected() check in the
>> >> update_sb_monitors(), the rest is mostly a revert of commits that
>> >> introduced the startup delay.
>> >>
>> >> Test enhanced to run with and w/o conditional monitoring, since the
>> >> issue is closely related it it.
>> >>
>> >
>> > Good idea.
>> >
>> >> Alternative solution might be to store the first condition change
>> >> sequence number and replace all ovsdb_idl_has_ever_connected() calls
>> >> with a function that also checks for the current sequence number being
>> >> higher than condition change one, but that will require slightly more
>> >> code.  Can be implemented in the future, if the first unconditional
>> >> monitor reply will become a problem some day.
>> >>
> 
> I believe this was the first solution we have thought about initially. The problem is that, when new data comes (as a result of monitor condition change), it may trigger a change of condition again, and so on. In theory, we don't know for sure what is the point that all "initial" data has been downloaded.
> It is possible that we assume there are at most "N" monitor condition changes required to download the required initial snapshot of data, but then this is almost equivalent to the current "N" iterations approach.
> Any idea to solve this problem?

I don't think there is a problem, unless we have a circular dependency
in condition updates.  But wouldn't that be a much more serious problem
on its own?  I'll recheck all the conditions that we have to make sure
we're not running in circles, but my understanding is that it should be
enough to just wait for the first or second condition update to complete.
Any new data that we're receiving afterwards should be an actual new
data created after the ovn-controller is already connected, so it will
not affect pre-existing ports and OF rules.

> 
> Thanks,
> Han
> 
>> >> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion & recreation during restart.")
>> >
>> > It seems to me we need to do something similar for OpenFlow clearing
>> > too.  Right now we wait a configured amount of time:
>> >
>> > https://github.com/ovn-org/ovn/commit/896adfd2d <https://github.com/ovn-org/ovn/commit/896adfd2d>
>>
>> Hmm, yes.  It seems like with the correct handling of Sb updates in place
>> we should be able to just implement an "ideal" solution:
>>
>> """
>>   Ideally, ovn-controller should clear the flows after it
>>   completes the new flows computing.
>> """
>>
>> Best regards, Ilya Maximets.
Han Zhou Aug. 23, 2022, 2:15 p.m. UTC | #5
On Tue, Aug 23, 2022 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 8/22/22 19:39, Han Zhou wrote:
> >
> >
> > On Mon, Aug 22, 2022 at 3:33 AM Ilya Maximets <i.maximets@ovn.org
<mailto:i.maximets@ovn.org>> wrote:
> >>
> >> On 8/22/22 12:21, Dumitru Ceara wrote:
> >> > Hi Ilya,
> >> >
> >> > On 8/22/22 11:19, Ilya Maximets wrote:
> >> >> daemon_started_recently() concept is flawed in terms that it uses
> >> >> fixed number of iterations for a countdown and a fixed timeout, so
> >> >> the undesired removal of configured resources can still happen under
> >> >> certain conditions even with this mechanism in place.
> >> >>
> >> >> The root cause of the original problem is that conditional
monitoring
> >> >> in ovn-controller works in stages.  ovn-controller doesn't know all
> >> >> the information it needs to monitor until the first monitor reply
> >> >> arrives, then it adjusts conditions and receives all the data.  That
> >> >> doesn't work well with the ovsdb_idl_has_ever_connected() check that
> >> >> will return 'true' right after the first monitor reply.
> >> >> This leads to situation where resources can be removed after the
> >> >> first monitor reply and re-added back after condition change.
> >> >>
> >> >> Instead of waiting a fixed number of iterations/seconds, the correct
> >> >> solution should be to always start with unconditional monitoring.
> >> >> When the full initial snapshot is received, switch to conditional
> >> >> monitoring as data is complete.  This way
ovsdb_idl_has_ever_connected()
> >> >> check will provide an accurate answer to the question: if we have
> >> >> a sufficient amount of data to make decisions about resource
removal?
> >> >
> >> > This does sound like the cleanest solution, I agree.  But does it
create
> >> > an issue in large scale deployments where conditional monitoring is
> >> > used?  The full initial SB snapshot is likely quite large.
> >>
> >> It can be large, yes, though it shouldn't be critical.  We will not
hold
> >> all this memory for long.  If it is critical, we can fall back on the
> >> "Alternative solution" below.  It should not be hard to implement.
> >> I'll wait for the discussion here to conclude before working on that.
> >
> > Thanks Ilya. I admit this is an approach I didn't think about before.
It is indeed cleaner.
> > However, the spike of memory may be undesirable, given that one of the
major advantages of conditional monitoring is saving memory on the client
side. I guess many environments don't care much about memory, but still
some environments do care, and in such cases it is critical. ovn-controller
is the distributed part, so at scale the total difference is actually
significant. Even if the memory spike won't last for long, the memory still
needs to be reserved for that. Otherwise it is possible to trigger OOM just
when ovn-controller restarts. I just did a scale test in a simulated 800
node ovn-k8s environment, with ovn-monitor-all enabled the memory is at
least 10x more than conditional monitoring (and my machine just got OOM
before the memory spike completes).
>
>
> OK, makes sense.  I'll work towards checking the condition change
> sequence number then.
>
> >
> >>
> >> >
> >> >>
> >> >> Once switched to conditional monitoring, southbound database will
> >> >> remove all the unnecessary data from the ovn-controller, so it will
> >> >> not need to keep it for long.  And since setup is using conditional
> >> >> monitoring, it's likely not that big anyway to be concerned about
> >> >> a brief spike in memory consumption on startup.
> >> >>
> >> >
> >> > For OpenShift and RedHat OpenStack this stands, we use
> >> > ovn-monitor-all=true for now.  But I'm not sure about other large
scale
> >> > deployments.  I'll let Han comment more on this too.
> >> >
> >> >> This change will also remove unnecessary waiting on startup when all
> >> >> the data is already available.
> >> >>
> >> >> The main change is in the ovsdb_idl_has_ever_connected() check in
the
> >> >> update_sb_monitors(), the rest is mostly a revert of commits that
> >> >> introduced the startup delay.
> >> >>
> >> >> Test enhanced to run with and w/o conditional monitoring, since the
> >> >> issue is closely related it it.
> >> >>
> >> >
> >> > Good idea.
> >> >
> >> >> Alternative solution might be to store the first condition change
> >> >> sequence number and replace all ovsdb_idl_has_ever_connected() calls
> >> >> with a function that also checks for the current sequence number
being
> >> >> higher than condition change one, but that will require slightly
more
> >> >> code.  Can be implemented in the future, if the first unconditional
> >> >> monitor reply will become a problem some day.
> >> >>
> >
> > I believe this was the first solution we have thought about initially.
The problem is that, when new data comes (as a result of monitor condition
change), it may trigger a change of condition again, and so on. In theory,
we don't know for sure what is the point that all "initial" data has been
downloaded.
> > It is possible that we assume there are at most "N" monitor condition
changes required to download the required initial snapshot of data, but
then this is almost equivalent to the current "N" iterations approach.
> > Any idea to solve this problem?
>
> I don't think there is a problem, unless we have a circular dependency
> in condition updates.  But wouldn't that be a much more serious problem
> on its own?  I'll recheck all the conditions that we have to make sure
> we're not running in circles, but my understanding is that it should be
> enough to just wait for the first or second condition update to complete.
> Any new data that we're receiving afterwards should be an actual new
> data created after the ovn-controller is already connected, so it will
> not affect pre-existing ports and OF rules.
>
It's not about circular dependency. It is about how we determine if a
datapath is related to the chassis. We start from the local VIFs and the
DPs they belong to, and each next iteration would change monitor conditions
to monitor the port-bindings of the next level of connected DPs according
to the patch ports already monitored. So, how many iterations are required
to download the data depends on the *hops* in the logical topology.

A possible solution is to always monitor all port-bindings, but the
side-effect would be memory and CPU cost at scale. (CPU cost may be ok if
we optimize and eliminate loops of SBREC_PORT_BINDING_TABLE_FOR_EACH).

> >
> > Thanks,
> > Han
> >
> >> >> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion &
recreation during restart.")
> >> >
> >> > It seems to me we need to do something similar for OpenFlow clearing
> >> > too.  Right now we wait a configured amount of time:
> >> >
> >> > https://github.com/ovn-org/ovn/commit/896adfd2d <
https://github.com/ovn-org/ovn/commit/896adfd2d>
> >>
> >> Hmm, yes.  It seems like with the correct handling of Sb updates in
place
> >> we should be able to just implement an "ideal" solution:
> >>
> >> """
> >>   Ideally, ovn-controller should clear the flows after it
> >>   completes the new flows computing.
> >> """
> >>
> >> Best regards, Ilya Maximets.
>
Han Zhou Aug. 23, 2022, 4:05 p.m. UTC | #6
On Tue, Aug 23, 2022 at 7:15 AM Han Zhou <hzhou@ovn.org> wrote:
>
>
>
> On Tue, Aug 23, 2022 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 8/22/22 19:39, Han Zhou wrote:
> > >
> > >
> > > On Mon, Aug 22, 2022 at 3:33 AM Ilya Maximets <i.maximets@ovn.org
<mailto:i.maximets@ovn.org>> wrote:
> > >>
> > >> On 8/22/22 12:21, Dumitru Ceara wrote:
> > >> > Hi Ilya,
> > >> >
> > >> > On 8/22/22 11:19, Ilya Maximets wrote:
> > >> >> daemon_started_recently() concept is flawed in terms that it uses
> > >> >> fixed number of iterations for a countdown and a fixed timeout, so
> > >> >> the undesired removal of configured resources can still happen
under
> > >> >> certain conditions even with this mechanism in place.
> > >> >>
> > >> >> The root cause of the original problem is that conditional
monitoring
> > >> >> in ovn-controller works in stages.  ovn-controller doesn't know
all
> > >> >> the information it needs to monitor until the first monitor reply
> > >> >> arrives, then it adjusts conditions and receives all the data.
That
> > >> >> doesn't work well with the ovsdb_idl_has_ever_connected() check
that
> > >> >> will return 'true' right after the first monitor reply.
> > >> >> This leads to situation where resources can be removed after the
> > >> >> first monitor reply and re-added back after condition change.
> > >> >>
> > >> >> Instead of waiting a fixed number of iterations/seconds, the
correct
> > >> >> solution should be to always start with unconditional monitoring.
> > >> >> When the full initial snapshot is received, switch to conditional
> > >> >> monitoring as data is complete.  This way
ovsdb_idl_has_ever_connected()
> > >> >> check will provide an accurate answer to the question: if we have
> > >> >> a sufficient amount of data to make decisions about resource
removal?
> > >> >
> > >> > This does sound like the cleanest solution, I agree.  But does it
create
> > >> > an issue in large scale deployments where conditional monitoring is
> > >> > used?  The full initial SB snapshot is likely quite large.
> > >>
> > >> It can be large, yes, though it shouldn't be critical.  We will not
hold
> > >> all this memory for long.  If it is critical, we can fall back on the
> > >> "Alternative solution" below.  It should not be hard to implement.
> > >> I'll wait for the discussion here to conclude before working on that.
> > >
> > > Thanks Ilya. I admit this is an approach I didn't think about before.
It is indeed cleaner.
> > > However, the spike of memory may be undesirable, given that one of
the major advantages of conditional monitoring is saving memory on the
client side. I guess many environments don't care much about memory, but
still some environments do care, and in such cases it is critical.
ovn-controller is the distributed part, so at scale the total difference is
actually significant. Even if the memory spike won't last for long, the
memory still needs to be reserved for that. Otherwise it is possible to
trigger OOM just when ovn-controller restarts. I just did a scale test in a
simulated 800 node ovn-k8s environment, with ovn-monitor-all enabled the
memory is at least 10x more than conditional monitoring (and my machine
just got OOM before the memory spike completes).
> >
> >
> > OK, makes sense.  I'll work towards checking the condition change
> > sequence number then.
> >
> > >
> > >>
> > >> >
> > >> >>
> > >> >> Once switched to conditional monitoring, southbound database will
> > >> >> remove all the unnecessary data from the ovn-controller, so it
will
> > >> >> not need to keep it for long.  And since setup is using
conditional
> > >> >> monitoring, it's likely not that big anyway to be concerned about
> > >> >> a brief spike in memory consumption on startup.
> > >> >>
> > >> >
> > >> > For OpenShift and RedHat OpenStack this stands, we use
> > >> > ovn-monitor-all=true for now.  But I'm not sure about other large
scale
> > >> > deployments.  I'll let Han comment more on this too.
> > >> >
> > >> >> This change will also remove unnecessary waiting on startup when
all
> > >> >> the data is already available.
> > >> >>
> > >> >> The main change is in the ovsdb_idl_has_ever_connected() check in
the
> > >> >> update_sb_monitors(), the rest is mostly a revert of commits that
> > >> >> introduced the startup delay.
> > >> >>
> > >> >> Test enhanced to run with and w/o conditional monitoring, since
the
> > >> >> issue is closely related it it.
> > >> >>
> > >> >
> > >> > Good idea.
> > >> >
> > >> >> Alternative solution might be to store the first condition change
> > >> >> sequence number and replace all ovsdb_idl_has_ever_connected()
calls
> > >> >> with a function that also checks for the current sequence number
being
> > >> >> higher than condition change one, but that will require slightly
more
> > >> >> code.  Can be implemented in the future, if the first
unconditional
> > >> >> monitor reply will become a problem some day.
> > >> >>
> > >
> > > I believe this was the first solution we have thought about
initially. The problem is that, when new data comes (as a result of monitor
condition change), it may trigger a change of condition again, and so on.
In theory, we don't know for sure what is the point that all "initial" data
has been downloaded.
> > > It is possible that we assume there are at most "N" monitor condition
changes required to download the required initial snapshot of data, but
then this is almost equivalent to the current "N" iterations approach.
> > > Any idea to solve this problem?
> >
> > I don't think there is a problem, unless we have a circular dependency
> > in condition updates.  But wouldn't that be a much more serious problem
> > on its own?  I'll recheck all the conditions that we have to make sure
> > we're not running in circles, but my understanding is that it should be
> > enough to just wait for the first or second condition update to
complete.
> > Any new data that we're receiving afterwards should be an actual new
> > data created after the ovn-controller is already connected, so it will
> > not affect pre-existing ports and OF rules.
> >
> It's not about circular dependency. It is about how we determine if a
datapath is related to the chassis. We start from the local VIFs and the
DPs they belong to, and each next iteration would change monitor conditions
to monitor the port-bindings of the next level of connected DPs according
to the patch ports already monitored. So, how many iterations are required
to download the data depends on the *hops* in the logical topology.
>
> A possible solution is to always monitor all port-bindings, but the
side-effect would be memory and CPU cost at scale. (CPU cost may be ok if
we optimize and eliminate loops of SBREC_PORT_BINDING_TABLE_FOR_EACH).
>
Sorry that I just realized that maybe it is sufficient to just monitor all
"patch" ports, and the code is already doing that. So it seems this
alternative should work in theory: the first round we can monitor for the
directly connected DPs for the local VIFs, and the second round we should
know all indirectly connected DPs through patch ports, and initiate
monitoring for them.
If we go down this path, I think there are a little more details to be
figured out. I did observe that today the ovn-controller adds monitor
conditions hop-by-hop, so we need to figure out why and fix that (assume
the theory is correct). Also, what if there are no VIFs attached on the
chassis yet, so there won't be any monitor condition change, and we should
not wait in this case.

Thanks,
Han

> > >
> > > Thanks,
> > > Han
> > >
> > >> >> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion &
recreation during restart.")
> > >> >
> > >> > It seems to me we need to do something similar for OpenFlow
clearing
> > >> > too.  Right now we wait a configured amount of time:
> > >> >
> > >> > https://github.com/ovn-org/ovn/commit/896adfd2d <
https://github.com/ovn-org/ovn/commit/896adfd2d>
> > >>
> > >> Hmm, yes.  It seems like with the correct handling of Sb updates in
place
> > >> we should be able to just implement an "ideal" solution:
> > >>
> > >> """
> > >>   Ideally, ovn-controller should clear the flows after it
> > >>   completes the new flows computing.
> > >> """
> > >>
> > >> Best regards, Ilya Maximets.
> >
diff mbox series

Patch

diff --git a/TODO.rst b/TODO.rst
index 12738280b..d54817011 100644
--- a/TODO.rst
+++ b/TODO.rst
@@ -172,10 +172,3 @@  OVN To-do List
   * Multi-threaded logical flow computation was optimized for the case
     when datapath groups are disabled.  Datpath groups are always enabled
     now so northd parallel processing should be revisited.
-
-* ovn-controller daemon module
-
-  * Dumitru Ceara: Add a new module e.g. ovn/lib/daemon-ovn.c that wraps
-    OVS' daemonize_start() call and initializes the additional things, like
-    the unixctl commands. Or, we should move the APIs such as
-    daemon_started_recently() to OVS's lib/daemon.
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 0b0ccc48a..2ac6409bc 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -94,7 +94,6 @@  static unixctl_cb_func debug_dump_lflow_conj_ids;
 static unixctl_cb_func lflow_cache_flush_cmd;
 static unixctl_cb_func lflow_cache_show_stats_cmd;
 static unixctl_cb_func debug_delay_nb_cfg_report;
-static unixctl_cb_func debug_ignore_startup_delay;
 
 #define DEFAULT_BRIDGE_NAME "br-int"
 #define DEFAULT_DATAPATH "system"
@@ -190,7 +189,14 @@  update_sb_monitors(struct ovsdb_idl *ovnsb_idl,
      * avoid the unnecessarily extra wake-ups of ovn-controller. */
     ovsdb_idl_condition_add_clause_true(&ldpg);
 
-    if (monitor_all) {
+    /* Don't enable conditional monitoring until we have a full view on a
+     * data that we need to monitor.  Without that we will send a first
+     * monitor request with incomplete conditions, receive a reply, update
+     * conditions and receive the rest of data.  Such a split may result in
+     * removal and re-addition of certain resources (e.g. patch ports)
+     * on restart of ovn-controller.  Database will send deletion requiests
+     * for all the data that we don't actually need. */
+    if (monitor_all || !ovsdb_idl_has_ever_connected(ovnsb_idl)) {
         ovsdb_idl_condition_add_clause_true(&pb);
         ovsdb_idl_condition_add_clause_true(&lf);
         ovsdb_idl_condition_add_clause_true(&mb);
@@ -867,12 +873,11 @@  static void
 store_nb_cfg(struct ovsdb_idl_txn *sb_txn, struct ovsdb_idl_txn *ovs_txn,
              const struct sbrec_chassis_private *chassis,
              const struct ovsrec_bridge *br_int,
-             unsigned int delay_nb_cfg_report)
+             unsigned int delay_nb_cfg_report, int64_t startup_ts)
 {
     struct ofctrl_acked_seqnos *acked_nb_cfg_seqnos =
         ofctrl_acked_seqnos_get(ofctrl_seq_type_nb_cfg);
     uint64_t cur_cfg = acked_nb_cfg_seqnos->last_acked;
-    int64_t startup_ts = daemon_startup_ts();
 
     if (ovs_txn && br_int
             && startup_ts != smap_get_ullong(&br_int->external_ids,
@@ -3860,9 +3865,6 @@  main(int argc, char *argv[])
                              debug_dump_lflow_conj_ids,
                              &lflow_output_data->conj_ids);
 
-    unixctl_command_register("debug/ignore-startup-delay", "", 0, 0,
-                             debug_ignore_startup_delay, NULL);
-
     unsigned int ovs_cond_seqno = UINT_MAX;
     unsigned int ovnsb_cond_seqno = UINT_MAX;
     unsigned int ovnsb_expected_cond_seqno = UINT_MAX;
@@ -3884,6 +3886,7 @@  main(int argc, char *argv[])
     /* Main loop. */
     exiting = false;
     restart = false;
+    int64_t startup_ts = time_wall_msec();
     bool sb_monitor_all = false;
     while (!exiting) {
         memory_run();
@@ -4062,10 +4065,6 @@  main(int argc, char *argv[])
                     }
                     stopwatch_stop(CONTROLLER_LOOP_STOPWATCH_NAME,
                                    time_msec());
-                    if (engine_has_updated()) {
-                        daemon_started_recently_countdown();
-                    }
-
                     ct_zones_data = engine_get_data(&en_ct_zones);
                     if (ovs_idl_txn) {
                         if (ct_zones_data) {
@@ -4228,7 +4227,7 @@  main(int argc, char *argv[])
             }
 
             store_nb_cfg(ovnsb_idl_txn, ovs_idl_txn, chassis_private,
-                         br_int, delay_nb_cfg_report);
+                         br_int, delay_nb_cfg_report, startup_ts);
 
             if (pending_pkt.conn) {
                 struct ed_type_addr_sets *as_data =
@@ -4705,11 +4704,3 @@  debug_dump_lflow_conj_ids(struct unixctl_conn *conn, int argc OVS_UNUSED,
     unixctl_command_reply(conn, ds_cstr(&conj_ids_dump));
     ds_destroy(&conj_ids_dump);
 }
-
-static void
-debug_ignore_startup_delay(struct unixctl_conn *conn, int argc OVS_UNUSED,
-                           const char *argv[] OVS_UNUSED, void *arg OVS_UNUSED)
-{
-    daemon_started_recently_ignore();
-    unixctl_command_reply(conn, NULL);
-}
diff --git a/controller/patch.c b/controller/patch.c
index 12e0b6f7c..ed831302c 100644
--- a/controller/patch.c
+++ b/controller/patch.c
@@ -311,14 +311,7 @@  patch_run(struct ovsdb_idl_txn *ovs_idl_txn,
     SHASH_FOR_EACH_SAFE (port_node, &existing_ports) {
         port = port_node->data;
         shash_delete(&existing_ports, port_node);
-        /* Wait for some iterations before really deleting any patch ports,
-         * because with conditional monitoring it is possible that related SB
-         * data is not completely downloaded yet after last restart of
-         * ovn-controller.  Otherwise it may cause unncessary dataplane
-         * interruption during restart/upgrade. */
-        if (!daemon_started_recently()) {
-            remove_port(bridge_table, port);
-        }
+        remove_port(bridge_table, port);
     }
     shash_destroy(&existing_ports);
 }
diff --git a/controller/vif-plug.c b/controller/vif-plug.c
index 38348bf54..eafc513f5 100644
--- a/controller/vif-plug.c
+++ b/controller/vif-plug.c
@@ -469,8 +469,7 @@  vif_plug_iface_touched_this_txn(
 static bool
 vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
                           struct vif_plug_ctx_in *vif_plug_ctx_in,
-                          struct vif_plug_ctx_out *vif_plug_ctx_out,
-                          bool can_unplug)
+                          struct vif_plug_ctx_out *vif_plug_ctx_out)
 {
     if (vif_plug_iface_touched_this_txn(vif_plug_ctx_out, pb->logical_port)) {
         return true;
@@ -482,7 +481,7 @@  vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
     if (lport_can_bind_on_this_chassis(vif_plug_ctx_in->chassis_rec, pb)) {
         handled &= consider_plug_lport(pb, lbinding,
                                        vif_plug_ctx_in, vif_plug_ctx_out);
-    } else if (can_unplug && lbinding && lbinding->iface) {
+    } else if (lbinding && lbinding->iface) {
         handled &= consider_unplug_iface(lbinding->iface, pb,
                                          vif_plug_ctx_in, vif_plug_ctx_out);
     }
@@ -492,8 +491,7 @@  vif_plug_handle_lport_vif(const struct sbrec_port_binding *pb,
 static bool
 vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
                       struct vif_plug_ctx_in *vif_plug_ctx_in,
-                      struct vif_plug_ctx_out *vif_plug_ctx_out,
-                      bool can_unplug)
+                      struct vif_plug_ctx_out *vif_plug_ctx_out)
 {
     bool handled = true;
     const char *vif_plug_type = smap_get(&iface_rec->external_ids,
@@ -513,10 +511,8 @@  vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
          * consider updating it */
         handled &= consider_plug_lport(pb, lbinding,
                                        vif_plug_ctx_in, vif_plug_ctx_out);
-    } else if (can_unplug
-               && (!pb
-                   || !lport_can_bind_on_this_chassis(
-                       vif_plug_ctx_in->chassis_rec, pb))) {
+    } else if (!pb || !lport_can_bind_on_this_chassis(
+                           vif_plug_ctx_in->chassis_rec, pb)) {
         /* No lport for this interface or it is destined for different chassis,
          * consuder unplugging it */
         handled &= consider_unplug_iface(iface_rec, pb,
@@ -525,31 +521,17 @@  vif_plug_handle_iface(const struct ovsrec_interface *iface_rec,
     return handled;
 }
 
-/* On initial startup or on IDL reconnect, several rounds of the main loop may
- * run before data is actually loaded in the IDL, primarily depending on
- * conditional monitoring status and other events that could trigger main loop
- * runs during this period.  Until we find a reliable way to determine the
- * completeness of the initial data downloading we need this counter so that we
- * do not erronously unplug ports because the data is just not loaded yet.
- */
 void
 vif_plug_run(struct vif_plug_ctx_in *vif_plug_ctx_in,
              struct vif_plug_ctx_out *vif_plug_ctx_out)
 {
-    bool delay_plug = daemon_started_recently();
-    if (delay_plug) {
-        VLOG_DBG("vif_plug_run: daemon started recently, will not unplug "
-                 "ports in this iteration.");
-    }
-
     if (!vif_plug_ctx_in->chassis_rec) {
         return;
     }
     const struct ovsrec_interface *iface_rec;
     OVSREC_INTERFACE_TABLE_FOR_EACH (iface_rec,
                                      vif_plug_ctx_in->iface_table) {
-        vif_plug_handle_iface(iface_rec, vif_plug_ctx_in, vif_plug_ctx_out,
-                              !delay_plug);
+        vif_plug_handle_iface(iface_rec, vif_plug_ctx_in, vif_plug_ctx_out);
     }
 
     struct sbrec_port_binding *target =
@@ -564,8 +546,7 @@  vif_plug_run(struct vif_plug_ctx_in *vif_plug_ctx_in,
             vif_plug_ctx_in->sbrec_port_binding_by_requested_chassis) {
         enum en_lport_type lport_type = get_lport_type(pb);
         if (lport_type == LP_VIF) {
-            vif_plug_handle_lport_vif(pb, vif_plug_ctx_in, vif_plug_ctx_out,
-                                      !delay_plug);
+            vif_plug_handle_lport_vif(pb, vif_plug_ctx_in, vif_plug_ctx_out);
         }
     }
     sbrec_port_binding_index_destroy_row(target);
diff --git a/controller/vif-plug.h b/controller/vif-plug.h
index 7a1978e38..76063591b 100644
--- a/controller/vif-plug.h
+++ b/controller/vif-plug.h
@@ -71,6 +71,7 @@  void vif_plug_clear_changed(struct shash *deleted_iface_ids);
 void vif_plug_finish_changed(struct shash *changed_iface_ids);
 void vif_plug_clear_deleted(struct shash *deleted_iface_ids);
 void vif_plug_finish_deleted(struct shash *changed_iface_ids);
+void vif_plug_reset_idl_prime_counter(void);
 
 #ifdef  __cplusplus
 }
diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
index 2e2b31033..575b774ae 100644
--- a/lib/inc-proc-eng.c
+++ b/lib/inc-proc-eng.c
@@ -312,17 +312,6 @@  engine_has_run(void)
     return false;
 }
 
-bool
-engine_has_updated(void)
-{
-    for (size_t i = 0; i < engine_n_nodes; i++) {
-        if (engine_nodes[i]->state == EN_UPDATED) {
-            return true;
-        }
-    }
-    return false;
-}
-
 bool
 engine_aborted(void)
 {
diff --git a/lib/inc-proc-eng.h b/lib/inc-proc-eng.h
index dc365dc18..9bfab1f7c 100644
--- a/lib/inc-proc-eng.h
+++ b/lib/inc-proc-eng.h
@@ -238,10 +238,6 @@  bool engine_node_changed(struct engine_node *node);
 /* Return true if the engine has run in the last iteration. */
 bool engine_has_run(void);
 
-/* Return true if the engine has any update in any node, i.e. any input
- * has changed; false if nothing has changed. */
-bool engine_has_updated(void);
-
 /* Returns true if during the last engine run we had to abort processing. */
 bool engine_aborted(void);
 
diff --git a/lib/ovn-util.c b/lib/ovn-util.c
index d80db179a..616999eab 100644
--- a/lib/ovn-util.c
+++ b/lib/ovn-util.c
@@ -883,55 +883,3 @@  get_bridge(const struct ovsrec_bridge_table *bridge_table, const char *br_name)
     }
     return NULL;
 }
-
-#define DAEMON_STARTUP_DELAY_SEED 20
-#define DAEMON_STARTUP_DELAY_MS   10000
-
-static int64_t startup_ts;
-static int startup_delay = DAEMON_STARTUP_DELAY_SEED;
-
-/* Used by debug command only, for tests. */
-static bool ignore_startup_delay = false;
-
-OVS_CONSTRUCTOR(startup_ts_initializer) {
-    startup_ts = time_wall_msec();
-}
-
-int64_t
-daemon_startup_ts(void)
-{
-    return startup_ts;
-}
-
-void
-daemon_started_recently_countdown(void)
-{
-    if (startup_delay > 0) {
-        startup_delay--;
-    }
-}
-
-void
-daemon_started_recently_ignore(void)
-{
-    ignore_startup_delay = true;
-}
-
-bool
-daemon_started_recently(void)
-{
-    if (ignore_startup_delay) {
-        return false;
-    }
-
-    VLOG_DBG("startup_delay: %d, startup_ts: %"PRId64, startup_delay,
-             startup_ts);
-
-    /* Ensure that at least an amount of updates has been handled. */
-    if (startup_delay) {
-        return true;
-    }
-
-    /* Ensure that at least an amount of time has passed. */
-    return time_wall_msec() - startup_ts <= DAEMON_STARTUP_DELAY_MS;
-}
diff --git a/lib/ovn-util.h b/lib/ovn-util.h
index 145f974ed..b3905ef7b 100644
--- a/lib/ovn-util.h
+++ b/lib/ovn-util.h
@@ -309,9 +309,5 @@  struct ovsrec_bridge_table;
 const struct ovsrec_bridge *get_bridge(const struct ovsrec_bridge_table *,
                                        const char *br_name);
 
-void daemon_started_recently_countdown(void);
-void daemon_started_recently_ignore(void);
-bool daemon_started_recently(void);
-int64_t daemon_startup_ts(void);
 
 #endif /* OVN_UTIL_H */
diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at
index 3c3fb31c7..2d765f31a 100644
--- a/tests/ovn-controller.at
+++ b/tests/ovn-controller.at
@@ -114,7 +114,6 @@  check_patches \
     'br-eth0 patch-localnet2-to-br-int patch-br-int-to-localnet2'
 
 # Delete the mapping and the ovn-bridge-mapping patch ports should go away.
-check ovn-appctl -t ovn-controller debug/ignore-startup-delay
 AT_CHECK([ovs-vsctl remove Open_vSwitch . external-ids ovn-bridge-mappings])
 check_bridge_mappings
 check_patches
@@ -2268,6 +2267,7 @@  OVS_WAIT_UNTIL([test 6 = $(as hv1 ovs-ofctl dump-flows br-int | grep -c $(port_b
 OVN_CLEANUP([hv1])
 AT_CLEANUP
 
+OVN_FOR_EACH_NORTHD([
 AT_SETUP([ovn-controller - restart should not delete patch ports])
 
 ovn_start
@@ -2337,3 +2337,4 @@  done
 AT_CHECK([grep "deleted interface patch" hv1/ovs-vswitchd.log], [1], [ignore])
 OVN_CLEANUP([hv1])
 AT_CLEANUP
+])
diff --git a/tests/ovn.at b/tests/ovn.at
index bba2c9c1d..21a99089f 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -31681,7 +31681,6 @@  OVS_WAIT_UNTIL([
     test x43 = x$(as hv1 ovs-vsctl get Interface ${iface1_uuid} mtu_request)
 ])
 
-as hv1 check ovn-appctl -t ovn-controller debug/ignore-startup-delay
 # Check that pointing requested-chassis somewhere else will unplug the port
 check ovn-nbctl --wait=hv set Logical_Switch_Port lsp1 \
     options:requested-chassis=non-existent-chassis
@@ -31689,7 +31688,6 @@  OVS_WAIT_UNTIL([
     ! as hv1 ovs-vsctl get Interface ${iface1_uuid} _uuid
 ])
 
-as hv2 check ovn-appctl -t ovn-controller debug/ignore-startup-delay
 # Check that removing an lport will unplug it
 AT_CHECK([test x${iface2_uuid} = x$(as hv2 ovs-vsctl get Interface ${iface2_uuid} _uuid)], [0], [])
 check ovn-nbctl --wait=hv lsp-del ${lsp2_uuid}