diff mbox series

[ovs-dev,v2] ovsdb-idl: Support change tracking of write-only columns.

Message ID 20220415142146.13169-1-dceara@redhat.com
State Changes Requested
Headers show
Series [ovs-dev,v2] ovsdb-idl: Support change tracking of write-only columns. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Dumitru Ceara April 15, 2022, 2:21 p.m. UTC
At a first glance, change tracking should never be allowed for
write-only columns.  However, some clients (e.g., ovn-northd) that are
mostly exclusive writers of a database, use change tracking to avoid
duplicating the IDL row records into a local cache when implementing
incremental processing.

The default behavior of the IDL is to automatically turn a write-only
column into a read-write column whenever the client enables change
tracking for that column.

For the afore mentioned clients, this becomes a performance issue.
Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of writes that don't
change a column's value.") explains why writes that don't change a
column's value cannot be optimized out early if the column is
read/write.

Furthermore, if there is at least one record in any table that
changed during a transaction, then *all* records that have been
written are added to the transaction, even if their values didn't
change.  If there are many such rows (e.g., like in ovn-northd's
case) this incurs a significant overhead because:
a. the client has to build this large transaction
b. the transaction has to be sent over the network
c. the server needs to parse this (mostly) no-op update

We now introduce new IDL APIs allowing users to explicitly request
monitoring mode combinations for different columns in the IDL.

These accept the combination 'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK'
as a valid input allowing users to change track write-only
tables.

We benchmarked ovn-northd performance when using this new mode
against NB and SB databases taken from ovn-kubernetes scale tests.
We noticed that when a minor change is performed to the Northbound
database (e.g., NB_Global.nb_cfg is incremented) the time it takes to
build the Southbound transaction becomes negligible (vs ~1.5 seconds
before this change).

End-to-end ovn-kubernetes scale tests on 120-node clusters also show
significant reduction of latency to bring up pods; both average and P99
latency decreased by ~30%.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
V2:
- Addressed Numan's comments:
  - Added APIs to allow per column configuration of modes.
- Fixed unit tests to actually enable change tracking for write-only
  columns.
- Fixed ovsdb_idl_row_change() to correctly update row/table/idl change
  seqnos if change tracking is enabled (even for write-only rows).

Note: The OVN counter part change is:
https://github.com/dceara/ovn/commit/0d0c689e6469ae2f8e195c423e9a91fbc514bd60
---
 NEWS               |  4 +++
 lib/ovsdb-idl.c    | 85 +++++++++++++++++++++++++++++++++-------------
 lib/ovsdb-idl.h    |  8 ++++-
 tests/ovsdb-idl.at | 20 +++++++++--
 tests/test-ovsdb.c | 25 ++++++++++----
 5 files changed, 108 insertions(+), 34 deletions(-)

Comments

Han Zhou April 15, 2022, 6:25 p.m. UTC | #1
On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> At a first glance, change tracking should never be allowed for
> write-only columns.  However, some clients (e.g., ovn-northd) that are
> mostly exclusive writers of a database, use change tracking to avoid
> duplicating the IDL row records into a local cache when implementing
> incremental processing.

Hi Dumitru,

It is still not clear to me why ovn-northd would need change-tracking for
the write-only columns. It didn't require change tracking before using the
incremental processing engine. In theory, to my understanding,
change-tracking is required only if the column is an input to some of the
engine nodes, so that it needs to be alerted when there is a change in the
database. If the column is write-only to ovn-northd, it means it doesn't
care about any change of the column from the DB, but blindly writes to the
column regardless of the current value. I checked the patch [0], which
mentioned that

    Such nodes are primary inputs to the northd incremental processing
    engine and without proper update processing for Southbound tables,
    northd might fail to timely reconcile the contents of the Southbound
    database.

I am confused that, if the columns were write-only, how come they become
inputs to northd and requires reconcile upon them? Does it mean that they
were misused before I-P, that they were in fact read/write columns to
ovn-northd?
Similarly for the patch [1], I didn't figure out what's exactly broken
without change-tracking to the options column from the commit message and
the patch itself.
I think I need to understand the context before commenting on the current
patch. Could you help explain a little more?

[0]
https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
[1]
https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089

Thanks,
Han

>
> The default behavior of the IDL is to automatically turn a write-only
> column into a read-write column whenever the client enables change
> tracking for that column.
>
> For the afore mentioned clients, this becomes a performance issue.
> Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of writes that don't
> change a column's value.") explains why writes that don't change a
> column's value cannot be optimized out early if the column is
> read/write.
>
> Furthermore, if there is at least one record in any table that
> changed during a transaction, then *all* records that have been
> written are added to the transaction, even if their values didn't
> change.  If there are many such rows (e.g., like in ovn-northd's
> case) this incurs a significant overhead because:
> a. the client has to build this large transaction
> b. the transaction has to be sent over the network
> c. the server needs to parse this (mostly) no-op update
>
> We now introduce new IDL APIs allowing users to explicitly request
> monitoring mode combinations for different columns in the IDL.
>
> These accept the combination 'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK'
> as a valid input allowing users to change track write-only
> tables.
>
> We benchmarked ovn-northd performance when using this new mode
> against NB and SB databases taken from ovn-kubernetes scale tests.
> We noticed that when a minor change is performed to the Northbound
> database (e.g., NB_Global.nb_cfg is incremented) the time it takes to
> build the Southbound transaction becomes negligible (vs ~1.5 seconds
> before this change).
>
> End-to-end ovn-kubernetes scale tests on 120-node clusters also show
> significant reduction of latency to bring up pods; both average and P99
> latency decreased by ~30%.
>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
> V2:
> - Addressed Numan's comments:
>   - Added APIs to allow per column configuration of modes.
> - Fixed unit tests to actually enable change tracking for write-only
>   columns.
> - Fixed ovsdb_idl_row_change() to correctly update row/table/idl change
>   seqnos if change tracking is enabled (even for write-only rows).
>
> Note: The OVN counter part change is:
>
https://github.com/dceara/ovn/commit/0d0c689e6469ae2f8e195c423e9a91fbc514bd60
> ---
>  NEWS               |  4 +++
>  lib/ovsdb-idl.c    | 85 +++++++++++++++++++++++++++++++++-------------
>  lib/ovsdb-idl.h    |  8 ++++-
>  tests/ovsdb-idl.at | 20 +++++++++--
>  tests/test-ovsdb.c | 25 ++++++++++----
>  5 files changed, 108 insertions(+), 34 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 5074b97aa51c..c1c41da94e9b 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -14,6 +14,10 @@ Post-v2.17.0
>     - OVSDB:
>       * 'relay' service model now supports transaction history, i.e.
honors the
>         'last-txn-id' field in 'monitor_cond_since' requests from clients.
> +   - OVSDB-IDL:
> +     * New interfaces for explcitly setting column modes.  The
combination
> +       'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK' is now considered valid and
> +       enables change tracking of write-only columns.
>
>
>  v2.17.0 - 17 Feb 2022
> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> index 882ede75598d..5f62c267d496 100644
> --- a/lib/ovsdb-idl.c
> +++ b/lib/ovsdb-idl.c
> @@ -859,20 +859,6 @@ ovsdb_idl_get_mode(struct ovsdb_idl *idl,
>      return &table->modes[column - table->class_->columns];
>  }
>
> -static void
> -ovsdb_idl_set_mode(struct ovsdb_idl *idl,
> -                   const struct ovsdb_idl_column *column,
> -                   unsigned char mode)
> -{
> -    const struct ovsdb_idl_table *table =
ovsdb_idl_table_from_column(idl,
> -
 column);
> -    size_t column_idx = column - table->class_->columns;
> -
> -    if (table->modes[column_idx] != mode) {
> -        *ovsdb_idl_get_mode(idl, column) = mode;
> -    }
> -}
> -
>  static void
>  add_ref_table(struct ovsdb_idl *idl, const struct ovsdb_base_type *base)
>  {
> @@ -902,7 +888,8 @@ void
>  ovsdb_idl_add_column(struct ovsdb_idl *idl,
>                       const struct ovsdb_idl_column *column)
>  {
> -    ovsdb_idl_set_mode(idl, column, OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
> +    ovsdb_idl_set_column_mode(idl, column,
> +                              OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
>      add_ref_table(idl, &column->type.key);
>      add_ref_table(idl, &column->type.value);
>  }
> @@ -1186,6 +1173,51 @@ ovsdb_idl_omit_alert(struct ovsdb_idl *idl,
>      *ovsdb_idl_get_mode(idl, column) &= ~(OVSDB_IDL_ALERT |
OVSDB_IDL_TRACK);
>  }
>
> +/* Sets the 'column' mode to 'mode' which must be a valid combination of
> + * OVSDB_IDL_MONITOR, OVSDB_IDL_ALERT and OVSDB_IDL_TRACK.
> + */
> +void
> +ovsdb_idl_set_column_mode(struct ovsdb_idl *idl,
> +                          const struct ovsdb_idl_column *column,
> +                          unsigned char mode)
> +{
> +    if (mode & ~(OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
{
> +        VLOG_ABORT("%s(): Column '%s' unknown mode %02x.",
> +                   __func__, column->name, mode);
> +    }
> +
> +    if ((mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
> +         && !(mode & OVSDB_IDL_MONITOR)) {
> +        VLOG_ABORT("%s(): Column '%s' mode must include
OVSDB_IDL_MONITOR.",
> +                   __func__, column->name);
> +    }
> +
> +    const struct ovsdb_idl_table *table =
ovsdb_idl_table_from_column(idl,
> +
 column);
> +    size_t column_idx = column - table->class_->columns;
> +
> +    if (table->modes[column_idx] != mode) {
> +        *ovsdb_idl_get_mode(idl, column) = mode;
> +    }
> +}
> +
> +/* Walks all columns of tables in 'idl' and sets their mode to 'mode'
> + * which must be a valid combination of OVSDB_IDL_MONITOR,
OVSDB_IDL_ALERT
> + * and OVSDB_IDL_TRACK.
> + */
> +void
> +ovsdb_idl_set_column_mode_all(struct ovsdb_idl *idl, unsigned char mode)
> +{
> +    for (size_t i = 0; i < idl->class_->n_tables; i++) {
> +        const struct ovsdb_idl_table_class *tc = &idl->class_->tables[i];
> +
> +        for (size_t j = 0; j < tc->n_columns; j++) {
> +            const struct ovsdb_idl_column *column = &tc->columns[j];
> +            ovsdb_idl_set_column_mode(idl, column, mode);
> +        }
> +    }
> +}
> +
>  /* Sets the mode for 'column' in 'idl' to 0.  See the big comment above
>   * OVSDB_IDL_MONITOR for details.
>   *
> @@ -1236,8 +1268,8 @@ ovsdb_idl_row_get_seqno(const struct ovsdb_idl_row
*row,
>   * functions.
>   *
>   * This function should be called between ovsdb_idl_create() and
> - * the first call to ovsdb_idl_run(). The column to be tracked
> - * should have OVSDB_IDL_ALERT turned on.
> + * the first call to ovsdb_idl_run(). The function will also ensure
> + * that the column to be tracked has OVSDB_IDL_ALERT turned on.
>   */
>  void
>  ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
> @@ -1246,7 +1278,9 @@ ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
>      if (!(*ovsdb_idl_get_mode(idl, column) & OVSDB_IDL_ALERT)) {
>          ovsdb_idl_add_column(idl, column);
>      }
> -    *ovsdb_idl_get_mode(idl, column) |= OVSDB_IDL_TRACK;
> +
> +    unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT |
OVSDB_IDL_TRACK;
> +    ovsdb_idl_set_column_mode(idl, column, mode);
>  }
>
>  void
> @@ -1694,11 +1728,14 @@ ovsdb_idl_row_change(struct ovsdb_idl_row *row,
const struct shash *values,
>              continue;
>          }
>
> -        if (datum_changed && table->modes[column_idx] & OVSDB_IDL_ALERT)
{
> -            changed = true;
> -            row->change_seqno[change]
> -                = row->table->change_seqno[change]
> -                = row->table->idl->change_seqno + 1;
> +        if (datum_changed) {
> +            unsigned char column_mode = table->modes[column_idx];
> +            if (column_mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK)) {
> +                changed = true;
> +                row->change_seqno[change]
> +                    = row->table->change_seqno[change]
> +                    = row->table->idl->change_seqno + 1;
> +            }
>
>              if (table->modes[column_idx] & OVSDB_IDL_TRACK) {
>                  if (ovs_list_is_empty(&row->track_node) &&
> @@ -3501,7 +3538,7 @@ ovsdb_idl_txn_write__(const struct ovsdb_idl_row
*row_,
>
>      class = row->table->class_;
>      column_idx = column - class->columns;
> -    write_only = row->table->modes[column_idx] == OVSDB_IDL_MONITOR;
> +    write_only = !(row->table->modes[column_idx] & OVSDB_IDL_ALERT);
>
>      ovs_assert(row->new_datum != NULL);
>      ovs_assert(column_idx < class->n_columns);
> diff --git a/lib/ovsdb-idl.h b/lib/ovsdb-idl.h
> index d00599616ef9..2d74ed0727a7 100644
> --- a/lib/ovsdb-idl.h
> +++ b/lib/ovsdb-idl.h
> @@ -180,7 +180,8 @@ bool ovsdb_idl_server_has_column(const struct
ovsdb_idl *,
>   *
>   *   - (OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK), for a
column
>   *     that a client wants to track using the change tracking
> - *     ovsdb_idl_track_get_*() functions.
> + *     ovsdb_idl_track_get_*() functions.  OVSDB_IDL_ALERT can be
omitted for
> + *     write-only columns.
>   */
>  #define OVSDB_IDL_MONITOR (1 << 0) /* Replicate this column? */
>  #define OVSDB_IDL_ALERT   (1 << 1) /* Alert client when column changes?
*/
> @@ -193,6 +194,11 @@ void ovsdb_idl_add_table(struct ovsdb_idl *,
>  void ovsdb_idl_omit(struct ovsdb_idl *, const struct ovsdb_idl_column *);
>  void ovsdb_idl_omit_alert(struct ovsdb_idl *, const struct
ovsdb_idl_column *);
>
> +void ovsdb_idl_set_column_mode(struct ovsdb_idl *,
> +                               const struct ovsdb_idl_column *,
> +                               unsigned char mode);
> +void ovsdb_idl_set_column_mode_all(struct ovsdb_idl *, unsigned char
mode);
> +
>  /* Change tracking.
>   *
>   * In OVSDB, change tracking is applied at each client in the IDL
layer.  This
> diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
> index 62e2b638320c..0436e0f77b0f 100644
> --- a/tests/ovsdb-idl.at
> +++ b/tests/ovsdb-idl.at
> @@ -1154,12 +1154,25 @@ OVSDB_CHECK_IDL_WO_MONITOR_COND([simple idl
disable monitor-cond],
>  ]])
>
>  m4_define([OVSDB_CHECK_IDL_TRACK_C],
> -  [AT_SETUP([$1 - C])
> +  [AT_SETUP([$1 - alert - C])
> +   AT_KEYWORDS([ovsdb server idl tracking positive $5])
> +   AT_CHECK([ovsdb_start_idltest])
> +   m4_if([$2], [], [],
> +     [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
[ignore])])
> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
-t10 --change-track=alert idl unix:socket $3],
> +            [0], [stdout], [ignore])
> +   AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
> +            [0], [$4])
> +   OVSDB_SERVER_SHUTDOWN
> +   AT_CLEANUP])
> +
> +m4_define([OVSDB_CHECK_IDL_TRACK_NOALERT_C],
> +  [AT_SETUP([$1 - noalert - C])
>     AT_KEYWORDS([ovsdb server idl tracking positive $5])
>     AT_CHECK([ovsdb_start_idltest])
>     m4_if([$2], [], [],
>       [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
[ignore])])
> -   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
-t10 -c idl unix:socket $3],
> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
-t10 --change-track=noalert idl unix:socket $3],
>              [0], [stdout], [ignore])
>     AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
>              [0], [$4])
> @@ -1167,7 +1180,8 @@ m4_define([OVSDB_CHECK_IDL_TRACK_C],
>     AT_CLEANUP])
>
>  m4_define([OVSDB_CHECK_IDL_TRACK],
> -  [OVSDB_CHECK_IDL_TRACK_C($@)])
> +  [OVSDB_CHECK_IDL_TRACK_C($@)
> +   OVSDB_CHECK_IDL_TRACK_NOALERT_C($@)])
>
>  OVSDB_CHECK_IDL_TRACK([track, simple idl, initially populated],
>    [['["idltest",
> diff --git a/tests/test-ovsdb.c b/tests/test-ovsdb.c
> index ca4e87b8115c..abd30cf27d18 100644
> --- a/tests/test-ovsdb.c
> +++ b/tests/test-ovsdb.c
> @@ -56,6 +56,7 @@
>  VLOG_DEFINE_THIS_MODULE(test_ovsdb);
>
>  struct test_ovsdb_pvt_context {
> +    bool track_alert;
>      bool track;
>  };
>
> @@ -122,6 +123,15 @@ parse_options(int argc, char *argv[], struct
test_ovsdb_pvt_context *pvt)
>              break;
>
>          case 'c':
> +            if (optarg) {
> +                if (!strcmp(optarg, "noalert")) {
> +                    pvt->track_alert = false;
> +                } else if (!strcmp(optarg, "alert")) {
> +                    pvt->track_alert = true;
> +                } else {
> +                    ovs_fatal(0, "value %s is invalid", optarg);
> +                }
> +            }
>              pvt->track = true;
>              break;
>
> @@ -2610,6 +2620,7 @@ update_conditions(struct ovsdb_idl *idl, char
*commands)
>  static void
>  do_idl(struct ovs_cmdl_context *ctx)
>  {
> +    struct test_ovsdb_pvt_context *pvt = ctx->pvt;
>      struct jsonrpc *rpc;
>      struct ovsdb_idl *idl;
>      unsigned int seqno = 0;
> @@ -2618,9 +2629,6 @@ do_idl(struct ovs_cmdl_context *ctx)
>      int step = 0;
>      int error;
>      int i;
> -    bool track;
> -
> -    track = ((struct test_ovsdb_pvt_context *)(ctx->pvt))->track;
>
>      idl = ovsdb_idl_create(ctx->argv[1], &idltest_idl_class, true, true);
>      ovsdb_idl_set_leader_only(idl, false);
> @@ -2637,8 +2645,13 @@ do_idl(struct ovs_cmdl_context *ctx)
>          rpc = NULL;
>      }
>
> -    if (track) {
> -        ovsdb_idl_track_add_all(idl);
> +    if (pvt->track) {
> +        unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK;
> +
> +        if (pvt->track_alert) {
> +            mode |= OVSDB_IDL_ALERT;
> +        }
> +        ovsdb_idl_set_column_mode_all(idl, mode);
>      }
>
>      setvbuf(stdout, NULL, _IONBF, 0);
> @@ -2683,7 +2696,7 @@ do_idl(struct ovs_cmdl_context *ctx)
>              }
>
>              /* Print update. */
> -            if (track) {
> +            if (pvt->track) {
>                  print_idl_track(idl, step++, terse);
>                  ovsdb_idl_track_clear(idl);
>              } else {
> --
> 2.27.0
>
Dumitru Ceara April 15, 2022, 8:45 p.m. UTC | #2
On 4/15/22 20:25, Han Zhou wrote:
> On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> At a first glance, change tracking should never be allowed for
>> write-only columns.  However, some clients (e.g., ovn-northd) that are
>> mostly exclusive writers of a database, use change tracking to avoid
>> duplicating the IDL row records into a local cache when implementing
>> incremental processing.
> 
> Hi Dumitru,
> 

Hi Han,

> It is still not clear to me why ovn-northd would need change-tracking for
> the write-only columns. It didn't require change tracking before using the
> incremental processing engine. In theory, to my understanding,
> change-tracking is required only if the column is an input to some of the
> engine nodes, so that it needs to be alerted when there is a change in the
> database. If the column is write-only to ovn-northd, it means it doesn't
> care about any change of the column from the DB, but blindly writes to the
> column regardless of the current value. I checked the patch [0], which
> mentioned that
> 
>     Such nodes are primary inputs to the northd incremental processing
>     engine and without proper update processing for Southbound tables,
>     northd might fail to timely reconcile the contents of the Southbound
>     database.
> 
> I am confused that, if the columns were write-only, how come they become
> inputs to northd and requires reconcile upon them? Does it mean that they
> were misused before I-P, that they were in fact read/write columns to
> ovn-northd?

I think it's actually a bit more complicated than this.

Even for pure write-only tables/columns we have a problem due to the
fact that clustered ovsdb offers at-least-once consistency [2].

We actually hit the case in which a single northd transaction to insert
a SB.Load_Balancer record resulted in two identical rows being added to
the database, both of them pointing to the same logical switch datapath
binding [3].  When that logical switch was deleted northd would fail to
remove the duplicate from the SB causing a referential integrity
violation due to the dangling reference.

Arguably the schema for load balancers is not ideal because it doesn't
defines an index on "name".  Nevertheless, that's not something we can
change easily because it will (quite likely with raft) break existing
deployments.  And it applies to other tables in the NB/SB schema.

The fix was relatively straighforward though [4], and it meant fixing
the way northd reconciles the SB.Load_balancers.

This is were I-P became an issue.  Consider the current main branch
code with commit e4d6d3455baf ("ovn-northd: Enable change tracking for
all SB tables.") reverted.  In a sandbox we do:

$ ovn-nbctl ls-add ls
$ ovn-nbctl lb-add lb1 1.1.1.1:1000 2.2.2.2:2000 -- ls-lb-add ls lb1

# At this point the SB.LB table looks ok:
$ ovn-sbctl list load_balancer                                                                                                                                                                          
_uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
name                : lb1
options             : {hairpin_orig_tuple="true"}
protocol            : tcp
vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}

# Simulate the bug in [3]
$ ovn-sbctl create load_balancer name=lb1 external_ids:lb_id=c2206d6a-6939-47cc-92de-554d2bc163b0

# Even with the northd fix [4], due to the fact that the LB table is
# write-only, northd doesn't wake up so we end up with the duplicate
# staying in the SB for an indeterminate time.
$ ovn-sbctl list load_balancer
_uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
name                : lb1
options             : {hairpin_orig_tuple="true"}
protocol            : tcp
vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}

_uuid               : 0bff0f83-234a-4631-ab88-2348b9d07d1f
datapaths           : []
external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
name                : lb1
options             : {}
protocol            : []
vips                : {}

# Only when we generate an input for a "non-write-only" column the SB.LB
# is reconciled, because the "northd" I-P node runs.
$ ovn-nbctl --wait=sb sync
$ ovn-sbctl list load_balancer
_uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
name                : lb1
options             : {hairpin_orig_tuple="true"}
protocol            : tcp
vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}

> Similarly for the patch [1], I didn't figure out what's exactly broken
> without change-tracking to the options column from the commit message and
> the patch itself.

I think the reason behind [1] is that ovn-controller (in the pinctrl
module) writes some options, SB.Port_Binding.options:ipv6_ra_pd_list,
which were then propagated by ovn-northd to
Logical_Router_Port.ipv6_prefix.

This is done in ovn_update_ipv6_prefix(), part of ovnnb_db_run() in the
"northd" I-P node.  But, if the only change that woke up northd was the
SB.Port_Binding.options:ipv6_ra_pd_list change, because the column is
not tracked, the en_sb_port_binding I-P node (input of northd) doesn't
run.

> I think I need to understand the context before commenting on the current
> patch. Could you help explain a little more?
> 

I hope this makes it more clear.  I'm sure there are more aspects I
missed here but I think we need to find a way to make northd work
correctly and without delays in reconciling state while at the same
time avoiding to affect latency (like in the performance regression
I introduced with [0]).

> [0]
> https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
> [1]
> https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089
> 

[2] https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#understanding-cluster-consistency
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2045577
[4] https://github.com/ovn-org/ovn/commit/9deb000536e0dcf81d0e61d5a8af1d4e655960b4

> Thanks,
> Han
> 

Thanks,
Dumitru

>>
>> The default behavior of the IDL is to automatically turn a write-only
>> column into a read-write column whenever the client enables change
>> tracking for that column.
>>
>> For the afore mentioned clients, this becomes a performance issue.
>> Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of writes that don't
>> change a column's value.") explains why writes that don't change a
>> column's value cannot be optimized out early if the column is
>> read/write.
>>
>> Furthermore, if there is at least one record in any table that
>> changed during a transaction, then *all* records that have been
>> written are added to the transaction, even if their values didn't
>> change.  If there are many such rows (e.g., like in ovn-northd's
>> case) this incurs a significant overhead because:
>> a. the client has to build this large transaction
>> b. the transaction has to be sent over the network
>> c. the server needs to parse this (mostly) no-op update
>>
>> We now introduce new IDL APIs allowing users to explicitly request
>> monitoring mode combinations for different columns in the IDL.
>>
>> These accept the combination 'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK'
>> as a valid input allowing users to change track write-only
>> tables.
>>
>> We benchmarked ovn-northd performance when using this new mode
>> against NB and SB databases taken from ovn-kubernetes scale tests.
>> We noticed that when a minor change is performed to the Northbound
>> database (e.g., NB_Global.nb_cfg is incremented) the time it takes to
>> build the Southbound transaction becomes negligible (vs ~1.5 seconds
>> before this change).
>>
>> End-to-end ovn-kubernetes scale tests on 120-node clusters also show
>> significant reduction of latency to bring up pods; both average and P99
>> latency decreased by ~30%.
>>
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> ---
>> V2:
>> - Addressed Numan's comments:
>>   - Added APIs to allow per column configuration of modes.
>> - Fixed unit tests to actually enable change tracking for write-only
>>   columns.
>> - Fixed ovsdb_idl_row_change() to correctly update row/table/idl change
>>   seqnos if change tracking is enabled (even for write-only rows).
>>
>> Note: The OVN counter part change is:
>>
> https://github.com/dceara/ovn/commit/0d0c689e6469ae2f8e195c423e9a91fbc514bd60
>> ---
>>  NEWS               |  4 +++
>>  lib/ovsdb-idl.c    | 85 +++++++++++++++++++++++++++++++++-------------
>>  lib/ovsdb-idl.h    |  8 ++++-
>>  tests/ovsdb-idl.at | 20 +++++++++--
>>  tests/test-ovsdb.c | 25 ++++++++++----
>>  5 files changed, 108 insertions(+), 34 deletions(-)
>>
>> diff --git a/NEWS b/NEWS
>> index 5074b97aa51c..c1c41da94e9b 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -14,6 +14,10 @@ Post-v2.17.0
>>     - OVSDB:
>>       * 'relay' service model now supports transaction history, i.e.
> honors the
>>         'last-txn-id' field in 'monitor_cond_since' requests from clients.
>> +   - OVSDB-IDL:
>> +     * New interfaces for explcitly setting column modes.  The
> combination
>> +       'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK' is now considered valid and
>> +       enables change tracking of write-only columns.
>>
>>
>>  v2.17.0 - 17 Feb 2022
>> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
>> index 882ede75598d..5f62c267d496 100644
>> --- a/lib/ovsdb-idl.c
>> +++ b/lib/ovsdb-idl.c
>> @@ -859,20 +859,6 @@ ovsdb_idl_get_mode(struct ovsdb_idl *idl,
>>      return &table->modes[column - table->class_->columns];
>>  }
>>
>> -static void
>> -ovsdb_idl_set_mode(struct ovsdb_idl *idl,
>> -                   const struct ovsdb_idl_column *column,
>> -                   unsigned char mode)
>> -{
>> -    const struct ovsdb_idl_table *table =
> ovsdb_idl_table_from_column(idl,
>> -
>  column);
>> -    size_t column_idx = column - table->class_->columns;
>> -
>> -    if (table->modes[column_idx] != mode) {
>> -        *ovsdb_idl_get_mode(idl, column) = mode;
>> -    }
>> -}
>> -
>>  static void
>>  add_ref_table(struct ovsdb_idl *idl, const struct ovsdb_base_type *base)
>>  {
>> @@ -902,7 +888,8 @@ void
>>  ovsdb_idl_add_column(struct ovsdb_idl *idl,
>>                       const struct ovsdb_idl_column *column)
>>  {
>> -    ovsdb_idl_set_mode(idl, column, OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
>> +    ovsdb_idl_set_column_mode(idl, column,
>> +                              OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
>>      add_ref_table(idl, &column->type.key);
>>      add_ref_table(idl, &column->type.value);
>>  }
>> @@ -1186,6 +1173,51 @@ ovsdb_idl_omit_alert(struct ovsdb_idl *idl,
>>      *ovsdb_idl_get_mode(idl, column) &= ~(OVSDB_IDL_ALERT |
> OVSDB_IDL_TRACK);
>>  }
>>
>> +/* Sets the 'column' mode to 'mode' which must be a valid combination of
>> + * OVSDB_IDL_MONITOR, OVSDB_IDL_ALERT and OVSDB_IDL_TRACK.
>> + */
>> +void
>> +ovsdb_idl_set_column_mode(struct ovsdb_idl *idl,
>> +                          const struct ovsdb_idl_column *column,
>> +                          unsigned char mode)
>> +{
>> +    if (mode & ~(OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
> {
>> +        VLOG_ABORT("%s(): Column '%s' unknown mode %02x.",
>> +                   __func__, column->name, mode);
>> +    }
>> +
>> +    if ((mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
>> +         && !(mode & OVSDB_IDL_MONITOR)) {
>> +        VLOG_ABORT("%s(): Column '%s' mode must include
> OVSDB_IDL_MONITOR.",
>> +                   __func__, column->name);
>> +    }
>> +
>> +    const struct ovsdb_idl_table *table =
> ovsdb_idl_table_from_column(idl,
>> +
>  column);
>> +    size_t column_idx = column - table->class_->columns;
>> +
>> +    if (table->modes[column_idx] != mode) {
>> +        *ovsdb_idl_get_mode(idl, column) = mode;
>> +    }
>> +}
>> +
>> +/* Walks all columns of tables in 'idl' and sets their mode to 'mode'
>> + * which must be a valid combination of OVSDB_IDL_MONITOR,
> OVSDB_IDL_ALERT
>> + * and OVSDB_IDL_TRACK.
>> + */
>> +void
>> +ovsdb_idl_set_column_mode_all(struct ovsdb_idl *idl, unsigned char mode)
>> +{
>> +    for (size_t i = 0; i < idl->class_->n_tables; i++) {
>> +        const struct ovsdb_idl_table_class *tc = &idl->class_->tables[i];
>> +
>> +        for (size_t j = 0; j < tc->n_columns; j++) {
>> +            const struct ovsdb_idl_column *column = &tc->columns[j];
>> +            ovsdb_idl_set_column_mode(idl, column, mode);
>> +        }
>> +    }
>> +}
>> +
>>  /* Sets the mode for 'column' in 'idl' to 0.  See the big comment above
>>   * OVSDB_IDL_MONITOR for details.
>>   *
>> @@ -1236,8 +1268,8 @@ ovsdb_idl_row_get_seqno(const struct ovsdb_idl_row
> *row,
>>   * functions.
>>   *
>>   * This function should be called between ovsdb_idl_create() and
>> - * the first call to ovsdb_idl_run(). The column to be tracked
>> - * should have OVSDB_IDL_ALERT turned on.
>> + * the first call to ovsdb_idl_run(). The function will also ensure
>> + * that the column to be tracked has OVSDB_IDL_ALERT turned on.
>>   */
>>  void
>>  ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
>> @@ -1246,7 +1278,9 @@ ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
>>      if (!(*ovsdb_idl_get_mode(idl, column) & OVSDB_IDL_ALERT)) {
>>          ovsdb_idl_add_column(idl, column);
>>      }
>> -    *ovsdb_idl_get_mode(idl, column) |= OVSDB_IDL_TRACK;
>> +
>> +    unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT |
> OVSDB_IDL_TRACK;
>> +    ovsdb_idl_set_column_mode(idl, column, mode);
>>  }
>>
>>  void
>> @@ -1694,11 +1728,14 @@ ovsdb_idl_row_change(struct ovsdb_idl_row *row,
> const struct shash *values,
>>              continue;
>>          }
>>
>> -        if (datum_changed && table->modes[column_idx] & OVSDB_IDL_ALERT)
> {
>> -            changed = true;
>> -            row->change_seqno[change]
>> -                = row->table->change_seqno[change]
>> -                = row->table->idl->change_seqno + 1;
>> +        if (datum_changed) {
>> +            unsigned char column_mode = table->modes[column_idx];
>> +            if (column_mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK)) {
>> +                changed = true;
>> +                row->change_seqno[change]
>> +                    = row->table->change_seqno[change]
>> +                    = row->table->idl->change_seqno + 1;
>> +            }
>>
>>              if (table->modes[column_idx] & OVSDB_IDL_TRACK) {
>>                  if (ovs_list_is_empty(&row->track_node) &&
>> @@ -3501,7 +3538,7 @@ ovsdb_idl_txn_write__(const struct ovsdb_idl_row
> *row_,
>>
>>      class = row->table->class_;
>>      column_idx = column - class->columns;
>> -    write_only = row->table->modes[column_idx] == OVSDB_IDL_MONITOR;
>> +    write_only = !(row->table->modes[column_idx] & OVSDB_IDL_ALERT);
>>
>>      ovs_assert(row->new_datum != NULL);
>>      ovs_assert(column_idx < class->n_columns);
>> diff --git a/lib/ovsdb-idl.h b/lib/ovsdb-idl.h
>> index d00599616ef9..2d74ed0727a7 100644
>> --- a/lib/ovsdb-idl.h
>> +++ b/lib/ovsdb-idl.h
>> @@ -180,7 +180,8 @@ bool ovsdb_idl_server_has_column(const struct
> ovsdb_idl *,
>>   *
>>   *   - (OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK), for a
> column
>>   *     that a client wants to track using the change tracking
>> - *     ovsdb_idl_track_get_*() functions.
>> + *     ovsdb_idl_track_get_*() functions.  OVSDB_IDL_ALERT can be
> omitted for
>> + *     write-only columns.
>>   */
>>  #define OVSDB_IDL_MONITOR (1 << 0) /* Replicate this column? */
>>  #define OVSDB_IDL_ALERT   (1 << 1) /* Alert client when column changes?
> */
>> @@ -193,6 +194,11 @@ void ovsdb_idl_add_table(struct ovsdb_idl *,
>>  void ovsdb_idl_omit(struct ovsdb_idl *, const struct ovsdb_idl_column *);
>>  void ovsdb_idl_omit_alert(struct ovsdb_idl *, const struct
> ovsdb_idl_column *);
>>
>> +void ovsdb_idl_set_column_mode(struct ovsdb_idl *,
>> +                               const struct ovsdb_idl_column *,
>> +                               unsigned char mode);
>> +void ovsdb_idl_set_column_mode_all(struct ovsdb_idl *, unsigned char
> mode);
>> +
>>  /* Change tracking.
>>   *
>>   * In OVSDB, change tracking is applied at each client in the IDL
> layer.  This
>> diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
>> index 62e2b638320c..0436e0f77b0f 100644
>> --- a/tests/ovsdb-idl.at
>> +++ b/tests/ovsdb-idl.at
>> @@ -1154,12 +1154,25 @@ OVSDB_CHECK_IDL_WO_MONITOR_COND([simple idl
> disable monitor-cond],
>>  ]])
>>
>>  m4_define([OVSDB_CHECK_IDL_TRACK_C],
>> -  [AT_SETUP([$1 - C])
>> +  [AT_SETUP([$1 - alert - C])
>> +   AT_KEYWORDS([ovsdb server idl tracking positive $5])
>> +   AT_CHECK([ovsdb_start_idltest])
>> +   m4_if([$2], [], [],
>> +     [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
> [ignore])])
>> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> -t10 --change-track=alert idl unix:socket $3],
>> +            [0], [stdout], [ignore])
>> +   AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
>> +            [0], [$4])
>> +   OVSDB_SERVER_SHUTDOWN
>> +   AT_CLEANUP])
>> +
>> +m4_define([OVSDB_CHECK_IDL_TRACK_NOALERT_C],
>> +  [AT_SETUP([$1 - noalert - C])
>>     AT_KEYWORDS([ovsdb server idl tracking positive $5])
>>     AT_CHECK([ovsdb_start_idltest])
>>     m4_if([$2], [], [],
>>       [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
> [ignore])])
>> -   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> -t10 -c idl unix:socket $3],
>> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> -t10 --change-track=noalert idl unix:socket $3],
>>              [0], [stdout], [ignore])
>>     AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
>>              [0], [$4])
>> @@ -1167,7 +1180,8 @@ m4_define([OVSDB_CHECK_IDL_TRACK_C],
>>     AT_CLEANUP])
>>
>>  m4_define([OVSDB_CHECK_IDL_TRACK],
>> -  [OVSDB_CHECK_IDL_TRACK_C($@)])
>> +  [OVSDB_CHECK_IDL_TRACK_C($@)
>> +   OVSDB_CHECK_IDL_TRACK_NOALERT_C($@)])
>>
>>  OVSDB_CHECK_IDL_TRACK([track, simple idl, initially populated],
>>    [['["idltest",
>> diff --git a/tests/test-ovsdb.c b/tests/test-ovsdb.c
>> index ca4e87b8115c..abd30cf27d18 100644
>> --- a/tests/test-ovsdb.c
>> +++ b/tests/test-ovsdb.c
>> @@ -56,6 +56,7 @@
>>  VLOG_DEFINE_THIS_MODULE(test_ovsdb);
>>
>>  struct test_ovsdb_pvt_context {
>> +    bool track_alert;
>>      bool track;
>>  };
>>
>> @@ -122,6 +123,15 @@ parse_options(int argc, char *argv[], struct
> test_ovsdb_pvt_context *pvt)
>>              break;
>>
>>          case 'c':
>> +            if (optarg) {
>> +                if (!strcmp(optarg, "noalert")) {
>> +                    pvt->track_alert = false;
>> +                } else if (!strcmp(optarg, "alert")) {
>> +                    pvt->track_alert = true;
>> +                } else {
>> +                    ovs_fatal(0, "value %s is invalid", optarg);
>> +                }
>> +            }
>>              pvt->track = true;
>>              break;
>>
>> @@ -2610,6 +2620,7 @@ update_conditions(struct ovsdb_idl *idl, char
> *commands)
>>  static void
>>  do_idl(struct ovs_cmdl_context *ctx)
>>  {
>> +    struct test_ovsdb_pvt_context *pvt = ctx->pvt;
>>      struct jsonrpc *rpc;
>>      struct ovsdb_idl *idl;
>>      unsigned int seqno = 0;
>> @@ -2618,9 +2629,6 @@ do_idl(struct ovs_cmdl_context *ctx)
>>      int step = 0;
>>      int error;
>>      int i;
>> -    bool track;
>> -
>> -    track = ((struct test_ovsdb_pvt_context *)(ctx->pvt))->track;
>>
>>      idl = ovsdb_idl_create(ctx->argv[1], &idltest_idl_class, true, true);
>>      ovsdb_idl_set_leader_only(idl, false);
>> @@ -2637,8 +2645,13 @@ do_idl(struct ovs_cmdl_context *ctx)
>>          rpc = NULL;
>>      }
>>
>> -    if (track) {
>> -        ovsdb_idl_track_add_all(idl);
>> +    if (pvt->track) {
>> +        unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK;
>> +
>> +        if (pvt->track_alert) {
>> +            mode |= OVSDB_IDL_ALERT;
>> +        }
>> +        ovsdb_idl_set_column_mode_all(idl, mode);
>>      }
>>
>>      setvbuf(stdout, NULL, _IONBF, 0);
>> @@ -2683,7 +2696,7 @@ do_idl(struct ovs_cmdl_context *ctx)
>>              }
>>
>>              /* Print update. */
>> -            if (track) {
>> +            if (pvt->track) {
>>                  print_idl_track(idl, step++, terse);
>>                  ovsdb_idl_track_clear(idl);
>>              } else {
>> --
>> 2.27.0
>>
>
Han Zhou April 15, 2022, 10:41 p.m. UTC | #3
On Fri, Apr 15, 2022 at 1:45 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 4/15/22 20:25, Han Zhou wrote:
> > On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> At a first glance, change tracking should never be allowed for
> >> write-only columns.  However, some clients (e.g., ovn-northd) that are
> >> mostly exclusive writers of a database, use change tracking to avoid
> >> duplicating the IDL row records into a local cache when implementing
> >> incremental processing.
> >
> > Hi Dumitru,
> >
>
> Hi Han,
>
> > It is still not clear to me why ovn-northd would need change-tracking
for
> > the write-only columns. It didn't require change tracking before using
the
> > incremental processing engine. In theory, to my understanding,
> > change-tracking is required only if the column is an input to some of
the
> > engine nodes, so that it needs to be alerted when there is a change in
the
> > database. If the column is write-only to ovn-northd, it means it doesn't
> > care about any change of the column from the DB, but blindly writes to
the
> > column regardless of the current value. I checked the patch [0], which
> > mentioned that
> >
> >     Such nodes are primary inputs to the northd incremental processing
> >     engine and without proper update processing for Southbound tables,
> >     northd might fail to timely reconcile the contents of the Southbound
> >     database.
> >
> > I am confused that, if the columns were write-only, how come they become
> > inputs to northd and requires reconcile upon them? Does it mean that
they
> > were misused before I-P, that they were in fact read/write columns to
> > ovn-northd?
>
> I think it's actually a bit more complicated than this.
>
> Even for pure write-only tables/columns we have a problem due to the
> fact that clustered ovsdb offers at-least-once consistency [2].
>
> We actually hit the case in which a single northd transaction to insert
> a SB.Load_Balancer record resulted in two identical rows being added to
> the database, both of them pointing to the same logical switch datapath
> binding [3].  When that logical switch was deleted northd would fail to
> remove the duplicate from the SB causing a referential integrity
> violation due to the dangling reference.
>
> Arguably the schema for load balancers is not ideal because it doesn't
> defines an index on "name".  Nevertheless, that's not something we can
> change easily because it will (quite likely with raft) break existing
> deployments.  And it applies to other tables in the NB/SB schema.
>
> The fix was relatively straighforward though [4], and it meant fixing
> the way northd reconciles the SB.Load_balancers.
>
> This is were I-P became an issue.  Consider the current main branch
> code with commit e4d6d3455baf ("ovn-northd: Enable change tracking for
> all SB tables.") reverted.  In a sandbox we do:
>
> $ ovn-nbctl ls-add ls
> $ ovn-nbctl lb-add lb1 1.1.1.1:1000 2.2.2.2:2000 -- ls-lb-add ls lb1
>
> # At this point the SB.LB table looks ok:
> $ ovn-sbctl list load_balancer


> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> name                : lb1
> options             : {hairpin_orig_tuple="true"}
> protocol            : tcp
> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>
> # Simulate the bug in [3]
> $ ovn-sbctl create load_balancer name=lb1
external_ids:lb_id=c2206d6a-6939-47cc-92de-554d2bc163b0
>
> # Even with the northd fix [4], due to the fact that the LB table is
> # write-only, northd doesn't wake up so we end up with the duplicate
> # staying in the SB for an indeterminate time.
> $ ovn-sbctl list load_balancer
> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> name                : lb1
> options             : {hairpin_orig_tuple="true"}
> protocol            : tcp
> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>
> _uuid               : 0bff0f83-234a-4631-ab88-2348b9d07d1f
> datapaths           : []
> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> name                : lb1
> options             : {}
> protocol            : []
> vips                : {}
>
> # Only when we generate an input for a "non-write-only" column the SB.LB
> # is reconciled, because the "northd" I-P node runs.
> $ ovn-nbctl --wait=sb sync
> $ ovn-sbctl list load_balancer
> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> name                : lb1
> options             : {hairpin_orig_tuple="true"}
> protocol            : tcp
> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>
> > Similarly for the patch [1], I didn't figure out what's exactly broken
> > without change-tracking to the options column from the commit message
and
> > the patch itself.
>
> I think the reason behind [1] is that ovn-controller (in the pinctrl
> module) writes some options, SB.Port_Binding.options:ipv6_ra_pd_list,
> which were then propagated by ovn-northd to
> Logical_Router_Port.ipv6_prefix.
>
> This is done in ovn_update_ipv6_prefix(), part of ovnnb_db_run() in the
> "northd" I-P node.  But, if the only change that woke up northd was the
> SB.Port_Binding.options:ipv6_ra_pd_list change, because the column is
> not tracked, the en_sb_port_binding I-P node (input of northd) doesn't
> run.
>
> > I think I need to understand the context before commenting on the
current
> > patch. Could you help explain a little more?
> >
>
> I hope this makes it more clear.  I'm sure there are more aspects I
> missed here but I think we need to find a way to make northd work
> correctly and without delays in reconciling state while at the same
> time avoiding to affect latency (like in the performance regression
> I introduced with [0]).
>
Thanks Dumitru for the detailed explanation. I think I understand the
problem now. In both of the cases above, we were in fact monitoring the
columns as "write-only" while they were in fact "read-write".

1) The first case (the load balancer one) looks more complex, but
essentially it is an input to ovn-northd. If we consider "removing
duplicates" as a feature of northd, although the duplicates would in fact
only be triggered by northd itself (but not actually generated by itself,
it was RAFT's problem that actually generated the duplicated row), northd
is reading the records and then remove the duplicates and write it back.
That's why the northd needs to track the load-balancer columns and react to
the changes. (We may still call it "write-only" in this case, which doesn't
mean northd doesn't read the value, but just that northd is the only one
that decides what should be there, not worrying about overwriting others'
changes. I'd rather call it "read-write" to be more precise.)

2) The second case (the IPv6 one) is more straightforward. If
ovn-controller can modify the column, the column should never be regarded
as write-only to northd, which is a bug even existed before northd I-P.
There may be other such bugs that we haven't noticed before, but can be
exposed by the I-P change. These bugs were not noticed before because
anything that wakes up the main loop would trigger a recompute before I-P
implementation, so we were actually handling any changes in time. And the
problem mentioned in [5] Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of
writes that don't change a column's value.") doesn't apply to this use case
(correct me if I am wrong).

Now to the solution:

a) To fix the bugs, we shouldn't blindly change-tracking for all tables and
columns, but instead, track changes for the columns that are read-write but
treated as write-only by mistake, while keep the real write-only columns
write-only (e.g. the nb_cfg related columns in SB_Global). I think we
should fix this regardless.

b) However, this is not enough, because we are facing the performance
problem introduced by such fixes, even if we only add change-tracking to
very few new columns. E.g. the one added by [1] already impacts performance
significantly even adding tracking for just one column. We need further
improvement to solve the performance problem, which is primarily caused by
the behavior introduced by [5]. However, in most use cases we don't have
the problem mentioned by [5]. I think this is the assumption that leads to
this patch as the solution for the performance problem and I think it is
the right approach. It allows northd to get alerted by changes, but only
write back columns that were changed. I just want to emphasize that we need
this approach on top of a), and only apply for the columns that are
read-write but don't worry about atomicity of transactions if only the
changed values are written back.

With the above understanding, I have some comments for the patch:

i) Functionally I believe it serves the purpose of solving the performance
problem, but it seems confusing what the modes really mean now. There were
definitions to each of the flags:
- OVSDB_IDL_MONITOR /* Replicate this column? */
- OVSDB_IDL_ALERT /* Alert client when column changes? */
- OVSDB_IDL_TRACK // there is no comment, but it is obvious as the name
suggests.

Each of the meanings was clear. Now with the patch it seems to change the
meanings of:
- OVSDB_IDL_ALERT to "alert client when column changes, and the column is
NOT write-only"
- OVSDB_IDL_TRACK to "alert client when column changes, and track the
changes"

This is really confusing. I think the purpose here is just to provide a
capability of "writing back changed columns only" while still "alert client
when column changes". Would it be better just to add a new flag for this,
like OVSDB_IDL_WRITE_CHANGED_ONLY? It can be set properly in existed APIs
to keep the existed behavior, and provide an API for clients to set it
explicitly for columns that are read-write but doesn't care about the
atomicity.

ii) The comments in ovsdb-idl.h under "The possible mode combinations are:"
needs updating.

Thanks,
Han

> > [0]
> >
https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
> > [1]
> >
https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089
> >
>
> [2]
https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#understanding-cluster-consistency
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=2045577
> [4]
https://github.com/ovn-org/ovn/commit/9deb000536e0dcf81d0e61d5a8af1d4e655960b4
>
[5]
https://github.com/openvswitch/ovs/commit/1cc618c32524076d14ba3ee30e672a554b8ee605


> > Thanks,
> > Han
> >
>
> Thanks,
> Dumitru
>
> >>
> >> The default behavior of the IDL is to automatically turn a write-only
> >> column into a read-write column whenever the client enables change
> >> tracking for that column.
> >>
> >> For the afore mentioned clients, this becomes a performance issue.
> >> Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of writes that don't
> >> change a column's value.") explains why writes that don't change a
> >> column's value cannot be optimized out early if the column is
> >> read/write.
> >>
> >> Furthermore, if there is at least one record in any table that
> >> changed during a transaction, then *all* records that have been
> >> written are added to the transaction, even if their values didn't
> >> change.  If there are many such rows (e.g., like in ovn-northd's
> >> case) this incurs a significant overhead because:
> >> a. the client has to build this large transaction
> >> b. the transaction has to be sent over the network
> >> c. the server needs to parse this (mostly) no-op update
> >>
> >> We now introduce new IDL APIs allowing users to explicitly request
> >> monitoring mode combinations for different columns in the IDL.
> >>
> >> These accept the combination 'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK'
> >> as a valid input allowing users to change track write-only
> >> tables.
> >>
> >> We benchmarked ovn-northd performance when using this new mode
> >> against NB and SB databases taken from ovn-kubernetes scale tests.
> >> We noticed that when a minor change is performed to the Northbound
> >> database (e.g., NB_Global.nb_cfg is incremented) the time it takes to
> >> build the Southbound transaction becomes negligible (vs ~1.5 seconds
> >> before this change).
> >>
> >> End-to-end ovn-kubernetes scale tests on 120-node clusters also show
> >> significant reduction of latency to bring up pods; both average and P99
> >> latency decreased by ~30%.
> >>
> >> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> >> ---
> >> V2:
> >> - Addressed Numan's comments:
> >>   - Added APIs to allow per column configuration of modes.
> >> - Fixed unit tests to actually enable change tracking for write-only
> >>   columns.
> >> - Fixed ovsdb_idl_row_change() to correctly update row/table/idl change
> >>   seqnos if change tracking is enabled (even for write-only rows).
> >>
> >> Note: The OVN counter part change is:
> >>
> >
https://github.com/dceara/ovn/commit/0d0c689e6469ae2f8e195c423e9a91fbc514bd60
> >> ---
> >>  NEWS               |  4 +++
> >>  lib/ovsdb-idl.c    | 85 +++++++++++++++++++++++++++++++++-------------
> >>  lib/ovsdb-idl.h    |  8 ++++-
> >>  tests/ovsdb-idl.at | 20 +++++++++--
> >>  tests/test-ovsdb.c | 25 ++++++++++----
> >>  5 files changed, 108 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/NEWS b/NEWS
> >> index 5074b97aa51c..c1c41da94e9b 100644
> >> --- a/NEWS
> >> +++ b/NEWS
> >> @@ -14,6 +14,10 @@ Post-v2.17.0
> >>     - OVSDB:
> >>       * 'relay' service model now supports transaction history, i.e.
> > honors the
> >>         'last-txn-id' field in 'monitor_cond_since' requests from
clients.
> >> +   - OVSDB-IDL:
> >> +     * New interfaces for explcitly setting column modes.  The
> > combination
> >> +       'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK' is now considered valid
and
> >> +       enables change tracking of write-only columns.
> >>
> >>
> >>  v2.17.0 - 17 Feb 2022
> >> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> >> index 882ede75598d..5f62c267d496 100644
> >> --- a/lib/ovsdb-idl.c
> >> +++ b/lib/ovsdb-idl.c
> >> @@ -859,20 +859,6 @@ ovsdb_idl_get_mode(struct ovsdb_idl *idl,
> >>      return &table->modes[column - table->class_->columns];
> >>  }
> >>
> >> -static void
> >> -ovsdb_idl_set_mode(struct ovsdb_idl *idl,
> >> -                   const struct ovsdb_idl_column *column,
> >> -                   unsigned char mode)
> >> -{
> >> -    const struct ovsdb_idl_table *table =
> > ovsdb_idl_table_from_column(idl,
> >> -
> >  column);
> >> -    size_t column_idx = column - table->class_->columns;
> >> -
> >> -    if (table->modes[column_idx] != mode) {
> >> -        *ovsdb_idl_get_mode(idl, column) = mode;
> >> -    }
> >> -}
> >> -
> >>  static void
> >>  add_ref_table(struct ovsdb_idl *idl, const struct ovsdb_base_type
*base)
> >>  {
> >> @@ -902,7 +888,8 @@ void
> >>  ovsdb_idl_add_column(struct ovsdb_idl *idl,
> >>                       const struct ovsdb_idl_column *column)
> >>  {
> >> -    ovsdb_idl_set_mode(idl, column, OVSDB_IDL_MONITOR |
OVSDB_IDL_ALERT);
> >> +    ovsdb_idl_set_column_mode(idl, column,
> >> +                              OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
> >>      add_ref_table(idl, &column->type.key);
> >>      add_ref_table(idl, &column->type.value);
> >>  }
> >> @@ -1186,6 +1173,51 @@ ovsdb_idl_omit_alert(struct ovsdb_idl *idl,
> >>      *ovsdb_idl_get_mode(idl, column) &= ~(OVSDB_IDL_ALERT |
> > OVSDB_IDL_TRACK);
> >>  }
> >>
> >> +/* Sets the 'column' mode to 'mode' which must be a valid combination
of
> >> + * OVSDB_IDL_MONITOR, OVSDB_IDL_ALERT and OVSDB_IDL_TRACK.
> >> + */
> >> +void
> >> +ovsdb_idl_set_column_mode(struct ovsdb_idl *idl,
> >> +                          const struct ovsdb_idl_column *column,
> >> +                          unsigned char mode)
> >> +{
> >> +    if (mode & ~(OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT |
OVSDB_IDL_TRACK))
> > {
> >> +        VLOG_ABORT("%s(): Column '%s' unknown mode %02x.",
> >> +                   __func__, column->name, mode);
> >> +    }
> >> +
> >> +    if ((mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
> >> +         && !(mode & OVSDB_IDL_MONITOR)) {
> >> +        VLOG_ABORT("%s(): Column '%s' mode must include
> > OVSDB_IDL_MONITOR.",
> >> +                   __func__, column->name);
> >> +    }
> >> +
> >> +    const struct ovsdb_idl_table *table =
> > ovsdb_idl_table_from_column(idl,
> >> +
> >  column);
> >> +    size_t column_idx = column - table->class_->columns;
> >> +
> >> +    if (table->modes[column_idx] != mode) {
> >> +        *ovsdb_idl_get_mode(idl, column) = mode;
> >> +    }
> >> +}
> >> +
> >> +/* Walks all columns of tables in 'idl' and sets their mode to 'mode'
> >> + * which must be a valid combination of OVSDB_IDL_MONITOR,
> > OVSDB_IDL_ALERT
> >> + * and OVSDB_IDL_TRACK.
> >> + */
> >> +void
> >> +ovsdb_idl_set_column_mode_all(struct ovsdb_idl *idl, unsigned char
mode)
> >> +{
> >> +    for (size_t i = 0; i < idl->class_->n_tables; i++) {
> >> +        const struct ovsdb_idl_table_class *tc =
&idl->class_->tables[i];
> >> +
> >> +        for (size_t j = 0; j < tc->n_columns; j++) {
> >> +            const struct ovsdb_idl_column *column = &tc->columns[j];
> >> +            ovsdb_idl_set_column_mode(idl, column, mode);
> >> +        }
> >> +    }
> >> +}
> >> +
> >>  /* Sets the mode for 'column' in 'idl' to 0.  See the big comment
above
> >>   * OVSDB_IDL_MONITOR for details.
> >>   *
> >> @@ -1236,8 +1268,8 @@ ovsdb_idl_row_get_seqno(const struct
ovsdb_idl_row
> > *row,
> >>   * functions.
> >>   *
> >>   * This function should be called between ovsdb_idl_create() and
> >> - * the first call to ovsdb_idl_run(). The column to be tracked
> >> - * should have OVSDB_IDL_ALERT turned on.
> >> + * the first call to ovsdb_idl_run(). The function will also ensure
> >> + * that the column to be tracked has OVSDB_IDL_ALERT turned on.
> >>   */
> >>  void
> >>  ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
> >> @@ -1246,7 +1278,9 @@ ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
> >>      if (!(*ovsdb_idl_get_mode(idl, column) & OVSDB_IDL_ALERT)) {
> >>          ovsdb_idl_add_column(idl, column);
> >>      }
> >> -    *ovsdb_idl_get_mode(idl, column) |= OVSDB_IDL_TRACK;
> >> +
> >> +    unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT |
> > OVSDB_IDL_TRACK;
> >> +    ovsdb_idl_set_column_mode(idl, column, mode);
> >>  }
> >>
> >>  void
> >> @@ -1694,11 +1728,14 @@ ovsdb_idl_row_change(struct ovsdb_idl_row *row,
> > const struct shash *values,
> >>              continue;
> >>          }
> >>
> >> -        if (datum_changed && table->modes[column_idx] &
OVSDB_IDL_ALERT)
> > {
> >> -            changed = true;
> >> -            row->change_seqno[change]
> >> -                = row->table->change_seqno[change]
> >> -                = row->table->idl->change_seqno + 1;
> >> +        if (datum_changed) {
> >> +            unsigned char column_mode = table->modes[column_idx];
> >> +            if (column_mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK)) {
> >> +                changed = true;
> >> +                row->change_seqno[change]
> >> +                    = row->table->change_seqno[change]
> >> +                    = row->table->idl->change_seqno + 1;
> >> +            }
> >>
> >>              if (table->modes[column_idx] & OVSDB_IDL_TRACK) {
> >>                  if (ovs_list_is_empty(&row->track_node) &&
> >> @@ -3501,7 +3538,7 @@ ovsdb_idl_txn_write__(const struct ovsdb_idl_row
> > *row_,
> >>
> >>      class = row->table->class_;
> >>      column_idx = column - class->columns;
> >> -    write_only = row->table->modes[column_idx] == OVSDB_IDL_MONITOR;
> >> +    write_only = !(row->table->modes[column_idx] & OVSDB_IDL_ALERT);
> >>
> >>      ovs_assert(row->new_datum != NULL);
> >>      ovs_assert(column_idx < class->n_columns);
> >> diff --git a/lib/ovsdb-idl.h b/lib/ovsdb-idl.h
> >> index d00599616ef9..2d74ed0727a7 100644
> >> --- a/lib/ovsdb-idl.h
> >> +++ b/lib/ovsdb-idl.h
> >> @@ -180,7 +180,8 @@ bool ovsdb_idl_server_has_column(const struct
> > ovsdb_idl *,
> >>   *
> >>   *   - (OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK), for a
> > column
> >>   *     that a client wants to track using the change tracking
> >> - *     ovsdb_idl_track_get_*() functions.
> >> + *     ovsdb_idl_track_get_*() functions.  OVSDB_IDL_ALERT can be
> > omitted for
> >> + *     write-only columns.
> >>   */
> >>  #define OVSDB_IDL_MONITOR (1 << 0) /* Replicate this column? */
> >>  #define OVSDB_IDL_ALERT   (1 << 1) /* Alert client when column
changes?
> > */
> >> @@ -193,6 +194,11 @@ void ovsdb_idl_add_table(struct ovsdb_idl *,
> >>  void ovsdb_idl_omit(struct ovsdb_idl *, const struct ovsdb_idl_column
*);
> >>  void ovsdb_idl_omit_alert(struct ovsdb_idl *, const struct
> > ovsdb_idl_column *);
> >>
> >> +void ovsdb_idl_set_column_mode(struct ovsdb_idl *,
> >> +                               const struct ovsdb_idl_column *,
> >> +                               unsigned char mode);
> >> +void ovsdb_idl_set_column_mode_all(struct ovsdb_idl *, unsigned char
> > mode);
> >> +
> >>  /* Change tracking.
> >>   *
> >>   * In OVSDB, change tracking is applied at each client in the IDL
> > layer.  This
> >> diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
> >> index 62e2b638320c..0436e0f77b0f 100644
> >> --- a/tests/ovsdb-idl.at
> >> +++ b/tests/ovsdb-idl.at
> >> @@ -1154,12 +1154,25 @@ OVSDB_CHECK_IDL_WO_MONITOR_COND([simple idl
> > disable monitor-cond],
> >>  ]])
> >>
> >>  m4_define([OVSDB_CHECK_IDL_TRACK_C],
> >> -  [AT_SETUP([$1 - C])
> >> +  [AT_SETUP([$1 - alert - C])
> >> +   AT_KEYWORDS([ovsdb server idl tracking positive $5])
> >> +   AT_CHECK([ovsdb_start_idltest])
> >> +   m4_if([$2], [], [],
> >> +     [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
> > [ignore])])
> >> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> > -t10 --change-track=alert idl unix:socket $3],
> >> +            [0], [stdout], [ignore])
> >> +   AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
> >> +            [0], [$4])
> >> +   OVSDB_SERVER_SHUTDOWN
> >> +   AT_CLEANUP])
> >> +
> >> +m4_define([OVSDB_CHECK_IDL_TRACK_NOALERT_C],
> >> +  [AT_SETUP([$1 - noalert - C])
> >>     AT_KEYWORDS([ovsdb server idl tracking positive $5])
> >>     AT_CHECK([ovsdb_start_idltest])
> >>     m4_if([$2], [], [],
> >>       [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore],
> > [ignore])])
> >> -   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> > -t10 -c idl unix:socket $3],
> >> +   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc
> > -t10 --change-track=noalert idl unix:socket $3],
> >>              [0], [stdout], [ignore])
> >>     AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
> >>              [0], [$4])
> >> @@ -1167,7 +1180,8 @@ m4_define([OVSDB_CHECK_IDL_TRACK_C],
> >>     AT_CLEANUP])
> >>
> >>  m4_define([OVSDB_CHECK_IDL_TRACK],
> >> -  [OVSDB_CHECK_IDL_TRACK_C($@)])
> >> +  [OVSDB_CHECK_IDL_TRACK_C($@)
> >> +   OVSDB_CHECK_IDL_TRACK_NOALERT_C($@)])
> >>
> >>  OVSDB_CHECK_IDL_TRACK([track, simple idl, initially populated],
> >>    [['["idltest",
> >> diff --git a/tests/test-ovsdb.c b/tests/test-ovsdb.c
> >> index ca4e87b8115c..abd30cf27d18 100644
> >> --- a/tests/test-ovsdb.c
> >> +++ b/tests/test-ovsdb.c
> >> @@ -56,6 +56,7 @@
> >>  VLOG_DEFINE_THIS_MODULE(test_ovsdb);
> >>
> >>  struct test_ovsdb_pvt_context {
> >> +    bool track_alert;
> >>      bool track;
> >>  };
> >>
> >> @@ -122,6 +123,15 @@ parse_options(int argc, char *argv[], struct
> > test_ovsdb_pvt_context *pvt)
> >>              break;
> >>
> >>          case 'c':
> >> +            if (optarg) {
> >> +                if (!strcmp(optarg, "noalert")) {
> >> +                    pvt->track_alert = false;
> >> +                } else if (!strcmp(optarg, "alert")) {
> >> +                    pvt->track_alert = true;
> >> +                } else {
> >> +                    ovs_fatal(0, "value %s is invalid", optarg);
> >> +                }
> >> +            }
> >>              pvt->track = true;
> >>              break;
> >>
> >> @@ -2610,6 +2620,7 @@ update_conditions(struct ovsdb_idl *idl, char
> > *commands)
> >>  static void
> >>  do_idl(struct ovs_cmdl_context *ctx)
> >>  {
> >> +    struct test_ovsdb_pvt_context *pvt = ctx->pvt;
> >>      struct jsonrpc *rpc;
> >>      struct ovsdb_idl *idl;
> >>      unsigned int seqno = 0;
> >> @@ -2618,9 +2629,6 @@ do_idl(struct ovs_cmdl_context *ctx)
> >>      int step = 0;
> >>      int error;
> >>      int i;
> >> -    bool track;
> >> -
> >> -    track = ((struct test_ovsdb_pvt_context *)(ctx->pvt))->track;
> >>
> >>      idl = ovsdb_idl_create(ctx->argv[1], &idltest_idl_class, true,
true);
> >>      ovsdb_idl_set_leader_only(idl, false);
> >> @@ -2637,8 +2645,13 @@ do_idl(struct ovs_cmdl_context *ctx)
> >>          rpc = NULL;
> >>      }
> >>
> >> -    if (track) {
> >> -        ovsdb_idl_track_add_all(idl);
> >> +    if (pvt->track) {
> >> +        unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK;
> >> +
> >> +        if (pvt->track_alert) {
> >> +            mode |= OVSDB_IDL_ALERT;
> >> +        }
> >> +        ovsdb_idl_set_column_mode_all(idl, mode);
> >>      }
> >>
> >>      setvbuf(stdout, NULL, _IONBF, 0);
> >> @@ -2683,7 +2696,7 @@ do_idl(struct ovs_cmdl_context *ctx)
> >>              }
> >>
> >>              /* Print update. */
> >> -            if (track) {
> >> +            if (pvt->track) {
> >>                  print_idl_track(idl, step++, terse);
> >>                  ovsdb_idl_track_clear(idl);
> >>              } else {
> >> --
> >> 2.27.0
> >>
> >
>
Dumitru Ceara April 19, 2022, 7:48 a.m. UTC | #4
On 4/16/22 00:41, Han Zhou wrote:
> On Fri, Apr 15, 2022 at 1:45 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 4/15/22 20:25, Han Zhou wrote:
>>> On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> At a first glance, change tracking should never be allowed for
>>>> write-only columns.  However, some clients (e.g., ovn-northd) that are
>>>> mostly exclusive writers of a database, use change tracking to avoid
>>>> duplicating the IDL row records into a local cache when implementing
>>>> incremental processing.
>>>
>>> Hi Dumitru,
>>>
>>
>> Hi Han,
>>
>>> It is still not clear to me why ovn-northd would need change-tracking
> for
>>> the write-only columns. It didn't require change tracking before using
> the
>>> incremental processing engine. In theory, to my understanding,
>>> change-tracking is required only if the column is an input to some of
> the
>>> engine nodes, so that it needs to be alerted when there is a change in
> the
>>> database. If the column is write-only to ovn-northd, it means it doesn't
>>> care about any change of the column from the DB, but blindly writes to
> the
>>> column regardless of the current value. I checked the patch [0], which
>>> mentioned that
>>>
>>>     Such nodes are primary inputs to the northd incremental processing
>>>     engine and without proper update processing for Southbound tables,
>>>     northd might fail to timely reconcile the contents of the Southbound
>>>     database.
>>>
>>> I am confused that, if the columns were write-only, how come they become
>>> inputs to northd and requires reconcile upon them? Does it mean that
> they
>>> were misused before I-P, that they were in fact read/write columns to
>>> ovn-northd?
>>
>> I think it's actually a bit more complicated than this.
>>
>> Even for pure write-only tables/columns we have a problem due to the
>> fact that clustered ovsdb offers at-least-once consistency [2].
>>
>> We actually hit the case in which a single northd transaction to insert
>> a SB.Load_Balancer record resulted in two identical rows being added to
>> the database, both of them pointing to the same logical switch datapath
>> binding [3].  When that logical switch was deleted northd would fail to
>> remove the duplicate from the SB causing a referential integrity
>> violation due to the dangling reference.
>>
>> Arguably the schema for load balancers is not ideal because it doesn't
>> defines an index on "name".  Nevertheless, that's not something we can
>> change easily because it will (quite likely with raft) break existing
>> deployments.  And it applies to other tables in the NB/SB schema.
>>
>> The fix was relatively straighforward though [4], and it meant fixing
>> the way northd reconciles the SB.Load_balancers.
>>
>> This is were I-P became an issue.  Consider the current main branch
>> code with commit e4d6d3455baf ("ovn-northd: Enable change tracking for
>> all SB tables.") reverted.  In a sandbox we do:
>>
>> $ ovn-nbctl ls-add ls
>> $ ovn-nbctl lb-add lb1 1.1.1.1:1000 2.2.2.2:2000 -- ls-lb-add ls lb1
>>
>> # At this point the SB.LB table looks ok:
>> $ ovn-sbctl list load_balancer
> 
> 
>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>> name                : lb1
>> options             : {hairpin_orig_tuple="true"}
>> protocol            : tcp
>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>
>> # Simulate the bug in [3]
>> $ ovn-sbctl create load_balancer name=lb1
> external_ids:lb_id=c2206d6a-6939-47cc-92de-554d2bc163b0
>>
>> # Even with the northd fix [4], due to the fact that the LB table is
>> # write-only, northd doesn't wake up so we end up with the duplicate
>> # staying in the SB for an indeterminate time.
>> $ ovn-sbctl list load_balancer
>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>> name                : lb1
>> options             : {hairpin_orig_tuple="true"}
>> protocol            : tcp
>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>
>> _uuid               : 0bff0f83-234a-4631-ab88-2348b9d07d1f
>> datapaths           : []
>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>> name                : lb1
>> options             : {}
>> protocol            : []
>> vips                : {}
>>
>> # Only when we generate an input for a "non-write-only" column the SB.LB
>> # is reconciled, because the "northd" I-P node runs.
>> $ ovn-nbctl --wait=sb sync
>> $ ovn-sbctl list load_balancer
>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>> name                : lb1
>> options             : {hairpin_orig_tuple="true"}
>> protocol            : tcp
>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>
>>> Similarly for the patch [1], I didn't figure out what's exactly broken
>>> without change-tracking to the options column from the commit message
> and
>>> the patch itself.
>>
>> I think the reason behind [1] is that ovn-controller (in the pinctrl
>> module) writes some options, SB.Port_Binding.options:ipv6_ra_pd_list,
>> which were then propagated by ovn-northd to
>> Logical_Router_Port.ipv6_prefix.
>>
>> This is done in ovn_update_ipv6_prefix(), part of ovnnb_db_run() in the
>> "northd" I-P node.  But, if the only change that woke up northd was the
>> SB.Port_Binding.options:ipv6_ra_pd_list change, because the column is
>> not tracked, the en_sb_port_binding I-P node (input of northd) doesn't
>> run.
>>
>>> I think I need to understand the context before commenting on the
> current
>>> patch. Could you help explain a little more?
>>>
>>
>> I hope this makes it more clear.  I'm sure there are more aspects I
>> missed here but I think we need to find a way to make northd work
>> correctly and without delays in reconciling state while at the same
>> time avoiding to affect latency (like in the performance regression
>> I introduced with [0]).
>>
> Thanks Dumitru for the detailed explanation. I think I understand the
> problem now. In both of the cases above, we were in fact monitoring the
> columns as "write-only" while they were in fact "read-write".
> 
> 1) The first case (the load balancer one) looks more complex, but
> essentially it is an input to ovn-northd. If we consider "removing
> duplicates" as a feature of northd, although the duplicates would in fact
> only be triggered by northd itself (but not actually generated by itself,
> it was RAFT's problem that actually generated the duplicated row), northd
> is reading the records and then remove the duplicates and write it back.
> That's why the northd needs to track the load-balancer columns and react to
> the changes. (We may still call it "write-only" in this case, which doesn't
> mean northd doesn't read the value, but just that northd is the only one
> that decides what should be there, not worrying about overwriting others'
> changes. I'd rather call it "read-write" to be more precise.)
> 

Sure, but just for completeness, this applies to the following tables
too (they don't have an index defined in the SB schema): DHCP_Options,
DHCPv6_Options, DNS, HA_Chassis.

Most of these didn't have change tracking enabled before [0] either as
far as I can tell; or at least not directly.  We'd have to see if there
are more corner cases to cover.

> 2) The second case (the IPv6 one) is more straightforward. If
> ovn-controller can modify the column, the column should never be regarded
> as write-only to northd, which is a bug even existed before northd I-P.
> There may be other such bugs that we haven't noticed before, but can be
> exposed by the I-P change. These bugs were not noticed before because
> anything that wakes up the main loop would trigger a recompute before I-P
> implementation, so we were actually handling any changes in time. And the
> problem mentioned in [5] Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of
> writes that don't change a column's value.") doesn't apply to this use case
> (correct me if I am wrong).
> 

Indeed, I think [5] is not really relevant in the ovn-northd <-> SB <->
ovn-controller context because both ovn-northd and ovn-controller don't
really care about transactions and (should) ensure eventual consistency
of the SB data.

> Now to the solution:
> 
> a) To fix the bugs, we shouldn't blindly change-tracking for all tables and
> columns, but instead, track changes for the columns that are read-write but
> treated as write-only by mistake, while keep the real write-only columns
> write-only (e.g. the nb_cfg related columns in SB_Global). I think we
> should fix this regardless.
> 

In theory, yes, but please see "c)" below.

> b) However, this is not enough, because we are facing the performance
> problem introduced by such fixes, even if we only add change-tracking to
> very few new columns. E.g. the one added by [1] already impacts performance
> significantly even adding tracking for just one column. We need further
> improvement to solve the performance problem, which is primarily caused by
> the behavior introduced by [5]. However, in most use cases we don't have
> the problem mentioned by [5]. I think this is the assumption that leads to
> this patch as the solution for the performance problem and I think it is
> the right approach. It allows northd to get alerted by changes, but only
> write back columns that were changed. I just want to emphasize that we need
> this approach on top of a), and only apply for the columns that are
> read-write but don't worry about atomicity of transactions if only the
> changed values are written back.

I would add:

c) However, before [6], the non-incremental implementation of ovn-northd
had the side effect that *any* external change to a SB record/column,
regardless of how the column was monitored, would trigger northd to
reconcile the SB contents.  That won't be the case anymore.  I'm not
sure if this is a big problem but it might uncover issues that used to
be handled indirectly by the northd full reprocessing.

E.g., in a sandbox with [0] reverted:

$ ovn-nbctl ls-add ls
$ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
  uuid=0xf771dbcb, table=13(ls_in_acl_after_lb ), priority=0    ,
match=(1), action=(next;)

# Simulate a lflow "disappearing" from the SB.
$ ovn-sbctl destroy logical_flow f771dbcb

# Northd doesn't recreate it until a read/write column changes value:
$ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
$ ovn-nbctl --wait=sb sync
$ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
  uuid=0x20a3b933, table=13(ls_in_acl_after_lb ), priority=0    ,
match=(1), action=(next;)

So, with all this in mind, if we find a good way of defining this new
monitoring mode ("writing back changed columns only") I don't really see
a problem with blindly enabling change tracking and this new mode for
the Southbound.  All in all it just ensures the same behavior as before
[6] without any real drawbacks.

> 
> With the above understanding, I have some comments for the patch:
> 
> i) Functionally I believe it serves the purpose of solving the performance
> problem, but it seems confusing what the modes really mean now. There were
> definitions to each of the flags:
> - OVSDB_IDL_MONITOR /* Replicate this column? */
> - OVSDB_IDL_ALERT /* Alert client when column changes? */
> - OVSDB_IDL_TRACK // there is no comment, but it is obvious as the name
> suggests.
> 
> Each of the meanings was clear. Now with the patch it seems to change the
> meanings of:
> - OVSDB_IDL_ALERT to "alert client when column changes, and the column is
> NOT write-only"
> - OVSDB_IDL_TRACK to "alert client when column changes, and track the
> changes"
> 
> This is really confusing. I think the purpose here is just to provide a
> capability of "writing back changed columns only" while still "alert client
> when column changes". Would it be better just to add a new flag for this,
> like OVSDB_IDL_WRITE_CHANGED_ONLY? It can be set properly in existed APIs
> to keep the existed behavior, and provide an API for clients to set it
> explicitly for columns that are read-write but doesn't care about the
> atomicity.

Sure, sounds better to make it explicit indeed.

> 
> ii) The comments in ovsdb-idl.h under "The possible mode combinations are:"
> needs updating.

Ack.

I'll prepare a v3 for the idl change.  In the meantime it would be great
if we could already agree on the way we want to enable this in
ovn-northd (i.e., blindly for all SB vs. selectively for some tables).

> 
> Thanks,
> Han
> 
>>> [0]
>>>
> https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
>>> [1]
>>>
> https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089
>>>
>>
>> [2]
> https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#understanding-cluster-consistency
>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=2045577
>> [4]
> https://github.com/ovn-org/ovn/commit/9deb000536e0dcf81d0e61d5a8af1d4e655960b4
>>
> [5]
> https://github.com/openvswitch/ovs/commit/1cc618c32524076d14ba3ee30e672a554b8ee605
> 

[6]
https://github.com/ovn-org/ovn/commit/4597317f16d1a522fd468dc5339cbe3bb851b60e

Thanks,
Dumitru
Han Zhou April 19, 2022, 5:47 p.m. UTC | #5
On Tue, Apr 19, 2022 at 12:49 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> On 4/16/22 00:41, Han Zhou wrote:
> > On Fri, Apr 15, 2022 at 1:45 PM Dumitru Ceara <dceara@redhat.com> wrote:
> >>
> >> On 4/15/22 20:25, Han Zhou wrote:
> >>> On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com>
wrote:
> >>>>
> >>>> At a first glance, change tracking should never be allowed for
> >>>> write-only columns.  However, some clients (e.g., ovn-northd) that
are
> >>>> mostly exclusive writers of a database, use change tracking to avoid
> >>>> duplicating the IDL row records into a local cache when implementing
> >>>> incremental processing.
> >>>
> >>> Hi Dumitru,
> >>>
> >>
> >> Hi Han,
> >>
> >>> It is still not clear to me why ovn-northd would need change-tracking
> > for
> >>> the write-only columns. It didn't require change tracking before using
> > the
> >>> incremental processing engine. In theory, to my understanding,
> >>> change-tracking is required only if the column is an input to some of
> > the
> >>> engine nodes, so that it needs to be alerted when there is a change in
> > the
> >>> database. If the column is write-only to ovn-northd, it means it
doesn't
> >>> care about any change of the column from the DB, but blindly writes to
> > the
> >>> column regardless of the current value. I checked the patch [0], which
> >>> mentioned that
> >>>
> >>>     Such nodes are primary inputs to the northd incremental processing
> >>>     engine and without proper update processing for Southbound tables,
> >>>     northd might fail to timely reconcile the contents of the
Southbound
> >>>     database.
> >>>
> >>> I am confused that, if the columns were write-only, how come they
become
> >>> inputs to northd and requires reconcile upon them? Does it mean that
> > they
> >>> were misused before I-P, that they were in fact read/write columns to
> >>> ovn-northd?
> >>
> >> I think it's actually a bit more complicated than this.
> >>
> >> Even for pure write-only tables/columns we have a problem due to the
> >> fact that clustered ovsdb offers at-least-once consistency [2].
> >>
> >> We actually hit the case in which a single northd transaction to insert
> >> a SB.Load_Balancer record resulted in two identical rows being added to
> >> the database, both of them pointing to the same logical switch datapath
> >> binding [3].  When that logical switch was deleted northd would fail to
> >> remove the duplicate from the SB causing a referential integrity
> >> violation due to the dangling reference.
> >>
> >> Arguably the schema for load balancers is not ideal because it doesn't
> >> defines an index on "name".  Nevertheless, that's not something we can
> >> change easily because it will (quite likely with raft) break existing
> >> deployments.  And it applies to other tables in the NB/SB schema.
> >>
> >> The fix was relatively straighforward though [4], and it meant fixing
> >> the way northd reconciles the SB.Load_balancers.
> >>
> >> This is were I-P became an issue.  Consider the current main branch
> >> code with commit e4d6d3455baf ("ovn-northd: Enable change tracking for
> >> all SB tables.") reverted.  In a sandbox we do:
> >>
> >> $ ovn-nbctl ls-add ls
> >> $ ovn-nbctl lb-add lb1 1.1.1.1:1000 2.2.2.2:2000 -- ls-lb-add ls lb1
> >>
> >> # At this point the SB.LB table looks ok:
> >> $ ovn-sbctl list load_balancer
> >
> >
> >> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> >> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> >> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> >> name                : lb1
> >> options             : {hairpin_orig_tuple="true"}
> >> protocol            : tcp
> >> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
> >>
> >> # Simulate the bug in [3]
> >> $ ovn-sbctl create load_balancer name=lb1
> > external_ids:lb_id=c2206d6a-6939-47cc-92de-554d2bc163b0
> >>
> >> # Even with the northd fix [4], due to the fact that the LB table is
> >> # write-only, northd doesn't wake up so we end up with the duplicate
> >> # staying in the SB for an indeterminate time.
> >> $ ovn-sbctl list load_balancer
> >> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> >> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> >> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> >> name                : lb1
> >> options             : {hairpin_orig_tuple="true"}
> >> protocol            : tcp
> >> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
> >>
> >> _uuid               : 0bff0f83-234a-4631-ab88-2348b9d07d1f
> >> datapaths           : []
> >> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> >> name                : lb1
> >> options             : {}
> >> protocol            : []
> >> vips                : {}
> >>
> >> # Only when we generate an input for a "non-write-only" column the
SB.LB
> >> # is reconciled, because the "northd" I-P node runs.
> >> $ ovn-nbctl --wait=sb sync
> >> $ ovn-sbctl list load_balancer
> >> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
> >> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
> >> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
> >> name                : lb1
> >> options             : {hairpin_orig_tuple="true"}
> >> protocol            : tcp
> >> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
> >>
> >>> Similarly for the patch [1], I didn't figure out what's exactly broken
> >>> without change-tracking to the options column from the commit message
> > and
> >>> the patch itself.
> >>
> >> I think the reason behind [1] is that ovn-controller (in the pinctrl
> >> module) writes some options, SB.Port_Binding.options:ipv6_ra_pd_list,
> >> which were then propagated by ovn-northd to
> >> Logical_Router_Port.ipv6_prefix.
> >>
> >> This is done in ovn_update_ipv6_prefix(), part of ovnnb_db_run() in the
> >> "northd" I-P node.  But, if the only change that woke up northd was the
> >> SB.Port_Binding.options:ipv6_ra_pd_list change, because the column is
> >> not tracked, the en_sb_port_binding I-P node (input of northd) doesn't
> >> run.
> >>
> >>> I think I need to understand the context before commenting on the
> > current
> >>> patch. Could you help explain a little more?
> >>>
> >>
> >> I hope this makes it more clear.  I'm sure there are more aspects I
> >> missed here but I think we need to find a way to make northd work
> >> correctly and without delays in reconciling state while at the same
> >> time avoiding to affect latency (like in the performance regression
> >> I introduced with [0]).
> >>
> > Thanks Dumitru for the detailed explanation. I think I understand the
> > problem now. In both of the cases above, we were in fact monitoring the
> > columns as "write-only" while they were in fact "read-write".
> >
> > 1) The first case (the load balancer one) looks more complex, but
> > essentially it is an input to ovn-northd. If we consider "removing
> > duplicates" as a feature of northd, although the duplicates would in
fact
> > only be triggered by northd itself (but not actually generated by
itself,
> > it was RAFT's problem that actually generated the duplicated row),
northd
> > is reading the records and then remove the duplicates and write it back.
> > That's why the northd needs to track the load-balancer columns and
react to
> > the changes. (We may still call it "write-only" in this case, which
doesn't
> > mean northd doesn't read the value, but just that northd is the only one
> > that decides what should be there, not worrying about overwriting
others'
> > changes. I'd rather call it "read-write" to be more precise.)
> >
>
> Sure, but just for completeness, this applies to the following tables
> too (they don't have an index defined in the SB schema): DHCP_Options,
> DHCPv6_Options, DNS, HA_Chassis.
>
> Most of these didn't have change tracking enabled before [0] either as
> far as I can tell; or at least not directly.  We'd have to see if there
> are more corner cases to cover.
>
> > 2) The second case (the IPv6 one) is more straightforward. If
> > ovn-controller can modify the column, the column should never be
regarded
> > as write-only to northd, which is a bug even existed before northd I-P.
> > There may be other such bugs that we haven't noticed before, but can be
> > exposed by the I-P change. These bugs were not noticed before because
> > anything that wakes up the main loop would trigger a recompute before
I-P
> > implementation, so we were actually handling any changes in time. And
the
> > problem mentioned in [5] Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity
of
> > writes that don't change a column's value.") doesn't apply to this use
case
> > (correct me if I am wrong).
> >
>
> Indeed, I think [5] is not really relevant in the ovn-northd <-> SB <->
> ovn-controller context because both ovn-northd and ovn-controller don't
> really care about transactions and (should) ensure eventual consistency
> of the SB data.

I'd be careful to claim "don't really care about transactions" for all use
cases, not that I know any use cases we do care about, but I just haven't
examined every use case of ovn-northd and ovn-controller. Anyway, as long
as we have the capability in the IDL to handle each column the way we want
(with or without a flag: OVSDB_IDL_WRITE_CHANGED_ONLY), we should be able
to handle both cases. Maybe we can assume it is ok to not care about
atomicity of writes in northd, and write the changed columns only for now,
but later if we figure out there is any problem for some columns, we can
always remove the flag for those columns.

>
> > Now to the solution:
> >
> > a) To fix the bugs, we shouldn't blindly change-tracking for all tables
and
> > columns, but instead, track changes for the columns that are read-write
but
> > treated as write-only by mistake, while keep the real write-only columns
> > write-only (e.g. the nb_cfg related columns in SB_Global). I think we
> > should fix this regardless.
> >
>
> In theory, yes, but please see "c)" below.
>
> > b) However, this is not enough, because we are facing the performance
> > problem introduced by such fixes, even if we only add change-tracking to
> > very few new columns. E.g. the one added by [1] already impacts
performance
> > significantly even adding tracking for just one column. We need further
> > improvement to solve the performance problem, which is primarily caused
by
> > the behavior introduced by [5]. However, in most use cases we don't have
> > the problem mentioned by [5]. I think this is the assumption that leads
to
> > this patch as the solution for the performance problem and I think it is
> > the right approach. It allows northd to get alerted by changes, but only
> > write back columns that were changed. I just want to emphasize that we
need
> > this approach on top of a), and only apply for the columns that are
> > read-write but don't worry about atomicity of transactions if only the
> > changed values are written back.
>
> I would add:
>
> c) However, before [6], the non-incremental implementation of ovn-northd
> had the side effect that *any* external change to a SB record/column,
> regardless of how the column was monitored, would trigger northd to
> reconcile the SB contents.  That won't be the case anymore.  I'm not
> sure if this is a big problem but it might uncover issues that used to
> be handled indirectly by the northd full reprocessing.
>
> E.g., in a sandbox with [0] reverted:
>
> $ ovn-nbctl ls-add ls
> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
>   uuid=0xf771dbcb, table=13(ls_in_acl_after_lb ), priority=0    ,
> match=(1), action=(next;)
>
> # Simulate a lflow "disappearing" from the SB.
> $ ovn-sbctl destroy logical_flow f771dbcb
>
> # Northd doesn't recreate it until a read/write column changes value:
> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
> $ ovn-nbctl --wait=sb sync
> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
>   uuid=0x20a3b933, table=13(ls_in_acl_after_lb ), priority=0    ,
> match=(1), action=(next;)
>
> So, with all this in mind, if we find a good way of defining this new
> monitoring mode ("writing back changed columns only") I don't really see
> a problem with blindly enabling change tracking and this new mode for
> the Southbound.  All in all it just ensures the same behavior as before
> [6] without any real drawbacks.
>

I understand c), but I still think a) is important. For the example above,
since we expect northd to detect changes in SB logical_flow table and fix
it, it should be regarded as read-write. Probably most of the tables and
columns fall into this category, but there are exceptions. In some cases we
don't even want/need to fix changes. One example is nb_cfg, which is only
for sync or performance test purposes. It is purely write-only, and no need
to detect any changes from other components. Currently with alert enabled,
after every nb_cfg update to the SB by northd, the northd would receive
notification for its own change to this column which triggers a recompute
unnecessarily. I am not sure if there are other examples. But I think
overall we are on the same page. We can detect such cases one-by-one and
treat them as write-only.

Thanks,
Han

> >
> > With the above understanding, I have some comments for the patch:
> >
> > i) Functionally I believe it serves the purpose of solving the
performance
> > problem, but it seems confusing what the modes really mean now. There
were
> > definitions to each of the flags:
> > - OVSDB_IDL_MONITOR /* Replicate this column? */
> > - OVSDB_IDL_ALERT /* Alert client when column changes? */
> > - OVSDB_IDL_TRACK // there is no comment, but it is obvious as the name
> > suggests.
> >
> > Each of the meanings was clear. Now with the patch it seems to change
the
> > meanings of:
> > - OVSDB_IDL_ALERT to "alert client when column changes, and the column
is
> > NOT write-only"
> > - OVSDB_IDL_TRACK to "alert client when column changes, and track the
> > changes"
> >
> > This is really confusing. I think the purpose here is just to provide a
> > capability of "writing back changed columns only" while still "alert
client
> > when column changes". Would it be better just to add a new flag for
this,
> > like OVSDB_IDL_WRITE_CHANGED_ONLY? It can be set properly in existed
APIs
> > to keep the existed behavior, and provide an API for clients to set it
> > explicitly for columns that are read-write but doesn't care about the
> > atomicity.
>
> Sure, sounds better to make it explicit indeed.
>
> >
> > ii) The comments in ovsdb-idl.h under "The possible mode combinations
are:"
> > needs updating.
>
> Ack.
>
> I'll prepare a v3 for the idl change.  In the meantime it would be great
> if we could already agree on the way we want to enable this in
> ovn-northd (i.e., blindly for all SB vs. selectively for some tables).
>
> >
> > Thanks,
> > Han
> >
> >>> [0]
> >>>
> >
https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
> >>> [1]
> >>>
> >
https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089
> >>>
> >>
> >> [2]
> >
https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#understanding-cluster-consistency
> >> [3] https://bugzilla.redhat.com/show_bug.cgi?id=2045577
> >> [4]
> >
https://github.com/ovn-org/ovn/commit/9deb000536e0dcf81d0e61d5a8af1d4e655960b4
> >>
> > [5]
> >
https://github.com/openvswitch/ovs/commit/1cc618c32524076d14ba3ee30e672a554b8ee605
> >
>
> [6]
>
https://github.com/ovn-org/ovn/commit/4597317f16d1a522fd468dc5339cbe3bb851b60e
>
> Thanks,
> Dumitru
>
Dumitru Ceara April 20, 2022, 7:32 p.m. UTC | #6
On 4/19/22 19:47, Han Zhou wrote:
> On Tue, Apr 19, 2022 at 12:49 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> On 4/16/22 00:41, Han Zhou wrote:
>>> On Fri, Apr 15, 2022 at 1:45 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> On 4/15/22 20:25, Han Zhou wrote:
>>>>> On Fri, Apr 15, 2022 at 7:22 AM Dumitru Ceara <dceara@redhat.com>
> wrote:
>>>>>>
>>>>>> At a first glance, change tracking should never be allowed for
>>>>>> write-only columns.  However, some clients (e.g., ovn-northd) that
> are
>>>>>> mostly exclusive writers of a database, use change tracking to avoid
>>>>>> duplicating the IDL row records into a local cache when implementing
>>>>>> incremental processing.
>>>>>
>>>>> Hi Dumitru,
>>>>>
>>>>
>>>> Hi Han,
>>>>
>>>>> It is still not clear to me why ovn-northd would need change-tracking
>>> for
>>>>> the write-only columns. It didn't require change tracking before using
>>> the
>>>>> incremental processing engine. In theory, to my understanding,
>>>>> change-tracking is required only if the column is an input to some of
>>> the
>>>>> engine nodes, so that it needs to be alerted when there is a change in
>>> the
>>>>> database. If the column is write-only to ovn-northd, it means it
> doesn't
>>>>> care about any change of the column from the DB, but blindly writes to
>>> the
>>>>> column regardless of the current value. I checked the patch [0], which
>>>>> mentioned that
>>>>>
>>>>>     Such nodes are primary inputs to the northd incremental processing
>>>>>     engine and without proper update processing for Southbound tables,
>>>>>     northd might fail to timely reconcile the contents of the
> Southbound
>>>>>     database.
>>>>>
>>>>> I am confused that, if the columns were write-only, how come they
> become
>>>>> inputs to northd and requires reconcile upon them? Does it mean that
>>> they
>>>>> were misused before I-P, that they were in fact read/write columns to
>>>>> ovn-northd?
>>>>
>>>> I think it's actually a bit more complicated than this.
>>>>
>>>> Even for pure write-only tables/columns we have a problem due to the
>>>> fact that clustered ovsdb offers at-least-once consistency [2].
>>>>
>>>> We actually hit the case in which a single northd transaction to insert
>>>> a SB.Load_Balancer record resulted in two identical rows being added to
>>>> the database, both of them pointing to the same logical switch datapath
>>>> binding [3].  When that logical switch was deleted northd would fail to
>>>> remove the duplicate from the SB causing a referential integrity
>>>> violation due to the dangling reference.
>>>>
>>>> Arguably the schema for load balancers is not ideal because it doesn't
>>>> defines an index on "name".  Nevertheless, that's not something we can
>>>> change easily because it will (quite likely with raft) break existing
>>>> deployments.  And it applies to other tables in the NB/SB schema.
>>>>
>>>> The fix was relatively straighforward though [4], and it meant fixing
>>>> the way northd reconciles the SB.Load_balancers.
>>>>
>>>> This is were I-P became an issue.  Consider the current main branch
>>>> code with commit e4d6d3455baf ("ovn-northd: Enable change tracking for
>>>> all SB tables.") reverted.  In a sandbox we do:
>>>>
>>>> $ ovn-nbctl ls-add ls
>>>> $ ovn-nbctl lb-add lb1 1.1.1.1:1000 2.2.2.2:2000 -- ls-lb-add ls lb1
>>>>
>>>> # At this point the SB.LB table looks ok:
>>>> $ ovn-sbctl list load_balancer
>>>
>>>
>>>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>>>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>>>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>>>> name                : lb1
>>>> options             : {hairpin_orig_tuple="true"}
>>>> protocol            : tcp
>>>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>>>
>>>> # Simulate the bug in [3]
>>>> $ ovn-sbctl create load_balancer name=lb1
>>> external_ids:lb_id=c2206d6a-6939-47cc-92de-554d2bc163b0
>>>>
>>>> # Even with the northd fix [4], due to the fact that the LB table is
>>>> # write-only, northd doesn't wake up so we end up with the duplicate
>>>> # staying in the SB for an indeterminate time.
>>>> $ ovn-sbctl list load_balancer
>>>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>>>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>>>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>>>> name                : lb1
>>>> options             : {hairpin_orig_tuple="true"}
>>>> protocol            : tcp
>>>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>>>
>>>> _uuid               : 0bff0f83-234a-4631-ab88-2348b9d07d1f
>>>> datapaths           : []
>>>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>>>> name                : lb1
>>>> options             : {}
>>>> protocol            : []
>>>> vips                : {}
>>>>
>>>> # Only when we generate an input for a "non-write-only" column the
> SB.LB
>>>> # is reconciled, because the "northd" I-P node runs.
>>>> $ ovn-nbctl --wait=sb sync
>>>> $ ovn-sbctl list load_balancer
>>>> _uuid               : 71d07ce2-d8ad-44f5-8fb6-9e4f5ef7ea92
>>>> datapaths           : [29d576ef-31f5-485b-acb7-190373ef74c3]
>>>> external_ids        : {lb_id="c2206d6a-6939-47cc-92de-554d2bc163b0"}
>>>> name                : lb1
>>>> options             : {hairpin_orig_tuple="true"}
>>>> protocol            : tcp
>>>> vips                : {"1.1.1.1:1000"="2.2.2.2:2000"}
>>>>
>>>>> Similarly for the patch [1], I didn't figure out what's exactly broken
>>>>> without change-tracking to the options column from the commit message
>>> and
>>>>> the patch itself.
>>>>
>>>> I think the reason behind [1] is that ovn-controller (in the pinctrl
>>>> module) writes some options, SB.Port_Binding.options:ipv6_ra_pd_list,
>>>> which were then propagated by ovn-northd to
>>>> Logical_Router_Port.ipv6_prefix.
>>>>
>>>> This is done in ovn_update_ipv6_prefix(), part of ovnnb_db_run() in the
>>>> "northd" I-P node.  But, if the only change that woke up northd was the
>>>> SB.Port_Binding.options:ipv6_ra_pd_list change, because the column is
>>>> not tracked, the en_sb_port_binding I-P node (input of northd) doesn't
>>>> run.
>>>>
>>>>> I think I need to understand the context before commenting on the
>>> current
>>>>> patch. Could you help explain a little more?
>>>>>
>>>>
>>>> I hope this makes it more clear.  I'm sure there are more aspects I
>>>> missed here but I think we need to find a way to make northd work
>>>> correctly and without delays in reconciling state while at the same
>>>> time avoiding to affect latency (like in the performance regression
>>>> I introduced with [0]).
>>>>
>>> Thanks Dumitru for the detailed explanation. I think I understand the
>>> problem now. In both of the cases above, we were in fact monitoring the
>>> columns as "write-only" while they were in fact "read-write".
>>>
>>> 1) The first case (the load balancer one) looks more complex, but
>>> essentially it is an input to ovn-northd. If we consider "removing
>>> duplicates" as a feature of northd, although the duplicates would in
> fact
>>> only be triggered by northd itself (but not actually generated by
> itself,
>>> it was RAFT's problem that actually generated the duplicated row),
> northd
>>> is reading the records and then remove the duplicates and write it back.
>>> That's why the northd needs to track the load-balancer columns and
> react to
>>> the changes. (We may still call it "write-only" in this case, which
> doesn't
>>> mean northd doesn't read the value, but just that northd is the only one
>>> that decides what should be there, not worrying about overwriting
> others'
>>> changes. I'd rather call it "read-write" to be more precise.)
>>>
>>
>> Sure, but just for completeness, this applies to the following tables
>> too (they don't have an index defined in the SB schema): DHCP_Options,
>> DHCPv6_Options, DNS, HA_Chassis.
>>
>> Most of these didn't have change tracking enabled before [0] either as
>> far as I can tell; or at least not directly.  We'd have to see if there
>> are more corner cases to cover.
>>
>>> 2) The second case (the IPv6 one) is more straightforward. If
>>> ovn-controller can modify the column, the column should never be
> regarded
>>> as write-only to northd, which is a bug even existed before northd I-P.
>>> There may be other such bugs that we haven't noticed before, but can be
>>> exposed by the I-P change. These bugs were not noticed before because
>>> anything that wakes up the main loop would trigger a recompute before
> I-P
>>> implementation, so we were actually handling any changes in time. And
> the
>>> problem mentioned in [5] Commit 1cc618c32524 ("ovsdb-idl: Fix atomicity
> of
>>> writes that don't change a column's value.") doesn't apply to this use
> case
>>> (correct me if I am wrong).
>>>
>>
>> Indeed, I think [5] is not really relevant in the ovn-northd <-> SB <->
>> ovn-controller context because both ovn-northd and ovn-controller don't
>> really care about transactions and (should) ensure eventual consistency
>> of the SB data.
> 
> I'd be careful to claim "don't really care about transactions" for all use
> cases, not that I know any use cases we do care about, but I just haven't
> examined every use case of ovn-northd and ovn-controller. Anyway, as long
> as we have the capability in the IDL to handle each column the way we want
> (with or without a flag: OVSDB_IDL_WRITE_CHANGED_ONLY), we should be able
> to handle both cases. Maybe we can assume it is ok to not care about
> atomicity of writes in northd, and write the changed columns only for now,
> but later if we figure out there is any problem for some columns, we can
> always remove the flag for those columns.
> 
>>
>>> Now to the solution:
>>>
>>> a) To fix the bugs, we shouldn't blindly change-tracking for all tables
> and
>>> columns, but instead, track changes for the columns that are read-write
> but
>>> treated as write-only by mistake, while keep the real write-only columns
>>> write-only (e.g. the nb_cfg related columns in SB_Global). I think we
>>> should fix this regardless.
>>>
>>
>> In theory, yes, but please see "c)" below.
>>
>>> b) However, this is not enough, because we are facing the performance
>>> problem introduced by such fixes, even if we only add change-tracking to
>>> very few new columns. E.g. the one added by [1] already impacts
> performance
>>> significantly even adding tracking for just one column. We need further
>>> improvement to solve the performance problem, which is primarily caused
> by
>>> the behavior introduced by [5]. However, in most use cases we don't have
>>> the problem mentioned by [5]. I think this is the assumption that leads
> to
>>> this patch as the solution for the performance problem and I think it is
>>> the right approach. It allows northd to get alerted by changes, but only
>>> write back columns that were changed. I just want to emphasize that we
> need
>>> this approach on top of a), and only apply for the columns that are
>>> read-write but don't worry about atomicity of transactions if only the
>>> changed values are written back.
>>
>> I would add:
>>
>> c) However, before [6], the non-incremental implementation of ovn-northd
>> had the side effect that *any* external change to a SB record/column,
>> regardless of how the column was monitored, would trigger northd to
>> reconcile the SB contents.  That won't be the case anymore.  I'm not
>> sure if this is a big problem but it might uncover issues that used to
>> be handled indirectly by the northd full reprocessing.
>>
>> E.g., in a sandbox with [0] reverted:
>>
>> $ ovn-nbctl ls-add ls
>> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
>>   uuid=0xf771dbcb, table=13(ls_in_acl_after_lb ), priority=0    ,
>> match=(1), action=(next;)
>>
>> # Simulate a lflow "disappearing" from the SB.
>> $ ovn-sbctl destroy logical_flow f771dbcb
>>
>> # Northd doesn't recreate it until a read/write column changes value:
>> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
>> $ ovn-nbctl --wait=sb sync
>> $ ovn-sbctl --uuid lflow-list | grep ls_in_acl_after_lb
>>   uuid=0x20a3b933, table=13(ls_in_acl_after_lb ), priority=0    ,
>> match=(1), action=(next;)
>>
>> So, with all this in mind, if we find a good way of defining this new
>> monitoring mode ("writing back changed columns only") I don't really see
>> a problem with blindly enabling change tracking and this new mode for
>> the Southbound.  All in all it just ensures the same behavior as before
>> [6] without any real drawbacks.
>>
> 
> I understand c), but I still think a) is important. For the example above,
> since we expect northd to detect changes in SB logical_flow table and fix
> it, it should be regarded as read-write. Probably most of the tables and
> columns fall into this category, but there are exceptions. In some cases we
> don't even want/need to fix changes. One example is nb_cfg, which is only
> for sync or performance test purposes. It is purely write-only, and no need
> to detect any changes from other components. Currently with alert enabled,
> after every nb_cfg update to the SB by northd, the northd would receive
> notification for its own change to this column which triggers a recompute
> unnecessarily. I am not sure if there are other examples. But I think
> overall we are on the same page. We can detect such cases one-by-one and
> treat them as write-only.
> 

Thanks, Han, for the review!

I just posted v3:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220420193122.22651-1-dceara@redhat.com/

Regards,
Dumitru


> Thanks,
> Han
> 
>>>
>>> With the above understanding, I have some comments for the patch:
>>>
>>> i) Functionally I believe it serves the purpose of solving the
> performance
>>> problem, but it seems confusing what the modes really mean now. There
> were
>>> definitions to each of the flags:
>>> - OVSDB_IDL_MONITOR /* Replicate this column? */
>>> - OVSDB_IDL_ALERT /* Alert client when column changes? */
>>> - OVSDB_IDL_TRACK // there is no comment, but it is obvious as the name
>>> suggests.
>>>
>>> Each of the meanings was clear. Now with the patch it seems to change
> the
>>> meanings of:
>>> - OVSDB_IDL_ALERT to "alert client when column changes, and the column
> is
>>> NOT write-only"
>>> - OVSDB_IDL_TRACK to "alert client when column changes, and track the
>>> changes"
>>>
>>> This is really confusing. I think the purpose here is just to provide a
>>> capability of "writing back changed columns only" while still "alert
> client
>>> when column changes". Would it be better just to add a new flag for
> this,
>>> like OVSDB_IDL_WRITE_CHANGED_ONLY? It can be set properly in existed
> APIs
>>> to keep the existed behavior, and provide an API for clients to set it
>>> explicitly for columns that are read-write but doesn't care about the
>>> atomicity.
>>
>> Sure, sounds better to make it explicit indeed.
>>
>>>
>>> ii) The comments in ovsdb-idl.h under "The possible mode combinations
> are:"
>>> needs updating.
>>
>> Ack.
>>
>> I'll prepare a v3 for the idl change.  In the meantime it would be great
>> if we could already agree on the way we want to enable this in
>> ovn-northd (i.e., blindly for all SB vs. selectively for some tables).
>>
>>>
>>> Thanks,
>>> Han
>>>
>>>>> [0]
>>>>>
>>>
> https://github.com/ovn-org/ovn/commit/e4d6d3455baf09c63ed610037c384855e5f64141
>>>>> [1]
>>>>>
>>>
> https://github.com/ovn-org/ovn/commit/d32a9bc5290e40dd63ef495eb4f0fcde9e446089
>>>>>
>>>>
>>>> [2]
>>>
> https://github.com/openvswitch/ovs/blob/master/Documentation/ref/ovsdb.7.rst#understanding-cluster-consistency
>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=2045577
>>>> [4]
>>>
> https://github.com/ovn-org/ovn/commit/9deb000536e0dcf81d0e61d5a8af1d4e655960b4
>>>>
>>> [5]
>>>
> https://github.com/openvswitch/ovs/commit/1cc618c32524076d14ba3ee30e672a554b8ee605
>>>
>>
>> [6]
>>
> https://github.com/ovn-org/ovn/commit/4597317f16d1a522fd468dc5339cbe3bb851b60e
>>
>> Thanks,
>> Dumitru
>>
>
diff mbox series

Patch

diff --git a/NEWS b/NEWS
index 5074b97aa51c..c1c41da94e9b 100644
--- a/NEWS
+++ b/NEWS
@@ -14,6 +14,10 @@  Post-v2.17.0
    - OVSDB:
      * 'relay' service model now supports transaction history, i.e. honors the
        'last-txn-id' field in 'monitor_cond_since' requests from clients.
+   - OVSDB-IDL:
+     * New interfaces for explcitly setting column modes.  The combination
+       'OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK' is now considered valid and
+       enables change tracking of write-only columns.
 
 
 v2.17.0 - 17 Feb 2022
diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
index 882ede75598d..5f62c267d496 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -859,20 +859,6 @@  ovsdb_idl_get_mode(struct ovsdb_idl *idl,
     return &table->modes[column - table->class_->columns];
 }
 
-static void
-ovsdb_idl_set_mode(struct ovsdb_idl *idl,
-                   const struct ovsdb_idl_column *column,
-                   unsigned char mode)
-{
-    const struct ovsdb_idl_table *table = ovsdb_idl_table_from_column(idl,
-                                                                      column);
-    size_t column_idx = column - table->class_->columns;
-
-    if (table->modes[column_idx] != mode) {
-        *ovsdb_idl_get_mode(idl, column) = mode;
-    }
-}
-
 static void
 add_ref_table(struct ovsdb_idl *idl, const struct ovsdb_base_type *base)
 {
@@ -902,7 +888,8 @@  void
 ovsdb_idl_add_column(struct ovsdb_idl *idl,
                      const struct ovsdb_idl_column *column)
 {
-    ovsdb_idl_set_mode(idl, column, OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
+    ovsdb_idl_set_column_mode(idl, column,
+                              OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT);
     add_ref_table(idl, &column->type.key);
     add_ref_table(idl, &column->type.value);
 }
@@ -1186,6 +1173,51 @@  ovsdb_idl_omit_alert(struct ovsdb_idl *idl,
     *ovsdb_idl_get_mode(idl, column) &= ~(OVSDB_IDL_ALERT | OVSDB_IDL_TRACK);
 }
 
+/* Sets the 'column' mode to 'mode' which must be a valid combination of
+ * OVSDB_IDL_MONITOR, OVSDB_IDL_ALERT and OVSDB_IDL_TRACK.
+ */
+void
+ovsdb_idl_set_column_mode(struct ovsdb_idl *idl,
+                          const struct ovsdb_idl_column *column,
+                          unsigned char mode)
+{
+    if (mode & ~(OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK)) {
+        VLOG_ABORT("%s(): Column '%s' unknown mode %02x.",
+                   __func__, column->name, mode);
+    }
+
+    if ((mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK))
+         && !(mode & OVSDB_IDL_MONITOR)) {
+        VLOG_ABORT("%s(): Column '%s' mode must include OVSDB_IDL_MONITOR.",
+                   __func__, column->name);
+    }
+
+    const struct ovsdb_idl_table *table = ovsdb_idl_table_from_column(idl,
+                                                                      column);
+    size_t column_idx = column - table->class_->columns;
+
+    if (table->modes[column_idx] != mode) {
+        *ovsdb_idl_get_mode(idl, column) = mode;
+    }
+}
+
+/* Walks all columns of tables in 'idl' and sets their mode to 'mode'
+ * which must be a valid combination of OVSDB_IDL_MONITOR, OVSDB_IDL_ALERT
+ * and OVSDB_IDL_TRACK.
+ */
+void
+ovsdb_idl_set_column_mode_all(struct ovsdb_idl *idl, unsigned char mode)
+{
+    for (size_t i = 0; i < idl->class_->n_tables; i++) {
+        const struct ovsdb_idl_table_class *tc = &idl->class_->tables[i];
+
+        for (size_t j = 0; j < tc->n_columns; j++) {
+            const struct ovsdb_idl_column *column = &tc->columns[j];
+            ovsdb_idl_set_column_mode(idl, column, mode);
+        }
+    }
+}
+
 /* Sets the mode for 'column' in 'idl' to 0.  See the big comment above
  * OVSDB_IDL_MONITOR for details.
  *
@@ -1236,8 +1268,8 @@  ovsdb_idl_row_get_seqno(const struct ovsdb_idl_row *row,
  * functions.
  *
  * This function should be called between ovsdb_idl_create() and
- * the first call to ovsdb_idl_run(). The column to be tracked
- * should have OVSDB_IDL_ALERT turned on.
+ * the first call to ovsdb_idl_run(). The function will also ensure
+ * that the column to be tracked has OVSDB_IDL_ALERT turned on.
  */
 void
 ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
@@ -1246,7 +1278,9 @@  ovsdb_idl_track_add_column(struct ovsdb_idl *idl,
     if (!(*ovsdb_idl_get_mode(idl, column) & OVSDB_IDL_ALERT)) {
         ovsdb_idl_add_column(idl, column);
     }
-    *ovsdb_idl_get_mode(idl, column) |= OVSDB_IDL_TRACK;
+
+    unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK;
+    ovsdb_idl_set_column_mode(idl, column, mode);
 }
 
 void
@@ -1694,11 +1728,14 @@  ovsdb_idl_row_change(struct ovsdb_idl_row *row, const struct shash *values,
             continue;
         }
 
-        if (datum_changed && table->modes[column_idx] & OVSDB_IDL_ALERT) {
-            changed = true;
-            row->change_seqno[change]
-                = row->table->change_seqno[change]
-                = row->table->idl->change_seqno + 1;
+        if (datum_changed) {
+            unsigned char column_mode = table->modes[column_idx];
+            if (column_mode & (OVSDB_IDL_ALERT | OVSDB_IDL_TRACK)) {
+                changed = true;
+                row->change_seqno[change]
+                    = row->table->change_seqno[change]
+                    = row->table->idl->change_seqno + 1;
+            }
 
             if (table->modes[column_idx] & OVSDB_IDL_TRACK) {
                 if (ovs_list_is_empty(&row->track_node) &&
@@ -3501,7 +3538,7 @@  ovsdb_idl_txn_write__(const struct ovsdb_idl_row *row_,
 
     class = row->table->class_;
     column_idx = column - class->columns;
-    write_only = row->table->modes[column_idx] == OVSDB_IDL_MONITOR;
+    write_only = !(row->table->modes[column_idx] & OVSDB_IDL_ALERT);
 
     ovs_assert(row->new_datum != NULL);
     ovs_assert(column_idx < class->n_columns);
diff --git a/lib/ovsdb-idl.h b/lib/ovsdb-idl.h
index d00599616ef9..2d74ed0727a7 100644
--- a/lib/ovsdb-idl.h
+++ b/lib/ovsdb-idl.h
@@ -180,7 +180,8 @@  bool ovsdb_idl_server_has_column(const struct ovsdb_idl *,
  *
  *   - (OVSDB_IDL_MONITOR | OVSDB_IDL_ALERT | OVSDB_IDL_TRACK), for a column
  *     that a client wants to track using the change tracking
- *     ovsdb_idl_track_get_*() functions.
+ *     ovsdb_idl_track_get_*() functions.  OVSDB_IDL_ALERT can be omitted for
+ *     write-only columns.
  */
 #define OVSDB_IDL_MONITOR (1 << 0) /* Replicate this column? */
 #define OVSDB_IDL_ALERT   (1 << 1) /* Alert client when column changes? */
@@ -193,6 +194,11 @@  void ovsdb_idl_add_table(struct ovsdb_idl *,
 void ovsdb_idl_omit(struct ovsdb_idl *, const struct ovsdb_idl_column *);
 void ovsdb_idl_omit_alert(struct ovsdb_idl *, const struct ovsdb_idl_column *);
 
+void ovsdb_idl_set_column_mode(struct ovsdb_idl *,
+                               const struct ovsdb_idl_column *,
+                               unsigned char mode);
+void ovsdb_idl_set_column_mode_all(struct ovsdb_idl *, unsigned char mode);
+
 /* Change tracking.
  *
  * In OVSDB, change tracking is applied at each client in the IDL layer.  This
diff --git a/tests/ovsdb-idl.at b/tests/ovsdb-idl.at
index 62e2b638320c..0436e0f77b0f 100644
--- a/tests/ovsdb-idl.at
+++ b/tests/ovsdb-idl.at
@@ -1154,12 +1154,25 @@  OVSDB_CHECK_IDL_WO_MONITOR_COND([simple idl disable monitor-cond],
 ]])
 
 m4_define([OVSDB_CHECK_IDL_TRACK_C],
-  [AT_SETUP([$1 - C])
+  [AT_SETUP([$1 - alert - C])
+   AT_KEYWORDS([ovsdb server idl tracking positive $5])
+   AT_CHECK([ovsdb_start_idltest])
+   m4_if([$2], [], [],
+     [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore], [ignore])])
+   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc -t10 --change-track=alert idl unix:socket $3],
+            [0], [stdout], [ignore])
+   AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
+            [0], [$4])
+   OVSDB_SERVER_SHUTDOWN
+   AT_CLEANUP])
+
+m4_define([OVSDB_CHECK_IDL_TRACK_NOALERT_C],
+  [AT_SETUP([$1 - noalert - C])
    AT_KEYWORDS([ovsdb server idl tracking positive $5])
    AT_CHECK([ovsdb_start_idltest])
    m4_if([$2], [], [],
      [AT_CHECK([ovsdb-client transact unix:socket $2], [0], [ignore], [ignore])])
-   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc -t10 -c idl unix:socket $3],
+   AT_CHECK([test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -vjsonrpc -t10 --change-track=noalert idl unix:socket $3],
             [0], [stdout], [ignore])
    AT_CHECK([sort stdout | uuidfilt]m4_if([$6],,, [[| $6]]),
             [0], [$4])
@@ -1167,7 +1180,8 @@  m4_define([OVSDB_CHECK_IDL_TRACK_C],
    AT_CLEANUP])
 
 m4_define([OVSDB_CHECK_IDL_TRACK],
-  [OVSDB_CHECK_IDL_TRACK_C($@)])
+  [OVSDB_CHECK_IDL_TRACK_C($@)
+   OVSDB_CHECK_IDL_TRACK_NOALERT_C($@)])
 
 OVSDB_CHECK_IDL_TRACK([track, simple idl, initially populated],
   [['["idltest",
diff --git a/tests/test-ovsdb.c b/tests/test-ovsdb.c
index ca4e87b8115c..abd30cf27d18 100644
--- a/tests/test-ovsdb.c
+++ b/tests/test-ovsdb.c
@@ -56,6 +56,7 @@ 
 VLOG_DEFINE_THIS_MODULE(test_ovsdb);
 
 struct test_ovsdb_pvt_context {
+    bool track_alert;
     bool track;
 };
 
@@ -122,6 +123,15 @@  parse_options(int argc, char *argv[], struct test_ovsdb_pvt_context *pvt)
             break;
 
         case 'c':
+            if (optarg) {
+                if (!strcmp(optarg, "noalert")) {
+                    pvt->track_alert = false;
+                } else if (!strcmp(optarg, "alert")) {
+                    pvt->track_alert = true;
+                } else {
+                    ovs_fatal(0, "value %s is invalid", optarg);
+                }
+            }
             pvt->track = true;
             break;
 
@@ -2610,6 +2620,7 @@  update_conditions(struct ovsdb_idl *idl, char *commands)
 static void
 do_idl(struct ovs_cmdl_context *ctx)
 {
+    struct test_ovsdb_pvt_context *pvt = ctx->pvt;
     struct jsonrpc *rpc;
     struct ovsdb_idl *idl;
     unsigned int seqno = 0;
@@ -2618,9 +2629,6 @@  do_idl(struct ovs_cmdl_context *ctx)
     int step = 0;
     int error;
     int i;
-    bool track;
-
-    track = ((struct test_ovsdb_pvt_context *)(ctx->pvt))->track;
 
     idl = ovsdb_idl_create(ctx->argv[1], &idltest_idl_class, true, true);
     ovsdb_idl_set_leader_only(idl, false);
@@ -2637,8 +2645,13 @@  do_idl(struct ovs_cmdl_context *ctx)
         rpc = NULL;
     }
 
-    if (track) {
-        ovsdb_idl_track_add_all(idl);
+    if (pvt->track) {
+        unsigned char mode = OVSDB_IDL_MONITOR | OVSDB_IDL_TRACK;
+
+        if (pvt->track_alert) {
+            mode |= OVSDB_IDL_ALERT;
+        }
+        ovsdb_idl_set_column_mode_all(idl, mode);
     }
 
     setvbuf(stdout, NULL, _IONBF, 0);
@@ -2683,7 +2696,7 @@  do_idl(struct ovs_cmdl_context *ctx)
             }
 
             /* Print update. */
-            if (track) {
+            if (pvt->track) {
                 print_idl_track(idl, step++, terse);
                 ovsdb_idl_track_clear(idl);
             } else {