Message ID | 20180911172958.25307-1-nusiddiq@redhat.com |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev,v2] ovsdb-server: Alleviate the possible data loss in an active/standby setup | expand |
On Tue, Sep 11, 2018 at 10:30 AM <nusiddiq@redhat.com> wrote: > > From: Numan Siddique <nusiddiq@redhat.com> > > The present code resets the database when it is in the state - > 'RPL_S_SCHEMA_REQUESTED' and repopulates the database when it > receives the monitor reply when it is in the state - > 'RPL_S_MONITOR_REQUESTED'. If however, it goes to active mode > before it processes the monitor reply, the whole data is lost. > > This patch alleviates the problem by resetting the database when it > receives the monitor reply (before processing it). So that > reset and repopulation of the db happens in the same state. > > This approach still has a window for data loss if the function > process_notification() when processing the monitor reply fails for > some reason or ovsdb-server crashes for some reason during > process_notification(). > > Reported-by: Han Zhou <zhouhan@gmail.com> > Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-August/047161.html > Tested-by: aginwala <aginwala@ebay.com> > Acked-by: Han Zhou <zhouhan@gmail.com> > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > --- > ovsdb/replication.c | 25 +++++++++++-------------- > 1 file changed, 11 insertions(+), 14 deletions(-) > > v1 -> v2 > -------- > * Updated the commit message as per Han's suggestion > * Added few comments in the code where it resets the db. > > diff --git a/ovsdb/replication.c b/ovsdb/replication.c > index 2b9ae2f83..752b3c89c 100644 > --- a/ovsdb/replication.c > +++ b/ovsdb/replication.c > @@ -299,19 +299,7 @@ replication_run(void) > /* After receiving schemas, reset the local databases that > * will be monitored and send out monitor requests for them. */ > if (hmap_is_empty(&request_ids)) { > - struct shash_node *node, *next; > - > - SHASH_FOR_EACH_SAFE (node, next, replication_dbs) { > - db = node->data; > - error = reset_database(db); > - if (error) { > - const char *db_name = db->schema->name; > - shash_find_and_delete(replication_dbs, db_name); > - ovsdb_error_assert(error); > - VLOG_WARN("Failed to reset database, " > - "%s not replicated.", db_name); > - } > - } > + struct shash_node *node; > > if (shash_is_empty(replication_dbs)) { > VLOG_WARN("Nothing to replicate."); > @@ -335,7 +323,16 @@ replication_run(void) > case RPL_S_MONITOR_REQUESTED: { > /* Reply to monitor requests. */ > struct ovsdb_error *error; > - error = process_notification(msg->result, db); > + VLOG_INFO("Monitor request received. Resetting the database"); > + /* Resetting the database here has few risks. If the > + * process_notification() fails, the database is completely > + * lost locally. In case that node becomes active, then > + * there is a chance of complete data loss in the active/standy > + * cluster. */ > + error = reset_database(db); > + if (!error) { > + error = process_notification(msg->result, db); > + } > if (error) { > ovsdb_error_assert(error); > state = RPL_S_ERR; > -- > 2.17.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev Thanks Numan!
On Tue, Sep 11, 2018 at 10:55:56AM -0700, Han Zhou wrote: > On Tue, Sep 11, 2018 at 10:30 AM <nusiddiq@redhat.com> wrote: > > > > From: Numan Siddique <nusiddiq@redhat.com> > > > > The present code resets the database when it is in the state - > > 'RPL_S_SCHEMA_REQUESTED' and repopulates the database when it > > receives the monitor reply when it is in the state - > > 'RPL_S_MONITOR_REQUESTED'. If however, it goes to active mode > > before it processes the monitor reply, the whole data is lost. > > > > This patch alleviates the problem by resetting the database when it > > receives the monitor reply (before processing it). So that > > reset and repopulation of the db happens in the same state. > > > > This approach still has a window for data loss if the function > > process_notification() when processing the monitor reply fails for > > some reason or ovsdb-server crashes for some reason during > > process_notification(). > > > > Reported-by: Han Zhou <zhouhan@gmail.com> > > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-August/047161.html > > Tested-by: aginwala <aginwala@ebay.com> > > Acked-by: Han Zhou <zhouhan@gmail.com> > > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > > --- > > ovsdb/replication.c | 25 +++++++++++-------------- > > 1 file changed, 11 insertions(+), 14 deletions(-) > > > > v1 -> v2 > > -------- > > * Updated the commit message as per Han's suggestion > > * Added few comments in the code where it resets the db. > > > > diff --git a/ovsdb/replication.c b/ovsdb/replication.c > > index 2b9ae2f83..752b3c89c 100644 > > --- a/ovsdb/replication.c > > +++ b/ovsdb/replication.c > > @@ -299,19 +299,7 @@ replication_run(void) > > /* After receiving schemas, reset the local databases > that > > * will be monitored and send out monitor requests for > them. */ > > if (hmap_is_empty(&request_ids)) { > > - struct shash_node *node, *next; > > - > > - SHASH_FOR_EACH_SAFE (node, next, replication_dbs) { > > - db = node->data; > > - error = reset_database(db); > > - if (error) { > > - const char *db_name = db->schema->name; > > - shash_find_and_delete(replication_dbs, > db_name); > > - ovsdb_error_assert(error); > > - VLOG_WARN("Failed to reset database, " > > - "%s not replicated.", db_name); > > - } > > - } > > + struct shash_node *node; > > > > if (shash_is_empty(replication_dbs)) { > > VLOG_WARN("Nothing to replicate."); > > @@ -335,7 +323,16 @@ replication_run(void) > > case RPL_S_MONITOR_REQUESTED: { > > /* Reply to monitor requests. */ > > struct ovsdb_error *error; > > - error = process_notification(msg->result, db); > > + VLOG_INFO("Monitor request received. Resetting the > database"); > > + /* Resetting the database here has few risks. If the > > + * process_notification() fails, the database is > completely > > + * lost locally. In case that node becomes active, then > > + * there is a chance of complete data loss in the > active/standy > > + * cluster. */ > > + error = reset_database(db); > > + if (!error) { > > + error = process_notification(msg->result, db); > > + } > > if (error) { > > ovsdb_error_assert(error); > > state = RPL_S_ERR; > > -- > > 2.17.1 > > > > _______________________________________________ > > dev mailing list > > dev@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > Thanks Numan! Thanks Numan and Han. I applied this to master. If you'd like it backported, please let me know (and how far).
On Tue, Sep 18, 2018 at 1:41 PM Ben Pfaff <blp@ovn.org> wrote: > On Tue, Sep 11, 2018 at 10:55:56AM -0700, Han Zhou wrote: > > On Tue, Sep 11, 2018 at 10:30 AM <nusiddiq@redhat.com> wrote: > > > > > > From: Numan Siddique <nusiddiq@redhat.com> > > > > > > The present code resets the database when it is in the state - > > > 'RPL_S_SCHEMA_REQUESTED' and repopulates the database when it > > > receives the monitor reply when it is in the state - > > > 'RPL_S_MONITOR_REQUESTED'. If however, it goes to active mode > > > before it processes the monitor reply, the whole data is lost. > > > > > > This patch alleviates the problem by resetting the database when it > > > receives the monitor reply (before processing it). So that > > > reset and repopulation of the db happens in the same state. > > > > > > This approach still has a window for data loss if the function > > > process_notification() when processing the monitor reply fails for > > > some reason or ovsdb-server crashes for some reason during > > > process_notification(). > > > > > > Reported-by: Han Zhou <zhouhan@gmail.com> > > > Reported-at: > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-August/047161.html > > > Tested-by: aginwala <aginwala@ebay.com> > > > Acked-by: Han Zhou <zhouhan@gmail.com> > > > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > > > --- > > > ovsdb/replication.c | 25 +++++++++++-------------- > > > 1 file changed, 11 insertions(+), 14 deletions(-) > > > > > > v1 -> v2 > > > -------- > > > * Updated the commit message as per Han's suggestion > > > * Added few comments in the code where it resets the db. > > > > > > diff --git a/ovsdb/replication.c b/ovsdb/replication.c > > > index 2b9ae2f83..752b3c89c 100644 > > > --- a/ovsdb/replication.c > > > +++ b/ovsdb/replication.c > > > @@ -299,19 +299,7 @@ replication_run(void) > > > /* After receiving schemas, reset the local databases > > that > > > * will be monitored and send out monitor requests for > > them. */ > > > if (hmap_is_empty(&request_ids)) { > > > - struct shash_node *node, *next; > > > - > > > - SHASH_FOR_EACH_SAFE (node, next, replication_dbs) > { > > > - db = node->data; > > > - error = reset_database(db); > > > - if (error) { > > > - const char *db_name = db->schema->name; > > > - shash_find_and_delete(replication_dbs, > > db_name); > > > - ovsdb_error_assert(error); > > > - VLOG_WARN("Failed to reset database, " > > > - "%s not replicated.", db_name); > > > - } > > > - } > > > + struct shash_node *node; > > > > > > if (shash_is_empty(replication_dbs)) { > > > VLOG_WARN("Nothing to replicate."); > > > @@ -335,7 +323,16 @@ replication_run(void) > > > case RPL_S_MONITOR_REQUESTED: { > > > /* Reply to monitor requests. */ > > > struct ovsdb_error *error; > > > - error = process_notification(msg->result, db); > > > + VLOG_INFO("Monitor request received. Resetting the > > database"); > > > + /* Resetting the database here has few risks. If the > > > + * process_notification() fails, the database is > > completely > > > + * lost locally. In case that node becomes active, > then > > > + * there is a chance of complete data loss in the > > active/standy > > > + * cluster. */ > > > + error = reset_database(db); > > > + if (!error) { > > > + error = process_notification(msg->result, db); > > > + } > > > if (error) { > > > ovsdb_error_assert(error); > > > state = RPL_S_ERR; > > > -- > > > 2.17.1 > > > > > > _______________________________________________ > > > dev mailing list > > > dev@openvswitch.org > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > Thanks Numan! > > Thanks Numan and Han. I applied this to master. If you'd like it > backported, please let me know (and how far). > Thanks Ben for the review and applying the patch. Can you please backport to 2.10 and 2.9 branches. They applied cleanly to me without any issues. Thanks Numan
On Tue, Sep 18, 2018 at 07:04:01PM +0530, Numan Siddique wrote: > On Tue, Sep 18, 2018 at 1:41 PM Ben Pfaff <blp@ovn.org> wrote: > > > On Tue, Sep 11, 2018 at 10:55:56AM -0700, Han Zhou wrote: > > > On Tue, Sep 11, 2018 at 10:30 AM <nusiddiq@redhat.com> wrote: > > > > > > > > From: Numan Siddique <nusiddiq@redhat.com> > > > > > > > > The present code resets the database when it is in the state - > > > > 'RPL_S_SCHEMA_REQUESTED' and repopulates the database when it > > > > receives the monitor reply when it is in the state - > > > > 'RPL_S_MONITOR_REQUESTED'. If however, it goes to active mode > > > > before it processes the monitor reply, the whole data is lost. > > > > > > > > This patch alleviates the problem by resetting the database when it > > > > receives the monitor reply (before processing it). So that > > > > reset and repopulation of the db happens in the same state. > > > > > > > > This approach still has a window for data loss if the function > > > > process_notification() when processing the monitor reply fails for > > > > some reason or ovsdb-server crashes for some reason during > > > > process_notification(). > > > > > > > > Reported-by: Han Zhou <zhouhan@gmail.com> > > > > Reported-at: > > > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-August/047161.html > > > > Tested-by: aginwala <aginwala@ebay.com> > > > > Acked-by: Han Zhou <zhouhan@gmail.com> > > > > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > > > > --- > > > > ovsdb/replication.c | 25 +++++++++++-------------- > > > > 1 file changed, 11 insertions(+), 14 deletions(-) > > > > > > > > v1 -> v2 > > > > -------- > > > > * Updated the commit message as per Han's suggestion > > > > * Added few comments in the code where it resets the db. > > > > > > > > diff --git a/ovsdb/replication.c b/ovsdb/replication.c > > > > index 2b9ae2f83..752b3c89c 100644 > > > > --- a/ovsdb/replication.c > > > > +++ b/ovsdb/replication.c > > > > @@ -299,19 +299,7 @@ replication_run(void) > > > > /* After receiving schemas, reset the local databases > > > that > > > > * will be monitored and send out monitor requests for > > > them. */ > > > > if (hmap_is_empty(&request_ids)) { > > > > - struct shash_node *node, *next; > > > > - > > > > - SHASH_FOR_EACH_SAFE (node, next, replication_dbs) > > { > > > > - db = node->data; > > > > - error = reset_database(db); > > > > - if (error) { > > > > - const char *db_name = db->schema->name; > > > > - shash_find_and_delete(replication_dbs, > > > db_name); > > > > - ovsdb_error_assert(error); > > > > - VLOG_WARN("Failed to reset database, " > > > > - "%s not replicated.", db_name); > > > > - } > > > > - } > > > > + struct shash_node *node; > > > > > > > > if (shash_is_empty(replication_dbs)) { > > > > VLOG_WARN("Nothing to replicate."); > > > > @@ -335,7 +323,16 @@ replication_run(void) > > > > case RPL_S_MONITOR_REQUESTED: { > > > > /* Reply to monitor requests. */ > > > > struct ovsdb_error *error; > > > > - error = process_notification(msg->result, db); > > > > + VLOG_INFO("Monitor request received. Resetting the > > > database"); > > > > + /* Resetting the database here has few risks. If the > > > > + * process_notification() fails, the database is > > > completely > > > > + * lost locally. In case that node becomes active, > > then > > > > + * there is a chance of complete data loss in the > > > active/standy > > > > + * cluster. */ > > > > + error = reset_database(db); > > > > + if (!error) { > > > > + error = process_notification(msg->result, db); > > > > + } > > > > if (error) { > > > > ovsdb_error_assert(error); > > > > state = RPL_S_ERR; > > > > -- > > > > 2.17.1 > > > > > > > > _______________________________________________ > > > > dev mailing list > > > > dev@openvswitch.org > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > > Thanks Numan! > > > > Thanks Numan and Han. I applied this to master. If you'd like it > > backported, please let me know (and how far). > > > > Thanks Ben for the review and applying the patch. Can you please backport > to 2.10 and 2.9 branches. > They applied cleanly to me without any issues. Done, thanks!
diff --git a/ovsdb/replication.c b/ovsdb/replication.c index 2b9ae2f83..752b3c89c 100644 --- a/ovsdb/replication.c +++ b/ovsdb/replication.c @@ -299,19 +299,7 @@ replication_run(void) /* After receiving schemas, reset the local databases that * will be monitored and send out monitor requests for them. */ if (hmap_is_empty(&request_ids)) { - struct shash_node *node, *next; - - SHASH_FOR_EACH_SAFE (node, next, replication_dbs) { - db = node->data; - error = reset_database(db); - if (error) { - const char *db_name = db->schema->name; - shash_find_and_delete(replication_dbs, db_name); - ovsdb_error_assert(error); - VLOG_WARN("Failed to reset database, " - "%s not replicated.", db_name); - } - } + struct shash_node *node; if (shash_is_empty(replication_dbs)) { VLOG_WARN("Nothing to replicate."); @@ -335,7 +323,16 @@ replication_run(void) case RPL_S_MONITOR_REQUESTED: { /* Reply to monitor requests. */ struct ovsdb_error *error; - error = process_notification(msg->result, db); + VLOG_INFO("Monitor request received. Resetting the database"); + /* Resetting the database here has few risks. If the + * process_notification() fails, the database is completely + * lost locally. In case that node becomes active, then + * there is a chance of complete data loss in the active/standy + * cluster. */ + error = reset_database(db); + if (!error) { + error = process_notification(msg->result, db); + } if (error) { ovsdb_error_assert(error); state = RPL_S_ERR;