diff mbox series

[ovs-dev,ovn] ovn-controller: Fix IP engine run with in-flight messages

Message ID 1564737778-27292-1-git-send-email-dceara@redhat.com
State Accepted
Headers show
Series [ovs-dev,ovn] ovn-controller: Fix IP engine run with in-flight messages | expand

Commit Message

Dumitru Ceara Aug. 2, 2019, 9:22 a.m. UTC
When trying to incrementally process changes even if there are in-flight
messages we were incorrectly setting the engine_aborted variable to the
value returned by engine_run. However, engine_run returns true if the
run was completed.

This was causing discrepancies between logical flows and openflow flows
due to the fact that the rerun wasn't happening after an aborted run.

In order to avoid confusion the engine_aborted variable is now renamed to
engine_run_done thus avoiding the negated logic.

CC: Han Zhou <hzhou8@ebay.com>
Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency caused by recompute.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 controller/ovn-controller.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Comments

Numan Siddique Aug. 2, 2019, 5:22 p.m. UTC | #1
On Fri, Aug 2, 2019 at 2:53 PM Dumitru Ceara <dceara@redhat.com> wrote:

> When trying to incrementally process changes even if there are in-flight
> messages we were incorrectly setting the engine_aborted variable to the
> value returned by engine_run. However, engine_run returns true if the
> run was completed.
>
> This was causing discrepancies between logical flows and openflow flows
> due to the fact that the rerun wasn't happening after an aborted run.
>
> In order to avoid confusion the engine_aborted variable is now renamed to
> engine_run_done thus avoiding the negated logic.
>
> CC: Han Zhou <hzhou8@ebay.com>
> Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency caused
> by recompute.")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>

I tested this patch multiple times with -j5 and -j10 options. I don't see
test failures now.
Without this patch, atleast one test fails for me 100% of the time when I
run tests with -j5.

Tested-by: Numan Siddique <nusiddiq@redhat.com>

I didn't get the chance to look into the code changes carefully.

Thanks
Numan

---
>  controller/ovn-controller.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index ad33d17..15b9f4e 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -1896,7 +1896,7 @@ main(int argc, char *argv[])
>
>      uint64_t engine_run_id = 0;
>      uint64_t old_engine_run_id = 0;
> -    bool engine_aborted = false;
> +    bool engine_run_done = true;
>
>      unsigned int ovs_cond_seqno = UINT_MAX;
>      unsigned int ovnsb_cond_seqno = UINT_MAX;
> @@ -1997,14 +1997,14 @@ main(int argc, char *argv[])
>                               * this round of engine_run and continue
> processing
>                               * acculated changes incrementally later when
>                               * ofctrl_can_put() returns true. */
> -                            if (!engine_aborted) {
> +                            if (engine_run_done) {
>                                  engine_set_abort_recompute(true);
> -                                engine_aborted =
> engine_run(&en_flow_output,
> -
> ++engine_run_id);
> +                                engine_run_done =
> engine_run(&en_flow_output,
> +
>  ++engine_run_id);
>                              }
>                          } else {
>                              engine_set_abort_recompute(false);
> -                            engine_aborted = false;
> +                            engine_run_done = true;
>                              engine_run(&en_flow_output, ++engine_run_id);
>                          }
>                      }
> @@ -2048,8 +2048,8 @@ main(int argc, char *argv[])
>                  }
>
>              }
> -            if (old_engine_run_id == engine_run_id || engine_aborted) {
> -                if (engine_aborted || engine_need_run(&en_flow_output)) {
> +            if (old_engine_run_id == engine_run_id || !engine_run_done) {
> +                if (!engine_run_done || engine_need_run(&en_flow_output))
> {
>                      VLOG_DBG("engine did not run, force recompute next
> time: "
>                               "br_int %p, chassis %p", br_int, chassis);
>                      engine_set_force_recompute(true);
> --
> 1.8.3.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Mark Michelson Aug. 2, 2019, 9:10 p.m. UTC | #2
Good catch!

Acked-by: Mark Michelson <mmichels@redhat.com>

On 8/2/19 5:22 AM, Dumitru Ceara wrote:
> When trying to incrementally process changes even if there are in-flight
> messages we were incorrectly setting the engine_aborted variable to the
> value returned by engine_run. However, engine_run returns true if the
> run was completed.
> 
> This was causing discrepancies between logical flows and openflow flows
> due to the fact that the rerun wasn't happening after an aborted run.
> 
> In order to avoid confusion the engine_aborted variable is now renamed to
> engine_run_done thus avoiding the negated logic.
> 
> CC: Han Zhou <hzhou8@ebay.com>
> Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency caused by recompute.")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
>   controller/ovn-controller.c | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index ad33d17..15b9f4e 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -1896,7 +1896,7 @@ main(int argc, char *argv[])
>   
>       uint64_t engine_run_id = 0;
>       uint64_t old_engine_run_id = 0;
> -    bool engine_aborted = false;
> +    bool engine_run_done = true;
>   
>       unsigned int ovs_cond_seqno = UINT_MAX;
>       unsigned int ovnsb_cond_seqno = UINT_MAX;
> @@ -1997,14 +1997,14 @@ main(int argc, char *argv[])
>                                * this round of engine_run and continue processing
>                                * acculated changes incrementally later when
>                                * ofctrl_can_put() returns true. */
> -                            if (!engine_aborted) {
> +                            if (engine_run_done) {
>                                   engine_set_abort_recompute(true);
> -                                engine_aborted = engine_run(&en_flow_output,
> -                                                            ++engine_run_id);
> +                                engine_run_done = engine_run(&en_flow_output,
> +                                                             ++engine_run_id);
>                               }
>                           } else {
>                               engine_set_abort_recompute(false);
> -                            engine_aborted = false;
> +                            engine_run_done = true;
>                               engine_run(&en_flow_output, ++engine_run_id);
>                           }
>                       }
> @@ -2048,8 +2048,8 @@ main(int argc, char *argv[])
>                   }
>   
>               }
> -            if (old_engine_run_id == engine_run_id || engine_aborted) {
> -                if (engine_aborted || engine_need_run(&en_flow_output)) {
> +            if (old_engine_run_id == engine_run_id || !engine_run_done) {
> +                if (!engine_run_done || engine_need_run(&en_flow_output)) {
>                       VLOG_DBG("engine did not run, force recompute next time: "
>                                "br_int %p, chassis %p", br_int, chassis);
>                       engine_set_force_recompute(true);
>
Han Zhou Aug. 4, 2019, 8:40 p.m. UTC | #3
On Fri, Aug 2, 2019 at 2:23 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> When trying to incrementally process changes even if there are in-flight
> messages we were incorrectly setting the engine_aborted variable to the
> value returned by engine_run. However, engine_run returns true if the
> run was completed.
>
> This was causing discrepancies between logical flows and openflow flows
> due to the fact that the rerun wasn't happening after an aborted run.
>
> In order to avoid confusion the engine_aborted variable is now renamed to
> engine_run_done thus avoiding the negated logic.
>
> CC: Han Zhou <hzhou8@ebay.com>
> Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
caused by recompute.")
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
>  controller/ovn-controller.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index ad33d17..15b9f4e 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -1896,7 +1896,7 @@ main(int argc, char *argv[])
>
>      uint64_t engine_run_id = 0;
>      uint64_t old_engine_run_id = 0;
> -    bool engine_aborted = false;
> +    bool engine_run_done = true;
>
>      unsigned int ovs_cond_seqno = UINT_MAX;
>      unsigned int ovnsb_cond_seqno = UINT_MAX;
> @@ -1997,14 +1997,14 @@ main(int argc, char *argv[])
>                               * this round of engine_run and continue
processing
>                               * acculated changes incrementally later when
>                               * ofctrl_can_put() returns true. */
> -                            if (!engine_aborted) {
> +                            if (engine_run_done) {
>                                  engine_set_abort_recompute(true);
> -                                engine_aborted =
engine_run(&en_flow_output,
> -
 ++engine_run_id);
> +                                engine_run_done =
engine_run(&en_flow_output,
> +
++engine_run_id);
>                              }
>                          } else {
>                              engine_set_abort_recompute(false);
> -                            engine_aborted = false;
> +                            engine_run_done = true;
>                              engine_run(&en_flow_output, ++engine_run_id);
>                          }
>                      }
> @@ -2048,8 +2048,8 @@ main(int argc, char *argv[])
>                  }
>
>              }
> -            if (old_engine_run_id == engine_run_id || engine_aborted) {
> -                if (engine_aborted || engine_need_run(&en_flow_output)) {
> +            if (old_engine_run_id == engine_run_id || !engine_run_done) {
> +                if (!engine_run_done ||
engine_need_run(&en_flow_output)) {
>                      VLOG_DBG("engine did not run, force recompute next
time: "
>                               "br_int %p, chassis %p", br_int, chassis);
>                      engine_set_force_recompute(true);
> --
> 1.8.3.1
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Acked-by: Han Zhou <hzhou8@ebay.com>

Thanks Dumitru for this critical fix! It was my silly mistake. I tried to
reproduce the test case failure you reported, but it still didn't happen on
my laptop, even with -j10. Later we may need to add a test case that
triggers a lot of flows being installed so that the scenario with in-flight
messages are easily covered.

The original patch is still waiting to be backported to 2.12. I will send a
new version followed by your fix on 2.12.
Numan Siddique Aug. 5, 2019, 6:51 a.m. UTC | #4
On Mon, Aug 5, 2019 at 2:11 AM Han Zhou <zhouhan@gmail.com> wrote:

> On Fri, Aug 2, 2019 at 2:23 AM Dumitru Ceara <dceara@redhat.com> wrote:
> >
> > When trying to incrementally process changes even if there are in-flight
> > messages we were incorrectly setting the engine_aborted variable to the
> > value returned by engine_run. However, engine_run returns true if the
> > run was completed.
> >
> > This was causing discrepancies between logical flows and openflow flows
> > due to the fact that the rerun wasn't happening after an aborted run.
> >
> > In order to avoid confusion the engine_aborted variable is now renamed to
> > engine_run_done thus avoiding the negated logic.
> >
> > CC: Han Zhou <hzhou8@ebay.com>
> > Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
> caused by recompute.")
> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> > ---
> >  controller/ovn-controller.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> > index ad33d17..15b9f4e 100644
> > --- a/controller/ovn-controller.c
> > +++ b/controller/ovn-controller.c
> > @@ -1896,7 +1896,7 @@ main(int argc, char *argv[])
> >
> >      uint64_t engine_run_id = 0;
> >      uint64_t old_engine_run_id = 0;
> > -    bool engine_aborted = false;
> > +    bool engine_run_done = true;
> >
> >      unsigned int ovs_cond_seqno = UINT_MAX;
> >      unsigned int ovnsb_cond_seqno = UINT_MAX;
> > @@ -1997,14 +1997,14 @@ main(int argc, char *argv[])
> >                               * this round of engine_run and continue
> processing
> >                               * acculated changes incrementally later
> when
> >                               * ofctrl_can_put() returns true. */
> > -                            if (!engine_aborted) {
> > +                            if (engine_run_done) {
> >                                  engine_set_abort_recompute(true);
> > -                                engine_aborted =
> engine_run(&en_flow_output,
> > -
>  ++engine_run_id);
> > +                                engine_run_done =
> engine_run(&en_flow_output,
> > +
> ++engine_run_id);
> >                              }
> >                          } else {
> >                              engine_set_abort_recompute(false);
> > -                            engine_aborted = false;
> > +                            engine_run_done = true;
> >                              engine_run(&en_flow_output,
> ++engine_run_id);
> >                          }
> >                      }
> > @@ -2048,8 +2048,8 @@ main(int argc, char *argv[])
> >                  }
> >
> >              }
> > -            if (old_engine_run_id == engine_run_id || engine_aborted) {
> > -                if (engine_aborted || engine_need_run(&en_flow_output))
> {
> > +            if (old_engine_run_id == engine_run_id || !engine_run_done)
> {
> > +                if (!engine_run_done ||
> engine_need_run(&en_flow_output)) {
> >                      VLOG_DBG("engine did not run, force recompute next
> time: "
> >                               "br_int %p, chassis %p", br_int, chassis);
> >                      engine_set_force_recompute(true);
> > --
> > 1.8.3.1
> >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
> Acked-by: Han Zhou <hzhou8@ebay.com>
>


Thanks.
I applied this patch to master.

Numan


>
> Thanks Dumitru for this critical fix! It was my silly mistake. I tried to
> reproduce the test case failure you reported, but it still didn't happen on
> my laptop, even with -j10. Later we may need to add a test case that
> triggers a lot of flows being installed so that the scenario with in-flight
> messages are easily covered.
>
> The original patch is still waiting to be backported to 2.12. I will send a
> new version followed by your fix on 2.12.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Dumitru Ceara Aug. 5, 2019, 7:12 a.m. UTC | #5
On Mon, Aug 5, 2019 at 8:52 AM Numan Siddique <nusiddiq@redhat.com> wrote:
>
>
>
> On Mon, Aug 5, 2019 at 2:11 AM Han Zhou <zhouhan@gmail.com> wrote:
>>
>> On Fri, Aug 2, 2019 at 2:23 AM Dumitru Ceara <dceara@redhat.com> wrote:
>> >
>> > When trying to incrementally process changes even if there are in-flight
>> > messages we were incorrectly setting the engine_aborted variable to the
>> > value returned by engine_run. However, engine_run returns true if the
>> > run was completed.
>> >
>> > This was causing discrepancies between logical flows and openflow flows
>> > due to the fact that the rerun wasn't happening after an aborted run.
>> >
>> > In order to avoid confusion the engine_aborted variable is now renamed to
>> > engine_run_done thus avoiding the negated logic.
>> >
>> > CC: Han Zhou <hzhou8@ebay.com>
>> > Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
>> caused by recompute.")
>> > Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> > ---
>> >  controller/ovn-controller.c | 14 +++++++-------
>> >  1 file changed, 7 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
>> > index ad33d17..15b9f4e 100644
>> > --- a/controller/ovn-controller.c
>> > +++ b/controller/ovn-controller.c
>> > @@ -1896,7 +1896,7 @@ main(int argc, char *argv[])
>> >
>> >      uint64_t engine_run_id = 0;
>> >      uint64_t old_engine_run_id = 0;
>> > -    bool engine_aborted = false;
>> > +    bool engine_run_done = true;
>> >
>> >      unsigned int ovs_cond_seqno = UINT_MAX;
>> >      unsigned int ovnsb_cond_seqno = UINT_MAX;
>> > @@ -1997,14 +1997,14 @@ main(int argc, char *argv[])
>> >                               * this round of engine_run and continue
>> processing
>> >                               * acculated changes incrementally later when
>> >                               * ofctrl_can_put() returns true. */
>> > -                            if (!engine_aborted) {
>> > +                            if (engine_run_done) {
>> >                                  engine_set_abort_recompute(true);
>> > -                                engine_aborted =
>> engine_run(&en_flow_output,
>> > -
>>  ++engine_run_id);
>> > +                                engine_run_done =
>> engine_run(&en_flow_output,
>> > +
>> ++engine_run_id);
>> >                              }
>> >                          } else {
>> >                              engine_set_abort_recompute(false);
>> > -                            engine_aborted = false;
>> > +                            engine_run_done = true;
>> >                              engine_run(&en_flow_output, ++engine_run_id);
>> >                          }
>> >                      }
>> > @@ -2048,8 +2048,8 @@ main(int argc, char *argv[])
>> >                  }
>> >
>> >              }
>> > -            if (old_engine_run_id == engine_run_id || engine_aborted) {
>> > -                if (engine_aborted || engine_need_run(&en_flow_output)) {
>> > +            if (old_engine_run_id == engine_run_id || !engine_run_done) {
>> > +                if (!engine_run_done ||
>> engine_need_run(&en_flow_output)) {
>> >                      VLOG_DBG("engine did not run, force recompute next
>> time: "
>> >                               "br_int %p, chassis %p", br_int, chassis);
>> >                      engine_set_force_recompute(true);
>> > --
>> > 1.8.3.1
>> >
>> > _______________________________________________
>> > dev mailing list
>> > dev@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>> Acked-by: Han Zhou <hzhou8@ebay.com>
>
>
>
> Thanks.
> I applied this patch to master.
>
> Numan
>
>>
>>
>> Thanks Dumitru for this critical fix! It was my silly mistake. I tried to
>> reproduce the test case failure you reported, but it still didn't happen on
>> my laptop, even with -j10. Later we may need to add a test case that
>> triggers a lot of flows being installed so that the scenario with in-flight
>> messages are easily covered.
>>
>> The original patch is still waiting to be backported to 2.12. I will send a
>> new version followed by your fix on 2.12.
>> _______________________________________________


Thanks Han, Mark, Numan for reviewing and testing the patch!
diff mbox series

Patch

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index ad33d17..15b9f4e 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1896,7 +1896,7 @@  main(int argc, char *argv[])
 
     uint64_t engine_run_id = 0;
     uint64_t old_engine_run_id = 0;
-    bool engine_aborted = false;
+    bool engine_run_done = true;
 
     unsigned int ovs_cond_seqno = UINT_MAX;
     unsigned int ovnsb_cond_seqno = UINT_MAX;
@@ -1997,14 +1997,14 @@  main(int argc, char *argv[])
                              * this round of engine_run and continue processing
                              * acculated changes incrementally later when
                              * ofctrl_can_put() returns true. */
-                            if (!engine_aborted) {
+                            if (engine_run_done) {
                                 engine_set_abort_recompute(true);
-                                engine_aborted = engine_run(&en_flow_output,
-                                                            ++engine_run_id);
+                                engine_run_done = engine_run(&en_flow_output,
+                                                             ++engine_run_id);
                             }
                         } else {
                             engine_set_abort_recompute(false);
-                            engine_aborted = false;
+                            engine_run_done = true;
                             engine_run(&en_flow_output, ++engine_run_id);
                         }
                     }
@@ -2048,8 +2048,8 @@  main(int argc, char *argv[])
                 }
 
             }
-            if (old_engine_run_id == engine_run_id || engine_aborted) {
-                if (engine_aborted || engine_need_run(&en_flow_output)) {
+            if (old_engine_run_id == engine_run_id || !engine_run_done) {
+                if (!engine_run_done || engine_need_run(&en_flow_output)) {
                     VLOG_DBG("engine did not run, force recompute next time: "
                              "br_int %p, chassis %p", br_int, chassis);
                     engine_set_force_recompute(true);