diff mbox series

[ovs-dev] ofproto-dpif-upcall: Log the emergency flow flush.

Message ID 20200915192120.3543861-1-fbl@sysclose.org
State Changes Requested
Headers show
Series [ovs-dev] ofproto-dpif-upcall: Log the emergency flow flush. | expand

Commit Message

Flavio Leitner Sept. 15, 2020, 7:21 p.m. UTC
When the number of flows in the datapath reaches twice the
maximum, revalidators will delete all flows as an emergency
action to recover. In that case, log a message with values
and increase a coverage counter.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Eelco Chaudron Sept. 16, 2020, 12:23 p.m. UTC | #1
On 15 Sep 2020, at 21:21, Flavio Leitner wrote:

> When the number of flows in the datapath reaches twice the
> maximum, revalidators will delete all flows as an emergency
> action to recover. In that case, log a message with values
> and increase a coverage counter.
>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Aaron Conole Sept. 22, 2020, 12:28 p.m. UTC | #2
Flavio Leitner <fbl@sysclose.org> writes:

> When the number of flows in the datapath reaches twice the
> maximum, revalidators will delete all flows as an emergency
> action to recover. In that case, log a message with values
> and increase a coverage counter.
>
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---

Acked-by: Aaron Conole <aconole@redhat.com>
Ilya Maximets Sept. 29, 2020, 3:27 p.m. UTC | #3
On 9/15/20 9:21 PM, Flavio Leitner wrote:
> When the number of flows in the datapath reaches twice the
> maximum, revalidators will delete all flows as an emergency
> action to recover. In that case, log a message with values
> and increase a coverage counter.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>  ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index 05a912f57..12e94287a 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention);
>  COVERAGE_DEFINE(upcall_ukey_replace);
>  COVERAGE_DEFINE(revalidate_missed_dp_flow);
>  COVERAGE_DEFINE(upcall_flow_limit_hit);
> +COVERAGE_DEFINE(upcall_flow_limit_kill);
>  
>  /* A thread that reads upcalls from dpif, forwards each upcall's packet,
>   * and possibly sets up a kernel flow as a cache. */
> @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator)
>  
>          long long int max_idle;
>          long long int now;
> +        size_t kill_all_limit;
>          size_t n_dp_flows;
>          bool kill_them_all;
>  
> @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator)
>              COVERAGE_INC(upcall_flow_limit_hit);
>          }
>  
> -        kill_them_all = n_dp_flows > flow_limit * 2;
> +        kill_them_all = false;
> +        kill_all_limit = flow_limit * 2;
> +        if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) {
> +            static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5);
> +
> +            VLOG_WARN_RL(&rlem, "Emergency: deleting all flows "
> +                         "(now: %"PRIuSIZE", max: %"PRIuSIZE")",

Almost he same concern here as in the other patch.  But this message is even
more misleading since it prints doubled dynamically adjusted limit which is
OK to print, but how to use this value in practice if you do not know internals
of this code?

Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as
current dynamic flow limit (%PRIuSIZE).  Deleting all flows as an emergency
measure."?

Another thing is that this message is actually misleading anyway because
revalidators recalculates the number of flows for each batch (50) of dumped
flows (every 100ms, to be honest), so we will not actually flush all the
flows in many cases.  We will remove them until situation is normalized.

So, probably:
"Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow
limit (%PRIuSIZE).  Starting to delete flows unconditionally as an emergency
measure."?

Also, this message should be printed only once per revalidation cycle, I guess,
not for every flow batch.

> +                         n_dp_flows, kill_all_limit);
> +            COVERAGE_INC(upcall_flow_limit_kill);
> +            kill_them_all = true;
> +        }
> +
>          max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
>  
>          udpif->dpif->current_ms = time_msec();
>
Flavio Leitner Sept. 29, 2020, 5:11 p.m. UTC | #4
On Tue, Sep 29, 2020 at 05:27:31PM +0200, Ilya Maximets wrote:
> On 9/15/20 9:21 PM, Flavio Leitner wrote:
> > When the number of flows in the datapath reaches twice the
> > maximum, revalidators will delete all flows as an emergency
> > action to recover. In that case, log a message with values
> > and increase a coverage counter.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> >  ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> > index 05a912f57..12e94287a 100644
> > --- a/ofproto/ofproto-dpif-upcall.c
> > +++ b/ofproto/ofproto-dpif-upcall.c
> > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention);
> >  COVERAGE_DEFINE(upcall_ukey_replace);
> >  COVERAGE_DEFINE(revalidate_missed_dp_flow);
> >  COVERAGE_DEFINE(upcall_flow_limit_hit);
> > +COVERAGE_DEFINE(upcall_flow_limit_kill);
> >  
> >  /* A thread that reads upcalls from dpif, forwards each upcall's packet,
> >   * and possibly sets up a kernel flow as a cache. */
> > @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator)
> >  
> >          long long int max_idle;
> >          long long int now;
> > +        size_t kill_all_limit;
> >          size_t n_dp_flows;
> >          bool kill_them_all;
> >  
> > @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator)
> >              COVERAGE_INC(upcall_flow_limit_hit);
> >          }
> >  
> > -        kill_them_all = n_dp_flows > flow_limit * 2;
> > +        kill_them_all = false;
> > +        kill_all_limit = flow_limit * 2;
> > +        if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) {
> > +            static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5);
> > +
> > +            VLOG_WARN_RL(&rlem, "Emergency: deleting all flows "
> > +                         "(now: %"PRIuSIZE", max: %"PRIuSIZE")",
> 
> Almost he same concern here as in the other patch.  But this message is even
> more misleading since it prints doubled dynamically adjusted limit which is
> OK to print, but how to use this value in practice if you do not know internals
> of this code?

The intention is to help troubleshoot an issue. It becomes more difficult
to understand the situation if you don't have any idea about how those
dynamic numbers look like.


> Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as
> current dynamic flow limit (%PRIuSIZE).  Deleting all flows as an emergency
> measure."?
> 
> Another thing is that this message is actually misleading anyway because
> revalidators recalculates the number of flows for each batch (50) of dumped
> flows (every 100ms, to be honest), so we will not actually flush all the
> flows in many cases.  We will remove them until situation is normalized.
> 
> So, probably:
> "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow
> limit (%PRIuSIZE).  Starting to delete flows unconditionally as an emergency
> measure."?

Definitely more clear and less scary.

> Also, this message should be printed only once per revalidation cycle, I guess,
> not for every flow batch.

That is rate limited, but I will see if there is another way.

fbl


> 
> > +                         n_dp_flows, kill_all_limit);
> > +            COVERAGE_INC(upcall_flow_limit_kill);
> > +            kill_them_all = true;
> > +        }
> > +
> >          max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
> >  
> >          udpif->dpif->current_ms = time_msec();
> > 
>
Flavio Leitner Sept. 29, 2020, 8:09 p.m. UTC | #5
On Tue, Sep 29, 2020 at 05:27:31PM +0200, Ilya Maximets wrote:
> On 9/15/20 9:21 PM, Flavio Leitner wrote:
> > When the number of flows in the datapath reaches twice the
> > maximum, revalidators will delete all flows as an emergency
> > action to recover. In that case, log a message with values
> > and increase a coverage counter.
> > 
> > Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> > ---
> >  ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> > index 05a912f57..12e94287a 100644
> > --- a/ofproto/ofproto-dpif-upcall.c
> > +++ b/ofproto/ofproto-dpif-upcall.c
> > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention);
> >  COVERAGE_DEFINE(upcall_ukey_replace);
> >  COVERAGE_DEFINE(revalidate_missed_dp_flow);
> >  COVERAGE_DEFINE(upcall_flow_limit_hit);
> > +COVERAGE_DEFINE(upcall_flow_limit_kill);
> >  
> >  /* A thread that reads upcalls from dpif, forwards each upcall's packet,
> >   * and possibly sets up a kernel flow as a cache. */
> > @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator)
> >  
> >          long long int max_idle;
> >          long long int now;
> > +        size_t kill_all_limit;
> >          size_t n_dp_flows;
> >          bool kill_them_all;
> >  
> > @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator)
> >              COVERAGE_INC(upcall_flow_limit_hit);
> >          }
> >  
> > -        kill_them_all = n_dp_flows > flow_limit * 2;
> > +        kill_them_all = false;
> > +        kill_all_limit = flow_limit * 2;
> > +        if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) {
> > +            static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5);
> > +
> > +            VLOG_WARN_RL(&rlem, "Emergency: deleting all flows "
> > +                         "(now: %"PRIuSIZE", max: %"PRIuSIZE")",
> 
> Almost he same concern here as in the other patch.  But this message is even
> more misleading since it prints doubled dynamically adjusted limit which is
> OK to print, but how to use this value in practice if you do not know internals
> of this code?
> 
> Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as
> current dynamic flow limit (%PRIuSIZE).  Deleting all flows as an emergency
> measure."?
> 
> Another thing is that this message is actually misleading anyway because
> revalidators recalculates the number of flows for each batch (50) of dumped
> flows (every 100ms, to be honest), so we will not actually flush all the
> flows in many cases.  We will remove them until situation is normalized.
> 
> So, probably:
> "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow
> limit (%PRIuSIZE).  Starting to delete flows unconditionally as an emergency
> measure."?
> 
> Also, this message should be printed only once per revalidation cycle, I guess,
> not for every flow batch.
> 
> > +                         n_dp_flows, kill_all_limit);
> > +            COVERAGE_INC(upcall_flow_limit_kill);
> > +            kill_them_all = true;
> > +        }
> > +
> >          max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
> >  
> >          udpif->dpif->current_ms = time_msec();
> > 
> 

Sent v2:
https://mail.openvswitch.org/pipermail/ovs-dev/2020-September/375647.html
diff mbox series

Patch

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index 05a912f57..12e94287a 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -57,6 +57,7 @@  COVERAGE_DEFINE(upcall_ukey_contention);
 COVERAGE_DEFINE(upcall_ukey_replace);
 COVERAGE_DEFINE(revalidate_missed_dp_flow);
 COVERAGE_DEFINE(upcall_flow_limit_hit);
+COVERAGE_DEFINE(upcall_flow_limit_kill);
 
 /* A thread that reads upcalls from dpif, forwards each upcall's packet,
  * and possibly sets up a kernel flow as a cache. */
@@ -2622,6 +2623,7 @@  revalidate(struct revalidator *revalidator)
 
         long long int max_idle;
         long long int now;
+        size_t kill_all_limit;
         size_t n_dp_flows;
         bool kill_them_all;
 
@@ -2649,7 +2651,18 @@  revalidate(struct revalidator *revalidator)
             COVERAGE_INC(upcall_flow_limit_hit);
         }
 
-        kill_them_all = n_dp_flows > flow_limit * 2;
+        kill_them_all = false;
+        kill_all_limit = flow_limit * 2;
+        if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) {
+            static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5);
+
+            VLOG_WARN_RL(&rlem, "Emergency: deleting all flows "
+                         "(now: %"PRIuSIZE", max: %"PRIuSIZE")",
+                         n_dp_flows, kill_all_limit);
+            COVERAGE_INC(upcall_flow_limit_kill);
+            kill_them_all = true;
+        }
+
         max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle;
 
         udpif->dpif->current_ms = time_msec();