Message ID | 20200915192120.3543861-1-fbl@sysclose.org |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev] ofproto-dpif-upcall: Log the emergency flow flush. | expand |
On 15 Sep 2020, at 21:21, Flavio Leitner wrote: > When the number of flows in the datapath reaches twice the > maximum, revalidators will delete all flows as an emergency > action to recover. In that case, log a message with values > and increase a coverage counter. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Eelco Chaudron <echaudro@redhat.com>
Flavio Leitner <fbl@sysclose.org> writes: > When the number of flows in the datapath reaches twice the > maximum, revalidators will delete all flows as an emergency > action to recover. In that case, log a message with values > and increase a coverage counter. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > --- Acked-by: Aaron Conole <aconole@redhat.com>
On 9/15/20 9:21 PM, Flavio Leitner wrote: > When the number of flows in the datapath reaches twice the > maximum, revalidators will delete all flows as an emergency > action to recover. In that case, log a message with values > and increase a coverage counter. > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > --- > ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > index 05a912f57..12e94287a 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention); > COVERAGE_DEFINE(upcall_ukey_replace); > COVERAGE_DEFINE(revalidate_missed_dp_flow); > COVERAGE_DEFINE(upcall_flow_limit_hit); > +COVERAGE_DEFINE(upcall_flow_limit_kill); > > /* A thread that reads upcalls from dpif, forwards each upcall's packet, > * and possibly sets up a kernel flow as a cache. */ > @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator) > > long long int max_idle; > long long int now; > + size_t kill_all_limit; > size_t n_dp_flows; > bool kill_them_all; > > @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator) > COVERAGE_INC(upcall_flow_limit_hit); > } > > - kill_them_all = n_dp_flows > flow_limit * 2; > + kill_them_all = false; > + kill_all_limit = flow_limit * 2; > + if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) { > + static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5); > + > + VLOG_WARN_RL(&rlem, "Emergency: deleting all flows " > + "(now: %"PRIuSIZE", max: %"PRIuSIZE")", Almost he same concern here as in the other patch. But this message is even more misleading since it prints doubled dynamically adjusted limit which is OK to print, but how to use this value in practice if you do not know internals of this code? Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow limit (%PRIuSIZE). Deleting all flows as an emergency measure."? Another thing is that this message is actually misleading anyway because revalidators recalculates the number of flows for each batch (50) of dumped flows (every 100ms, to be honest), so we will not actually flush all the flows in many cases. We will remove them until situation is normalized. So, probably: "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow limit (%PRIuSIZE). Starting to delete flows unconditionally as an emergency measure."? Also, this message should be printed only once per revalidation cycle, I guess, not for every flow batch. > + n_dp_flows, kill_all_limit); > + COVERAGE_INC(upcall_flow_limit_kill); > + kill_them_all = true; > + } > + > max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle; > > udpif->dpif->current_ms = time_msec(); >
On Tue, Sep 29, 2020 at 05:27:31PM +0200, Ilya Maximets wrote: > On 9/15/20 9:21 PM, Flavio Leitner wrote: > > When the number of flows in the datapath reaches twice the > > maximum, revalidators will delete all flows as an emergency > > action to recover. In that case, log a message with values > > and increase a coverage counter. > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > > --- > > ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > > index 05a912f57..12e94287a 100644 > > --- a/ofproto/ofproto-dpif-upcall.c > > +++ b/ofproto/ofproto-dpif-upcall.c > > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention); > > COVERAGE_DEFINE(upcall_ukey_replace); > > COVERAGE_DEFINE(revalidate_missed_dp_flow); > > COVERAGE_DEFINE(upcall_flow_limit_hit); > > +COVERAGE_DEFINE(upcall_flow_limit_kill); > > > > /* A thread that reads upcalls from dpif, forwards each upcall's packet, > > * and possibly sets up a kernel flow as a cache. */ > > @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator) > > > > long long int max_idle; > > long long int now; > > + size_t kill_all_limit; > > size_t n_dp_flows; > > bool kill_them_all; > > > > @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator) > > COVERAGE_INC(upcall_flow_limit_hit); > > } > > > > - kill_them_all = n_dp_flows > flow_limit * 2; > > + kill_them_all = false; > > + kill_all_limit = flow_limit * 2; > > + if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) { > > + static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5); > > + > > + VLOG_WARN_RL(&rlem, "Emergency: deleting all flows " > > + "(now: %"PRIuSIZE", max: %"PRIuSIZE")", > > Almost he same concern here as in the other patch. But this message is even > more misleading since it prints doubled dynamically adjusted limit which is > OK to print, but how to use this value in practice if you do not know internals > of this code? The intention is to help troubleshoot an issue. It becomes more difficult to understand the situation if you don't have any idea about how those dynamic numbers look like. > Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as > current dynamic flow limit (%PRIuSIZE). Deleting all flows as an emergency > measure."? > > Another thing is that this message is actually misleading anyway because > revalidators recalculates the number of flows for each batch (50) of dumped > flows (every 100ms, to be honest), so we will not actually flush all the > flows in many cases. We will remove them until situation is normalized. > > So, probably: > "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow > limit (%PRIuSIZE). Starting to delete flows unconditionally as an emergency > measure."? Definitely more clear and less scary. > Also, this message should be printed only once per revalidation cycle, I guess, > not for every flow batch. That is rate limited, but I will see if there is another way. fbl > > > + n_dp_flows, kill_all_limit); > > + COVERAGE_INC(upcall_flow_limit_kill); > > + kill_them_all = true; > > + } > > + > > max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle; > > > > udpif->dpif->current_ms = time_msec(); > > >
On Tue, Sep 29, 2020 at 05:27:31PM +0200, Ilya Maximets wrote: > On 9/15/20 9:21 PM, Flavio Leitner wrote: > > When the number of flows in the datapath reaches twice the > > maximum, revalidators will delete all flows as an emergency > > action to recover. In that case, log a message with values > > and increase a coverage counter. > > > > Signed-off-by: Flavio Leitner <fbl@sysclose.org> > > --- > > ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > > index 05a912f57..12e94287a 100644 > > --- a/ofproto/ofproto-dpif-upcall.c > > +++ b/ofproto/ofproto-dpif-upcall.c > > @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention); > > COVERAGE_DEFINE(upcall_ukey_replace); > > COVERAGE_DEFINE(revalidate_missed_dp_flow); > > COVERAGE_DEFINE(upcall_flow_limit_hit); > > +COVERAGE_DEFINE(upcall_flow_limit_kill); > > > > /* A thread that reads upcalls from dpif, forwards each upcall's packet, > > * and possibly sets up a kernel flow as a cache. */ > > @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator) > > > > long long int max_idle; > > long long int now; > > + size_t kill_all_limit; > > size_t n_dp_flows; > > bool kill_them_all; > > > > @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator) > > COVERAGE_INC(upcall_flow_limit_hit); > > } > > > > - kill_them_all = n_dp_flows > flow_limit * 2; > > + kill_them_all = false; > > + kill_all_limit = flow_limit * 2; > > + if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) { > > + static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5); > > + > > + VLOG_WARN_RL(&rlem, "Emergency: deleting all flows " > > + "(now: %"PRIuSIZE", max: %"PRIuSIZE")", > > Almost he same concern here as in the other patch. But this message is even > more misleading since it prints doubled dynamically adjusted limit which is > OK to print, but how to use this value in practice if you do not know internals > of this code? > > Maybe something like "Number of datapath flows (%PRIuSIZE) twice as high as > current dynamic flow limit (%PRIuSIZE). Deleting all flows as an emergency > measure."? > > Another thing is that this message is actually misleading anyway because > revalidators recalculates the number of flows for each batch (50) of dumped > flows (every 100ms, to be honest), so we will not actually flush all the > flows in many cases. We will remove them until situation is normalized. > > So, probably: > "Number of datapath flows (%PRIuSIZE) twice as high as current dynamic flow > limit (%PRIuSIZE). Starting to delete flows unconditionally as an emergency > measure."? > > Also, this message should be printed only once per revalidation cycle, I guess, > not for every flow batch. > > > + n_dp_flows, kill_all_limit); > > + COVERAGE_INC(upcall_flow_limit_kill); > > + kill_them_all = true; > > + } > > + > > max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle; > > > > udpif->dpif->current_ms = time_msec(); > > > Sent v2: https://mail.openvswitch.org/pipermail/ovs-dev/2020-September/375647.html
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index 05a912f57..12e94287a 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -57,6 +57,7 @@ COVERAGE_DEFINE(upcall_ukey_contention); COVERAGE_DEFINE(upcall_ukey_replace); COVERAGE_DEFINE(revalidate_missed_dp_flow); COVERAGE_DEFINE(upcall_flow_limit_hit); +COVERAGE_DEFINE(upcall_flow_limit_kill); /* A thread that reads upcalls from dpif, forwards each upcall's packet, * and possibly sets up a kernel flow as a cache. */ @@ -2622,6 +2623,7 @@ revalidate(struct revalidator *revalidator) long long int max_idle; long long int now; + size_t kill_all_limit; size_t n_dp_flows; bool kill_them_all; @@ -2649,7 +2651,18 @@ revalidate(struct revalidator *revalidator) COVERAGE_INC(upcall_flow_limit_hit); } - kill_them_all = n_dp_flows > flow_limit * 2; + kill_them_all = false; + kill_all_limit = flow_limit * 2; + if (OVS_UNLIKELY(n_dp_flows > kill_all_limit)) { + static struct vlog_rate_limit rlem = VLOG_RATE_LIMIT_INIT(1, 5); + + VLOG_WARN_RL(&rlem, "Emergency: deleting all flows " + "(now: %"PRIuSIZE", max: %"PRIuSIZE")", + n_dp_flows, kill_all_limit); + COVERAGE_INC(upcall_flow_limit_kill); + kill_them_all = true; + } + max_idle = n_dp_flows > flow_limit ? 100 : ofproto_max_idle; udpif->dpif->current_ms = time_msec();
When the number of flows in the datapath reaches twice the maximum, revalidators will delete all flows as an emergency action to recover. In that case, log a message with values and increase a coverage counter. Signed-off-by: Flavio Leitner <fbl@sysclose.org> --- ofproto/ofproto-dpif-upcall.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)