diff mbox series

[ovs-dev,v4] ofproto-dpif-upcall: fix unintended side effect of statistics workaround

Message ID 0ef0ffb69b0b2a97528502d7225a05788c8bc638.1670455538.git.bnemeth@redhat.com
State Superseded
Headers show
Series [ovs-dev,v4] ofproto-dpif-upcall: fix unintended side effect of statistics workaround | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation fail test: fail

Commit Message

Balazs Nemeth Dec. 7, 2022, 11:25 p.m. UTC
Besides the unrealistic case of an overflow, stats->n_packets can become
less than ukey->stats.n_packets only when incorrect statistics were
read. For that reason, the condition that the code employed is a
workaround to avoid large jumps in statistics simply by assuming no
packets were read. However, if this happens while the revalidator is
under heavy load, the workaround has an unintended side effect where
should_revalidate returns false because the metric it calculates is
based on a bogus value.

Therefore, this patch makes it explicit that the workaround has been
triggered and consequently skipping should_revalidate assuming it should
have returned true. Note also that the workaround is only triggered
if stats->n_packets is _strictly_ less than ukey->stats.n_packets. Not
receiving any packets is a valid scenario where should_revalidate should
not be skipped.

Signed-off-by: Balazs Nemeth <bnemeth@redhat.com>
---
 ofproto/ofproto-dpif-upcall.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Roi Dayan Dec. 8, 2022, 7:41 a.m. UTC | #1
On 08/12/2022 1:25, Balazs Nemeth wrote:
> Besides the unrealistic case of an overflow, stats->n_packets can become
> less than ukey->stats.n_packets only when incorrect statistics were
> read. For that reason, the condition that the code employed is a
> workaround to avoid large jumps in statistics simply by assuming no
> packets were read. However, if this happens while the revalidator is
> under heavy load, the workaround has an unintended side effect where
> should_revalidate returns false because the metric it calculates is
> based on a bogus value.
> 
> Therefore, this patch makes it explicit that the workaround has been
> triggered and consequently skipping should_revalidate assuming it should
> have returned true. Note also that the workaround is only triggered
> if stats->n_packets is _strictly_ less than ukey->stats.n_packets. Not
> receiving any packets is a valid scenario where should_revalidate should
> not be skipped.
> 
> Signed-off-by: Balazs Nemeth <bnemeth@redhat.com>
> ---
>  ofproto/ofproto-dpif-upcall.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index ad9635496..f0078e3c1 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -2327,6 +2327,7 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>      OVS_REQUIRES(ukey->mutex)
>  {
>      bool need_revalidate = ukey->reval_seq != reval_seq;
> +    bool stats_workaround;
>      enum reval_result result = UKEY_DELETE;
>      struct dpif_flow_stats push;
>  
> @@ -2334,7 +2335,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>  
>      push.used = stats->used;
>      push.tcp_flags = stats->tcp_flags;
> -    push.n_packets = (stats->n_packets > ukey->stats.n_packets
> +    stats_workaround = stats->n_packets < ukey->stats.n_packets;
> +    push.n_packets = (!stats_workaround
>                        ? stats->n_packets - ukey->stats.n_packets
>                        : 0);
>      push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes
> @@ -2342,7 +2344,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>                      : 0);
>  
>      if (need_revalidate) {
> -        if (should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
> +        if (stats_workaround ||
> +            should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
>              if (!ukey->xcache) {
>                  ukey->xcache = xlate_cache_new();
>              } else {


this actually looks good to me as an improvement to the last patch
to protect against such scenario as described, as we know it happens.
I'm in favor of this.

Acked-by: Roi Dayan <roid@nvidia.com>
Eelco Chaudron Dec. 8, 2022, 7:50 a.m. UTC | #2
On 8 Dec 2022, at 0:25, Balazs Nemeth wrote:

> Besides the unrealistic case of an overflow, stats->n_packets can become
> less than ukey->stats.n_packets only when incorrect statistics were
> read. For that reason, the condition that the code employed is a
> workaround to avoid large jumps in statistics simply by assuming no
> packets were read. However, if this happens while the revalidator is
> under heavy load, the workaround has an unintended side effect where
> should_revalidate returns false because the metric it calculates is
> based on a bogus value.
>
> Therefore, this patch makes it explicit that the workaround has been
> triggered and consequently skipping should_revalidate assuming it should
> have returned true. Note also that the workaround is only triggered
> if stats->n_packets is _strictly_ less than ukey->stats.n_packets. Not
> receiving any packets is a valid scenario where should_revalidate should
> not be skipped.

I think we should not commit any workarounds in the code but concentrate on fixing the real issue that could happen, which is the wrap-around of statistics.

> Signed-off-by: Balazs Nemeth <bnemeth@redhat.com>

Next time please keep a list of revisions with changes here.

> ---
>  ofproto/ofproto-dpif-upcall.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index ad9635496..f0078e3c1 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -2327,6 +2327,7 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>      OVS_REQUIRES(ukey->mutex)
>  {
>      bool need_revalidate = ukey->reval_seq != reval_seq;
> +    bool stats_workaround;

I guess the name of this stat does not really explain what we work around.
But I guess it might not be needed, see comments below.

>      enum reval_result result = UKEY_DELETE;
>      struct dpif_flow_stats push;
>
> @@ -2334,7 +2335,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>
>      push.used = stats->used;
>      push.tcp_flags = stats->tcp_flags;
> -    push.n_packets = (stats->n_packets > ukey->stats.n_packets
> +    stats_workaround = stats->n_packets < ukey->stats.n_packets;
> +    push.n_packets = (!stats_workaround
>                        ? stats->n_packets - ukey->stats.n_packets
>                        : 0);
>      push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes

I guess the design goal was taken to reset the stats when a wrap happens, which if true we should not reset the stats to zero, but set them to the actual value.

So please update both packets and bytes to something like (not tested): UINT64_MAX - ukey->stats.n_packets + stats->n_packets + 1

> @@ -2342,7 +2344,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>                      : 0);
>
>      if (need_revalidate) {
> -        if (should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
> +        if (stats_workaround ||
> +            should_revalidate(udpif, push.n_packets, ukey->stats.used)) {

If we calculate the actual packets as suggested above, we should no longer need the stats_workaround.

>              if (!ukey->xcache) {
>                  ukey->xcache = xlate_cache_new();
>              } else {

Note that we also need to do this wrap-around in the push_dp_ops().

Doing it this way should also not mask the potential issue with the large packet counts, as the statistics will still show them (being a bit less large).
See the discussion here: https://patchwork.ozlabs.org/project/openvswitch/patch/164362644549.2824752.16138250405530944551.stgit@ebuild/

In the meantime, I’ll try to re-visit the issue mentioned above to see if I can replicate it again…
Eelco Chaudron Dec. 8, 2022, 9:07 a.m. UTC | #3
On 8 Dec 2022, at 8:50, Eelco Chaudron wrote:

> On 8 Dec 2022, at 0:25, Balazs Nemeth wrote:
>
>> Besides the unrealistic case of an overflow, stats->n_packets can become
>> less than ukey->stats.n_packets only when incorrect statistics were
>> read. For that reason, the condition that the code employed is a
>> workaround to avoid large jumps in statistics simply by assuming no
>> packets were read. However, if this happens while the revalidator is
>> under heavy load, the workaround has an unintended side effect where
>> should_revalidate returns false because the metric it calculates is
>> based on a bogus value.
>>
>> Therefore, this patch makes it explicit that the workaround has been
>> triggered and consequently skipping should_revalidate assuming it should
>> have returned true. Note also that the workaround is only triggered
>> if stats->n_packets is _strictly_ less than ukey->stats.n_packets. Not
>> receiving any packets is a valid scenario where should_revalidate should
>> not be skipped.
>
> I think we should not commit any workarounds in the code but concentrate on fixing the real issue that could happen, which is the wrap-around of statistics.
>
>> Signed-off-by: Balazs Nemeth <bnemeth@redhat.com>
>
> Next time please keep a list of revisions with changes here.
>
>> ---
>>  ofproto/ofproto-dpif-upcall.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
>> index ad9635496..f0078e3c1 100644
>> --- a/ofproto/ofproto-dpif-upcall.c
>> +++ b/ofproto/ofproto-dpif-upcall.c
>> @@ -2327,6 +2327,7 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>>      OVS_REQUIRES(ukey->mutex)
>>  {
>>      bool need_revalidate = ukey->reval_seq != reval_seq;
>> +    bool stats_workaround;
>
> I guess the name of this stat does not really explain what we work around.
> But I guess it might not be needed, see comments below.
>
>>      enum reval_result result = UKEY_DELETE;
>>      struct dpif_flow_stats push;
>>
>> @@ -2334,7 +2335,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>>
>>      push.used = stats->used;
>>      push.tcp_flags = stats->tcp_flags;
>> -    push.n_packets = (stats->n_packets > ukey->stats.n_packets
>> +    stats_workaround = stats->n_packets < ukey->stats.n_packets;
>> +    push.n_packets = (!stats_workaround
>>                        ? stats->n_packets - ukey->stats.n_packets
>>                        : 0);
>>      push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes
>
> I guess the design goal was taken to reset the stats when a wrap happens, which if true we should not reset the stats to zero, but set them to the actual value.
>
> So please update both packets and bytes to something like (not tested): UINT64_MAX - ukey->stats.n_packets + stats->n_packets + 1

Guess I did not have enough Coffee this morning and read the modulo part wrong.
So in theory the best solution was version one of this patch :)

+    /* In case lhs of the subtraction overflowed since the last check,
+       the subtraction is still correct */
+    push.n_packets = stats->n_packets - ukey->stats.n_packets;
+    push.n_bytes = stats->n_bytes - ukey->stats.n_bytes;


But I do not think should apply this patch until we fixed the tc stat inconsistency.

In addition, a workaround with a “stats_workaround” variable is maybe something that will work for testing, but should not be upstreamed.

Also, take into consideration what would happen if the stats are always wrong. We will never expire the entry?

>> @@ -2342,7 +2344,8 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>>                      : 0);
>>
>>      if (need_revalidate) {
>> -        if (should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
>> +        if (stats_workaround ||
>> +            should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
>
> If we calculate the actual packets as suggested above, we should no longer need the stats_workaround.
>
>>              if (!ukey->xcache) {
>>                  ukey->xcache = xlate_cache_new();
>>              } else {
>
> Note that we also need to do this wrap-around in the push_dp_ops().
>
> Doing it this way should also not mask the potential issue with the large packet counts, as the statistics will still show them (being a bit less large).
> See the discussion here: https://patchwork.ozlabs.org/project/openvswitch/patch/164362644549.2824752.16138250405530944551.stgit@ebuild/
>
> In the meantime, I’ll try to re-visit the issue mentioned above to see if I can replicate it again…
diff mbox series

Patch

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index ad9635496..f0078e3c1 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -2327,6 +2327,7 @@  revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
     OVS_REQUIRES(ukey->mutex)
 {
     bool need_revalidate = ukey->reval_seq != reval_seq;
+    bool stats_workaround;
     enum reval_result result = UKEY_DELETE;
     struct dpif_flow_stats push;
 
@@ -2334,7 +2335,8 @@  revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
 
     push.used = stats->used;
     push.tcp_flags = stats->tcp_flags;
-    push.n_packets = (stats->n_packets > ukey->stats.n_packets
+    stats_workaround = stats->n_packets < ukey->stats.n_packets;
+    push.n_packets = (!stats_workaround
                       ? stats->n_packets - ukey->stats.n_packets
                       : 0);
     push.n_bytes = (stats->n_bytes > ukey->stats.n_bytes
@@ -2342,7 +2344,8 @@  revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
                     : 0);
 
     if (need_revalidate) {
-        if (should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
+        if (stats_workaround ||
+            should_revalidate(udpif, push.n_packets, ukey->stats.used)) {
             if (!ukey->xcache) {
                 ukey->xcache = xlate_cache_new();
             } else {