diff mbox series

[ovs-dev,v3] ofproto-dpif-upcall: Try lock for udpif_key map during sweep.

Message ID 20240315100402.1979482-1-i@liuyulong.me
State Under Review
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev,v3] ofproto-dpif-upcall: Try lock for udpif_key map during sweep. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

LIU Yulong March 15, 2024, 10:04 a.m. UTC
A potential race condition happened with the following 3 threads:
* PMD thread replaced the old_ukey and transitioned the state to
  UKEY_DELETED.
* RCU thread is freeing the old_ukey mutex.
* While the revalidator thread is trying to lock the old_ukey mutex.

We added some timestamp to udpif_key state transition and
the udpif_key mutex free action, as well as the sweep try lock for
the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
try_lock mutex had always a bit later than the udpif_key mutex free
action. For instance [3]:
ukey_destroy_time = 13217289156490
sweep_now = 13217289156568

The second time is 78us behind the first time.

Then vswitchd process aborts at the revalidator thread try_lock of
ukey->mutex because of the NULL pointer.

This patch adds the try_lock for the ukeys' basket udpif_key map to
avoid the PMD and revalidator access to the same map for replacing the
udpif_key and transitioning the udpif_key state.

More details can be found at:
[1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
[2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
[3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html

Signed-off-by: LIU Yulong <i@liuyulong.me>
---
v2: - Updated commit message to make 0-day Robot happy.
v3: - Updated commit message to make checkpatch.py happy.
    - Add some investigation details.
---
---
 ofproto/ofproto-dpif-upcall.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Ilya Maximets March 19, 2024, 9:42 p.m. UTC | #1
On 3/15/24 11:04, LIU Yulong wrote:
> A potential race condition happened with the following 3 threads:
> * PMD thread replaced the old_ukey and transitioned the state to
>   UKEY_DELETED.
> * RCU thread is freeing the old_ukey mutex.
> * While the revalidator thread is trying to lock the old_ukey mutex.
> 
> We added some timestamp to udpif_key state transition and
> the udpif_key mutex free action, as well as the sweep try lock for
> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
> try_lock mutex had always a bit later than the udpif_key mutex free
> action. For instance [3]:
> ukey_destroy_time = 13217289156490
> sweep_now = 13217289156568
> 
> The second time is 78us behind the first time.
> 
> Then vswitchd process aborts at the revalidator thread try_lock of
> ukey->mutex because of the NULL pointer.
> 
> This patch adds the try_lock for the ukeys' basket udpif_key map to
> avoid the PMD and revalidator access to the same map for replacing the
> udpif_key and transitioning the udpif_key state.

Thanks for the patch!

I still need to catch up on the ovs-discuss thread to better understand
what we're dealing with.

However, the solution proposed here may be a little too radical.  The
main problem is that we're holding the mutex for potentially very long
time - the whole iteration over every ukey in the map, potentially
waiting for datapath operations to remove old flows.  This can take
hundreds of milliseconds or even a fwe seconds in some cases.  For that
time we will be blocking PMD threads and handlers.  This is probably
not a good solution.  Also, it might be masking the real issue by having
this large critical section, i.e. the issue may still be there, but just
much harder to reproduce.

Best regards, Ilya Maximets.
LIU Yulong March 21, 2024, 9:01 a.m. UTC | #2
ACK

Thanks, Ilya.

For now, update such lock movement code to the test hosts, no cores at
sweep__ for 10 days.

LIU Yulong

On Wed, Mar 20, 2024 at 5:42 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 3/15/24 11:04, LIU Yulong wrote:
> > A potential race condition happened with the following 3 threads:
> > * PMD thread replaced the old_ukey and transitioned the state to
> >   UKEY_DELETED.
> > * RCU thread is freeing the old_ukey mutex.
> > * While the revalidator thread is trying to lock the old_ukey mutex.
> >
> > We added some timestamp to udpif_key state transition and
> > the udpif_key mutex free action, as well as the sweep try lock for
> > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
> > try_lock mutex had always a bit later than the udpif_key mutex free
> > action. For instance [3]:
> > ukey_destroy_time = 13217289156490
> > sweep_now = 13217289156568
> >
> > The second time is 78us behind the first time.
> >
> > Then vswitchd process aborts at the revalidator thread try_lock of
> > ukey->mutex because of the NULL pointer.
> >
> > This patch adds the try_lock for the ukeys' basket udpif_key map to
> > avoid the PMD and revalidator access to the same map for replacing the
> > udpif_key and transitioning the udpif_key state.
>
> Thanks for the patch!
>
> I still need to catch up on the ovs-discuss thread to better understand
> what we're dealing with.
>
> However, the solution proposed here may be a little too radical.  The
> main problem is that we're holding the mutex for potentially very long
> time - the whole iteration over every ukey in the map, potentially
> waiting for datapath operations to remove old flows.  This can take
> hundreds of milliseconds or even a fwe seconds in some cases.  For that
> time we will be blocking PMD threads and handlers.  This is probably
> not a good solution.  Also, it might be masking the real issue by having
> this large critical section, i.e. the issue may still be there, but just
> much harder to reproduce.
>
> Best regards, Ilya Maximets.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eelco Chaudron March 22, 2024, 12:56 p.m. UTC | #3
On 15 Mar 2024, at 11:04, LIU Yulong wrote:

> A potential race condition happened with the following 3 threads:
> * PMD thread replaced the old_ukey and transitioned the state to
>   UKEY_DELETED.
> * RCU thread is freeing the old_ukey mutex.
> * While the revalidator thread is trying to lock the old_ukey mutex.
>
> We added some timestamp to udpif_key state transition and
> the udpif_key mutex free action, as well as the sweep try lock for
> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
> try_lock mutex had always a bit later than the udpif_key mutex free
> action. For instance [3]:
> ukey_destroy_time = 13217289156490
> sweep_now = 13217289156568
>
> The second time is 78us behind the first time.
>
> Then vswitchd process aborts at the revalidator thread try_lock of
> ukey->mutex because of the NULL pointer.
>
> This patch adds the try_lock for the ukeys' basket udpif_key map to
> avoid the PMD and revalidator access to the same map for replacing the
> udpif_key and transitioning the udpif_key state.
>
> More details can be found at:
> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
>
> Signed-off-by: LIU Yulong <i@liuyulong.me>

Hi LIU,

I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis.

Assume PMD is doing try_ukey_replace():

  1930  try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey,
  1931                   struct udpif_key *new_ukey)
  1932      OVS_REQUIRES(umap->mutex)
  1933      OVS_TRY_LOCK(true, new_ukey->mutex)
  1934  {
  1935      bool replaced = false;
  1936
  1937      if (!ovs_mutex_trylock(&old_ukey->mutex)) {
  1938          if (old_ukey->state == UKEY_EVICTED) {
  1939              /* The flow was deleted during the current revalidator dump,
  1940               * but its ukey won't be fully cleaned up until the sweep phase.
  1941               * In the mean time, we are receiving upcalls for this traffic.
  1942               * Expedite the (new) flow install by replacing the ukey. */
  1943              ovs_mutex_lock(&new_ukey->mutex);
  1944              cmap_replace(&umap->cmap, &old_ukey->cmap_node,
  1945                           &new_ukey->cmap_node, new_ukey->hash);
  1946              new_ukey->dump_seq = old_ukey->dump_seq;
  1947              ovsrcu_postpone(ukey_delete__, old_ukey);
  1948              transition_ukey(old_ukey, UKEY_DELETED);
  1949              transition_ukey(new_ukey, UKEY_VISIBLE);
  1950              replaced = true;
  1951          }
  1952          ovs_mutex_unlock(&old_ukey->mutex);
  1953      }
  ...

In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period.
The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below.

2983  static void
2984  revalidator_sweep__(struct revalidator *revalidator, bool purge)
2985  {
....
2995
2996      for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) {
....
3004
3005          CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
3006              enum flow_del_reason del_reason = FDR_NONE;
3007              enum ukey_state ukey_state;
3008
3009              /* Handler threads could be holding a ukey lock while it installs a
3010               * new flow, so don't hang around waiting for access to it. */
3011              if (ovs_mutex_trylock(&ukey->mutex)) {
3012                  continue;
3013              }
3037                      reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
3038                                    &odp_actions);
3039                  }

3040                  OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif,
3041                                 ukey, result, del_reason);
3042              }
3043              ovs_mutex_unlock(&ukey->mutex);
....
3045              if (ukey_state == UKEY_EVICTED) {
3046                  /* The common flow deletion case involves deletion of the flow
3047                   * during the dump phase and ukey deletion here. */
3048                  ovs_mutex_lock(&umap->mutex);
3049                  ukey_delete(umap, ukey);
3050                  ovs_mutex_unlock(&umap->mutex);
3051              }
3052
3053              if (n_ops == REVALIDATE_MAX_BATCH) {
3054                  /* Update/delete missed flows and clean up corresponding ukeys
3055                   * if necessary. */
3056                  push_ukey_ops(udpif, umap, ops, n_ops);
3057                  n_ops = 0;
3058              }
3059          }
....
3066          ovsrcu_quiesce();
              ^^ Here we call it.
3067      }
3068  }

So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)?

If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together.

//Eelco
LIU Yulong April 7, 2024, 1:57 a.m. UTC | #4
Hi Eelco,

Sorry for the late reply. And thank you very much for your code examination.

Today, I removed that  "ovsrcu_quiesce();" at  line 3066 [1], and
started to run the
test in my local environment. Let's see if such change can overcome the issue.

Regarding the long time lock or quiesce state, Ilya's feedback seemed
to be true during some
service restart or large amount flow installations (no matter with or
without the lock move fix).
I noticed logs like this:

2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms
waiting for main to quiesce
2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms
waiting for main to quiesce
2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms
waiting for main to quiesce
2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms
waiting for main to quiesce
2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms
waiting for main to quiesce

And

2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms
waiting for dpdk_watchdog1 to quiesce
2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms
waiting for dpdk_watchdog1 to quiesce
2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms
waiting for revalidator6 to quiesce
2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms
waiting for revalidator6 to quiesce
2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms
waiting for revalidator6 to quiesce
2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms
waiting for revalidator6 to quiesce
2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms
waiting for revalidator6 to quiesce
2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms
waiting for revalidator6 to quiesce
2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms
waiting for pmd-c03/id:8 to quiesce
2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms
waiting for pmd-c03/id:8 to quiesce


Not sure if this is related to the core issue.


[1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066


LIU Yulong


On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 15 Mar 2024, at 11:04, LIU Yulong wrote:
>
> > A potential race condition happened with the following 3 threads:
> > * PMD thread replaced the old_ukey and transitioned the state to
> >   UKEY_DELETED.
> > * RCU thread is freeing the old_ukey mutex.
> > * While the revalidator thread is trying to lock the old_ukey mutex.
> >
> > We added some timestamp to udpif_key state transition and
> > the udpif_key mutex free action, as well as the sweep try lock for
> > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
> > try_lock mutex had always a bit later than the udpif_key mutex free
> > action. For instance [3]:
> > ukey_destroy_time = 13217289156490
> > sweep_now = 13217289156568
> >
> > The second time is 78us behind the first time.
> >
> > Then vswitchd process aborts at the revalidator thread try_lock of
> > ukey->mutex because of the NULL pointer.
> >
> > This patch adds the try_lock for the ukeys' basket udpif_key map to
> > avoid the PMD and revalidator access to the same map for replacing the
> > udpif_key and transitioning the udpif_key state.
> >
> > More details can be found at:
> > [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
> > [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
> > [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
> >
> > Signed-off-by: LIU Yulong <i@liuyulong.me>
>
> Hi LIU,
>
> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis.
>
> Assume PMD is doing try_ukey_replace():
>
>   1930  try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey,
>   1931                   struct udpif_key *new_ukey)
>   1932      OVS_REQUIRES(umap->mutex)
>   1933      OVS_TRY_LOCK(true, new_ukey->mutex)
>   1934  {
>   1935      bool replaced = false;
>   1936
>   1937      if (!ovs_mutex_trylock(&old_ukey->mutex)) {
>   1938          if (old_ukey->state == UKEY_EVICTED) {
>   1939              /* The flow was deleted during the current revalidator dump,
>   1940               * but its ukey won't be fully cleaned up until the sweep phase.
>   1941               * In the mean time, we are receiving upcalls for this traffic.
>   1942               * Expedite the (new) flow install by replacing the ukey. */
>   1943              ovs_mutex_lock(&new_ukey->mutex);
>   1944              cmap_replace(&umap->cmap, &old_ukey->cmap_node,
>   1945                           &new_ukey->cmap_node, new_ukey->hash);
>   1946              new_ukey->dump_seq = old_ukey->dump_seq;
>   1947              ovsrcu_postpone(ukey_delete__, old_ukey);
>   1948              transition_ukey(old_ukey, UKEY_DELETED);
>   1949              transition_ukey(new_ukey, UKEY_VISIBLE);
>   1950              replaced = true;
>   1951          }
>   1952          ovs_mutex_unlock(&old_ukey->mutex);
>   1953      }
>   ...
>
> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period.
> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below.
>
> 2983  static void
> 2984  revalidator_sweep__(struct revalidator *revalidator, bool purge)
> 2985  {
> ....
> 2995
> 2996      for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) {
> ....
> 3004
> 3005          CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
> 3006              enum flow_del_reason del_reason = FDR_NONE;
> 3007              enum ukey_state ukey_state;
> 3008
> 3009              /* Handler threads could be holding a ukey lock while it installs a
> 3010               * new flow, so don't hang around waiting for access to it. */
> 3011              if (ovs_mutex_trylock(&ukey->mutex)) {
> 3012                  continue;
> 3013              }
> 3037                      reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
> 3038                                    &odp_actions);
> 3039                  }
>
> 3040                  OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif,
> 3041                                 ukey, result, del_reason);
> 3042              }
> 3043              ovs_mutex_unlock(&ukey->mutex);
> ....
> 3045              if (ukey_state == UKEY_EVICTED) {
> 3046                  /* The common flow deletion case involves deletion of the flow
> 3047                   * during the dump phase and ukey deletion here. */
> 3048                  ovs_mutex_lock(&umap->mutex);
> 3049                  ukey_delete(umap, ukey);
> 3050                  ovs_mutex_unlock(&umap->mutex);
> 3051              }
> 3052
> 3053              if (n_ops == REVALIDATE_MAX_BATCH) {
> 3054                  /* Update/delete missed flows and clean up corresponding ukeys
> 3055                   * if necessary. */
> 3056                  push_ukey_ops(udpif, umap, ops, n_ops);
> 3057                  n_ops = 0;
> 3058              }
> 3059          }
> ....
> 3066          ovsrcu_quiesce();
>               ^^ Here we call it.
> 3067      }
> 3068  }
>
> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)?
>
> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together.
>
> //Eelco
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eelco Chaudron April 8, 2024, 8:43 a.m. UTC | #5
On 7 Apr 2024, at 3:57, LIU Yulong wrote:

> Hi Eelco,
>
> Sorry for the late reply. And thank you very much for your code examination.
>
> Today, I removed that  "ovsrcu_quiesce();" at  line 3066 [1], and
> started to run the
> test in my local environment. Let's see if such change can overcome the issue.

Lets see…

> Regarding the long time lock or quiesce state, Ilya's feedback seemed
> to be true during some
> service restart or large amount flow installations (no matter with or
> without the lock move fix).

This looks worrisome anyway and might be worth investigating (without your patch).

> I noticed logs like this:
>
> 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms
> waiting for main to quiesce
> 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms
> waiting for main to quiesce
> 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms
> waiting for main to quiesce
> 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms
> waiting for main to quiesce
> 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms
> waiting for main to quiesce
>
> And
>
> 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms
> waiting for dpdk_watchdog1 to quiesce
> 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms
> waiting for dpdk_watchdog1 to quiesce
> 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms
> waiting for revalidator6 to quiesce
> 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms
> waiting for pmd-c03/id:8 to quiesce
> 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms
> waiting for pmd-c03/id:8 to quiesce
>
>
> Not sure if this is related to the core issue.

I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :)

>
> [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066
>
>
> LIU Yulong
>
>
> On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>>
>>
>>
>> On 15 Mar 2024, at 11:04, LIU Yulong wrote:
>>
>>> A potential race condition happened with the following 3 threads:
>>> * PMD thread replaced the old_ukey and transitioned the state to
>>>   UKEY_DELETED.
>>> * RCU thread is freeing the old_ukey mutex.
>>> * While the revalidator thread is trying to lock the old_ukey mutex.
>>>
>>> We added some timestamp to udpif_key state transition and
>>> the udpif_key mutex free action, as well as the sweep try lock for
>>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
>>> try_lock mutex had always a bit later than the udpif_key mutex free
>>> action. For instance [3]:
>>> ukey_destroy_time = 13217289156490
>>> sweep_now = 13217289156568
>>>
>>> The second time is 78us behind the first time.
>>>
>>> Then vswitchd process aborts at the revalidator thread try_lock of
>>> ukey->mutex because of the NULL pointer.
>>>
>>> This patch adds the try_lock for the ukeys' basket udpif_key map to
>>> avoid the PMD and revalidator access to the same map for replacing the
>>> udpif_key and transitioning the udpif_key state.
>>>
>>> More details can be found at:
>>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
>>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
>>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
>>>
>>> Signed-off-by: LIU Yulong <i@liuyulong.me>
>>
>> Hi LIU,
>>
>> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis.
>>
>> Assume PMD is doing try_ukey_replace():
>>
>>   1930  try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey,
>>   1931                   struct udpif_key *new_ukey)
>>   1932      OVS_REQUIRES(umap->mutex)
>>   1933      OVS_TRY_LOCK(true, new_ukey->mutex)
>>   1934  {
>>   1935      bool replaced = false;
>>   1936
>>   1937      if (!ovs_mutex_trylock(&old_ukey->mutex)) {
>>   1938          if (old_ukey->state == UKEY_EVICTED) {
>>   1939              /* The flow was deleted during the current revalidator dump,
>>   1940               * but its ukey won't be fully cleaned up until the sweep phase.
>>   1941               * In the mean time, we are receiving upcalls for this traffic.
>>   1942               * Expedite the (new) flow install by replacing the ukey. */
>>   1943              ovs_mutex_lock(&new_ukey->mutex);
>>   1944              cmap_replace(&umap->cmap, &old_ukey->cmap_node,
>>   1945                           &new_ukey->cmap_node, new_ukey->hash);
>>   1946              new_ukey->dump_seq = old_ukey->dump_seq;
>>   1947              ovsrcu_postpone(ukey_delete__, old_ukey);
>>   1948              transition_ukey(old_ukey, UKEY_DELETED);
>>   1949              transition_ukey(new_ukey, UKEY_VISIBLE);
>>   1950              replaced = true;
>>   1951          }
>>   1952          ovs_mutex_unlock(&old_ukey->mutex);
>>   1953      }
>>   ...
>>
>> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period.
>> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below.
>>
>> 2983  static void
>> 2984  revalidator_sweep__(struct revalidator *revalidator, bool purge)
>> 2985  {
>> ....
>> 2995
>> 2996      for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) {
>> ....
>> 3004
>> 3005          CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
>> 3006              enum flow_del_reason del_reason = FDR_NONE;
>> 3007              enum ukey_state ukey_state;
>> 3008
>> 3009              /* Handler threads could be holding a ukey lock while it installs a
>> 3010               * new flow, so don't hang around waiting for access to it. */
>> 3011              if (ovs_mutex_trylock(&ukey->mutex)) {
>> 3012                  continue;
>> 3013              }
>> 3037                      reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
>> 3038                                    &odp_actions);
>> 3039                  }
>>
>> 3040                  OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif,
>> 3041                                 ukey, result, del_reason);
>> 3042              }
>> 3043              ovs_mutex_unlock(&ukey->mutex);
>> ....
>> 3045              if (ukey_state == UKEY_EVICTED) {
>> 3046                  /* The common flow deletion case involves deletion of the flow
>> 3047                   * during the dump phase and ukey deletion here. */
>> 3048                  ovs_mutex_lock(&umap->mutex);
>> 3049                  ukey_delete(umap, ukey);
>> 3050                  ovs_mutex_unlock(&umap->mutex);
>> 3051              }
>> 3052
>> 3053              if (n_ops == REVALIDATE_MAX_BATCH) {
>> 3054                  /* Update/delete missed flows and clean up corresponding ukeys
>> 3055                   * if necessary. */
>> 3056                  push_ukey_ops(udpif, umap, ops, n_ops);
>> 3057                  n_ops = 0;
>> 3058              }
>> 3059          }
>> ....
>> 3066          ovsrcu_quiesce();
>>               ^^ Here we call it.
>> 3067      }
>> 3068  }
>>
>> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)?
>>
>> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together.
>>
>> //Eelco
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
LIU Yulong April 12, 2024, 7:31 a.m. UTC | #6
Unfortunately, removed that  "ovsrcu_quiesce();" the ovs-vswitchd
still went core:
root root 957M Apr 12 01:35 core.52305

But good news is with such change the core dump does not occur so frequently,
it's about 7 days.

Core trace back:
#0  0x00007ff0cf7ab387 in raise () from /lib64/libc.so.6
#1  0x00007ff0cf7aca78 in abort () from /lib64/libc.so.6
#2  0x00005567d610eeee in ovs_abort_valist (err_no=<optimized out>,
format=<optimized out>, args=args@entry=0x7ff0ba72f350) at
lib/util.c:499
#3  0x00005567d610ef84 in ovs_abort (err_no=err_no@entry=0,
format=format@entry=0x5567d656d290 "%s: %s() passed uninitialized
ovs_mutex") at lib/util.c:491
#4  0x00005567d60d9c91 in ovs_mutex_trylock_at
(l_=l_@entry=0x7ff0808e6e18, where=where@entry=0x5567d654a538
"ofproto/ofproto-dpif-upcall.c:3077")
    at lib/ovs-thread.c:106
#5  0x00005567d6002d26 in revalidator_sweep__
(revalidator=revalidator@entry=0x5567da631090,
purge=purge@entry=false) at ofproto/ofproto-dpif-upcall.c:3077
#6  0x00005567d6005726 in revalidator_sweep
(revalidator=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:3135
#7  udpif_revalidator (arg=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:1103
#8  0x00005567d60da84f in ovsthread_wrapper (aux_=<optimized out>) at
lib/ovs-thread.c:422
#9  0x00007ff0d1902ea5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ff0cf873b0d in clone () from /lib64/libc.so.6

Some timestamp:
(gdb) p sweep_now
$1 = 15851410054368

ukey life cycle:
  ukey_create_time = 15851409359172,
  ukey_visible_time = 15851409359173,
  ukey_operational_time = 15851409524080,
  ukey_evicting_time = 15851409910342,
  ukey_evicted_time = 15851409910443,
  ukey_deleted_time = 15851410052878,
  ukey_destroy_time = 15851410054243,
  ukey_replace_time = 15851410052878

based on this test commit:
https://github.com/gotostack/ovs/commit/38a2b73af4442aa741930b3e4cff32ab7b559050

The ukey_destroy_time is 125us later than sweep_now:
15851410054243 - 15851410054368 = -125

So I may guess:
1. The race condition is still between such ovs_mutex_trylock and
ovs_mutex_destroy, maybe due to some
CPU clock switching  between threads.
2. There are other threads which override the memory of ukey mutex lock.

I will set the code back to this lock movement change again.

Regards,
LIU Yulong


On Mon, Apr 8, 2024 at 4:43 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 7 Apr 2024, at 3:57, LIU Yulong wrote:
>
> > Hi Eelco,
> >
> > Sorry for the late reply. And thank you very much for your code examination.
> >
> > Today, I removed that  "ovsrcu_quiesce();" at  line 3066 [1], and
> > started to run the
> > test in my local environment. Let's see if such change can overcome the issue.
>
> Lets see…
>
> > Regarding the long time lock or quiesce state, Ilya's feedback seemed
> > to be true during some
> > service restart or large amount flow installations (no matter with or
> > without the lock move fix).
>
> This looks worrisome anyway and might be worth investigating (without your patch).
>
> > I noticed logs like this:
> >
> > 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms
> > waiting for main to quiesce
> > 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms
> > waiting for main to quiesce
> > 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms
> > waiting for main to quiesce
> > 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms
> > waiting for main to quiesce
> > 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms
> > waiting for main to quiesce
> >
> > And
> >
> > 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms
> > waiting for dpdk_watchdog1 to quiesce
> > 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms
> > waiting for dpdk_watchdog1 to quiesce
> > 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms
> > waiting for revalidator6 to quiesce
> > 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms
> > waiting for pmd-c03/id:8 to quiesce
> > 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms
> > waiting for pmd-c03/id:8 to quiesce
> >
> >
> > Not sure if this is related to the core issue.
>
> I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :)
>
> >
> > [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066
> >
> >
> > LIU Yulong
> >
> >
> > On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote:
> >>
> >>
> >>
> >> On 15 Mar 2024, at 11:04, LIU Yulong wrote:
> >>
> >>> A potential race condition happened with the following 3 threads:
> >>> * PMD thread replaced the old_ukey and transitioned the state to
> >>>   UKEY_DELETED.
> >>> * RCU thread is freeing the old_ukey mutex.
> >>> * While the revalidator thread is trying to lock the old_ukey mutex.
> >>>
> >>> We added some timestamp to udpif_key state transition and
> >>> the udpif_key mutex free action, as well as the sweep try lock for
> >>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
> >>> try_lock mutex had always a bit later than the udpif_key mutex free
> >>> action. For instance [3]:
> >>> ukey_destroy_time = 13217289156490
> >>> sweep_now = 13217289156568
> >>>
> >>> The second time is 78us behind the first time.
> >>>
> >>> Then vswitchd process aborts at the revalidator thread try_lock of
> >>> ukey->mutex because of the NULL pointer.
> >>>
> >>> This patch adds the try_lock for the ukeys' basket udpif_key map to
> >>> avoid the PMD and revalidator access to the same map for replacing the
> >>> udpif_key and transitioning the udpif_key state.
> >>>
> >>> More details can be found at:
> >>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
> >>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
> >>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
> >>>
> >>> Signed-off-by: LIU Yulong <i@liuyulong.me>
> >>
> >> Hi LIU,
> >>
> >> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis.
> >>
> >> Assume PMD is doing try_ukey_replace():
> >>
> >>   1930  try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey,
> >>   1931                   struct udpif_key *new_ukey)
> >>   1932      OVS_REQUIRES(umap->mutex)
> >>   1933      OVS_TRY_LOCK(true, new_ukey->mutex)
> >>   1934  {
> >>   1935      bool replaced = false;
> >>   1936
> >>   1937      if (!ovs_mutex_trylock(&old_ukey->mutex)) {
> >>   1938          if (old_ukey->state == UKEY_EVICTED) {
> >>   1939              /* The flow was deleted during the current revalidator dump,
> >>   1940               * but its ukey won't be fully cleaned up until the sweep phase.
> >>   1941               * In the mean time, we are receiving upcalls for this traffic.
> >>   1942               * Expedite the (new) flow install by replacing the ukey. */
> >>   1943              ovs_mutex_lock(&new_ukey->mutex);
> >>   1944              cmap_replace(&umap->cmap, &old_ukey->cmap_node,
> >>   1945                           &new_ukey->cmap_node, new_ukey->hash);
> >>   1946              new_ukey->dump_seq = old_ukey->dump_seq;
> >>   1947              ovsrcu_postpone(ukey_delete__, old_ukey);
> >>   1948              transition_ukey(old_ukey, UKEY_DELETED);
> >>   1949              transition_ukey(new_ukey, UKEY_VISIBLE);
> >>   1950              replaced = true;
> >>   1951          }
> >>   1952          ovs_mutex_unlock(&old_ukey->mutex);
> >>   1953      }
> >>   ...
> >>
> >> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period.
> >> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below.
> >>
> >> 2983  static void
> >> 2984  revalidator_sweep__(struct revalidator *revalidator, bool purge)
> >> 2985  {
> >> ....
> >> 2995
> >> 2996      for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) {
> >> ....
> >> 3004
> >> 3005          CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
> >> 3006              enum flow_del_reason del_reason = FDR_NONE;
> >> 3007              enum ukey_state ukey_state;
> >> 3008
> >> 3009              /* Handler threads could be holding a ukey lock while it installs a
> >> 3010               * new flow, so don't hang around waiting for access to it. */
> >> 3011              if (ovs_mutex_trylock(&ukey->mutex)) {
> >> 3012                  continue;
> >> 3013              }
> >> 3037                      reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
> >> 3038                                    &odp_actions);
> >> 3039                  }
> >>
> >> 3040                  OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif,
> >> 3041                                 ukey, result, del_reason);
> >> 3042              }
> >> 3043              ovs_mutex_unlock(&ukey->mutex);
> >> ....
> >> 3045              if (ukey_state == UKEY_EVICTED) {
> >> 3046                  /* The common flow deletion case involves deletion of the flow
> >> 3047                   * during the dump phase and ukey deletion here. */
> >> 3048                  ovs_mutex_lock(&umap->mutex);
> >> 3049                  ukey_delete(umap, ukey);
> >> 3050                  ovs_mutex_unlock(&umap->mutex);
> >> 3051              }
> >> 3052
> >> 3053              if (n_ops == REVALIDATE_MAX_BATCH) {
> >> 3054                  /* Update/delete missed flows and clean up corresponding ukeys
> >> 3055                   * if necessary. */
> >> 3056                  push_ukey_ops(udpif, umap, ops, n_ops);
> >> 3057                  n_ops = 0;
> >> 3058              }
> >> 3059          }
> >> ....
> >> 3066          ovsrcu_quiesce();
> >>               ^^ Here we call it.
> >> 3067      }
> >> 3068  }
> >>
> >> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)?
> >>
> >> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together.
> >>
> >> //Eelco
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Eelco Chaudron April 12, 2024, 6:23 p.m. UTC | #7
On 12 Apr 2024, at 9:31, LIU Yulong wrote:

> Unfortunately, removed that  "ovsrcu_quiesce();" the ovs-vswitchd
> still went core:
> root root 957M Apr 12 01:35 core.52305
>
> But good news is with such change the core dump does not occur so frequently,
> it's about 7 days.
>
> Core trace back:
> #0  0x00007ff0cf7ab387 in raise () from /lib64/libc.so.6
> #1  0x00007ff0cf7aca78 in abort () from /lib64/libc.so.6
> #2  0x00005567d610eeee in ovs_abort_valist (err_no=<optimized out>,
> format=<optimized out>, args=args@entry=0x7ff0ba72f350) at
> lib/util.c:499
> #3  0x00005567d610ef84 in ovs_abort (err_no=err_no@entry=0,
> format=format@entry=0x5567d656d290 "%s: %s() passed uninitialized
> ovs_mutex") at lib/util.c:491
> #4  0x00005567d60d9c91 in ovs_mutex_trylock_at
> (l_=l_@entry=0x7ff0808e6e18, where=where@entry=0x5567d654a538
> "ofproto/ofproto-dpif-upcall.c:3077")
>     at lib/ovs-thread.c:106
> #5  0x00005567d6002d26 in revalidator_sweep__
> (revalidator=revalidator@entry=0x5567da631090,
> purge=purge@entry=false) at ofproto/ofproto-dpif-upcall.c:3077
> #6  0x00005567d6005726 in revalidator_sweep
> (revalidator=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:3135
> #7  udpif_revalidator (arg=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:1103
> #8  0x00005567d60da84f in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:422
> #9  0x00007ff0d1902ea5 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007ff0cf873b0d in clone () from /lib64/libc.so.6
>
> Some timestamp:
> (gdb) p sweep_now
> $1 = 15851410054368
>
> ukey life cycle:
>   ukey_create_time = 15851409359172,
>   ukey_visible_time = 15851409359173,
>   ukey_operational_time = 15851409524080,
>   ukey_evicting_time = 15851409910342,
>   ukey_evicted_time = 15851409910443,
>   ukey_deleted_time = 15851410052878,
>   ukey_destroy_time = 15851410054243,
>   ukey_replace_time = 15851410052878
>
> based on this test commit:
> https://github.com/gotostack/ovs/commit/38a2b73af4442aa741930b3e4cff32ab7b559050
>
> The ukey_destroy_time is 125us later than sweep_now:
> 15851410054243 - 15851410054368 = -125
>
> So I may guess:
> 1. The race condition is still between such ovs_mutex_trylock and
> ovs_mutex_destroy, maybe due to some
> CPU clock switching  between threads.
> 2. There are other threads which override the memory of ukey mutex lock.

There is something odd going on in your system, but from a code analysis, the locking/rcu handling seems fine.

If I get some time in the next week(s), I’ll try to come up with some debug code to better understand the problem.

Cheers,

Eelco

> I will set the code back to this lock movement change again.
>
> Regards,
> LIU Yulong
>
>
> On Mon, Apr 8, 2024 at 4:43 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>>
>>
>>
>> On 7 Apr 2024, at 3:57, LIU Yulong wrote:
>>
>>> Hi Eelco,
>>>
>>> Sorry for the late reply. And thank you very much for your code examination.
>>>
>>> Today, I removed that  "ovsrcu_quiesce();" at  line 3066 [1], and
>>> started to run the
>>> test in my local environment. Let's see if such change can overcome the issue.
>>
>> Lets see…
>>
>>> Regarding the long time lock or quiesce state, Ilya's feedback seemed
>>> to be true during some
>>> service restart or large amount flow installations (no matter with or
>>> without the lock move fix).
>>
>> This looks worrisome anyway and might be worth investigating (without your patch).
>>
>>> I noticed logs like this:
>>>
>>> 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms
>>> waiting for main to quiesce
>>> 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms
>>> waiting for main to quiesce
>>> 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms
>>> waiting for main to quiesce
>>> 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms
>>> waiting for main to quiesce
>>> 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms
>>> waiting for main to quiesce
>>>
>>> And
>>>
>>> 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms
>>> waiting for dpdk_watchdog1 to quiesce
>>> 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms
>>> waiting for dpdk_watchdog1 to quiesce
>>> 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms
>>> waiting for revalidator6 to quiesce
>>> 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>> 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms
>>> waiting for pmd-c03/id:8 to quiesce
>>>
>>>
>>> Not sure if this is related to the core issue.
>>
>> I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :)
>>
>>>
>>> [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066
>>>
>>>
>>> LIU Yulong
>>>
>>>
>>> On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 15 Mar 2024, at 11:04, LIU Yulong wrote:
>>>>
>>>>> A potential race condition happened with the following 3 threads:
>>>>> * PMD thread replaced the old_ukey and transitioned the state to
>>>>>   UKEY_DELETED.
>>>>> * RCU thread is freeing the old_ukey mutex.
>>>>> * While the revalidator thread is trying to lock the old_ukey mutex.
>>>>>
>>>>> We added some timestamp to udpif_key state transition and
>>>>> the udpif_key mutex free action, as well as the sweep try lock for
>>>>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key
>>>>> try_lock mutex had always a bit later than the udpif_key mutex free
>>>>> action. For instance [3]:
>>>>> ukey_destroy_time = 13217289156490
>>>>> sweep_now = 13217289156568
>>>>>
>>>>> The second time is 78us behind the first time.
>>>>>
>>>>> Then vswitchd process aborts at the revalidator thread try_lock of
>>>>> ukey->mutex because of the NULL pointer.
>>>>>
>>>>> This patch adds the try_lock for the ukeys' basket udpif_key map to
>>>>> avoid the PMD and revalidator access to the same map for replacing the
>>>>> udpif_key and transitioning the udpif_key state.
>>>>>
>>>>> More details can be found at:
>>>>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
>>>>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
>>>>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
>>>>>
>>>>> Signed-off-by: LIU Yulong <i@liuyulong.me>
>>>>
>>>> Hi LIU,
>>>>
>>>> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis.
>>>>
>>>> Assume PMD is doing try_ukey_replace():
>>>>
>>>>   1930  try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey,
>>>>   1931                   struct udpif_key *new_ukey)
>>>>   1932      OVS_REQUIRES(umap->mutex)
>>>>   1933      OVS_TRY_LOCK(true, new_ukey->mutex)
>>>>   1934  {
>>>>   1935      bool replaced = false;
>>>>   1936
>>>>   1937      if (!ovs_mutex_trylock(&old_ukey->mutex)) {
>>>>   1938          if (old_ukey->state == UKEY_EVICTED) {
>>>>   1939              /* The flow was deleted during the current revalidator dump,
>>>>   1940               * but its ukey won't be fully cleaned up until the sweep phase.
>>>>   1941               * In the mean time, we are receiving upcalls for this traffic.
>>>>   1942               * Expedite the (new) flow install by replacing the ukey. */
>>>>   1943              ovs_mutex_lock(&new_ukey->mutex);
>>>>   1944              cmap_replace(&umap->cmap, &old_ukey->cmap_node,
>>>>   1945                           &new_ukey->cmap_node, new_ukey->hash);
>>>>   1946              new_ukey->dump_seq = old_ukey->dump_seq;
>>>>   1947              ovsrcu_postpone(ukey_delete__, old_ukey);
>>>>   1948              transition_ukey(old_ukey, UKEY_DELETED);
>>>>   1949              transition_ukey(new_ukey, UKEY_VISIBLE);
>>>>   1950              replaced = true;
>>>>   1951          }
>>>>   1952          ovs_mutex_unlock(&old_ukey->mutex);
>>>>   1953      }
>>>>   ...
>>>>
>>>> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period.
>>>> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below.
>>>>
>>>> 2983  static void
>>>> 2984  revalidator_sweep__(struct revalidator *revalidator, bool purge)
>>>> 2985  {
>>>> ....
>>>> 2995
>>>> 2996      for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) {
>>>> ....
>>>> 3004
>>>> 3005          CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
>>>> 3006              enum flow_del_reason del_reason = FDR_NONE;
>>>> 3007              enum ukey_state ukey_state;
>>>> 3008
>>>> 3009              /* Handler threads could be holding a ukey lock while it installs a
>>>> 3010               * new flow, so don't hang around waiting for access to it. */
>>>> 3011              if (ovs_mutex_trylock(&ukey->mutex)) {
>>>> 3012                  continue;
>>>> 3013              }
>>>> 3037                      reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
>>>> 3038                                    &odp_actions);
>>>> 3039                  }
>>>>
>>>> 3040                  OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif,
>>>> 3041                                 ukey, result, del_reason);
>>>> 3042              }
>>>> 3043              ovs_mutex_unlock(&ukey->mutex);
>>>> ....
>>>> 3045              if (ukey_state == UKEY_EVICTED) {
>>>> 3046                  /* The common flow deletion case involves deletion of the flow
>>>> 3047                   * during the dump phase and ukey deletion here. */
>>>> 3048                  ovs_mutex_lock(&umap->mutex);
>>>> 3049                  ukey_delete(umap, ukey);
>>>> 3050                  ovs_mutex_unlock(&umap->mutex);
>>>> 3051              }
>>>> 3052
>>>> 3053              if (n_ops == REVALIDATE_MAX_BATCH) {
>>>> 3054                  /* Update/delete missed flows and clean up corresponding ukeys
>>>> 3055                   * if necessary. */
>>>> 3056                  push_ukey_ops(udpif, umap, ops, n_ops);
>>>> 3057                  n_ops = 0;
>>>> 3058              }
>>>> 3059          }
>>>> ....
>>>> 3066          ovsrcu_quiesce();
>>>>               ^^ Here we call it.
>>>> 3067      }
>>>> 3068  }
>>>>
>>>> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)?
>>>>
>>>> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together.
>>>>
>>>> //Eelco
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev@openvswitch.org
>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
diff mbox series

Patch

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index 9a5c5c29c..ef13f820a 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -2974,6 +2974,10 @@  revalidator_sweep__(struct revalidator *revalidator, bool purge)
         struct umap *umap = &udpif->ukeys[i];
         size_t n_ops = 0;
 
+        if (ovs_mutex_trylock(&umap->mutex)) {
+            continue;
+        }
+
         CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
             enum ukey_state ukey_state;
 
@@ -3013,9 +3017,7 @@  revalidator_sweep__(struct revalidator *revalidator, bool purge)
             if (ukey_state == UKEY_EVICTED) {
                 /* The common flow deletion case involves deletion of the flow
                  * during the dump phase and ukey deletion here. */
-                ovs_mutex_lock(&umap->mutex);
                 ukey_delete(umap, ukey);
-                ovs_mutex_unlock(&umap->mutex);
             }
 
             if (n_ops == REVALIDATE_MAX_BATCH) {
@@ -3025,6 +3027,7 @@  revalidator_sweep__(struct revalidator *revalidator, bool purge)
                 n_ops = 0;
             }
         }
+        ovs_mutex_unlock(&umap->mutex);
 
         if (n_ops) {
             push_ukey_ops(udpif, umap, ops, n_ops);