mbox series

[ovs-dev,v2,0/9] ipsec: Resiliency to Libreswan failures.

Message ID 20241030135043.3139987-1-i.maximets@ovn.org
Headers show
Series ipsec: Resiliency to Libreswan failures. | expand

Message

Ilya Maximets Oct. 30, 2024, 1:50 p.m. UTC
This patch set is a result of debugging different Librewan issues
for the past few weeks in an attempt to solve the problem where
ovs-monitor-ipsec gets stuck forever while calling ipsec commands
and cannot progress any further.

Main parts here are the introduction of the reconciliation mechanism
for the ipsec connections and termination of the stuck commands on
timeout.

This set also contains a lot of small changes that ultimately fix
compatibility with multiple versions of Libreswan as well as improve
visibility into what the ovs-monitor-ipsec process is doing by adding
more verbose logging.
For example, without the fist patch in the set, ovs-monitor-ipsec
deadlocks both libreswan and itself with Libreswan 5 pretty easily:
  https://github.com/libreswan/libreswan/issues/1859
More details on addressed issues are in the commit messages.

The last few patches in the set are adding a system test that stresses
the reconciliation and various failure handling paths inside the
monitor.  Mainly because we do get a lot of failures from Libreswan
while running the test.  This test is currently actively used by
Libreswan team to find and fix the root causes of multiple issues that
triggered creation of this patch set.

The intention for this patch set is to be backported to at least
branch 3.3.  But further down to 3.1 (or even 2.17 ?) may also be good.
Luckily, the code is not that different on older branches.

The set is tested with various versions of Libreswan including
3.32 (from Ubuntu 22.04), 4.5, 4.6, 4.9, 4.12, 4.14, 4.15 and 5.1.

Without the set, only 4.5 and below work well enough, 4.9 - 4.15 are
getting completely stuck with a few dozens of connections, and 5.1
deadlocks easily.

With the set: 4.5 and below still work well, 5.1 works well, 4.9 - 4.15
can get into state with connectivity issues (libreswan issue that cannot
be worked around externally), but it is much less likely to end up in
this state and it affects only a couple individual connections instead
of blocking the daemon as a whole.  Also, 4.14 and 4.15 seems noticeably
harder to get into that state (but still very possible).


Version 2:
  - Moved the regexp patch earlier in the set to avoid CI failures.
  - Added logic to avoid reconciliation triggered on every wake up
    if there are no configuration changes.  Now it runs only once in
    15 seconds, if there are no config changes.
  - Improved regexp for loaded connections.  Now we match on the
    string starting with a digit (IP address) after the name.
    This solves matching on connections that do not have === in their
    formatting.  No idea why libreswan prints differently sometimes.
  - Addressed comments from Roi: removed unnecessary len() and moved
    stdout/err decoding to the common function.
  - Added grep on pluto's ERRORs to the test, so they are more visible.


Ilya Maximets (9):
  ipsec: Add a helper function to run commands from the monitor.
  ipsec: libreswan: Fix regexp for connections waiting on child SA.
  ipsec: libreswan: Reconcile missing connections periodically.
  ipsec: libreswan: Try to bring non-active connections up.
  ipsec: libreswan: Avoid monitor hanging on stuck ipsec commands.
  ipsec: Make command timeout configurable.
  system-tests: Verbose cleanup of ports and namespaces.
  tests: ipsec: Add NxN + reconciliation test.
  tests: ipsec: Check that nodes can ping each other in the NxN test.

 ipsec/ovs-monitor-ipsec.in    | 486 +++++++++++++++++++---------------
 tests/system-common-macros.at |   7 +-
 tests/system-ipsec.at         | 206 +++++++++++++-
 3 files changed, 468 insertions(+), 231 deletions(-)