mbox series

[ovs-dev,RFC,0/5] conntrack: Introduce buckets and reduce contention.

Message ID 163820911055.3001886.13118680364054625917.stgit@fed.void
Headers show
Series conntrack: Introduce buckets and reduce contention. | expand

Message

Paolo Valerio Nov. 29, 2021, 6:05 p.m. UTC
This series aims to share the work done so far, and to start a
discussion of a slightly different approach in terms of the way we
handle the connections.
It's not considered ready as further work and extensive tests are
needed.

There are two main logical changes that the series tries to introduce.
The first one is the reintroduction of buckets, the second one is the
removal of the expiration lists.

The first two patches are a prereq and were picked (untouched) from
this series (by Gaetan):

https://patchwork.ozlabs.org/project/openvswitch/list/?series=249027&state=*

The third patch is a follow up of the following:

https://patchwork.ozlabs.org/project/openvswitch/patch/20201129033255.64647-2-hepeng.0320@bytedance.com/

Some logic has changed, but the overall idea remained the same.
Additional details in the patch description.

The fourth patch introduces the buckets and removes the expiration
lists.
The buckets got reintroduced as cmaps instead of hmaps, trying to
narrow down the critical sections as well.

An alternative approach may involve the replacement of cmaps with
linked lists (without increasing the number of locks), increasing the
number of buckets (it could involve the ability to change the buckets
number when the user changes the connection limit). This could improve
the stale conns eviction during the processing of the packet, but this
is currently out of scope for this series (can be discussed, though).

The insertion, namely the creation of the connection, can probably be
improved (or at least an improvement can be evaluated), as, in case of
nat, it acquires the bucket locks calling nat_get_unique_tuple_lock().
This is needed because we may need to know the reverse tuple before
locking in order to acquire the locks in the right way to avoid ABBA
deadlocks.

Further details about locking may be found in patch #4 description.

There's a known issue, and it is related mostly to zone limit. Keeping
the stale entries, leads to a mismatch between the current number of
per zone connections and the actual number of valid connections.
This means that if we have a small zone limit number, we may wait
a whole "next_wakeup" time before the entries get cleaned up (assuming
we have candidates for removal). This could be avoided,
e.g. triggering a per zone cleanup when a threshold of connections has
been reached (it may involve a different way to account the per zone
connections). Other approaches are possible (e.g. cleaning the zone
from the packet path during the insertion), but both require
additional handling.

Expiration lists have been removed for the sake of evaluation, but
they could be included and duplicated among buckets. In this way we
could distribute the contention (as a matter of fact reducing the
number of connections per list) that affects the write access to a
single data structure during every connection update.

Patch #5 reintegrates the command below with multiple buckets:

ovs-appctl dpctl/ct-bkts

A couple of minor patches have been kept out of the series, for the
time being, as they strongly depend on #4.

Gaetan Rivet (2):
      conntrack: Use a cmap to store zone limits
      conntrack-tp: Use a cmap to store timeout policies

Paolo Valerio (2):
      conntrack: Split single cmap to multiple buckets.
      conntrack: Make ovs-appctl dpctl/ct-bkts work with multiple buckets

Peng He (1):
      conntrack: Replaces nat_conn introducing key directionality.


 lib/conntrack-private.h |  58 ++-
 lib/conntrack-tp.c      | 102 ++---
 lib/conntrack.c         | 952 ++++++++++++++++++++++------------------
 lib/conntrack.h         |   4 +-
 lib/dpif-netdev.c       |   5 +-
 tests/system-traffic.at |   5 +-
 6 files changed, 608 insertions(+), 518 deletions(-)