diff mbox series

[ovs-dev] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Message ID 20230510151238.1222882-1-frode.nordahl@canonical.com
State Changes Requested
Delegated to: Simon Horman
Headers show
Series [ovs-dev] tc: fix crash on EAGAIN return from recvmsg on netlink socket. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/intel-ovs-compilation fail test: fail

Commit Message

Frode Nordahl May 10, 2023, 3:12 p.m. UTC
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.

`tc_transact` in turn calls `nl_transact`, which via
`nl_transact_multiple__` ultimately calls and handles return
value from `recvmsg`.  On error a check for EAGAIN is performed
and a consequence of this condition is effectively to provide a
non-error (0) result and an empty reply buffer.

Before this change, the `tc_transact` and, consumers of it, were
unaware of this condition.  The use of assertions on the reply
buffer can as such lead to a fatal crash of OVS.

To be fair, the behavior of `nl_transact` when handling an EAGAIN
return is currently not documented, so this change also adds that
documentation.

For the record, the symptom of the crash is this in the log:
EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()

And an excerpt of the backtrace looks like this:
0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384

Reported-At: https://launchpad.net/bugs/2018500
Fixes f98e418fbdb6 (tc: Add tc flower functions)
Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 lib/dpif-netlink.c   | 4 +++-
 lib/netlink-socket.c | 5 +++++
 lib/tc.c             | 8 ++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

Comments

Frode Nordahl May 10, 2023, 4:06 p.m. UTC | #1
Hello again,

While we have confirmed this patch fixes a real issue in production
and it passes the regular functional tests, I was apparently a bit
quick out the door with this patch. It appears to break the system
tests due to some unexpected WRN level logging:
../../tests/system-traffic.at:23: check_logs
--- /dev/null    2023-05-10 17:48:04.088000000 +0200
+++ /tmp/autopkgtest.7DpozZ/build.MGV/src/_debian/tests/system-kmod-testsuite.dir/at-groups/1/stdout
   2023-05-10 18:02:04.040410777 +0200
@@ -0,0 +1 @@
+2023-05-10T16:02:01.887Z|00034|netdev_linux|WARN|query br0 qdisc
failed (Resource temporarily unavailable)

I will check and send a v2 asap, apologies for the inconvenience.
Brian Haley May 10, 2023, 4:32 p.m. UTC | #2
Hi Frode,

Thanks for fixing this, just had one nit below since you're pushing out 
a new PS.

On 5/10/23 11:12 AM, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.

<snip>

> --- a/lib/netlink-socket.c
> +++ b/lib/netlink-socket.c
> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>    *
>    *      2. Resending the request causes it to be re-executed, so the request
>    *         needs to be idempotent.
> + *
> + *      3. In the event that the kernel is to busy to handle the request to
> + *         receive the response (i.e. EAGAIN), this function will still return
> + *         0.  The caller can check for this condition by checking for a zero
> + *         size of the 'replyp' ofpbuf buffer.

s/to busy/too busy

-Brian
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 60bd39643..75537b2f8 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2340,7 +2340,9 @@  parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
             }
             netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
         }
-        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
+        level = (err == EAGAIN ||
+                 err == ENOSPC ||
+                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
         VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
                 ovs_strerror(err),
                 (oor_netdev ? oor_netdev->name : dev->name));
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 80da20d9f..e40caafe3 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -1798,6 +1798,11 @@  nl_pool_release(struct nl_sock *sock)
  *
  *      2. Resending the request causes it to be re-executed, so the request
  *         needs to be idempotent.
+ *
+ *      3. In the event that the kernel is to busy to handle the request to
+ *         receive the response (i.e. EAGAIN), this function will still return
+ *         0.  The caller can check for this condition by checking for a zero
+ *         size of the 'replyp' ofpbuf buffer.
  */
 int
 nl_transact(int protocol, const struct ofpbuf *request,
diff --git a/lib/tc.c b/lib/tc.c
index 5c32c6f97..8e229f93c 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -237,11 +237,19 @@  static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
     }
 }
 
+/* Please acquaint yourself with the documentation of the `nl_transact`
+ * function in the netlink-sock module before making use of this function.
+ */
 int
 tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
 {
     int error = nl_transact(NETLINK_ROUTE, request, replyp);
     ofpbuf_uninit(request);
+
+    if (!error && replyp && !(*replyp)->size) {
+        return EAGAIN;
+    }
+
     return error;
 }