Message ID | 20230510151238.1222882-1-frode.nordahl@canonical.com |
---|---|
State | Changes Requested |
Delegated to: | Simon Horman |
Headers | show |
Series | [ovs-dev] tc: fix crash on EAGAIN return from recvmsg on netlink socket. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | fail | github build: failed |
ovsrobot/intel-ovs-compilation | fail | test: fail |
Hello again, While we have confirmed this patch fixes a real issue in production and it passes the regular functional tests, I was apparently a bit quick out the door with this patch. It appears to break the system tests due to some unexpected WRN level logging: ../../tests/system-traffic.at:23: check_logs --- /dev/null 2023-05-10 17:48:04.088000000 +0200 +++ /tmp/autopkgtest.7DpozZ/build.MGV/src/_debian/tests/system-kmod-testsuite.dir/at-groups/1/stdout 2023-05-10 18:02:04.040410777 +0200 @@ -0,0 +1 @@ +2023-05-10T16:02:01.887Z|00034|netdev_linux|WARN|query br0 qdisc failed (Resource temporarily unavailable) I will check and send a v2 asap, apologies for the inconvenience.
Hi Frode, Thanks for fixing this, just had one nit below since you're pushing out a new PS. On 5/10/23 11:12 AM, Frode Nordahl wrote: > The tc module combines the use of the `tc_transact` helper > function for communication with the in-kernel tc infrastructure > with assertions on the reply data by `ofpbuf_at_assert` on the > received data prior to further processing. <snip> > --- a/lib/netlink-socket.c > +++ b/lib/netlink-socket.c > @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock) > * > * 2. Resending the request causes it to be re-executed, so the request > * needs to be idempotent. > + * > + * 3. In the event that the kernel is to busy to handle the request to > + * receive the response (i.e. EAGAIN), this function will still return > + * 0. The caller can check for this condition by checking for a zero > + * size of the 'replyp' ofpbuf buffer. s/to busy/too busy -Brian
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index 60bd39643..75537b2f8 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put) } netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true); } - level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR; + level = (err == EAGAIN || + err == ENOSPC || + err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR; VLOG_RL(&rl, level, "failed to offload flow: %s: %s", ovs_strerror(err), (oor_netdev ? oor_netdev->name : dev->name)); diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c index 80da20d9f..e40caafe3 100644 --- a/lib/netlink-socket.c +++ b/lib/netlink-socket.c @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock) * * 2. Resending the request causes it to be re-executed, so the request * needs to be idempotent. + * + * 3. In the event that the kernel is to busy to handle the request to + * receive the response (i.e. EAGAIN), this function will still return + * 0. The caller can check for this condition by checking for a zero + * size of the 'replyp' ofpbuf buffer. */ int nl_transact(int protocol, const struct ofpbuf *request, diff --git a/lib/tc.c b/lib/tc.c index 5c32c6f97..8e229f93c 100644 --- a/lib/tc.c +++ b/lib/tc.c @@ -237,11 +237,19 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type, } } +/* Please acquaint yourself with the documentation of the `nl_transact` + * function in the netlink-sock module before making use of this function. + */ int tc_transact(struct ofpbuf *request, struct ofpbuf **replyp) { int error = nl_transact(NETLINK_ROUTE, request, replyp); ofpbuf_uninit(request); + + if (!error && replyp && !(*replyp)->size) { + return EAGAIN; + } + return error; }
The tc module combines the use of the `tc_transact` helper function for communication with the in-kernel tc infrastructure with assertions on the reply data by `ofpbuf_at_assert` on the received data prior to further processing. `tc_transact` in turn calls `nl_transact`, which via `nl_transact_multiple__` ultimately calls and handles return value from `recvmsg`. On error a check for EAGAIN is performed and a consequence of this condition is effectively to provide a non-error (0) result and an empty reply buffer. Before this change, the `tc_transact` and, consumers of it, were unaware of this condition. The use of assertions on the reply buffer can as such lead to a fatal crash of OVS. To be fair, the behavior of `nl_transact` when handling an EAGAIN return is currently not documented, so this change also adds that documentation. For the record, the symptom of the crash is this in the log: EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert() And an excerpt of the backtrace looks like this: 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194 tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>, ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>, match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257 parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297 try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384 Reported-At: https://launchpad.net/bugs/2018500 Fixes f98e418fbdb6 (tc: Add tc flower functions) Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.) Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> --- lib/dpif-netlink.c | 4 +++- lib/netlink-socket.c | 5 +++++ lib/tc.c | 8 ++++++++ 3 files changed, 16 insertions(+), 1 deletion(-)