diff mbox series

[ovs-dev,v4] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Message ID 20230523103921.3505877-1-frode.nordahl@canonical.com
State Changes Requested
Headers show
Series [ovs-dev,v4] tc: fix crash on EAGAIN return from recvmsg on netlink socket. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Frode Nordahl May 23, 2023, 10:39 a.m. UTC
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.

`tc_transact` in turn calls `nl_transact`, which via
`nl_transact_multiple__` ultimately calls and handles return
value from `recvmsg`.  On error a check for EAGAIN is performed
and a consequence of this condition is effectively to provide a
non-error (0) result and an empty reply buffer.

Before this change, the `tc_transact` and, consumers of it, were
unaware of this condition.  The use of assertions on the reply
buffer can as such lead to a fatal crash of OVS.

To be fair, the behavior of `nl_transact` when handling an EAGAIN
return is currently not documented, so this change also adds that
documentation.

While fixing the problem, it led me to find potential problems
with the one-time initialization functions in the netdev-offload-tc
module.  Make use of the information now available about an EAGAIN
condition to retry one-time initialization, and resort to logging
a warning if probing of tc features fails due to temporary
situations such as resource depletion.

For the record, the symptom of the crash is this in the log:
EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()

And an excerpt of the backtrace looks like this:
0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384

Reported-At: https://launchpad.net/bugs/2018500
Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
Fixes: 407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
Acked-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 lib/dpif.c              | 12 +++++--
 lib/netdev-linux.c      |  7 ++--
 lib/netdev-offload-tc.c | 73 +++++++++++++++++++++++++++++++++++++----
 lib/netlink-socket.c    |  5 +++
 lib/tc.c                | 26 +++++++++++++++
 5 files changed, 109 insertions(+), 14 deletions(-)

Comments

Ilya Maximets May 24, 2023, 7:31 p.m. UTC | #1
On 5/23/23 12:39, Frode Nordahl wrote:

<snip>

>  int
>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>  {
>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>      ofpbuf_uninit(request);
> +
> +    if (!error && replyp && !(*replyp)->size) {

Not a full review, but shouldn't we also check that *replyp != NULL
here?  This can happen, for example, if nl_pool_alloc() fails.

> +        COVERAGE_INC(tc_transact_eagain);
> +        /* We replicate the behavior of `nl_transact` for error conditions and
> +         * free any allocations before setting the 'replyp' buffer to NULL. */
> +        ofpbuf_delete(*replyp);
> +        *replyp = NULL;
> +        return EAGAIN;
> +    }
> +
>      return error;
>  }
>  

Bets regards, Ilya Maximets.
Frode Nordahl May 24, 2023, 9:41 p.m. UTC | #2
ons. 24. mai 2023, 21:31 skrev Ilya Maximets <i.maximets@ovn.org>:

> On 5/23/23 12:39, Frode Nordahl wrote:
>
> <snip>
>
> >  int
> >  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
> >  {
> >      int error = nl_transact(NETLINK_ROUTE, request, replyp);
> >      ofpbuf_uninit(request);
> > +
> > +    if (!error && replyp && !(*replyp)->size) {
>
> Not a full review, but shouldn't we also check that *replyp != NULL
> here?  This can happen, for example, if nl_pool_alloc() fails.
>

In the event of `nl_pool_alloc()` failing, would not error also have a
value?

In any case, it probably would not hurt with a extra check before
dereferencing the pointer, thank you for pointing it out.

--
Frode Nordahl

> +        COVERAGE_INC(tc_transact_eagain);
> > +        /* We replicate the behavior of `nl_transact` for error
> conditions and
> > +         * free any allocations before setting the 'replyp' buffer to
> NULL. */
> > +        ofpbuf_delete(*replyp);
> > +        *replyp = NULL;
> > +        return EAGAIN;
> > +    }
> > +
> >      return error;
> >  }
> >
>
> Bets regards, Ilya Maximets.
>
Eelco Chaudron May 26, 2023, 1:37 p.m. UTC | #3
On 23 May 2023, at 12:39, Frode Nordahl wrote:

> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
>
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
>
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
>
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
>
> While fixing the problem, it led me to find potential problems
> with the one-time initialization functions in the netdev-offload-tc
> module.  Make use of the information now available about an EAGAIN
> condition to retry one-time initialization, and resort to logging
> a warning if probing of tc features fails due to temporary
> situations such as resource depletion.
>
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
>
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
>
> Reported-At: https://launchpad.net/bugs/2018500
> Fixes: f98e418fbdb6 ("tc: Add tc flower functions")
> Fixes: 407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
> Acked-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Simon Horman <simon.horman@corigine.com>
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>

Thanks Frode for the v4, it looks good to me. And if you sent out a v5 with only Ilya’s suggested change you can keep my ack.

Acked-by: Eelco Chaudron <echaudro@redhat.com>
Ilya Maximets May 26, 2023, 6:43 p.m. UTC | #4
On 5/23/23 12:39, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
> 
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.

Hi, Frode, others.

I took a closer look at the patch and the code in the netlink-socket.
IIUC, the EAGAIN here is not a result of operation that we're requesting,
it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
trying to read.  The reply to that transaction will arrive eventually and
we will interpret it later as a reply to a different netlink transaction.

The issue appears to be introduced in the following commit:
  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")

And it was a performance optimization introduced as part of the set:
  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html

The problem is that we can't tell apart socket EAGAIN if there is nothing
to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
manage to reply yet.

The real solution would be to revert commit 407556ac6c90 or be a bit
smarter and wait for reply only on requests that specify a reply buffer.

Or am I missing something?

Best regards, Ilya Maximets.
Ilya Maximets May 26, 2023, 6:52 p.m. UTC | #5
On 5/26/23 20:43, Ilya Maximets wrote:
> On 5/23/23 12:39, Frode Nordahl wrote:
>> The tc module combines the use of the `tc_transact` helper
>> function for communication with the in-kernel tc infrastructure
>> with assertions on the reply data by `ofpbuf_at_assert` on the
>> received data prior to further processing.
>>
>> `tc_transact` in turn calls `nl_transact`, which via
>> `nl_transact_multiple__` ultimately calls and handles return
>> value from `recvmsg`.  On error a check for EAGAIN is performed
>> and a consequence of this condition is effectively to provide a
>> non-error (0) result and an empty reply buffer.
> 
> Hi, Frode, others.
> 
> I took a closer look at the patch and the code in the netlink-socket.
> IIUC, the EAGAIN here is not a result of operation that we're requesting,
> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
> trying to read.  The reply to that transaction will arrive eventually and
> we will interpret it later as a reply to a different netlink transaction.
> 
> The issue appears to be introduced in the following commit:
>   407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
> 
> And it was a performance optimization introduced as part of the set:
>   https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
> 
> The problem is that we can't tell apart socket EAGAIN if there is nothing
> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
> manage to reply yet.

It's still strance though that reply is delayed.  Typically netlink
replies are formed right in the request handler.  Did you manage to
figure out what is causing this in tc specifically?  It wasn't an issue
for an OVS and other common families for many years.

> 
> The real solution would be to revert commit 407556ac6c90 or be a bit
> smarter and wait for reply only on requests that specify a reply buffer.
> 
> Or am I missing something?
> 
> Best regards, Ilya Maximets.
Eelco Chaudron May 26, 2023, 7:19 p.m. UTC | #6
Send from my phone

> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
> 
> On 5/26/23 20:43, Ilya Maximets wrote:
>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>> The tc module combines the use of the `tc_transact` helper
>>> function for communication with the in-kernel tc infrastructure
>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>> received data prior to further processing.
>>> 
>>> `tc_transact` in turn calls `nl_transact`, which via
>>> `nl_transact_multiple__` ultimately calls and handles return
>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>> and a consequence of this condition is effectively to provide a
>>> non-error (0) result and an empty reply buffer.
>> 
>> Hi, Frode, others.
>> 
>> I took a closer look at the patch and the code in the netlink-socket.
>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>> trying to read.  The reply to that transaction will arrive eventually and
>> we will interpret it later as a reply to a different netlink transaction.
>> 
>> The issue appears to be introduced in the following commit:
>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>> 
>> And it was a performance optimization introduced as part of the set:
>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>> 
>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>> manage to reply yet.
> 
> It's still strance though that reply is delayed.  Typically netlink
> replies are formed right in the request handler.  Did you manage to
> figure out what is causing this in tc specifically?  It wasn't an issue
> for an OVS and other common families for many years.
> 

Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.

I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.

Maybe Frode can get some perf traces to confirm this. 

//Eelco
>> 
>> The real solution would be to revert commit 407556ac6c90 or be a bit
>> smarter and wait for reply only on requests that specify a reply buffer.
>> 
>> Or am I missing something?
>> 
>> Best regards, Ilya Maximets.
>
Ilya Maximets May 26, 2023, 8:02 p.m. UTC | #7
On 5/26/23 21:19, Eelco Chaudron wrote:
> 
> 
> Send from my phone
> 
>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
>>
>> On 5/26/23 20:43, Ilya Maximets wrote:
>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>>> The tc module combines the use of the `tc_transact` helper
>>>> function for communication with the in-kernel tc infrastructure
>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>> received data prior to further processing.
>>>>
>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>> and a consequence of this condition is effectively to provide a
>>>> non-error (0) result and an empty reply buffer.
>>>
>>> Hi, Frode, others.
>>>
>>> I took a closer look at the patch and the code in the netlink-socket.
>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>>> trying to read.  The reply to that transaction will arrive eventually and
>>> we will interpret it later as a reply to a different netlink transaction.
>>>
>>> The issue appears to be introduced in the following commit:
>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>>>
>>> And it was a performance optimization introduced as part of the set:
>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>>>
>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>>> manage to reply yet.
>>
>> It's still strance though that reply is delayed.  Typically netlink
>> replies are formed right in the request handler.  Did you manage to
>> figure out what is causing this in tc specifically?  It wasn't an issue
>> for an OVS and other common families for many years.
>>
> 
> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.

That makes sense, OK.

> 
> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.

I'm not sure it is actually possible for netlink replies to be delivered
asynchronously.  Looking at the kernel code, by the time the original
request is processed, the reply should already be sent.

A simpler explanation, would be that it just doesn't reply for some reason
to a request that needs a reply. e.g. some incorrect error handling in the
kernel that leaves request unanswered.

One potential issue I see in the tc_new_tfilter() implementation is that
if tfilter_notify() will fail for any reason, the function will still
return 0, because tc_new_tfilter() doesn't check the result.  And no reply
will be generated, because the return value is zero (success) and the
rtnetlink_send() was never called due to an error in the tfilter_notify().

I'm not sure if that's what happening here, but it is one option how we
end up with a successful netlink transaction with NLM_F_ECHO set, but no
reply on the socket.

A potential fix would be (not tested):

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2621550bfddc..9c334dbf9c5f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
 			      flags, extack);
 	if (err == 0) {
-		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
+		err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false, rtnl_held, extack);
 		tfilter_put(tp, fh);
 		/* q pointer is NULL for shared blocks */
---

Solution on the OVS side might be to fail transactions that requested a
reply, but didn't get one.  Async replies sound like something that should
not be allowed and likely is not allowed.

Best regards, Ilya Maximets.

> 
> Maybe Frode can get some perf traces to confirm this. 
> 
> //Eelco
>>>
>>> The real solution would be to revert commit 407556ac6c90 or be a bit
>>> smarter and wait for reply only on requests that specify a reply buffer.
>>>
>>> Or am I missing something?
>>>
>>> Best regards, Ilya Maximets.
>>
Ilya Maximets May 26, 2023, 8:07 p.m. UTC | #8
CC: Marcelo.

On 5/26/23 22:02, Ilya Maximets wrote:
> On 5/26/23 21:19, Eelco Chaudron wrote:
>>
>>
>> Send from my phone
>>
>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
>>>
>>> On 5/26/23 20:43, Ilya Maximets wrote:
>>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>>>> The tc module combines the use of the `tc_transact` helper
>>>>> function for communication with the in-kernel tc infrastructure
>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>>> received data prior to further processing.
>>>>>
>>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>>> and a consequence of this condition is effectively to provide a
>>>>> non-error (0) result and an empty reply buffer.
>>>>
>>>> Hi, Frode, others.
>>>>
>>>> I took a closer look at the patch and the code in the netlink-socket.
>>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>>>> trying to read.  The reply to that transaction will arrive eventually and
>>>> we will interpret it later as a reply to a different netlink transaction.
>>>>
>>>> The issue appears to be introduced in the following commit:
>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>>>>
>>>> And it was a performance optimization introduced as part of the set:
>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>>>>
>>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>>>> manage to reply yet.
>>>
>>> It's still strance though that reply is delayed.  Typically netlink
>>> replies are formed right in the request handler.  Did you manage to
>>> figure out what is causing this in tc specifically?  It wasn't an issue
>>> for an OVS and other common families for many years.
>>>
>>
>> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.
> 
> That makes sense, OK.
> 
>>
>> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.
> 
> I'm not sure it is actually possible for netlink replies to be delivered
> asynchronously.  Looking at the kernel code, by the time the original
> request is processed, the reply should already be sent.
> 
> A simpler explanation, would be that it just doesn't reply for some reason
> to a request that needs a reply. e.g. some incorrect error handling in the
> kernel that leaves request unanswered.
> 
> One potential issue I see in the tc_new_tfilter() implementation is that
> if tfilter_notify() will fail for any reason, the function will still
> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
> will be generated, because the return value is zero (success) and the
> rtnetlink_send() was never called due to an error in the tfilter_notify().
> 
> I'm not sure if that's what happening here, but it is one option how we
> end up with a successful netlink transaction with NLM_F_ECHO set, but no
> reply on the socket.
> 
> A potential fix would be (not tested):
> 
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2621550bfddc..9c334dbf9c5f 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>  	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>  			      flags, extack);
>  	if (err == 0) {
> -		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> +		err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>  			       RTM_NEWTFILTER, false, rtnl_held, extack);
>  		tfilter_put(tp, fh);
>  		/* q pointer is NULL for shared blocks */
> ---
> 
> Solution on the OVS side might be to fail transactions that requested a
> reply, but didn't get one.  Async replies sound like something that should
> not be allowed and likely is not allowed.
> 
> Best regards, Ilya Maximets.
> 
>>
>> Maybe Frode can get some perf traces to confirm this. 
>>
>> //Eelco
>>>>
>>>> The real solution would be to revert commit 407556ac6c90 or be a bit
>>>> smarter and wait for reply only on requests that specify a reply buffer.
>>>>
>>>> Or am I missing something?
>>>>
>>>> Best regards, Ilya Maximets.
>>>
>
Ilya Maximets May 26, 2023, 8:31 p.m. UTC | #9
On 5/26/23 22:07, Ilya Maximets wrote:
> CC: Marcelo.
> 
> On 5/26/23 22:02, Ilya Maximets wrote:
>> On 5/26/23 21:19, Eelco Chaudron wrote:
>>>
>>>
>>> Send from my phone
>>>
>>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
>>>>
>>>> On 5/26/23 20:43, Ilya Maximets wrote:
>>>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>>>>> The tc module combines the use of the `tc_transact` helper
>>>>>> function for communication with the in-kernel tc infrastructure
>>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>>>> received data prior to further processing.
>>>>>>
>>>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>>>> and a consequence of this condition is effectively to provide a
>>>>>> non-error (0) result and an empty reply buffer.
>>>>>
>>>>> Hi, Frode, others.
>>>>>
>>>>> I took a closer look at the patch and the code in the netlink-socket.
>>>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>>>>> trying to read.  The reply to that transaction will arrive eventually and
>>>>> we will interpret it later as a reply to a different netlink transaction.
>>>>>
>>>>> The issue appears to be introduced in the following commit:
>>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>>>>>
>>>>> And it was a performance optimization introduced as part of the set:
>>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>>>>>
>>>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>>>>> manage to reply yet.
>>>>
>>>> It's still strance though that reply is delayed.  Typically netlink
>>>> replies are formed right in the request handler.  Did you manage to
>>>> figure out what is causing this in tc specifically?  It wasn't an issue
>>>> for an OVS and other common families for many years.
>>>>
>>>
>>> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.
>>
>> That makes sense, OK.
>>
>>>
>>> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.
>>
>> I'm not sure it is actually possible for netlink replies to be delivered
>> asynchronously.  Looking at the kernel code, by the time the original
>> request is processed, the reply should already be sent.
>>
>> A simpler explanation, would be that it just doesn't reply for some reason
>> to a request that needs a reply. e.g. some incorrect error handling in the
>> kernel that leaves request unanswered.
>>
>> One potential issue I see in the tc_new_tfilter() implementation is that
>> if tfilter_notify() will fail for any reason, the function will still
>> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
>> will be generated, because the return value is zero (success) and the
>> rtnetlink_send() was never called due to an error in the tfilter_notify().
>>
>> I'm not sure if that's what happening here, but it is one option how we
>> end up with a successful netlink transaction with NLM_F_ECHO set, but no
>> reply on the socket.
>>
>> A potential fix would be (not tested):
>>
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 2621550bfddc..9c334dbf9c5f 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>>  	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>>  			      flags, extack);
>>  	if (err == 0) {
>> -		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>> +		err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>>  			       RTM_NEWTFILTER, false, rtnl_held, extack);
>>  		tfilter_put(tp, fh);
>>  		/* q pointer is NULL for shared blocks */
>> ---

^^ This is definitely not a correct solution though, because we already
changed the filter, but we just failed to dump it back.  Some other
solution is needed.  Most likely, it needs to be re-tried until it succeeds.
Or the change() will need to be reverted and a failure reported to the
userspace.

>>
>> Solution on the OVS side might be to fail transactions that requested a
>> reply, but didn't get one.  Async replies sound like something that should
>> not be allowed and likely is not allowed.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Maybe Frode can get some perf traces to confirm this. 
>>>
>>> //Eelco
>>>>>
>>>>> The real solution would be to revert commit 407556ac6c90 or be a bit
>>>>> smarter and wait for reply only on requests that specify a reply buffer.
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>
>
Frode Nordahl May 27, 2023, 7:21 a.m. UTC | #10
On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 5/26/23 22:07, Ilya Maximets wrote:
> > CC: Marcelo.
> >
> > On 5/26/23 22:02, Ilya Maximets wrote:
> >> On 5/26/23 21:19, Eelco Chaudron wrote:
> >>>
> >>>
> >>> Send from my phone
> >>>
> >>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
> >>>>
> >>>> On 5/26/23 20:43, Ilya Maximets wrote:
> >>>>> On 5/23/23 12:39, Frode Nordahl wrote:
> >>>>>> The tc module combines the use of the `tc_transact` helper
> >>>>>> function for communication with the in-kernel tc infrastructure
> >>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
> >>>>>> received data prior to further processing.
> >>>>>>
> >>>>>> `tc_transact` in turn calls `nl_transact`, which via
> >>>>>> `nl_transact_multiple__` ultimately calls and handles return
> >>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
> >>>>>> and a consequence of this condition is effectively to provide a
> >>>>>> non-error (0) result and an empty reply buffer.
> >>>>>
> >>>>> Hi, Frode, others.

Hello, Ilya, thank you for weighing in on the issue and thread, much
appreciated.

> >>>>> I took a closer look at the patch and the code in the netlink-socket.
> >>>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
> >>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
> >>>>> trying to read.  The reply to that transaction will arrive eventually and
> >>>>> we will interpret it later as a reply to a different netlink transaction.
> >>>>>
> >>>>> The issue appears to be introduced in the following commit:
> >>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
> >>>>>
> >>>>> And it was a performance optimization introduced as part of the set:
> >>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
> >>>>>
> >>>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
> >>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
> >>>>> manage to reply yet.

I did find that change suspicious, and it is listed as one of the
commits fixed in
my patch, although I'm only addressing it by documenting its behavior as I do
not have full visibility of the blast radius for changing it.

> >>>> It's still strance though that reply is delayed.  Typically netlink
> >>>> replies are formed right in the request handler.  Did you manage to
> >>>> figure out what is causing this in tc specifically?  It wasn't an issue
> >>>> for an OVS and other common families for many years.
> >>>>
> >>>
> >>> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.
> >>
> >> That makes sense, OK.

Our challenge here is that this crash is happening in a production environment,
and we are currently unable to reproduce it in a lab environment. The
consequence being that production has disabled the feature. So the proposed
patch is in my mind a step on the way to figure out what is actually going on by
allowing the system to be up and capture more logs without interruption of
service.

> >>>
> >>> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.
> >>
> >> I'm not sure it is actually possible for netlink replies to be delivered
> >> asynchronously.  Looking at the kernel code, by the time the original
> >> request is processed, the reply should already be sent.
> >>
> >> A simpler explanation, would be that it just doesn't reply for some reason
> >> to a request that needs a reply. e.g. some incorrect error handling in the
> >> kernel that leaves request unanswered.
> >>
> >> One potential issue I see in the tc_new_tfilter() implementation is that
> >> if tfilter_notify() will fail for any reason, the function will still
> >> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
> >> will be generated, because the return value is zero (success) and the
> >> rtnetlink_send() was never called due to an error in the tfilter_notify().
> >>
> >> I'm not sure if that's what happening here, but it is one option how we
> >> end up with a successful netlink transaction with NLM_F_ECHO set, but no
> >> reply on the socket.
> >>
> >> A potential fix would be (not tested):
> >>
> >> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >> index 2621550bfddc..9c334dbf9c5f 100644
> >> --- a/net/sched/cls_api.c
> >> +++ b/net/sched/cls_api.c
> >> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
> >>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> >>                            flags, extack);
> >>      if (err == 0) {
> >> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> >> +            err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> >>                             RTM_NEWTFILTER, false, rtnl_held, extack);
> >>              tfilter_put(tp, fh);
> >>              /* q pointer is NULL for shared blocks */
> >> ---
>
> ^^ This is definitely not a correct solution though, because we already
> changed the filter, but we just failed to dump it back.  Some other
> solution is needed.  Most likely, it needs to be re-tried until it succeeds.
> Or the change() will need to be reverted and a failure reported to the
> userspace.

Good catch, the cited code does indeed look problematic.

> >>
> >> Solution on the OVS side might be to fail transactions that requested a
> >> reply, but didn't get one.  Async replies sound like something that should
> >> not be allowed and likely is not allowed.

Even if this turns out to be a bug on the kernel side, we ought to
protect ourself
from the condition. Just to understand your proposal, you are thinking partial
revert of 407556ac6c90 and add a check for no/empty reply in the nl_transact?
Ilya Maximets May 27, 2023, 12:51 p.m. UTC | #11
On 5/27/23 09:21, Frode Nordahl wrote:
> On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 5/26/23 22:07, Ilya Maximets wrote:
>>> CC: Marcelo.
>>>
>>> On 5/26/23 22:02, Ilya Maximets wrote:
>>>> On 5/26/23 21:19, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> Send from my phone
>>>>>
>>>>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org> het volgende geschreven:
>>>>>>
>>>>>> On 5/26/23 20:43, Ilya Maximets wrote:
>>>>>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>>>>>>> The tc module combines the use of the `tc_transact` helper
>>>>>>>> function for communication with the in-kernel tc infrastructure
>>>>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>>>>>> received data prior to further processing.
>>>>>>>>
>>>>>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>>>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>>>>>> and a consequence of this condition is effectively to provide a
>>>>>>>> non-error (0) result and an empty reply buffer.
>>>>>>>
>>>>>>> Hi, Frode, others.
> 
> Hello, Ilya, thank you for weighing in on the issue and thread, much
> appreciated.
> 
>>>>>>> I took a closer look at the patch and the code in the netlink-socket.
>>>>>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>>>>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>>>>>>> trying to read.  The reply to that transaction will arrive eventually and
>>>>>>> we will interpret it later as a reply to a different netlink transaction.
>>>>>>>
>>>>>>> The issue appears to be introduced in the following commit:
>>>>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>>>>>>>
>>>>>>> And it was a performance optimization introduced as part of the set:
>>>>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>>>>>>>
>>>>>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>>>>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>>>>>>> manage to reply yet.
> 
> I did find that change suspicious, and it is listed as one of the
> commits fixed in
> my patch, although I'm only addressing it by documenting its behavior as I do
> not have full visibility of the blast radius for changing it.
> 
>>>>>> It's still strance though that reply is delayed.  Typically netlink
>>>>>> replies are formed right in the request handler.  Did you manage to
>>>>>> figure out what is causing this in tc specifically?  It wasn't an issue
>>>>>> for an OVS and other common families for many years.
>>>>>>
>>>>>
>>>>> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.
>>>>
>>>> That makes sense, OK.
> 
> Our challenge here is that this crash is happening in a production environment,
> and we are currently unable to reproduce it in a lab environment. The
> consequence being that production has disabled the feature. So the proposed
> patch is in my mind a step on the way to figure out what is actually going on by
> allowing the system to be up and capture more logs without interruption of
> service.
> 
>>>>>
>>>>> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.
>>>>
>>>> I'm not sure it is actually possible for netlink replies to be delivered
>>>> asynchronously.  Looking at the kernel code, by the time the original
>>>> request is processed, the reply should already be sent.
>>>>
>>>> A simpler explanation, would be that it just doesn't reply for some reason
>>>> to a request that needs a reply. e.g. some incorrect error handling in the
>>>> kernel that leaves request unanswered.
>>>>
>>>> One potential issue I see in the tc_new_tfilter() implementation is that
>>>> if tfilter_notify() will fail for any reason, the function will still
>>>> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
>>>> will be generated, because the return value is zero (success) and the
>>>> rtnetlink_send() was never called due to an error in the tfilter_notify().
>>>>
>>>> I'm not sure if that's what happening here, but it is one option how we
>>>> end up with a successful netlink transaction with NLM_F_ECHO set, but no
>>>> reply on the socket.
>>>>
>>>> A potential fix would be (not tested):
>>>>
>>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>>> index 2621550bfddc..9c334dbf9c5f 100644
>>>> --- a/net/sched/cls_api.c
>>>> +++ b/net/sched/cls_api.c
>>>> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>>>>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>>>>                            flags, extack);
>>>>      if (err == 0) {
>>>> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>>>> +            err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>>>>                             RTM_NEWTFILTER, false, rtnl_held, extack);
>>>>              tfilter_put(tp, fh);
>>>>              /* q pointer is NULL for shared blocks */
>>>> ---
>>
>> ^^ This is definitely not a correct solution though, because we already
>> changed the filter, but we just failed to dump it back.  Some other
>> solution is needed.  Most likely, it needs to be re-tried until it succeeds.
>> Or the change() will need to be reverted and a failure reported to the
>> userspace.
> 
> Good catch, the cited code does indeed look problematic.
> 
>>>>
>>>> Solution on the OVS side might be to fail transactions that requested a
>>>> reply, but didn't get one.  Async replies sound like something that should
>>>> not be allowed and likely is not allowed.
> 
> Even if this turns out to be a bug on the kernel side, we ought to
> protect ourself
> from the condition. Just to understand your proposal, you are thinking partial
> revert of 407556ac6c90 and add a check for no/empty reply in the nl_transact?
> 

I was thinking about revert, but in the presence of the kernel bug if we
block on receive, we will just wait forever, if the kernel simply doesn't
reply to the ECHO.

Solution from the OVS side is similar to your patch, but with EPROTO instead.
Since EAGAIN is not a kernel's reply, we're not really expected to re-try
the receive.  EAGAIN on a socket just means that the kernel didn't and will
not reply.

Only caller of the tc_transact() knows if they expected a reply or not.
And if it was expected, but didn't happen, that's a protocol error.

What we can do is to check that reply->size > NLMSG_HDRLEN before pulling
the header, and return EPROTO if it's not.  That's what ofpbuf_try_pull()
can do.  tc_transact() users that also use ofpbuf_at_assert() should be
converted to use ofpbuf_try_pull() instead, so we do not crash if something
like this happens.  See dpif_netlink_vport_from_ofpbuf() for example.

And we should have a loud error/warning message for this condition.

No retries are needed.  And EAGAIN documentation part is also not needed,
because we're not supposed to retry, AFAICT.

Best regards, Ilya Maximets.
Frode Nordahl May 28, 2023, 9:05 a.m. UTC | #12
lør. 27. mai 2023, 14:51 skrev Ilya Maximets <i.maximets@ovn.org>:

> On 5/27/23 09:21, Frode Nordahl wrote:
> > On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maximets@ovn.org>
> wrote:
> >>
> >> On 5/26/23 22:07, Ilya Maximets wrote:
> >>> CC: Marcelo.
> >>>
> >>> On 5/26/23 22:02, Ilya Maximets wrote:
> >>>> On 5/26/23 21:19, Eelco Chaudron wrote:
> >>>>>
> >>>>>
> >>>>> Send from my phone
> >>>>>
> >>>>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org>
> het volgende geschreven:
> >>>>>>
> >>>>>> On 5/26/23 20:43, Ilya Maximets wrote:
> >>>>>>> On 5/23/23 12:39, Frode Nordahl wrote:
> >>>>>>>> The tc module combines the use of the `tc_transact` helper
> >>>>>>>> function for communication with the in-kernel tc infrastructure
> >>>>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
> >>>>>>>> received data prior to further processing.
> >>>>>>>>
> >>>>>>>> `tc_transact` in turn calls `nl_transact`, which via
> >>>>>>>> `nl_transact_multiple__` ultimately calls and handles return
> >>>>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
> >>>>>>>> and a consequence of this condition is effectively to provide a
> >>>>>>>> non-error (0) result and an empty reply buffer.
> >>>>>>>
> >>>>>>> Hi, Frode, others.
> >
> > Hello, Ilya, thank you for weighing in on the issue and thread, much
> > appreciated.
> >
> >>>>>>> I took a closer look at the patch and the code in the
> netlink-socket.
> >>>>>>> IIUC, the EAGAIN here is not a result of operation that we're
> requesting,
> >>>>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket
> while
> >>>>>>> trying to read.  The reply to that transaction will arrive
> eventually and
> >>>>>>> we will interpret it later as a reply to a different netlink
> transaction.
> >>>>>>>
> >>>>>>> The issue appears to be introduced in the following commit:
> >>>>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final
> message in a transaction.")
> >>>>>>>
> >>>>>>> And it was a performance optimization introduced as part of the
> set:
> >>>>>>>
> https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
> >>>>>>>
> >>>>>>> The problem is that we can't tell apart socket EAGAIN if there is
> nothing
> >>>>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just
> didn't
> >>>>>>> manage to reply yet.
> >
> > I did find that change suspicious, and it is listed as one of the
> > commits fixed in
> > my patch, although I'm only addressing it by documenting its behavior as
> I do
> > not have full visibility of the blast radius for changing it.
> >
> >>>>>> It's still strance though that reply is delayed.  Typically netlink
> >>>>>> replies are formed right in the request handler.  Did you manage to
> >>>>>> figure out what is causing this in tc specifically?  It wasn't an
> issue
> >>>>>> for an OVS and other common families for many years.
> >>>>>>
> >>>>>
> >>>>> Replying from my phone so I did not look at any code. However when I
> looked at the code during the review I noticed it will ignore earlier (none
> processed messages) replies based in the receive loop on the transaction
> ID. So I do not think it should be an issue of messages getting out of sync.
> >>>>
> >>>> That makes sense, OK.
> >
> > Our challenge here is that this crash is happening in a production
> environment,
> > and we are currently unable to reproduce it in a lab environment. The
> > consequence being that production has disabled the feature. So the
> proposed
> > patch is in my mind a step on the way to figure out what is actually
> going on by
> > allowing the system to be up and capture more logs without interruption
> of
> > service.
> >
> >>>>>
> >>>>> I assumed the delay happens when we request something to a hardware
> offload drive which is replying async due to it being busy.
> >>>>
> >>>> I'm not sure it is actually possible for netlink replies to be
> delivered
> >>>> asynchronously.  Looking at the kernel code, by the time the original
> >>>> request is processed, the reply should already be sent.
> >>>>
> >>>> A simpler explanation, would be that it just doesn't reply for some
> reason
> >>>> to a request that needs a reply. e.g. some incorrect error handling
> in the
> >>>> kernel that leaves request unanswered.
> >>>>
> >>>> One potential issue I see in the tc_new_tfilter() implementation is
> that
> >>>> if tfilter_notify() will fail for any reason, the function will still
> >>>> return 0, because tc_new_tfilter() doesn't check the result.  And no
> reply
> >>>> will be generated, because the return value is zero (success) and the
> >>>> rtnetlink_send() was never called due to an error in the
> tfilter_notify().
> >>>>
> >>>> I'm not sure if that's what happening here, but it is one option how
> we
> >>>> end up with a successful netlink transaction with NLM_F_ECHO set, but
> no
> >>>> reply on the socket.
> >>>>
> >>>> A potential fix would be (not tested):
> >>>>
> >>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >>>> index 2621550bfddc..9c334dbf9c5f 100644
> >>>> --- a/net/sched/cls_api.c
> >>>> +++ b/net/sched/cls_api.c
> >>>> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb,
> struct nlmsghdr *n,
> >>>>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> >>>>                            flags, extack);
> >>>>      if (err == 0) {
> >>>> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> >>>> +            err = tfilter_notify(net, skb, n, tp, block, q, parent,
> fh,
> >>>>                             RTM_NEWTFILTER, false, rtnl_held, extack);
> >>>>              tfilter_put(tp, fh);
> >>>>              /* q pointer is NULL for shared blocks */
> >>>> ---
> >>
> >> ^^ This is definitely not a correct solution though, because we already
> >> changed the filter, but we just failed to dump it back.  Some other
> >> solution is needed.  Most likely, it needs to be re-tried until it
> succeeds.
> >> Or the change() will need to be reverted and a failure reported to the
> >> userspace.
> >
> > Good catch, the cited code does indeed look problematic.
> >
> >>>>
> >>>> Solution on the OVS side might be to fail transactions that requested
> a
> >>>> reply, but didn't get one.  Async replies sound like something that
> should
> >>>> not be allowed and likely is not allowed.
> >
> > Even if this turns out to be a bug on the kernel side, we ought to
> > protect ourself
> > from the condition. Just to understand your proposal, you are thinking
> partial
> > revert of 407556ac6c90 and add a check for no/empty reply in the
> nl_transact?
> >
>
> I was thinking about revert, but in the presence of the kernel bug if we
> block on receive, we will just wait forever, if the kernel simply doesn't
> reply to the ECHO.
>
> Solution from the OVS side is similar to your patch, but with EPROTO
> instead.
> Since EAGAIN is not a kernel's reply, we're not really expected to re-try
> the receive.  EAGAIN on a socket just means that the kernel didn't and will
> not reply.
>

Thank you for the clarification, with the lack of a reproducer the work so
far has been based on analysis constrained by time to provide some path
forward, so this helps alot.

Only caller of the tc_transact() knows if they expected a reply or not.
> And if it was expected, but didn't happen, that's a protocol error.
>

The caller of tc_transact() has indicated that they expect a reply by
providing a replyp, my thinking around doing a early check of the response
in tc_transact() was that we could avoid this whole class of bugs right
there instead of leaving it to each consumer. So I'd like to keep the zero
length reply check there.

What we can do is to check that reply->size > NLMSG_HDRLEN before pulling
> the header, and return EPROTO if it's not.  That's what ofpbuf_try_pull()
> can do.  tc_transact() users that also use ofpbuf_at_assert() should be
> converted to use ofpbuf_try_pull() instead, so we do not crash if something
> like this happens.  See dpif_netlink_vport_from_ofpbuf() for example.
>
> And we should have a loud error/warning message for this condition.
>

Ack

No retries are needed.  And EAGAIN documentation part is also not needed,
> because we're not supposed to retry, AFAICT.
>

Thanks for the direction, I'll work on this and post a v5.

--
Frode Nordahl

Best regards, Ilya Maximets.
>
Ilya Maximets May 29, 2023, 10:59 a.m. UTC | #13
On 5/28/23 11:05, Frode Nordahl wrote:
> 
> 
> lør. 27. mai 2023, 14:51 skrev Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>:
> 
>     On 5/27/23 09:21, Frode Nordahl wrote:
>     > On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>     >>
>     >> On 5/26/23 22:07, Ilya Maximets wrote:
>     >>> CC: Marcelo.
>     >>>
>     >>> On 5/26/23 22:02, Ilya Maximets wrote:
>     >>>> On 5/26/23 21:19, Eelco Chaudron wrote:
>     >>>>>
>     >>>>>
>     >>>>> Send from my phone
>     >>>>>
>     >>>>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> het volgende geschreven:
>     >>>>>>
>     >>>>>> On 5/26/23 20:43, Ilya Maximets wrote:
>     >>>>>>> On 5/23/23 12:39, Frode Nordahl wrote:
>     >>>>>>>> The tc module combines the use of the `tc_transact` helper
>     >>>>>>>> function for communication with the in-kernel tc infrastructure
>     >>>>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>     >>>>>>>> received data prior to further processing.
>     >>>>>>>>
>     >>>>>>>> `tc_transact` in turn calls `nl_transact`, which via
>     >>>>>>>> `nl_transact_multiple__` ultimately calls and handles return
>     >>>>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>     >>>>>>>> and a consequence of this condition is effectively to provide a
>     >>>>>>>> non-error (0) result and an empty reply buffer.
>     >>>>>>>
>     >>>>>>> Hi, Frode, others.
>     >
>     > Hello, Ilya, thank you for weighing in on the issue and thread, much
>     > appreciated.
>     >
>     >>>>>>> I took a closer look at the patch and the code in the netlink-socket.
>     >>>>>>> IIUC, the EAGAIN here is not a result of operation that we're requesting,
>     >>>>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>     >>>>>>> trying to read.  The reply to that transaction will arrive eventually and
>     >>>>>>> we will interpret it later as a reply to a different netlink transaction.
>     >>>>>>>
>     >>>>>>> The issue appears to be introduced in the following commit:
>     >>>>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message in a transaction.")
>     >>>>>>>
>     >>>>>>> And it was a performance optimization introduced as part of the set:
>     >>>>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html <https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html>
>     >>>>>>>
>     >>>>>>> The problem is that we can't tell apart socket EAGAIN if there is nothing
>     >>>>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>     >>>>>>> manage to reply yet.
>     >
>     > I did find that change suspicious, and it is listed as one of the
>     > commits fixed in
>     > my patch, although I'm only addressing it by documenting its behavior as I do
>     > not have full visibility of the blast radius for changing it.
>     >
>     >>>>>> It's still strance though that reply is delayed.  Typically netlink
>     >>>>>> replies are formed right in the request handler.  Did you manage to
>     >>>>>> figure out what is causing this in tc specifically?  It wasn't an issue
>     >>>>>> for an OVS and other common families for many years.
>     >>>>>>
>     >>>>>
>     >>>>> Replying from my phone so I did not look at any code. However when I looked at the code during the review I noticed it will ignore earlier (none processed messages) replies based in the receive loop on the transaction ID. So I do not think it should be an issue of messages getting out of sync.
>     >>>>
>     >>>> That makes sense, OK.
>     >
>     > Our challenge here is that this crash is happening in a production environment,
>     > and we are currently unable to reproduce it in a lab environment. The
>     > consequence being that production has disabled the feature. So the proposed
>     > patch is in my mind a step on the way to figure out what is actually going on by
>     > allowing the system to be up and capture more logs without interruption of
>     > service.
>     >
>     >>>>>
>     >>>>> I assumed the delay happens when we request something to a hardware offload drive which is replying async due to it being busy.
>     >>>>
>     >>>> I'm not sure it is actually possible for netlink replies to be delivered
>     >>>> asynchronously.  Looking at the kernel code, by the time the original
>     >>>> request is processed, the reply should already be sent.
>     >>>>
>     >>>> A simpler explanation, would be that it just doesn't reply for some reason
>     >>>> to a request that needs a reply. e.g. some incorrect error handling in the
>     >>>> kernel that leaves request unanswered.
>     >>>>
>     >>>> One potential issue I see in the tc_new_tfilter() implementation is that
>     >>>> if tfilter_notify() will fail for any reason, the function will still
>     >>>> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
>     >>>> will be generated, because the return value is zero (success) and the
>     >>>> rtnetlink_send() was never called due to an error in the tfilter_notify().
>     >>>>
>     >>>> I'm not sure if that's what happening here, but it is one option how we
>     >>>> end up with a successful netlink transaction with NLM_F_ECHO set, but no
>     >>>> reply on the socket.
>     >>>>
>     >>>> A potential fix would be (not tested):
>     >>>>
>     >>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>     >>>> index 2621550bfddc..9c334dbf9c5f 100644
>     >>>> --- a/net/sched/cls_api.c
>     >>>> +++ b/net/sched/cls_api.c
>     >>>> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>     >>>>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>     >>>>                            flags, extack);
>     >>>>      if (err == 0) {
>     >>>> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>     >>>> +            err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>     >>>>                             RTM_NEWTFILTER, false, rtnl_held, extack);
>     >>>>              tfilter_put(tp, fh);
>     >>>>              /* q pointer is NULL for shared blocks */
>     >>>> ---
>     >>
>     >> ^^ This is definitely not a correct solution though, because we already
>     >> changed the filter, but we just failed to dump it back.  Some other
>     >> solution is needed.  Most likely, it needs to be re-tried until it succeeds.
>     >> Or the change() will need to be reverted and a failure reported to the
>     >> userspace.
>     >
>     > Good catch, the cited code does indeed look problematic.
>     >
>     >>>>
>     >>>> Solution on the OVS side might be to fail transactions that requested a
>     >>>> reply, but didn't get one.  Async replies sound like something that should
>     >>>> not be allowed and likely is not allowed.
>     >
>     > Even if this turns out to be a bug on the kernel side, we ought to
>     > protect ourself
>     > from the condition. Just to understand your proposal, you are thinking partial
>     > revert of 407556ac6c90 and add a check for no/empty reply in the nl_transact?
>     >
> 
>     I was thinking about revert, but in the presence of the kernel bug if we
>     block on receive, we will just wait forever, if the kernel simply doesn't
>     reply to the ECHO.
> 
>     Solution from the OVS side is similar to your patch, but with EPROTO instead.
>     Since EAGAIN is not a kernel's reply, we're not really expected to re-try
>     the receive.  EAGAIN on a socket just means that the kernel didn't and will
>     not reply.
> 
> 
> Thank you for the clarification, with the lack of a reproducer the work so far has been based on analysis constrained by time to provide some path forward, so this helps alot.
> 
>     Only caller of the tc_transact() knows if they expected a reply or not.
>     And if it was expected, but didn't happen, that's a protocol error.
> 
> 
> The caller of tc_transact() has indicated that they expect a reply by providing a replyp, my thinking around doing a early check of the response in tc_transact() was that we could avoid this whole class of bugs right there instead of leaving it to each consumer. So I'd like to keep the zero length reply check there.

I think, we still need to replace the ofpbuf_at_assert() with ofpbuf_try_pull()
though in all the callers.  Since we actually need to treat the kernel as an
unreliable service provider.  This will also cover cases of truncated messages
or any other possible issues.  So, I'm not sure how much benefit an explicit
check for this special case will have.  At the same time, I'm not against
having it. 

> 
>     What we can do is to check that reply->size > NLMSG_HDRLEN before pulling
>     the header, and return EPROTO if it's not.  That's what ofpbuf_try_pull()
>     can do.  tc_transact() users that also use ofpbuf_at_assert() should be
>     converted to use ofpbuf_try_pull() instead, so we do not crash if something
>     like this happens.  See dpif_netlink_vport_from_ofpbuf() for example.
> 
>     And we should have a loud error/warning message for this condition.
> 
> 
> Ack
> 
>     No retries are needed.  And EAGAIN documentation part is also not needed,
>     because we're not supposed to retry, AFAICT.
> 
> 
> Thanks for the direction, I'll work on this and post a v5.
> 
> --
> Frode Nordahl 
> 
>     Best regards, Ilya Maximets.
>
diff mbox series

Patch

diff --git a/lib/dpif.c b/lib/dpif.c
index 3305401fe..f97a57e7b 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1749,9 +1749,15 @@  flow_message_log_level(int error)
     /* If flows arrive in a batch, userspace may push down multiple
      * unique flow definitions that overlap when wildcards are applied.
      * Kernels that support flow wildcarding will reject these flows as
-     * duplicates (EEXIST), so lower the log level to debug for these
-     * types of messages. */
-    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
+     * duplicates (EEXIST).
+     *
+     * Some subsystems expose temporary error conditions such as EAGAIN return
+     * for operations on non-blocking sockets.  This is done to make the right
+     * decissions during processing.
+     *
+     * If they bubble up here we ought to not log those as a warning, so lower
+     * the log level to debug for these types of messages. */
+    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
 }
 
 static bool
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 36620199e..f5c241474 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -6235,8 +6235,7 @@  tc_query_qdisc(const struct netdev *netdev_)
      * On Linux 2.6.35+ we use the straightforward method because it allows us
      * to handle non-builtin qdiscs without handle 1:0 (e.g. codel).  However,
      * in such a case we get no response at all from the kernel (!) if a
-     * builtin qdisc is in use (which is later caught by "!error &&
-     * !qdisc->size"). */
+     * builtin qdisc is in use (which is later caught by "error == EAGAIN" */
     tcmsg = netdev_linux_tc_make_request(netdev_, RTM_GETQDISC, NLM_F_ECHO,
                                          &request);
     if (!tcmsg) {
@@ -6247,7 +6246,7 @@  tc_query_qdisc(const struct netdev *netdev_)
 
     /* Figure out what tc class to instantiate. */
     error = tc_transact(&request, &qdisc);
-    if (!error && qdisc->size) {
+    if (!error) {
         const char *kind;
 
         error = tc_parse_qdisc(qdisc, &kind, NULL);
@@ -6262,7 +6261,7 @@  tc_query_qdisc(const struct netdev *netdev_)
                 ops = &tc_ops_other;
             }
         }
-    } else if ((!error && !qdisc->size) || error == ENOENT) {
+    } else if (error == ENOENT || error == EAGAIN) {
         /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
          * set up by some other entity that doesn't have a handle 1:0.  We will
          * assume that it's the system default qdisc. */
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 4f26dd8cc..c8ca280a7 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -2571,6 +2571,28 @@  netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
     return 0;
 }
 
+/* This macro is for use by one-time initialization functions, where we have
+ * one shot per thread/process to perform a pertinent initialization task that
+ * may return a temporary error (EAGAIN).
+ *
+ * With min/max values of 1/64 we would retry 7 times, spending at the
+ * most 127 * 1E6 nsec (0.127s) sleeping.
+ */
+#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
+#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
+#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(ERROR, CONDITION, FUNCTION, ...) \
+    for (uint64_t backoff = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                    \
+         backoff <= NETDEV_OFFLOAD_TC_BACKOFF_MAX;                            \
+         backoff <<= 1)                                                       \
+    {                                                                         \
+        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
+        if (CONDITION) {                                                      \
+            xnanosleep(backoff * 1E6);                                        \
+            continue;                                                         \
+        }                                                                     \
+        break;                                                                \
+    }
+
 static void
 probe_multi_mask_per_prio(int ifindex)
 {
@@ -2594,8 +2616,13 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id1, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id1, &flower);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2603,10 +2630,15 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
 
     id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id2, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id2, &flower);
     tc_del_flower_filter(&id1);
 
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2657,8 +2689,13 @@  probe_ct_state_support(int ifindex)
         goto out;
     }
 
-    error = tc_get_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_get_flower, &id, &flower);
     if (error || flower.mask.ct_state != ct_state) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out_del;
     }
 
@@ -2670,10 +2707,16 @@  probe_ct_state_support(int ifindex)
 
     /* Test for reject, ct_state >= MAX */
     ct_state = ~0;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (!error) {
         /* No reject, can't continue probing other flags */
         goto out_del;
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                  ovs_strerror(error));
+        goto out_del;
     }
 
     tc_del_flower_filter(&id);
@@ -2682,8 +2725,14 @@  probe_ct_state_support(int ifindex)
     memset(&flower, 0, sizeof flower);
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_INVALID;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2695,8 +2744,14 @@  probe_ct_state_support(int ifindex)
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
                TCA_FLOWER_KEY_CT_FLAGS_REPLY;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2732,13 +2787,17 @@  probe_tc_block_support(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(error, error == EAGAIN,
+                                         tc_replace_flower, &id, &flower);
 
     tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
 
     if (!error) {
         block_support = true;
         VLOG_INFO("probe tc: block offload is supported.");
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe block offload support: %s",
+                  ovs_strerror(error));
     }
 }
 
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 80da20d9f..dea060fc3 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -1798,6 +1798,11 @@  nl_pool_release(struct nl_sock *sock)
  *
  *      2. Resending the request causes it to be re-executed, so the request
  *         needs to be idempotent.
+ *
+ *      3. In the event that the kernel is too busy to handle the request to
+ *         receive the response (i.e. EAGAIN), this function will still return
+ *         0.  The caller can check for this condition by checking for a zero
+ *         size of the 'replyp' ofpbuf buffer.
  */
 int
 nl_transact(int protocol, const struct ofpbuf *request,
diff --git a/lib/tc.c b/lib/tc.c
index 5c32c6f97..0396570ac 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -36,6 +36,7 @@ 
 #include <unistd.h>
 
 #include "byte-order.h"
+#include "coverage.h"
 #include "netlink-socket.h"
 #include "netlink.h"
 #include "openvswitch/ofpbuf.h"
@@ -67,6 +68,8 @@ 
 
 VLOG_DEFINE_THIS_MODULE(tc);
 
+COVERAGE_DEFINE(tc_transact_eagain);
+
 static struct vlog_rate_limit error_rl = VLOG_RATE_LIMIT_INIT(60, 5);
 
 static enum tc_offload_policy tc_policy = TC_POLICY_NONE;
@@ -237,11 +240,34 @@  static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
     }
 }
 
+/* The `tc_transact` function is a wrapper around `nl_transact` with the
+ * addition of:
+ *
+ * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
+ *    regardless of success or failure.
+ *
+ * 2. When a 'replyp' pointer is provided; in the event of the kernel
+ *    being too busy to process the request for the response, a positive
+ *    error return will be provided with the value of EAGAIN.
+ *
+ * Please acquaint yourself with the documentation of the `nl_transact`
+ * function in the netlink-socket module before making use of this function.
+ */
 int
 tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
 {
     int error = nl_transact(NETLINK_ROUTE, request, replyp);
     ofpbuf_uninit(request);
+
+    if (!error && replyp && !(*replyp)->size) {
+        COVERAGE_INC(tc_transact_eagain);
+        /* We replicate the behavior of `nl_transact` for error conditions and
+         * free any allocations before setting the 'replyp' buffer to NULL. */
+        ofpbuf_delete(*replyp);
+        *replyp = NULL;
+        return EAGAIN;
+    }
+
     return error;
 }