diff mbox series

[bpf-next,v3,2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress

Message ID 160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower
State Not Applicable
Delegated to: BPF Maintainers
Headers show
Series sockmap/sk_skb program memory acct fixes | expand

Commit Message

John Fastabend Oct. 9, 2020, 6:36 p.m. UTC
When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.

Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.

By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Comments

Jakub Sitnicki Oct. 12, 2020, 9:03 a.m. UTC | #1
Hey John,

Exiting to see this work :-)

On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
> When we receive an skb and the ingress skb verdict program returns
> SK_PASS we currently set the ingress flag and put it on the workqueue
> so it can be turned into a sk_msg and put on the sk_msg ingress queue.
> Then finally telling userspace with data_ready hook.
>
> Here we observe that if the workqueue is empty then we can try to
> convert into a sk_msg type and call data_ready directly without
> bouncing through a workqueue. Its a common pattern to have a recv
> verdict program for visibility that always returns SK_PASS. In this
> case unless there is an ENOMEM error or we overrun the socket we
> can avoid the workqueue completely only using it when we fall back
> to error cases caused by memory pressure.
>
> By doing this we eliminate another case where data may be dropped
> if errors occur on memory limits in workqueue.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c |   17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 040ae1d75b65..4b160d97b7f9 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>  {
>  	struct tcp_skb_cb *tcp;
>  	struct sock *sk_other;
> +	int err = -EIO;
>
>  	switch (verdict) {
>  	case __SK_PASS:
> @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>
>  		tcp = TCP_SKB_CB(skb);
>  		tcp->bpf.flags |= BPF_F_INGRESS;
> -		skb_queue_tail(&psock->ingress_skb, skb);
> -		schedule_work(&psock->work);
> +
> +		/* If the queue is empty then we can submit directly
> +		 * into the msg queue. If its not empty we have to
> +		 * queue work otherwise we may get OOO data. Otherwise,
> +		 * if sk_psock_skb_ingress errors will be handled by
> +		 * retrying later from workqueue.
> +		 */
> +		if (skb_queue_empty(&psock->ingress_skb)) {
> +			err = sk_psock_skb_ingress(psock, skb);

When going through the workqueue (sk_psock_backlog), we will also check
if socket didn't get detached from the process, that is if
psock->sk->sk_socket != NULL, before queueing into msg queue.

Do we need a similar check here?

> +		}
> +		if (err < 0) {
> +			skb_queue_tail(&psock->ingress_skb, skb);
> +			schedule_work(&psock->work);
> +		}
>  		break;
>  	case __SK_REDIRECT:
>  		sk_psock_skb_redirect(skb);
John Fastabend Oct. 12, 2020, 3:33 p.m. UTC | #2
Jakub Sitnicki wrote:
> Hey John,
> 
> Exiting to see this work :-)
> 
> On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
> > When we receive an skb and the ingress skb verdict program returns
> > SK_PASS we currently set the ingress flag and put it on the workqueue
> > so it can be turned into a sk_msg and put on the sk_msg ingress queue.
> > Then finally telling userspace with data_ready hook.
> >
> > Here we observe that if the workqueue is empty then we can try to
> > convert into a sk_msg type and call data_ready directly without
> > bouncing through a workqueue. Its a common pattern to have a recv
> > verdict program for visibility that always returns SK_PASS. In this
> > case unless there is an ENOMEM error or we overrun the socket we
> > can avoid the workqueue completely only using it when we fall back
> > to error cases caused by memory pressure.
> >
> > By doing this we eliminate another case where data may be dropped
> > if errors occur on memory limits in workqueue.
> >
> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c |   17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 040ae1d75b65..4b160d97b7f9 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
> >  {
> >  	struct tcp_skb_cb *tcp;
> >  	struct sock *sk_other;
> > +	int err = -EIO;
> >
> >  	switch (verdict) {
> >  	case __SK_PASS:
> > @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
> >
> >  		tcp = TCP_SKB_CB(skb);
> >  		tcp->bpf.flags |= BPF_F_INGRESS;
> > -		skb_queue_tail(&psock->ingress_skb, skb);
> > -		schedule_work(&psock->work);
> > +
> > +		/* If the queue is empty then we can submit directly
> > +		 * into the msg queue. If its not empty we have to
> > +		 * queue work otherwise we may get OOO data. Otherwise,
> > +		 * if sk_psock_skb_ingress errors will be handled by
> > +		 * retrying later from workqueue.
> > +		 */
> > +		if (skb_queue_empty(&psock->ingress_skb)) {
> > +			err = sk_psock_skb_ingress(psock, skb);
> 
> When going through the workqueue (sk_psock_backlog), we will also check
> if socket didn't get detached from the process, that is if
> psock->sk->sk_socket != NULL, before queueing into msg queue.

The sk_socket check is only for the egress path,

  sk_psock_handle_skb -> skb_send_sock_locked -> kernel_sendmsg_locked

Then the do_tcp_sendpages() uses sk_socket and I don't see any checks for
sk_socket being set. Although I think its worth looking through to see
if the psock/sk state is always such that we have sk_socket there I
don't recall off-hand where that is null'd.

But, to answer your question this is ingress only and here we don't
use sk_socket for anything so I don't see any reason the check is
needed. All that is done here is converting to skmsg and posting
onto ingress queue.

> 
> Do we need a similar check here?
> 

Don't think so for above reason. Thanks for asking though and let me
know if you see something.

I think to make the workqueue path symmetric I'll move the check there
into the egress branch.

> > +		}
> > +		if (err < 0) {
> > +			skb_queue_tail(&psock->ingress_skb, skb);
> > +			schedule_work(&psock->work);
> > +		}
> >  		break;
> >  	case __SK_REDIRECT:
> >  		sk_psock_skb_redirect(skb);
Jakub Sitnicki Oct. 13, 2020, 7:43 p.m. UTC | #3
On Mon, Oct 12, 2020 at 05:33 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:

[...]

>> On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
>> > When we receive an skb and the ingress skb verdict program returns
>> > SK_PASS we currently set the ingress flag and put it on the workqueue
>> > so it can be turned into a sk_msg and put on the sk_msg ingress queue.
>> > Then finally telling userspace with data_ready hook.
>> >
>> > Here we observe that if the workqueue is empty then we can try to
>> > convert into a sk_msg type and call data_ready directly without
>> > bouncing through a workqueue. Its a common pattern to have a recv
>> > verdict program for visibility that always returns SK_PASS. In this
>> > case unless there is an ENOMEM error or we overrun the socket we
>> > can avoid the workqueue completely only using it when we fall back
>> > to error cases caused by memory pressure.
>> >
>> > By doing this we eliminate another case where data may be dropped
>> > if errors occur on memory limits in workqueue.
>> >
>> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  net/core/skmsg.c |   17 +++++++++++++++--
>> >  1 file changed, 15 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index 040ae1d75b65..4b160d97b7f9 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>> >  {
>> >  	struct tcp_skb_cb *tcp;
>> >  	struct sock *sk_other;
>> > +	int err = -EIO;
>> >
>> >  	switch (verdict) {
>> >  	case __SK_PASS:
>> > @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>> >
>> >  		tcp = TCP_SKB_CB(skb);
>> >  		tcp->bpf.flags |= BPF_F_INGRESS;
>> > -		skb_queue_tail(&psock->ingress_skb, skb);
>> > -		schedule_work(&psock->work);
>> > +
>> > +		/* If the queue is empty then we can submit directly
>> > +		 * into the msg queue. If its not empty we have to
>> > +		 * queue work otherwise we may get OOO data. Otherwise,
>> > +		 * if sk_psock_skb_ingress errors will be handled by
>> > +		 * retrying later from workqueue.
>> > +		 */
>> > +		if (skb_queue_empty(&psock->ingress_skb)) {
>> > +			err = sk_psock_skb_ingress(psock, skb);
>>
>> When going through the workqueue (sk_psock_backlog), we will also check
>> if socket didn't get detached from the process, that is if
>> psock->sk->sk_socket != NULL, before queueing into msg queue.
>
> The sk_socket check is only for the egress path,
>
>   sk_psock_handle_skb -> skb_send_sock_locked -> kernel_sendmsg_locked

Oh, okay. I thought it was because we want to forwarding into the socket
as soon as there is no process to read from the queue.

> Then the do_tcp_sendpages() uses sk_socket and I don't see any checks for
> sk_socket being set. Although I think its worth looking through to see
> if the psock/sk state is always such that we have sk_socket there I
> don't recall off-hand where that is null'd.

It's in sock_orphan().

> But, to answer your question this is ingress only and here we don't
> use sk_socket for anything so I don't see any reason the check is
> needed. All that is done here is converting to skmsg and posting
> onto ingress queue.

Queued skb won't be read out, but I don't see a problem with it.

[...]
diff mbox series

Patch

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 040ae1d75b65..4b160d97b7f9 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -773,6 +773,7 @@  static void sk_psock_verdict_apply(struct sk_psock *psock,
 {
 	struct tcp_skb_cb *tcp;
 	struct sock *sk_other;
+	int err = -EIO;
 
 	switch (verdict) {
 	case __SK_PASS:
@@ -784,8 +785,20 @@  static void sk_psock_verdict_apply(struct sk_psock *psock,
 
 		tcp = TCP_SKB_CB(skb);
 		tcp->bpf.flags |= BPF_F_INGRESS;
-		skb_queue_tail(&psock->ingress_skb, skb);
-		schedule_work(&psock->work);
+
+		/* If the queue is empty then we can submit directly
+		 * into the msg queue. If its not empty we have to
+		 * queue work otherwise we may get OOO data. Otherwise,
+		 * if sk_psock_skb_ingress errors will be handled by
+		 * retrying later from workqueue.
+		 */
+		if (skb_queue_empty(&psock->ingress_skb)) {
+			err = sk_psock_skb_ingress(psock, skb);
+		}
+		if (err < 0) {
+			skb_queue_tail(&psock->ingress_skb, skb);
+			schedule_work(&psock->work);
+		}
 		break;
 	case __SK_REDIRECT:
 		sk_psock_skb_redirect(skb);