diff mbox

[net-next] bpf, skb_do_redirect: clear sender_cpu before xmit

Message ID ceae3b812d3e01f73efcfdff0a7817b3c61ae24e.1444205375.git.daniel@iogearbox.net
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann Oct. 7, 2015, 8:16 a.m. UTC
Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
forwarding"), we also need to clear the skb->sender_cpu when moving
from RX to TX via skb_do_redirect() due to the shared location of
napi_id (used on RX) and sender_cpu (used on TX).

Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 ( It's also needed here in the net-next commit 27b29f63058d. )

 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Alexei Starovoitov Oct. 7, 2015, 3:46 p.m. UTC | #1
On 10/7/15 1:16 AM, Daniel Borkmann wrote:
> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
> forwarding"), we also need to clear the skb->sender_cpu when moving
> from RX to TX via skb_do_redirect() due to the shared location of
> napi_id (used on RX) and sender_cpu (used on TX).
>
> Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
> Signed-off-by: Daniel Borkmann<daniel@iogearbox.net>

Acked-by: Alexei Starovoitov <ast@plumgrid.com>

with the amount of skb_sender_cpu_clear() all over the code base
I wonder whether there is a better solution to all of these.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 8, 2015, 12:07 p.m. UTC | #2
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed,  7 Oct 2015 10:16:09 +0200

> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
> forwarding"), we also need to clear the skb->sender_cpu when moving
> from RX to TX via skb_do_redirect() due to the shared location of
> napi_id (used on RX) and sender_cpu (used on TX).
> 
> Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
dho@fastly.com Oct. 9, 2015, 12:50 a.m. UTC | #3
On Wed, Oct 7, 2015 at 8:46 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On 10/7/15 1:16 AM, Daniel Borkmann wrote:
>>
>> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before
>> forwarding"), we also need to clear the skb->sender_cpu when moving
>> from RX to TX via skb_do_redirect() due to the shared location of
>> napi_id (used on RX) and sender_cpu (used on TX).
>>
>> Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
>> Signed-off-by: Daniel Borkmann<daniel@iogearbox.net>
>
>
> Acked-by: Alexei Starovoitov <ast@plumgrid.com>
>
> with the amount of skb_sender_cpu_clear() all over the code base
> I wonder whether there is a better solution to all of these.

I think there is. We found that splitting the union of sender_cpu and
napi_id solved the issue for us. In general, I think this is an OK
solution as long as the following hold:

 * skbs are always allocated via kzalloc
 * out -> out cloned skbs are always cloned on the same CPU
 * an extra four bytes in skbuff isn't a bad thing

I think the first and last points are true, but I'm not 100% sure. I'm
also particularly unsure about the second point. If that assumption
does not hold, it could result in extra cache / bus traffic between
cores / sockets. However, that would also imply that we were already
getting some extra traffic at the point of doing the copy. So maybe
not a big deal? The other problem I could imagine is if the second
point *is* true and skbs end up being cloned multiple times, XPS might
get overworked on individual cores.

Anyway, I'm not 100% sure about any of these things: I'm really not at
all familiar with the Linux kernel, let alone the netstack -- this
just turned out to be not particularly difficult to find given
register context and call stack from the panic. I'd be happy to send a
patch to struct skbuff and toss skb_sender_cpu_clear, but I suspect
someone else on this list could validate that quicker than I. The
patch at that point is trivial.

I think it's probably a good thing to do. The need to call
skb_sender_cpu_clear() around every rx->tx copy interaction seems
brittle and likely to be problematic again in the future unless code
is always cargo culted, and assuming we've found every potential clone
site.

--dho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 9, 2015, 2:35 a.m. UTC | #4
On 10/8/15 5:50 PM, Devon H. O'Dell wrote:
>> with the amount of skb_sender_cpu_clear() all over the code base
>> >I wonder whether there is a better solution to all of these.
> I think there is. We found that splitting the union of sender_cpu and
> napi_id solved the issue for us. In general, I think this is an OK
> solution as long as the following hold:
>
>   * skbs are always allocated via kzalloc
>   * out -> out cloned skbs are always cloned on the same CPU
>   * an extra four bytes in skbuff isn't a bad thing

I'm pretty sure extending sk_buff for this is not acceptable.
I was thinking may be we can use sign bit to distinguish between
napi_id and sender_cpu.
Like:
         if ((int)skb->sender_cpu >= 0)
                 skb->sender_cpu = - (raw_smp_processor_id() + 1);
and inside get_xps_queue() use it only if it's negative.
Then we can remove skb_sender_cpu_clear() from everywhere.
Adding a check to napi_hash_add() to make sure that napi_id is not
negative is probably ok too.
Thoughts?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
dho@fastly.com Oct. 9, 2015, 4:40 p.m. UTC | #5
On Thu, Oct 8, 2015 at 7:35 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On 10/8/15 5:50 PM, Devon H. O'Dell wrote:
>>>
>>> with the amount of skb_sender_cpu_clear() all over the code base
>>> >I wonder whether there is a better solution to all of these.
>>
>> I think there is. We found that splitting the union of sender_cpu and
>> napi_id solved the issue for us. In general, I think this is an OK
>> solution as long as the following hold:
>>
>>   * skbs are always allocated via kzalloc
>>   * out -> out cloned skbs are always cloned on the same CPU
>>   * an extra four bytes in skbuff isn't a bad thing
>
>
> I'm pretty sure extending sk_buff for this is not acceptable.

That's unfortunate.

> I was thinking may be we can use sign bit to distinguish between
> napi_id and sender_cpu.
> Like:
>         if ((int)skb->sender_cpu >= 0)
>                 skb->sender_cpu = - (raw_smp_processor_id() + 1);
> and inside get_xps_queue() use it only if it's negative.

I like the idea, but it seems unnecessarily magical. What about using
a bitfield? Then there's just an option bit that is either
OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu
in netdev_pick_tx becomes

    if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) ...

> Then we can remove skb_sender_cpu_clear() from everywhere.
> Adding a check to napi_hash_add() to make sure that napi_id is not
> negative is probably ok too.

We could change this to check that sender_napi_option would be
OPTION_NAPI_ID with the bitfield idea.

My names are probably bad, but I think the idea is less magical (and
is effectively the same thing you are proposing).

> Thoughts?

--dho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Oct. 9, 2015, 5:33 p.m. UTC | #6
On 10/09/2015 04:35 AM, Alexei Starovoitov wrote:
> On 10/8/15 5:50 PM, Devon H. O'Dell wrote:
>>> with the amount of skb_sender_cpu_clear() all over the code base
>>> >I wonder whether there is a better solution to all of these.
>> I think there is. We found that splitting the union of sender_cpu and
>> napi_id solved the issue for us. In general, I think this is an OK
>> solution as long as the following hold:
>>
>>   * skbs are always allocated via kzalloc
>>   * out -> out cloned skbs are always cloned on the same CPU
>>   * an extra four bytes in skbuff isn't a bad thing
>
> I'm pretty sure extending sk_buff for this is not acceptable.

+1, I agree.

> I was thinking may be we can use sign bit to distinguish between
> napi_id and sender_cpu.
> Like:
>          if ((int)skb->sender_cpu >= 0)
>                  skb->sender_cpu = - (raw_smp_processor_id() + 1);
> and inside get_xps_queue() use it only if it's negative.
> Then we can remove skb_sender_cpu_clear() from everywhere.
> Adding a check to napi_hash_add() to make sure that napi_id is not
> negative is probably ok too.
> Thoughts?

I think this doesn't make it any more maintainable.

skb_sender_cpu_clear(), one can at least git-grep to easily find
out and review call-sites in the code. There are various members
already used differently depending on the context.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 10, 2015, 3:11 a.m. UTC | #7
On 10/9/15 9:40 AM, Devon H. O'Dell wrote:
> I like the idea, but it seems unnecessarily magical. What about using
> a bitfield? Then there's just an option bit that is either
> OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu
> in netdev_pick_tx becomes
>
>      if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) ..

It's less magical, but slower since two loads from skb and two cmp/jmp
are needed instead of one.
and this is critical path of xmit executed for every skb.
that's why I proposed a sign.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 10, 2015, 3:19 a.m. UTC | #8
On 10/9/15 10:33 AM, Daniel Borkmann wrote:
>> I was thinking may be we can use sign bit to distinguish between
>> napi_id and sender_cpu.
>> Like:
>>          if ((int)skb->sender_cpu >= 0)
>>                  skb->sender_cpu = - (raw_smp_processor_id() + 1);
>> and inside get_xps_queue() use it only if it's negative.
>> Then we can remove skb_sender_cpu_clear() from everywhere.
>> Adding a check to napi_hash_add() to make sure that napi_id is not
>> negative is probably ok too.
>> Thoughts?
>
> I think this doesn't make it any more maintainable.
>
> skb_sender_cpu_clear(), one can at least git-grep to easily find
> out and review call-sites in the code. There are various members
> already used differently depending on the context.

since this bug wasn't fixed at once in all places, it means
that it is hard to review _all_ needed call-sites.
There are 7 places that call skb_sender_cpu_clear() in net-next.
Plus 2 more in net.
How many such paths from rx to tx left?
On the first glance ovs is missing one and who knows what else.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 10, 2015, 4:38 a.m. UTC | #9
On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:

> since this bug wasn't fixed at once in all places, it means
> that it is hard to review _all_ needed call-sites.
> There are 7 places that call skb_sender_cpu_clear() in net-next.
> Plus 2 more in net.
> How many such paths from rx to tx left?
> On the first glance ovs is missing one and who knows what else.

Alexei, what's happening ?

The original patch is 6 months old. If this issue was so urgent, how
comes it took so long to catch the remaining bugs ?

Just add skb_sender_cpu_clear() where needed, thanks.

Using union is hard, but there is a price to performance.

skb size is absolutely critical and deserves some headaches.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 10, 2015, 4:55 a.m. UTC | #10
On 10/9/15 9:38 PM, Eric Dumazet wrote:
> On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:
>
>> since this bug wasn't fixed at once in all places, it means
>> that it is hard to review _all_ needed call-sites.
>> There are 7 places that call skb_sender_cpu_clear() in net-next.
>> Plus 2 more in net.
>> How many such paths from rx to tx left?
>> On the first glance ovs is missing one and who knows what else.
>
> Alexei, what's happening ?
>
> The original patch is 6 months old. If this issue was so urgent, how
> comes it took so long to catch the remaining bugs ?

no urgency at all. bpf side is clean, so I'm not worried :)

> Just add skb_sender_cpu_clear() where needed, thanks.
>
> Using union is hard, but there is a price to performance.
>
> skb size is absolutely critical and deserves some headaches.

yep. as I said it shouldn't be increased and proposed in-band sign bit.

Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear()
in other places, I rest my case.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Oct. 10, 2015, 4:56 a.m. UTC | #11
On 10/9/15 9:38 PM, Eric Dumazet wrote:
> On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote:
>
>> since this bug wasn't fixed at once in all places, it means
>> that it is hard to review _all_ needed call-sites.
>> There are 7 places that call skb_sender_cpu_clear() in net-next.
>> Plus 2 more in net.
>> How many such paths from rx to tx left?
>> On the first glance ovs is missing one and who knows what else.
>
> Alexei, what's happening ?
>
> The original patch is 6 months old. If this issue was so urgent, how
> comes it took so long to catch the remaining bugs ?

no urgency at all. bpf side is clean, so I'm not worried :)

> Just add skb_sender_cpu_clear() where needed, thanks.
>
> Using union is hard, but there is a price to performance.
>
> skb size is absolutely critical and deserves some headaches.

yep. as I said it shouldn't be increased and proposed in-band sign bit.

Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear()
in other places, I rest my case.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 10, 2015, 5:12 p.m. UTC | #12
On Fri, 2015-10-09 at 21:56 -0700, Alexei Starovoitov wrote:

> yep. as I said it shouldn't be increased and proposed in-band sign bit.

I believe we still would like to keep some helpers to clearly identify
when a packet crosses domains. Even if the helper is empty.

It helps a lot when a new feature must be added.

skb_scrub_packet() and similar helper are not only doing useful work,
they are easily grep-able.

So your idea sounds nice, but in reality it maintains the status quo,
where we do not really know which points need care when skbs cross a
domain.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 16, 2015, 6:07 p.m. UTC | #13
On Fri, 2015-10-09 at 19:33 +0200, Daniel Borkmann wrote:
> On 10/09/2015 04:35 AM, Alexei Starovoitov wrote:
> > On 10/8/15 5:50 PM, Devon H. O'Dell wrote:
> >>> with the amount of skb_sender_cpu_clear() all over the code base
> >>> >I wonder whether there is a better solution to all of these.
> >> I think there is. We found that splitting the union of sender_cpu and
> >> napi_id solved the issue for us. In general, I think this is an OK
> >> solution as long as the following hold:
> >>
> >>   * skbs are always allocated via kzalloc
> >>   * out -> out cloned skbs are always cloned on the same CPU
> >>   * an extra four bytes in skbuff isn't a bad thing
> >
> > I'm pretty sure extending sk_buff for this is not acceptable.
> 
> +1, I agree.
> 
> > I was thinking may be we can use sign bit to distinguish between
> > napi_id and sender_cpu.
> > Like:
> >          if ((int)skb->sender_cpu >= 0)
> >                  skb->sender_cpu = - (raw_smp_processor_id() + 1);
> > and inside get_xps_queue() use it only if it's negative.
> > Then we can remove skb_sender_cpu_clear() from everywhere.
> > Adding a check to napi_hash_add() to make sure that napi_id is not
> > negative is probably ok too.
> > Thoughts?
> 
> I think this doesn't make it any more maintainable.
> 
> skb_sender_cpu_clear(), one can at least git-grep to easily find
> out and review call-sites in the code. There are various members
> already used differently depending on the context.

Extending busy polling support for tunnels devices actually requires to
make some changes in this area.

We need to keep skb->napi_id set until skb reaches a socket, but
skb_scrub_packet() currently defeats the thing.

I will leave skb_sender_cpu_clear() in place but it will be empty.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/filter.c b/net/core/filter.c
index da3e535..8f4603c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1462,6 +1462,7 @@  int skb_do_redirect(struct sk_buff *skb)
 		return dev_forward_skb(dev, skb);
 
 	skb->dev = dev;
+	skb_sender_cpu_clear(skb);
 	return dev_queue_xmit(skb);
 }