Message ID | ceae3b812d3e01f73efcfdff0a7817b3c61ae24e.1444205375.git.daniel@iogearbox.net |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On 10/7/15 1:16 AM, Daniel Borkmann wrote: > Similar to commit c29390c6dfee ("xps: must clear sender_cpu before > forwarding"), we also need to clear the skb->sender_cpu when moving > from RX to TX via skb_do_redirect() due to the shared location of > napi_id (used on RX) and sender_cpu (used on TX). > > Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") > Signed-off-by: Daniel Borkmann<daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> with the amount of skb_sender_cpu_clear() all over the code base I wonder whether there is a better solution to all of these. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann <daniel@iogearbox.net> Date: Wed, 7 Oct 2015 10:16:09 +0200 > Similar to commit c29390c6dfee ("xps: must clear sender_cpu before > forwarding"), we also need to clear the skb->sender_cpu when moving > from RX to TX via skb_do_redirect() due to the shared location of > napi_id (used on RX) and sender_cpu (used on TX). > > Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 7, 2015 at 8:46 AM, Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/7/15 1:16 AM, Daniel Borkmann wrote: >> >> Similar to commit c29390c6dfee ("xps: must clear sender_cpu before >> forwarding"), we also need to clear the skb->sender_cpu when moving >> from RX to TX via skb_do_redirect() due to the shared location of >> napi_id (used on RX) and sender_cpu (used on TX). >> >> Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") >> Signed-off-by: Daniel Borkmann<daniel@iogearbox.net> > > > Acked-by: Alexei Starovoitov <ast@plumgrid.com> > > with the amount of skb_sender_cpu_clear() all over the code base > I wonder whether there is a better solution to all of these. I think there is. We found that splitting the union of sender_cpu and napi_id solved the issue for us. In general, I think this is an OK solution as long as the following hold: * skbs are always allocated via kzalloc * out -> out cloned skbs are always cloned on the same CPU * an extra four bytes in skbuff isn't a bad thing I think the first and last points are true, but I'm not 100% sure. I'm also particularly unsure about the second point. If that assumption does not hold, it could result in extra cache / bus traffic between cores / sockets. However, that would also imply that we were already getting some extra traffic at the point of doing the copy. So maybe not a big deal? The other problem I could imagine is if the second point *is* true and skbs end up being cloned multiple times, XPS might get overworked on individual cores. Anyway, I'm not 100% sure about any of these things: I'm really not at all familiar with the Linux kernel, let alone the netstack -- this just turned out to be not particularly difficult to find given register context and call stack from the panic. I'd be happy to send a patch to struct skbuff and toss skb_sender_cpu_clear, but I suspect someone else on this list could validate that quicker than I. The patch at that point is trivial. I think it's probably a good thing to do. The need to call skb_sender_cpu_clear() around every rx->tx copy interaction seems brittle and likely to be problematic again in the future unless code is always cargo culted, and assuming we've found every potential clone site. --dho -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/8/15 5:50 PM, Devon H. O'Dell wrote: >> with the amount of skb_sender_cpu_clear() all over the code base >> >I wonder whether there is a better solution to all of these. > I think there is. We found that splitting the union of sender_cpu and > napi_id solved the issue for us. In general, I think this is an OK > solution as long as the following hold: > > * skbs are always allocated via kzalloc > * out -> out cloned skbs are always cloned on the same CPU > * an extra four bytes in skbuff isn't a bad thing I'm pretty sure extending sk_buff for this is not acceptable. I was thinking may be we can use sign bit to distinguish between napi_id and sender_cpu. Like: if ((int)skb->sender_cpu >= 0) skb->sender_cpu = - (raw_smp_processor_id() + 1); and inside get_xps_queue() use it only if it's negative. Then we can remove skb_sender_cpu_clear() from everywhere. Adding a check to napi_hash_add() to make sure that napi_id is not negative is probably ok too. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 8, 2015 at 7:35 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/8/15 5:50 PM, Devon H. O'Dell wrote: >>> >>> with the amount of skb_sender_cpu_clear() all over the code base >>> >I wonder whether there is a better solution to all of these. >> >> I think there is. We found that splitting the union of sender_cpu and >> napi_id solved the issue for us. In general, I think this is an OK >> solution as long as the following hold: >> >> * skbs are always allocated via kzalloc >> * out -> out cloned skbs are always cloned on the same CPU >> * an extra four bytes in skbuff isn't a bad thing > > > I'm pretty sure extending sk_buff for this is not acceptable. That's unfortunate. > I was thinking may be we can use sign bit to distinguish between > napi_id and sender_cpu. > Like: > if ((int)skb->sender_cpu >= 0) > skb->sender_cpu = - (raw_smp_processor_id() + 1); > and inside get_xps_queue() use it only if it's negative. I like the idea, but it seems unnecessarily magical. What about using a bitfield? Then there's just an option bit that is either OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu in netdev_pick_tx becomes if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) ... > Then we can remove skb_sender_cpu_clear() from everywhere. > Adding a check to napi_hash_add() to make sure that napi_id is not > negative is probably ok too. We could change this to check that sender_napi_option would be OPTION_NAPI_ID with the bitfield idea. My names are probably bad, but I think the idea is less magical (and is effectively the same thing you are proposing). > Thoughts? --dho -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/09/2015 04:35 AM, Alexei Starovoitov wrote: > On 10/8/15 5:50 PM, Devon H. O'Dell wrote: >>> with the amount of skb_sender_cpu_clear() all over the code base >>> >I wonder whether there is a better solution to all of these. >> I think there is. We found that splitting the union of sender_cpu and >> napi_id solved the issue for us. In general, I think this is an OK >> solution as long as the following hold: >> >> * skbs are always allocated via kzalloc >> * out -> out cloned skbs are always cloned on the same CPU >> * an extra four bytes in skbuff isn't a bad thing > > I'm pretty sure extending sk_buff for this is not acceptable. +1, I agree. > I was thinking may be we can use sign bit to distinguish between > napi_id and sender_cpu. > Like: > if ((int)skb->sender_cpu >= 0) > skb->sender_cpu = - (raw_smp_processor_id() + 1); > and inside get_xps_queue() use it only if it's negative. > Then we can remove skb_sender_cpu_clear() from everywhere. > Adding a check to napi_hash_add() to make sure that napi_id is not > negative is probably ok too. > Thoughts? I think this doesn't make it any more maintainable. skb_sender_cpu_clear(), one can at least git-grep to easily find out and review call-sites in the code. There are various members already used differently depending on the context. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/9/15 9:40 AM, Devon H. O'Dell wrote: > I like the idea, but it seems unnecessarily magical. What about using > a bitfield? Then there's just an option bit that is either > OPTION_NAPI_ID or OPTION_SENDER_CPU. Then the check to set sender_cpu > in netdev_pick_tx becomes > > if (skb->sender_napi_option == OPTION_NAPI_ID || skb->sender_cpu == 0) .. It's less magical, but slower since two loads from skb and two cmp/jmp are needed instead of one. and this is critical path of xmit executed for every skb. that's why I proposed a sign. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/9/15 10:33 AM, Daniel Borkmann wrote: >> I was thinking may be we can use sign bit to distinguish between >> napi_id and sender_cpu. >> Like: >> if ((int)skb->sender_cpu >= 0) >> skb->sender_cpu = - (raw_smp_processor_id() + 1); >> and inside get_xps_queue() use it only if it's negative. >> Then we can remove skb_sender_cpu_clear() from everywhere. >> Adding a check to napi_hash_add() to make sure that napi_id is not >> negative is probably ok too. >> Thoughts? > > I think this doesn't make it any more maintainable. > > skb_sender_cpu_clear(), one can at least git-grep to easily find > out and review call-sites in the code. There are various members > already used differently depending on the context. since this bug wasn't fixed at once in all places, it means that it is hard to review _all_ needed call-sites. There are 7 places that call skb_sender_cpu_clear() in net-next. Plus 2 more in net. How many such paths from rx to tx left? On the first glance ovs is missing one and who knows what else. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote: > since this bug wasn't fixed at once in all places, it means > that it is hard to review _all_ needed call-sites. > There are 7 places that call skb_sender_cpu_clear() in net-next. > Plus 2 more in net. > How many such paths from rx to tx left? > On the first glance ovs is missing one and who knows what else. Alexei, what's happening ? The original patch is 6 months old. If this issue was so urgent, how comes it took so long to catch the remaining bugs ? Just add skb_sender_cpu_clear() where needed, thanks. Using union is hard, but there is a price to performance. skb size is absolutely critical and deserves some headaches. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/9/15 9:38 PM, Eric Dumazet wrote: > On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote: > >> since this bug wasn't fixed at once in all places, it means >> that it is hard to review _all_ needed call-sites. >> There are 7 places that call skb_sender_cpu_clear() in net-next. >> Plus 2 more in net. >> How many such paths from rx to tx left? >> On the first glance ovs is missing one and who knows what else. > > Alexei, what's happening ? > > The original patch is 6 months old. If this issue was so urgent, how > comes it took so long to catch the remaining bugs ? no urgency at all. bpf side is clean, so I'm not worried :) > Just add skb_sender_cpu_clear() where needed, thanks. > > Using union is hard, but there is a price to performance. > > skb size is absolutely critical and deserves some headaches. yep. as I said it shouldn't be increased and proposed in-band sign bit. Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear() in other places, I rest my case. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/9/15 9:38 PM, Eric Dumazet wrote: > On Fri, 2015-10-09 at 20:19 -0700, Alexei Starovoitov wrote: > >> since this bug wasn't fixed at once in all places, it means >> that it is hard to review _all_ needed call-sites. >> There are 7 places that call skb_sender_cpu_clear() in net-next. >> Plus 2 more in net. >> How many such paths from rx to tx left? >> On the first glance ovs is missing one and who knows what else. > > Alexei, what's happening ? > > The original patch is 6 months old. If this issue was so urgent, how > comes it took so long to catch the remaining bugs ? no urgency at all. bpf side is clean, so I'm not worried :) > Just add skb_sender_cpu_clear() where needed, thanks. > > Using union is hard, but there is a price to performance. > > skb size is absolutely critical and deserves some headaches. yep. as I said it shouldn't be increased and proposed in-band sign bit. Anyway, since you and Daniel are ok with adding skb_sender_cpu_clear() in other places, I rest my case. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-10-09 at 21:56 -0700, Alexei Starovoitov wrote:
> yep. as I said it shouldn't be increased and proposed in-band sign bit.
I believe we still would like to keep some helpers to clearly identify
when a packet crosses domains. Even if the helper is empty.
It helps a lot when a new feature must be added.
skb_scrub_packet() and similar helper are not only doing useful work,
they are easily grep-able.
So your idea sounds nice, but in reality it maintains the status quo,
where we do not really know which points need care when skbs cross a
domain.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-10-09 at 19:33 +0200, Daniel Borkmann wrote: > On 10/09/2015 04:35 AM, Alexei Starovoitov wrote: > > On 10/8/15 5:50 PM, Devon H. O'Dell wrote: > >>> with the amount of skb_sender_cpu_clear() all over the code base > >>> >I wonder whether there is a better solution to all of these. > >> I think there is. We found that splitting the union of sender_cpu and > >> napi_id solved the issue for us. In general, I think this is an OK > >> solution as long as the following hold: > >> > >> * skbs are always allocated via kzalloc > >> * out -> out cloned skbs are always cloned on the same CPU > >> * an extra four bytes in skbuff isn't a bad thing > > > > I'm pretty sure extending sk_buff for this is not acceptable. > > +1, I agree. > > > I was thinking may be we can use sign bit to distinguish between > > napi_id and sender_cpu. > > Like: > > if ((int)skb->sender_cpu >= 0) > > skb->sender_cpu = - (raw_smp_processor_id() + 1); > > and inside get_xps_queue() use it only if it's negative. > > Then we can remove skb_sender_cpu_clear() from everywhere. > > Adding a check to napi_hash_add() to make sure that napi_id is not > > negative is probably ok too. > > Thoughts? > > I think this doesn't make it any more maintainable. > > skb_sender_cpu_clear(), one can at least git-grep to easily find > out and review call-sites in the code. There are various members > already used differently depending on the context. Extending busy polling support for tunnels devices actually requires to make some changes in this area. We need to keep skb->napi_id set until skb reaches a socket, but skb_scrub_packet() currently defeats the thing. I will leave skb_sender_cpu_clear() in place but it will be empty. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/filter.c b/net/core/filter.c index da3e535..8f4603c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1462,6 +1462,7 @@ int skb_do_redirect(struct sk_buff *skb) return dev_forward_skb(dev, skb); skb->dev = dev; + skb_sender_cpu_clear(skb); return dev_queue_xmit(skb); }
Similar to commit c29390c6dfee ("xps: must clear sender_cpu before forwarding"), we also need to clear the skb->sender_cpu when moving from RX to TX via skb_do_redirect() due to the shared location of napi_id (used on RX) and sender_cpu (used on TX). Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> --- ( It's also needed here in the net-next commit 27b29f63058d. ) net/core/filter.c | 1 + 1 file changed, 1 insertion(+)