Message ID | 1430333589-4940-7-git-send-email-pablo@netfilter.org |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote: > Port qdisc ingress on top of the Netfilter ingress allows us to detach the > qdisc ingress filtering code from the core, so now it resides where it really > belongs. Hm, but that means, in case you have a tc ingress qdisc attached with one single (ideal) or more (less ideal) classifier/actions, the path we _now_ have to traverse just to a single tc classifier invocation is, if I spot this correctly, f.e.: __netif_receive_skb_core() `-> nf_hook_ingress() `-> nf_hook_do_ingress() `-> nf_hook_slow() `-> [for each entry in hook list] `-> nf_iterate() `-> (*elemp)->hook() `-> handle_ing() `-> ing_filter() `-> qdisc_enqueue_root() `-> sch->enqueue() `-> ingress_enqueue() `-> tc_classify() `-> tc_classify_compat() `-> [for each attached classifier] `-> tp->classify() `-> f.e. cls_bpf_classify() `-> [for each classifier from plist] `-> BPF_PROG_RUN() What was actually mentioned in the other thread where we'd like to see a more lightweight ingress qdisc is to cut that down tremendously to increase pps rate, as provided, that we would be able to process a path roughly like: __netif_receive_skb_core() `-> tc_classify() `-> tc_classify_compat() `-> [for each attached classifier] `-> tp->classify() `-> f.e. cls_bpf_classify() `-> [for each classifier from plist] `-> BPF_PROG_RUN() Therefore, I think it would be better to not wrap that ingress qdisc part of the patch set into even more layers. What do you think? Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote: > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > index 2274e72..23b57da 100644 > --- a/net/sched/Kconfig > +++ b/net/sched/Kconfig > @@ -312,6 +312,7 @@ config NET_SCH_PIE > config NET_SCH_INGRESS > tristate "Ingress Qdisc" > depends on NET_CLS_ACT > + select NETFILTER_INGRESS > ---help--- > Say Y here if you want to use classifiers for incoming packets. > If unsure, say Y. So now it impossible to compile ingress Qdics without netfilters... (I know you moved them into net/core/, but still they are netfilter API's.) Why do we have to mix different layers? IOW, why not just keep TC at L2 and netfilters at L3 even just w.r.t. API? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 29, 2015 at 10:27:05PM +0200, Daniel Borkmann wrote: > On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote: > >Port qdisc ingress on top of the Netfilter ingress allows us to detach the > >qdisc ingress filtering code from the core, so now it resides where it really > >belongs. > > Hm, but that means, in case you have a tc ingress qdisc attached > with one single (ideal) or more (less ideal) classifier/actions, > the path we _now_ have to traverse just to a single tc classifier > invocation is, if I spot this correctly, f.e.: > > __netif_receive_skb_core() > `-> nf_hook_ingress() > `-> nf_hook_do_ingress() > `-> nf_hook_slow() > `-> [for each entry in hook list] > `-> nf_iterate() > `-> (*elemp)->hook() > `-> handle_ing() > `-> ing_filter() > `-> qdisc_enqueue_root() > `-> sch->enqueue() > `-> ingress_enqueue() > `-> tc_classify() > `-> tc_classify_compat() > `-> [for each attached classifier] > `-> tp->classify() > `-> f.e. cls_bpf_classify() > `-> [for each classifier from plist] > `-> BPF_PROG_RUN() Actually, the extra cost is roughly (getting inlined stuff away and other non-relevant stuff): `-> nf_hook_slow() `-> [for each entry in hook list] `-> nf_iterate() `-> (*elemp)->hook() as part of the generic hook infrastructure, which comes with extra flexibility in return. I think the main concern so far was not to harm the critical netif_receive_core() path, and this patchset proves not to affect this. BTW, the sch->enqueue() can easily go away after this patchset, see attached patch. > What was actually mentioned in the other thread where we'd like to > see a more lightweight ingress qdisc is to cut that down tremendously > to increase pps rate, as provided, that we would be able to process > a path roughly like: > > __netif_receive_skb_core() > `-> tc_classify() > `-> tc_classify_compat() > `-> [for each attached classifier] > `-> tp->classify() > `-> f.e. cls_bpf_classify() > `-> [for each classifier from plist] > `-> BPF_PROG_RUN() > > Therefore, I think it would be better to not wrap that ingress qdisc > part of the patch set into even more layers. What do you think? I think the main front to improve performance in qdisc ingress is to remove the central spinlock that is harming scalability. There's also the built-in rule counters there that look problematic. So I would focus on improving performance from the qdisc ingress core infrastructure itself. On the bugfix front, the illegal mangling of shared skb from actions like stateless nat and bpf look also important to be addressed to me. David already suggested to propagate some state object that keeps a pointer to the skb that is passed to the action. Thus, the action can clone it and get the skb back to the ingress path. I started a patchset to do so here, it's a bit large since it requires quite a lot of function signature adjustment. I can also see there were also intentions to support userspace queueing at some point since TC_ACT_QUEUED has been there since the beginning. That should be possible at some point using this infrastructure (once there are no further concerns on the netif_receive_core_finish() patch as soon as gcc 4.9 and follow up versions keep inlining this new function). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29.04, Daniel Borkmann wrote: > On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote: > >Port qdisc ingress on top of the Netfilter ingress allows us to detach the > >qdisc ingress filtering code from the core, so now it resides where it really > >belongs. > > Hm, but that means, in case you have a tc ingress qdisc attached > with one single (ideal) or more (less ideal) classifier/actions, > the path we _now_ have to traverse just to a single tc classifier > invocation is, if I spot this correctly, f.e.: > > __netif_receive_skb_core() > `-> nf_hook_ingress() > `-> nf_hook_do_ingress() > `-> nf_hook_slow() > `-> [for each entry in hook list] > `-> nf_iterate() > `-> (*elemp)->hook() > `-> handle_ing() > `-> ing_filter() > `-> qdisc_enqueue_root() > `-> sch->enqueue() > `-> ingress_enqueue() > `-> tc_classify() > `-> tc_classify_compat() > `-> [for each attached classifier] > `-> tp->classify() > `-> f.e. cls_bpf_classify() > `-> [for each classifier from plist] > `-> BPF_PROG_RUN() > You obviously realize this callchain is fully made up by yourself because 1/3 of the functions will be optimized away by the compiler, some are completely gone with this patchset and some are optional and appear to be present simply to make it appear longer. The difference is very simple: where we had an indirect call to q->enqueue before (and a lot of crap that didn't belong there), we now have a call to nf_hook_slow, followed by the hook invocation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29.04, Cong Wang wrote: > On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > > index 2274e72..23b57da 100644 > > --- a/net/sched/Kconfig > > +++ b/net/sched/Kconfig > > @@ -312,6 +312,7 @@ config NET_SCH_PIE > > config NET_SCH_INGRESS > > tristate "Ingress Qdisc" > > depends on NET_CLS_ACT > > + select NETFILTER_INGRESS > > ---help--- > > Say Y here if you want to use classifiers for incoming packets. > > If unsure, say Y. > > > So now it impossible to compile ingress Qdics without netfilters... > > (I know you moved them into net/core/, but still they are netfilter API's.) > > Why do we have to mix different layers? IOW, why not just keep TC at L2 > and netfilters at L3 even just w.r.t. API? Call it what you want, netfilter for most people is the entire infrastructure built on top of the hooks. We can call it packet hooks if that makes you happy. And the are infrastructure and not affiliated to any particular layer. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 29, 2015 at 02:53:58PM -0700, Cong Wang wrote: > On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote: > > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > > index 2274e72..23b57da 100644 > > --- a/net/sched/Kconfig > > +++ b/net/sched/Kconfig > > @@ -312,6 +312,7 @@ config NET_SCH_PIE > > config NET_SCH_INGRESS > > tristate "Ingress Qdisc" > > depends on NET_CLS_ACT > > + select NETFILTER_INGRESS > > ---help--- > > Say Y here if you want to use classifiers for incoming packets. > > If unsure, say Y. > > > So now it impossible to compile ingress Qdics without netfilters... > > (I know you moved them into net/core/, but still they are netfilter API's.) This is only one single file that only contains the very basic hook infrastructure, this does not depend on the layer 3. > Why do we have to mix different layers? IOW, why not just keep TC at L2 > and netfilters at L3 even just w.r.t. API? I think that used to be true in the iptables days... The nftables infrastructure is flexible and extensible enough to satisfy the needs of other network layers. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 01:36 AM, Patrick McHardy wrote: ... > You obviously realize this callchain is fully made up by yourself Hence, I wrote path, not call chain, I guess that should have been clear. [...] > The difference is very simple: where we had an indirect call to > q->enqueue before (and a lot of crap that didn't belong there), > we now have a call to nf_hook_slow, followed by the hook invocation. Sure, but what I wanted to express is that from an architectural point of view down to invoking a classifier, we now have a list of hooks, one element of those can be ingress qdisc and that one can have lists of classifier/actions by itself, etc. That doesn't seem sound wrt 'where it really belongs'. Personally, I have no objections if you want to use netfilter on ingress, don't get me wrong, but that ingress qdisc part doesn't really fit in there, and moving it behind netfilter facilities does have additional cost for a packet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 01:32 AM, Pablo Neira Ayuso wrote: ... > Actually, the extra cost is roughly (getting inlined stuff away and > other non-relevant stuff): > > `-> nf_hook_slow() > `-> [for each entry in hook list] > `-> nf_iterate() > `-> (*elemp)->hook() Yep, agreed. > as part of the generic hook infrastructure, which comes with extra > flexibility in return. I think the main concern so far was not to harm > the critical netif_receive_core() path, and this patchset proves not > to affect this. Correct, as you use the static key and hide everything behind it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Daniel Borkmann wrote: > On 04/30/2015 01:36 AM, Patrick McHardy wrote: > ... > >You obviously realize this callchain is fully made up by yourself > > Hence, I wrote path, not call chain, I guess that should have been > clear. > > [...] > >The difference is very simple: where we had an indirect call to > >q->enqueue before (and a lot of crap that didn't belong there), > >we now have a call to nf_hook_slow, followed by the hook invocation. > > Sure, but what I wanted to express is that from an architectural > point of view down to invoking a classifier, we now have a list of > hooks, one element of those can be ingress qdisc and that one can > have lists of classifier/actions by itself, etc. That doesn't seem > sound wrt 'where it really belongs'. > > Personally, I have no objections if you want to use netfilter on > ingress, don't get me wrong, but that ingress qdisc part doesn't > really fit in there, and moving it behind netfilter facilities does > have additional cost for a packet. It does, and Pablo made up for this cost by moving the TTL tracking to ingress where it actually belongs. This would obviously also be possible without his patchset, but the fact is, nobody seemed to have cared for the past ten years, we have from today a basically zero cost feature and people can decide what the want for themselves, there is no added cost for people not using it, there is no added cost for people using ingress. As for "really fit", ingress started off as living on a netfilter hook and only moved to a direct invocation somewhere around 2.6.12 IIRC. As I was mentioning earlier, ingress *qdisc* is a hack in itself since there is no freaking queuing, its a hack to invoke classifier chains, so actually netfilter is *the better* abstraction to do this. And it's not netfilter, its packet hooks for classification purposes. To me all these arguments have to very negative sounds. It has always been like this - and it hasn't. And we want ingress queueing to be the exclusive way to do this. And this quite frankly sucks for all the reasons stated before. Its a hack, its not queueing, it doesn't cost anything, let the people decide what they actually want to use. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 01:32 AM, Pablo Neira Ayuso wrote: ... >> Therefore, I think it would be better to not wrap that ingress qdisc >> part of the patch set into even more layers. What do you think? > > I think the main front to improve performance in qdisc ingress is to > remove the central spinlock that is harming scalability. There's also > the built-in rule counters there that look problematic. So I would > focus on improving performance from the qdisc ingress core > infrastructure itself. > > On the bugfix front, the illegal mangling of shared skb from actions > like stateless nat and bpf look also important to be addressed to me. > David already suggested to propagate some state object that keeps a > pointer to the skb that is passed to the action. Thus, the action can > clone it and get the skb back to the ingress path. I started a > patchset to do so here, it's a bit large since it requires quite a lot > of function signature adjustment. > > I can also see there were also intentions to support userspace > queueing at some point since TC_ACT_QUEUED has been there since the > beginning. That should be possible at some point using this > infrastructure (once there are no further concerns on the > netif_receive_core_finish() patch as soon as gcc 4.9 and follow up > versions keep inlining this new function). Wrt the other mail, just thinking out loud ... do you see a longer-term possibility of further generalizing the gen hooks infrastructure, so that actually classifier from tc could attach (disregarding the nf_* naming scheme for now) ... `-> nf_hook_slow() `-> [for each entry in hook list] <-- here as an entry `-> nf_iterate() `-> (*elemp)->hook() ... as well? Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Daniel Borkmann wrote: > > > >I can also see there were also intentions to support userspace > >queueing at some point since TC_ACT_QUEUED has been there since the > >beginning. That should be possible at some point using this > >infrastructure (once there are no further concerns on the > >netif_receive_core_finish() patch as soon as gcc 4.9 and follow up > >versions keep inlining this new function). > > Wrt the other mail, just thinking out loud ... do you see a longer-term > possibility of further generalizing the gen hooks infrastructure, so that > actually classifier from tc could attach (disregarding the nf_* naming > scheme for now) ... > > `-> nf_hook_slow() > `-> [for each entry in hook list] <-- here as an entry > `-> nf_iterate() > `-> (*elemp)->hook() > > ... as well? Jumping in there since I'm probably the one thinking the TC ingress abstraction is wrong the strongest - yes, it's an interesting idea. Unlike egress qdiscs, ingress only has a single classifier chain anyways, so there is no qdisc internal classification structure to be observed. It should be possible to skip the ingress invocation for classification purposes completely and only use it to expose it to userspace for management purposes, while invoking the chain directly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Pablo Neira Ayuso wrote: > On the bugfix front, the illegal mangling of shared skb from actions > like stateless nat and bpf look also important to be addressed to me. > David already suggested to propagate some state object that keeps a > pointer to the skb that is passed to the action. Thus, the action can > clone it and get the skb back to the ingress path. I started a > patchset to do so here, it's a bit large since it requires quite a lot > of function signature adjustment. Jumping in on this point - the fact that roughly 2/3 of TC actions will simply BUG under not unlikely circumstances when used in ingress (I went through them one by one with Pablo a week ago) is also telling. Nobody seems to be using that. All packet mangling actions will BUG while any tap is active. Its nothing easily fixed, but apparently nobody has cared in ten years. ipt is trivial to crash differently, connmark is as well. So I'm wondering what are we actually arguing about here. Whether we are affecting the performance how fast TC will crash? We *do* actually care about these thing, in TC apparently nobody has for the past ten years. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 02:30 AM, Patrick McHardy wrote: > On 30.04, Daniel Borkmann wrote: >>> >>> I can also see there were also intentions to support userspace >>> queueing at some point since TC_ACT_QUEUED has been there since the >>> beginning. That should be possible at some point using this >>> infrastructure (once there are no further concerns on the >>> netif_receive_core_finish() patch as soon as gcc 4.9 and follow up >>> versions keep inlining this new function). >> >> Wrt the other mail, just thinking out loud ... do you see a longer-term >> possibility of further generalizing the gen hooks infrastructure, so that >> actually classifier from tc could attach (disregarding the nf_* naming >> scheme for now) ... >> >> `-> nf_hook_slow() >> `-> [for each entry in hook list] <-- here as an entry >> `-> nf_iterate() >> `-> (*elemp)->hook() >> >> ... as well? > > Jumping in there since I'm probably the one thinking the TC ingress > abstraction is wrong the strongest - yes, it's an interesting idea. > Unlike egress qdiscs, ingress only has a single classifier chain > anyways, so there is no qdisc internal classification structure to Yes, exactly. > be observed. It should be possible to skip the ingress invocation > for classification purposes completely and only use it to expose > it to userspace for management purposes, while invoking the chain > directly. That would also meet our goals to only have a single classifier chain, eventually. ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Daniel Borkmann wrote: > On 04/30/2015 02:30 AM, Patrick McHardy wrote: > >On 30.04, Daniel Borkmann wrote: > >>> > >>>I can also see there were also intentions to support userspace > >>>queueing at some point since TC_ACT_QUEUED has been there since the > >>>beginning. That should be possible at some point using this > >>>infrastructure (once there are no further concerns on the > >>>netif_receive_core_finish() patch as soon as gcc 4.9 and follow up > >>>versions keep inlining this new function). > >> > >>Wrt the other mail, just thinking out loud ... do you see a longer-term > >>possibility of further generalizing the gen hooks infrastructure, so that > >>actually classifier from tc could attach (disregarding the nf_* naming > >>scheme for now) ... > >> > >> `-> nf_hook_slow() > >> `-> [for each entry in hook list] <-- here as an entry > >> `-> nf_iterate() > >> `-> (*elemp)->hook() > >> > >>... as well? > > > >Jumping in there since I'm probably the one thinking the TC ingress > >abstraction is wrong the strongest - yes, it's an interesting idea. > >Unlike egress qdiscs, ingress only has a single classifier chain > >anyways, so there is no qdisc internal classification structure to > > Yes, exactly. > > >be observed. It should be possible to skip the ingress invocation > >for classification purposes completely and only use it to expose > >it to userspace for management purposes, while invoking the chain > >directly. > > That would also meet our goals to only have a single classifier > chain, eventually. ;) For TC, yes. What Pablo is trying to achieve (and I agree that it's worthwhile) is multiple things: * have people use TC ingress as before, obviously * have people use netfilter, if they prefer * have them combine both if for some unfortunate reason they require it Netfilter is based on hook chains. The cost when only using a single hook is minimal (as Pablo showed in his numbers), but even if only using TC and a single netfilter classifier chain, there has to be some relative ordering and the hooks provide this in a generic way. So I'm all in favour of reduing costs for single or for multi-hook cases, but it wouldn't be usable for us anymore when doing away with multiple hooks completely. One more point to consider is - this is generic infrastructure - the next person requiring a hook will not have to touch the ingress path at all. He can register with the hooks, having zero additional costs for people not using it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 02:37 AM, Patrick McHardy wrote: > On 30.04, Pablo Neira Ayuso wrote: >> On the bugfix front, the illegal mangling of shared skb from actions >> like stateless nat and bpf look also important to be addressed to me. >> David already suggested to propagate some state object that keeps a >> pointer to the skb that is passed to the action. Thus, the action can >> clone it and get the skb back to the ingress path. I started a >> patchset to do so here, it's a bit large since it requires quite a lot >> of function signature adjustment. > > Jumping in on this point - the fact that roughly 2/3 of TC actions will > simply BUG under not unlikely circumstances when used in ingress (I went > through them one by one with Pablo a week ago) is also telling. Nobody > seems to be using that. All packet mangling actions will BUG while any > tap is active. Its nothing easily fixed, but apparently nobody has cared > in ten years. ipt is trivial to crash differently, connmark is as well. > > So I'm wondering what are we actually arguing about here. Whether we are > affecting the performance how fast TC will crash? We *do* actually care > about these thing, in TC apparently nobody has for the past ten years. Totally agree with you that the situation is quite a mess. From tc ingress/ egress side, at least my use case is to have an as minimal as possible entry point for cls_bpf/act_bpf, which is what we were working on recently. That is rather ``fresh'' compared to the remaining history of cls/act in tc. Cheers, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote: > Netfilter is based on hook chains. The cost when only using a single hook > is minimal (as Pablo showed in his numbers), but even if only using > TC and a single netfilter classifier chain, there has to be some relative > ordering and the hooks provide this in a generic way. Unfortunately the numbers that Pablo shown are not measuring the right thing. > 840203pps 403Mb/sec this is 20 times less than what they should be. Something else were measured together with netif_receive_skb. I've applied these patches and see the following for eth0 + ingress + u32: 18.0 Mpps 21.43% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 9.88% kpktgend_0 [kernel.vmlinux] [k] kfree_skb 9.79% kpktgend_0 [cls_u32] [k] u32_classify 9.16% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock 8.16% kpktgend_0 [kernel.vmlinux] [k] nf_iterate 5.28% kpktgend_0 [sch_ingress] [k] handle_ing 4.51% kpktgend_0 [sch_ingress] [k] ingress_enqueue 4.42% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat 3.16% kpktgend_0 [kernel.vmlinux] [k] nf_hook_slow 3.01% kpktgend_0 [kernel.vmlinux] [k] ip_rcv 2.70% kpktgend_0 [kernel.vmlinux] [k] tc_classify without these patches: 22.4 Mpps 25.89% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 14.41% kpktgend_0 [kernel.vmlinux] [k] kfree_skb 14.05% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock 11.75% kpktgend_0 [cls_u32] [k] u32_classify 6.48% kpktgend_0 [sch_ingress] [k] ingress_enqueue 6.06% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat 4.16% kpktgend_0 [kernel.vmlinux] [k] tc_classify 3.77% kpktgend_0 [kernel.vmlinux] [k] ip_rcv clearly nf_iterate/nf_hook_slow are slowing things down. I've spent more than a week trying to speedup ingress qdisc and, so far, got from 22.4 Mpps to 27.2 Mpps, so this 'generalization' is totally not acceptable to me. You're right that for 10 years no one cared about performance of ingress qdisc, but that doesn't mean it's a wrong abstraction and wrong architecture. Now I care about its performance and I hope other people will do too. So please leave ingress qdisc alone, this 'generalization' is too costly. That doesn't mean that netfilter shouldn't have its own hook on ingress. Without patch 6, the set looks good. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29.04, Alexei Starovoitov wrote: > On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote: > > Netfilter is based on hook chains. The cost when only using a single hook > > is minimal (as Pablo showed in his numbers), but even if only using > > TC and a single netfilter classifier chain, there has to be some relative > > ordering and the hooks provide this in a generic way. > > Unfortunately the numbers that Pablo shown are not measuring > the right thing. > > > 840203pps 403Mb/sec > > this is 20 times less than what they should be. > Something else were measured together with netif_receive_skb. > > I've applied these patches and see the following > for eth0 + ingress + u32: > > 18.0 Mpps > 21.43% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core > 9.88% kpktgend_0 [kernel.vmlinux] [k] kfree_skb > 9.79% kpktgend_0 [cls_u32] [k] u32_classify > 9.16% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock > 8.16% kpktgend_0 [kernel.vmlinux] [k] nf_iterate > 5.28% kpktgend_0 [sch_ingress] [k] handle_ing > 4.51% kpktgend_0 [sch_ingress] [k] ingress_enqueue > 4.42% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat > 3.16% kpktgend_0 [kernel.vmlinux] [k] nf_hook_slow > 3.01% kpktgend_0 [kernel.vmlinux] [k] ip_rcv > 2.70% kpktgend_0 [kernel.vmlinux] [k] tc_classify > > without these patches: > > 22.4 Mpps > 25.89% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core > 14.41% kpktgend_0 [kernel.vmlinux] [k] kfree_skb > 14.05% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock > 11.75% kpktgend_0 [cls_u32] [k] u32_classify > 6.48% kpktgend_0 [sch_ingress] [k] ingress_enqueue > 6.06% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat > 4.16% kpktgend_0 [kernel.vmlinux] [k] tc_classify > 3.77% kpktgend_0 [kernel.vmlinux] [k] ip_rcv > > clearly nf_iterate/nf_hook_slow are slowing things down. > > I've spent more than a week trying to speedup ingress qdisc > and, so far, got from 22.4 Mpps to 27.2 Mpps, > so this 'generalization' is totally not acceptable to me. > > You're right that for 10 years no one cared about performance > of ingress qdisc, but that doesn't mean it's a wrong abstraction > and wrong architecture. Now I care about its performance and > I hope other people will do too. The wrong abstraction is using a qdisc for ingress. An abstraction is not about performance. Why do you thing ingress exists? For queueing? Or as providing a hooking point for a bunch of broken (at ingress) actions? You're (one of) the one who painfully realized how broken any kind of packet mangling at that point is. The infrastructure is simply crap and always has been. > So please leave ingress qdisc alone, this 'generalization' > is too costly. Sorry, we are of the opinion that TC classifiers suck, so we will not leave that path alone :) You're numbers are well appreciated, we will fix this and return with better numbers. > That doesn't mean that netfilter shouldn't have its own hook > on ingress. Without patch 6, the set looks good. I don't agree. It would be preferable to optimize the single hook case not only for ingress's sake, but for all the already existing hooks. Cheers, Patrick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Daniel Borkmann wrote: > On 04/30/2015 02:37 AM, Patrick McHardy wrote: > >On 30.04, Pablo Neira Ayuso wrote: > >>On the bugfix front, the illegal mangling of shared skb from actions > >>like stateless nat and bpf look also important to be addressed to me. > >>David already suggested to propagate some state object that keeps a > >>pointer to the skb that is passed to the action. Thus, the action can > >>clone it and get the skb back to the ingress path. I started a > >>patchset to do so here, it's a bit large since it requires quite a lot > >>of function signature adjustment. > > > >Jumping in on this point - the fact that roughly 2/3 of TC actions will > >simply BUG under not unlikely circumstances when used in ingress (I went > >through them one by one with Pablo a week ago) is also telling. Nobody > >seems to be using that. All packet mangling actions will BUG while any > >tap is active. Its nothing easily fixed, but apparently nobody has cared > >in ten years. ipt is trivial to crash differently, connmark is as well. > > > >So I'm wondering what are we actually arguing about here. Whether we are > >affecting the performance how fast TC will crash? We *do* actually care > >about these thing, in TC apparently nobody has for the past ten years. > > Totally agree with you that the situation is quite a mess. From tc ingress/ > egress side, at least my use case is to have an as minimal as possible entry > point for cls_bpf/act_bpf, which is what we were working on recently. That > is rather ``fresh'' compared to the remaining history of cls/act in tc. It's more than a mess. Leaving aside the fully broken code at ingress, just look at the TC action APIs. Its "a failed state". TC is as well, but on egress only on the userspace side - Thomas's presentation 'TC - -EWTF' IIRC put it quite well. Our long term goal is to fix both, but ingress is actually simply, for any realistic case (though my experience is limited), meaning you have more than a single classifier, some structure in your rules and more than a single CPU, we'll do better right now. But Alexey's point is well taken. If we can't manange to add our hooks without impacting existing use cases, we'll try better until we can. The single hook case seems to be possible to optimize at the benefit of all the layers quite easily, and for people using both, we'll try to compete on a different level. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/29/15 21:34, Patrick McHardy wrote: > On 29.04, Alexei Starovoitov wrote: >> On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote: [..] >> >>> 840203pps 403Mb/sec >> >> this is 20 times less than what they should be. >> Something else were measured together with netif_receive_skb. >> >> I've applied these patches and see the following >> for eth0 + ingress + u32: >> >> 18.0 Mpps [..] >> without these patches: >> >> 22.4 Mpps >> 25.89% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core >> 14.41% kpktgend_0 [kernel.vmlinux] [k] kfree_skb >> 14.05% kpktgend_0 [kernel.vmlinux] [k] _raw_spin_lock >> 11.75% kpktgend_0 [cls_u32] [k] u32_classify >> 6.48% kpktgend_0 [sch_ingress] [k] ingress_enqueue >> 6.06% kpktgend_0 [kernel.vmlinux] [k] tc_classify_compat >> 4.16% kpktgend_0 [kernel.vmlinux] [k] tc_classify >> 3.77% kpktgend_0 [kernel.vmlinux] [k] ip_rcv >> >> clearly nf_iterate/nf_hook_slow are slowing things down. >> >> I've spent more than a week trying to speedup ingress qdisc >> and, so far, got from 22.4 Mpps to 27.2 Mpps, >> so this 'generalization' is totally not acceptable to me. >> >> You're right that for 10 years no one cared about performance >> of ingress qdisc, but that doesn't mean it's a wrong abstraction >> and wrong architecture. Now I care about its performance and >> I hope other people will do too. > Well, even if you didnt touch tc and got rid of the lock it will still be about a magnitude faster than netfilter ;-> How about i call netfilter crap Patrick? It is slower than molasses. For more than 10+ years no-one has found good ways to bring it up to speed. For sure the attempt to re-use netfilter code from tc has failed. I think thats what you keep probably repeating as something that will crash. Apart from some help from Pablo in recent times - this re-use has been a disaster even when linking in user space (from lack of backward incompatibilities mostly). It seems to me we may just have to drop support for ipt and anything netfilter from tc. And at some point we are going to need those features, we'll just need to write faster versions of them. ebpf maybe helpful. > The wrong abstraction is using a qdisc for ingress. An abstraction > is not about performance. Why do you thing ingress exists? For > queueing? Or as providing a hooking point for a bunch of broken > (at ingress) actions? You're (one of) the one who painfully realized > how broken any kind of packet mangling at that point is. The > infrastructure is simply crap and always has been. > What abstraction is broken? The ingress expects L3 headers. The egress expects L2 headers. I dont think there is disagreement that the ingress should see L2 headers. The view is that it will be a bit of work. Discussion is ongoing to resolve this without penalizing performance. You dont care about performance - netfilter is all about how to have nice usability. I care about usability but not as much as i do about performance. >> So please leave ingress qdisc alone, this 'generalization' >> is too costly. > > Sorry, we are of the opinion that TC classifiers suck, so we will > not leave that path alone :) You're numbers are well appreciated, > we will fix this and return with better numbers. > >> That doesn't mean that netfilter shouldn't have its own hook >> on ingress. Without patch 6, the set looks good. > > I don't agree. It would be preferable to optimize the single hook > case not only for ingress's sake, but for all the already existing > hooks. > And i strongly disagree. Netfilter is slow as a boring hot day. Please dont enforce it on us. I thought you were kidding when you said you want to move the ingress back to netfilter. We run away from there years ago. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/29/15 21:43, Patrick McHardy wrote: > On 30.04, Daniel Borkmann wrote: >> On 04/30/2015 02:37 AM, Patrick McHardy wrote: >>> On 30.04, Pablo Neira Ayuso wrote: >> Totally agree with you that the situation is quite a mess. From tc ingress/ >> egress side, at least my use case is to have an as minimal as possible entry >> point for cls_bpf/act_bpf, which is what we were working on recently. That >> is rather ``fresh'' compared to the remaining history of cls/act in tc. > > It's more than a mess. Leaving aside the fully broken code at ingress, > just look at the TC action APIs. Its "a failed state". Since youve repeated about 100 that tc api being broken, maybe you can explain more rationally? By that i mean dont use words like words like "crap" or "failed state" or no chest-thumping. Lets say we totally stopped trying to reuse netfilter code, what are you talking about? I think there is confusion about usability vs merits of performance. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29.04, Jamal Hadi Salim wrote: > >>You're right that for 10 years no one cared about performance > >>of ingress qdisc, but that doesn't mean it's a wrong abstraction > >>and wrong architecture. Now I care about its performance and > >>I hope other people will do too. > > > > Well, even if you didnt touch tc and got rid of the lock > it will still be about a magnitude faster than netfilter ;-> > How about i call netfilter crap Patrick? It is slower than > molasses. For more than 10+ years no-one has found good ways > to bring it up to speed. Quite frankly, that is ridiculous. I realize we turn in circles in this discussion, so we will provide the numbers. Give us a test case if you want, something you think is hard. > For sure the attempt to re-use netfilter code from tc has failed. Agreed. > I think thats what you keep probably repeating as something that will > crash. *Everything* except the purely passive ones as the policer will crash. Everything. Its well known. No point in repeating it. > Apart from some help from Pablo in recent times - this re-use > has been a disaster even when linking in user space (from lack of > backward incompatibilities mostly). > It seems to me we may just have to drop support for ipt and anything > netfilter from tc. > And at some point we are going to need those features, we'll just need > to write faster versions of them. ebpf maybe helpful. I stopped caring about TC a long time ago, basically when actions were merged over my head, despite me being the only person actually taking care of TC. You might remember, I spent multiple months cleaning up the action mess when it was merged, and that was only the *really* bad bugs. Magic numbers in the APIs, lists limited to more magic numbers etc, I didn't care. Today I do not think its about individual problems anymore, the entire thing is broken. Egress has some shared problems with ingress, such as horrible userspace code, horrible error reporting, no way to use it even for me without looking at both TC and kernel code as the same time, but ingress is worse in the sense that the supermajority of it will actually crash in very likely cases, and its not about ipt, its everything touching a packet in the presence of a tap, like 2/3 of all actions. The qdiscs are mostly (except integration of classifiers) fine, everything else is crap. > >The wrong abstraction is using a qdisc for ingress. An abstraction > >is not about performance. Why do you thing ingress exists? For > >queueing? Or as providing a hooking point for a bunch of broken > >(at ingress) actions? You're (one of) the one who painfully realized > >how broken any kind of packet mangling at that point is. The > >infrastructure is simply crap and always has been. > > What abstraction is broken? The ingress expects L3 headers. The egress > expects L2 headers. I dont think there is disagreement that the ingress > should see L2 headers. The view is that it will be a bit of work. > Discussion is ongoing to resolve this without penalizing performance. > You dont care about performance - netfilter is all about how to have > nice usability. I care about usability but not as much as i do > about performance. The abstraction is *wrong*. There is no queueing, hence using a qdisc is the wrong abstraction. Why are we arguing about that? Its a mechanism to invoke classifiers and actions. What is the qdisc actually used for? It certainly doesn't queue anything, the only thing it's doing is imposing the worst locking imaginable on the entire path. If that's not enough, look at the actions. I mean, you're not classifying for the sake of it, there are not even classes on ingress, its purely done for the actions. And they are, as stated, almost completely broken. We can do policing. That actually works. Maybe one or two more. Except policing, none of those is even remotely related to QoS. So let me ask again, in what sense is the abstraction actually right? > >>So please leave ingress qdisc alone, this 'generalization' > >>is too costly. > > > >Sorry, we are of the opinion that TC classifiers suck, so we will > >not leave that path alone :) You're numbers are well appreciated, > >we will fix this and return with better numbers. > > > >>That doesn't mean that netfilter shouldn't have its own hook > >>on ingress. Without patch 6, the set looks good. > > > >I don't agree. It would be preferable to optimize the single hook > >case not only for ingress's sake, but for all the already existing > >hooks. > > > > And i strongly disagree. Netfilter is slow as a boring hot day. > Please dont enforce it on us. I thought you were kidding when you said > you want to move the ingress back to netfilter. We run away from there > years ago. Quite frankly, talking about netfilter at all is so wrong that there isn't even a possibility to respond. Netfilter is a mechanism - a hook to receive a packet. And even that kicks your single threaded ingress qdiscs ass every packet of the day. What we're talking about are the things built on top. There's no question we also win on hardware that's not 1975 there, because people are actually using it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29.04, Jamal Hadi Salim wrote: > On 04/29/15 21:43, Patrick McHardy wrote: > >On 30.04, Daniel Borkmann wrote: > >>On 04/30/2015 02:37 AM, Patrick McHardy wrote: > >>>On 30.04, Pablo Neira Ayuso wrote: > > >>Totally agree with you that the situation is quite a mess. From tc ingress/ > >>egress side, at least my use case is to have an as minimal as possible entry > >>point for cls_bpf/act_bpf, which is what we were working on recently. That > >>is rather ``fresh'' compared to the remaining history of cls/act in tc. > > > >It's more than a mess. Leaving aside the fully broken code at ingress, > >just look at the TC action APIs. Its "a failed state". > > Since youve repeated about 100 that tc api being broken, maybe > you can explain more rationally? By that i mean dont use words > like words like "crap" or "failed state" or no chest-thumping. > Lets say we totally stopped trying to reuse netfilter code, what are > you talking about? > > I think there is confusion about usability vs merits of performance. I noticed tons of semantical holes when actually working on this, but let me just give a single example which is still on the top of my head: tcf_action_init: for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) { act = tcf_action_init_1(net, tb[i], est, name, ovr, bind); So the API allows up to *magic number* (32) actions in one list, but not more. So many will be atomic at least from a rtnetlink perspective, but certainly not from a packet processing perspecting, from that one, they will simply be processed as they are instantiated. So, we simulate atomicity (what netlink is about) for a magic number of elements, but do not even provide it at runtime. Even ignoring the harder problem of having transactional semantics for multiple actions, why even support *magic number* of elements in one message and pretend atomicity? Smells a lot like premature optimization, ignoring sementical expectations. And yes, I do think it breaks with every concept of rtnetlink (messages are atomic) for no reason at all. I remember TC actions where full of these "special" interpretations. If you insist, I can do a review again, but I did all that and stated it years ago. Regarding TC as a whole, I think the problem is shared between userspace and the kernel. iproute TC is certainly completely failed, its unusable without looking at the kernel and iproute code, it hasn't even made the slightest infrastructrual progess in the past 15 years (*(u16)RTN_DATA(bla))?) and that is telling for itself. I don't extend that to ip, even though it suffers from the same coding problems, but let's be honest, do you really expect some magic is going to happen and TC userspace will suddenly become usable? I don't. And we intend to provide an alternative. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30.04, Patrick McHardy wrote: > Regarding TC as a whole, I think the problem is shared between userspace > and the kernel. iproute TC is certainly completely failed, its unusable > without looking at the kernel and iproute code, it hasn't even made the > slightest infrastructrual progess in the past 15 years (*(u16)RTN_DATA(bla))?) > and that is telling for itself. I don't extend that to ip, even though it > suffers from the same coding problems, but let's be honest, do you really > expect some magic is going to happen and TC userspace will suddenly become > usable? > > I don't. And we intend to provide an alternative. For the sake of completeness - on ingress, this alternativ means simply ditching TC. On egress, we will provide a nicely integrated way of qdiscs and classifiers where nftables will obviously respect the qdiscs. On ingress, there is simply no benefit in doing that at all. Hence Pablo's patches. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 06:05:37AM +0200, Patrick McHardy wrote: > 06:05:37AM do you ever sleep? ;) > For the sake of completeness - on ingress, this alternativ means simply > ditching TC. I'm sure 'ditching TC' doesn't mean to kill ingress qdisc completely. Compatibility needs to be preserved. What I'm doing in my 'experimental ingress qdisc' acceleration boils down to: @@ -1649,6 +1649,7 @@ struct net_device { rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; - struct netdev_queue __rcu *ingress_queue; + struct tcf_proto __rcu *ingress_filter_list; so to call tc_classify() and reach cls_bpf I don't need to walk down skb->dev->ingress_queue->qdisc->enqueue/qdisc_priv->filter_list and can just do skb->dev->ingress_filter_list and can skip several unnecessary deref like sch->stab, skb->len, etc. Both ingress_queue and ingress qdisc are no longer allocated and stay only as a shim to preserve uapi. My point is that I agree that cleanup of ingress qdisc is needed. I disagree with drastic measures. Just add your nf_hook to ingress and let's see how things evolve. We have rx_handler and all of ptype hooks in there. One can argue that rx_handler overlaps with nf_hook too ? ;) We cannot generalize them all under one 'hook' infra. nf needs to do nf_hook_state_init() and pass it around which no one else needs. That's the cost others should not pay. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 08:02 AM, Alexei Starovoitov wrote: ... > My point is that I agree that cleanup of ingress qdisc is needed. > I disagree with drastic measures. > Just add your nf_hook to ingress and let's see how things evolve. > We have rx_handler and all of ptype hooks in there. One can argue > that rx_handler overlaps with nf_hook too ? ;) > We cannot generalize them all under one 'hook' infra. > nf needs to do nf_hook_state_init() and pass it around which > no one else needs. That's the cost others should not pay. +1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 29, 2015 at 06:16:35PM -0700, Alexei Starovoitov wrote: > On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote: > > Netfilter is based on hook chains. The cost when only using a single hook > > is minimal (as Pablo showed in his numbers), but even if only using > > TC and a single netfilter classifier chain, there has to be some relative > > ordering and the hooks provide this in a generic way. > > Unfortunately the numbers that Pablo shown are not measuring > the right thing. The numbers I've shown were measuring the critical ingress path. I never posted numbers on generic hook infrastructure. > > 840203pps 403Mb/sec > > this is 20 times less than what they should be. It's ~10% less, because obviously we run more code in the generic hook infrastructure, as they provide more features. So far it was only possible to register one single chain at ingress. Eric Dumazet already indicated that this benchmark only measures *one single CPU* and icache is fully populated. The numbers with smp will go worse because of the central lock contention. These are the numbers I got banging *one single CPU*: * Without patches + qdisc ingress: Result: OK: 16298126(c16298125+d0) usec, 10000000 (60byte,0frags) 613567pps 294Mb/sec (294512160bps) errors: 10000000 * With patches + qdisc ingress on top of hooks: Result: OK: 18339281(c18339280+d0) usec, 10000000 (60byte,0frags) 545277pps 261Mb/sec (261732960bps) errors: 10000000 * With patches + nftables ingress chain: Result: OK: 17118167(c17118167+d0) usec, 10000000 (60byte,0frags) 584174pps 280Mb/sec (280403520bps) errors: 10000000 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 11:24:57AM +0200, Daniel Borkmann wrote: > On 04/30/2015 08:02 AM, Alexei Starovoitov wrote: > ... > >My point is that I agree that cleanup of ingress qdisc is needed. > >I disagree with drastic measures. > >Just add your nf_hook to ingress and let's see how things evolve. > >We have rx_handler and all of ptype hooks in there. One can argue > >that rx_handler overlaps with nf_hook too ? ;) > >We cannot generalize them all under one 'hook' infra. > >nf needs to do nf_hook_state_init() and pass it around which > >no one else needs. That's the cost others should not pay. > > +1 Actually, the state object can be useful to resolve the major bug in actions that mangle skbs in an illegal way, as we can use it to pass back to the ingress path the new skb_shared_check()'ed skb. The genericity that they state object introduces comes with a cost, no doubt, but it helps to extend things later on and resolve tricky situation like the one above without large patches to propagate new state information that you need all over the code. Regarding the performance argument that is repeating over and over again, we all here are quite aware here that there's is a *good room for improvement* in qdisc ingress itself... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/29/15 23:11, Patrick McHardy wrote: > Quite frankly, that is ridiculous. I realize we turn in circles > in this discussion, so we will provide the numbers. Give us a test > case if you want, something you think is hard. The usual tests i used to run are throughput and latency of datapath. Then more importantly update rates of policies - this with and without traffic. Start with a zero rules. Add them logarithmically (with and without traffic running). i.e in order of {0, 1, 10, 100, 1000, ...} With a single rule you dont notice much difference. Start adding rules and it becomes very obvious. The latency for adding rule 1001 when there are say 1000 rules already added is important from a control perspective if it stays linear. I am sure this would make a good paper. Where are the firewo/men when you need them? >> For sure the attempt to re-use netfilter code from tc has failed. > > Agreed. > And i do note historically netfilter folks have not been very willing to help make this better over the years. Actually i would trace that back way to Harald's time. Pablo has been a lot more helpful in recent times but now your answer is essentially to subsume tc into netfilter;-> Great. I could send a patch to delete it but i worry since people do send bug reports. >> I think thats what you keep probably repeating as something that will >> crash. > > *Everything* except the purely passive ones as the policer will crash. > Everything. Its well known. No point in repeating it. > please send fixes or send pointers or even private mail so either myself or someone else can fix it. I am sure if you stared long enough at any code you'll find bugs. As would be the case for netfilter. Depends how much time you are willing to invest. But to just handwave with "fire fire" is not helpful. In the engineering world people come up and use solutions and when they find new use cases either fix bugs that existed or add new features. This is how it works in Linux. Bugs _do get fixed_. I know it is hard for people who take a lot of pride to accept. To illustrate with an example: I saw you mentioning the stateless nat action in the other email; i happen to know that it is _heavily_ used by some large players in very serious environments. You are probably using tc nat without knowing it. Are there bugs? possibly yes. Do they apply to those two big systems? Very unlikely otherwise they'd be fixed by now. We dont intentionally keep bugs - but we dont aim for perfection either. Lets refactor and fix bugs as needed. Ive known you for years and i know you genuinely care about bugs - so i am in no way trying to diminish what you are saying. I have just never agreed when you get outraged and try to kill something because you found bugs. IOW, the logic of "there are bugs in there, lets burn down the house" is not how we operate. > I stopped caring about TC a long time ago, basically when actions were > merged over my head, despite me being the only person actually taking > care of TC. You might remember, I spent multiple months cleaning up the > action mess when it was merged, and that was only the *really* bad bugs. > Magic numbers in the APIs, lists limited to more magic numbers etc, Yes, you fixed bugs as did other people over the years. And i have no doubt you introduced bugs while you were doing that as did other people. Magic numbers? Seriously? I am sure people are still fixing magic numbers in netfilter this century and possibly this year. If you care fix it. Note: There is a difference between fixing bugs and you trying to reshape code in your view of the world. The struggle over the years was you trying to reshape things to your world - as it is in this case. > I didn't care. Today I do not think its about individual problems anymore, > the entire thing is broken. Egress has some shared problems with ingress, > such as horrible userspace code, horrible error reporting, no way to use it > even for me without looking at both TC and kernel code as the same time, but > ingress is worse in the sense that the supermajority of it will actually > crash in very likely cases, and its not about ipt, its everything touching > a packet in the presence of a tap, like 2/3 of all actions. The qdiscs are > mostly (except integration of classifiers) fine, everything else is crap. > Again, provide fixes or pointers so things can be fixed. > > The abstraction is *wrong*. There is no queueing, hence using a qdisc > is the wrong abstraction. Why are we arguing about that? Its a > mechanism to invoke classifiers and actions. What is the qdisc actually > used for? With all due respect, that is a very flawed arguement in the linux world. Let me give you an example. People use netdevs because they provide a nice abstraction and tooling to go with it. Shall we kill all those netdevs which are conviniences that take a packet in and out because hell they have nothing to do with ports. What about xfrm? etc. The qdisc provided all those niceties. > It certainly doesn't queue anything, the only thing it's > doing is imposing the worst locking imaginable on the entire path. > And yet despite this "worst locking imaginable" it is still faster than your lockless approach. > If that's not enough, look at the actions. I mean, you're not classifying > for the sake of it, there are not even classes on ingress, its purely > done for the actions. And they are, as stated, almost completely > broken. We can do policing. That actually works. Maybe one or two more. > Except policing, none of those is even remotely related to QoS. > So let me ask again, in what sense is the abstraction actually right? > In the sense i described above. > Quite frankly, talking about netfilter at all is so wrong that there > isn't even a possibility to respond. Netfilter is a mechanism - a hook > to receive a packet. And even that kicks your single threaded ingress > qdiscs ass every packet of the day. What we're talking about are the > things built on top. There's no question we also win on hardware that's > not 1975 there, because people are actually using it. > Netfilter is a lot more usable - no doubt about it. It has always been very good. But there are caveats. At one point i think we were thankful all the crackheads were using netfilter instead of tc because tc was harder to understand ;->. In retrospect i think that was wrong;-> So let me see if i can summarize your arguement: Hardware is faster than 1975 therefore lets just use netfilter for usability reasons. Thats what all those kids using rubyonrails are arguing for. And there is room for that. I claim there is still room for C. Heck, we are even trying to move things into hardware so we can go faster. But i admit to be envious sometimes when i see code written very rapidly even though it makes me puke when i see how crappy the performance is. I still havent learnt to accept that - even though i have come to terms with accepting bugs. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi jamal, On Thu, Apr 30, 2015 at 07:55:22AM -0400, Jamal Hadi Salim wrote: [...] > Start with a zero rules. Add them logarithmically (with and without > traffic running). i.e in order of {0, 1, 10, 100, 1000, ...} > With a single rule you dont notice much difference. Start adding rules > and it becomes very obvious. I think the days of linear ruleset performance competitions are over, we have better data structures to allow users to arrange the ruleset through the multidimensional dictionaries and the arbitrary state flows that minimize the number of inspections, which is what it harms performance when it comes to packet classification. > ... but now your answer is essentially to subsume tc into netfilter > ;-> Great. We're not subsuming anything under netfilter, that hook infrastructure is generic, and qdisc ingress will not depend on layer 3 hooks. It's just a bit more code to allow more than one single chain from ingress and potentially support more features from that entry point in an optional fashion. > ... IOW, the logic of "there are bugs > in there, lets burn down the house" is not how we operate. We're not burning down any house, after this patchset Qdisc ingress will still be there. Qdisc ingress can keep going with it, fix the bugs, improve its performance, there is way a lot room for it from its own nice, through the flexibility that the generic hook infrastructure provides. > Netfilter is a lot more usable - no doubt about it. It has always > been very good. But there are caveats. > At one point i think we were thankful all the crackheads were using > netfilter instead of tc because tc was harder to understand ;->. > In retrospect i think that was wrong;-> > > So let me see if i can summarize your arguement: > Hardware is faster than 1975 therefore lets just use netfilter > for usability reasons. [...] You keep saying that qdisc ingress outperforms, that's only right for just a very slight difference when comparing it with no rules on single CPU (when ported to the common playground of the generic hook infrastructure). On SMP nftables will outperform, even more if the ruleset is arranged in a non-linear list fashion, with all the new tricks that we got. Anyway, let's take this "nftables vs. qdisc ingress" discussion to an end. I think the main point of this discussion is to provide a generic entry point to ingress filtering (for both qdisc ingress and nftables) that, if unused, doesn't harm performance of the critical path netif_receive_core() path at all. Thus, users can choose what they want, I have heard you saying several times: "To each their poison" and I like that. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 05:33 PM, Pablo Neira Ayuso wrote: ... > On Thu, Apr 30, 2015 at 07:55:22AM -0400, Jamal Hadi Salim wrote: > [...] >> Start with a zero rules. Add them logarithmically (with and without >> traffic running). i.e in order of {0, 1, 10, 100, 1000, ...} >> With a single rule you dont notice much difference. Start adding rules >> and it becomes very obvious. > > I think the days of linear ruleset performance competitions are over, Totally agree with you. You want to have a single classification pass that parses the packet once and comes to a verdict immediately. > we have better data structures to allow users to arrange the ruleset > through the multidimensional dictionaries and the arbitrary state > flows that minimize the number of inspections, which is what it harms > performance when it comes to packet classification. I think both have different use cases, though, but on cls_bpf side you have maps infrastructure that is evolving as well. Not really speaking about the other remaining classifiers, however. I also don't want to go any further into this vim vs emacs debate. ;) And, personally, I also don't have any issue offering alternatives to users. However, I still disagree with moving ingress behind this artificial barrier if it's just not necessary. I believe, in your RFC v1 patch, you had a second ingress hook as a static key for nft, I tend to like that much better consensus-wise. Both subsystems should not put unnecessary barriers into their way, really. Best, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 06:09:25PM +0200, Daniel Borkmann wrote: > I think both have different use cases, though, but on cls_bpf side you > have maps infrastructure that is evolving as well. Not really speaking > about the other remaining classifiers, however. I also don't want to go > any further into this vim vs emacs debate. ;) And, personally, I also > don't have any issue offering alternatives to users. > > However, I still disagree with moving ingress behind this artificial > barrier if it's just not necessary. I believe, in your RFC v1 patch, > you had a second ingress hook as a static key for nft, I tend to like > that much better consensus-wise. Both subsystems should not put > unnecessary barriers into their way, really. I'm evolving to think that it would be good to have a single entry point for ingress filtering. But where are the barriers? These unfounded performance claims are simply absurd, qdisc ingress barely performs a bit better just because it executes a bit less code and only in the single CPU scenario with no rules at all. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 30, 2015 at 12:12:04PM +0200, Pablo Neira Ayuso wrote: > > These are the numbers I got banging *one single CPU*: > > * Without patches + qdisc ingress: > > Result: OK: 16298126(c16298125+d0) usec, 10000000 (60byte,0frags) > 613567pps 294Mb/sec (294512160bps) errors: 10000000 > > * With patches + qdisc ingress on top of hooks: > > Result: OK: 18339281(c18339280+d0) usec, 10000000 (60byte,0frags) > 545277pps 261Mb/sec (261732960bps) errors: 10000000 > > * With patches + nftables ingress chain: > > Result: OK: 17118167(c17118167+d0) usec, 10000000 (60byte,0frags) > > 584174pps 280Mb/sec (280403520bps) errors: 10000000 So in other words you're saying: tc has to live with 12% slowdown (613k / 545k) only because _you_ want one hook for both nft and tc ?! The numbers from my box are 22.4 Mpps vs 18 Mpps which is 24% slowdown for TC due to nf_hook. Notice I'm seeing _millions_ packet per second processed by netif_receive_skb->ingress_qdisc->u32 whereas you're talking about _thousands_. Even if your box is very old, it still doesn't explain this huge difference. Please post 'perf report' numbers, so we can help analyze what is actually being measured. I bet netif_receive_skb is not even in top 10. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 06:36 PM, Pablo Neira Ayuso wrote: ... > But where are the barriers? These unfounded performance claims are > simply absurd, qdisc ingress barely performs a bit better just because > it executes a bit less code and only in the single CPU scenario with > no rules at all. I think we're going in circles a bit. :( You are right in saying that currently, there's a central spinlock, which is worked on to get rid of, you've seen the patch on the list floating around already. Single CPU, artificial micro-benchmark, which were done show that you see on your machine ~613Kpps to ~545Kpps, others have seen it more amplified as 22.4Mpps to 18.0Mpps drop from __netif_receive_skb_core() up to an empty dummy u32_classify() rule, which has already been acknowledged that this gap needs to be improved. Lets call it unfounded then. I think we wouldn't even have this discussion if we wouldn't try brute forcing both worlds behind this single static key, or, have both invoked from within the same layer/list. Cheers, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/30/2015 09:16 PM, Daniel Borkmann wrote: > On 04/30/2015 06:36 PM, Pablo Neira Ayuso wrote: > ... >> But where are the barriers? These unfounded performance claims are >> simply absurd, qdisc ingress barely performs a bit better just because >> it executes a bit less code and only in the single CPU scenario with >> no rules at all. > > I think we're going in circles a bit. :( You are right in saying that > currently, there's a central spinlock, which is worked on to get rid > of, you've seen the patch on the list floating around already. Single > CPU, artificial micro-benchmark, which were done show that you see on > your machine ~613Kpps to ~545Kpps, others have seen it more amplified > as 22.4Mpps to 18.0Mpps drop from __netif_receive_skb_core() up to an > empty dummy u32_classify() rule, which has already been acknowledged > that this gap needs to be improved. Lets call it unfounded then. I > think we wouldn't even have this discussion if we wouldn't try brute > forcing both worlds behind this single static key, or, have both > invoked from within the same layer/list. Ok, out of curiosity, I did the same as both of you: I'm using a pretty standard Supermicro X10SLM-F/X10SLM-F, Xeon E3-1240 v3. *** ingress + dummy u32, net-next: w/o perf: ... Result: OK: 5157948(c5157388+d559) usec, 100000000 (60byte,0frags) 19387551pps 9306Mb/sec (9306024480bps) errors: 100000000 perf record -C0 -ecycles:k ./pktgen.sh p1p1 ... Result: OK: 5182638(c5182057+d580) usec, 100000000 (60byte,0frags) 19295191pps 9261Mb/sec (9261691680bps) errors: 100000000 26.07% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 14.39% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 13.69% kpktgend_0 [cls_u32] [k] u32_classify 11.75% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock 5.34% kpktgend_0 [sch_ingress] [k] ingress_enqueue 5.21% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 4.93% kpktgend_0 [kernel.kallsyms] [k] skb_defer_rx_timestamp 3.41% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 3.21% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.16% kpktgend_0 [kernel.kallsyms] [k] tc_classify 3.08% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.05% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb 1.60% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.15% kpktgend_0 [kernel.kallsyms] [k] classify 0.45% kpktgend_0 [kernel.kallsyms] [k] __local_bh_enable_ip *** nf hook infra + ingress + dummy u32, net-next: w/o perf: ... Result: OK: 6555903(c6555744+d159) usec, 100000000 (60byte,0frags) 15253426pps 7321Mb/sec (7321644480bps) errors: 100000000 perf record -C0 -ecycles:k ./pktgen.sh p1p1 ... Result: OK: 6591291(c6591153+d138) usec, 100000000 (60byte,0frags) 15171532pps 7282Mb/sec (7282335360bps) errors: 100000000 25.94% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 12.19% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.00% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock 10.58% kpktgend_0 [cls_u32] [k] u32_classify 5.34% kpktgend_0 [sch_ingress] [k] handle_ing 4.68% kpktgend_0 [kernel.kallsyms] [k] nf_iterate 4.33% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 4.32% kpktgend_0 [sch_ingress] [k] ingress_enqueue 3.62% kpktgend_0 [kernel.kallsyms] [k] skb_defer_rx_timestamp 2.95% kpktgend_0 [kernel.kallsyms] [k] nf_hook_slow 2.75% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.60% kpktgend_0 [kernel.kallsyms] [k] tc_classify 2.52% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.50% kpktgend_0 [pktgen] [k] pktgen_thread_worker 1.77% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb 1.28% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 0.94% kpktgend_0 [kernel.kallsyms] [k] classify 0.38% kpktgend_0 [kernel.kallsyms] [k] __local_bh_enable_ip *** drop ingress spinlock (patch w/ bstats addition) + ingress + dummy u32, net-next: w/o perf: ... Result: OK: 4789828(c4789353+d474) usec, 100000000 (60byte,0frags) 20877576pps 10021Mb/sec (10021236480bps) errors: 100000000 perf record -C0 -ecycles:k ./pktgen.sh p1p1 ... Result: OK: 4829276(c4828437+d839) usec, 100000000 (60byte,0frags) 20707036pps 9939Mb/sec (9939377280bps) errors: 100000000 33.11% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 15.27% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 14.60% kpktgend_0 [cls_u32] [k] u32_classify 6.06% kpktgend_0 [sch_ingress] [k] ingress_enqueue 5.55% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.31% kpktgend_0 [kernel.kallsyms] [k] skb_defer_rx_timestamp 3.77% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.45% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 3.33% kpktgend_0 [kernel.kallsyms] [k] tc_classify 3.33% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.34% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb 1.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.15% kpktgend_0 [kernel.kallsyms] [k] classify 0.48% kpktgend_0 [kernel.kallsyms] [k] __local_bh_enable_ip That means, here, moving ingress behind nf hooks, I see a similar slowdown in this micro-benchmark as Alexei of really worst case of ~27%. Now in real world that might probably just end up as a few percent depending on the use case, but really, why should we go down that path if we can just avoid that? If you find a way, where both tc/nf hooks are triggered from within the same list, then that would probably look better already. Or, as a start, as mentioned, with a second static key for netfilter, which can later on then still be reworked for a better integration, although I agree with you that it's less clean and I see the point of consolidating code. If you want, I'm happy to provide numbers if you have a next set as well, feel free to ping me. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pablo, On 04/30/15 11:33, Pablo Neira Ayuso wrote: > > You keep saying that qdisc ingress outperforms, that's only right for > just a very slight difference when comparing it with no rules on > single CPU (when ported to the common playground of the generic hook > infrastructure). On SMP nftables will outperform, even more if the > ruleset is arranged in a non-linear list fashion, with all the new > tricks that we got. > I am interested to see the numbers. I think this would be a great paper; it is extremely tempting to spend time on it. > Anyway, let's take this "nftables vs. qdisc ingress" discussion to an > end. I think the main point of this discussion is to provide a generic > entry point to ingress filtering (for both qdisc ingress and nftables) > that, if unused, doesn't harm performance of the critical path > netif_receive_core() path at all. Thus, users can choose what they > want, I have heard you saying several times: "To each their poison" > and I like that. > Yes - but my good friend Patrick is not saying that. I dont want to turn on netfilter in order to get tc actions on ingress. And i dont want to be slowed down because now the code path has become longer. We are trying to prune the code path. If somehow you can work to not affect performance then we can live well together. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 2da5d10..5b2cc04 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -79,19 +79,6 @@ static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev) struct netdev_queue *dev_ingress_queue_create(struct net_device *dev); -#ifdef CONFIG_NET_CLS_ACT -void net_inc_ingress_queue(void); -void net_dec_ingress_queue(void); -#else -static inline void net_inc_ingress_queue(void) -{ -} - -static inline void net_dec_ingress_queue(void) -{ -} -#endif - extern void rtnetlink_init(void); extern void __rtnl_unlock(void); diff --git a/net/core/dev.c b/net/core/dev.c index fa8a262..af03263 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1631,22 +1631,6 @@ int call_netdevice_notifiers(unsigned long val, struct net_device *dev) } EXPORT_SYMBOL(call_netdevice_notifiers); -#ifdef CONFIG_NET_CLS_ACT -static struct static_key ingress_needed __read_mostly; - -void net_inc_ingress_queue(void) -{ - static_key_slow_inc(&ingress_needed); -} -EXPORT_SYMBOL_GPL(net_inc_ingress_queue); - -void net_dec_ingress_queue(void) -{ - static_key_slow_dec(&ingress_needed); -} -EXPORT_SYMBOL_GPL(net_dec_ingress_queue); -#endif - static struct static_key netstamp_needed __read_mostly; #ifdef HAVE_JUMP_LABEL /* We are not allowed to call static_key_slow_dec() from irq context @@ -3521,67 +3505,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev, EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook); #endif -#ifdef CONFIG_NET_CLS_ACT -/* TODO: Maybe we should just force sch_ingress to be compiled in - * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions - * a compare and 2 stores extra right now if we dont have it on - * but have CONFIG_NET_CLS_ACT - * NOTE: This doesn't stop any functionality; if you dont have - * the ingress scheduler, you just can't add policies on ingress. - * - */ -static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq) -{ - struct net_device *dev = skb->dev; - u32 ttl = G_TC_RTTL(skb->tc_verd); - int result = TC_ACT_OK; - struct Qdisc *q; - - if (unlikely(MAX_RED_LOOP < ttl++)) { - net_warn_ratelimited("Redir loop detected Dropping packet (%d->%d)\n", - skb->skb_iif, dev->ifindex); - return TC_ACT_SHOT; - } - - skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl); - skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS); - - q = rcu_dereference(rxq->qdisc); - if (q != &noop_qdisc) { - spin_lock(qdisc_lock(q)); - if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) - result = qdisc_enqueue_root(skb, q); - spin_unlock(qdisc_lock(q)); - } - - return result; -} - -static inline struct sk_buff *handle_ing(struct sk_buff *skb, - struct packet_type **pt_prev, - int *ret, struct net_device *orig_dev) -{ - struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue); - - if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc) - return skb; - - if (*pt_prev) { - *ret = deliver_skb(skb, *pt_prev, orig_dev); - *pt_prev = NULL; - } - - switch (ing_filter(skb, rxq)) { - case TC_ACT_SHOT: - case TC_ACT_STOLEN: - kfree_skb(skb); - return NULL; - } - - return skb; -} -#endif - /** * netdev_rx_handler_register - register receive handler * @dev: device to register a handler for @@ -3726,16 +3649,6 @@ another_round: } skip_taps: -#ifdef CONFIG_NET_CLS_ACT - if (static_key_false(&ingress_needed)) { - skb = handle_ing(skb, &pt_prev, &ret, orig_dev); - if (!skb) - goto unlock; - } - - skb->tc_verd = 0; -ncls: -#endif if (nf_hook_ingress_active(skb)) { ret = nf_hook_ingress(skb, &pt_prev, &ret, orig_dev); if (ret < 0) { @@ -3743,7 +3656,10 @@ ncls: goto unlock; } } - +#ifdef CONFIG_NET_CLS_ACT + skb->tc_verd = 0; +ncls: +#endif if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) goto drop; diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 2274e72..23b57da 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -312,6 +312,7 @@ config NET_SCH_PIE config NET_SCH_INGRESS tristate "Ingress Qdisc" depends on NET_CLS_ACT + select NETFILTER_INGRESS ---help--- Say Y here if you want to use classifiers for incoming packets. If unsure, say Y. diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c index 4cdbfb8..996c823 100644 --- a/net/sched/sch_ingress.c +++ b/net/sched/sch_ingress.c @@ -14,10 +14,11 @@ #include <linux/rtnetlink.h> #include <net/netlink.h> #include <net/pkt_sched.h> - +#include <linux/netfilter_hooks.h> struct ingress_qdisc_data { struct tcf_proto __rcu *filter_list; + struct nf_hook_ops ops; }; /* ------------------------- Class/flow operations ------------------------- */ @@ -88,11 +89,62 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch) /* ------------------------------------------------------------- */ +static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq) +{ + struct net_device *dev = skb->dev; + u32 ttl = G_TC_RTTL(skb->tc_verd); + int result = TC_ACT_OK; + struct Qdisc *q; + + if (unlikely(MAX_RED_LOOP < ttl++)) { + net_warn_ratelimited("Redir loop detected Dropping packet (%d->%d)\n", + skb->skb_iif, dev->ifindex); + return TC_ACT_SHOT; + } + + skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl); + skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS); + + q = rcu_dereference(rxq->qdisc); + if (q != &noop_qdisc) { + spin_lock(qdisc_lock(q)); + if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) + result = qdisc_enqueue_root(skb, q); + spin_unlock(qdisc_lock(q)); + } + + return result; +} + +static unsigned int handle_ing(const struct nf_hook_ops *ops, + struct sk_buff *skb, + const struct nf_hook_state *state) +{ + struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue); + + if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc) + return NF_ACCEPT; + + switch (ing_filter(skb, rxq)) { + case TC_ACT_SHOT: + case TC_ACT_STOLEN: + return NF_DROP; + } + + return NF_ACCEPT; +} + static int ingress_init(struct Qdisc *sch, struct nlattr *opt) { - net_inc_ingress_queue(); + struct ingress_qdisc_data *p = qdisc_priv(sch); + + p->ops.hook = handle_ing; + p->ops.pf = NFPROTO_NETDEV; + p->ops.hooknum = NF_NETDEV_INGRESS; + p->ops.priority = 0; + p->ops.dev = qdisc_dev(sch); - return 0; + return nf_register_hook(&p->ops); } static void ingress_destroy(struct Qdisc *sch) @@ -100,7 +152,7 @@ static void ingress_destroy(struct Qdisc *sch) struct ingress_qdisc_data *p = qdisc_priv(sch); tcf_destroy_chain(&p->filter_list); - net_dec_ingress_queue(); + nf_unregister_hook(&p->ops); } static int ingress_dump(struct Qdisc *sch, struct sk_buff *skb)
Port qdisc ingress on top of the Netfilter ingress allows us to detach the qdisc ingress filtering code from the core, so now it resides where it really belongs. The specific qdisc ingress static key is also gone since we now rely on the generic netfilter hook static key infrastructure. This only depends on the basic hook infrastructure that resides on net/core/hooks.c, so you have enable qdisc ingress filtering without the layer 3 netfilter hooks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> --- include/linux/rtnetlink.h | 13 ------- net/core/dev.c | 92 ++------------------------------------------- net/sched/Kconfig | 1 + net/sched/sch_ingress.c | 60 +++++++++++++++++++++++++++-- 4 files changed, 61 insertions(+), 105 deletions(-)