diff mbox

[6/6] net: move qdisc ingress filtering on top of netfilter ingress hooks

Message ID 1430333589-4940-7-git-send-email-pablo@netfilter.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Pablo Neira Ayuso April 29, 2015, 6:53 p.m. UTC
Port qdisc ingress on top of the Netfilter ingress allows us to detach the
qdisc ingress filtering code from the core, so now it resides where it really
belongs.

The specific qdisc ingress static key is also gone since we now rely on the
generic netfilter hook static key infrastructure.

This only depends on the basic hook infrastructure that resides on
net/core/hooks.c, so you have enable qdisc ingress filtering without the
layer 3 netfilter hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/rtnetlink.h |   13 -------
 net/core/dev.c            |   92 ++-------------------------------------------
 net/sched/Kconfig         |    1 +
 net/sched/sch_ingress.c   |   60 +++++++++++++++++++++++++++--
 4 files changed, 61 insertions(+), 105 deletions(-)

Comments

Daniel Borkmann April 29, 2015, 8:27 p.m. UTC | #1
On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote:
> Port qdisc ingress on top of the Netfilter ingress allows us to detach the
> qdisc ingress filtering code from the core, so now it resides where it really
> belongs.

Hm, but that means, in case you have a tc ingress qdisc attached
with one single (ideal) or more (less ideal) classifier/actions,
the path we _now_ have to traverse just to a single tc classifier
invocation is, if I spot this correctly, f.e.:

  __netif_receive_skb_core()
  `-> nf_hook_ingress()
   `-> nf_hook_do_ingress()
    `-> nf_hook_slow()
     `-> [for each entry in hook list]
      `-> nf_iterate()
       `-> (*elemp)->hook()
        `-> handle_ing()
         `-> ing_filter()
          `-> qdisc_enqueue_root()
           `-> sch->enqueue()
            `-> ingress_enqueue()
             `-> tc_classify()
              `-> tc_classify_compat()
               `-> [for each attached classifier]
                `-> tp->classify()
                 `-> f.e. cls_bpf_classify()
                  `-> [for each classifier from plist]
                   `-> BPF_PROG_RUN()

What was actually mentioned in the other thread where we'd like to
see a more lightweight ingress qdisc is to cut that down tremendously
to increase pps rate, as provided, that we would be able to process
a path roughly like:

  __netif_receive_skb_core()
  `-> tc_classify()
   `-> tc_classify_compat()
     `-> [for each attached classifier]
       `-> tp->classify()
         `-> f.e. cls_bpf_classify()
           `-> [for each classifier from plist]
             `-> BPF_PROG_RUN()

Therefore, I think it would be better to not wrap that ingress qdisc
part of the patch set into even more layers. What do you think?

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang April 29, 2015, 9:53 p.m. UTC | #2
On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 2274e72..23b57da 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -312,6 +312,7 @@ config NET_SCH_PIE
>  config NET_SCH_INGRESS
>         tristate "Ingress Qdisc"
>         depends on NET_CLS_ACT
> +       select NETFILTER_INGRESS
>         ---help---
>           Say Y here if you want to use classifiers for incoming packets.
>           If unsure, say Y.


So now it impossible to compile ingress Qdics without netfilters...

(I know you moved them into net/core/, but still they are netfilter API's.)

Why do we have to mix different layers? IOW, why not just keep TC at L2
and netfilters at L3 even just w.r.t. API?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 29, 2015, 11:32 p.m. UTC | #3
On Wed, Apr 29, 2015 at 10:27:05PM +0200, Daniel Borkmann wrote:
> On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote:
> >Port qdisc ingress on top of the Netfilter ingress allows us to detach the
> >qdisc ingress filtering code from the core, so now it resides where it really
> >belongs.
> 
> Hm, but that means, in case you have a tc ingress qdisc attached
> with one single (ideal) or more (less ideal) classifier/actions,
> the path we _now_ have to traverse just to a single tc classifier
> invocation is, if I spot this correctly, f.e.:
> 
>  __netif_receive_skb_core()
>  `-> nf_hook_ingress()
>   `-> nf_hook_do_ingress()
>    `-> nf_hook_slow()
>     `-> [for each entry in hook list]
>      `-> nf_iterate()
>       `-> (*elemp)->hook()
>        `-> handle_ing()
>         `-> ing_filter()
>          `-> qdisc_enqueue_root()
>           `-> sch->enqueue()
>            `-> ingress_enqueue()
>             `-> tc_classify()
>              `-> tc_classify_compat()
>               `-> [for each attached classifier]
>                `-> tp->classify()
>                 `-> f.e. cls_bpf_classify()
>                  `-> [for each classifier from plist]
>                   `-> BPF_PROG_RUN()

Actually, the extra cost is roughly (getting inlined stuff away and
other non-relevant stuff):

    `-> nf_hook_slow()
     `-> [for each entry in hook list]
      `-> nf_iterate()
       `-> (*elemp)->hook()

as part of the generic hook infrastructure, which comes with extra
flexibility in return. I think the main concern so far was not to harm
the critical netif_receive_core() path, and this patchset proves not
to affect this.

BTW, the sch->enqueue() can easily go away after this patchset, see
attached patch.

> What was actually mentioned in the other thread where we'd like to
> see a more lightweight ingress qdisc is to cut that down tremendously
> to increase pps rate, as provided, that we would be able to process
> a path roughly like:
> 
>  __netif_receive_skb_core()
>  `-> tc_classify()
>   `-> tc_classify_compat()
>     `-> [for each attached classifier]
>       `-> tp->classify()
>         `-> f.e. cls_bpf_classify()
>           `-> [for each classifier from plist]
>             `-> BPF_PROG_RUN()
> 
> Therefore, I think it would be better to not wrap that ingress qdisc
> part of the patch set into even more layers. What do you think?

I think the main front to improve performance in qdisc ingress is to
remove the central spinlock that is harming scalability. There's also
the built-in rule counters there that look problematic. So I would
focus on improving performance from the qdisc ingress core
infrastructure itself.

On the bugfix front, the illegal mangling of shared skb from actions
like stateless nat and bpf look also important to be addressed to me.
David already suggested to propagate some state object that keeps a
pointer to the skb that is passed to the action. Thus, the action can
clone it and get the skb back to the ingress path. I started a
patchset to do so here, it's a bit large since it requires quite a lot
of function signature adjustment.

I can also see there were also intentions to support userspace
queueing at some point since TC_ACT_QUEUED has been there since the
beginning.  That should be possible at some point using this
infrastructure (once there are no further concerns on the
netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
versions keep inlining this new function).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 29, 2015, 11:36 p.m. UTC | #4
On 29.04, Daniel Borkmann wrote:
> On 04/29/2015 08:53 PM, Pablo Neira Ayuso wrote:
> >Port qdisc ingress on top of the Netfilter ingress allows us to detach the
> >qdisc ingress filtering code from the core, so now it resides where it really
> >belongs.
> 
> Hm, but that means, in case you have a tc ingress qdisc attached
> with one single (ideal) or more (less ideal) classifier/actions,
> the path we _now_ have to traverse just to a single tc classifier
> invocation is, if I spot this correctly, f.e.:
> 
>  __netif_receive_skb_core()
>  `-> nf_hook_ingress()
>   `-> nf_hook_do_ingress()
>    `-> nf_hook_slow()
>     `-> [for each entry in hook list]
>      `-> nf_iterate()
>       `-> (*elemp)->hook()
>        `-> handle_ing()
>         `-> ing_filter()
>          `-> qdisc_enqueue_root()
>           `-> sch->enqueue()
>            `-> ingress_enqueue()
>             `-> tc_classify()
>              `-> tc_classify_compat()
>               `-> [for each attached classifier]
>                `-> tp->classify()
>                 `-> f.e. cls_bpf_classify()
>                  `-> [for each classifier from plist]
>                   `-> BPF_PROG_RUN()
> 

You obviously realize this callchain is fully made up by yourself
because 1/3 of the functions will be optimized away by the compiler,
some are completely gone with this patchset and some are optional
and appear to be present simply to make it appear longer.

The difference is very simple: where we had an indirect call to
q->enqueue before (and a lot of crap that didn't belong there),
we now have a call to nf_hook_slow, followed by the hook invocation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 29, 2015, 11:37 p.m. UTC | #5
On 29.04, Cong Wang wrote:
> On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> > index 2274e72..23b57da 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -312,6 +312,7 @@ config NET_SCH_PIE
> >  config NET_SCH_INGRESS
> >         tristate "Ingress Qdisc"
> >         depends on NET_CLS_ACT
> > +       select NETFILTER_INGRESS
> >         ---help---
> >           Say Y here if you want to use classifiers for incoming packets.
> >           If unsure, say Y.
> 
> 
> So now it impossible to compile ingress Qdics without netfilters...
> 
> (I know you moved them into net/core/, but still they are netfilter API's.)
> 
> Why do we have to mix different layers? IOW, why not just keep TC at L2
> and netfilters at L3 even just w.r.t. API?

Call it what you want, netfilter for most people is the entire infrastructure
built on top of the hooks. We can call it packet hooks if that makes you
happy. And the are infrastructure and not affiliated to any particular
layer.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 29, 2015, 11:42 p.m. UTC | #6
On Wed, Apr 29, 2015 at 02:53:58PM -0700, Cong Wang wrote:
> On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> > index 2274e72..23b57da 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -312,6 +312,7 @@ config NET_SCH_PIE
> >  config NET_SCH_INGRESS
> >         tristate "Ingress Qdisc"
> >         depends on NET_CLS_ACT
> > +       select NETFILTER_INGRESS
> >         ---help---
> >           Say Y here if you want to use classifiers for incoming packets.
> >           If unsure, say Y.
> 
> 
> So now it impossible to compile ingress Qdics without netfilters...
> 
> (I know you moved them into net/core/, but still they are netfilter API's.)

This is only one single file that only contains the very basic hook
infrastructure, this does not depend on the layer 3.

> Why do we have to mix different layers? IOW, why not just keep TC at L2
> and netfilters at L3 even just w.r.t. API?

I think that used to be true in the iptables days... The nftables
infrastructure is flexible and extensible enough to satisfy the needs
of other network layers.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, midnight UTC | #7
On 04/30/2015 01:36 AM, Patrick McHardy wrote:
...
> You obviously realize this callchain is fully made up by yourself

Hence, I wrote path, not call chain, I guess that should have been
clear.

[...]
> The difference is very simple: where we had an indirect call to
> q->enqueue before (and a lot of crap that didn't belong there),
> we now have a call to nf_hook_slow, followed by the hook invocation.

Sure, but what I wanted to express is that from an architectural
point of view down to invoking a classifier, we now have a list of
hooks, one element of those can be ingress qdisc and that one can
have lists of classifier/actions by itself, etc. That doesn't seem
sound wrt 'where it really belongs'.

Personally, I have no objections if you want to use netfilter on
ingress, don't get me wrong, but that ingress qdisc part doesn't
really fit in there, and moving it behind netfilter facilities does
have additional cost for a packet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 12:10 a.m. UTC | #8
On 04/30/2015 01:32 AM, Pablo Neira Ayuso wrote:
...
> Actually, the extra cost is roughly (getting inlined stuff away and
> other non-relevant stuff):
>
>      `-> nf_hook_slow()
>       `-> [for each entry in hook list]
>        `-> nf_iterate()
>         `-> (*elemp)->hook()

Yep, agreed.

> as part of the generic hook infrastructure, which comes with extra
> flexibility in return. I think the main concern so far was not to harm
> the critical netif_receive_core() path, and this patchset proves not
> to affect this.

Correct, as you use the static key and hide everything behind it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 12:15 a.m. UTC | #9
On 30.04, Daniel Borkmann wrote:
> On 04/30/2015 01:36 AM, Patrick McHardy wrote:
> ...
> >You obviously realize this callchain is fully made up by yourself
> 
> Hence, I wrote path, not call chain, I guess that should have been
> clear.
> 
> [...]
> >The difference is very simple: where we had an indirect call to
> >q->enqueue before (and a lot of crap that didn't belong there),
> >we now have a call to nf_hook_slow, followed by the hook invocation.
> 
> Sure, but what I wanted to express is that from an architectural
> point of view down to invoking a classifier, we now have a list of
> hooks, one element of those can be ingress qdisc and that one can
> have lists of classifier/actions by itself, etc. That doesn't seem
> sound wrt 'where it really belongs'.
> 
> Personally, I have no objections if you want to use netfilter on
> ingress, don't get me wrong, but that ingress qdisc part doesn't
> really fit in there, and moving it behind netfilter facilities does
> have additional cost for a packet.

It does, and Pablo made up for this cost by moving the TTL tracking
to ingress where it actually belongs. This would obviously also be
possible without his patchset, but the fact is, nobody seemed to
have cared for the past ten years, we have from today a basically
zero cost feature and people can decide what the want for themselves,
there is no added cost for people not using it, there is no added
cost for people using ingress.

As for "really fit", ingress started off as living on a netfilter
hook and only moved to a direct invocation somewhere around 2.6.12
IIRC. As I was mentioning earlier, ingress *qdisc* is a hack in
itself since there is no freaking queuing, its a hack to invoke
classifier chains, so actually netfilter is *the better* abstraction
to do this. And it's not netfilter, its packet hooks for classification
purposes.

To me all these arguments have to very negative sounds. It has always
been like this - and it hasn't. And we want ingress queueing to be
the exclusive way to do this. And this quite frankly sucks for all
the reasons stated before. Its a hack, its not queueing, it doesn't
cost anything, let the people decide what they actually want to use.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 12:20 a.m. UTC | #10
On 04/30/2015 01:32 AM, Pablo Neira Ayuso wrote:
...
>> Therefore, I think it would be better to not wrap that ingress qdisc
>> part of the patch set into even more layers. What do you think?
>
> I think the main front to improve performance in qdisc ingress is to
> remove the central spinlock that is harming scalability. There's also
> the built-in rule counters there that look problematic. So I would
> focus on improving performance from the qdisc ingress core
> infrastructure itself.
>
> On the bugfix front, the illegal mangling of shared skb from actions
> like stateless nat and bpf look also important to be addressed to me.
> David already suggested to propagate some state object that keeps a
> pointer to the skb that is passed to the action. Thus, the action can
> clone it and get the skb back to the ingress path. I started a
> patchset to do so here, it's a bit large since it requires quite a lot
> of function signature adjustment.
>
> I can also see there were also intentions to support userspace
> queueing at some point since TC_ACT_QUEUED has been there since the
> beginning.  That should be possible at some point using this
> infrastructure (once there are no further concerns on the
> netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
> versions keep inlining this new function).

Wrt the other mail, just thinking out loud ... do you see a longer-term
possibility of further generalizing the gen hooks infrastructure, so that
actually classifier from tc could attach (disregarding the nf_* naming
scheme for now) ...

     `-> nf_hook_slow()
      `-> [for each entry in hook list]   <-- here as an entry
       `-> nf_iterate()
        `-> (*elemp)->hook()

... as well?

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 12:30 a.m. UTC | #11
On 30.04, Daniel Borkmann wrote:
> >
> >I can also see there were also intentions to support userspace
> >queueing at some point since TC_ACT_QUEUED has been there since the
> >beginning.  That should be possible at some point using this
> >infrastructure (once there are no further concerns on the
> >netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
> >versions keep inlining this new function).
> 
> Wrt the other mail, just thinking out loud ... do you see a longer-term
> possibility of further generalizing the gen hooks infrastructure, so that
> actually classifier from tc could attach (disregarding the nf_* naming
> scheme for now) ...
> 
>     `-> nf_hook_slow()
>      `-> [for each entry in hook list]   <-- here as an entry
>       `-> nf_iterate()
>        `-> (*elemp)->hook()
> 
> ... as well?

Jumping in there since I'm probably the one thinking the TC ingress
abstraction is wrong the strongest - yes, it's an interesting idea.
Unlike egress qdiscs, ingress only has a single classifier chain
anyways, so there is no qdisc internal classification structure to
be observed. It should be possible to skip the ingress invocation
for classification purposes completely and only use it to expose
it to userspace for management purposes, while invoking the chain
directly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 12:37 a.m. UTC | #12
On 30.04, Pablo Neira Ayuso wrote:
> On the bugfix front, the illegal mangling of shared skb from actions
> like stateless nat and bpf look also important to be addressed to me.
> David already suggested to propagate some state object that keeps a
> pointer to the skb that is passed to the action. Thus, the action can
> clone it and get the skb back to the ingress path. I started a
> patchset to do so here, it's a bit large since it requires quite a lot
> of function signature adjustment.

Jumping in on this point - the fact that roughly 2/3 of TC actions will
simply BUG under not unlikely circumstances when used in ingress (I went
through them one by one with Pablo a week ago) is also telling. Nobody
seems to be using that. All packet mangling actions will BUG while any
tap is active. Its nothing easily fixed, but apparently nobody has cared
in ten years. ipt is trivial to crash differently, connmark is as well.

So I'm wondering what are we actually arguing about here. Whether we are
affecting the performance how fast TC will crash? We *do* actually care
about these thing, in TC apparently nobody has for the past ten years.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 12:41 a.m. UTC | #13
On 04/30/2015 02:30 AM, Patrick McHardy wrote:
> On 30.04, Daniel Borkmann wrote:
>>>
>>> I can also see there were also intentions to support userspace
>>> queueing at some point since TC_ACT_QUEUED has been there since the
>>> beginning.  That should be possible at some point using this
>>> infrastructure (once there are no further concerns on the
>>> netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
>>> versions keep inlining this new function).
>>
>> Wrt the other mail, just thinking out loud ... do you see a longer-term
>> possibility of further generalizing the gen hooks infrastructure, so that
>> actually classifier from tc could attach (disregarding the nf_* naming
>> scheme for now) ...
>>
>>      `-> nf_hook_slow()
>>       `-> [for each entry in hook list]   <-- here as an entry
>>        `-> nf_iterate()
>>         `-> (*elemp)->hook()
>>
>> ... as well?
>
> Jumping in there since I'm probably the one thinking the TC ingress
> abstraction is wrong the strongest - yes, it's an interesting idea.
> Unlike egress qdiscs, ingress only has a single classifier chain
> anyways, so there is no qdisc internal classification structure to

Yes, exactly.

> be observed. It should be possible to skip the ingress invocation
> for classification purposes completely and only use it to expose
> it to userspace for management purposes, while invoking the chain
> directly.

That would also meet our goals to only have a single classifier
chain, eventually. ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 12:48 a.m. UTC | #14
On 30.04, Daniel Borkmann wrote:
> On 04/30/2015 02:30 AM, Patrick McHardy wrote:
> >On 30.04, Daniel Borkmann wrote:
> >>>
> >>>I can also see there were also intentions to support userspace
> >>>queueing at some point since TC_ACT_QUEUED has been there since the
> >>>beginning.  That should be possible at some point using this
> >>>infrastructure (once there are no further concerns on the
> >>>netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
> >>>versions keep inlining this new function).
> >>
> >>Wrt the other mail, just thinking out loud ... do you see a longer-term
> >>possibility of further generalizing the gen hooks infrastructure, so that
> >>actually classifier from tc could attach (disregarding the nf_* naming
> >>scheme for now) ...
> >>
> >>     `-> nf_hook_slow()
> >>      `-> [for each entry in hook list]   <-- here as an entry
> >>       `-> nf_iterate()
> >>        `-> (*elemp)->hook()
> >>
> >>... as well?
> >
> >Jumping in there since I'm probably the one thinking the TC ingress
> >abstraction is wrong the strongest - yes, it's an interesting idea.
> >Unlike egress qdiscs, ingress only has a single classifier chain
> >anyways, so there is no qdisc internal classification structure to
> 
> Yes, exactly.
> 
> >be observed. It should be possible to skip the ingress invocation
> >for classification purposes completely and only use it to expose
> >it to userspace for management purposes, while invoking the chain
> >directly.
> 
> That would also meet our goals to only have a single classifier
> chain, eventually. ;)

For TC, yes. What Pablo is trying to achieve (and I agree that it's
worthwhile) is multiple things:

* have people use TC ingress as before, obviously
* have people use netfilter, if they prefer
* have them combine both if for some unfortunate reason they require it

Netfilter is based on hook chains. The cost when only using a single hook
is minimal (as Pablo showed in his numbers), but even if only using
TC and a single netfilter classifier chain, there has to be some relative
ordering and the hooks provide this in a generic way.

So I'm all in favour of reduing costs for single or for multi-hook
cases, but it wouldn't be usable for us anymore when doing away with
multiple hooks completely.

One more point to consider is - this is generic infrastructure - the
next person requiring a hook will not have to touch the ingress path
at all. He can register with the hooks, having zero additional costs
for people not using it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 1:04 a.m. UTC | #15
On 04/30/2015 02:37 AM, Patrick McHardy wrote:
> On 30.04, Pablo Neira Ayuso wrote:
>> On the bugfix front, the illegal mangling of shared skb from actions
>> like stateless nat and bpf look also important to be addressed to me.
>> David already suggested to propagate some state object that keeps a
>> pointer to the skb that is passed to the action. Thus, the action can
>> clone it and get the skb back to the ingress path. I started a
>> patchset to do so here, it's a bit large since it requires quite a lot
>> of function signature adjustment.
>
> Jumping in on this point - the fact that roughly 2/3 of TC actions will
> simply BUG under not unlikely circumstances when used in ingress (I went
> through them one by one with Pablo a week ago) is also telling. Nobody
> seems to be using that. All packet mangling actions will BUG while any
> tap is active. Its nothing easily fixed, but apparently nobody has cared
> in ten years. ipt is trivial to crash differently, connmark is as well.
>
> So I'm wondering what are we actually arguing about here. Whether we are
> affecting the performance how fast TC will crash? We *do* actually care
> about these thing, in TC apparently nobody has for the past ten years.

Totally agree with you that the situation is quite a mess. From tc ingress/
egress side, at least my use case is to have an as minimal as possible entry
point for cls_bpf/act_bpf, which is what we were working on recently. That
is rather ``fresh'' compared to the remaining history of cls/act in tc.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 30, 2015, 1:16 a.m. UTC | #16
On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote:
> Netfilter is based on hook chains. The cost when only using a single hook
> is minimal (as Pablo showed in his numbers), but even if only using
> TC and a single netfilter classifier chain, there has to be some relative
> ordering and the hooks provide this in a generic way.

Unfortunately the numbers that Pablo shown are not measuring
the right thing.

> 840203pps 403Mb/sec

this is 20 times less than what they should be.
Something else were measured together with netif_receive_skb.

I've applied these patches and see the following
for eth0 + ingress + u32:

18.0 Mpps
  21.43%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
   9.88%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
   9.79%  kpktgend_0   [cls_u32]         [k] u32_classify
   9.16%  kpktgend_0   [kernel.vmlinux]  [k] _raw_spin_lock
   8.16%  kpktgend_0   [kernel.vmlinux]  [k] nf_iterate
   5.28%  kpktgend_0   [sch_ingress]     [k] handle_ing
   4.51%  kpktgend_0   [sch_ingress]     [k] ingress_enqueue
   4.42%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify_compat
   3.16%  kpktgend_0   [kernel.vmlinux]  [k] nf_hook_slow
   3.01%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   2.70%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify

without these patches:

22.4 Mpps
  25.89%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
  14.41%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
  14.05%  kpktgend_0   [kernel.vmlinux]  [k] _raw_spin_lock
  11.75%  kpktgend_0   [cls_u32]         [k] u32_classify
   6.48%  kpktgend_0   [sch_ingress]     [k] ingress_enqueue
   6.06%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify_compat
   4.16%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify
   3.77%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv

clearly nf_iterate/nf_hook_slow are slowing things down.

I've spent more than a week trying to speedup ingress qdisc
and, so far, got from 22.4 Mpps to 27.2 Mpps,
so this 'generalization' is totally not acceptable to me.

You're right that for 10 years no one cared about performance
of ingress qdisc, but that doesn't mean it's a wrong abstraction
and wrong architecture. Now I care about its performance and
I hope other people will do too.
So please leave ingress qdisc alone, this 'generalization'
is too costly.

That doesn't mean that netfilter shouldn't have its own hook
on ingress. Without patch 6, the set looks good.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 1:34 a.m. UTC | #17
On 29.04, Alexei Starovoitov wrote:
> On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote:
> > Netfilter is based on hook chains. The cost when only using a single hook
> > is minimal (as Pablo showed in his numbers), but even if only using
> > TC and a single netfilter classifier chain, there has to be some relative
> > ordering and the hooks provide this in a generic way.
> 
> Unfortunately the numbers that Pablo shown are not measuring
> the right thing.
> 
> > 840203pps 403Mb/sec
> 
> this is 20 times less than what they should be.
> Something else were measured together with netif_receive_skb.
> 
> I've applied these patches and see the following
> for eth0 + ingress + u32:
> 
> 18.0 Mpps
>   21.43%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>    9.88%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
>    9.79%  kpktgend_0   [cls_u32]         [k] u32_classify
>    9.16%  kpktgend_0   [kernel.vmlinux]  [k] _raw_spin_lock
>    8.16%  kpktgend_0   [kernel.vmlinux]  [k] nf_iterate
>    5.28%  kpktgend_0   [sch_ingress]     [k] handle_ing
>    4.51%  kpktgend_0   [sch_ingress]     [k] ingress_enqueue
>    4.42%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify_compat
>    3.16%  kpktgend_0   [kernel.vmlinux]  [k] nf_hook_slow
>    3.01%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>    2.70%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify
> 
> without these patches:
> 
> 22.4 Mpps
>   25.89%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>   14.41%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
>   14.05%  kpktgend_0   [kernel.vmlinux]  [k] _raw_spin_lock
>   11.75%  kpktgend_0   [cls_u32]         [k] u32_classify
>    6.48%  kpktgend_0   [sch_ingress]     [k] ingress_enqueue
>    6.06%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify_compat
>    4.16%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify
>    3.77%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
> 
> clearly nf_iterate/nf_hook_slow are slowing things down.
> 
> I've spent more than a week trying to speedup ingress qdisc
> and, so far, got from 22.4 Mpps to 27.2 Mpps,
> so this 'generalization' is totally not acceptable to me.
> 
> You're right that for 10 years no one cared about performance
> of ingress qdisc, but that doesn't mean it's a wrong abstraction
> and wrong architecture. Now I care about its performance and
> I hope other people will do too.

The wrong abstraction is using a qdisc for ingress. An abstraction
is not about performance. Why do you thing ingress exists? For
queueing? Or as providing a hooking point for a bunch of broken
(at ingress) actions? You're (one of) the one who painfully realized
how broken any kind of packet mangling at that point is. The
infrastructure is simply crap and always has been.

> So please leave ingress qdisc alone, this 'generalization'
> is too costly.

Sorry, we are of the opinion that TC classifiers suck, so we will
not leave that path alone :) You're numbers are well appreciated,
we will fix this and return with better numbers.

> That doesn't mean that netfilter shouldn't have its own hook
> on ingress. Without patch 6, the set looks good.

I don't agree. It would be preferable to optimize the single hook
case not only for ingress's sake, but for all the already existing
hooks.

Cheers,
Patrick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 1:43 a.m. UTC | #18
On 30.04, Daniel Borkmann wrote:
> On 04/30/2015 02:37 AM, Patrick McHardy wrote:
> >On 30.04, Pablo Neira Ayuso wrote:
> >>On the bugfix front, the illegal mangling of shared skb from actions
> >>like stateless nat and bpf look also important to be addressed to me.
> >>David already suggested to propagate some state object that keeps a
> >>pointer to the skb that is passed to the action. Thus, the action can
> >>clone it and get the skb back to the ingress path. I started a
> >>patchset to do so here, it's a bit large since it requires quite a lot
> >>of function signature adjustment.
> >
> >Jumping in on this point - the fact that roughly 2/3 of TC actions will
> >simply BUG under not unlikely circumstances when used in ingress (I went
> >through them one by one with Pablo a week ago) is also telling. Nobody
> >seems to be using that. All packet mangling actions will BUG while any
> >tap is active. Its nothing easily fixed, but apparently nobody has cared
> >in ten years. ipt is trivial to crash differently, connmark is as well.
> >
> >So I'm wondering what are we actually arguing about here. Whether we are
> >affecting the performance how fast TC will crash? We *do* actually care
> >about these thing, in TC apparently nobody has for the past ten years.
> 
> Totally agree with you that the situation is quite a mess. From tc ingress/
> egress side, at least my use case is to have an as minimal as possible entry
> point for cls_bpf/act_bpf, which is what we were working on recently. That
> is rather ``fresh'' compared to the remaining history of cls/act in tc.

It's more than a mess. Leaving aside the fully broken code at ingress,
just look at the TC action APIs. Its "a failed state". TC is as well, but
on egress only on the userspace side - Thomas's presentation 'TC - -EWTF'
IIRC put it quite well. Our long term goal is to fix both, but ingress is
actually simply, for any realistic case (though my experience is limited),
meaning you have more than a single classifier, some structure in your rules
and more than a single CPU, we'll do better right now.

But Alexey's point is well taken. If we can't manange to add our hooks
without impacting existing use cases, we'll try better until we can.
The single hook case seems to be possible to optimize at the benefit of
all the layers quite easily, and for people using both, we'll try to
compete on a different level.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 30, 2015, 2:22 a.m. UTC | #19
On 04/29/15 21:34, Patrick McHardy wrote:
> On 29.04, Alexei Starovoitov wrote:
>> On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote:
[..]
>>
>>> 840203pps 403Mb/sec
>>
>> this is 20 times less than what they should be.
>> Something else were measured together with netif_receive_skb.
>>
>> I've applied these patches and see the following
>> for eth0 + ingress + u32:
>>
>> 18.0 Mpps

[..]

>> without these patches:
>>
>> 22.4 Mpps
>>    25.89%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>>    14.41%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
>>    14.05%  kpktgend_0   [kernel.vmlinux]  [k] _raw_spin_lock
>>    11.75%  kpktgend_0   [cls_u32]         [k] u32_classify
>>     6.48%  kpktgend_0   [sch_ingress]     [k] ingress_enqueue
>>     6.06%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify_compat
>>     4.16%  kpktgend_0   [kernel.vmlinux]  [k] tc_classify
>>     3.77%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>>
>> clearly nf_iterate/nf_hook_slow are slowing things down.
>>
>> I've spent more than a week trying to speedup ingress qdisc
>> and, so far, got from 22.4 Mpps to 27.2 Mpps,
>> so this 'generalization' is totally not acceptable to me.
>>
>> You're right that for 10 years no one cared about performance
>> of ingress qdisc, but that doesn't mean it's a wrong abstraction
>> and wrong architecture. Now I care about its performance and
>> I hope other people will do too.
>

Well, even if you didnt touch tc and got rid of the lock
it will still be about a magnitude faster than netfilter ;->
How about i call netfilter crap Patrick? It is slower than
molasses. For more than 10+ years no-one has found good ways
to bring it up to speed.

For sure the attempt to re-use netfilter code from tc has failed.
I think thats what you keep probably repeating as something that will
crash. Apart from some help from Pablo in recent times - this re-use
has been a disaster even when linking in user space (from lack of
backward incompatibilities mostly).
It seems to me we may just have to drop support for ipt and anything
netfilter from tc.
And at some point we are going to need those features, we'll just need
to write faster versions of them. ebpf maybe helpful.

> The wrong abstraction is using a qdisc for ingress. An abstraction
> is not about performance. Why do you thing ingress exists? For
> queueing? Or as providing a hooking point for a bunch of broken
> (at ingress) actions? You're (one of) the one who painfully realized
> how broken any kind of packet mangling at that point is. The
> infrastructure is simply crap and always has been.
>

What abstraction is broken? The ingress expects L3 headers. The egress
expects L2 headers. I dont think there is disagreement that the ingress
should see L2 headers. The view is that it will be a bit of work.
Discussion is ongoing to resolve this without penalizing performance.
You dont care about performance - netfilter is all about how to have
nice usability. I care about usability but not as much as i do
about performance.

>> So please leave ingress qdisc alone, this 'generalization'
>> is too costly.
>
> Sorry, we are of the opinion that TC classifiers suck, so we will
> not leave that path alone :) You're numbers are well appreciated,
> we will fix this and return with better numbers.
>
>> That doesn't mean that netfilter shouldn't have its own hook
>> on ingress. Without patch 6, the set looks good.
>
> I don't agree. It would be preferable to optimize the single hook
> case not only for ingress's sake, but for all the already existing
> hooks.
>

And i strongly disagree. Netfilter is slow as a boring hot day.
Please dont enforce it on us. I thought you were kidding when you said
you want to move the ingress back to netfilter. We run away from there
years ago.

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 30, 2015, 2:35 a.m. UTC | #20
On 04/29/15 21:43, Patrick McHardy wrote:
> On 30.04, Daniel Borkmann wrote:
>> On 04/30/2015 02:37 AM, Patrick McHardy wrote:
>>> On 30.04, Pablo Neira Ayuso wrote:

>> Totally agree with you that the situation is quite a mess. From tc ingress/
>> egress side, at least my use case is to have an as minimal as possible entry
>> point for cls_bpf/act_bpf, which is what we were working on recently. That
>> is rather ``fresh'' compared to the remaining history of cls/act in tc.
>
> It's more than a mess. Leaving aside the fully broken code at ingress,
> just look at the TC action APIs. Its "a failed state".

Since youve repeated about 100 that tc api being broken, maybe
you can explain more rationally? By that i mean dont use words
like words like "crap" or "failed state" or no chest-thumping.
Lets say we totally stopped trying to reuse netfilter code, what are
you talking about?

I think there is confusion about usability vs merits of performance.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 3:11 a.m. UTC | #21
On 29.04, Jamal Hadi Salim wrote:
> >>You're right that for 10 years no one cared about performance
> >>of ingress qdisc, but that doesn't mean it's a wrong abstraction
> >>and wrong architecture. Now I care about its performance and
> >>I hope other people will do too.
> >
> 
> Well, even if you didnt touch tc and got rid of the lock
> it will still be about a magnitude faster than netfilter ;->
> How about i call netfilter crap Patrick? It is slower than
> molasses. For more than 10+ years no-one has found good ways
> to bring it up to speed.

Quite frankly, that is ridiculous. I realize we turn in circles
in this discussion, so we will provide the numbers. Give us a test
case if you want, something you think is hard.

> For sure the attempt to re-use netfilter code from tc has failed.

Agreed.

> I think thats what you keep probably repeating as something that will
> crash.

*Everything* except the purely passive ones as the policer will crash.
Everything. Its well known. No point in repeating it.

> Apart from some help from Pablo in recent times - this re-use
> has been a disaster even when linking in user space (from lack of
> backward incompatibilities mostly).
> It seems to me we may just have to drop support for ipt and anything
> netfilter from tc.
> And at some point we are going to need those features, we'll just need
> to write faster versions of them. ebpf maybe helpful.

I stopped caring about TC a long time ago, basically when actions were
merged over my head, despite me being the only person actually taking
care of TC. You might remember, I spent multiple months cleaning up the
action mess when it was merged, and that was only the *really* bad bugs.
Magic numbers in the APIs, lists limited to more magic numbers etc,
I didn't care. Today I do not think its about individual problems anymore,
the entire thing is broken. Egress has some shared problems with ingress,
such as horrible userspace code, horrible error reporting, no way to use it
even for me without looking at both TC and kernel code as the same time, but
ingress is worse in the sense that the supermajority of it will actually
crash in very likely cases, and its not about ipt, its everything touching
a packet in the presence of a tap, like 2/3 of all actions. The qdiscs are
mostly (except integration of classifiers) fine, everything else is crap.

> >The wrong abstraction is using a qdisc for ingress. An abstraction
> >is not about performance. Why do you thing ingress exists? For
> >queueing? Or as providing a hooking point for a bunch of broken
> >(at ingress) actions? You're (one of) the one who painfully realized
> >how broken any kind of packet mangling at that point is. The
> >infrastructure is simply crap and always has been.
> 
> What abstraction is broken? The ingress expects L3 headers. The egress
> expects L2 headers. I dont think there is disagreement that the ingress
> should see L2 headers. The view is that it will be a bit of work.
> Discussion is ongoing to resolve this without penalizing performance.
> You dont care about performance - netfilter is all about how to have
> nice usability. I care about usability but not as much as i do
> about performance.

The abstraction is *wrong*. There is no queueing, hence using a qdisc
is the wrong abstraction. Why are we arguing about that? Its a
mechanism to invoke classifiers and actions. What is the qdisc actually
used for? It certainly doesn't queue anything, the only thing it's
doing is imposing the worst locking imaginable on the entire path.

If that's not enough, look at the actions. I mean, you're not classifying
for the sake of it, there are not even classes on ingress, its purely
done for the actions. And they are, as stated, almost completely
broken. We can do policing. That actually works. Maybe one or two more.
Except policing, none of those is even remotely related to QoS.
So let me ask again, in what sense is the abstraction actually right?

> >>So please leave ingress qdisc alone, this 'generalization'
> >>is too costly.
> >
> >Sorry, we are of the opinion that TC classifiers suck, so we will
> >not leave that path alone :) You're numbers are well appreciated,
> >we will fix this and return with better numbers.
> >
> >>That doesn't mean that netfilter shouldn't have its own hook
> >>on ingress. Without patch 6, the set looks good.
> >
> >I don't agree. It would be preferable to optimize the single hook
> >case not only for ingress's sake, but for all the already existing
> >hooks.
> >
> 
> And i strongly disagree. Netfilter is slow as a boring hot day.
> Please dont enforce it on us. I thought you were kidding when you said
> you want to move the ingress back to netfilter. We run away from there
> years ago.

Quite frankly, talking about netfilter at all is so wrong that there
isn't even a possibility to respond. Netfilter is a mechanism - a hook
to receive a packet. And even that kicks your single threaded ingress
qdiscs ass every packet of the day. What we're talking about are the
things built on top. There's no question we also win on hardware that's
not 1975 there, because people are actually using it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 3:29 a.m. UTC | #22
On 29.04, Jamal Hadi Salim wrote:
> On 04/29/15 21:43, Patrick McHardy wrote:
> >On 30.04, Daniel Borkmann wrote:
> >>On 04/30/2015 02:37 AM, Patrick McHardy wrote:
> >>>On 30.04, Pablo Neira Ayuso wrote:
> 
> >>Totally agree with you that the situation is quite a mess. From tc ingress/
> >>egress side, at least my use case is to have an as minimal as possible entry
> >>point for cls_bpf/act_bpf, which is what we were working on recently. That
> >>is rather ``fresh'' compared to the remaining history of cls/act in tc.
> >
> >It's more than a mess. Leaving aside the fully broken code at ingress,
> >just look at the TC action APIs. Its "a failed state".
> 
> Since youve repeated about 100 that tc api being broken, maybe
> you can explain more rationally? By that i mean dont use words
> like words like "crap" or "failed state" or no chest-thumping.
> Lets say we totally stopped trying to reuse netfilter code, what are
> you talking about?
> 
> I think there is confusion about usability vs merits of performance.

I noticed tons of semantical holes when actually working on this, but
let me just give a single example which is still on the top of my head:

tcf_action_init:

        for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {                      
	        act = tcf_action_init_1(net, tb[i], est, name, ovr, bind);

So the API allows up to *magic number* (32) actions in one list, but not
more. So many will be atomic at least from a rtnetlink perspective, but
certainly not from a packet processing perspecting, from that one, they
will simply be processed as they are instantiated. So, we simulate
atomicity (what netlink is about) for a magic number of elements, but
do not even provide it at runtime.

Even ignoring the harder problem of having transactional semantics for
multiple actions, why even support *magic number* of elements in one
message and pretend atomicity? Smells a lot like premature optimization,
ignoring sementical expectations.

And yes, I do think it breaks with every concept of rtnetlink (messages
are atomic) for no reason at all. I remember TC actions where full of
these "special" interpretations. If you insist, I can do a review again,
but I did all that and stated it years ago.

Regarding TC as a whole, I think the problem is shared between userspace
and the kernel. iproute TC is certainly completely failed, its unusable
without looking at the kernel and iproute code, it hasn't even made the
slightest infrastructrual progess in the past 15 years (*(u16)RTN_DATA(bla))?)
and that is telling for itself. I don't extend that to ip, even though it
suffers from the same coding problems, but let's be honest, do you really
expect some magic is going to happen and TC userspace will suddenly become
usable?

I don't. And we intend to provide an alternative.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 30, 2015, 4:05 a.m. UTC | #23
On 30.04, Patrick McHardy wrote:
> Regarding TC as a whole, I think the problem is shared between userspace
> and the kernel. iproute TC is certainly completely failed, its unusable
> without looking at the kernel and iproute code, it hasn't even made the
> slightest infrastructrual progess in the past 15 years (*(u16)RTN_DATA(bla))?)
> and that is telling for itself. I don't extend that to ip, even though it
> suffers from the same coding problems, but let's be honest, do you really
> expect some magic is going to happen and TC userspace will suddenly become
> usable?
> 
> I don't. And we intend to provide an alternative.

For the sake of completeness - on ingress, this alternativ means simply
ditching TC. On egress, we will provide a nicely integrated way of qdiscs
and classifiers where nftables will obviously respect the qdiscs. On
ingress, there is simply no benefit in doing that at all. Hence Pablo's
patches.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 30, 2015, 6:02 a.m. UTC | #24
On Thu, Apr 30, 2015 at 06:05:37AM +0200, Patrick McHardy wrote:

> 06:05:37AM

do you ever sleep? ;)

> For the sake of completeness - on ingress, this alternativ means simply
> ditching TC. 

I'm sure 'ditching TC' doesn't mean to kill ingress qdisc
completely. Compatibility needs to be preserved.

What I'm doing in my 'experimental ingress qdisc' acceleration
boils down to:
@@ -1649,6 +1649,7 @@ struct net_device {
        rx_handler_func_t __rcu *rx_handler;
        void __rcu              *rx_handler_data;

-       struct netdev_queue __rcu *ingress_queue;
+       struct tcf_proto __rcu  *ingress_filter_list;

so to call tc_classify() and reach cls_bpf I don't need to walk
down skb->dev->ingress_queue->qdisc->enqueue/qdisc_priv->filter_list
and can just do skb->dev->ingress_filter_list
and can skip several unnecessary deref like sch->stab, skb->len, etc.
Both ingress_queue and ingress qdisc are no longer allocated
and stay only as a shim to preserve uapi.

My point is that I agree that cleanup of ingress qdisc is needed.
I disagree with drastic measures.
Just add your nf_hook to ingress and let's see how things evolve.
We have rx_handler and all of ptype hooks in there. One can argue
that rx_handler overlaps with nf_hook too ? ;)
We cannot generalize them all under one 'hook' infra.
nf needs to do nf_hook_state_init() and pass it around which
no one else needs. That's the cost others should not pay.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 9:24 a.m. UTC | #25
On 04/30/2015 08:02 AM, Alexei Starovoitov wrote:
...
> My point is that I agree that cleanup of ingress qdisc is needed.
> I disagree with drastic measures.
> Just add your nf_hook to ingress and let's see how things evolve.
> We have rx_handler and all of ptype hooks in there. One can argue
> that rx_handler overlaps with nf_hook too ? ;)
> We cannot generalize them all under one 'hook' infra.
> nf needs to do nf_hook_state_init() and pass it around which
> no one else needs. That's the cost others should not pay.

+1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 30, 2015, 10:12 a.m. UTC | #26
On Wed, Apr 29, 2015 at 06:16:35PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 30, 2015 at 02:48:39AM +0200, Patrick McHardy wrote:
> > Netfilter is based on hook chains. The cost when only using a single hook
> > is minimal (as Pablo showed in his numbers), but even if only using
> > TC and a single netfilter classifier chain, there has to be some relative
> > ordering and the hooks provide this in a generic way.
>
> Unfortunately the numbers that Pablo shown are not measuring
> the right thing.

The numbers I've shown were measuring the critical ingress path. I
never posted numbers on generic hook infrastructure.

> > 840203pps 403Mb/sec
>
> this is 20 times less than what they should be.

It's ~10% less, because obviously we run more code in the generic hook
infrastructure, as they provide more features. So far it was only
possible to register one single chain at ingress. Eric Dumazet already
indicated that this benchmark only measures *one single CPU* and
icache is fully populated. The numbers with smp will go worse because
of the central lock contention.

These are the numbers I got banging *one single CPU*:

* Without patches + qdisc ingress:

Result: OK: 16298126(c16298125+d0) usec, 10000000 (60byte,0frags)
  613567pps 294Mb/sec (294512160bps) errors: 10000000

* With patches + qdisc ingress on top of hooks:

Result: OK: 18339281(c18339280+d0) usec, 10000000 (60byte,0frags)
  545277pps 261Mb/sec (261732960bps) errors: 10000000

* With patches + nftables ingress chain:

Result: OK: 17118167(c17118167+d0) usec, 10000000 (60byte,0frags)

  584174pps 280Mb/sec (280403520bps) errors: 10000000
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 30, 2015, 10:28 a.m. UTC | #27
On Thu, Apr 30, 2015 at 11:24:57AM +0200, Daniel Borkmann wrote:
> On 04/30/2015 08:02 AM, Alexei Starovoitov wrote:
> ...
> >My point is that I agree that cleanup of ingress qdisc is needed.
> >I disagree with drastic measures.
> >Just add your nf_hook to ingress and let's see how things evolve.
> >We have rx_handler and all of ptype hooks in there. One can argue
> >that rx_handler overlaps with nf_hook too ? ;)
> >We cannot generalize them all under one 'hook' infra.
> >nf needs to do nf_hook_state_init() and pass it around which
> >no one else needs. That's the cost others should not pay.
> 
> +1

Actually, the state object can be useful to resolve the major bug in
actions that mangle skbs in an illegal way, as we can use it to pass
back to the ingress path the new skb_shared_check()'ed skb.

The genericity that they state object introduces comes with a cost, no
doubt, but it helps to extend things later on and resolve tricky
situation like the one above without large patches to propagate new
state information that you need all over the code.

Regarding the performance argument that is repeating over and over
again, we all here are quite aware here that there's is a *good room
for improvement* in qdisc ingress itself...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 30, 2015, 11:55 a.m. UTC | #28
On 04/29/15 23:11, Patrick McHardy wrote:

> Quite frankly, that is ridiculous. I realize we turn in circles
> in this discussion, so we will provide the numbers. Give us a test
> case if you want, something you think is hard.


The usual tests i used to run are throughput and latency of datapath.
Then more importantly update rates of policies - this with and without
traffic.
Start with a zero rules. Add them logarithmically (with and without
traffic running). i.e in order of {0, 1, 10, 100, 1000, ...}
With a single rule you dont notice much difference. Start adding rules
and it becomes very obvious.
The latency for adding rule 1001 when there are say 1000 rules
already added is important from a control perspective if it stays linear.
I am sure this would make a good paper. Where are the firewo/men when
you need them?

>> For sure the attempt to re-use netfilter code from tc has failed.
>
> Agreed.
>

And i do note historically netfilter folks have not been very
willing to help make this better over the years. Actually i would trace
that back way to Harald's time. Pablo has been a lot more helpful in
recent times but now your answer is essentially to subsume tc into
netfilter;-> Great.
I could send a patch to delete it but i worry since people do send
bug reports.

>> I think thats what you keep probably repeating as something that will
>> crash.
>
> *Everything* except the purely passive ones as the policer will crash.
> Everything. Its well known. No point in repeating it.
>

please send fixes or send pointers or even private mail so either
myself or someone else can fix it.
I am sure if you stared long enough at any code you'll find bugs.
As would be the case for netfilter. Depends how much time you are
willing to invest. But to just handwave with "fire fire" is not
helpful. In the engineering world people come up and use solutions
and when they find new use cases either fix bugs that existed or add
new  features. This is how it works in Linux. Bugs _do get fixed_.
I know it is hard for people who take a lot of pride to accept.
To illustrate with an example:
I saw you mentioning the stateless nat action in the other email;
i happen to know that it is _heavily_ used by some large players in
very serious environments. You are probably using tc nat without
knowing it. Are there bugs? possibly yes. Do they apply to those
two big systems? Very unlikely otherwise they'd be fixed by now.
We dont intentionally keep bugs - but we dont aim
for perfection either. Lets refactor and fix bugs as needed.
Ive known you for years and i know you genuinely care about bugs - so
i am in no way trying to diminish what you are saying.
I have just never agreed when you get outraged and try to kill
something because you found bugs.  IOW, the logic of "there are bugs
in there, lets burn down the house" is not how we operate.


> I stopped caring about TC a long time ago, basically when actions were
> merged over my head, despite me being the only person actually taking
> care of TC. You might remember, I spent multiple months cleaning up the
> action mess when it was merged, and that was only the *really* bad bugs.
> Magic numbers in the APIs, lists limited to more magic numbers etc,

Yes, you fixed bugs as did other people over the years. And i have no
doubt you introduced bugs while you were doing that as did other people.
Magic numbers? Seriously? I am sure people are still fixing
magic numbers in netfilter this century and possibly this year.
If you care fix it. Note:
There is a difference between fixing bugs and you trying to reshape
code in your view of the world. The struggle over the years was you
trying to reshape things to your world - as it is in this case.

> I didn't care. Today I do not think its about individual problems anymore,
> the entire thing is broken. Egress has some shared problems with ingress,
> such as horrible userspace code, horrible error reporting, no way to use it
> even for me without looking at both TC and kernel code as the same time, but
> ingress is worse in the sense that the supermajority of it will actually
> crash in very likely cases, and its not about ipt, its everything touching
> a packet in the presence of a tap, like 2/3 of all actions. The qdiscs are
> mostly (except integration of classifiers) fine, everything else is crap.
>

Again, provide fixes or pointers so things can be fixed.

>
> The abstraction is *wrong*. There is no queueing, hence using a qdisc
> is the wrong abstraction. Why are we arguing about that? Its a
> mechanism to invoke classifiers and actions. What is the qdisc actually
> used for?

With all due respect, that is a very flawed arguement in the linux
world. Let me give you an example.
People use netdevs because they provide a nice abstraction and
tooling to go with it. Shall we kill all those netdevs which are
conviniences that take a packet in and out because hell they have
nothing to do with ports. What about xfrm? etc.
The qdisc provided all those niceties.

> It certainly doesn't queue anything, the only thing it's
> doing is imposing the worst locking imaginable on the entire path.
>

And yet despite this "worst locking imaginable" it is still faster than
your lockless approach.

> If that's not enough, look at the actions. I mean, you're not classifying
> for the sake of it, there are not even classes on ingress, its purely
> done for the actions. And they are, as stated, almost completely
> broken. We can do policing. That actually works. Maybe one or two more.
> Except policing, none of those is even remotely related to QoS.
> So let me ask again, in what sense is the abstraction actually right?
>

In the sense i described above.


> Quite frankly, talking about netfilter at all is so wrong that there
> isn't even a possibility to respond. Netfilter is a mechanism - a hook
> to receive a packet. And even that kicks your single threaded ingress
> qdiscs ass every packet of the day. What we're talking about are the
> things built on top. There's no question we also win on hardware that's
> not 1975 there, because people are actually using it.
>

Netfilter is a lot more usable - no doubt about it. It has always
been very good. But there are caveats.
At one point i think we were thankful all the crackheads were using
netfilter instead of tc because tc was harder to understand ;->.
In retrospect i think that was wrong;->

So let me see if i can summarize your arguement:
Hardware is faster than 1975 therefore lets just use netfilter
for usability reasons.
Thats what all those kids using rubyonrails are arguing for.
And there is room for that.
I claim there is still room for C. Heck, we are even trying to move
things into hardware so we can go faster.
But i admit to be envious sometimes when i see code written very rapidly
even though it makes me puke when i see how crappy the performance is.
I still havent learnt to accept that - even though i have come to terms
with accepting bugs.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 30, 2015, 3:33 p.m. UTC | #29
Hi jamal,

On Thu, Apr 30, 2015 at 07:55:22AM -0400, Jamal Hadi Salim wrote:
[...]
> Start with a zero rules. Add them logarithmically (with and without
> traffic running). i.e in order of {0, 1, 10, 100, 1000, ...}
> With a single rule you dont notice much difference. Start adding rules
> and it becomes very obvious.

I think the days of linear ruleset performance competitions are over,
we have better data structures to allow users to arrange the ruleset
through the multidimensional dictionaries and the arbitrary state
flows that minimize the number of inspections, which is what it harms
performance when it comes to packet classification.

> ... but now your answer is essentially to subsume tc into netfilter
> ;-> Great.

We're not subsuming anything under netfilter, that hook infrastructure
is generic, and qdisc ingress will not depend on layer 3 hooks. It's
just a bit more code to allow more than one single chain from ingress
and potentially support more features from that entry point in an
optional fashion.

> ... IOW, the logic of "there are bugs
> in there, lets burn down the house" is not how we operate.

We're not burning down any house, after this patchset Qdisc ingress
will still be there. Qdisc ingress can keep going with it, fix the
bugs, improve its performance, there is way a lot room for it from its
own nice, through the flexibility that the generic hook infrastructure
provides.

> Netfilter is a lot more usable - no doubt about it. It has always
> been very good. But there are caveats.
> At one point i think we were thankful all the crackheads were using
> netfilter instead of tc because tc was harder to understand ;->.
> In retrospect i think that was wrong;->
> 
> So let me see if i can summarize your arguement:
> Hardware is faster than 1975 therefore lets just use netfilter
> for usability reasons. [...]

You keep saying that qdisc ingress outperforms, that's only right for
just a very slight difference when comparing it with no rules on
single CPU (when ported to the common playground of the generic hook
infrastructure). On SMP nftables will outperform, even more if the
ruleset is arranged in a non-linear list fashion, with all the new
tricks that we got.

Anyway, let's take this "nftables vs. qdisc ingress" discussion to an
end. I think the main point of this discussion is to provide a generic
entry point to ingress filtering (for both qdisc ingress and nftables)
that, if unused, doesn't harm performance of the critical path
netif_receive_core() path at all. Thus, users can choose what they
want, I have heard you saying several times: "To each their poison"
and I like that.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 4:09 p.m. UTC | #30
On 04/30/2015 05:33 PM, Pablo Neira Ayuso wrote:
...
> On Thu, Apr 30, 2015 at 07:55:22AM -0400, Jamal Hadi Salim wrote:
> [...]
>> Start with a zero rules. Add them logarithmically (with and without
>> traffic running). i.e in order of {0, 1, 10, 100, 1000, ...}
>> With a single rule you dont notice much difference. Start adding rules
>> and it becomes very obvious.
>
> I think the days of linear ruleset performance competitions are over,

Totally agree with you. You want to have a single classification pass
that parses the packet once and comes to a verdict immediately.

> we have better data structures to allow users to arrange the ruleset
> through the multidimensional dictionaries and the arbitrary state
> flows that minimize the number of inspections, which is what it harms
> performance when it comes to packet classification.

I think both have different use cases, though, but on cls_bpf side you
have maps infrastructure that is evolving as well. Not really speaking
about the other remaining classifiers, however. I also don't want to go
any further into this vim vs emacs debate. ;) And, personally, I also
don't have any issue offering alternatives to users.

However, I still disagree with moving ingress behind this artificial
barrier if it's just not necessary. I believe, in your RFC v1 patch,
you had a second ingress hook as a static key for nft, I tend to like
that much better consensus-wise. Both subsystems should not put
unnecessary barriers into their way, really.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 30, 2015, 4:36 p.m. UTC | #31
On Thu, Apr 30, 2015 at 06:09:25PM +0200, Daniel Borkmann wrote:
> I think both have different use cases, though, but on cls_bpf side you
> have maps infrastructure that is evolving as well. Not really speaking
> about the other remaining classifiers, however. I also don't want to go
> any further into this vim vs emacs debate. ;) And, personally, I also
> don't have any issue offering alternatives to users.
> 
> However, I still disagree with moving ingress behind this artificial
> barrier if it's just not necessary. I believe, in your RFC v1 patch,
> you had a second ingress hook as a static key for nft, I tend to like
> that much better consensus-wise. Both subsystems should not put
> unnecessary barriers into their way, really.

I'm evolving to think that it would be good to have a single entry
point for ingress filtering.

But where are the barriers? These unfounded performance claims are
simply absurd, qdisc ingress barely performs a bit better just because
it executes a bit less code and only in the single CPU scenario with
no rules at all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 30, 2015, 7:05 p.m. UTC | #32
On Thu, Apr 30, 2015 at 12:12:04PM +0200, Pablo Neira Ayuso wrote:
> 
> These are the numbers I got banging *one single CPU*:
> 
> * Without patches + qdisc ingress:
> 
> Result: OK: 16298126(c16298125+d0) usec, 10000000 (60byte,0frags)
>   613567pps 294Mb/sec (294512160bps) errors: 10000000
> 
> * With patches + qdisc ingress on top of hooks:
> 
> Result: OK: 18339281(c18339280+d0) usec, 10000000 (60byte,0frags)
>   545277pps 261Mb/sec (261732960bps) errors: 10000000
> 
> * With patches + nftables ingress chain:
> 
> Result: OK: 17118167(c17118167+d0) usec, 10000000 (60byte,0frags)
> 
>   584174pps 280Mb/sec (280403520bps) errors: 10000000

So in other words you're saying: tc has to live with 12%
slowdown (613k / 545k) only because _you_ want one hook
for both nft and tc ?!

The numbers from my box are 22.4 Mpps vs 18 Mpps which is 24%
slowdown for TC due to nf_hook.
Notice I'm seeing _millions_ packet per second processed by
netif_receive_skb->ingress_qdisc->u32
whereas you're talking about _thousands_.
Even if your box is very old, it still doesn't explain this
huge difference.
Please post 'perf report' numbers, so we can help analyze
what is actually being measured. I bet netif_receive_skb
is not even in top 10.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 7:16 p.m. UTC | #33
On 04/30/2015 06:36 PM, Pablo Neira Ayuso wrote:
...
> But where are the barriers? These unfounded performance claims are
> simply absurd, qdisc ingress barely performs a bit better just because
> it executes a bit less code and only in the single CPU scenario with
> no rules at all.

I think we're going in circles a bit. :( You are right in saying that
currently, there's a central spinlock, which is worked on to get rid
of, you've seen the patch on the list floating around already. Single
CPU, artificial micro-benchmark, which were done show that you see on
your machine ~613Kpps to ~545Kpps, others have seen it more amplified
as 22.4Mpps to 18.0Mpps drop from __netif_receive_skb_core() up to an
empty dummy u32_classify() rule, which has already been acknowledged
that this gap needs to be improved. Lets call it unfounded then. I
think we wouldn't even have this discussion if we wouldn't try brute
forcing both worlds behind this single static key, or, have both
invoked from within the same layer/list.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 30, 2015, 11:01 p.m. UTC | #34
On 04/30/2015 09:16 PM, Daniel Borkmann wrote:
> On 04/30/2015 06:36 PM, Pablo Neira Ayuso wrote:
> ...
>> But where are the barriers? These unfounded performance claims are
>> simply absurd, qdisc ingress barely performs a bit better just because
>> it executes a bit less code and only in the single CPU scenario with
>> no rules at all.
>
> I think we're going in circles a bit. :( You are right in saying that
> currently, there's a central spinlock, which is worked on to get rid
> of, you've seen the patch on the list floating around already. Single
> CPU, artificial micro-benchmark, which were done show that you see on
> your machine ~613Kpps to ~545Kpps, others have seen it more amplified
> as 22.4Mpps to 18.0Mpps drop from __netif_receive_skb_core() up to an
> empty dummy u32_classify() rule, which has already been acknowledged
> that this gap needs to be improved. Lets call it unfounded then. I
> think we wouldn't even have this discussion if we wouldn't try brute
> forcing both worlds behind this single static key, or, have both
> invoked from within the same layer/list.

Ok, out of curiosity, I did the same as both of you: I'm using a pretty
standard Supermicro X10SLM-F/X10SLM-F, Xeon E3-1240 v3.

*** ingress + dummy u32, net-next:

w/o perf:
...
Result: OK: 5157948(c5157388+d559) usec, 100000000 (60byte,0frags)
   19387551pps 9306Mb/sec (9306024480bps) errors: 100000000

perf record -C0 -ecycles:k ./pktgen.sh p1p1
...
Result: OK: 5182638(c5182057+d580) usec, 100000000 (60byte,0frags)
   19295191pps 9261Mb/sec (9261691680bps) errors: 100000000

   26.07%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
   14.39%   kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
   13.69%   kpktgend_0  [cls_u32]          [k] u32_classify
   11.75%   kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
    5.34%   kpktgend_0  [sch_ingress]      [k] ingress_enqueue
    5.21%   kpktgend_0  [kernel.kallsyms]  [k] tc_classify_compat
    4.93%   kpktgend_0  [kernel.kallsyms]  [k] skb_defer_rx_timestamp
    3.41%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
    3.21%   kpktgend_0  [pktgen]           [k] pktgen_thread_worker
    3.16%   kpktgend_0  [kernel.kallsyms]  [k] tc_classify
    3.08%   kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
    2.05%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
    1.60%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
    1.15%   kpktgend_0  [kernel.kallsyms]  [k] classify
    0.45%   kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip

*** nf hook infra + ingress + dummy u32, net-next:

w/o perf:
...
Result: OK: 6555903(c6555744+d159) usec, 100000000 (60byte,0frags)
   15253426pps 7321Mb/sec (7321644480bps) errors: 100000000

perf record -C0 -ecycles:k ./pktgen.sh p1p1
...
Result: OK: 6591291(c6591153+d138) usec, 100000000 (60byte,0frags)
   15171532pps 7282Mb/sec (7282335360bps) errors: 100000000

   25.94%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
   12.19%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
   11.00%  kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
   10.58%  kpktgend_0  [cls_u32]          [k] u32_classify
    5.34%  kpktgend_0  [sch_ingress]      [k] handle_ing
    4.68%  kpktgend_0  [kernel.kallsyms]  [k] nf_iterate
    4.33%  kpktgend_0  [kernel.kallsyms]  [k] tc_classify_compat
    4.32%  kpktgend_0  [sch_ingress]      [k] ingress_enqueue
    3.62%  kpktgend_0  [kernel.kallsyms]  [k] skb_defer_rx_timestamp
    2.95%  kpktgend_0  [kernel.kallsyms]  [k] nf_hook_slow
    2.75%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
    2.60%  kpktgend_0  [kernel.kallsyms]  [k] tc_classify
    2.52%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
    2.50%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
    1.77%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
    1.28%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
    0.94%  kpktgend_0  [kernel.kallsyms]  [k] classify
    0.38%  kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip

*** drop ingress spinlock (patch w/ bstats addition) + ingress +
     dummy u32, net-next:

w/o perf:
...
Result: OK: 4789828(c4789353+d474) usec, 100000000 (60byte,0frags)
   20877576pps 10021Mb/sec (10021236480bps) errors: 100000000

perf record -C0 -ecycles:k ./pktgen.sh p1p1
...
Result: OK: 4829276(c4828437+d839) usec, 100000000 (60byte,0frags)
   20707036pps 9939Mb/sec (9939377280bps) errors: 100000000

   33.11%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
   15.27%   kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
   14.60%   kpktgend_0  [cls_u32]          [k] u32_classify
    6.06%   kpktgend_0  [sch_ingress]      [k] ingress_enqueue
    5.55%   kpktgend_0  [kernel.kallsyms]  [k] tc_classify_compat
    5.31%   kpktgend_0  [kernel.kallsyms]  [k] skb_defer_rx_timestamp
    3.77%   kpktgend_0  [pktgen]           [k] pktgen_thread_worker
    3.45%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
    3.33%   kpktgend_0  [kernel.kallsyms]  [k] tc_classify
    3.33%   kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
    2.34%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
    1.78%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
    1.15%   kpktgend_0  [kernel.kallsyms]  [k] classify
    0.48%   kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip

That means, here, moving ingress behind nf hooks, I see a similar
slowdown in this micro-benchmark as Alexei of really worst case
of ~27%.

Now in real world that might probably just end up as a few percent
depending on the use case, but really, why should we go down that
path if we can just avoid that?

If you find a way, where both tc/nf hooks are triggered from within
the same list, then that would probably look better already. Or,
as a start, as mentioned, with a second static key for netfilter,
which can later on then still be reworked for a better integration,
although I agree with you that it's less clean and I see the point
of consolidating code.

If you want, I'm happy to provide numbers if you have a next set as
well, feel free to ping me.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim May 1, 2015, 1:15 a.m. UTC | #35
Pablo,

On 04/30/15 11:33, Pablo Neira Ayuso wrote:

>
> You keep saying that qdisc ingress outperforms, that's only right for
> just a very slight difference when comparing it with no rules on
> single CPU (when ported to the common playground of the generic hook
> infrastructure). On SMP nftables will outperform, even more if the
> ruleset is arranged in a non-linear list fashion, with all the new
> tricks that we got.
>

I am interested to see the numbers. I think this would be a great paper;
it is extremely tempting to spend time on it.

> Anyway, let's take this "nftables vs. qdisc ingress" discussion to an
> end. I think the main point of this discussion is to provide a generic
> entry point to ingress filtering (for both qdisc ingress and nftables)
> that, if unused, doesn't harm performance of the critical path
> netif_receive_core() path at all. Thus, users can choose what they
> want, I have heard you saying several times: "To each their poison"
> and I like that.
>

Yes - but my good friend Patrick is not saying that.
I dont want to turn on netfilter in order to get tc actions on ingress.
And i dont want to be slowed down because now the code path has become
longer. We are trying to prune the code path. If somehow you can work to
not affect performance then we can live well together.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2da5d10..5b2cc04 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -79,19 +79,6 @@  static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
 
 struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
 
-#ifdef CONFIG_NET_CLS_ACT
-void net_inc_ingress_queue(void);
-void net_dec_ingress_queue(void);
-#else
-static inline void net_inc_ingress_queue(void)
-{
-}
-
-static inline void net_dec_ingress_queue(void)
-{
-}
-#endif
-
 extern void rtnetlink_init(void);
 extern void __rtnl_unlock(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fa8a262..af03263 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1631,22 +1631,6 @@  int call_netdevice_notifiers(unsigned long val, struct net_device *dev)
 }
 EXPORT_SYMBOL(call_netdevice_notifiers);
 
-#ifdef CONFIG_NET_CLS_ACT
-static struct static_key ingress_needed __read_mostly;
-
-void net_inc_ingress_queue(void)
-{
-	static_key_slow_inc(&ingress_needed);
-}
-EXPORT_SYMBOL_GPL(net_inc_ingress_queue);
-
-void net_dec_ingress_queue(void)
-{
-	static_key_slow_dec(&ingress_needed);
-}
-EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
-#endif
-
 static struct static_key netstamp_needed __read_mostly;
 #ifdef HAVE_JUMP_LABEL
 /* We are not allowed to call static_key_slow_dec() from irq context
@@ -3521,67 +3505,6 @@  int (*br_fdb_test_addr_hook)(struct net_device *dev,
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-#ifdef CONFIG_NET_CLS_ACT
-/* TODO: Maybe we should just force sch_ingress to be compiled in
- * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
- * a compare and 2 stores extra right now if we dont have it on
- * but have CONFIG_NET_CLS_ACT
- * NOTE: This doesn't stop any functionality; if you dont have
- * the ingress scheduler, you just can't add policies on ingress.
- *
- */
-static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
-{
-	struct net_device *dev = skb->dev;
-	u32 ttl = G_TC_RTTL(skb->tc_verd);
-	int result = TC_ACT_OK;
-	struct Qdisc *q;
-
-	if (unlikely(MAX_RED_LOOP < ttl++)) {
-		net_warn_ratelimited("Redir loop detected Dropping packet (%d->%d)\n",
-				     skb->skb_iif, dev->ifindex);
-		return TC_ACT_SHOT;
-	}
-
-	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
-	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
-
-	q = rcu_dereference(rxq->qdisc);
-	if (q != &noop_qdisc) {
-		spin_lock(qdisc_lock(q));
-		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			result = qdisc_enqueue_root(skb, q);
-		spin_unlock(qdisc_lock(q));
-	}
-
-	return result;
-}
-
-static inline struct sk_buff *handle_ing(struct sk_buff *skb,
-					 struct packet_type **pt_prev,
-					 int *ret, struct net_device *orig_dev)
-{
-	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
-
-	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	switch (ing_filter(skb, rxq)) {
-	case TC_ACT_SHOT:
-	case TC_ACT_STOLEN:
-		kfree_skb(skb);
-		return NULL;
-	}
-
-	return skb;
-}
-#endif
-
 /**
  *	netdev_rx_handler_register - register receive handler
  *	@dev: device to register a handler for
@@ -3726,16 +3649,6 @@  another_round:
 	}
 
 skip_taps:
-#ifdef CONFIG_NET_CLS_ACT
-	if (static_key_false(&ingress_needed)) {
-		skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
-		if (!skb)
-			goto unlock;
-	}
-
-	skb->tc_verd = 0;
-ncls:
-#endif
 	if (nf_hook_ingress_active(skb)) {
 		ret = nf_hook_ingress(skb, &pt_prev, &ret, orig_dev);
 		if (ret < 0) {
@@ -3743,7 +3656,10 @@  ncls:
 			goto unlock;
 		}
 	}
-
+#ifdef CONFIG_NET_CLS_ACT
+	skb->tc_verd = 0;
+ncls:
+#endif
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
 		goto drop;
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 2274e72..23b57da 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -312,6 +312,7 @@  config NET_SCH_PIE
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
+	select NETFILTER_INGRESS
 	---help---
 	  Say Y here if you want to use classifiers for incoming packets.
 	  If unsure, say Y.
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 4cdbfb8..996c823 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -14,10 +14,11 @@ 
 #include <linux/rtnetlink.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
-
+#include <linux/netfilter_hooks.h>
 
 struct ingress_qdisc_data {
 	struct tcf_proto __rcu	*filter_list;
+	struct nf_hook_ops	ops;
 };
 
 /* ------------------------- Class/flow operations ------------------------- */
@@ -88,11 +89,62 @@  static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 /* ------------------------------------------------------------- */
 
+static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
+{
+	struct net_device *dev = skb->dev;
+	u32 ttl = G_TC_RTTL(skb->tc_verd);
+	int result = TC_ACT_OK;
+	struct Qdisc *q;
+
+	if (unlikely(MAX_RED_LOOP < ttl++)) {
+		net_warn_ratelimited("Redir loop detected Dropping packet (%d->%d)\n",
+				     skb->skb_iif, dev->ifindex);
+		return TC_ACT_SHOT;
+	}
+
+	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
+	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
+
+	q = rcu_dereference(rxq->qdisc);
+	if (q != &noop_qdisc) {
+		spin_lock(qdisc_lock(q));
+		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
+			result = qdisc_enqueue_root(skb, q);
+		spin_unlock(qdisc_lock(q));
+	}
+
+	return result;
+}
+
+static unsigned int handle_ing(const struct nf_hook_ops *ops,
+			       struct sk_buff *skb,
+			       const struct nf_hook_state *state)
+{
+	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
+
+	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
+		return NF_ACCEPT;
+
+	switch (ing_filter(skb, rxq)) {
+	case TC_ACT_SHOT:
+	case TC_ACT_STOLEN:
+		return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
 {
-	net_inc_ingress_queue();
+	struct ingress_qdisc_data *p = qdisc_priv(sch);
+
+	p->ops.hook	= handle_ing;
+	p->ops.pf	= NFPROTO_NETDEV;
+	p->ops.hooknum	= NF_NETDEV_INGRESS;
+	p->ops.priority = 0;
+	p->ops.dev	= qdisc_dev(sch);
 
-	return 0;
+	return nf_register_hook(&p->ops);
 }
 
 static void ingress_destroy(struct Qdisc *sch)
@@ -100,7 +152,7 @@  static void ingress_destroy(struct Qdisc *sch)
 	struct ingress_qdisc_data *p = qdisc_priv(sch);
 
 	tcf_destroy_chain(&p->filter_list);
-	net_dec_ingress_queue();
+	nf_unregister_hook(&p->ops);
 }
 
 static int ingress_dump(struct Qdisc *sch, struct sk_buff *skb)