diff mbox

[v3,net-next,2/2] tc: add 'needs_l2' flag to ingress qdisc

Message ID 1428535575-7736-2-git-send-email-ast@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov April 8, 2015, 11:26 p.m. UTC
TC classifers and actions attached to ingress and egress qdiscs see
inconsistent skb->data. For ingress L2 header is already pulled, whereas
for egress it's present. Introduce an optional flag for ingress qdisc
which if set will cause ingress to push L2 header before calling
into classifiers/actions and pull L2 back afterwards.

The following cls/act assume L2 and now marked as 'needs_l2':
cls_bpf, act_bpf, cls_rsvp, act_csum, act_nat.
The users can use them on ingress qdisc created with 'needs_l2' flag and
on any egress qdisc. The use of them with vanilla ingress is disallowed.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
V2->V3: rewrite to use optional flag for ingress qdisc and corresponding
flags for classifers and actions

cls_bpf/act_bpf will always require L2.

cls_rsvp, act_csum, act_nat are doing pskb_may_pull/skb_clone_writable
assuming that skb->data is pointing to L2. Should be fixable.

Everything that calls into ematch is also potentially broken on ingress.
Like em_canid does:
  struct can_frame *cf = (struct can_frame *)skb->data;
and other em-s do:
  case TCF_LAYER_LINK:
    return skb->data;
Should be fixable as well.

skb_share_check() can be costly when taps are used. In my use cases of
tc+bpf that doesn't happen, so I'm not worried at the moment.
If tc+bpf usage changes in the future, we can add 'early_ingress_l2' flag
that will bypass taps and avoid skb_share_check. This early_handle_ing() hook
can be wrapped with static_key as well.

 include/net/act_api.h          |    9 +++++----
 include/net/sch_generic.h      |    3 +++
 include/uapi/linux/pkt_sched.h |   10 ++++++++++
 net/core/dev.c                 |   22 ++++++++++++++++++++--
 net/sched/act_api.c            |   21 +++++++++++++++------
 net/sched/act_bpf.c            |    1 +
 net/sched/act_csum.c           |    2 ++
 net/sched/act_nat.c            |    2 ++
 net/sched/cls_api.c            |   21 +++++++++++++++++++--
 net/sched/cls_bpf.c            |    1 +
 net/sched/cls_rsvp.h           |    2 ++
 net/sched/sch_ingress.c        |   27 +++++++++++++++++++++++++++
 samples/bpf/tcbpf1_kern.c      |    6 ------
 13 files changed, 107 insertions(+), 20 deletions(-)

Comments

David Miller April 9, 2015, 2:44 a.m. UTC | #1
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Wed,  8 Apr 2015 16:26:15 -0700

> index 4cd5cf1aedf8..887033fbdfa6 100644
> --- a/net/sched/act_csum.c
> +++ b/net/sched/act_csum.c
> @@ -565,6 +565,8 @@ static struct tc_action_ops act_csum_ops = {
>  	.act		= tcf_csum,
>  	.dump		= tcf_csum_dump,
>  	.init		= tcf_csum_init,
> +	/* todo: fix pskb_may_pull in tcf_csum to not assume L2 */
> +	.flags		= TCA_NEEDS_L2,
>  };
>  
>  MODULE_DESCRIPTION("Checksum updating actions");


All this module really cares about is where skb_network_header() is
for data accesses.  And a simple offset would allow it to calculate
the pskb_may_pull() length argument correctly.

And it would therefore work without all of these expensive SKB share
check calls etc.

I really don't like your approach, and I'd rather you propagate the
offset into the BPF programs you generate.  I bet you can do it in
an efficient way, and just haven't put in the effort to discover
how.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 9, 2015, 3:05 a.m. UTC | #2
On 4/8/15 7:44 PM, David Miller wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Wed,  8 Apr 2015 16:26:15 -0700
>
>> index 4cd5cf1aedf8..887033fbdfa6 100644
>> --- a/net/sched/act_csum.c
>> +++ b/net/sched/act_csum.c
>> @@ -565,6 +565,8 @@ static struct tc_action_ops act_csum_ops = {
>>   	.act		= tcf_csum,
>>   	.dump		= tcf_csum_dump,
>>   	.init		= tcf_csum_init,
>> +	/* todo: fix pskb_may_pull in tcf_csum to not assume L2 */
>> +	.flags		= TCA_NEEDS_L2,
>>   };
>>
>>   MODULE_DESCRIPTION("Checksum updating actions");
>
>
> All this module really cares about is where skb_network_header() is
> for data accesses.  And a simple offset would allow it to calculate
> the pskb_may_pull() length argument correctly.
>
> And it would therefore work without all of these expensive SKB share
> check calls etc.

yes. act_csum can be fixed. I mentioned that in the commit log.
I can drop markings for anything other than cls_bpf/act_bpf if you like?

> I really don't like your approach, and I'd rather you propagate the
> offset into the BPF programs you generate.  I bet you can do it in
> an efficient way, and just haven't put in the effort to discover
> how.

I'm sure there is a way to propagate the offset into the programs.
It's not about efficiency of programs, but about consistency.
Programs should know nothing about kernel. Sending network offset
into them is exposing this very specific kernel behavior.
As soon as we do that all programs need to be written with this
offset in mind, so they can work with ingress and egress. It would
also break sockex1_kern.c, sockex2_kern.c and all other bpf programs.
That's not acceptable to me.
As I said I'd rather disable them attaching to ingress than force
all program authors to always use network_offset.
Offset 0 points to L2 was always fundamental assumption of BPF.
Both in classic and now carried over into extended.
We cannot change that.
We can cheat and say 'bpf for ingress' has to use new rules, but
that's just wrong. Imagine documenting it everywhere and all
confusion it will cause.

I still don't understand why you don't like it.
bridge is doing skb_share_check and that's acceptable there.
What's wrong with doing it for ingress here under flag?
Will 'early_ingress_l2' hook and flag be better?
Like adding:
skb = early_handle_ing(skb, &pt_prev, &ret, orig_dev);
right before taps in __netif_receive_skb_core ?
This way we can safely push/pull without share check.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 9, 2015, 3:14 a.m. UTC | #3
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Wed, 08 Apr 2015 20:05:13 -0700

> I'm sure there is a way to propagate the offset into the programs.
> It's not about efficiency of programs, but about consistency.
> Programs should know nothing about kernel. Sending network offset
> into them is exposing this very specific kernel behavior.

It can be performed by the data access helpers the JIT'd programs
have to invoke anyways.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 9, 2015, 3:39 a.m. UTC | #4
On 4/8/15 8:14 PM, David Miller wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Wed, 08 Apr 2015 20:05:13 -0700
>
>> I'm sure there is a way to propagate the offset into the programs.
>> It's not about efficiency of programs, but about consistency.
>> Programs should know nothing about kernel. Sending network offset
>> into them is exposing this very specific kernel behavior.
>
> It can be performed by the data access helpers the JIT'd programs
> have to invoke anyways.

hmm, not sure what you mean.
Let's take specific line from sockex1_kern.c:
int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
this C code is compiled into
R0 = LD_ABS_B 23
(instruction with fixed offset)
which is being interpreted as:
skb_header_pointer(skb, 23, 1, buffer);
and similar by JITs which are using doing
r0 = *(char *)(skb->data + 23)
in this case.

Are you proposing to change semantics of LD_ABS instruction to use
skb->head + skb->mac_header instead of skb->data in interpreter
and in all JITs?
Performance wise it will be ok, since JITs can cache that pointer.
But that will be huge and very risky change.
I'm not sure yet whether all programs will keep working afterwards.
Is it really worth taking so much risk vs push/pull of L2?
If you say, let's take the risk, sure, I can try hacking all the bits
and see what the cost.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 9, 2015, 5:20 a.m. UTC | #5
On 4/8/15 8:39 PM, Alexei Starovoitov wrote:
> On 4/8/15 8:14 PM, David Miller wrote:
>> From: Alexei Starovoitov <ast@plumgrid.com>
>> Date: Wed, 08 Apr 2015 20:05:13 -0700
>>
>>> I'm sure there is a way to propagate the offset into the programs.
>>> It's not about efficiency of programs, but about consistency.
>>> Programs should know nothing about kernel. Sending network offset
>>> into them is exposing this very specific kernel behavior.
>>
>> It can be performed by the data access helpers the JIT'd programs
>> have to invoke anyways.
>
> hmm, not sure what you mean.
> Let's take specific line from sockex1_kern.c:
> int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
> this C code is compiled into
> R0 = LD_ABS_B 23
> (instruction with fixed offset)
> which is being interpreted as:
> skb_header_pointer(skb, 23, 1, buffer);
> and similar by JITs which are using doing
> r0 = *(char *)(skb->data + 23)
> in this case.
>
> Are you proposing to change semantics of LD_ABS instruction to use
> skb->head + skb->mac_header instead of skb->data in interpreter
> and in all JITs?
> Performance wise it will be ok, since JITs can cache that pointer.
> But that will be huge and very risky change.
> I'm not sure yet whether all programs will keep working afterwards.
> Is it really worth taking so much risk vs push/pull of L2?
> If you say, let's take the risk, sure, I can try hacking all the bits
> and see what the cost.

have to correct myself.
we cannot change the meaning of ld_abs, since for dgram sockets
offset is actually not pointing to L2.
af_packet is doing:
  if (sk->sk_type != SOCK_DGRAM)
       skb_push(skb, skb->data - skb_mac_header(skb));
  res = run_filter(skb, sk, snaplen);
so not everything assumes L2. raw socket taps, ppp, team, cls_bpf
certainly want to see L2.
I still hate to see ingress and egress cls/act programs to be different.
af_packet also does:
skb = skb_share_check(skb, GFP_ATOMIC);

so Dave, would you consider early_ingress_l2() hook that doesn't need
to do skb_share_check ?
If not, the only option I have is to introduce a set of helpers
that can read from packet where offset is always starts at L2.
Then we can say that cls/act_bpf should never use ld_abs/ld_ind
instructions if they want to run on both ingress and egress and
should use these new read helpers instead.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 9, 2015, 5:25 a.m. UTC | #6
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Wed, 08 Apr 2015 22:20:47 -0700

> so Dave, would you consider early_ingress_l2() hook that doesn't need
> to do skb_share_check ?

Let me think about it some more.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 9, 2015, 10:49 a.m. UTC | #7
On 04/08/15 23:05, Alexei Starovoitov wrote:
[..]
> As I said I'd rather disable them attaching to ingress than force
> all program authors to always use network_offset.
> Offset 0 points to L2 was always fundamental assumption of BPF.

The issue is not really the ingress qdisc here - rather the ingress
qdisc tries to cope with _how things work_.
The challenge is *some netdevs* (not all) strip off the link
layer _before_ the ingres qdisc sees them. You have to recognize
those devices and only speacial case with them. Your assumptions of
blindly pushing/pulling will break  in some cases (take a look at
mirred).

Having said that:
why could you not use the AT bits to identify whether you are
being invoked at ingress in the ebpf cls/ac wrapper? IOW:
have bpf cls/act be responsible for that knowledge.
Your changes penalize everyone else because of this assumption
bpf makes. We have always tried to be sensitive to perfomance.
People like Eric D. complain about a single if statement in this
code path and you are adding a bunch more unconditionally.
If all i wanted to do was police or account (a good number of use
cases) suddenly this gets more costly.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 9, 2015, 11:02 a.m. UTC | #8
On 04/09/15 06:49, Jamal Hadi Salim wrote:
> On 04/08/15 23:05, Alexei Starovoitov wrote:
> [..]
>> As I said I'd rather disable them attaching to ingress than force
>> all program authors to always use network_offset.
>> Offset 0 points to L2 was always fundamental assumption of BPF.
>
> The issue is not really the ingress qdisc here - rather the ingress
> qdisc tries to cope with _how things work_.
> The challenge is *some netdevs* (not all) strip off the link
> layer _before_ the ingres qdisc sees them. You have to recognize
> those devices and only speacial case with them. Your assumptions of
> blindly pushing/pulling will break  in some cases (take a look at
> mirred).
>

To be precise look at:
tcf_mirred_init() and how ok_push based on dev->type is decided.
So we do this in the slow path and make a simple compare in the
fast path on whether push the l2 header.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 9, 2015, 3:15 p.m. UTC | #9
On 04/09/2015 07:20 AM, Alexei Starovoitov wrote:
...
> we cannot change the meaning of ld_abs, since for dgram sockets
> offset is actually not pointing to L2.
> af_packet is doing:
>   if (sk->sk_type != SOCK_DGRAM)
>        skb_push(skb, skb->data - skb_mac_header(skb));
>   res = run_filter(skb, sk, snaplen);
> so not everything assumes L2.

It would be not just that e.g. sock_queue_rcv_skb() users
have differing offset assumptions as well when they call
from their protocols into sk_filter().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 9, 2015, 3:36 p.m. UTC | #10
On Thu, 2015-04-09 at 17:15 +0200, Daniel Borkmann wrote:
> On 04/09/2015 07:20 AM, Alexei Starovoitov wrote:
> ...
> > we cannot change the meaning of ld_abs, since for dgram sockets
> > offset is actually not pointing to L2.
> > af_packet is doing:
> >   if (sk->sk_type != SOCK_DGRAM)
> >        skb_push(skb, skb->data - skb_mac_header(skb));
> >   res = run_filter(skb, sk, snaplen);
> > so not everything assumes L2.
> 
> It would be not just that e.g. sock_queue_rcv_skb() users
> have differing offset assumptions as well when they call
> from their protocols into sk_filter().

It seems that passing an extra parameter
could help those users, and would remove the requirement
of skb_clone() from dev_queue_xmit_nit()

skb_clone() would only be done after run_filter()

Maybe we'll finally have a fast packet capture, instead of simply
focusing on the filter part, which is tiny.

Lets take this discussion as the incentive to finally get this done
right.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann April 9, 2015, 3:38 p.m. UTC | #11
On 04/09/2015 12:49 PM, Jamal Hadi Salim wrote:
...
> Your changes penalize everyone else because of this assumption
> bpf makes. We have always tried to be sensitive to perfomance.

That includes also BPF, right? ;) I mean you'd need to push extra
unneeded per-packet instructions down into the interpreter and
JITs that neither the output path needs in case of {cls,act}_bpf,
and generally other users working on skbs such as team driver, all
possible kind of sockets with filters attached, xt_bpf, etc, etc
just to accommodate for the ingress use-case. I mean I understand
your concern, but making BPF cls/act responsible for that knowledge
has it's downsides just as well.

Moreover, we'd enforce user space to start programming with
unintuitive negative offsets when accessing mac layer, and cls_bpf
at least, since it's around for some time, would need to start
differentiating between classic and native eBPF to keep compat
with old BPF programs for the output path. That's pretty messy. :/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 9, 2015, 5:03 p.m. UTC | #12
On 4/9/15 3:49 AM, Jamal Hadi Salim wrote:
> those devices and only speacial case with them. Your assumptions of
> blindly pushing/pulling will break  in some cases (take a look at
> mirred).

you underestimated how much time I've spent studying mirred ;)
In particular my v2 patch fixes two bugs in it:
- it was forwarding packets with L2 present to tunnel devices if used
   with egress qdisc
- it wasn't updating skb->csum with ingress qdisc

In v3 since 'needs_l2' is an optional flag, I didn't touch mirred
and left its bugs for the future patches.

> Your changes penalize everyone else because of this assumption
> bpf makes. We have always tried to be sensitive to perfomance.

hmm. penalize everyone? it's an optional flag.
If ingress_l2 is used with cls_bpf plus some other classifier,
sure, the other classifier is paying potential cost skb_share_check
if taps are active. The chances of having more than a couple classifiers
on a single device are very slim, since they're called sequentially
and killing performance regardless.
In most of the use cases I'm after, cls_bpf will be the only classifier
attached. In some cases there may be few others, but cls_bpf will be
the first and the heaviest one. So it makes sense to optimize for it.

> code path and you are adding a bunch more unconditionally.

not true at all. Looks like you misread the v3 patch completely.
It's an optional flag.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 9, 2015, 5:20 p.m. UTC | #13
On 4/9/15 8:36 AM, Eric Dumazet wrote:
>
> Maybe we'll finally have a fast packet capture, instead of simply
> focusing on the filter part, which is tiny.

My plan for that was to use early_ingress_l2() hook before taps.
which will push L2 and call into the program _without_ skb_share_check.
We'll use it on the main interface where all traffic is seen.
The program will parse the packet the way it likes and redirect
the interesting packets to either different physical interface or
another tap/veth where user space will capture it. Then for 99%
of traffic on the main interface we don't pay any extra cost and if
external physical device is used for capturing then it's even cheaper.
On tracing side we were planing to use existing perf ring buffer infra
to pass some data to user space. The same helpers can be used to pass
info about the packet when user space doesn't need to see the whole
packet. In such case the bpf program will extract interesting headers
from the packet, store them into ring buffer and copy to user space
which is extremely fast.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 10, 2015, 11:49 a.m. UTC | #14
On 04/09/15 11:38, Daniel Borkmann wrote:
> On 04/09/2015 12:49 PM, Jamal Hadi Salim wrote:
> ...
>> Your changes penalize everyone else because of this assumption
>> bpf makes. We have always tried to be sensitive to perfomance.
>
> That includes also BPF, right? ;)

Yes, of course - we are trying to reach some conclusion i hope. I
have no qualms on ebpf;  but I have concerns for other users of tc.
If the whole world is suddenly going to shift to ebpf then it is
easy to make a call. I doubt that will _ever_ happen.
As a new kid on the block ebpf needs to provide a strong case for
making such changes which affect everyone else.
It is not a simple "fix" that Alexei posted that will suffice to
make everything on ingress appear at offset 0. I am afraid, a lot
more intrusion will be needed (and we have avoided it all these
years).

> I mean you'd need to push extra
> unneeded per-packet instructions down into the interpreter and
> JITs that neither the output path needs in case of {cls,act}_bpf,
> and generally other users working on skbs such as team driver, all
> possible kind of sockets with filters attached, xt_bpf, etc, etc
> just to accommodate for the ingress use-case. I mean I understand
> your concern, but making BPF cls/act responsible for that knowledge
> has it's downsides just as well.
>

The main downside is usability for someone who wants to write code
that is inserted in the kernel. They have to know whether their code
will run on ingress or egress and the type of device etc.
The AT_XXX provides that signal and dev->type fills in the other gap.
Such coders i hope will have enough knowledge. It is close to
someone writing netfilter hooks needing to know that something is
at post/pre-routing etc

> Moreover, we'd enforce user space to start programming with
> unintuitive negative offsets when accessing mac layer, and cls_bpf
> at least, since it's around for some time, would need to start
> differentiating between classic and native eBPF to keep compat
> with old BPF programs for the output path. That's pretty messy. :/


I agree that user facing usability issues need to be addressed but that
is not the same as someone writing ebpf code.
The  negative offset presentation to the user does look ugly. You can
teach tc for the case of u32 to hide that from the user.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 10, 2015, 12:48 p.m. UTC | #15
On 04/09/15 13:03, Alexei Starovoitov wrote:
> On 4/9/15 3:49 AM, Jamal Hadi Salim wrote:
>> those devices and only speacial case with them. Your assumptions of
>> blindly pushing/pulling will break  in some cases (take a look at
>> mirred).
>
> you underestimated how much time I've spent studying mirred ;)
> In particular my v2 patch fixes two bugs in it:
> - it was forwarding packets with L2 present to tunnel devices if used
>    with egress qdisc
> - it wasn't updating skb->csum with ingress qdisc
>
> In v3 since 'needs_l2' is an optional flag, I didn't touch mirred
> and left its bugs for the future patches.
>

Alexei,  when you do stuff like based on some policy as you suggest:
+	skb_push(skb, hard_header_len);
+	skb_postpush_rcsum(skb, skb->data, hard_header_len);

That doesnt give you L2 at offset 0 for _some devices_ which is what you
say you need. In some devices it adds a second L2 header.
This is why i keep pointing you to mirred. That is just how
the system works today.

To your "bugs" comments:
- It is reasonable but borderline grey area intepretation of policy
intent to refer additional L2 header a "bug". It is clearly
undesirable to send additional L2 headers (depending on the tunnel
device) and up  to now i havent thought of it as a "bug". But I would
concede that point. In my testing I thought that was a feature. This
is different from the ingress  side where it simply wont work if
you have a missing L2 header.
- updating csum; earlier i said i was conflicted it being a
useful "feature".
You are repeating again that it is a bug. It is not.
This action is intended to  mirror or redirect packets, period.
Unix philosophy do small things and do them well.
Adding checksum recomputation maybe useful if some preceding action
required it. But calling lack of checksum recomputation a bug is a
big stretch.


>
> hmm. penalize everyone? it's an optional flag.

You added at minimal two if conditions that execute on a per-packet
level. Eric D. has been hounding me for just having one.

> If ingress_l2 is used with cls_bpf plus some other classifier,
> sure, the other classifier is paying potential cost skb_share_check
> if taps are active. The chances of having more than a couple classifiers
> on a single device are very slim, since they're called sequentially
> and killing performance regardless.

In my world it is not uncommon.

> In most of the use cases I'm after, cls_bpf will be the only classifier
> attached.

And i am pointing there are other use cases.

>In some cases there may be few others, but cls_bpf will be
> the first and the heaviest one. So it makes sense to optimize for it.
>

I am sorry - but i remain unconvinced.

>> code path and you are adding a bunch more unconditionally.
>
> not true at all. Looks like you misread the v3 patch completely.
> It's an optional flag.

+		needs_l2 = q->flags & TCQ_F_INGRESS_L2;
+		if (needs_l2) {
..
....
+		if (needs_l2)

That is executed for _every_ packet going after ingress qdisc.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 10, 2015, 10:35 p.m. UTC | #16
On 4/10/15 5:48 AM, Jamal Hadi Salim wrote:
>
> That doesnt give you L2 at offset 0 for _some devices_ which is what you
> say you need. In some devices it adds a second L2 header.

ok. fair point. will add a check to make sure such ingress_l2 qdisc
doesn't get applied to devices without header_ops.
Same check is done by af_packet.

>
> To your "bugs" comments:
> - updating csum; earlier i said i was conflicted it being a
> useful "feature".
> You are repeating again that it is a bug. It is not.
> This action is intended to  mirror or redirect packets, period.

without updating skb->csum act_mirred is breaking csum for
checksum_complete devices. If these redirected skbs are eventually
seen by stack, the stack will drop them. I cannot use such
'feature', so I would need to find a way to do mirred without fixing
act_mirred. Fine.

> +        needs_l2 = q->flags & TCQ_F_INGRESS_L2;
> +        if (needs_l2) {
> ..
> ....
> +        if (needs_l2)
>
> That is executed for _every_ packet going after ingress qdisc.

and there is hugely expensive spin_lock/unlock around ingress.
Comparing to lock cmpxchg the cost of branch is a noise.
But fine, I'll find a way to optimize the whole thing without
adding extra branches.

> They have to know whether their code
> will run on ingress or egress and the type of device etc.
> The AT_XXX provides that signal and dev->type fills in the other gap.

True. The program authors may want to know whether the packet is seen
on ingress or egress and take different actions inside the program,
but forcing them to _always_ parse the packet differently because of
it is not acceptable.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim April 13, 2015, 2:28 p.m. UTC | #17
On 04/10/15 18:35, Alexei Starovoitov wrote:

>> To your "bugs" comments:
>> - updating csum; earlier i said i was conflicted it being a
>> useful "feature".
>> You are repeating again that it is a bug. It is not.
>> This action is intended to  mirror or redirect packets, period.
>
> without updating skb->csum act_mirred is breaking csum for
> checksum_complete devices.

Ok, could this then be checked for in dev features flags and
only then recomputed?


> True. The program authors may want to know whether the packet is seen
> on ingress or egress and take different actions inside the program,
> but forcing them to _always_ parse the packet differently because of
> it is not acceptable.


Yes, I see your point here as reasonable dilema. But you could add an
API call the user always make that resets and unsets the header?
It would be a single branch failure for egress/stack source.


cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov April 13, 2015, 4:13 p.m. UTC | #18
On 4/13/15 7:28 AM, Jamal Hadi Salim wrote:
> On 04/10/15 18:35, Alexei Starovoitov wrote:
>
>>> To your "bugs" comments:
>>> - updating csum; earlier i said i was conflicted it being a
>>> useful "feature".
>>> You are repeating again that it is a bug. It is not.
>>> This action is intended to  mirror or redirect packets, period.
>>
>> without updating skb->csum act_mirred is breaking csum for
>> checksum_complete devices.
>
> Ok, could this then be checked for in dev features flags and
> only then recomputed?

that's what my v2 patch did :)
if ip_summed == checksum_complete -> recompute skb->csum

> Yes, I see your point here as reasonable dilema. But you could add an
> API call the user always make that resets and unsets the header?
> It would be a single branch failure for egress/stack source.

cannot do that from the program. skb->data and headerlen are already
cached by JITs and skb pointer itself is in some bpf registers,
so push/pull/skb_share_check is _not_ possible when program is running.
It can only be done before program starts. That's exactly what all my
patches are doing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 3ee4c92afd1b..c52bd98b7dd9 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -95,6 +95,8 @@  struct tc_action_ops {
 			struct nlattr *est, struct tc_action *act, int ovr,
 			int bind);
 	int     (*walk)(struct sk_buff *, struct netlink_callback *, int, struct tc_action *);
+	u32	flags;
+#define TCA_NEEDS_L2	1
 };
 
 int tcf_hash_search(struct tc_action *a, u32 index);
@@ -112,12 +114,11 @@  int tcf_unregister_action(struct tc_action_ops *a);
 int tcf_action_destroy(struct list_head *actions, int bind);
 int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
 		    struct tcf_result *res);
-int tcf_action_init(struct net *net, struct nlattr *nla,
-				  struct nlattr *est, char *n, int ovr,
-				  int bind, struct list_head *);
+int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est,
+		    char *n, int ovr, int bind, struct list_head *, bool);
 struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 				    struct nlattr *est, char *n, int ovr,
-				    int bind);
+				    int bind, bool qdisc_has_l2);
 int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6d778efcfdfd..5f0efa556a35 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -61,6 +61,7 @@  struct Qdisc {
 				      */
 #define TCQ_F_WARN_NONWC	(1 << 16)
 #define TCQ_F_CPUSTATS		0x20 /* run using percpu statistics */
+#define TCQ_F_INGRESS_L2	0x40 /* ingress qdisc with L2 header */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
@@ -228,6 +229,8 @@  struct tcf_proto_ops {
 					struct sk_buff *skb, struct tcmsg*);
 
 	struct module		*owner;
+	u32			flags;
+#define TCF_NEEDS_L2	1
 };
 
 struct tcf_proto {
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 534b84710745..1076c56e6458 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -845,4 +845,14 @@  struct tc_pie_xstats {
 	__u32 maxq;             /* maximum queue size */
 	__u32 ecn_mark;         /* packets marked with ecn*/
 };
+
+/* INGRESS section */
+
+enum {
+	TCA_INGRESS_UNSPEC,
+	TCA_INGRESS_NEEDS_L2,
+	__TCA_INGRESS_MAX,
+};
+
+#define TCA_INGRESS_MAX (__TCA_INGRESS_MAX - 1)
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index b2775f06c710..957a31ef25ee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3513,10 +3513,13 @@  EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
  * the ingress scheduler, you just can't add policies on ingress.
  *
  */
-static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
+static int ing_filter(struct sk_buff **pskb, struct netdev_queue *rxq)
 {
+	struct sk_buff *skb = *pskb;
 	struct net_device *dev = skb->dev;
+	u32 hard_header_len = dev->hard_header_len;
 	u32 ttl = G_TC_RTTL(skb->tc_verd);
+	u32 needs_l2;
 	int result = TC_ACT_OK;
 	struct Qdisc *q;
 
@@ -3531,10 +3534,25 @@  static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 
 	q = rcu_dereference(rxq->qdisc);
 	if (q != &noop_qdisc) {
+		needs_l2 = q->flags & TCQ_F_INGRESS_L2;
+		if (needs_l2) {
+			*pskb = skb = skb_share_check(skb, GFP_ATOMIC);
+
+			if (unlikely(!skb))
+				/* original skb was freed by skb_share_check */
+				return TC_ACT_UNSPEC;
+
+			skb_push(skb, hard_header_len);
+			skb_postpush_rcsum(skb, skb->data, hard_header_len);
+		}
+
 		spin_lock(qdisc_lock(q));
 		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
 			result = qdisc_enqueue_root(skb, q);
 		spin_unlock(qdisc_lock(q));
+
+		if (needs_l2)
+			skb_pull_rcsum(skb, hard_header_len);
 	}
 
 	return result;
@@ -3554,7 +3572,7 @@  static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 		*pt_prev = NULL;
 	}
 
-	switch (ing_filter(skb, rxq)) {
+	switch (ing_filter(&skb, rxq)) {
 	case TC_ACT_SHOT:
 	case TC_ACT_STOLEN:
 		kfree_skb(skb);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3d43e4979f27..0fe93a60c59b 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -484,7 +484,7 @@  errout:
 
 struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 				    struct nlattr *est, char *name, int ovr,
-				    int bind)
+				    int bind, bool qdisc_has_l2)
 {
 	struct tc_action *a;
 	struct tc_action_ops *a_o;
@@ -533,6 +533,14 @@  struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
 		goto err_out;
 	}
 
+	if ((a_o->flags & TCA_NEEDS_L2) && !qdisc_has_l2) {
+		/* actions that require L2 cannot be attached
+		 * to vanilla ingress qdisc
+		 */
+		err = -EINVAL;
+		goto err_mod;
+	}
+
 	err = -ENOMEM;
 	a = kzalloc(sizeof(*a), GFP_KERNEL);
 	if (a == NULL)
@@ -565,9 +573,9 @@  err_out:
 	return ERR_PTR(err);
 }
 
-int tcf_action_init(struct net *net, struct nlattr *nla,
-				  struct nlattr *est, char *name, int ovr,
-				  int bind, struct list_head *actions)
+int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est,
+		    char *name, int ovr, int bind, struct list_head *actions,
+		    bool qdisc_has_l2)
 {
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
 	struct tc_action *act;
@@ -579,7 +587,8 @@  int tcf_action_init(struct net *net, struct nlattr *nla,
 		return err;
 
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
-		act = tcf_action_init_1(net, tb[i], est, name, ovr, bind);
+		act = tcf_action_init_1(net, tb[i], est, name, ovr, bind,
+					qdisc_has_l2);
 		if (IS_ERR(act)) {
 			err = PTR_ERR(act);
 			goto err;
@@ -931,7 +940,7 @@  tcf_action_add(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 	int ret = 0;
 	LIST_HEAD(actions);
 
-	ret = tcf_action_init(net, nla, NULL, NULL, ovr, 0, &actions);
+	ret = tcf_action_init(net, nla, NULL, NULL, ovr, 0, &actions, true);
 	if (ret)
 		goto done;
 
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 4d2cede17468..8f89d97dfa4a 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -339,6 +339,7 @@  static struct tc_action_ops act_bpf_ops __read_mostly = {
 	.dump		=	tcf_bpf_dump,
 	.cleanup	=	tcf_bpf_cleanup,
 	.init		=	tcf_bpf_init,
+	.flags		=	TCA_NEEDS_L2,
 };
 
 static int __init bpf_init_module(void)
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 4cd5cf1aedf8..887033fbdfa6 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -565,6 +565,8 @@  static struct tc_action_ops act_csum_ops = {
 	.act		= tcf_csum,
 	.dump		= tcf_csum_dump,
 	.init		= tcf_csum_init,
+	/* todo: fix pskb_may_pull in tcf_csum to not assume L2 */
+	.flags		= TCA_NEEDS_L2,
 };
 
 MODULE_DESCRIPTION("Checksum updating actions");
diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index 270a030d5fd0..de83aa016302 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -287,6 +287,8 @@  static struct tc_action_ops act_nat_ops = {
 	.act		=	tcf_nat,
 	.dump		=	tcf_nat_dump,
 	.init		=	tcf_nat_init,
+	/* todo: fix pskb_may_pull in tcf_nat to not assume L2 */
+	.flags		=	TCA_NEEDS_L2,
 };
 
 MODULE_DESCRIPTION("Stateless NAT actions");
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8b0470e418dc..ae1b1b769136 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -257,6 +257,17 @@  replay:
 			kfree(tp);
 			goto errout;
 		}
+		if ((tp_ops->flags & TCF_NEEDS_L2) &&
+		    (q->flags & TCQ_F_INGRESS) &&
+		    !(q->flags & TCQ_F_INGRESS_L2)) {
+			/* classifiers that need L2 header cannot be
+			 * attached to vanilla ingress qdisc
+			 */
+			err = -EINVAL;
+			module_put(tp_ops->owner);
+			kfree(tp);
+			goto errout;
+		}
 		tp->ops = tp_ops;
 		tp->protocol = protocol;
 		tp->prio = nprio ? :
@@ -517,12 +528,17 @@  int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 #ifdef CONFIG_NET_CLS_ACT
 	{
 		struct tc_action *act;
+		bool qdisc_has_l2 = true;
+
+		if ((tp->q->flags & TCQ_F_INGRESS) &&
+		    !(tp->q->flags & TCQ_F_INGRESS_L2))
+			qdisc_has_l2 = false;
 
 		INIT_LIST_HEAD(&exts->actions);
 		if (exts->police && tb[exts->police]) {
 			act = tcf_action_init_1(net, tb[exts->police], rate_tlv,
 						"police", ovr,
-						TCA_ACT_BIND);
+						TCA_ACT_BIND, qdisc_has_l2);
 			if (IS_ERR(act))
 				return PTR_ERR(act);
 
@@ -532,7 +548,8 @@  int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 			int err;
 			err = tcf_action_init(net, tb[exts->action], rate_tlv,
 					      NULL, ovr,
-					      TCA_ACT_BIND, &exts->actions);
+					      TCA_ACT_BIND, &exts->actions,
+					      qdisc_has_l2);
 			if (err)
 				return err;
 		}
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 5c4171c5d2bd..b99e677e2800 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -476,6 +476,7 @@  static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.delete		=	cls_bpf_delete,
 	.walk		=	cls_bpf_walk,
 	.dump		=	cls_bpf_dump,
+	.flags		=	TCF_NEEDS_L2,
 };
 
 static int __init cls_bpf_init_mod(void)
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 02fa82792dab..9f1ff1e19f20 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -716,6 +716,8 @@  static struct tcf_proto_ops RSVP_OPS __read_mostly = {
 	.walk		=	rsvp_walk,
 	.dump		=	rsvp_dump,
 	.owner		=	THIS_MODULE,
+	/* todo: fix pskb_network_may_pull in rsvp_classify to not assume L2 */
+	.flags		=	TCF_NEEDS_L2,
 };
 
 static int __init init_rsvp(void)
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index eb5b8445fef9..431ae6ffd142 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -109,6 +109,32 @@  nla_put_failure:
 	return -1;
 }
 
+static const struct nla_policy ingress_policy[TCA_INGRESS_MAX + 1] = {
+	[TCA_INGRESS_NEEDS_L2]	= { .type = NLA_U32 },
+};
+
+/* NEEDS_L2 flag can only be set at init time */
+static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct nlattr *tb[TCA_INGRESS_MAX + 1];
+	int err;
+
+	if (!opt)
+		return 0;
+
+	err = nla_parse_nested(tb, TCA_INGRESS_MAX, opt, ingress_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_INGRESS_NEEDS_L2]) {
+		if (nla_get_u32(tb[TCA_INGRESS_NEEDS_L2]))
+			sch->flags |= TCQ_F_INGRESS_L2;
+		else
+			sch->flags &= ~TCQ_F_INGRESS_L2;
+	}
+	return 0;
+}
+
 static const struct Qdisc_class_ops ingress_class_ops = {
 	.leaf		=	ingress_leaf,
 	.get		=	ingress_get,
@@ -123,6 +149,7 @@  static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
 	.cl_ops		=	&ingress_class_ops,
 	.id		=	"ingress",
 	.priv_size	=	sizeof(struct ingress_qdisc_data),
+	.init		=	ingress_init,
 	.enqueue	=	ingress_enqueue,
 	.destroy	=	ingress_destroy,
 	.dump		=	ingress_dump,
diff --git a/samples/bpf/tcbpf1_kern.c b/samples/bpf/tcbpf1_kern.c
index 7cf3f42a6e39..76cdaab63058 100644
--- a/samples/bpf/tcbpf1_kern.c
+++ b/samples/bpf/tcbpf1_kern.c
@@ -14,12 +14,6 @@  static inline void set_dst_mac(struct __sk_buff *skb, char *mac)
 	bpf_skb_store_bytes(skb, 0, mac, ETH_ALEN, 1);
 }
 
-/* use 1 below for ingress qdisc and 0 for egress */
-#if 0
-#undef ETH_HLEN
-#define ETH_HLEN 0
-#endif
-
 #define IP_CSUM_OFF (ETH_HLEN + offsetof(struct iphdr, check))
 #define TOS_OFF (ETH_HLEN + offsetof(struct iphdr, tos))