diff mbox series

[RFC,bpf-next,v3,6/8] flow_dissector: handle no-skb use case

Message ID 20190322195903.162516-7-sdf@google.com
State RFC
Delegated to: BPF Maintainers
Headers show
Series net: flow_dissector: trigger BPF hook when called from eth_get_headlen | expand

Commit Message

Stanislav Fomichev March 22, 2019, 7:59 p.m. UTC
When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/core/flow_dissector.c | 54 +++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 31 deletions(-)

Comments

Alexei Starovoitov March 23, 2019, 1 a.m. UTC | #1
On Fri, Mar 22, 2019 at 12:59:01PM -0700, Stanislav Fomichev wrote:
> When called without skb, gather all required data from the
> __skb_flow_dissect's arguments and use recently introduces
> no-skb mode of bpf flow dissector.
> 
> Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> +			struct bpf_flow_keys flow_keys;
> +			struct bpf_flow_dissector ctx = {
> +				.flow_keys = &flow_keys,
> +				.data = data,
> +				.data_end = data + hlen,
> +				.protocol = proto,
> +			};
> +
> +			if (skb) {
> +				ctx.skb = skb;
> +				ctx.protocol = skb->protocol;
> +				ctx.vlan_tci = skb->vlan_tci;
> +				ctx.vlan_proto = skb->vlan_proto;
> +				ctx.vlan_present = skb->vlan_present;
> +			}

are you suggesting to have vlan* fields that only work properly for skb case?
It means that progs/bpf_flow.c would not work as-is for eth_get_headlen.
And to have unified program that works in both cases the program would need to rely
on above internal implementation detail, like checking that ctx->protocol == 0 ?
imo that is worse than having two different flow dissector program types.

May be remove protocol and vlan* from ctx ?
Then the only thing program can do is look at the packet data which is
eth_get_headlen use case. For skb case the existence of vlan can be retrofitted into
dissector results by the kernel after the program finished.
Stanislav Fomichev March 23, 2019, 1:19 a.m. UTC | #2
On 03/22, Alexei Starovoitov wrote:
> On Fri, Mar 22, 2019 at 12:59:01PM -0700, Stanislav Fomichev wrote:
> > When called without skb, gather all required data from the
> > __skb_flow_dissect's arguments and use recently introduces
> > no-skb mode of bpf flow dissector.
> > 
> > Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > +			struct bpf_flow_keys flow_keys;
> > +			struct bpf_flow_dissector ctx = {
> > +				.flow_keys = &flow_keys,
> > +				.data = data,
> > +				.data_end = data + hlen,
> > +				.protocol = proto,
> > +			};
> > +
> > +			if (skb) {
> > +				ctx.skb = skb;
> > +				ctx.protocol = skb->protocol;
> > +				ctx.vlan_tci = skb->vlan_tci;
> > +				ctx.vlan_proto = skb->vlan_proto;
> > +				ctx.vlan_present = skb->vlan_present;
> > +			}
> 
> are you suggesting to have vlan* fields that only work properly for skb case?
> It means that progs/bpf_flow.c would not work as-is for eth_get_headlen.
> And to have unified program that works in both cases the program would need to rely
> on above internal implementation detail, like checking that ctx->protocol == 0 ?
> imo that is worse than having two different flow dissector program types.
Yeah. The reference implementation (bpf_flow.c) should handle it though.
In parse_eth_proto we handle ETH_P_8021Q/ETH_P_8021AD, so I'm not sure
why we even had that skb->vlan_present check in the first place.
[But, again, who knows what's out there besides bpf_flow.c]

In general, this RFC's approach is to have a "best effort" vlan
detection, bpf program would still have to handle it.
Agree, that it's confusing.

> May be remove protocol and vlan* from ctx ?
> Then the only thing program can do is look at the packet data which is
> eth_get_headlen use case. For skb case the existence of vlan can be retrofitted into
> dissector results by the kernel after the program finished.
Are we ok with breaking api in this case? I'm all in on removing this
extra information. We can always put it back if somebody complains (and
manually parse in eth_get_headlen case).

Regarding dissector results: that's currently not implemented, we don't
export vlan information.

We can still have protocol, because in both skb/skb-less cases we have
it.
Alexei Starovoitov March 23, 2019, 1:41 a.m. UTC | #3
On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> Are we ok with breaking api in this case? I'm all in on removing this
> extra information. We can always put it back if somebody complains (and
> manually parse in eth_get_headlen case).

Fine. That seems to be the only way forward to clean it all up.
Could you submit patch 1 to bpf tree disallowing vlan fields?
Patch 3 looks like candidate as well?

> We can still have protocol, because in both skb/skb-less cases we have
> it.

proto can work in both cases, but is it needed ? Does program benefit from it?
The kernel side burns extra bytes by copying it and extra branches to handle it.
May be drop it as well?
Stanislav Fomichev March 23, 2019, 4:05 p.m. UTC | #4
On 03/22, Alexei Starovoitov wrote:
> On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > Are we ok with breaking api in this case? I'm all in on removing this
> > extra information. We can always put it back if somebody complains (and
> > manually parse in eth_get_headlen case).
> 
> Fine. That seems to be the only way forward to clean it all up.
> Could you submit patch 1 to bpf tree disallowing vlan fields?
> Patch 3 looks like candidate as well?
SGTM, will do. Let me also spend some time and do a simple test for
the vlan case, to make sure I didn't miss something important.
One question here though: would I need to wait for bpf and bpf-next
to re-merge to continues the series? Or we can cherry-pick those
patches to bpf-next as well (and git will work it out during the
merge)?

> > We can still have protocol, because in both skb/skb-less cases we have
> > it.
> 
> proto can work in both cases, but is it needed ? Does program benefit from it?
> The kernel side burns extra bytes by copying it and extra branches to handle it.
> May be drop it as well?
I feel like the program benefits from it, there is no need to go back and
re-parse that (and in the skb case, this data is already pulled). I was
also thinking about re-purposing flow_keys->n_proto for that (instead
of skb->protocol), so it functions as input and output, maybe that's a
more clear way to do it.
Alexei Starovoitov March 26, 2019, 12:35 a.m. UTC | #5
On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> On 03/22, Alexei Starovoitov wrote:
> > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > Are we ok with breaking api in this case? I'm all in on removing this
> > > extra information. We can always put it back if somebody complains (and
> > > manually parse in eth_get_headlen case).
> > 
> > Fine. That seems to be the only way forward to clean it all up.
> > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > Patch 3 looks like candidate as well?
> SGTM, will do. Let me also spend some time and do a simple test for
> the vlan case, to make sure I didn't miss something important.
> One question here though: would I need to wait for bpf and bpf-next
> to re-merge to continues the series? Or we can cherry-pick those
> patches to bpf-next as well (and git will work it out during the
> merge)?
> 
> > > We can still have protocol, because in both skb/skb-less cases we have
> > > it.
> > 
> > proto can work in both cases, but is it needed ? Does program benefit from it?
> > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > May be drop it as well?
> I feel like the program benefits from it, there is no need to go back and
> re-parse that (and in the skb case, this data is already pulled). I was
> also thinking about re-purposing flow_keys->n_proto for that (instead
> of skb->protocol), so it functions as input and output, maybe that's a
> more clear way to do it.

Are you saying that skb-less and skb flow dissector progs are looking
at different positions into the packet ?
In case of with-skb it's already after eth header was pulled?
In such case skb-less should be different program type or
both should point at the same point.
Stanislav Fomichev March 26, 2019, 4:45 p.m. UTC | #6
On 03/25, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > On 03/22, Alexei Starovoitov wrote:
> > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > extra information. We can always put it back if somebody complains (and
> > > > manually parse in eth_get_headlen case).
> > > 
> > > Fine. That seems to be the only way forward to clean it all up.
> > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > Patch 3 looks like candidate as well?
> > SGTM, will do. Let me also spend some time and do a simple test for
> > the vlan case, to make sure I didn't miss something important.
> > One question here though: would I need to wait for bpf and bpf-next
> > to re-merge to continues the series? Or we can cherry-pick those
> > patches to bpf-next as well (and git will work it out during the
> > merge)?
> > 
> > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > it.
> > > 
> > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > May be drop it as well?
> > I feel like the program benefits from it, there is no need to go back and
> > re-parse that (and in the skb case, this data is already pulled). I was
> > also thinking about re-purposing flow_keys->n_proto for that (instead
> > of skb->protocol), so it functions as input and output, maybe that's a
> > more clear way to do it.
> 
> Are you saying that skb-less and skb flow dissector progs are looking
> at different positions into the packet ?
No, sorry for confusion, they are both called to parse (optional) L2-vlan
and L3+ headers. However, with-skb case can be called with l2-vlan
parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
__netif_receive_skb_core, but we can still invoke flow dissector prior to
that when doing RFS (get_rps_cpu).

That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
(and then also manually test for ETH_P_8021Q/ETH_P_8021AD).

Let me try to post the patches to bpf tree somewhere this week, we
can discuss the API changes there.

> In case of with-skb it's already after eth header was pulled?
> In such case skb-less should be different program type or
> both should point at the same point.
Alexei Starovoitov March 26, 2019, 5:48 p.m. UTC | #7
On Tue, Mar 26, 2019 at 09:45:29AM -0700, Stanislav Fomichev wrote:
> On 03/25, Alexei Starovoitov wrote:
> > On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > > On 03/22, Alexei Starovoitov wrote:
> > > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > > extra information. We can always put it back if somebody complains (and
> > > > > manually parse in eth_get_headlen case).
> > > > 
> > > > Fine. That seems to be the only way forward to clean it all up.
> > > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > > Patch 3 looks like candidate as well?
> > > SGTM, will do. Let me also spend some time and do a simple test for
> > > the vlan case, to make sure I didn't miss something important.
> > > One question here though: would I need to wait for bpf and bpf-next
> > > to re-merge to continues the series? Or we can cherry-pick those
> > > patches to bpf-next as well (and git will work it out during the
> > > merge)?
> > > 
> > > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > > it.
> > > > 
> > > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > > May be drop it as well?
> > > I feel like the program benefits from it, there is no need to go back and
> > > re-parse that (and in the skb case, this data is already pulled). I was
> > > also thinking about re-purposing flow_keys->n_proto for that (instead
> > > of skb->protocol), so it functions as input and output, maybe that's a
> > > more clear way to do it.
> > 
> > Are you saying that skb-less and skb flow dissector progs are looking
> > at different positions into the packet ?
> No, sorry for confusion, they are both called to parse (optional) L2-vlan
> and L3+ headers. However, with-skb case can be called with l2-vlan
> parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
> __netif_receive_skb_core, but we can still invoke flow dissector prior to
> that when doing RFS (get_rps_cpu).
> 
> That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
> (and then also manually test for ETH_P_8021Q/ETH_P_8021AD).
> 
> Let me try to post the patches to bpf tree somewhere this week, we
> can discuss the API changes there.

let's figure out what we disable for bpf/stable first.
Sound like skb->protocol isn't helping.
prog has to read it from eth header anyway. then let's drop it from ctx?
skb->vlan_present also seems to be misleading.
For normal processing skb->vlan_present means that there is a vlan hdr
in the packet, but for flow dissector is some sort of weird hint
whether skb-based dissector was pre- or post- rfs.
I think we need make vlan semantics unambiguous first.
Willem de Bruijn March 26, 2019, 5:51 p.m. UTC | #8
On Tue, Mar 26, 2019 at 1:48 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 26, 2019 at 09:45:29AM -0700, Stanislav Fomichev wrote:
> > On 03/25, Alexei Starovoitov wrote:
> > > On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > > > On 03/22, Alexei Starovoitov wrote:
> > > > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > > > extra information. We can always put it back if somebody complains (and
> > > > > > manually parse in eth_get_headlen case).
> > > > >
> > > > > Fine. That seems to be the only way forward to clean it all up.
> > > > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > > > Patch 3 looks like candidate as well?
> > > > SGTM, will do. Let me also spend some time and do a simple test for
> > > > the vlan case, to make sure I didn't miss something important.
> > > > One question here though: would I need to wait for bpf and bpf-next
> > > > to re-merge to continues the series? Or we can cherry-pick those
> > > > patches to bpf-next as well (and git will work it out during the
> > > > merge)?
> > > >
> > > > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > > > it.
> > > > >
> > > > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > > > May be drop it as well?
> > > > I feel like the program benefits from it, there is no need to go back and
> > > > re-parse that (and in the skb case, this data is already pulled). I was
> > > > also thinking about re-purposing flow_keys->n_proto for that (instead
> > > > of skb->protocol), so it functions as input and output, maybe that's a
> > > > more clear way to do it.
> > >
> > > Are you saying that skb-less and skb flow dissector progs are looking
> > > at different positions into the packet ?
> > No, sorry for confusion, they are both called to parse (optional) L2-vlan
> > and L3+ headers. However, with-skb case can be called with l2-vlan
> > parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
> > __netif_receive_skb_core, but we can still invoke flow dissector prior to
> > that when doing RFS (get_rps_cpu).
> >
> > That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
> > (and then also manually test for ETH_P_8021Q/ETH_P_8021AD).
> >
> > Let me try to post the patches to bpf tree somewhere this week, we
> > can discuss the API changes there.
>
> let's figure out what we disable for bpf/stable first.
> Sound like skb->protocol isn't helping.
> prog has to read it from eth header anyway. then let's drop it from ctx?

The C flow dissector does not parse link layer headers It starts with
proto either given as argument or derived from skb->protocol (or
skb->vlan_proto).

The BPF flow dissector should work the same. It is fine to pass the
data including ethernet header, but parsing can start at nhoff with
proto explicitly passed.

We should not assume Ethernet link layer.
Alexei Starovoitov March 26, 2019, 6:08 p.m. UTC | #9
On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> The BPF flow dissector should work the same. It is fine to pass the
> data including ethernet header, but parsing can start at nhoff with
> proto explicitly passed.
>
> We should not assume Ethernet link layer.

then skb-less dissector has to be different program type
because semantics are different.
Stanislav Fomichev March 26, 2019, 6:17 p.m. UTC | #10
On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> > The BPF flow dissector should work the same. It is fine to pass the
> > data including ethernet header, but parsing can start at nhoff with
> > proto explicitly passed.
> >
> > We should not assume Ethernet link layer.
> 
> then skb-less dissector has to be different program type
> because semantics are different.
The semantics are the same as for c-based __skb_flow_dissect.
We just need to pass nhoff and proto that has been passed to
__skb_flow_dissect to the bpf program. In case of with-skb,
take this initial data from skb, like __skb_flow_dissect does (and don't
ask BPF program to do it essentially):

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763

I was thinking of passing proto as flow_keys->n_proto and we already
pass flow_keys->nhoff, so no need to do anything for it. With that,
BPF program doesn't need to look into skb and can parse optional vlan
and L3+ headers. The same way __skb_flow_dissect does that.
Alexei Starovoitov March 26, 2019, 6:30 p.m. UTC | #11
On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > > The BPF flow dissector should work the same. It is fine to pass the
> > > data including ethernet header, but parsing can start at nhoff with
> > > proto explicitly passed.
> > >
> > > We should not assume Ethernet link layer.
> > 
> > then skb-less dissector has to be different program type
> > because semantics are different.
> The semantics are the same as for c-based __skb_flow_dissect.
> We just need to pass nhoff and proto that has been passed to
> __skb_flow_dissect to the bpf program. In case of with-skb,
> take this initial data from skb, like __skb_flow_dissect does (and don't
> ask BPF program to do it essentially):
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> 
> I was thinking of passing proto as flow_keys->n_proto and we already
> pass flow_keys->nhoff, so no need to do anything for it. With that,
> BPF program doesn't need to look into skb and can parse optional vlan
> and L3+ headers. The same way __skb_flow_dissect does that.

makes sense. then I'd also prefer for proto to be in flow_keys to
high light this difference.
may be add vlan_proto/present/tci there as well?
At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
Stanislav Fomichev March 26, 2019, 6:54 p.m. UTC | #12
On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > data including ethernet header, but parsing can start at nhoff with
> > > > proto explicitly passed.
> > > >
> > > > We should not assume Ethernet link layer.
> > > 
> > > then skb-less dissector has to be different program type
> > > because semantics are different.
> > The semantics are the same as for c-based __skb_flow_dissect.
> > We just need to pass nhoff and proto that has been passed to
> > __skb_flow_dissect to the bpf program. In case of with-skb,
> > take this initial data from skb, like __skb_flow_dissect does (and don't
> > ask BPF program to do it essentially):
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > 
> > I was thinking of passing proto as flow_keys->n_proto and we already
> > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > BPF program doesn't need to look into skb and can parse optional vlan
> > and L3+ headers. The same way __skb_flow_dissect does that.
> 
> makes sense. then I'd also prefer for proto to be in flow_keys to
> high light this difference.
Maybe rename existing flow_keys->n_proto to flow_keys->proto?
That would match __skb_flow_dissect and remove ambiguity with both proto
and n_proto in flow_keys.

> may be add vlan_proto/present/tci there as well?
> At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
Why do you think we need them? My understanding was that when
skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
that vlan info has been already parsed out of the packet and stored in
the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
points to proper L3 header.

If that's correct, BPF flow dissector should not care about that. For
example, look at how C-based flow dissector does that:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944

If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
and move on.

But, we would need vlan_proto/present/tci in the flow_keys in the future.
We don't currently return parsed vlan data from the BPF flow dissector.
But it feels like it's getting into bpf-next territory :-)
Alexei Starovoitov March 27, 2019, 1:41 a.m. UTC | #13
On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > proto explicitly passed.
> > > > >
> > > > > We should not assume Ethernet link layer.
> > > > 
> > > > then skb-less dissector has to be different program type
> > > > because semantics are different.
> > > The semantics are the same as for c-based __skb_flow_dissect.
> > > We just need to pass nhoff and proto that has been passed to
> > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > ask BPF program to do it essentially):
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > 
> > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > BPF program doesn't need to look into skb and can parse optional vlan
> > > and L3+ headers. The same way __skb_flow_dissect does that.
> > 
> > makes sense. then I'd also prefer for proto to be in flow_keys to
> > high light this difference.
> Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> That would match __skb_flow_dissect and remove ambiguity with both proto
> and n_proto in flow_keys.

disabling useless fields in ctx is one thing, since probability of breaking users
is low, but renaming n_proto is imo too much.

> > may be add vlan_proto/present/tci there as well?
> > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> Why do you think we need them? My understanding was that when
> skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> that vlan info has been already parsed out of the packet and stored in
> the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> points to proper L3 header.
> 
> If that's correct, BPF flow dissector should not care about that. For
> example, look at how C-based flow dissector does that:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> 
> If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> and move on.
> 
> But, we would need vlan_proto/present/tci in the flow_keys in the future.
> We don't currently return parsed vlan data from the BPF flow dissector.
> But it feels like it's getting into bpf-next territory :-)

Whether ctx->data points to L2 or L3 is uapi regardless whether
progs/bpf_flow.c is relying on that or not.
So far I think you're saying that in all three cases:
no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
This has to be preserved.
Only now after reading bpf_flow.c for Nth time I realized what semantics
you gave to skb->vlan* and skb->protocol fields. All of them have
to be kept as-is.
For no-skb cases all of them should be available with the same logic
and it has to documented, since it's different from other bpf progs
that access these fields.
Stanislav Fomichev March 27, 2019, 2:44 a.m. UTC | #14
On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > proto explicitly passed.
> > > > > >
> > > > > > We should not assume Ethernet link layer.
> > > > > 
> > > > > then skb-less dissector has to be different program type
> > > > > because semantics are different.
> > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > We just need to pass nhoff and proto that has been passed to
> > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > ask BPF program to do it essentially):
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > 
> > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > 
> > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > high light this difference.
> > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > That would match __skb_flow_dissect and remove ambiguity with both proto
> > and n_proto in flow_keys.
> 
> disabling useless fields in ctx is one thing, since probability of breaking users
> is low, but renaming n_proto is imo too much.
> 
> > > may be add vlan_proto/present/tci there as well?
> > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > Why do you think we need them? My understanding was that when
> > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > that vlan info has been already parsed out of the packet and stored in
> > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > points to proper L3 header.
> > 
> > If that's correct, BPF flow dissector should not care about that. For
> > example, look at how C-based flow dissector does that:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > 
> > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > and move on.
> > 
> > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > We don't currently return parsed vlan data from the BPF flow dissector.
> > But it feels like it's getting into bpf-next territory :-)
> 
> Whether ctx->data points to L2 or L3 is uapi regardless whether
> progs/bpf_flow.c is relying on that or not.
> So far I think you're saying that in all three cases:
> no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> This has to be preserved.
It points to L3 (or vlan). And this will be preserved, I have no
intention to change that.

Just to make sure, we are on the same page, here is what
__skb_flow_dissect (and BPF prog) is seeing in nhoff.

NO-VLAN is always the same for both with-skb/no-skb:
+----+----+-----+--+
|DMAC|SMAC|PROTO|L3|
+----+----+-----+--+
                 ^
                 +-- nhoff
                     proto = PROTO

VLAN no-skb (eth_get_headlen):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                ^
                +-- nhoff
                    proto = TPID

VLAN with-skb, RFS (pre __netif_receive_skb_core):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                ^
                +-- nhoff
                    proto = TPID

VLAN with-skb, post RFS (post __netif_receive_skb_core / skb_vlan_untag):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                          ^
                          +-- nhoff
			      proto = PROTO

And in the last case, networking stack sets:
 * skb->vlan_present to true
 * skb->vlan_proto to TPID
 * skb->vlan_tci to TCI
 * skb->protocol to PROTO
 * pulls vlan header, so skb->data points to L3 header

> Only now after reading bpf_flow.c for Nth time I realized what semantics
> you gave to skb->vlan* and skb->protocol fields. All of them have
> to be kept as-is.
Don't read too much into current bpf_flow.c, I don't think it really
works with vlans in all the cases :-/

It always looks back, assuming post RFS situation; that needs to be
changed by dropping that "if (!skb->vlan_present)" and just looking
into input 'proto' (and optionally parsing vlan hdr if proto ==
802.1q/ad, which we already, sort of, do).

I'm gonna add a small testcase for BPF_PROG_TEST_RUN.

> For no-skb cases all of them should be available with the same logic
> and it has to documented, since it's different from other bpf progs
> that access these fields.
I feel like dropping those vlan_{present,proto,tci} from bpf flow dissector.
It should not care what's in the skb and should just rely on the input 'proto'
to optionally parse vlan header.

+1 on documenting all of that
Alexei Starovoitov March 27, 2019, 5:55 p.m. UTC | #15
On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > proto explicitly passed.
> > > > > > >
> > > > > > > We should not assume Ethernet link layer.
> > > > > > 
> > > > > > then skb-less dissector has to be different program type
> > > > > > because semantics are different.
> > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > We just need to pass nhoff and proto that has been passed to
> > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > ask BPF program to do it essentially):
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > 
> > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > 
> > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > high light this difference.
> > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > and n_proto in flow_keys.
> > 
> > disabling useless fields in ctx is one thing, since probability of breaking users
> > is low, but renaming n_proto is imo too much.
> > 
> > > > may be add vlan_proto/present/tci there as well?
> > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > Why do you think we need them? My understanding was that when
> > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > that vlan info has been already parsed out of the packet and stored in
> > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > points to proper L3 header.
> > > 
> > > If that's correct, BPF flow dissector should not care about that. For
> > > example, look at how C-based flow dissector does that:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > 
> > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > and move on.
> > > 
> > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > But it feels like it's getting into bpf-next territory :-)
> > 
> > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > progs/bpf_flow.c is relying on that or not.
> > So far I think you're saying that in all three cases:
> > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > This has to be preserved.
> It points to L3 (or vlan). And this will be preserved, I have no
> intention to change that.
> 
> Just to make sure, we are on the same page, here is what
> __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> 
> NO-VLAN is always the same for both with-skb/no-skb:
> +----+----+-----+--+
> |DMAC|SMAC|PROTO|L3|
> +----+----+-----+--+
>                  ^
>                  +-- nhoff
>                      proto = PROTO
> 
> VLAN no-skb (eth_get_headlen):
> +----+----+----+---+-----+--+
> |DMAC|SMAC|TPID|TCI|PROTO|L3|
> +----+----+----+---+-----+--+
>                 ^
>                 +-- nhoff
>                     proto = TPID

where ctx->data will point to ?
These nhoff differences are fine.
I want to make sure that ctx->data is the same for all.

> And in the last case, networking stack sets:
>  * skb->vlan_present to true
>  * skb->vlan_proto to TPID
>  * skb->vlan_tci to TCI
>  * skb->protocol to PROTO
>  * pulls vlan header, so skb->data points to L3 header

then skb-less should point after L2 as well
or point to L2 for all.

I'm arguing about it because early on during tc-bpf days
we had this discrepancy and it caused issues.
Thankfully we addressed it before it became long term headache.
Stanislav Fomichev March 27, 2019, 7:58 p.m. UTC | #16
On 03/27, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > proto explicitly passed.
> > > > > > > >
> > > > > > > > We should not assume Ethernet link layer.
> > > > > > > 
> > > > > > > then skb-less dissector has to be different program type
> > > > > > > because semantics are different.
> > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > ask BPF program to do it essentially):
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > 
> > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > 
> > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > high light this difference.
> > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > and n_proto in flow_keys.
> > > 
> > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > is low, but renaming n_proto is imo too much.
> > > 
> > > > > may be add vlan_proto/present/tci there as well?
> > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > Why do you think we need them? My understanding was that when
> > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > that vlan info has been already parsed out of the packet and stored in
> > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > points to proper L3 header.
> > > > 
> > > > If that's correct, BPF flow dissector should not care about that. For
> > > > example, look at how C-based flow dissector does that:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > 
> > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > and move on.
> > > > 
> > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > But it feels like it's getting into bpf-next territory :-)
> > > 
> > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > progs/bpf_flow.c is relying on that or not.
> > > So far I think you're saying that in all three cases:
> > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > This has to be preserved.
> > It points to L3 (or vlan). And this will be preserved, I have no
> > intention to change that.
> > 
> > Just to make sure, we are on the same page, here is what
> > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > 
> > NO-VLAN is always the same for both with-skb/no-skb:
> > +----+----+-----+--+
> > |DMAC|SMAC|PROTO|L3|
> > +----+----+-----+--+
> >                  ^
> >                  +-- nhoff
> >                      proto = PROTO
> > 
> > VLAN no-skb (eth_get_headlen):
> > +----+----+----+---+-----+--+
> > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > +----+----+----+---+-----+--+
> >                 ^
> >                 +-- nhoff
> >                     proto = TPID
> 
> where ctx->data will point to ?
> These nhoff differences are fine.
> I want to make sure that ctx->data is the same for all.
For with-skb, nhoff would be zero, and ctx->data would point to
TCI/L3.
For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
non-zero (TCI/L3 offset).

If you want, for skb-less case, when calling BPF program we can do the math
ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
But I'm not sure whether we need to do that; flow dissector is supposed
to look at ctx->data + nhoff, it should not matter what each individual
value is, they only make sense together.

> > And in the last case, networking stack sets:
> >  * skb->vlan_present to true
> >  * skb->vlan_proto to TPID
> >  * skb->vlan_tci to TCI
> >  * skb->protocol to PROTO
> >  * pulls vlan header, so skb->data points to L3 header
> 
> then skb-less should point after L2 as well
> or point to L2 for all.
Sure, we can do that, see above, we can manually move ctx->data to nhoff
and set nhoff to zero.

> I'm arguing about it because early on during tc-bpf days
> we had this discrepancy and it caused issues.
> Thankfully we addressed it before it became long term headache.
Ack, that's what we are trying to do here as well :-) Address all
existing shortcomings before it's too late.
Alexei Starovoitov March 28, 2019, 1:26 a.m. UTC | #17
On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> On 03/27, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > proto explicitly passed.
> > > > > > > > >
> > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > 
> > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > because semantics are different.
> > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > ask BPF program to do it essentially):
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > 
> > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > 
> > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > high light this difference.
> > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > and n_proto in flow_keys.
> > > > 
> > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > is low, but renaming n_proto is imo too much.
> > > > 
> > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > Why do you think we need them? My understanding was that when
> > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > points to proper L3 header.
> > > > > 
> > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > example, look at how C-based flow dissector does that:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > 
> > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > and move on.
> > > > > 
> > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > But it feels like it's getting into bpf-next territory :-)
> > > > 
> > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > progs/bpf_flow.c is relying on that or not.
> > > > So far I think you're saying that in all three cases:
> > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > This has to be preserved.
> > > It points to L3 (or vlan). And this will be preserved, I have no
> > > intention to change that.
> > > 
> > > Just to make sure, we are on the same page, here is what
> > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > 
> > > NO-VLAN is always the same for both with-skb/no-skb:
> > > +----+----+-----+--+
> > > |DMAC|SMAC|PROTO|L3|
> > > +----+----+-----+--+
> > >                  ^
> > >                  +-- nhoff
> > >                      proto = PROTO
> > > 
> > > VLAN no-skb (eth_get_headlen):
> > > +----+----+----+---+-----+--+
> > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > +----+----+----+---+-----+--+
> > >                 ^
> > >                 +-- nhoff
> > >                     proto = TPID
> > 
> > where ctx->data will point to ?
> > These nhoff differences are fine.
> > I want to make sure that ctx->data is the same for all.
> For with-skb, nhoff would be zero, and ctx->data would point to
> TCI/L3.
> For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> non-zero (TCI/L3 offset).
> 
> If you want, for skb-less case, when calling BPF program we can do the math
> ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> But I'm not sure whether we need to do that; flow dissector is supposed
> to look at ctx->data + nhoff, it should not matter what each individual
> value is, they only make sense together.

My strong preference is to have data to point to L2 in all cases.
Semantics of requiring bpf prog to start processing from a tuple
(data + nhoff) where both point to random places is very confusing.
Willem de Bruijn March 28, 2019, 3:14 a.m. UTC | #18
On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > On 03/27, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > proto explicitly passed.
> > > > > > > > > >
> > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > >
> > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > because semantics are different.
> > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > ask BPF program to do it essentially):
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > >
> > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > >
> > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > high light this difference.
> > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > and n_proto in flow_keys.
> > > > >
> > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > is low, but renaming n_proto is imo too much.
> > > > >
> > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > Why do you think we need them? My understanding was that when
> > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > points to proper L3 header.
> > > > > >
> > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > example, look at how C-based flow dissector does that:
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > >
> > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > and move on.
> > > > > >
> > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > >
> > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > progs/bpf_flow.c is relying on that or not.
> > > > > So far I think you're saying that in all three cases:
> > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > This has to be preserved.
> > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > intention to change that.
> > > >
> > > > Just to make sure, we are on the same page, here is what
> > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > >
> > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > +----+----+-----+--+
> > > > |DMAC|SMAC|PROTO|L3|
> > > > +----+----+-----+--+
> > > >                  ^
> > > >                  +-- nhoff
> > > >                      proto = PROTO
> > > >
> > > > VLAN no-skb (eth_get_headlen):
> > > > +----+----+----+---+-----+--+
> > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > +----+----+----+---+-----+--+
> > > >                 ^
> > > >                 +-- nhoff
> > > >                     proto = TPID
> > >
> > > where ctx->data will point to ?
> > > These nhoff differences are fine.
> > > I want to make sure that ctx->data is the same for all.
> > For with-skb, nhoff would be zero, and ctx->data would point to
> > TCI/L3.
> > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > non-zero (TCI/L3 offset).
> >
> > If you want, for skb-less case, when calling BPF program we can do the math
> > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > But I'm not sure whether we need to do that; flow dissector is supposed
> > to look at ctx->data + nhoff, it should not matter what each individual
> > value is, they only make sense together.
>
> My strong preference is to have data to point to L2 in all cases.
> Semantics of requiring bpf prog to start processing from a tuple
> (data + nhoff) where both point to random places is very confusing.

Since flow dissection starts at the network layer, I would then
suggest data always at L3 and nhoff 0.

This can be derived in the same manner as __skb_flow_dissect
already does if !data, using only skb_network_offset.

>From a quick scan, skb_mac_offset should also be valid in all cases
where the flow dissector is called today, so the other can be computed, too.

But this is less obvious. For instance, tun_get_user calls into the flow
dissector up to three times (wow) and IFF_TUN has no link layer
(ARPHRD_NONE). And then there are also fun variable length link layer
protocols to deal with..
Alexei Starovoitov March 28, 2019, 3:32 a.m. UTC | #19
On Wed, Mar 27, 2019 at 11:14:46PM -0400, Willem de Bruijn wrote:
> On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > > On 03/27, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > > proto explicitly passed.
> > > > > > > > > > >
> > > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > > >
> > > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > > because semantics are different.
> > > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > > ask BPF program to do it essentially):
> > > > > > > > >
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > > >
> > > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > > >
> > > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > > high light this difference.
> > > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > > and n_proto in flow_keys.
> > > > > >
> > > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > > is low, but renaming n_proto is imo too much.
> > > > > >
> > > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > > Why do you think we need them? My understanding was that when
> > > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > > points to proper L3 header.
> > > > > > >
> > > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > > example, look at how C-based flow dissector does that:
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > > >
> > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > and move on.
> > > > > > >
> > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > >
> > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > So far I think you're saying that in all three cases:
> > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > This has to be preserved.
> > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > intention to change that.
> > > > >
> > > > > Just to make sure, we are on the same page, here is what
> > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > >
> > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > +----+----+-----+--+
> > > > > |DMAC|SMAC|PROTO|L3|
> > > > > +----+----+-----+--+
> > > > >                  ^
> > > > >                  +-- nhoff
> > > > >                      proto = PROTO
> > > > >
> > > > > VLAN no-skb (eth_get_headlen):
> > > > > +----+----+----+---+-----+--+
> > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > +----+----+----+---+-----+--+
> > > > >                 ^
> > > > >                 +-- nhoff
> > > > >                     proto = TPID
> > > >
> > > > where ctx->data will point to ?
> > > > These nhoff differences are fine.
> > > > I want to make sure that ctx->data is the same for all.
> > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > TCI/L3.
> > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > non-zero (TCI/L3 offset).
> > >
> > > If you want, for skb-less case, when calling BPF program we can do the math
> > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > to look at ctx->data + nhoff, it should not matter what each individual
> > > value is, they only make sense together.
> >
> > My strong preference is to have data to point to L2 in all cases.
> > Semantics of requiring bpf prog to start processing from a tuple
> > (data + nhoff) where both point to random places is very confusing.
> 
> Since flow dissection starts at the network layer, I would then
> suggest data always at L3 and nhoff 0.
> 
> This can be derived in the same manner as __skb_flow_dissect
> already does if !data, using only skb_network_offset.
> 
> From a quick scan, skb_mac_offset should also be valid in all cases
> where the flow dissector is called today, so the other can be computed, too.
> 
> But this is less obvious. For instance, tun_get_user calls into the flow
> dissector up to three times (wow) and IFF_TUN has no link layer
> (ARPHRD_NONE). And then there are also fun variable length link layer
> protocols to deal with..

ahh. ok. Can we guarantee some stable position?
Current bpf_flow_dissect_get_header assumes that
ctx->data + ctx->flow_keys->thoff point to IP, right?
Based on what Stanislav saying above even that is not a guarantee?
I'm struggling to see how users can wrap their heads around this.
It seems bpf_flow.c will become the only prog that can deal with
this range of possible inputs.

I propose to start with the doc that describes all cases, where
things point to and how prog suppose to parse that.
Stanislav Fomichev March 28, 2019, 4:17 a.m. UTC | #20
On 03/27, Alexei Starovoitov wrote:
> On Wed, Mar 27, 2019 at 11:14:46PM -0400, Willem de Bruijn wrote:
> > On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > > > On 03/27, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > > > proto explicitly passed.
> > > > > > > > > > > >
> > > > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > > > >
> > > > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > > > because semantics are different.
> > > > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > > > ask BPF program to do it essentially):
> > > > > > > > > >
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > > > >
> > > > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > > > >
> > > > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > > > high light this difference.
> > > > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > > > and n_proto in flow_keys.
> > > > > > >
> > > > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > > > is low, but renaming n_proto is imo too much.
> > > > > > >
> > > > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > > > Why do you think we need them? My understanding was that when
> > > > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > > > points to proper L3 header.
> > > > > > > >
> > > > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > > > example, look at how C-based flow dissector does that:
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > > > >
> > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > and move on.
> > > > > > > >
> > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > >
> > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > So far I think you're saying that in all three cases:
> > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > This has to be preserved.
> > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > intention to change that.
> > > > > >
> > > > > > Just to make sure, we are on the same page, here is what
> > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > >
> > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > +----+----+-----+--+
> > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > +----+----+-----+--+
> > > > > >                  ^
> > > > > >                  +-- nhoff
> > > > > >                      proto = PROTO
> > > > > >
> > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > +----+----+----+---+-----+--+
> > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > +----+----+----+---+-----+--+
> > > > > >                 ^
> > > > > >                 +-- nhoff
> > > > > >                     proto = TPID
> > > > >
> > > > > where ctx->data will point to ?
> > > > > These nhoff differences are fine.
> > > > > I want to make sure that ctx->data is the same for all.
> > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > TCI/L3.
> > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > non-zero (TCI/L3 offset).
> > > >
> > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > value is, they only make sense together.
> > >
> > > My strong preference is to have data to point to L2 in all cases.
> > > Semantics of requiring bpf prog to start processing from a tuple
> > > (data + nhoff) where both point to random places is very confusing.
> > 
> > Since flow dissection starts at the network layer, I would then
> > suggest data always at L3 and nhoff 0.
For eth_get_headlen we need to manually parse 802.1q header. And for RFS
case as well (unless I'm missing something).

> > This can be derived in the same manner as __skb_flow_dissect
> > already does if !data, using only skb_network_offset.
> > 
> > From a quick scan, skb_mac_offset should also be valid in all cases
> > where the flow dissector is called today, so the other can be computed, too.
> > 
> > But this is less obvious. For instance, tun_get_user calls into the flow
> > dissector up to three times (wow) and IFF_TUN has no link layer
> > (ARPHRD_NONE). And then there are also fun variable length link layer
> > protocols to deal with..
> 
> ahh. ok. Can we guarantee some stable position?
I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
post RFS it will point to L3. The only thing we can do is to have
nhoff=0 (and adjust ctx->data accordingly) when the main bpf
flow dissector procedure is called. But that would require bringing
this new kernel context (bpf_flow_dissector) into bpf/stable.
(And it's not clear what's the benefit, since tail calls would still
have to look at that offset).

> Current bpf_flow_dissect_get_header assumes that
> ctx->data + ctx->flow_keys->thoff point to IP, right?
Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
And it's only for the "main" call; bpf program adjusts this thoff
to make sure that tail calls preserve some sense of progress (so it
eventually points to L4 and that's what we export back).

> Based on what Stanislav saying above even that is not a guarantee?
> I'm struggling to see how users can wrap their heads around this.
> It seems bpf_flow.c will become the only prog that can deal with
> this range of possible inputs.
> 
> I propose to start with the doc that describes all cases, where
> things point to and how prog suppose to parse that.
Yeah, that is what I was going to propose - add a doc along with the
patch series. I don't see how we can make it simple(r) at this point :-(
I can try to document everything so users don't have to read the
kernel code to understand how to write the bpf flow dissector programs.
Willem de Bruijn March 28, 2019, 12:58 p.m. UTC | #21
> > > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > > and move on.
> > > > > > > > >
> > > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > > >
> > > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > > So far I think you're saying that in all three cases:
> > > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > > This has to be preserved.
> > > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > > intention to change that.
> > > > > > >
> > > > > > > Just to make sure, we are on the same page, here is what
> > > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > > >
> > > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > > +----+----+-----+--+
> > > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > > +----+----+-----+--+
> > > > > > >                  ^
> > > > > > >                  +-- nhoff
> > > > > > >                      proto = PROTO
> > > > > > >
> > > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > > +----+----+----+---+-----+--+
> > > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > > +----+----+----+---+-----+--+
> > > > > > >                 ^
> > > > > > >                 +-- nhoff
> > > > > > >                     proto = TPID
> > > > > >
> > > > > > where ctx->data will point to ?
> > > > > > These nhoff differences are fine.
> > > > > > I want to make sure that ctx->data is the same for all.
> > > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > > TCI/L3.
> > > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > > non-zero (TCI/L3 offset).
> > > > >
> > > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > > value is, they only make sense together.
> > > >
> > > > My strong preference is to have data to point to L2 in all cases.
> > > > Semantics of requiring bpf prog to start processing from a tuple
> > > > (data + nhoff) where both point to random places is very confusing.
> > >
> > > Since flow dissection starts at the network layer, I would then
> > > suggest data always at L3 and nhoff 0.
> For eth_get_headlen we need to manually parse 802.1q header. And for RFS
> case as well (unless I'm missing something).
>
> > > This can be derived in the same manner as __skb_flow_dissect
> > > already does if !data, using only skb_network_offset.
> > >
> > > From a quick scan, skb_mac_offset should also be valid in all cases
> > > where the flow dissector is called today, so the other can be computed, too.
> > >
> > > But this is less obvious. For instance, tun_get_user calls into the flow
> > > dissector up to three times (wow) and IFF_TUN has no link layer
> > > (ARPHRD_NONE). And then there are also fun variable length link layer
> > > protocols to deal with..
> >
> > ahh. ok. Can we guarantee some stable position?
> I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
> post RFS it will point to L3. The only thing we can do is to have
> nhoff=0 (and adjust ctx->data accordingly) when the main bpf
> flow dissector procedure is called. But that would require bringing
> this new kernel context (bpf_flow_dissector) into bpf/stable.
> (And it's not clear what's the benefit, since tail calls would still
> have to look at that offset).

The flow dissector can be called also before and after tunneling, in
which case skb_network_offset points to an inner header. Or after
MPLS, which stumps a flow dissector called earlier as that has no
information about the encapsulated protocol.

I don't think that there should be a goal that flow dissection starts
at the same point in the packet for all callsites along the datapath.
As long as it always starts at a known ETH_P_.. type protocol header
the program should be able to parse that. That is how the non-BPF
flow dissector works.

> > Current bpf_flow_dissect_get_header assumes that
> > ctx->data + ctx->flow_keys->thoff point to IP, right?
> Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
> And it's only for the "main" call; bpf program adjusts this thoff
> to make sure that tail calls preserve some sense of progress (so it
> eventually points to L4 and that's what we export back).
>
> > Based on what Stanislav saying above even that is not a guarantee?
> > I'm struggling to see how users can wrap their heads around this.
> > It seems bpf_flow.c will become the only prog that can deal with
> > this range of possible inputs.
> >
> > I propose to start with the doc that describes all cases, where
> > things point to and how prog suppose to parse that.
> Yeah, that is what I was going to propose - add a doc along with the
> patch series. I don't see how we can make it simple(r) at this point :-(

Does it have to be simpler? A flow dissector should be ready to
dissect VLAN tags. That's the only complication here?



> I can try to document everything so users don't have to read the
> kernel code to understand how to write the bpf flow dissector programs.
Stanislav Fomichev April 1, 2019, 4:30 p.m. UTC | #22
On 03/28, Willem de Bruijn wrote:
> > > > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > > > and move on.
> > > > > > > > > >
> > > > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > > > >
> > > > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > > > So far I think you're saying that in all three cases:
> > > > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > > > This has to be preserved.
> > > > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > > > intention to change that.
> > > > > > > >
> > > > > > > > Just to make sure, we are on the same page, here is what
> > > > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > > > >
> > > > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > > > +----+----+-----+--+
> > > > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > > > +----+----+-----+--+
> > > > > > > >                  ^
> > > > > > > >                  +-- nhoff
> > > > > > > >                      proto = PROTO
> > > > > > > >
> > > > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > > > +----+----+----+---+-----+--+
> > > > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > > > +----+----+----+---+-----+--+
> > > > > > > >                 ^
> > > > > > > >                 +-- nhoff
> > > > > > > >                     proto = TPID
> > > > > > >
> > > > > > > where ctx->data will point to ?
> > > > > > > These nhoff differences are fine.
> > > > > > > I want to make sure that ctx->data is the same for all.
> > > > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > > > TCI/L3.
> > > > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > > > non-zero (TCI/L3 offset).
> > > > > >
> > > > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > > > value is, they only make sense together.
> > > > >
> > > > > My strong preference is to have data to point to L2 in all cases.
> > > > > Semantics of requiring bpf prog to start processing from a tuple
> > > > > (data + nhoff) where both point to random places is very confusing.
> > > >
> > > > Since flow dissection starts at the network layer, I would then
> > > > suggest data always at L3 and nhoff 0.
> > For eth_get_headlen we need to manually parse 802.1q header. And for RFS
> > case as well (unless I'm missing something).
> >
> > > > This can be derived in the same manner as __skb_flow_dissect
> > > > already does if !data, using only skb_network_offset.
> > > >
> > > > From a quick scan, skb_mac_offset should also be valid in all cases
> > > > where the flow dissector is called today, so the other can be computed, too.
> > > >
> > > > But this is less obvious. For instance, tun_get_user calls into the flow
> > > > dissector up to three times (wow) and IFF_TUN has no link layer
> > > > (ARPHRD_NONE). And then there are also fun variable length link layer
> > > > protocols to deal with..
> > >
> > > ahh. ok. Can we guarantee some stable position?
> > I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
> > post RFS it will point to L3. The only thing we can do is to have
> > nhoff=0 (and adjust ctx->data accordingly) when the main bpf
> > flow dissector procedure is called. But that would require bringing
> > this new kernel context (bpf_flow_dissector) into bpf/stable.
> > (And it's not clear what's the benefit, since tail calls would still
> > have to look at that offset).
> 
> The flow dissector can be called also before and after tunneling, in
> which case skb_network_offset points to an inner header. Or after
> MPLS, which stumps a flow dissector called earlier as that has no
> information about the encapsulated protocol.
> 
> I don't think that there should be a goal that flow dissection starts
> at the same point in the packet for all callsites along the datapath.
> As long as it always starts at a known ETH_P_.. type protocol header
> the program should be able to parse that. That is how the non-BPF
> flow dissector works.
> 
> > > Current bpf_flow_dissect_get_header assumes that
> > > ctx->data + ctx->flow_keys->thoff point to IP, right?
> > Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
> > And it's only for the "main" call; bpf program adjusts this thoff
> > to make sure that tail calls preserve some sense of progress (so it
> > eventually points to L4 and that's what we export back).
> >
> > > Based on what Stanislav saying above even that is not a guarantee?
> > > I'm struggling to see how users can wrap their heads around this.
> > > It seems bpf_flow.c will become the only prog that can deal with
> > > this range of possible inputs.
> > >
> > > I propose to start with the doc that describes all cases, where
> > > things point to and how prog suppose to parse that.
> > Yeah, that is what I was going to propose - add a doc along with the
> > patch series. I don't see how we can make it simple(r) at this point :-(
> 
> Does it have to be simpler? A flow dissector should be ready to
> dissect VLAN tags. That's the only complication here?
I don't see how it can be made simpler. That's the context from which
existing __skb_flow_dissect is called and that's what we have to dissect
from the BPF as well. We can try to make nhoff to be 0 when the
dissector is called, that's probably the only simplification we can
attempt to do (but, as I said previously, it requires bringing
new kernel context to bpf/stable and seems more complicated than
necessary).

Let me prepare a series for bpf/stable with the small doc describing
BPF flow dissector environment. We can continue the discussion
from there :-)

> > I can try to document everything so users don't have to read the
> > kernel code to understand how to write the bpf flow dissector programs.
diff mbox series

Patch

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1e2e4a9330af..ec8ebafe333f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -683,26 +683,6 @@  static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	}
 }
 
-bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
-			    const struct sk_buff *skb,
-			    struct flow_dissector *flow_dissector,
-			    struct bpf_flow_keys *flow_keys)
-{
-	struct bpf_flow_dissector ctx = {
-		.flow_keys = flow_keys,
-		.skb = skb,
-		.protocol = skb->protocol,
-		.vlan_tci = skb->vlan_tci,
-		.vlan_proto = skb->vlan_proto,
-		.vlan_present = skb->vlan_present,
-		.data = skb->data,
-		.data_end = skb->data + skb_headlen(skb),
-	};
-
-	return bpf_flow_dissect(prog, &ctx, skb_network_offset(skb),
-				skb_headlen(skb));
-}
-
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 		      int nhoff, int hlen)
 {
@@ -754,6 +734,7 @@  bool __skb_flow_dissect(const struct net *net,
 	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
+	struct bpf_prog *attached = NULL;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
 	int num_hdrs = 0;
@@ -796,26 +777,37 @@  bool __skb_flow_dissect(const struct net *net,
 					      target_container);
 
 	if (skb) {
-		struct bpf_flow_keys flow_keys;
-		struct bpf_prog *attached = NULL;
-
-		rcu_read_lock();
 		if (!net) {
 			if (skb->dev)
 				net = dev_net(skb->dev);
 			else if (skb->sk)
 				net = sock_net(skb->sk);
-			else
-				WARN_ON_ONCE(1);
 		}
+	}
 
-		if (net)
-			attached = rcu_dereference(net->flow_dissector_prog);
+	WARN_ON_ONCE(!net);
+	if (net) {
+		rcu_read_lock();
+		attached = rcu_dereference(net->flow_dissector_prog);
 
 		if (attached) {
-			ret = __skb_flow_bpf_dissect(attached, skb,
-						     flow_dissector,
-						     &flow_keys);
+			struct bpf_flow_keys flow_keys;
+			struct bpf_flow_dissector ctx = {
+				.flow_keys = &flow_keys,
+				.data = data,
+				.data_end = data + hlen,
+				.protocol = proto,
+			};
+
+			if (skb) {
+				ctx.skb = skb;
+				ctx.protocol = skb->protocol;
+				ctx.vlan_tci = skb->vlan_tci;
+				ctx.vlan_proto = skb->vlan_proto;
+				ctx.vlan_present = skb->vlan_present;
+			}
+
+			ret = bpf_flow_dissect(attached, &ctx, nhoff, hlen);
 			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
 						 target_container);
 			rcu_read_unlock();