diff mbox

[v6,net-next,1/4] net: flow_dissector: avoid multiple calls in eBPF

Message ID 1401389758-13252-1-git-send-email-chema@google.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Chema Gonzalez May 29, 2014, 6:55 p.m. UTC
This patch makes __skb_get_pay_offset() (the function called by
the eBPF call generated from "ld #poff") store the output of the flow
dissector (a flow_keys structure) in the eBPF stack. This way, multiple
invocations of this eBPF call in the same BPF filter will only require
one single call to the flow dissector.

Note that other eBPF calls that use the flow dissector can use the same
approach, and share the results, so at the end there is only one single
call to the flow dissector per packet.

Tested:

$ cat tools/net/ipv4_tcp_poff2.bpf
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ld poff
ld poff
ld poff
ld poff
ret #-1
drop: ret #0
$ ./tools/net/bpf_asm tools/net/ipv4_tcp_poff2.bpf
10,40 0 0 12,21 0 7 2048,48 0 0 23,21 0 5 6,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,6 0 0 4294967295,6 0 0 0,

And then, in a VM, we ran:

$ tcpdump -n -i eth0 -f "10,40 0 0 12,21 0 7 2048,48 0 0 23,21 0 5 6,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,6 0 0 4294967295,6 0 0 0,"

This tcpdump is github's tcpdump HEAD with pull request #353, which
allows using raw filters in tcpdump.

Debugging the kernel, we can see that there only the first "ld poff" call
does invoke the flow dissector. The other calls reuse the value in the
eBPF stack.

Also tested the test_bpf module.

Signed-off-by: Chema Gonzalez <chema@google.com>
---
 include/linux/skbuff.h    |  3 +-
 lib/test_bpf.c            | 52 ++++++++++++++++++++++++++++
 net/core/filter.c         | 86 ++++++++++++++++++++++++++++++++++++++++++-----
 net/core/flow_dissector.c | 16 +++++----
 4 files changed, 141 insertions(+), 16 deletions(-)

Comments

Daniel Borkmann May 29, 2014, 11:54 p.m. UTC | #1
On 05/29/2014 08:55 PM, Chema Gonzalez wrote:
> This patch makes __skb_get_pay_offset() (the function called by
> the eBPF call generated from "ld #poff") store the output of the flow
> dissector (a flow_keys structure) in the eBPF stack. This way, multiple
> invocations of this eBPF call in the same BPF filter will only require
> one single call to the flow dissector.
>
> Note that other eBPF calls that use the flow dissector can use the same
> approach, and share the results, so at the end there is only one single
> call to the flow dissector per packet.
>
> Tested:
>
> $ cat tools/net/ipv4_tcp_poff2.bpf
> ldh [12]
> jne #0x800, drop
> ldb [23]
> jneq #6, drop
> ld poff
> ld poff
> ld poff
> ld poff
> ret #-1
> drop: ret #0
> $ ./tools/net/bpf_asm tools/net/ipv4_tcp_poff2.bpf
> 10,40 0 0 12,21 0 7 2048,48 0 0 23,21 0 5 6,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,6 0 0 4294967295,6 0 0 0,
>
> And then, in a VM, we ran:
>
> $ tcpdump -n -i eth0 -f "10,40 0 0 12,21 0 7 2048,48 0 0 23,21 0 5 6,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,32 0 0 4294963252,6 0 0 4294967295,6 0 0 0,"
>
> This tcpdump is github's tcpdump HEAD with pull request #353, which
> allows using raw filters in tcpdump.
>
> Debugging the kernel, we can see that there only the first "ld poff" call
> does invoke the flow dissector. The other calls reuse the value in the
> eBPF stack.
>
> Also tested the test_bpf module.
>
> Signed-off-by: Chema Gonzalez <chema@google.com>

I actually liked [1] most ... ;)

Just an idea ...

Have you taken into account the following: since thoff is u16 and ip_proto
is u8 and you clearly seem to want to have the value of these two in your
patch set (or even plus the poff value), why not squashing all these into
one extension e.g. SKF_AD_DISSECT where you call the external skb_flow_dissect()
helper or similar on?

I could imagine that either these offsets are squashed into the return or
stored if you really need them from the struct flow_keys into M[] itself. So
you would end up with one e.g. 'ld #keys' call that e.g. fills out/overwrites
your M[] with the dissected metadata. Then, you'd only have a reason to call
that once and do further processing based on these information. Whether you
also need poff could e.g. be determined if the user does e.g. 'ldx #1' or so
to indicate to the function it eventually calls, that it would further do
the dissect and also give you poff into fixed defined M[] slots back.

Thus if someone really only cares about headers, a use case like 'ld #poff +
ret a' is sufficient, but if you need much more and don't want to do this in
BPF like in your case, you go with 'ld #keys' and process your BPF program
further via M[].

Thanks,

Daniel

  [1] http://patchwork.ozlabs.org/patch/344271/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chema Gonzalez May 30, 2014, 5:12 p.m. UTC | #2
On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <dborkman@redhat.com> wrote:
> I actually liked [1] most ... ;)
>
> Just an idea ...
>
> Have you taken into account the following: since thoff is u16 and ip_proto
> is u8 and you clearly seem to want to have the value of these two in your
> patch set (or even plus the poff value), why not squashing all these into
Note that I not only want thoff and ip_proto (the ones in patch #1 and
#2), but also nhoff and nproto. I could use the other flow_keys fields
(saddr, daddr, sport, dport), but they're one proto
comparison+indirect load away from [tn]hoff/[ip_|n]proto, so I don't
think it's worth to change BPF for them. Now, nhoff and proto are not
kept by the flow dissector. We cannot just add them to flow_keys as
the struct is used in some skb CB's that don't have any space left. I
have a patch that replaces thoff/ip_proto with nhof/proto (as the L4
info is easily obtainable from the L3 one, but not the other way
around), plus a fix of the couple of places where the L4 info is
actually used. I'll post it after these patches (and after your
code/decode cleaning goes through).

> one extension e.g. SKF_AD_DISSECT where you call the external
> skb_flow_dissect()
> helper or similar on?
>
> I could imagine that either these offsets are squashed into the return or
> stored if you really need them from the struct flow_keys into M[] itself. So
> you would end up with one e.g. 'ld #keys' call that e.g. fills
> out/overwrites
> your M[] with the dissected metadata. Then, you'd only have a reason to call
> that once and do further processing based on these information. Whether you
> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or so
> to indicate to the function it eventually calls, that it would further do
> the dissect and also give you poff into fixed defined M[] slots back.

IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
to load flow_keys in the stack and then *explicitly* adds "ld
mem[<offset>]" to access the field she's interested on. Note that if
the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
the compiler know she wants "ld #keys" and then "ld
mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
equivalent to what we're doing right now*.

The only advantage I can see is that we're only adding one new
ancillary load, which is more elegant. I can see some issues with this
approach:

- usability. The user has to actually know what's the right offset of
the thoff, which is cumbersome and may change with kernels. We'd be
effectively exporting the flow_keys struct layout to the users.
- have to take care that the classic BPF filter does not muck with
mem[] between the "ld #keys" and all of the the "ld* mem[]" that
access to it. Note that we're currently storing the flow_keys struct
in the stack outside of the mem[] words, so this is not a problem
here. (It is only accessible using ebpf insns.)

*Now, if your comment is ebpf-only, that could be ok. That would mean:

- a. the user writes "ld #thoff"
- b. the BPF->eBPF compiler generates one common BPF_CALL
(__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
would not include payload_offset (which needs to run its own function
to get the poff from flow_keys).

Is this what you are proposing?

-Chema


> Thus if someone really only cares about headers, a use case like 'ld #poff +
> ret a' is sufficient, but if you need much more and don't want to do this in
> BPF like in your case, you go with 'ld #keys' and process your BPF program
> further via M[].
>
> Thanks,
>
> Daniel
>
>  [1] http://patchwork.ozlabs.org/patch/344271/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann June 2, 2014, 12:36 p.m. UTC | #3
On 05/30/2014 07:12 PM, Chema Gonzalez wrote:
> On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> I actually liked [1] most ... ;)
...
>> one extension e.g. SKF_AD_DISSECT where you call the external
>> skb_flow_dissect()
>> helper or similar on?
>>
>> I could imagine that either these offsets are squashed into the return or
>> stored if you really need them from the struct flow_keys into M[] itself. So
>> you would end up with one e.g. 'ld #keys' call that e.g. fills
>> out/overwrites
>> your M[] with the dissected metadata. Then, you'd only have a reason to call
>> that once and do further processing based on these information. Whether you
>> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or so
>> to indicate to the function it eventually calls, that it would further do
>> the dissect and also give you poff into fixed defined M[] slots back.
>
> IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
> to load flow_keys in the stack and then *explicitly* adds "ld
> mem[<offset>]" to access the field she's interested on. Note that if
> the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
> the compiler know she wants "ld #keys" and then "ld
> mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
> equivalent to what we're doing right now*.

I think there are many ways to design that, but for something possibly
generic, what I mean is something like this:

We would a-priori know per API documentation what slots in M[] are
filled with which information, just for example:

M[0] <-- nhoff  [network header off]
M[1] <-- nproto [network header proto]
M[2] <-- thoff  [transport header off]
M[3] <-- tproto [transport header proto]
M[4] <-- poff   [payload off]

Now a user could load the following pseudo bpf_asm style program:

ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
ld #keys  <-- triggers the extension to fill the M[] slots
ld M[0]   <-- loads nhoff from M[0] into accu
<do sth with it>
ld M[3]   <-- loads tproto into accu, etc
...

A program like:

ld #2
ld #keys
...

Would then on the other hand only fill the first two slots of M[] as
the user does not need all information from the kernel's flow dissector
and thus also does not fully need to dissect the skb.

This also permits in future to add new fields by enabling them with
ld #6 etc, for example.

I think this would fit much more into the design of BPF (although I
don't strictly like referring to the kernel's flow dissector and thus
bypassing BPF's actual purpose; but I understand that it can get quite
complicated with tunnels, etc).

But this idea would allow you to only add 1 new extension and to have
it return dynamically a part or all information you would need in your
program, and thus solves the issue of calling the skb flow dissector
multiple times just because we have ld #thoff, ld #nhoff, etc, we can
avoid making such a stack_layout + filter_prologue hack and thus design
it more cleanly into the BPF architecture.

> The only advantage I can see is that we're only adding one new
> ancillary load, which is more elegant. I can see some issues with this
> approach:
>
> - usability. The user has to actually know what's the right offset of
> the thoff, which is cumbersome and may change with kernels. We'd be
> effectively exporting the flow_keys struct layout to the users.

See above.

> - have to take care that the classic BPF filter does not muck with
> mem[] between the "ld #keys" and all of the the "ld* mem[]" that
> access to it. Note that we're currently storing the flow_keys struct
> in the stack outside of the mem[] words, so this is not a problem
> here. (It is only accessible using ebpf insns.)

Since it is part of the API/ABI, and a new instruction, it is then
known that ld #keys would fill particular slots of M[] to the user.
That however, must be carefully designed, so that in future one
doesn't need to add ld #keys2. The part of not mucking with M[] fields
is then part of the user's task actually, just the same way as a
user shouldn't store something to A resp. X while an ld resp. ldx
knowingly would overwrite that value stored previously. I don't
think it would be any different than that.

> *Now, if your comment is ebpf-only, that could be ok. That would mean:
>
> - a. the user writes "ld #thoff"
> - b. the BPF->eBPF compiler generates one common BPF_CALL
> (__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
> would not include payload_offset (which needs to run its own function
> to get the poff from flow_keys).

Since we're not using eBPF in user space right now, the comment is not
about eBPF. I think if you would use eBPF from user space, you don't
need to add such extensions anymore, but could just write yourself a
minimal flow dissector in 'restricted' C to solve that problem.

Cheers,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov June 2, 2014, 4:48 p.m. UTC | #4
extending cc-list, since I think this thread is related to bpf split thread.

On Mon, Jun 2, 2014 at 5:36 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 05/30/2014 07:12 PM, Chema Gonzalez wrote:
>>
>> On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <dborkman@redhat.com>
>> wrote:
>>>
>>> I actually liked [1] most ... ;)
>
> ...
>
>>> one extension e.g. SKF_AD_DISSECT where you call the external
>>> skb_flow_dissect()
>>> helper or similar on?
>>>
>>> I could imagine that either these offsets are squashed into the return or
>>> stored if you really need them from the struct flow_keys into M[] itself.
>>> So
>>> you would end up with one e.g. 'ld #keys' call that e.g. fills
>>> out/overwrites
>>> your M[] with the dissected metadata. Then, you'd only have a reason to
>>> call
>>> that once and do further processing based on these information. Whether
>>> you
>>> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or
>>> so
>>> to indicate to the function it eventually calls, that it would further do
>>> the dissect and also give you poff into fixed defined M[] slots back.
>>
>>
>> IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
>> to load flow_keys in the stack and then *explicitly* adds "ld
>> mem[<offset>]" to access the field she's interested on. Note that if
>> the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
>> the compiler know she wants "ld #keys" and then "ld
>> mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
>> equivalent to what we're doing right now*.
>
>
> I think there are many ways to design that, but for something possibly
> generic, what I mean is something like this:
>
> We would a-priori know per API documentation what slots in M[] are
> filled with which information, just for example:
>
> M[0] <-- nhoff  [network header off]
> M[1] <-- nproto [network header proto]
> M[2] <-- thoff  [transport header off]
> M[3] <-- tproto [transport header proto]
> M[4] <-- poff   [payload off]
>
> Now a user could load the following pseudo bpf_asm style program:
>
> ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
> ld #keys  <-- triggers the extension to fill the M[] slots
> ld M[0]   <-- loads nhoff from M[0] into accu
> <do sth with it>
> ld M[3]   <-- loads tproto into accu, etc
> …

imo there are pros and cons in Daniel's and Chema's proposals
for classic BPF extensions.
I like Chema's a bit more, since his proposal doesn't require to
change classic BPF verifier and that's always a delicate area
(seccomp also allows to use M[] slots).
Both still look like a stop gap until next classic extension
appears on horizon.

I think explicit eBPF program solves this particular task cleaner
and it doesn't require changing eBPF.

> A program like:
>
> ld #2
> ld #keys
> ...
>
> Would then on the other hand only fill the first two slots of M[] as
> the user does not need all information from the kernel's flow dissector
> and thus also does not fully need to dissect the skb.
>
> This also permits in future to add new fields by enabling them with
> ld #6 etc, for example.
>
> I think this would fit much more into the design of BPF (although I
> don't strictly like referring to the kernel's flow dissector and thus
> bypassing BPF's actual purpose; but I understand that it can get quite
> complicated with tunnels, etc).
>
> But this idea would allow you to only add 1 new extension and to have
> it return dynamically a part or all information you would need in your
> program, and thus solves the issue of calling the skb flow dissector
> multiple times just because we have ld #thoff, ld #nhoff, etc, we can
> avoid making such a stack_layout + filter_prologue hack and thus design
> it more cleanly into the BPF architecture.
>
>
>> The only advantage I can see is that we're only adding one new
>> ancillary load, which is more elegant. I can see some issues with this
>> approach:
>>
>> - usability. The user has to actually know what's the right offset of
>> the thoff, which is cumbersome and may change with kernels. We'd be
>> effectively exporting the flow_keys struct layout to the users.
>
>
> See above.
>
>
>> - have to take care that the classic BPF filter does not muck with
>> mem[] between the "ld #keys" and all of the the "ld* mem[]" that
>> access to it. Note that we're currently storing the flow_keys struct
>> in the stack outside of the mem[] words, so this is not a problem
>> here. (It is only accessible using ebpf insns.)
>
>
> Since it is part of the API/ABI, and a new instruction, it is then
> known that ld #keys would fill particular slots of M[] to the user.
> That however, must be carefully designed, so that in future one
> doesn't need to add ld #keys2. The part of not mucking with M[] fields
> is then part of the user's task actually, just the same way as a
> user shouldn't store something to A resp. X while an ld resp. ldx
> knowingly would overwrite that value stored previously. I don't
> think it would be any different than that.
>
>
>> *Now, if your comment is ebpf-only, that could be ok. That would mean:
>>
>> - a. the user writes "ld #thoff"
>> - b. the BPF->eBPF compiler generates one common BPF_CALL
>> (__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
>> would not include payload_offset (which needs to run its own function
>> to get the poff from flow_keys).
>
>
> Since we're not using eBPF in user space right now, the comment is not
> about eBPF. I think if you would use eBPF from user space, you don't
> need to add such extensions anymore, but could just write yourself a
> minimal flow dissector in 'restricted' C to solve that problem.

Exactly.
I think exposing eBPF to user space is not a matter of 'if' but 'when'.

I'm not sure how pressing it is now to add another extension to classic,
when the goal of this patch set fits into eBPF model just fine.
yes, eBPF verifier is not in tree yet and it will obviously take longer to
review than Chema's or Daniel's set.

When eBPF is exposed to user space the inner header access can
be done in two ways without changing eBPF instruction set or eBPF
verifier.

eBPF approach #1:
-- code packet parser in restricted C
Pros:
skb_flow_dissect() stays hidden in kernel.
No additional uapi headache which exist if we start reusing
in-kernel skb_flow_dissect() either via classic or eBPF.
Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
(I provided performance numbers before)
Cons:
in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to userspace.

eBPF approach #2:
-- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
Pros:
eBPF program becomes much shorter and can be done in asm like:
bpf_mov r2, fp
bpf_sub r2, 64
bpf_call bpf_skb_flow_dissect  // call in-kernel helper function from
eBPF program
bpf_ldx r1, [fp - 64]  // access flow_keys data
bpf_ldx r2, [fp - 60]

Cons:
bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
uapi visible helper function

imo both eBPF approaches are cleaner than extending classic.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann June 3, 2014, 8:33 a.m. UTC | #5
On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
> extending cc-list, since I think this thread is related to bpf split thread.
>
> On Mon, Jun 2, 2014 at 5:36 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> On 05/30/2014 07:12 PM, Chema Gonzalez wrote:
>>>
>>> On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <dborkman@redhat.com>
>>> wrote:
>>>>
>>>> I actually liked [1] most ... ;)
>>
>> ...
>>
>>>> one extension e.g. SKF_AD_DISSECT where you call the external
>>>> skb_flow_dissect()
>>>> helper or similar on?
>>>>
>>>> I could imagine that either these offsets are squashed into the return or
>>>> stored if you really need them from the struct flow_keys into M[] itself.
>>>> So
>>>> you would end up with one e.g. 'ld #keys' call that e.g. fills
>>>> out/overwrites
>>>> your M[] with the dissected metadata. Then, you'd only have a reason to
>>>> call
>>>> that once and do further processing based on these information. Whether
>>>> you
>>>> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or
>>>> so
>>>> to indicate to the function it eventually calls, that it would further do
>>>> the dissect and also give you poff into fixed defined M[] slots back.
>>>
>>>
>>> IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
>>> to load flow_keys in the stack and then *explicitly* adds "ld
>>> mem[<offset>]" to access the field she's interested on. Note that if
>>> the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
>>> the compiler know she wants "ld #keys" and then "ld
>>> mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
>>> equivalent to what we're doing right now*.
>>
>>
>> I think there are many ways to design that, but for something possibly
>> generic, what I mean is something like this:
>>
>> We would a-priori know per API documentation what slots in M[] are
>> filled with which information, just for example:
>>
>> M[0] <-- nhoff  [network header off]
>> M[1] <-- nproto [network header proto]
>> M[2] <-- thoff  [transport header off]
>> M[3] <-- tproto [transport header proto]
>> M[4] <-- poff   [payload off]
>>
>> Now a user could load the following pseudo bpf_asm style program:
>>
>> ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>> ld #keys  <-- triggers the extension to fill the M[] slots
>> ld M[0]   <-- loads nhoff from M[0] into accu
>> <do sth with it>
>> ld M[3]   <-- loads tproto into accu, etc
>> …
>
> imo there are pros and cons in Daniel's and Chema's proposals
> for classic BPF extensions.
> I like Chema's a bit more, since his proposal doesn't require to
> change classic BPF verifier and that's always a delicate area
> (seccomp also allows to use M[] slots).

What bothers me is that you have to *special case* this extension
all over the BPF converter and stack, even you need to introduce a
prologue just to walk the whole filter program and to check if the
extension is present; next to that instead of one extension you
need to add a couple of them to uapi. I understand Chema's proposal
or his need to easily access these data but that's why I proposed
the above if he wants to go for that. Btw, seccomp doesn't allow
for loading BPF extensions, you said so yourself Alexei.

> Both still look like a stop gap until next classic extension
> appears on horizon.
>
> I think explicit eBPF program solves this particular task cleaner
> and it doesn't require changing eBPF.
>
>> A program like:
>>
>> ld #2
>> ld #keys
>> ...
>>
>> Would then on the other hand only fill the first two slots of M[] as
>> the user does not need all information from the kernel's flow dissector
>> and thus also does not fully need to dissect the skb.
>>
>> This also permits in future to add new fields by enabling them with
>> ld #6 etc, for example.
>>
>> I think this would fit much more into the design of BPF (although I
>> don't strictly like referring to the kernel's flow dissector and thus
>> bypassing BPF's actual purpose; but I understand that it can get quite
>> complicated with tunnels, etc).
>>
>> But this idea would allow you to only add 1 new extension and to have
>> it return dynamically a part or all information you would need in your
>> program, and thus solves the issue of calling the skb flow dissector
>> multiple times just because we have ld #thoff, ld #nhoff, etc, we can
>> avoid making such a stack_layout + filter_prologue hack and thus design
>> it more cleanly into the BPF architecture.
>>
>>
>>> The only advantage I can see is that we're only adding one new
>>> ancillary load, which is more elegant. I can see some issues with this
>>> approach:
>>>
>>> - usability. The user has to actually know what's the right offset of
>>> the thoff, which is cumbersome and may change with kernels. We'd be
>>> effectively exporting the flow_keys struct layout to the users.
>>
>>
>> See above.
>>
>>
>>> - have to take care that the classic BPF filter does not muck with
>>> mem[] between the "ld #keys" and all of the the "ld* mem[]" that
>>> access to it. Note that we're currently storing the flow_keys struct
>>> in the stack outside of the mem[] words, so this is not a problem
>>> here. (It is only accessible using ebpf insns.)
>>
>>
>> Since it is part of the API/ABI, and a new instruction, it is then
>> known that ld #keys would fill particular slots of M[] to the user.
>> That however, must be carefully designed, so that in future one
>> doesn't need to add ld #keys2. The part of not mucking with M[] fields
>> is then part of the user's task actually, just the same way as a
>> user shouldn't store something to A resp. X while an ld resp. ldx
>> knowingly would overwrite that value stored previously. I don't
>> think it would be any different than that.
>>
>>
>>> *Now, if your comment is ebpf-only, that could be ok. That would mean:
>>>
>>> - a. the user writes "ld #thoff"
>>> - b. the BPF->eBPF compiler generates one common BPF_CALL
>>> (__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
>>> would not include payload_offset (which needs to run its own function
>>> to get the poff from flow_keys).
>>
>>
>> Since we're not using eBPF in user space right now, the comment is not
>> about eBPF. I think if you would use eBPF from user space, you don't
>> need to add such extensions anymore, but could just write yourself a
>> minimal flow dissector in 'restricted' C to solve that problem.
>
> Exactly.
> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>
> I'm not sure how pressing it is now to add another extension to classic,
> when the goal of this patch set fits into eBPF model just fine.
> yes, eBPF verifier is not in tree yet and it will obviously take longer to
> review than Chema's or Daniel's set.
>
> When eBPF is exposed to user space the inner header access can
> be done in two ways without changing eBPF instruction set or eBPF
> verifier.
>
> eBPF approach #1:
> -- code packet parser in restricted C
> Pros:
> skb_flow_dissect() stays hidden in kernel.
> No additional uapi headache which exist if we start reusing
> in-kernel skb_flow_dissect() either via classic or eBPF.
> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
> (I provided performance numbers before)
> Cons:
> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to userspace.
>
> eBPF approach #2:
> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
> Pros:
> eBPF program becomes much shorter and can be done in asm like:
> bpf_mov r2, fp
> bpf_sub r2, 64
> bpf_call bpf_skb_flow_dissect  // call in-kernel helper function from
> eBPF program
> bpf_ldx r1, [fp - 64]  // access flow_keys data
> bpf_ldx r2, [fp - 60]
>
> Cons:
> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
> uapi visible helper function
>
> imo both eBPF approaches are cleaner than extending classic.
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov June 3, 2014, 8:15 p.m. UTC | #6
On Tue, Jun 3, 2014 at 1:33 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
>>
>> imo there are pros and cons in Daniel's and Chema's proposals
>> for classic BPF extensions.
>> I like Chema's a bit more, since his proposal doesn't require to
>> change classic BPF verifier and that's always a delicate area
>> (seccomp also allows to use M[] slots).
>
>
> What bothers me is that you have to *special case* this extension
> all over the BPF converter and stack, even you need to introduce a
> prologue just to walk the whole filter program and to check if the
> extension is present; next to that instead of one extension you
> need to add a couple of them to uapi. I understand Chema's proposal

Agree. The walking part and multiple anc loads are not clean, but in your
approach you'd need to hack sk_chk_filter() to recognize very specific
anc load and do even fancier things in check_load_and_stores()
to make sure that 'ld #5, ld #keys' is safe from M[] usage point of view.

> or his need to easily access these data but that's why I proposed
> the above if he wants to go for that. Btw, seccomp doesn't allow
> for loading BPF extensions, you said so yourself Alexei.

of course, but seccomp does use ld/st M[] which will overlap
with your socket extension and since check_load_and_stores()
will be hacked, seccomp verification needs to be considered as well.
From user api point of view, your approach is cleaner than Chema's,
but from implementation Chema's is much safer and smaller.

Anyway as I said before I'm not excited about either.
I don't think we should be adding classic BPF extensions any more.
The long term headache of supporting classic BPF extensions
outweighs the short term benefits.

Not having to constantly tweak kernel for such cases was the key
goal of eBPF. My two eBPF approaches to solve Chema's need
are both cleaner, since nothing special is being done in eBPF
core to support this new need. Both instruction set and verifier
stay generic and cover tracing and this new socket use at once.
I do realize that I'm talking here about future eBPF verifier that
is not in tree yet and eBPF is not exposed to user space, but
I think we should consider longer term perspective.
If I'm forced to pick between yours or Chema's classic extensions,
I would pick Chema's because it's lesser evil :)

>> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>>
>> I'm not sure how pressing it is now to add another extension to classic,
>> when the goal of this patch set fits into eBPF model just fine.
>> yes, eBPF verifier is not in tree yet and it will obviously take longer to
>> review than Chema's or Daniel's set.
>>
>> When eBPF is exposed to user space the inner header access can
>> be done in two ways without changing eBPF instruction set or eBPF
>> verifier.
>>
>> eBPF approach #1:
>> -- code packet parser in restricted C
>> Pros:
>> skb_flow_dissect() stays hidden in kernel.
>> No additional uapi headache which exist if we start reusing
>> in-kernel skb_flow_dissect() either via classic or eBPF.
>> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
>> (I provided performance numbers before)
>> Cons:
>> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to
>> userspace.
>>
>> eBPF approach #2:
>> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
>> Pros:
>> eBPF program becomes much shorter and can be done in asm like:
>> bpf_mov r2, fp
>> bpf_sub r2, 64
>> bpf_call bpf_skb_flow_dissect  // call in-kernel helper function from
>> eBPF program
>> bpf_ldx r1, [fp - 64]  // access flow_keys data
>> bpf_ldx r2, [fp - 60]
>>
>> Cons:
>> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
>> uapi visible helper function
>>
>> imo both eBPF approaches are cleaner than extending classic.
>>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chema Gonzalez June 3, 2014, 9:12 p.m. UTC | #7
On Tue, Jun 3, 2014 at 1:15 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Tue, Jun 3, 2014 at 1:33 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
>>>
>>> imo there are pros and cons in Daniel's and Chema's proposals
>>> for classic BPF extensions.
>>> I like Chema's a bit more, since his proposal doesn't require to
>>> change classic BPF verifier and that's always a delicate area
>>> (seccomp also allows to use M[] slots).
>>
>>
>> What bothers me is that you have to *special case* this extension
>> all over the BPF converter and stack, even you need to introduce a
>> prologue just to walk the whole filter program and to check if the
>> extension is present; next to that instead of one extension you
>> need to add a couple of them to uapi. I understand Chema's proposal
>
> Agree. The walking part and multiple anc loads are not clean, but in your
> approach you'd need to hack sk_chk_filter() to recognize very specific
> anc load and do even fancier things in check_load_and_stores()
> to make sure that 'ld #5, ld #keys' is safe from M[] usage point of view.
I think you may be confusing the two parts of the patch, namely
whether we implement the code as 1 anc load with N parameters or as N
anc loads, and the existence of a prologue.

- Adding just 1 anc load ("ld #keys" in your pseudo-code) requires
less UAPI changes, but effectively exports the flow_keys struct to the
user (which means we cannot change it in the future). The fact that
you're proposing your own version of the flow_keys ( {nhoff, nproto,
thoff, tproto, poff} ) does not make it less of a problem: You're just
exporting an alternative (albeit IMO way better one, as all the
dissected info can be obtained from your 5-tuple) flow_keys. From a
usability PoV, the user may have to either do "ld #0; ld #keys"
(knowing that "#0" refers to "nhoff"), or "ld #nhoff". I definitely
prefer the second, but I can easily rewrite the patch to use the
first.

- Now, the existence of a prologue is a must if you want to ensure the
filter only calls the actual flow dissector once (this is the main
goal of the patch, actually). Having 2 insns separated by N insns that
access data obtained from the same flow dissector call means you have
to both (a) ensure the first caller -- whether it's the first or the
second insn -- does perform the BPF_CALL, and (b) ensure that none of
the N insns in the middle mucks with the result of the first call (my
solution deals with (b) by using the stack outside of M[], while yours
requires verifying the code. I find mine easier).

In order to achieve (a), you need the prologue code: Because the code
can have jumps, you can never know whether the "ld #keys" was actually
called or not. What my solution's prologue does is to write a 0 on a
flow_inited u32, which states that the flow dissector hasn't been
called. Now, every call for any of the sub-fields checks this
flow_inited field, and runs the flow dissector iff it's zero (hasn't
been called yet), in which case it also sets flow_inited to 1.

Your approach needs it too. Citing from your pseudo-code:

> ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
> ld #keys  <-- triggers the extension to fill the M[] slots
> ld M[0]   <-- loads nhoff from M[0] into accu

How does the "ld M[0]" know that the actual flow dissector has already
been called? What if the insn just before the "ld #5" was "jmp +2" ?
In that case, the "ld #keys" would have never been called.

> ld M[0]   <-- loads nhoff from M[0] into accu
> <do sth with it>
> ld M[3]   <-- loads tproto into accu, etc

>> or his need to easily access these data but that's why I proposed
>> the above if he wants to go for that. Btw, seccomp doesn't allow
>> for loading BPF extensions, you said so yourself Alexei.
>
> of course, but seccomp does use ld/st M[] which will overlap
> with your socket extension and since check_load_and_stores()
> will be hacked, seccomp verification needs to be considered as well.
> From user api point of view, your approach is cleaner than Chema's,
> but from implementation Chema's is much safer and smaller.
>
> Anyway as I said before I'm not excited about either.
> I don't think we should be adding classic BPF extensions any more.
> The long term headache of supporting classic BPF extensions
> outweighs the short term benefits.
I see a couple of issues with (effectively) freezing classic BPF
development while waiting for direct eBPF access to happen. The first
one is that the kernel has to accept it. I can see many questions
about this, especially security and usability (I'll send an email
about the "split BPF out of core later"). Now, the main issue is
whether/when the tools will support it. IMO, this is useful iff I can
quickly write/reuse filters and run tcpdump filters based on them. I'm
trying to get upstream libpcap to accept support for raw (classic) BPF
filters, and it's taking a long time. I can imagine how they may be
less receptive about supporting a Linux-only eBPF mechanism. Tools do
matter.

Even if eBPF happens, it's not that the extensions are so hard to port
to eBPF: It's one BPF_CALL per extension. And they have a
straightforward porting path to the C-like code.

-Chema


> Not having to constantly tweak kernel for such cases was the key
> goal of eBPF. My two eBPF approaches to solve Chema's need
> are both cleaner, since nothing special is being done in eBPF
> core to support this new need. Both instruction set and verifier
> stay generic and cover tracing and this new socket use at once.
> I do realize that I'm talking here about future eBPF verifier that
> is not in tree yet and eBPF is not exposed to user space, but
> I think we should consider longer term perspective.
> If I'm forced to pick between yours or Chema's classic extensions,
> I would pick Chema's because it's lesser evil :)
>
>>> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>>>
>>> I'm not sure how pressing it is now to add another extension to classic,
>>> when the goal of this patch set fits into eBPF model just fine.
>>> yes, eBPF verifier is not in tree yet and it will obviously take longer to
>>> review than Chema's or Daniel's set.
>>>
>>> When eBPF is exposed to user space the inner header access can
>>> be done in two ways without changing eBPF instruction set or eBPF
>>> verifier.
>>>
>>> eBPF approach #1:
>>> -- code packet parser in restricted C
>>> Pros:
>>> skb_flow_dissect() stays hidden in kernel.
>>> No additional uapi headache which exist if we start reusing
>>> in-kernel skb_flow_dissect() either via classic or eBPF.
>>> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
>>> (I provided performance numbers before)
>>> Cons:
>>> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to
>>> userspace.
>>>
>>> eBPF approach #2:
>>> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
>>> Pros:
>>> eBPF program becomes much shorter and can be done in asm like:
>>> bpf_mov r2, fp
>>> bpf_sub r2, 64
>>> bpf_call bpf_skb_flow_dissect  // call in-kernel helper function from
>>> eBPF program
>>> bpf_ldx r1, [fp - 64]  // access flow_keys data
>>> bpf_ldx r2, [fp - 60]
>>>
>>> Cons:
>>> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
>>> uapi visible helper function
>>>
>>> imo both eBPF approaches are cleaner than extending classic.
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann June 4, 2014, 8:51 a.m. UTC | #8
On 06/03/2014 11:12 PM, Chema Gonzalez wrote:
...
> Your approach needs it too. Citing from your pseudo-code:
>
>> ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>> ld #keys  <-- triggers the extension to fill the M[] slots
>> ld M[0]   <-- loads nhoff from M[0] into accu
>
> How does the "ld M[0]" know that the actual flow dissector has already
> been called? What if the insn just before the "ld #5" was "jmp +2" ?
> In that case, the "ld #keys" would have never been called.

But that case would be no different from doing something like ...

[...]
   jmp foo
   ldi #42
   st M[0]
foo:
   ld M[0]
[...]

... and would then not pass the checker in check_load_and_stores(),
which, as others have already stated, would need to be extended,
of course. It's one possible approach.

>> Anyway as I said before I'm not excited about either.
>> I don't think we should be adding classic BPF extensions any more.
>> The long term headache of supporting classic BPF extensions
>> outweighs the short term benefits.
...
> I see a couple of issues with (effectively) freezing classic BPF
> development while waiting for direct eBPF access to happen. The first
> one is that the kernel has to accept it. I can see many questions
> about this, especially security and usability (I'll send an email
> about the "split BPF out of core later"). Now, the main issue is
> whether/when the tools will support it. IMO, this is useful iff I can
> quickly write/reuse filters and run tcpdump filters based on them. I'm
> trying to get upstream libpcap to accept support for raw (classic) BPF
> filters, and it's taking a long time. I can imagine how they may be
> less receptive about supporting a Linux-only eBPF mechanism. Tools do
> matter.

Grepping through libpcap code, which tries to be platform independent,
it seems after all the years, the only thing where you can see support
for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
just don't care, perhaps they do, who knows, but it looks to me a bit
that they are reluctant to these improvements, maybe for one reason
that other OSes don't support it. That was also one of the reasons that
led me to start writing bpf_asm (net/tools/) for having a small DSL
for more easily trying out BPF code while having _full_ control over it.

Maybe someone should start a binary-compatible Linux-only version of
libpcap, where tcpdump will transparently make use of these low level
improvements eventually. </rant> ;)

Thanks,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 5, 2014, 6:55 a.m. UTC | #9
FWIW, I'm not applying this series.

There is no real consensus and a lot more discussion and time is
needed before anything happens here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chema Gonzalez June 20, 2014, 9:56 p.m. UTC | #10
I'll try to revive the discussion for this patch, in case I can
convince you about its implementation. I rebased it to the latest
HEAD, and I'm ready to re-submit.

On Wed, Jun 4, 2014 at 1:51 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 06/03/2014 11:12 PM, Chema Gonzalez wrote:
> ...
>
>> Your approach needs it too. Citing from your pseudo-code:
>>
>>> ld #5     <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>>> ld #keys  <-- triggers the extension to fill the M[] slots
>>> ld M[0]   <-- loads nhoff from M[0] into accu
>>
>>
>> How does the "ld M[0]" know that the actual flow dissector has already
>> been called? What if the insn just before the "ld #5" was "jmp +2" ?
>> In that case, the "ld #keys" would have never been called.
>
>
> But that case would be no different from doing something like ...
>
> [...]
>   jmp foo
>   ldi #42
>   st M[0]
> foo:
>   ld M[0]
> [...]
>
> ... and would then not pass the checker in check_load_and_stores(),
> which, as others have already stated, would need to be extended,
> of course. It's one possible approach.

As Alexei/you have noted, this is moving the complexity to
check_load_and_stores(). It's more complicated than the current
ensure-there's-a-store-preceding-each-load-for-each-M[]-position
check. In particular, your approach requires 2 insns before a memory
position can be considered filled, namely the "ld #5" and the "ld
#keys". That means adding intra-instruction state to the FSM in
check_load_and_stores().

A second issue is that you need to ensure that M[0:5] does not get
polluted between the moment you call "ld #5; ld #keys" and the moment
the code uses the flow-dissector values. This is more complex than
just ensuring the code doesn't access uninit'ed M[].

There's also some concerns about effectively adding a prologue. We
already have that:

$ cat net/core/filter.c
...
int sk_convert_filter(struct sock_filter *prog, int len,
          struct sock_filter_int *new_prog, int *new_len)
{
  ...
  if (new_insn)
    *new_insn = BPF_MOV64_REG(BPF_REG_CTX, BPF_REG_ARG1);
  new_insn++;

  for (i = 0; i < len; fp++, i++) {
    ...

The current prologue is embedded in sk_convert_filter(). The patch
just moves it to its own function, and (potentially) adds another
instruction (which zeroes flow_inited iff there's any call to the flow
dissector).

>>> Anyway as I said before I'm not excited about either.
>>> I don't think we should be adding classic BPF extensions any more.
>>> The long term headache of supporting classic BPF extensions
>>> outweighs the short term benefits.
>> I see a couple of issues with (effectively) freezing classic BPF
>> development while waiting for direct eBPF access to happen. The first
>> one is that the kernel has to accept it. I can see many questions
>> about this, especially security and usability (I'll send an email
>> about the "split BPF out of core later"). Now, the main issue is
>> whether/when the tools will support it. IMO, this is useful iff I can
>> quickly write/reuse filters and run tcpdump filters based on them. I'm
>> trying to get upstream libpcap to accept support for raw (classic) BPF
>> filters, and it's taking a long time. I can imagine how they may be
>> less receptive about supporting a Linux-only eBPF mechanism. Tools do
>> matter.
This is a high-level decision, more than a technical one. Do we want
to freeze classic BPF development in linux, even before we have a
complete eBPF replacement, and zero eBPF tool (libpcap) support?

> Grepping through libpcap code, which tries to be platform independent,
> it seems after all the years, the only thing where you can see support
> for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
Actually they recently added MOD/XOR support. Woo-hoo!

> just don't care, perhaps they do, who knows, but it looks to me a bit
> that they are reluctant to these improvements, maybe for one reason
> that other OSes don't support it.
From the comments in the MOD/XOR patch, the latter seem to be the issue.

> That was also one of the reasons that
> led me to start writing bpf_asm (net/tools/) for having a small DSL
> for more easily trying out BPF code while having _full_ control over it.
>
> Maybe someone should start a binary-compatible Linux-only version of
> libpcap, where tcpdump will transparently make use of these low level
> improvements eventually. </rant> ;)
There's too much code dependent on libpcap to make a replacement possible.

Thanks,
-Chema
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann June 24, 2014, 8:14 a.m. UTC | #11
On 06/20/2014 11:56 PM, Chema Gonzalez wrote:
...
>>>> Anyway as I said before I'm not excited about either.
>>>> I don't think we should be adding classic BPF extensions any more.
>>>> The long term headache of supporting classic BPF extensions
>>>> outweighs the short term benefits.
 >>>
>>> I see a couple of issues with (effectively) freezing classic BPF
>>> development while waiting for direct eBPF access to happen. The first
>>> one is that the kernel has to accept it. I can see many questions
>>> about this, especially security and usability (I'll send an email
>>> about the "split BPF out of core later"). Now, the main issue is
>>> whether/when the tools will support it. IMO, this is useful iff I can
>>> quickly write/reuse filters and run tcpdump filters based on them. I'm
>>> trying to get upstream libpcap to accept support for raw (classic) BPF
>>> filters, and it's taking a long time. I can imagine how they may be
>>> less receptive about supporting a Linux-only eBPF mechanism. Tools do
>>> matter.
 >
> This is a high-level decision, more than a technical one. Do we want
> to freeze classic BPF development in linux, even before we have a
> complete eBPF replacement, and zero eBPF tool (libpcap) support?

In my opinion, I don't think we strictly have to hard-freeze it. The
only concern I see is that conceptually hooking into the flow_dissector
to read out all keys for further processing on top of them 1) sort
of breaks/bypasses the concept of BPF (as it's actually the task of
BPF itself for doing this), 2) effectively freezes any changes to the
flow_dissector as BPF applications making use of it now depend on the
provided offsets for doing further processing on top of them, 3) it
can already be resolved by (re-)writing the kernel's flow dissector
in C-like syntax in user space iff eBPF can be loaded from there with
similar performance. So shouldn't we rather work towards that as a
more generic approach/goal in the mid term and w/o having to maintain
a very short term intermediate solution that we need to special case
along the code and have to carry around forever ...

>> Grepping through libpcap code, which tries to be platform independent,
>> it seems after all the years, the only thing where you can see support
>> for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
 >
> Actually they recently added MOD/XOR support. Woo-hoo!

Great to hear, still quite some things missing, unfortunately. :/

>> just don't care, perhaps they do, who knows, but it looks to me a bit
>> that they are reluctant to these improvements, maybe for one reason
>> that other OSes don't support it.
 >
>  From the comments in the MOD/XOR patch, the latter seem to be the issue.

Yep, that's the pain you need to live with when trying to be multi
OS capable. I assume in its very origin, the [libpcap] compiler was
probably not designed for handling such differences in various
operating systems (likely even ran in user space from libpcap directly).

>> That was also one of the reasons that
>> led me to start writing bpf_asm (net/tools/) for having a small DSL
>> for more easily trying out BPF code while having _full_ control over it.
>>
>> Maybe someone should start a binary-compatible Linux-only version of
>> libpcap, where tcpdump will transparently make use of these low level
>> improvements eventually. </rant> ;)
 >
> There's too much code dependent on libpcap to make a replacement possible.

Well, I wrote binary-compatible, so applications on top of it won't
care much if it could be used as drop-in replacement. That would perhaps
also allow for fanout and other features to be used ...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chema Gonzalez June 25, 2014, 10 p.m. UTC | #12
On Tue, Jun 24, 2014 at 1:14 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> This is a high-level decision, more than a technical one. Do we want
>> to freeze classic BPF development in linux, even before we have a
>> complete eBPF replacement, and zero eBPF tool (libpcap) support?
>
>
> In my opinion, I don't think we strictly have to hard-freeze it. The
> only concern I see is that conceptually hooking into the flow_dissector
> to read out all keys for further processing on top of them 1) sort
> of breaks/bypasses the concept of BPF (as it's actually the task of
> BPF itself for doing this),
I don't think we want to do flow dissection using BPF insns. It's not
easy to write BPF insns, and we already have kernel code that does
that. IMO that's what eBPF calls/BPF ancillary loads are for (e.g.
vlan access).

> 2) effectively freezes any changes to the
> flow_dissector as BPF applications making use of it now depend on the
> provided offsets for doing further processing on top of them, 3) it
Remember that my approach does not have (user-visible) offsets. It
uses the eBPF stack to dump the output (struct flow_keys) of the flow
dissector (skb_flow_dissect()). The only dependencies we're adding is
that, once we provide a BPF ancillary load to access e.g. thoff, we
have to keep providing it.

> can already be resolved by (re-)writing the kernel's flow dissector
> in C-like syntax in user space iff eBPF can be loaded from there with
> similar performance. So shouldn't we rather work towards that as a
> more generic approach/goal in the mid term and w/o having to maintain
> a very short term intermediate solution that we need to special case
> along the code and have to carry around forever ...
Once (if) we reach the point where we can do eBPF filters in "C-like
syntax," I'd agree with you that it would be nice to be able to reuse
the same function inside the kernel and as an eBPF library. The same
probably applies to other network functions. Now, I'm not sure what's
the model to reuse: Are we planning to re-write (maybe "re-write" is
too strong, as we will probably only need some minor changes) some of
the kernel functions into this "C--" language so that eBPF can use
them? Do other people agree with this vision?

There's still the problem of whether we want to obsolete classic BPF
in the kernel before the tools (libpcap mainly) accept eBPF. This can
take a lot.

Finally, what's the user's CLI interface you have in mind? Right now,
tcpdump expressions are very handy: I know I can pass "ip[2:2] ==
1500" or "(tcp[13] & 0x03)" to any libpcap-based application. This is
very handy to log into a machine, and quickly run tcpdump to get the
packets I'm interested on. What would be the model for using C-- eBPF
filters in the same manner?

Thanks again,
-Chema
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann June 27, 2014, 10:19 a.m. UTC | #13
On 06/26/2014 12:00 AM, Chema Gonzalez wrote:
...
> There's still the problem of whether we want to obsolete classic BPF
> in the kernel before the tools (libpcap mainly) accept eBPF. This can
> take a lot.
>
> Finally, what's the user's CLI interface you have in mind? Right now,
> tcpdump expressions are very handy: I know I can pass "ip[2:2] ==
> 1500" or "(tcp[13] & 0x03)" to any libpcap-based application. This is
> very handy to log into a machine, and quickly run tcpdump to get the
> packets I'm interested on. What would be the model for using C-- eBPF
> filters in the same manner?

Yes, imho, it's a valid question to ask. I think there are a couple
of possibilities for libpcap/tcpdump from a user point of view (note,
I don't strictly think it's the _only_ main user though): 1) iff a
llvm and/or gcc backend gets merged from the compiler side, one could
add a cli interface to run the generated opcodes from a file for
advanced filters while perhaps classic BPF continues to be supported
via its high-level filter expressions; 2) there could be a Linux-only
compiler in libpcap that translates and makes use of full eBPF (though
significantly more effort to implement); 3) libpcap continues to use
classic BPF as it's currently doing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7a9beeb..0214b5a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3065,7 +3065,8 @@  bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
 
 int skb_checksum_setup(struct sk_buff *skb, bool recalculate);
 
-u32 __skb_get_poff(const struct sk_buff *skb);
+u32 __skb_get_poff(const struct sk_buff *skb, struct flow_keys *flow,
+		u32 *flow_inited);
 
 /**
  * skb_head_is_locked - Determine if the skb->head is locked down
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index af677cb..0f8128f 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -446,6 +446,58 @@  static struct bpf_test tests[] = {
 		{ { 30, 0 }, { 100, 42 } },
 	},
 	{
+		"LD_PAYLOAD_OFF_STACK",
+		.u.insns = {
+			BPF_STMT(BPF_LD | BPF_IMM, 0x11111111),
+			BPF_STMT(BPF_ST, 0),
+			BPF_STMT(BPF_LD | BPF_IMM, 0x22222222),
+			BPF_STMT(BPF_ST, 1),
+			BPF_STMT(BPF_LD | BPF_IMM, 0xeeeeeeee),
+			BPF_STMT(BPF_ST, 14),
+			BPF_STMT(BPF_LD | BPF_IMM, 0xffffffff),
+			BPF_STMT(BPF_ST, 15),
+			BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+			    SKF_AD_OFF + SKF_AD_PAY_OFFSET),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 54, 1, 0),
+			BPF_STMT(BPF_RET | BPF_K, 0),
+			BPF_STMT(BPF_LD | BPF_MEM, 0),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0x11111111, 1, 0),
+			BPF_STMT(BPF_RET | BPF_K, 0),
+			BPF_STMT(BPF_LD | BPF_MEM, 1),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0x22222222, 1, 0),
+			BPF_STMT(BPF_RET | BPF_K, 0),
+			BPF_STMT(BPF_LD | BPF_MEM, 14),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0xeeeeeeee, 1, 0),
+			BPF_STMT(BPF_RET | BPF_K, 0),
+			BPF_STMT(BPF_LD | BPF_MEM, 15),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0xffffffff, 1, 0),
+			BPF_STMT(BPF_RET | BPF_K, 0),
+			BPF_STMT(BPF_RET | BPF_K, 1)
+		},
+		CLASSIC,
+		/* 01:02:03:04:05:06 < 07:08:09:0a:0b:0c, ethertype IPv4(0x0800)
+		 * length 94: 10.1.1.1.10000 > 10.1.1.2.22: Flags [P.],
+		 * seq 1:21, ack 2, win 14400, length 20
+		 */
+		{ 0x01, 0x02, 0x03, 0x04, 0x05, 0x06,
+		  0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c,
+		  0x08, 0x00,
+		  /* ip header */
+		  0x45, 0x10, 0x00, 0x5e,
+		  0x75, 0xb5, 0x40, 0x00,
+		  0x40, 0x06, 0xad, 0x2e,
+		  0x0a, 0x01, 0x01, 0x01, /* ip src */
+		  0x0a, 0x01, 0x01, 0x02, /* ip dst */
+		  /* tcp header */
+		  0x27, 0x10, 0x00, 0x16, /* tcp src/dst port */
+		  0x00, 0x00, 0x00, 0x01, /* tcp seq# */
+		  0x00, 0x00, 0x00, 0x02, /* tcp ack# */
+			0x50, 0x00, 0x38, 0x40,
+			0x9a, 0x42, 0x00, 0x00,
+		},
+		{ { 100, 1 } },
+	},
+	{
 		"LD_ANC_XOR",
 		.u.insns = {
 			BPF_STMT(BPF_LD | BPF_IMM, 10),
diff --git a/net/core/filter.c b/net/core/filter.c
index 2c2d35d..0c55252 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -66,6 +66,24 @@ 
 #define CTX	regs[BPF_REG_CTX]
 #define K	insn->imm
 
+/* classic BPF stack layout:
+ *
+ * This is the layout for the stack for eBPF filters generated from
+ * classic BPF filters.
+ *
+ * Top (BPF_MEMWORDS * 4) bytes are used to represent classic BPF
+ * mem[0-15] slots.
+ *
+ * Flow dissector users (poff so far) use the space just below mem[]
+ * to share the flow_keys obtained from dissecting the flow, and a
+ * bool stating whether such field has been inited.
+ */
+struct classic_bpf_stack_layout {
+	u32 flow_inited;
+	struct flow_keys flow;
+	u32 mem[BPF_MEMWORDS];
+};
+
 /* No hurry in this branch
  *
  * Exported for the bpf jit load helper.
@@ -596,9 +614,13 @@  static unsigned int pkt_type_offset(void)
 	return -1;
 }
 
-static u64 __skb_get_pay_offset(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
+static u64 __skb_get_pay_offset(u64 ctx, u64 a, u64 x, u64 fp, u64 r5)
 {
-	return __skb_get_poff((struct sk_buff *)(unsigned long) ctx);
+	struct classic_bpf_stack_layout *stack_layout =
+	    (void *) fp - sizeof(struct classic_bpf_stack_layout);
+	return __skb_get_poff((struct sk_buff *)(unsigned long) ctx,
+			      &stack_layout->flow,
+			      &stack_layout->flow_inited);
 }
 
 static u64 __skb_get_nlattr(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
@@ -779,7 +801,11 @@  static bool convert_bpf_extensions(struct sock_filter *fp,
 		*insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG3, BPF_REG_X);
 		insn++;
 
-		/* Emit call(ctx, arg2=A, arg3=X) */
+		/* arg4 = FP */
+		*insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG4, BPF_REG_FP);
+		insn++;
+
+		/* Emit call(ctx, arg2=A, arg3=X, arg4=FP) */
 		insn->code = BPF_JMP | BPF_CALL;
 		switch (fp->k) {
 		case SKF_AD_OFF + SKF_AD_PAY_OFFSET:
@@ -818,6 +844,45 @@  static bool convert_bpf_extensions(struct sock_filter *fp,
 	return true;
 }
 
+void __sk_convert_filter_prologue(struct sock_filter *fp, int len,
+		struct sock_filter_int **new_insn)
+{
+	bool use_flow_dissector = false;
+	bool do_write_code = false;
+	int i;
+
+	/* check if there are any insn's that use the flow dissector */
+	for (i = 0; i < len; fp++, i++) {
+		if (BPF_CLASS(fp->code) == BPF_LD &&
+		    BPF_MODE(fp->code) == BPF_ABS &&
+		    fp->k == SKF_AD_OFF + SKF_AD_PAY_OFFSET) {
+			use_flow_dissector = true;
+			break;
+		}
+	}
+
+	do_write_code = (*new_insn != NULL);
+
+	/* first init the stack */
+	if (use_flow_dissector) {
+		/* stack_layout->flow_inited = 0; */
+		if (do_write_code) {
+			(*new_insn)->code = BPF_ST | BPF_MEM | BPF_W;
+			(*new_insn)->a_reg = BPF_REG_FP;
+			(*new_insn)->x_reg = 0;
+			(*new_insn)->off = (s16) (
+			-sizeof(struct classic_bpf_stack_layout) +
+			offsetof(struct classic_bpf_stack_layout, flow_inited));
+			(*new_insn)->imm = 0;
+		}
+		(*new_insn)++;
+	}
+	/* ctx = arg1 */
+	if (do_write_code)
+		**new_insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_CTX, BPF_REG_ARG1);
+	(*new_insn)++;
+}
+
 /**
  *	sk_convert_filter - convert filter program
  *	@prog: the user passed filter program
@@ -867,10 +932,7 @@  do_pass:
 	new_insn = new_prog;
 	fp = prog;
 
-	if (new_insn) {
-		*new_insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_CTX, BPF_REG_ARG1);
-	}
-	new_insn++;
+	__sk_convert_filter_prologue(fp, len, &new_insn);
 
 	for (i = 0; i < len; fp++, i++) {
 		struct sock_filter_int tmp_insns[6] = { };
@@ -1041,7 +1103,10 @@  do_pass:
 			insn->a_reg = BPF_REG_FP;
 			insn->x_reg = fp->code == BPF_ST ?
 				      BPF_REG_A : BPF_REG_X;
-			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			insn->off =
+			    -sizeof(struct classic_bpf_stack_layout) +
+			    offsetof(struct classic_bpf_stack_layout,
+				     mem[fp->k]);
 			break;
 
 		/* Load from stack. */
@@ -1051,7 +1116,10 @@  do_pass:
 			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ?
 				      BPF_REG_A : BPF_REG_X;
 			insn->x_reg = BPF_REG_FP;
-			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			insn->off =
+			    -sizeof(struct classic_bpf_stack_layout) +
+			    offsetof(struct classic_bpf_stack_layout,
+				     mem[fp->k]);
 			break;
 
 		/* A = K or X = K */
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 107ed12..bf3cb99 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -275,16 +275,20 @@  EXPORT_SYMBOL(__skb_tx_hash);
  * truncate packets without needing to push actual payload to the user
  * space and can analyze headers only, instead.
  */
-u32 __skb_get_poff(const struct sk_buff *skb)
+u32 __skb_get_poff(const struct sk_buff *skb, struct flow_keys *flow,
+		u32 *flow_inited)
 {
-	struct flow_keys keys;
 	u32 poff = 0;
 
-	if (!skb_flow_dissect(skb, &keys))
-		return 0;
+	/* check whether the flow dissector has already been run */
+	if (!*flow_inited) {
+		if (!skb_flow_dissect(skb, flow))
+			return 0;
+		*flow_inited = 1;
+	}
 
-	poff += keys.thoff;
-	switch (keys.ip_proto) {
+	poff += flow->thoff;
+	switch (flow->ip_proto) {
 	case IPPROTO_TCP: {
 		const struct tcphdr *tcph;
 		struct tcphdr _tcph;