diff mbox

[v7,net-next,1/3] filter: add Extended BPF interpreter and converter

Message ID 1394320530-3508-2-git-send-email-ast@plumgrid.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Alexei Starovoitov March 8, 2014, 11:15 p.m. UTC
Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
  Original BPF has two registers (A and X) and hidden frame pointer.
  Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
  semantics of old 32-bit ALU operations are preserved via 32-bit
  subregisters
- if (cond) jump_true; else jump_false;
  old BPF insns are replaced with:
  if (cond) jump_true; /* else fallthrough */
- adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced with
  up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
  overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
 if (bpf_ext_enable) {
    convert to new
    sk_chk_filter() - check old bpf
    use sk_run_filter_ext() - new interpreter
 } else {
    sk_chk_filter() - check old bpf
    if (bpf_jit_enable)
        use old jit
    else
        use sk_run_filter() - old interpreter
 }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
  Old BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  Extended BPF jump instructions have one branch and fall-through,
  which fit CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     90       101       192       202
ext BPF     31        71       47         97
old BPF jit 12        34       17         44
ext BPF jit TBD

--i386--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF    107        136      227       252
ext BPF     40        119       69       172

--arm32--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF    202        300      475       540
ext BPF    180        270      330       470
old BPF jit 26        182       37       202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
---
I think typecasting fixes are minor, so I kept Daniel's and Hagen's rev-by/ack.

 arch/arm/net/bpf_jit_32.c       |    3 +-
 arch/powerpc/net/bpf_jit_comp.c |    3 +-
 arch/s390/net/bpf_jit_comp.c    |    3 +-
 arch/sparc/net/bpf_jit_comp.c   |    3 +-
 arch/x86/net/bpf_jit_comp.c     |    3 +-
 include/linux/filter.h          |   16 +-
 include/linux/netdevice.h       |    1 +
 include/uapi/linux/filter.h     |   33 +-
 net/core/filter.c               |  801 ++++++++++++++++++++++++++++++++++++++-
 net/core/sysctl_net_core.c      |    7 +
 10 files changed, 840 insertions(+), 33 deletions(-)

Comments

Daniel Borkmann March 9, 2014, 12:29 p.m. UTC | #1
On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
> Extended BPF extends old BPF in the following ways:
> - from 2 to 10 registers
>    Original BPF has two registers (A and X) and hidden frame pointer.
>    Extended BPF has ten registers and read-only frame pointer.
> - from 32-bit registers to 64-bit registers
>    semantics of old 32-bit ALU operations are preserved via 32-bit
>    subregisters
> - if (cond) jump_true; else jump_false;
>    old BPF insns are replaced with:
>    if (cond) jump_true; /* else fallthrough */
> - adds signed > and >= insns
> - 16 4-byte stack slots for register spill-fill replaced with
>    up to 512 bytes of multi-use stack space
> - introduces bpf_call insn and register passing convention for zero
>    overhead calls from/to other kernel functions (not part of this patch)
> - adds arithmetic right shift insn
> - adds swab32/swab64 insns
> - adds atomic_add insn
> - old tax/txa insns are replaced with 'mov dst,src' insn
>
> Extended BPF is designed to be JITed with one to one mapping, which
> allows GCC/LLVM backends to generate optimized BPF code that performs
> almost as fast as natively compiled code
>
> sk_convert_filter() remaps old style insns into extended:
> 'sock_filter' instructions are remapped on the fly to
> 'sock_filter_ext' extended instructions when
> sysctl net.core.bpf_ext_enable=1
>
> Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
>   if (bpf_ext_enable) {
>      convert to new
>      sk_chk_filter() - check old bpf
>      use sk_run_filter_ext() - new interpreter
>   } else {
>      sk_chk_filter() - check old bpf
>      if (bpf_jit_enable)
>          use old jit
>      else
>          use sk_run_filter() - old interpreter
>   }
>
> sk_run_filter_ext() interpreter is noticeably faster
> than sk_run_filter() for two reasons:
>
> 1.fall-through jumps
>    Old BPF jump instructions are forced to go either 'true' or 'false'
>    branch which causes branch-miss penalty.
>    Extended BPF jump instructions have one branch and fall-through,
>    which fit CPU branch predictor logic better.
>    'perf stat' shows drastic difference for branch-misses.
>
> 2.jump-threaded implementation of interpreter vs switch statement
>    Instead of single tablejump at the top of 'switch' statement, GCC will
>    generate multiple tablejump instructions, which helps CPU branch predictor
>
> Performance of two BPF filters generated by libpcap was measured
> on x86_64, i386 and arm32.
>
> fprog #1 is taken from Documentation/networking/filter.txt:
> tcpdump -i eth0 port 22 -dd
>
> fprog #2 is taken from 'man tcpdump':
> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>     ((tcp[12]&0xf0)>>2)) != 0)' -dd
>
> Other libpcap programs have similar performance differences.
>
> Raw performance data from BPF micro-benchmark:
> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
> time in nsec per call, smaller is better
> --x86_64--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF     90       101       192       202
> ext BPF     31        71       47         97
> old BPF jit 12        34       17         44
> ext BPF jit TBD
>
> --i386--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    107        136      227       252
> ext BPF     40        119       69       172
>
> --arm32--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    202        300      475       540
> ext BPF    180        270      330       470
> old BPF jit 26        182       37       202
> new BPF jit TBD
>
> Tested with trinify BPF fuzzer
>
> Future work:
>
> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>
> 1. add extended BPF JIT for x86_64
>
> 2. add inband old/new demux and extended BPF verifier, so that new programs
>     can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
>     interfaces
>
> 3. tracing filters systemtap-like with extended BPF
>
> 4. OVS with extended BPF
>
> 5. nftables with extended BPF
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>

One more question or possible issue that came through my mind: When
someone attaches a socket filter from user space, and bpf_ext_enable=1
then the old filter will transparently be converted to the new
representation. If then user space (e.g. through checkpoint restore)
will issue a sk_get_filter() and thus we're calling sk_decode_filter()
on sk->sk_filter and, therefore, try to decode what we stored in
insns_ext[] with the assumption we still have the old code. Would that
actually crash (or leak memory, or just return garbage), as we access
decodes[] array with filt->code? Would be great if you could double-check.

The assumption with sk_get_filter() is that it returns the same filter
that was previously attached, so that it can be re-attached again at
a later point in time.

Cheers,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 9, 2014, 2:45 p.m. UTC | #2
On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:

> +/**
> + *	sk_run_filter_ext - run an extended filter
> + *	@ctx: buffer to run the filter on
> + *	@insn: filter to apply
> + *
> + * Decode and execute extended BPF instructions.
> + * @ctx is the data we are operating on.
> + * @filter is the array of filter instructions.
> + */
> +notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
> +{
> +	u64 stack[64];
> +	u64 regs[16];
> +	void *ptr;
> +	u64 tmp;
> +	int off;

Why is this 'notrace' ?

80 u64 on the stack, that is 640 bytes to run a filter ????


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 9, 2014, 2:49 p.m. UTC | #3
On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:

> +			if (BPF_SRC(fp->code) == BPF_K &&
> +			    (int)fp->k < 0) {
> +				/* extended BPF immediates are signed,
> +				 * zero extend immediate into tmp register
> +				 * and use it in compare insn
> +				 */
> +				insn->code = BPF_ALU | BPF_MOV | BPF_K;
> +				insn->a_reg = 2;
> +				insn->imm = fp->k;
> +				insn++;
> +
> +				insn->a_reg = 6;
> +				insn->x_reg = 2;
> +				bpf_src = BPF_X;
> +			} else {
> +				insn->a_reg = 6;
> +				insn->x_reg = 7;
> +				insn->imm = fp->k;
> +				bpf_src = BPF_SRC(fp->code);
> +			}
> +			/* common case where 'jump_false' is next insn */
> +			if (fp->jf == 0) {
> +				insn->code = BPF_JMP | BPF_OP(fp->code) |
> +					bpf_src;
> +				tgt = i + fp->jt + 1;
> +				EMIT_JMP;
> +				break;
> +			}
> +			/* convert JEQ into JNE when 'jump_true' is next insn */
> +			if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
> +				insn->code = BPF_JMP | BPF_JNE | bpf_src;
> +				tgt = i + fp->jf + 1;
> +				EMIT_JMP;
> +				break;
> +			}
> +			/* other jumps are mapped into two insns: Jxx and JA */
> +			tgt = i + fp->jt + 1;
> +			insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
> +			EMIT_JMP;
> +
> +			insn++;
> +			insn->code = BPF_JMP | BPF_JA;
> +			tgt = i + fp->jf + 1;
> +			EMIT_JMP;
> +			break;
> +
> +		/* ldxb 4*([14]&0xf) is remaped into 3 insns */
> +		case BPF_LDX | BPF_MSH | BPF_B:
> +			insn->code = BPF_LD | BPF_ABS | BPF_B;
> +			insn->a_reg = 7;
> +			insn->imm = fp->k;
> +
> +			insn++;
> +			insn->code = BPF_ALU | BPF_AND | BPF_K;
> +			insn->a_reg = 7;
> +			insn->imm = 0xf;
> +
> +			insn++;
> +			insn->code = BPF_ALU | BPF_LSH | BPF_K;
> +			insn->a_reg = 7;
> +			insn->imm = 2;
> +			break;
> +
> +		/* RET_K, RET_A are remaped into 2 insns */
> +		case BPF_RET | BPF_A:
> +		case BPF_RET | BPF_K:
> +			insn->code = BPF_ALU | BPF_MOV |
> +				(BPF_RVAL(fp->code) == BPF_K ? BPF_K : BPF_X);
> +			insn->a_reg = 0;
> +			insn->x_reg = 6;
> +			insn->imm = fp->k;
> +
> +			insn++;
> +			insn->code = BPF_RET | BPF_K;
> +			break;


What the hell is this ?

All this magical values, like 2, 6, 7, 10.

I am afraid nobody will be able to read this but you.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 9, 2014, 5:08 p.m. UTC | #4
On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>
>> Extended BPF extends old BPF in the following ways:
>> - from 2 to 10 registers
>>    Original BPF has two registers (A and X) and hidden frame pointer.
>>    Extended BPF has ten registers and read-only frame pointer.
>> - from 32-bit registers to 64-bit registers
>>    semantics of old 32-bit ALU operations are preserved via 32-bit
>>    subregisters
>> - if (cond) jump_true; else jump_false;
>>    old BPF insns are replaced with:
>>    if (cond) jump_true; /* else fallthrough */
>> - adds signed > and >= insns
>> - 16 4-byte stack slots for register spill-fill replaced with
>>    up to 512 bytes of multi-use stack space
>> - introduces bpf_call insn and register passing convention for zero
>>    overhead calls from/to other kernel functions (not part of this patch)
>> - adds arithmetic right shift insn
>> - adds swab32/swab64 insns
>> - adds atomic_add insn
>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>
>> Extended BPF is designed to be JITed with one to one mapping, which
>> allows GCC/LLVM backends to generate optimized BPF code that performs
>> almost as fast as natively compiled code
>>
>> sk_convert_filter() remaps old style insns into extended:
>> 'sock_filter' instructions are remapped on the fly to
>> 'sock_filter_ext' extended instructions when
>> sysctl net.core.bpf_ext_enable=1
>>
>> Old filter comes through sk_attach_filter() or
>> sk_unattached_filter_create()
>>   if (bpf_ext_enable) {
>>      convert to new
>>      sk_chk_filter() - check old bpf
>>      use sk_run_filter_ext() - new interpreter
>>   } else {
>>      sk_chk_filter() - check old bpf
>>      if (bpf_jit_enable)
>>          use old jit
>>      else
>>          use sk_run_filter() - old interpreter
>>   }
>>
>> sk_run_filter_ext() interpreter is noticeably faster
>> than sk_run_filter() for two reasons:
>>
>> 1.fall-through jumps
>>    Old BPF jump instructions are forced to go either 'true' or 'false'
>>    branch which causes branch-miss penalty.
>>    Extended BPF jump instructions have one branch and fall-through,
>>    which fit CPU branch predictor logic better.
>>    'perf stat' shows drastic difference for branch-misses.
>>
>> 2.jump-threaded implementation of interpreter vs switch statement
>>    Instead of single tablejump at the top of 'switch' statement, GCC will
>>    generate multiple tablejump instructions, which helps CPU branch
>> predictor
>>
>> Performance of two BPF filters generated by libpcap was measured
>> on x86_64, i386 and arm32.
>>
>> fprog #1 is taken from Documentation/networking/filter.txt:
>> tcpdump -i eth0 port 22 -dd
>>
>> fprog #2 is taken from 'man tcpdump':
>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>     ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>
>> Other libpcap programs have similar performance differences.
>>
>> Raw performance data from BPF micro-benchmark:
>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>> time in nsec per call, smaller is better
>> --x86_64--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF     90       101       192       202
>> ext BPF     31        71       47         97
>> old BPF jit 12        34       17         44
>> ext BPF jit TBD
>>
>> --i386--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF    107        136      227       252
>> ext BPF     40        119       69       172
>>
>> --arm32--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF    202        300      475       540
>> ext BPF    180        270      330       470
>> old BPF jit 26        182       37       202
>> new BPF jit TBD
>>
>> Tested with trinify BPF fuzzer
>>
>> Future work:
>>
>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>
>> 1. add extended BPF JIT for x86_64
>>
>> 2. add inband old/new demux and extended BPF verifier, so that new
>> programs
>>     can be loaded through old sk_attach_filter() and
>> sk_unattached_filter_create()
>>     interfaces
>>
>> 3. tracing filters systemtap-like with extended BPF
>>
>> 4. OVS with extended BPF
>>
>> 5. nftables with extended BPF
>>
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>
>
> One more question or possible issue that came through my mind: When
> someone attaches a socket filter from user space, and bpf_ext_enable=1
> then the old filter will transparently be converted to the new
> representation. If then user space (e.g. through checkpoint restore)
> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
> on sk->sk_filter and, therefore, try to decode what we stored in
> insns_ext[] with the assumption we still have the old code. Would that
> actually crash (or leak memory, or just return garbage), as we access
> decodes[] array with filt->code? Would be great if you could double-check.

ohh. yes. missed that.
when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
This way the user space can see how old bpf filter was converted.

Of course we can allocate extra memory and keep original bpf code there
just to return it via sk_get_filter(), but that seems overkill.

> The assumption with sk_get_filter() is that it returns the same filter
> that was previously attached, so that it can be re-attached again at
> a later point in time.

when bpf_ext_enable=1, load old, sk_get_filter() returns new ebpf,
this ebpf will be re-attachable, since there will be inband demux for bpf/ebpf.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 9, 2014, 5:38 p.m. UTC | #5
On Sun, Mar 9, 2014 at 7:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:
>
>> +/**
>> + *   sk_run_filter_ext - run an extended filter
>> + *   @ctx: buffer to run the filter on
>> + *   @insn: filter to apply
>> + *
>> + * Decode and execute extended BPF instructions.
>> + * @ctx is the data we are operating on.
>> + * @filter is the array of filter instructions.
>> + */
>> +notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
>> +{
>> +     u64 stack[64];
>> +     u64 regs[16];
>> +     void *ptr;
>> +     u64 tmp;
>> +     int off;

First of all, great that you finally reviewed it! Feedback is appreciated :)

> Why is this 'notrace' ?

to avoid overhead of dummy call.
JITed filters are not adding this dummy call.
So 'notrace' on interpreter brings it to parity with JITed filters.

> 80 u64 on the stack, that is 640 bytes to run a filter ????

yes. that was described in commit log and in Doc...filter.txt:
"
- 16 4-byte stack slots for register spill-fill replaced with
  up to 512 bytes of multi-use stack space
"

For interpreter it is prohibitive to dynamically allocate stack space
that's why it just grabs 64*8 to run any program.
For JIT it's going to be close to zero for majority of filters, since
generated program will allocate only as much as was allowed
by sk_chk_filter_ext(). Only largest programs would need 'up to 512'.
This much stack would be needed for programs that need to use
large key/value pairs in their ebpf tables.
So far I haven't seen a program that approaches this limit,
but it seems to me that 512 is reasonable, since kernel warns on
functions with > 1k stack.

btw, current x86 jit just does 'subq  $96,%rsp',
I think ebpf jit should use the minimum amount of stack. Only amount
that is needed.
May be I'm over thinking it and having 'subq $512, %rsp' for JIT is also fine.
Let me know.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 9, 2014, 6:02 p.m. UTC | #6
On Sun, Mar 9, 2014 at 7:49 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:
>
>> +                     if (BPF_SRC(fp->code) == BPF_K &&
>> +                         (int)fp->k < 0) {
>> +                             /* extended BPF immediates are signed,
>> +                              * zero extend immediate into tmp register
>> +                              * and use it in compare insn
>> +                              */
>> +                             insn->code = BPF_ALU | BPF_MOV | BPF_K;
>> +                             insn->a_reg = 2;
>> +                             insn->imm = fp->k;
>> +                             insn++;
>> +
>> +                             insn->a_reg = 6;
>> +                             insn->x_reg = 2;
>> +                             bpf_src = BPF_X;
>> +                     } else {
>> +                             insn->a_reg = 6;
>> +                             insn->x_reg = 7;
>> +                             insn->imm = fp->k;
>> +                             bpf_src = BPF_SRC(fp->code);
>> +                     }
>> +                     /* common case where 'jump_false' is next insn */
>> +                     if (fp->jf == 0) {
>> +                             insn->code = BPF_JMP | BPF_OP(fp->code) |
>> +                                     bpf_src;
>> +                             tgt = i + fp->jt + 1;
>> +                             EMIT_JMP;
>> +                             break;
>> +                     }
>> +                     /* convert JEQ into JNE when 'jump_true' is next insn */
>> +                     if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>> +                             insn->code = BPF_JMP | BPF_JNE | bpf_src;
>> +                             tgt = i + fp->jf + 1;
>> +                             EMIT_JMP;
>> +                             break;
>> +                     }
>> +                     /* other jumps are mapped into two insns: Jxx and JA */
>> +                     tgt = i + fp->jt + 1;
>> +                     insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
>> +                     EMIT_JMP;
>> +
>> +                     insn++;
>> +                     insn->code = BPF_JMP | BPF_JA;
>> +                     tgt = i + fp->jf + 1;
>> +                     EMIT_JMP;
>> +                     break;
>> +
>> +             /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>> +             case BPF_LDX | BPF_MSH | BPF_B:
>> +                     insn->code = BPF_LD | BPF_ABS | BPF_B;
>> +                     insn->a_reg = 7;
>> +                     insn->imm = fp->k;
>> +
>> +                     insn++;
>> +                     insn->code = BPF_ALU | BPF_AND | BPF_K;
>> +                     insn->a_reg = 7;
>> +                     insn->imm = 0xf;
>> +
>> +                     insn++;
>> +                     insn->code = BPF_ALU | BPF_LSH | BPF_K;
>> +                     insn->a_reg = 7;
>> +                     insn->imm = 2;
>> +                     break;
>> +
>> +             /* RET_K, RET_A are remaped into 2 insns */
>> +             case BPF_RET | BPF_A:
>> +             case BPF_RET | BPF_K:
>> +                     insn->code = BPF_ALU | BPF_MOV |
>> +                             (BPF_RVAL(fp->code) == BPF_K ? BPF_K : BPF_X);
>> +                     insn->a_reg = 0;
>> +                     insn->x_reg = 6;
>> +                     insn->imm = fp->k;
>> +
>> +                     insn++;
>> +                     insn->code = BPF_RET | BPF_K;
>> +                     break;
>
>
> What the hell is this ?
>
> All this magical values, like 2, 6, 7, 10.

they are register numbers, since they are assigned into 'a_reg' and 'x_reg'
which are described in uapi/filter.h:
        __u8    a_reg:4; /* dest register */
        __u8    x_reg:4; /* source register */
and in Doc...filter.txt

In the V1 series I had a bunch of #define like:
#define R1 1
#define R2 2
which seemed as silly as doing '#define one 1'

I thought that the sk_convert_filter() code is pretty clear in terms
of what it's doing, but I'm happy to add an extensive comment to
describe the mechanics.
Also it felt that most of the time you and other folks want me to remove
comments, so I figured I'll add comments on demand.
Here looks like it's the case.

> I am afraid nobody will be able to read this but you.

that's certainly not the intent. I've presented it at the last plumbers conf
and would like to share more, since I think ebpf is a fundamental
breakthrough that can be used by many kernel subsystems.
This patch only covers old filters and seccomp.
We can do a lot more interesting things with tracing+ebpf and so on.

Regards,
Alexei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 9, 2014, 6:11 p.m. UTC | #7
On Sun, 2014-03-09 at 10:38 -0700, Alexei Starovoitov wrote:
> On Sun, Mar 9, 2014 at 7:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:
> >
> >> +/**
> >> + *   sk_run_filter_ext - run an extended filter
> >> + *   @ctx: buffer to run the filter on
> >> + *   @insn: filter to apply
> >> + *
> >> + * Decode and execute extended BPF instructions.
> >> + * @ctx is the data we are operating on.
> >> + * @filter is the array of filter instructions.
> >> + */
> >> +notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
> >> +{
> >> +     u64 stack[64];
> >> +     u64 regs[16];
> >> +     void *ptr;
> >> +     u64 tmp;
> >> +     int off;
> 
> First of all, great that you finally reviewed it! Feedback is appreciated :)
> 
> > Why is this 'notrace' ?
> 
> to avoid overhead of dummy call.
> JITed filters are not adding this dummy call.
> So 'notrace' on interpreter brings it to parity with JITed filters.

Then its a wrong reason.

At the time we wrote JIT, there was (yet) no support for profiling JIT
from perf tools. I asked for help and nobody answered.

Maybe this has changed, if so, please someone add support.


> 
> > 80 u64 on the stack, that is 640 bytes to run a filter ????
> 
> yes. that was described in commit log and in Doc...filter.txt:
> "
> - 16 4-byte stack slots for register spill-fill replaced with
>   up to 512 bytes of multi-use stack space
> "
> 
> For interpreter it is prohibitive to dynamically allocate stack space
> that's why it just grabs 64*8 to run any program.

Where is checked the max capacity of this stack ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 9, 2014, 6:57 p.m. UTC | #8
On Sun, Mar 9, 2014 at 11:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2014-03-09 at 10:38 -0700, Alexei Starovoitov wrote:
>> On Sun, Mar 9, 2014 at 7:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Sat, 2014-03-08 at 15:15 -0800, Alexei Starovoitov wrote:
>> >
>> >> +/**
>> >> + *   sk_run_filter_ext - run an extended filter
>> >> + *   @ctx: buffer to run the filter on
>> >> + *   @insn: filter to apply
>> >> + *
>> >> + * Decode and execute extended BPF instructions.
>> >> + * @ctx is the data we are operating on.
>> >> + * @filter is the array of filter instructions.
>> >> + */
>> >> +notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
>> >> +{
>> >> +     u64 stack[64];
>> >> +     u64 regs[16];
>> >> +     void *ptr;
>> >> +     u64 tmp;
>> >> +     int off;
>>
>> First of all, great that you finally reviewed it! Feedback is appreciated :)
>>
>> > Why is this 'notrace' ?
>>
>> to avoid overhead of dummy call.
>> JITed filters are not adding this dummy call.
>> So 'notrace' on interpreter brings it to parity with JITed filters.
>
> Then its a wrong reason.

fine. I'll remove it then.

> At the time we wrote JIT, there was (yet) no support for profiling JIT
> from perf tools. I asked for help and nobody answered.
>
> Maybe this has changed, if so, please someone add support.

I have few ideas on how to get the line number info from C
into ebpf via llvm. Mainly to have nice messages when
kernel rejects ebpf filter when it fails safety check.
but that will come several commits from now.
This info can be used to beautify perf tools as well.

>>
>> > 80 u64 on the stack, that is 640 bytes to run a filter ????
>>
>> yes. that was described in commit log and in Doc...filter.txt:
>> "
>> - 16 4-byte stack slots for register spill-fill replaced with
>>   up to 512 bytes of multi-use stack space
>> "
>>
>> For interpreter it is prohibitive to dynamically allocate stack space
>> that's why it just grabs 64*8 to run any program.
>
> Where is checked the max capacity of this stack ?

In case of sk_convert_filter() case the converted ebpf filter
is guaranteed to use max 4*16 bytes of stack, since sk_chk_filter()
verified old bpf.

In case of ebpf, the check is done in several places.
In V1 series there are check_stack_boundary() and
check_mem_access() functions which are called from bpf_check()
(which will be renamed to sk_chk_filter_ext())
Here is a macro and comment from V1 series:
+/* JITed code allocates 512 bytes and used bottom 4 slots
+ * to save R6-R9
+ */
+#define MAX_BPF_STACK (512 - 4 * 8)

So sk_chk_filter_ext() enforces 480 byte limit for ebpf program
itself and 512 of real CPU stack is allocated by ebpf jit-ed program.
As I was saying in previous email I was planning to make
this stack allocation less hard coded for JITed program.
Here is relevant comment from V1 series:
+ * Future improvements:
+ * stack size is hardcoded to 512 bytes maximum per program, relax it

In sk_run_filter_ext() I used "u64 stack[64];", but "u64 stack[60];" is
safe too, but I didn't want to go into extensive explanation
of 'magic' 60 number in the first patch, so I just rounded it to 64.
Since now you understand, I can make it stack[60] now :)

Or, even better, I can reintroduce MAX_BPF_STACK into
this patch set and use it in sk_run_filter_ext()...

I will also add:
BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK);
to make sure that old filters via sk_convert_filter() stay correct.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 9, 2014, 7:11 p.m. UTC | #9
On Sun, 2014-03-09 at 11:57 -0700, Alexei Starovoitov wrote:

> In sk_run_filter_ext() I used "u64 stack[64];", but "u64 stack[60];" is
> safe too, but I didn't want to go into extensive explanation
> of 'magic' 60 number in the first patch, so I just rounded it to 64.
> Since now you understand, I can make it stack[60] now :)

My point was : You should not use 64 or 60 in the C code.

I should not have to ask you why it is safe.

It should be obvious just reading the source code. And so far it is not.

Thats why we use macros and comments at the macro definition.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 9, 2014, 7:20 p.m. UTC | #10
On Sun, Mar 9, 2014 at 12:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2014-03-09 at 11:57 -0700, Alexei Starovoitov wrote:
>
>> In sk_run_filter_ext() I used "u64 stack[64];", but "u64 stack[60];" is
>> safe too, but I didn't want to go into extensive explanation
>> of 'magic' 60 number in the first patch, so I just rounded it to 64.
>> Since now you understand, I can make it stack[60] now :)
>
> My point was : You should not use 64 or 60 in the C code.
>
> I should not have to ask you why it is safe.
>
> It should be obvious just reading the source code. And so far it is not.
>
> Thats why we use macros and comments at the macro definition.

Agree. Will fix and send V8 for these issues and sk_get_filter()
caught by Daniel.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann March 9, 2014, 10 p.m. UTC | #11
On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>
>>> Extended BPF extends old BPF in the following ways:
>>> - from 2 to 10 registers
>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>     Extended BPF has ten registers and read-only frame pointer.
>>> - from 32-bit registers to 64-bit registers
>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>     subregisters
>>> - if (cond) jump_true; else jump_false;
>>>     old BPF insns are replaced with:
>>>     if (cond) jump_true; /* else fallthrough */
>>> - adds signed > and >= insns
>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>     up to 512 bytes of multi-use stack space
>>> - introduces bpf_call insn and register passing convention for zero
>>>     overhead calls from/to other kernel functions (not part of this patch)
>>> - adds arithmetic right shift insn
>>> - adds swab32/swab64 insns
>>> - adds atomic_add insn
>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>
>>> Extended BPF is designed to be JITed with one to one mapping, which
>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>> almost as fast as natively compiled code
>>>
>>> sk_convert_filter() remaps old style insns into extended:
>>> 'sock_filter' instructions are remapped on the fly to
>>> 'sock_filter_ext' extended instructions when
>>> sysctl net.core.bpf_ext_enable=1
>>>
>>> Old filter comes through sk_attach_filter() or
>>> sk_unattached_filter_create()
>>>    if (bpf_ext_enable) {
>>>       convert to new
>>>       sk_chk_filter() - check old bpf
>>>       use sk_run_filter_ext() - new interpreter
>>>    } else {
>>>       sk_chk_filter() - check old bpf
>>>       if (bpf_jit_enable)
>>>           use old jit
>>>       else
>>>           use sk_run_filter() - old interpreter
>>>    }
>>>
>>> sk_run_filter_ext() interpreter is noticeably faster
>>> than sk_run_filter() for two reasons:
>>>
>>> 1.fall-through jumps
>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>     branch which causes branch-miss penalty.
>>>     Extended BPF jump instructions have one branch and fall-through,
>>>     which fit CPU branch predictor logic better.
>>>     'perf stat' shows drastic difference for branch-misses.
>>>
>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>     generate multiple tablejump instructions, which helps CPU branch
>>> predictor
>>>
>>> Performance of two BPF filters generated by libpcap was measured
>>> on x86_64, i386 and arm32.
>>>
>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>> tcpdump -i eth0 port 22 -dd
>>>
>>> fprog #2 is taken from 'man tcpdump':
>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>
>>> Other libpcap programs have similar performance differences.
>>>
>>> Raw performance data from BPF micro-benchmark:
>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>> time in nsec per call, smaller is better
>>> --x86_64--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF     90       101       192       202
>>> ext BPF     31        71       47         97
>>> old BPF jit 12        34       17         44
>>> ext BPF jit TBD
>>>
>>> --i386--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF    107        136      227       252
>>> ext BPF     40        119       69       172
>>>
>>> --arm32--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF    202        300      475       540
>>> ext BPF    180        270      330       470
>>> old BPF jit 26        182       37       202
>>> new BPF jit TBD
>>>
>>> Tested with trinify BPF fuzzer
>>>
>>> Future work:
>>>
>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>
>>> 1. add extended BPF JIT for x86_64
>>>
>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>> programs
>>>      can be loaded through old sk_attach_filter() and
>>> sk_unattached_filter_create()
>>>      interfaces
>>>
>>> 3. tracing filters systemtap-like with extended BPF
>>>
>>> 4. OVS with extended BPF
>>>
>>> 5. nftables with extended BPF
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>>
>>
>> One more question or possible issue that came through my mind: When
>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>> then the old filter will transparently be converted to the new
>> representation. If then user space (e.g. through checkpoint restore)
>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>> on sk->sk_filter and, therefore, try to decode what we stored in
>> insns_ext[] with the assumption we still have the old code. Would that
>> actually crash (or leak memory, or just return garbage), as we access
>> decodes[] array with filt->code? Would be great if you could double-check.
>
> ohh. yes. missed that.
> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
> This way the user space can see how old bpf filter was converted.
>
> Of course we can allocate extra memory and keep original bpf code there
> just to return it via sk_get_filter(), but that seems overkill.

Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
filter program (v2)").

I think the issue can be that when applications could get migrated
from one machine to another and their kernel won't support ebpf yet,
then filter could not get loaded this way as it's expected to return
what the user loaded. The trade-off, however, is that the original
BPF code needs to be stored as well. :(

>> The assumption with sk_get_filter() is that it returns the same filter
>> that was previously attached, so that it can be re-attached again at
>> a later point in time.
>
> when bpf_ext_enable=1, load old, sk_get_filter() returns new ebpf,
> this ebpf will be re-attachable, since there will be inband demux for bpf/ebpf.
>
> Thanks
> Alexei
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 10, 2014, 12:41 a.m. UTC | #12
On Sun, Mar 9, 2014 at 3:00 PM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>>
>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net>
>> wrote:
>>>
>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>
>>>>
>>>> Extended BPF extends old BPF in the following ways:
>>>> - from 2 to 10 registers
>>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>>     Extended BPF has ten registers and read-only frame pointer.
>>>> - from 32-bit registers to 64-bit registers
>>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>     subregisters
>>>> - if (cond) jump_true; else jump_false;
>>>>     old BPF insns are replaced with:
>>>>     if (cond) jump_true; /* else fallthrough */
>>>> - adds signed > and >= insns
>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>     up to 512 bytes of multi-use stack space
>>>> - introduces bpf_call insn and register passing convention for zero
>>>>     overhead calls from/to other kernel functions (not part of this
>>>> patch)
>>>> - adds arithmetic right shift insn
>>>> - adds swab32/swab64 insns
>>>> - adds atomic_add insn
>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>
>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>> almost as fast as natively compiled code
>>>>
>>>> sk_convert_filter() remaps old style insns into extended:
>>>> 'sock_filter' instructions are remapped on the fly to
>>>> 'sock_filter_ext' extended instructions when
>>>> sysctl net.core.bpf_ext_enable=1
>>>>
>>>> Old filter comes through sk_attach_filter() or
>>>> sk_unattached_filter_create()
>>>>    if (bpf_ext_enable) {
>>>>       convert to new
>>>>       sk_chk_filter() - check old bpf
>>>>       use sk_run_filter_ext() - new interpreter
>>>>    } else {
>>>>       sk_chk_filter() - check old bpf
>>>>       if (bpf_jit_enable)
>>>>           use old jit
>>>>       else
>>>>           use sk_run_filter() - old interpreter
>>>>    }
>>>>
>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>> than sk_run_filter() for two reasons:
>>>>
>>>> 1.fall-through jumps
>>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>     branch which causes branch-miss penalty.
>>>>     Extended BPF jump instructions have one branch and fall-through,
>>>>     which fit CPU branch predictor logic better.
>>>>     'perf stat' shows drastic difference for branch-misses.
>>>>
>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>     Instead of single tablejump at the top of 'switch' statement, GCC
>>>> will
>>>>     generate multiple tablejump instructions, which helps CPU branch
>>>> predictor
>>>>
>>>> Performance of two BPF filters generated by libpcap was measured
>>>> on x86_64, i386 and arm32.
>>>>
>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>> tcpdump -i eth0 port 22 -dd
>>>>
>>>> fprog #2 is taken from 'man tcpdump':
>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>
>>>> Other libpcap programs have similar performance differences.
>>>>
>>>> Raw performance data from BPF micro-benchmark:
>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>> time in nsec per call, smaller is better
>>>> --x86_64--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF     90       101       192       202
>>>> ext BPF     31        71       47         97
>>>> old BPF jit 12        34       17         44
>>>> ext BPF jit TBD
>>>>
>>>> --i386--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF    107        136      227       252
>>>> ext BPF     40        119       69       172
>>>>
>>>> --arm32--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF    202        300      475       540
>>>> ext BPF    180        270      330       470
>>>> old BPF jit 26        182       37       202
>>>> new BPF jit TBD
>>>>
>>>> Tested with trinify BPF fuzzer
>>>>
>>>> Future work:
>>>>
>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>
>>>> 1. add extended BPF JIT for x86_64
>>>>
>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>> programs
>>>>      can be loaded through old sk_attach_filter() and
>>>> sk_unattached_filter_create()
>>>>      interfaces
>>>>
>>>> 3. tracing filters systemtap-like with extended BPF
>>>>
>>>> 4. OVS with extended BPF
>>>>
>>>> 5. nftables with extended BPF
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>>>
>>>
>>>
>>> One more question or possible issue that came through my mind: When
>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>> then the old filter will transparently be converted to the new
>>> representation. If then user space (e.g. through checkpoint restore)
>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>> insns_ext[] with the assumption we still have the old code. Would that
>>> actually crash (or leak memory, or just return garbage), as we access
>>> decodes[] array with filt->code? Would be great if you could
>>> double-check.
>>
>>
>> ohh. yes. missed that.
>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>> This way the user space can see how old bpf filter was converted.
>>
>> Of course we can allocate extra memory and keep original bpf code there
>> just to return it via sk_get_filter(), but that seems overkill.
>
>
> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
> filter program (v2)").
>
> I think the issue can be that when applications could get migrated
> from one machine to another and their kernel won't support ebpf yet,
> then filter could not get loaded this way as it's expected to return
> what the user loaded. The trade-off, however, is that the original
> BPF code needs to be stored as well. :(

I see.
...even on one machine:
bpf_ext=1, attach, get_filter, bpf_ext=0, re-attach...
So we need to save original.
At least we don't need to keep it for 'unattached' filters.
Should memory come from sk_optmem budget or plain kmalloc is enough ?
Latter would have simpler implementation, but former is probably cleaner?

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov March 11, 2014, 5:40 p.m. UTC | #13
On 03/10/2014 02:00 AM, Daniel Borkmann wrote:
> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>
>>>> Extended BPF extends old BPF in the following ways:
>>>> - from 2 to 10 registers
>>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>>     Extended BPF has ten registers and read-only frame pointer.
>>>> - from 32-bit registers to 64-bit registers
>>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>     subregisters
>>>> - if (cond) jump_true; else jump_false;
>>>>     old BPF insns are replaced with:
>>>>     if (cond) jump_true; /* else fallthrough */
>>>> - adds signed > and >= insns
>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>     up to 512 bytes of multi-use stack space
>>>> - introduces bpf_call insn and register passing convention for zero
>>>>     overhead calls from/to other kernel functions (not part of this patch)
>>>> - adds arithmetic right shift insn
>>>> - adds swab32/swab64 insns
>>>> - adds atomic_add insn
>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>
>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>> almost as fast as natively compiled code
>>>>
>>>> sk_convert_filter() remaps old style insns into extended:
>>>> 'sock_filter' instructions are remapped on the fly to
>>>> 'sock_filter_ext' extended instructions when
>>>> sysctl net.core.bpf_ext_enable=1
>>>>
>>>> Old filter comes through sk_attach_filter() or
>>>> sk_unattached_filter_create()
>>>>    if (bpf_ext_enable) {
>>>>       convert to new
>>>>       sk_chk_filter() - check old bpf
>>>>       use sk_run_filter_ext() - new interpreter
>>>>    } else {
>>>>       sk_chk_filter() - check old bpf
>>>>       if (bpf_jit_enable)
>>>>           use old jit
>>>>       else
>>>>           use sk_run_filter() - old interpreter
>>>>    }
>>>>
>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>> than sk_run_filter() for two reasons:
>>>>
>>>> 1.fall-through jumps
>>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>     branch which causes branch-miss penalty.
>>>>     Extended BPF jump instructions have one branch and fall-through,
>>>>     which fit CPU branch predictor logic better.
>>>>     'perf stat' shows drastic difference for branch-misses.
>>>>
>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>>     generate multiple tablejump instructions, which helps CPU branch
>>>> predictor
>>>>
>>>> Performance of two BPF filters generated by libpcap was measured
>>>> on x86_64, i386 and arm32.
>>>>
>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>> tcpdump -i eth0 port 22 -dd
>>>>
>>>> fprog #2 is taken from 'man tcpdump':
>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>
>>>> Other libpcap programs have similar performance differences.
>>>>
>>>> Raw performance data from BPF micro-benchmark:
>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>> time in nsec per call, smaller is better
>>>> --x86_64--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF     90       101       192       202
>>>> ext BPF     31        71       47         97
>>>> old BPF jit 12        34       17         44
>>>> ext BPF jit TBD
>>>>
>>>> --i386--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF    107        136      227       252
>>>> ext BPF     40        119       69       172
>>>>
>>>> --arm32--
>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>            cache-hit cache-miss cache-hit cache-miss
>>>> old BPF    202        300      475       540
>>>> ext BPF    180        270      330       470
>>>> old BPF jit 26        182       37       202
>>>> new BPF jit TBD
>>>>
>>>> Tested with trinify BPF fuzzer
>>>>
>>>> Future work:
>>>>
>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>
>>>> 1. add extended BPF JIT for x86_64
>>>>
>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>> programs
>>>>      can be loaded through old sk_attach_filter() and
>>>> sk_unattached_filter_create()
>>>>      interfaces
>>>>
>>>> 3. tracing filters systemtap-like with extended BPF
>>>>
>>>> 4. OVS with extended BPF
>>>>
>>>> 5. nftables with extended BPF
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>>>
>>>
>>> One more question or possible issue that came through my mind: When
>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>> then the old filter will transparently be converted to the new
>>> representation. If then user space (e.g. through checkpoint restore)
>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>> insns_ext[] with the assumption we still have the old code. Would that
>>> actually crash (or leak memory, or just return garbage), as we access
>>> decodes[] array with filt->code? Would be great if you could double-check.
>>
>> ohh. yes. missed that.
>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>> This way the user space can see how old bpf filter was converted.
>>
>> Of course we can allocate extra memory and keep original bpf code there
>> just to return it via sk_get_filter(), but that seems overkill.
> 
> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
> filter program (v2)").
> 
> I think the issue can be that when applications could get migrated
> from one machine to another and their kernel won't support ebpf yet,
> then filter could not get loaded this way as it's expected to return
> what the user loaded. The trade-off, however, is that the original
> BPF code needs to be stored as well. :(

Sorry if I miss the point, but isn't the original filter kept on socket?
The sk_attach_filter() does so, then calls __sk_prepare_filter, which
in turn calls bpf_jit_compile(), and the latter two keep the insns in place.

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 11, 2014, 6:03 p.m. UTC | #14
On Tue, Mar 11, 2014 at 10:40 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/10/2014 02:00 AM, Daniel Borkmann wrote:
>> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>>
>>>>> Extended BPF extends old BPF in the following ways:
>>>>> - from 2 to 10 registers
>>>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>>>     Extended BPF has ten registers and read-only frame pointer.
>>>>> - from 32-bit registers to 64-bit registers
>>>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>>     subregisters
>>>>> - if (cond) jump_true; else jump_false;
>>>>>     old BPF insns are replaced with:
>>>>>     if (cond) jump_true; /* else fallthrough */
>>>>> - adds signed > and >= insns
>>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>>     up to 512 bytes of multi-use stack space
>>>>> - introduces bpf_call insn and register passing convention for zero
>>>>>     overhead calls from/to other kernel functions (not part of this patch)
>>>>> - adds arithmetic right shift insn
>>>>> - adds swab32/swab64 insns
>>>>> - adds atomic_add insn
>>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>>
>>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>>> almost as fast as natively compiled code
>>>>>
>>>>> sk_convert_filter() remaps old style insns into extended:
>>>>> 'sock_filter' instructions are remapped on the fly to
>>>>> 'sock_filter_ext' extended instructions when
>>>>> sysctl net.core.bpf_ext_enable=1
>>>>>
>>>>> Old filter comes through sk_attach_filter() or
>>>>> sk_unattached_filter_create()
>>>>>    if (bpf_ext_enable) {
>>>>>       convert to new
>>>>>       sk_chk_filter() - check old bpf
>>>>>       use sk_run_filter_ext() - new interpreter
>>>>>    } else {
>>>>>       sk_chk_filter() - check old bpf
>>>>>       if (bpf_jit_enable)
>>>>>           use old jit
>>>>>       else
>>>>>           use sk_run_filter() - old interpreter
>>>>>    }
>>>>>
>>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>>> than sk_run_filter() for two reasons:
>>>>>
>>>>> 1.fall-through jumps
>>>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>>     branch which causes branch-miss penalty.
>>>>>     Extended BPF jump instructions have one branch and fall-through,
>>>>>     which fit CPU branch predictor logic better.
>>>>>     'perf stat' shows drastic difference for branch-misses.
>>>>>
>>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>>>     generate multiple tablejump instructions, which helps CPU branch
>>>>> predictor
>>>>>
>>>>> Performance of two BPF filters generated by libpcap was measured
>>>>> on x86_64, i386 and arm32.
>>>>>
>>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>>> tcpdump -i eth0 port 22 -dd
>>>>>
>>>>> fprog #2 is taken from 'man tcpdump':
>>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>>
>>>>> Other libpcap programs have similar performance differences.
>>>>>
>>>>> Raw performance data from BPF micro-benchmark:
>>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>>> time in nsec per call, smaller is better
>>>>> --x86_64--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF     90       101       192       202
>>>>> ext BPF     31        71       47         97
>>>>> old BPF jit 12        34       17         44
>>>>> ext BPF jit TBD
>>>>>
>>>>> --i386--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF    107        136      227       252
>>>>> ext BPF     40        119       69       172
>>>>>
>>>>> --arm32--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF    202        300      475       540
>>>>> ext BPF    180        270      330       470
>>>>> old BPF jit 26        182       37       202
>>>>> new BPF jit TBD
>>>>>
>>>>> Tested with trinify BPF fuzzer
>>>>>
>>>>> Future work:
>>>>>
>>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>>
>>>>> 1. add extended BPF JIT for x86_64
>>>>>
>>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>>> programs
>>>>>      can be loaded through old sk_attach_filter() and
>>>>> sk_unattached_filter_create()
>>>>>      interfaces
>>>>>
>>>>> 3. tracing filters systemtap-like with extended BPF
>>>>>
>>>>> 4. OVS with extended BPF
>>>>>
>>>>> 5. nftables with extended BPF
>>>>>
>>>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>>>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>>>>
>>>>
>>>> One more question or possible issue that came through my mind: When
>>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>>> then the old filter will transparently be converted to the new
>>>> representation. If then user space (e.g. through checkpoint restore)
>>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>>> insns_ext[] with the assumption we still have the old code. Would that
>>>> actually crash (or leak memory, or just return garbage), as we access
>>>> decodes[] array with filt->code? Would be great if you could double-check.
>>>
>>> ohh. yes. missed that.
>>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>>> This way the user space can see how old bpf filter was converted.
>>>
>>> Of course we can allocate extra memory and keep original bpf code there
>>> just to return it via sk_get_filter(), but that seems overkill.
>>
>> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
>> filter program (v2)").
>>
>> I think the issue can be that when applications could get migrated
>> from one machine to another and their kernel won't support ebpf yet,
>> then filter could not get loaded this way as it's expected to return
>> what the user loaded. The trade-off, however, is that the original
>> BPF code needs to be stored as well. :(
>
> Sorry if I miss the point, but isn't the original filter kept on socket?
> The sk_attach_filter() does so, then calls __sk_prepare_filter, which
> in turn calls bpf_jit_compile(), and the latter two keep the insns in place.

Yes. in V8/V9 series original filter is kept on socket.
and your crtools/test/zdtm/live/static/socket_filter.c test passes.
Let me know if there are any other tests I can try.

Thanks
Alexei

>
> Thanks,
> Pavel
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov March 11, 2014, 6:19 p.m. UTC | #15
On 03/11/2014 10:03 PM, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2014 at 10:40 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> On 03/10/2014 02:00 AM, Daniel Borkmann wrote:
>>> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>>>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>>>
>>>>>> Extended BPF extends old BPF in the following ways:
>>>>>> - from 2 to 10 registers
>>>>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>>>>     Extended BPF has ten registers and read-only frame pointer.
>>>>>> - from 32-bit registers to 64-bit registers
>>>>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>>>     subregisters
>>>>>> - if (cond) jump_true; else jump_false;
>>>>>>     old BPF insns are replaced with:
>>>>>>     if (cond) jump_true; /* else fallthrough */
>>>>>> - adds signed > and >= insns
>>>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>>>     up to 512 bytes of multi-use stack space
>>>>>> - introduces bpf_call insn and register passing convention for zero
>>>>>>     overhead calls from/to other kernel functions (not part of this patch)
>>>>>> - adds arithmetic right shift insn
>>>>>> - adds swab32/swab64 insns
>>>>>> - adds atomic_add insn
>>>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>>>
>>>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>>>> almost as fast as natively compiled code
>>>>>>
>>>>>> sk_convert_filter() remaps old style insns into extended:
>>>>>> 'sock_filter' instructions are remapped on the fly to
>>>>>> 'sock_filter_ext' extended instructions when
>>>>>> sysctl net.core.bpf_ext_enable=1
>>>>>>
>>>>>> Old filter comes through sk_attach_filter() or
>>>>>> sk_unattached_filter_create()
>>>>>>    if (bpf_ext_enable) {
>>>>>>       convert to new
>>>>>>       sk_chk_filter() - check old bpf
>>>>>>       use sk_run_filter_ext() - new interpreter
>>>>>>    } else {
>>>>>>       sk_chk_filter() - check old bpf
>>>>>>       if (bpf_jit_enable)
>>>>>>           use old jit
>>>>>>       else
>>>>>>           use sk_run_filter() - old interpreter
>>>>>>    }
>>>>>>
>>>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>>>> than sk_run_filter() for two reasons:
>>>>>>
>>>>>> 1.fall-through jumps
>>>>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>>>     branch which causes branch-miss penalty.
>>>>>>     Extended BPF jump instructions have one branch and fall-through,
>>>>>>     which fit CPU branch predictor logic better.
>>>>>>     'perf stat' shows drastic difference for branch-misses.
>>>>>>
>>>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>>>>     generate multiple tablejump instructions, which helps CPU branch
>>>>>> predictor
>>>>>>
>>>>>> Performance of two BPF filters generated by libpcap was measured
>>>>>> on x86_64, i386 and arm32.
>>>>>>
>>>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>>>> tcpdump -i eth0 port 22 -dd
>>>>>>
>>>>>> fprog #2 is taken from 'man tcpdump':
>>>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>>>
>>>>>> Other libpcap programs have similar performance differences.
>>>>>>
>>>>>> Raw performance data from BPF micro-benchmark:
>>>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>>>> time in nsec per call, smaller is better
>>>>>> --x86_64--
>>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF     90       101       192       202
>>>>>> ext BPF     31        71       47         97
>>>>>> old BPF jit 12        34       17         44
>>>>>> ext BPF jit TBD
>>>>>>
>>>>>> --i386--
>>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF    107        136      227       252
>>>>>> ext BPF     40        119       69       172
>>>>>>
>>>>>> --arm32--
>>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF    202        300      475       540
>>>>>> ext BPF    180        270      330       470
>>>>>> old BPF jit 26        182       37       202
>>>>>> new BPF jit TBD
>>>>>>
>>>>>> Tested with trinify BPF fuzzer
>>>>>>
>>>>>> Future work:
>>>>>>
>>>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>>>
>>>>>> 1. add extended BPF JIT for x86_64
>>>>>>
>>>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>>>> programs
>>>>>>      can be loaded through old sk_attach_filter() and
>>>>>> sk_unattached_filter_create()
>>>>>>      interfaces
>>>>>>
>>>>>> 3. tracing filters systemtap-like with extended BPF
>>>>>>
>>>>>> 4. OVS with extended BPF
>>>>>>
>>>>>> 5. nftables with extended BPF
>>>>>>
>>>>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>>>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>>>>> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
>>>>>
>>>>>
>>>>> One more question or possible issue that came through my mind: When
>>>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>>>> then the old filter will transparently be converted to the new
>>>>> representation. If then user space (e.g. through checkpoint restore)
>>>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>>>> insns_ext[] with the assumption we still have the old code. Would that
>>>>> actually crash (or leak memory, or just return garbage), as we access
>>>>> decodes[] array with filt->code? Would be great if you could double-check.
>>>>
>>>> ohh. yes. missed that.
>>>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>>>> This way the user space can see how old bpf filter was converted.
>>>>
>>>> Of course we can allocate extra memory and keep original bpf code there
>>>> just to return it via sk_get_filter(), but that seems overkill.
>>>
>>> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
>>> filter program (v2)").
>>>
>>> I think the issue can be that when applications could get migrated
>>> from one machine to another and their kernel won't support ebpf yet,
>>> then filter could not get loaded this way as it's expected to return
>>> what the user loaded. The trade-off, however, is that the original
>>> BPF code needs to be stored as well. :(
>>
>> Sorry if I miss the point, but isn't the original filter kept on socket?
>> The sk_attach_filter() does so, then calls __sk_prepare_filter, which
>> in turn calls bpf_jit_compile(), and the latter two keep the insns in place.
> 
> Yes. in V8/V9 series original filter is kept on socket.

Ah, I see :)

> and your crtools/test/zdtm/live/static/socket_filter.c test passes.
> Let me know if there are any other tests I can try.

No, that's the only test we need wrt sk-filter.
Thanks for keeping an eye on it :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 271b5e971568..e72ff51f4561 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -925,6 +925,7 @@  void bpf_jit_compile(struct sk_filter *fp)
 		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
 
 	fp->bpf_func = (void *)ctx.target;
+	fp->jited = 1;
 out:
 	kfree(ctx.offsets);
 	return;
@@ -932,7 +933,7 @@  out:
 
 void bpf_jit_free(struct sk_filter *fp)
 {
-	if (fp->bpf_func != sk_run_filter)
+	if (fp->jited)
 		module_free(NULL, fp->bpf_func);
 	kfree(fp);
 }
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 555034f8505e..c0c5fcb0736a 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -689,6 +689,7 @@  void bpf_jit_compile(struct sk_filter *fp)
 		((u64 *)image)[0] = (u64)code_base;
 		((u64 *)image)[1] = local_paca->kernel_toc;
 		fp->bpf_func = (void *)image;
+		fp->jited = 1;
 	}
 out:
 	kfree(addrs);
@@ -697,7 +698,7 @@  out:
 
 void bpf_jit_free(struct sk_filter *fp)
 {
-	if (fp->bpf_func != sk_run_filter)
+	if (fp->jited)
 		module_free(NULL, fp->bpf_func);
 	kfree(fp);
 }
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index 708d60e40066..bf56fe51b5c1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -877,6 +877,7 @@  void bpf_jit_compile(struct sk_filter *fp)
 	if (jit.start) {
 		set_memory_ro((unsigned long)header, header->pages);
 		fp->bpf_func = (void *) jit.start;
+		fp->jited = 1;
 	}
 out:
 	kfree(addrs);
@@ -887,7 +888,7 @@  void bpf_jit_free(struct sk_filter *fp)
 	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (fp->bpf_func == sk_run_filter)
+	if (!fp->jited)
 		goto free_filter;
 	set_memory_rw(addr, header->pages);
 	module_free(NULL, header);
diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
index 01fe9946d388..8c01be66f67d 100644
--- a/arch/sparc/net/bpf_jit_comp.c
+++ b/arch/sparc/net/bpf_jit_comp.c
@@ -809,6 +809,7 @@  cond_branch:			f_offset = addrs[i + filter[i].jf];
 	if (image) {
 		bpf_flush_icache(image, image + proglen);
 		fp->bpf_func = (void *)image;
+		fp->jited = 1;
 	}
 out:
 	kfree(addrs);
@@ -817,7 +818,7 @@  out:
 
 void bpf_jit_free(struct sk_filter *fp)
 {
-	if (fp->bpf_func != sk_run_filter)
+	if (fp->jited)
 		module_free(NULL, fp->bpf_func);
 	kfree(fp);
 }
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4ed75dd81d05..7fa182cd3973 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -772,6 +772,7 @@  cond_branch:			f_offset = addrs[i + filter[i].jf] - addrs[i];
 		bpf_flush_icache(header, image + proglen);
 		set_memory_ro((unsigned long)header, header->pages);
 		fp->bpf_func = (void *)image;
+		fp->jited = 1;
 	}
 out:
 	kfree(addrs);
@@ -791,7 +792,7 @@  static void bpf_jit_free_deferred(struct work_struct *work)
 
 void bpf_jit_free(struct sk_filter *fp)
 {
-	if (fp->bpf_func != sk_run_filter) {
+	if (fp->jited) {
 		INIT_WORK(&fp->work, bpf_jit_free_deferred);
 		schedule_work(&fp->work);
 	} else {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..0a9278258763 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -26,11 +26,17 @@  struct sk_filter
 {
 	atomic_t		refcnt;
 	unsigned int         	len;	/* Number of filter blocks */
+	unsigned int		jited:1;
 	struct rcu_head		rcu;
-	unsigned int		(*bpf_func)(const struct sk_buff *skb,
-					    const struct sock_filter *filter);
+	union {
+		unsigned int (*bpf_func)(const struct sk_buff *skb,
+					 const struct sock_filter *fp);
+		unsigned int (*bpf_func_ext)(void *ctx,
+					     const struct sock_filter_ext *fp);
+	};
 	union {
 		struct sock_filter     	insns[0];
+		struct sock_filter_ext	insns_ext[0];
 		struct work_struct	work;
 	};
 };
@@ -52,7 +58,11 @@  extern int sk_detach_filter(struct sock *sk);
 extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+		      struct sock_filter_ext *new_prog,	int *p_new_len);
+unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn);
 
+#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #ifdef CONFIG_BPF_JIT
 #include <stdarg.h>
 #include <linux/linkage.h>
@@ -70,7 +80,6 @@  static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
 		print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
 			       16, 1, image, proglen, false);
 }
-#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #else
 #include <linux/slab.h>
 static inline void bpf_jit_compile(struct sk_filter *fp)
@@ -80,7 +89,6 @@  static inline void bpf_jit_free(struct sk_filter *fp)
 {
 	kfree(fp);
 }
-#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
 #endif
 
 static inline int bpf_tell_extensions(void)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a869488b8ae..2c13d000389c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3054,6 +3054,7 @@  extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		bpf_ext_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..4e98fe16ba88 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,5 +1,6 @@ 
 /*
  * Linux Socket Filter Data Structures
+ * Extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
  */
 
 #ifndef _UAPI__LINUX_FILTER_H__
@@ -19,7 +20,7 @@ 
  *	Try and keep these values and structures similar to BSD, especially
  *	the BPF code definitions which need to match so you can share filters
  */
- 
+
 struct sock_filter {	/* Filter block */
 	__u16	code;   /* Actual filter code */
 	__u8	jt;	/* Jump true */
@@ -27,6 +28,14 @@  struct sock_filter {	/* Filter block */
 	__u32	k;      /* Generic multiuse field */
 };
 
+struct sock_filter_ext {
+	__u8	code;    /* opcode */
+	__u8    a_reg:4; /* dest register */
+	__u8    x_reg:4; /* source register */
+	__s16	off;     /* signed offset */
+	__s32	imm;     /* signed immediate constant */
+};
+
 struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 	unsigned short		len;	/* Number of filter blocks */
 	struct sock_filter __user *filter;
@@ -45,12 +54,14 @@  struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_JMP         0x05
 #define         BPF_RET         0x06
 #define         BPF_MISC        0x07
+#define         BPF_ALU64       0x07
 
 /* ld/ldx fields */
 #define BPF_SIZE(code)  ((code) & 0x18)
 #define         BPF_W           0x00
 #define         BPF_H           0x08
 #define         BPF_B           0x10
+#define         BPF_DW          0x18
 #define BPF_MODE(code)  ((code) & 0xe0)
 #define         BPF_IMM         0x00
 #define         BPF_ABS         0x20
@@ -58,6 +69,7 @@  struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_MEM         0x60
 #define         BPF_LEN         0x80
 #define         BPF_MSH         0xa0
+#define         BPF_XADD        0xc0 /* exclusive add */
 
 /* alu/jmp fields */
 #define BPF_OP(code)    ((code) & 0xf0)
@@ -68,16 +80,24 @@  struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_OR          0x40
 #define         BPF_AND         0x50
 #define         BPF_LSH         0x60
-#define         BPF_RSH         0x70
+#define         BPF_RSH         0x70 /* logical shift right */
 #define         BPF_NEG         0x80
 #define		BPF_MOD		0x90
 #define		BPF_XOR		0xa0
+#define		BPF_MOV		0xb0 /* mov reg to reg */
+#define		BPF_ARSH	0xc0 /* sign extending arithmetic shift right */
+#define		BPF_BSWAP32	0xd0 /* swap lower 4 bytes of 64-bit register */
+#define		BPF_BSWAP64	0xe0 /* swap all 8 bytes of 64-bit register */
 
 #define         BPF_JA          0x00
-#define         BPF_JEQ         0x10
-#define         BPF_JGT         0x20
-#define         BPF_JGE         0x30
-#define         BPF_JSET        0x40
+#define         BPF_JEQ         0x10 /* jump == */
+#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
+#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
+#define         BPF_JSET        0x40 /* if (A & X) */
+#define         BPF_JNE         0x50 /* jump != */
+#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
+#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
+#define         BPF_CALL        0x80 /* function call */
 #define BPF_SRC(code)   ((code) & 0x08)
 #define         BPF_K           0x00
 #define         BPF_X           0x08
@@ -134,5 +154,4 @@  struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
-
 #endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index ad30d626a5bd..ad6f7546ce64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1,5 +1,6 @@ 
 /*
  * Linux Socket Filter - Kernel level socket filtering
+ * Extended BPF is Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  *
  * Author:
  *     Jay Schulist <jschlst@samba.org>
@@ -40,6 +41,8 @@ 
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
 
+int bpf_ext_enable __read_mostly;
+
 /* No hurry in this branch
  *
  * Exported for the bpf jit load helper.
@@ -134,11 +137,7 @@  unsigned int sk_run_filter(const struct sk_buff *skb,
 	 * Process array of filter instructions.
 	 */
 	for (;; fentry++) {
-#if defined(CONFIG_X86_32)
-#define	K (fentry->k)
-#else
 		const u32 K = fentry->k;
-#endif
 
 		switch (fentry->code) {
 		case BPF_S_ALU_ADD_X:
@@ -646,6 +645,7 @@  static int __sk_prepare_filter(struct sk_filter *fp)
 	int err;
 
 	fp->bpf_func = sk_run_filter;
+	fp->jited = 0;
 
 	err = sk_chk_filter(fp->insns, fp->len);
 	if (err)
@@ -655,6 +655,84 @@  static int __sk_prepare_filter(struct sk_filter *fp)
 	return 0;
 }
 
+static int sk_prepare_filter_ext(struct sk_filter **pfp,
+				 struct sock_fprog *fprog, struct sock *sk)
+{
+	unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+	struct sock_filter *old_prog;
+	unsigned int sk_fsize;
+	struct sk_filter *fp;
+	int new_len;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct sock_filter) !=
+		     sizeof(struct sock_filter_ext));
+
+	/* store old program into buffer, since chk_filter will remap opcodes */
+	old_prog = kmalloc(fsize, GFP_KERNEL);
+	if (!old_prog)
+		return -ENOMEM;
+
+	if (sk) {
+		if (copy_from_user(old_prog, fprog->filter, fsize)) {
+			err = -EFAULT;
+			goto free_prog;
+		}
+	} else {
+		memcpy(old_prog, fprog->filter, fsize);
+	}
+
+	/* calculate bpf_ext program length */
+	err = sk_convert_filter(fprog->filter, fprog->len, NULL, &new_len);
+	if (err)
+		goto free_prog;
+
+	sk_fsize = sk_filter_size(new_len);
+	/* allocate sk_filter to store bpf_ext program */
+	if (sk)
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+	else
+		fp = kmalloc(sk_fsize, GFP_KERNEL);
+	if (!fp) {
+		err = -ENOMEM;
+		goto free_prog;
+	}
+
+	/* remap sock_filter insns into sock_filter_ext insns */
+	err = sk_convert_filter(old_prog, fprog->len, fp->insns_ext, &new_len);
+	if (err)
+		/* 2nd sk_convert_filter() can fail only if it fails
+		 * to allocate memory, remapping must succeed
+		 */
+		goto free_fp;
+
+	/* now chk_filter can overwrite old_prog while checking */
+	err = sk_chk_filter(old_prog, fprog->len);
+	if (err)
+		goto free_fp;
+
+	/* discard old prog */
+	kfree(old_prog);
+
+	atomic_set(&fp->refcnt, 1);
+	fp->len = new_len;
+	fp->jited = 0;
+
+	/* sock_filter_ext insns must be executed by sk_run_filter_ext */
+	fp->bpf_func_ext = sk_run_filter_ext;
+
+	*pfp = fp;
+	return 0;
+free_fp:
+	if (sk)
+		sock_kfree_s(sk, fp, sk_fsize);
+	else
+		kfree(fp);
+free_prog:
+	kfree(old_prog);
+	return err;
+}
+
 /**
  *	sk_unattached_filter_create - create an unattached filter
  *	@fprog: the filter program
@@ -676,6 +754,9 @@  int sk_unattached_filter_create(struct sk_filter **pfp,
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
+	if (bpf_ext_enable)
+		return sk_prepare_filter_ext(pfp, fprog, NULL);
+
 	fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
 	if (!fp)
 		return -ENOMEM;
@@ -726,21 +807,27 @@  int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
-	fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
-	if (!fp)
-		return -ENOMEM;
-	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
-		sock_kfree_s(sk, fp, sk_fsize);
-		return -EFAULT;
-	}
+	if (bpf_ext_enable) {
+		err = sk_prepare_filter_ext(&fp, fprog, sk);
+		if (err)
+			return err;
+	} else {
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+		if (!fp)
+			return -ENOMEM;
+		if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+			sock_kfree_s(sk, fp, sk_fsize);
+			return -EFAULT;
+		}
 
-	atomic_set(&fp->refcnt, 1);
-	fp->len = fprog->len;
+		atomic_set(&fp->refcnt, 1);
+		fp->len = fprog->len;
 
-	err = __sk_prepare_filter(fp);
-	if (err) {
-		sk_filter_uncharge(sk, fp);
-		return err;
+		err = __sk_prepare_filter(fp);
+		if (err) {
+			sk_filter_uncharge(sk, fp);
+			return err;
+		}
 	}
 
 	old_fp = rcu_dereference_protected(sk->sk_filter,
@@ -882,3 +969,683 @@  out:
 	release_sock(sk);
 	return ret;
 }
+
+/**
+ *	sk_convert_filter - convert filter program
+ *	@old_prog: the filter program
+ *	@len: the length of filter program
+ *	@new_prog: buffer where converted program will be stored
+ *	@p_new_len: pointer to store length of converted program
+ *
+ * remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style
+ *
+ * first, call sk_convert_filter(old_prog, len, NULL, &new_len) to calculate new
+ * program length in one pass
+ *
+ * then new_prog = kmalloc(sizeof(struct sock_filter_ext) * new_len);
+ *
+ * and call it again: sk_convert_filter(old_prog, len, new_prog, &new_len);
+ * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
+ */
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+		      struct sock_filter_ext *new_prog, int *p_new_len)
+{
+	struct sock_filter_ext *new_insn;
+	struct sock_filter *fp;
+	int *addrs = NULL;
+	int new_len = 0;
+	int pass = 0;
+	int tgt, i;
+	u8 bpf_src;
+
+	if (len <= 0 || len >= BPF_MAXINSNS)
+		return -EINVAL;
+
+	if (new_prog) {
+		addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
+		if (!addrs)
+			return -ENOMEM;
+	}
+
+do_pass:
+	new_insn = new_prog;
+	fp = old_prog;
+	for (i = 0; i < len; fp++, i++) {
+		struct sock_filter_ext tmp_insns[3] = {};
+		struct sock_filter_ext *insn = tmp_insns;
+
+		if (addrs)
+			addrs[i] = new_insn - new_prog;
+
+		switch (fp->code) {
+		/* all arithmetic insns and skb loads map as-is */
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_K:
+		case BPF_ALU | BPF_LSH | BPF_X:
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_X:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_XOR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_K:
+		case BPF_ALU | BPF_MUL | BPF_X:
+		case BPF_ALU | BPF_MUL | BPF_K:
+		case BPF_ALU | BPF_DIV | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_K:
+		case BPF_ALU | BPF_MOD | BPF_X:
+		case BPF_ALU | BPF_MOD | BPF_K:
+		case BPF_ALU | BPF_NEG:
+		case BPF_LD | BPF_ABS | BPF_W:
+		case BPF_LD | BPF_ABS | BPF_H:
+		case BPF_LD | BPF_ABS | BPF_B:
+		case BPF_LD | BPF_IND | BPF_W:
+		case BPF_LD | BPF_IND | BPF_H:
+		case BPF_LD | BPF_IND | BPF_B:
+			insn->code = fp->code;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			insn->imm = fp->k;
+			break;
+
+		/* jump opcodes map as-is, but offsets need adjustment */
+		case BPF_JMP | BPF_JA:
+			tgt = i + fp->k + 1;
+			insn->code = fp->code;
+#define EMIT_JMP \
+	do { \
+		if (tgt >= len || tgt < 0) \
+			goto err; \
+		insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
+		/* adjust pc relative offset for 2nd or 3rd insn */ \
+		insn->off -= insn - tmp_insns; \
+	} while (0)
+
+			EMIT_JMP;
+			break;
+
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JSET | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_X:
+			if (BPF_SRC(fp->code) == BPF_K &&
+			    (int)fp->k < 0) {
+				/* extended BPF immediates are signed,
+				 * zero extend immediate into tmp register
+				 * and use it in compare insn
+				 */
+				insn->code = BPF_ALU | BPF_MOV | BPF_K;
+				insn->a_reg = 2;
+				insn->imm = fp->k;
+				insn++;
+
+				insn->a_reg = 6;
+				insn->x_reg = 2;
+				bpf_src = BPF_X;
+			} else {
+				insn->a_reg = 6;
+				insn->x_reg = 7;
+				insn->imm = fp->k;
+				bpf_src = BPF_SRC(fp->code);
+			}
+			/* common case where 'jump_false' is next insn */
+			if (fp->jf == 0) {
+				insn->code = BPF_JMP | BPF_OP(fp->code) |
+					bpf_src;
+				tgt = i + fp->jt + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* convert JEQ into JNE when 'jump_true' is next insn */
+			if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
+				insn->code = BPF_JMP | BPF_JNE | bpf_src;
+				tgt = i + fp->jf + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* other jumps are mapped into two insns: Jxx and JA */
+			tgt = i + fp->jt + 1;
+			insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
+			EMIT_JMP;
+
+			insn++;
+			insn->code = BPF_JMP | BPF_JA;
+			tgt = i + fp->jf + 1;
+			EMIT_JMP;
+			break;
+
+		/* ldxb 4*([14]&0xf) is remaped into 3 insns */
+		case BPF_LDX | BPF_MSH | BPF_B:
+			insn->code = BPF_LD | BPF_ABS | BPF_B;
+			insn->a_reg = 7;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_AND | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 0xf;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_LSH | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 2;
+			break;
+
+		/* RET_K, RET_A are remaped into 2 insns */
+		case BPF_RET | BPF_A:
+		case BPF_RET | BPF_K:
+			insn->code = BPF_ALU | BPF_MOV |
+				(BPF_RVAL(fp->code) == BPF_K ? BPF_K : BPF_X);
+			insn->a_reg = 0;
+			insn->x_reg = 6;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_RET | BPF_K;
+			break;
+
+		/* store to stack */
+		case BPF_ST:
+		case BPF_STX:
+			insn->code = BPF_STX | BPF_MEM | BPF_W;
+			insn->a_reg = 10;
+			insn->x_reg = fp->code == BPF_ST ? 6 : 7;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+		/* load from stack */
+		case BPF_LD | BPF_MEM:
+		case BPF_LDX | BPF_MEM:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 10;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+		/* A = K or X = K */
+		case BPF_LD | BPF_IMM:
+		case BPF_LDX | BPF_IMM:
+			insn->code = BPF_ALU | BPF_MOV | BPF_K;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->imm = fp->k;
+			break;
+
+		/* X = A */
+		case BPF_MISC | BPF_TAX:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 7;
+			insn->x_reg = 6;
+			break;
+
+		/* A = X */
+		case BPF_MISC | BPF_TXA:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			break;
+
+		/* A = skb->len or X = skb->len */
+		case BPF_LD | BPF_W | BPF_LEN:
+		case BPF_LDX | BPF_W | BPF_LEN:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 1;
+			insn->off = offsetof(struct sk_buff, len);
+			break;
+
+		/* access seccomp_data fields */
+		case BPF_LDX | BPF_ABS | BPF_W:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = 6;
+			insn->x_reg = 1;
+			insn->off = fp->k;
+			break;
+
+		default:
+			/* pr_err("unknown opcode %02x\n", fp->code); */
+			goto err;
+		}
+
+		insn++;
+		if (new_prog) {
+			memcpy(new_insn, tmp_insns,
+			       sizeof(*insn) * (insn - tmp_insns));
+		}
+		new_insn += insn - tmp_insns;
+	}
+
+	if (!new_prog) {
+		/* only calculating new length */
+		*p_new_len = new_insn - new_prog;
+		return 0;
+	}
+
+	pass++;
+	if (new_len != new_insn - new_prog) {
+		new_len = new_insn - new_prog;
+		if (pass > 2)
+			goto err;
+		goto do_pass;
+	}
+	kfree(addrs);
+	if (*p_new_len != new_len)
+		/* inconsistent new program length */
+		pr_err("sk_convert_filter() usage error\n");
+	return 0;
+err:
+	kfree(addrs);
+	return -EINVAL;
+}
+
+/**
+ *	sk_run_filter_ext - run an extended filter
+ *	@ctx: buffer to run the filter on
+ *	@insn: filter to apply
+ *
+ * Decode and execute extended BPF instructions.
+ * @ctx is the data we are operating on.
+ * @filter is the array of filter instructions.
+ */
+notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
+{
+	u64 stack[64];
+	u64 regs[16];
+	void *ptr;
+	u64 tmp;
+	int off;
+
+#define K insn->imm
+#define A regs[insn->a_reg]
+#define X regs[insn->x_reg]
+
+#define CONT ({insn++; goto select_insn; })
+#define CONT_JMP ({insn++; goto select_insn; })
+/* some compilers may need help:
+ * #define CONT_JMP ({insn++; goto *jumptable[insn->code]; })
+ */
+
+	static const void *jumptable[256] = {
+		[0 ... 255] = &&default_label,
+#define DL(A, B, C) [A|B|C] = &&A##_##B##_##C,
+		DL(BPF_ALU, BPF_ADD, BPF_X)
+		DL(BPF_ALU, BPF_ADD, BPF_K)
+		DL(BPF_ALU, BPF_SUB, BPF_X)
+		DL(BPF_ALU, BPF_SUB, BPF_K)
+		DL(BPF_ALU, BPF_AND, BPF_X)
+		DL(BPF_ALU, BPF_AND, BPF_K)
+		DL(BPF_ALU, BPF_OR, BPF_X)
+		DL(BPF_ALU, BPF_OR, BPF_K)
+		DL(BPF_ALU, BPF_LSH, BPF_X)
+		DL(BPF_ALU, BPF_LSH, BPF_K)
+		DL(BPF_ALU, BPF_RSH, BPF_X)
+		DL(BPF_ALU, BPF_RSH, BPF_K)
+		DL(BPF_ALU, BPF_XOR, BPF_X)
+		DL(BPF_ALU, BPF_XOR, BPF_K)
+		DL(BPF_ALU, BPF_MUL, BPF_X)
+		DL(BPF_ALU, BPF_MUL, BPF_K)
+		DL(BPF_ALU, BPF_MOV, BPF_X)
+		DL(BPF_ALU, BPF_MOV, BPF_K)
+		DL(BPF_ALU, BPF_DIV, BPF_X)
+		DL(BPF_ALU, BPF_DIV, BPF_K)
+		DL(BPF_ALU, BPF_MOD, BPF_X)
+		DL(BPF_ALU, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_ADD, BPF_X)
+		DL(BPF_ALU64, BPF_ADD, BPF_K)
+		DL(BPF_ALU64, BPF_SUB, BPF_X)
+		DL(BPF_ALU64, BPF_SUB, BPF_K)
+		DL(BPF_ALU64, BPF_AND, BPF_X)
+		DL(BPF_ALU64, BPF_AND, BPF_K)
+		DL(BPF_ALU64, BPF_OR, BPF_X)
+		DL(BPF_ALU64, BPF_OR, BPF_K)
+		DL(BPF_ALU64, BPF_LSH, BPF_X)
+		DL(BPF_ALU64, BPF_LSH, BPF_K)
+		DL(BPF_ALU64, BPF_RSH, BPF_X)
+		DL(BPF_ALU64, BPF_RSH, BPF_K)
+		DL(BPF_ALU64, BPF_XOR, BPF_X)
+		DL(BPF_ALU64, BPF_XOR, BPF_K)
+		DL(BPF_ALU64, BPF_MUL, BPF_X)
+		DL(BPF_ALU64, BPF_MUL, BPF_K)
+		DL(BPF_ALU64, BPF_MOV, BPF_X)
+		DL(BPF_ALU64, BPF_MOV, BPF_K)
+		DL(BPF_ALU64, BPF_ARSH, BPF_X)
+		DL(BPF_ALU64, BPF_ARSH, BPF_K)
+		DL(BPF_ALU64, BPF_DIV, BPF_X)
+		DL(BPF_ALU64, BPF_DIV, BPF_K)
+		DL(BPF_ALU64, BPF_MOD, BPF_X)
+		DL(BPF_ALU64, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
+		DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
+		DL(BPF_ALU, BPF_NEG, 0)
+		DL(BPF_JMP, BPF_CALL, 0)
+		DL(BPF_JMP, BPF_JA, 0)
+		DL(BPF_JMP, BPF_JEQ, BPF_X)
+		DL(BPF_JMP, BPF_JEQ, BPF_K)
+		DL(BPF_JMP, BPF_JNE, BPF_X)
+		DL(BPF_JMP, BPF_JNE, BPF_K)
+		DL(BPF_JMP, BPF_JGT, BPF_X)
+		DL(BPF_JMP, BPF_JGT, BPF_K)
+		DL(BPF_JMP, BPF_JGE, BPF_X)
+		DL(BPF_JMP, BPF_JGE, BPF_K)
+		DL(BPF_JMP, BPF_JSGT, BPF_X)
+		DL(BPF_JMP, BPF_JSGT, BPF_K)
+		DL(BPF_JMP, BPF_JSGE, BPF_X)
+		DL(BPF_JMP, BPF_JSGE, BPF_K)
+		DL(BPF_JMP, BPF_JSET, BPF_X)
+		DL(BPF_JMP, BPF_JSET, BPF_K)
+		DL(BPF_STX, BPF_MEM, BPF_B)
+		DL(BPF_STX, BPF_MEM, BPF_H)
+		DL(BPF_STX, BPF_MEM, BPF_W)
+		DL(BPF_STX, BPF_MEM, BPF_DW)
+		DL(BPF_ST, BPF_MEM, BPF_B)
+		DL(BPF_ST, BPF_MEM, BPF_H)
+		DL(BPF_ST, BPF_MEM, BPF_W)
+		DL(BPF_ST, BPF_MEM, BPF_DW)
+		DL(BPF_LDX, BPF_MEM, BPF_B)
+		DL(BPF_LDX, BPF_MEM, BPF_H)
+		DL(BPF_LDX, BPF_MEM, BPF_W)
+		DL(BPF_LDX, BPF_MEM, BPF_DW)
+		DL(BPF_STX, BPF_XADD, BPF_W)
+#ifdef CONFIG_64BIT
+		DL(BPF_STX, BPF_XADD, BPF_DW)
+#endif
+		DL(BPF_LD, BPF_ABS, BPF_W)
+		DL(BPF_LD, BPF_ABS, BPF_H)
+		DL(BPF_LD, BPF_ABS, BPF_B)
+		DL(BPF_LD, BPF_IND, BPF_W)
+		DL(BPF_LD, BPF_IND, BPF_H)
+		DL(BPF_LD, BPF_IND, BPF_B)
+		DL(BPF_RET, BPF_K, 0)
+#undef DL
+	};
+
+	regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
+	regs[1/* BPF R1 */] = (u64)(ulong)ctx;
+
+	/* execute 1st insn */
+select_insn:
+	goto *jumptable[insn->code];
+
+	/* ALU */
+#define ALU(OPCODE, OP) \
+	BPF_ALU64_##OPCODE##_BPF_X: \
+		A = A OP X; \
+		CONT; \
+	BPF_ALU_##OPCODE##_BPF_X: \
+		A = (u32)A OP (u32)X; \
+		CONT; \
+	BPF_ALU64_##OPCODE##_BPF_K: \
+		A = A OP K; \
+		CONT; \
+	BPF_ALU_##OPCODE##_BPF_K: \
+		A = (u32)A OP (u32)K; \
+		CONT;
+
+	ALU(BPF_ADD, +)
+	ALU(BPF_SUB, -)
+	ALU(BPF_AND, &)
+	ALU(BPF_OR, |)
+	ALU(BPF_LSH, <<)
+	ALU(BPF_RSH, >>)
+	ALU(BPF_XOR, ^)
+	ALU(BPF_MUL, *)
+#undef ALU
+
+BPF_ALU_BPF_NEG_0:
+	A = (u32)-A;
+	CONT;
+BPF_ALU_BPF_MOV_BPF_X:
+	A = (u32)X;
+	CONT;
+BPF_ALU_BPF_MOV_BPF_K:
+	A = (u32)K;
+	CONT;
+BPF_ALU64_BPF_MOV_BPF_X:
+	A = X;
+	CONT;
+BPF_ALU64_BPF_MOV_BPF_K:
+	A = K;
+	CONT;
+BPF_ALU64_BPF_ARSH_BPF_X:
+	(*(s64 *) &A) >>= X;
+	CONT;
+BPF_ALU64_BPF_ARSH_BPF_K:
+	(*(s64 *) &A) >>= K;
+	CONT;
+BPF_ALU64_BPF_MOD_BPF_X:
+	tmp = A;
+	if (X)
+		A = do_div(tmp, X);
+	CONT;
+BPF_ALU_BPF_MOD_BPF_X:
+	tmp = (u32)A;
+	if (X)
+		A = do_div(tmp, (u32)X);
+	CONT;
+BPF_ALU64_BPF_MOD_BPF_K:
+	tmp = A;
+	if (K)
+		A = do_div(tmp, K);
+	CONT;
+BPF_ALU_BPF_MOD_BPF_K:
+	tmp = (u32)A;
+	if (K)
+		A = do_div(tmp, (u32)K);
+	CONT;
+BPF_ALU64_BPF_DIV_BPF_X:
+	if (X)
+		do_div(A, X);
+	CONT;
+BPF_ALU_BPF_DIV_BPF_X:
+	tmp = (u32)A;
+	if (X)
+		do_div(tmp, (u32)X);
+	A = (u32)tmp;
+	CONT;
+BPF_ALU64_BPF_DIV_BPF_K:
+	if (K)
+		do_div(A, K);
+	CONT;
+BPF_ALU_BPF_DIV_BPF_K:
+	tmp = (u32)A;
+	if (K)
+		do_div(tmp, (u32)K);
+	A = (u32)tmp;
+	CONT;
+BPF_ALU64_BPF_BSWAP32_BPF_X:
+	A = swab32(A);
+	CONT;
+BPF_ALU64_BPF_BSWAP64_BPF_X:
+	A = swab64(A);
+	CONT;
+
+	/* CALL */
+BPF_JMP_BPF_CALL_0:
+	return 0; /* not implemented yet */
+
+	/* JMP */
+BPF_JMP_BPF_JA_0:
+	insn += insn->off;
+	CONT;
+BPF_JMP_BPF_JEQ_BPF_X:
+	if (A == X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JEQ_BPF_K:
+	if (A == K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JNE_BPF_X:
+	if (A != X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JNE_BPF_K:
+	if (A != K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGT_BPF_X:
+	if (A > X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGT_BPF_K:
+	if (A > K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGE_BPF_X:
+	if (A >= X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGE_BPF_K:
+	if (A >= K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGT_BPF_X:
+	if (((s64)A) > ((s64)X)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGT_BPF_K:
+	if (((s64)A) > ((s64)K)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGE_BPF_X:
+	if (((s64)A) >= ((s64)X)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGE_BPF_K:
+	if (((s64)A) >= ((s64)K)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSET_BPF_X:
+	if (A & X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSET_BPF_K:
+	if (A & K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+
+	/* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+	BPF_STX_BPF_MEM_##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = X; \
+		CONT; \
+	BPF_ST_BPF_MEM_##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = K; \
+		CONT; \
+	BPF_LDX_BPF_MEM_##SIZEOP: \
+		A = *(SIZE *)(ulong)(X + insn->off); \
+		CONT;
+
+	LDST(BPF_B, u8)
+	LDST(BPF_H, u16)
+	LDST(BPF_W, u32)
+	LDST(BPF_DW, u64)
+#undef LDST
+
+BPF_STX_BPF_XADD_BPF_W: /* lock xadd *(u32 *)(A + insn->off) += X */
+	atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
+	CONT;
+#ifdef CONFIG_64BIT
+BPF_STX_BPF_XADD_BPF_DW: /* lock xadd *(u64 *)(A + insn->off) += X */
+	atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
+	CONT;
+#endif
+
+BPF_LD_BPF_ABS_BPF_W: /* A = *(u32 *)(SKB + K) */
+	off = K;
+load_word:
+	/* sk_convert_filter() and sk_chk_filter_ext() will make sure
+	 * that BPF_LD+BPD_ABS and BPF_LD+BPF_IND insns are only
+	 * appearing in the programs where ctx == skb
+	 */
+	ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
+	if (likely(ptr != NULL)) {
+		A = get_unaligned_be32(ptr);
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_ABS_BPF_H: /* A = *(u16 *)(SKB + K) */
+	off = K;
+load_half:
+	ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
+	if (likely(ptr != NULL)) {
+		A = get_unaligned_be16(ptr);
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_ABS_BPF_B: /* A = *(u8 *)(SKB + K) */
+	off = K;
+load_byte:
+	ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
+	if (likely(ptr != NULL)) {
+		A = *(u8 *)ptr;
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_IND_BPF_W: /* A = *(u32 *)(SKB + X + K) */
+	off = K + X;
+	goto load_word;
+
+BPF_LD_BPF_IND_BPF_H: /* A = *(u16 *)(SKB + X + K) */
+	off = K + X;
+	goto load_half;
+
+BPF_LD_BPF_IND_BPF_B: /* A = *(u8 *)(SKB + X + K) */
+	off = K + X;
+	goto load_byte;
+
+	/* RET */
+BPF_RET_BPF_K_0:
+	return regs[0/* R0 */];
+
+default_label:
+	/* sk_chk_filter_ext() and sk_convert_filter() guarantee
+	 * that we never reach here
+	 */
+	WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+	return 0;
+#undef CONT
+#undef A
+#undef X
+#undef K
+#undef LOAD_IMM
+}
+EXPORT_SYMBOL(sk_run_filter_ext);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..e1b979312588 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@  static struct ctl_table net_core_table[] = {
 	},
 #endif
 	{
+		.procname	= "bpf_ext_enable",
+		.data		= &bpf_ext_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "netdev_tstamp_prequeue",
 		.data		= &netdev_tstamp_prequeue,
 		.maxlen		= sizeof(int),