diff mbox series

[bpf,v3] bpf: verifier, do_refine_retval_range may clamp umin to 0 incorrectly

Message ID 158015334199.28573.4940395881683556537.stgit@john-XPS-13-9370
State Accepted
Delegated to: BPF Maintainers
Headers show
Series [bpf,v3] bpf: verifier, do_refine_retval_range may clamp umin to 0 incorrectly | expand

Commit Message

John Fastabend Jan. 27, 2020, 7:29 p.m. UTC
do_refine_retval_range() is called to refine return values from specified
helpers, probe_read_str and get_stack at the moment, the reasoning is
because both have a max value as part of their input arguments and
because the helper ensure the return value will not be larger than this
we can set smax values of the return register, r0.

However, the return value is a signed integer so setting umax is incorrect
It leads to further confusion when the do_refine_retval_range() then calls,
__reg_deduce_bounds() which will see a umax value as meaning the value is
unsigned and then assuming it is unsigned set the smin = umin which in this
case results in 'smin = 0' and an 'smax = X' where X is the input argument
from the helper call.

Here are the comments from _reg_deduce_bounds() on why this would be safe
to do.

 /* Learn sign from unsigned bounds.  Signed bounds cross the sign
  * boundary, so we must be careful.
  */
 if ((s64)reg->umax_value >= 0) {
	/* Positive.  We can't learn anything from the smin, but smax
	 * is positive, hence safe.
	 */
	reg->smin_value = reg->umin_value;
	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
						  reg->umax_value);

But now we incorrectly have a return value with type int with the
signed bounds (0,X). Suppose the return value is negative, which is
possible the we have the verifier and reality out of sync. Among other
things this may result in any error handling code being falsely detected
as dead-code and removed. For instance the example below shows using
bpf_probe_read_str() causes the error path to be identified as dead
code and removed.

>From the 'llvm-object -S' dump,

 r2 = 100
 call 45
 if r0 s< 0 goto +4
 r4 = *(u32 *)(r7 + 0)

But from dump xlate

  (b7) r2 = 100
  (85) call bpf_probe_read_compat_str#-96768
  (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto

Due to verifier state after call being

 R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))

To fix omit setting the umax value because its not safe. The only
actual bounds we know is the smax. This results in the correct bounds
(SMIN, X) where X is the max length from the helper. After this the
new verifier state looks like the following after call 45.

R0=inv(id=0,smax_value=100)

Then xlated version no longer removed dead code giving the expected
result,

  (b7) r2 = 100
  (85) call bpf_probe_read_compat_str#-96768
  (c5) if r0 s< 0x0 goto pc+4
  (61) r4 = *(u32 *)(r7 +0)

Note, bpf_probe_read_* calls are root only so we wont hit this case
with non-root bpf users.

v3: comment had some documentation about meta set to null case which
is not relevant here and confusing to include in the comment.

v2 note: In original version we set msize_smax_value from check_func_arg()
and propagated this into smax of retval. The logic was smax is the bound
on the retval we set and because the type in the helper is ARG_CONST_SIZE
we know that the reg is a positive tnum_const() so umax=smax. Alexei
pointed out though this is a bit odd to read because the register in
check_func_arg() has a C type of u32 and the umax bound would be the
normally relavent bound here. Pulling in extra knowledge about future
checks makes reading the code a bit tricky. Further having a signed
meta data that can only ever be positive is also a bit odd. So dropped
the msize_smax_value metadata and made it a u64 msize_max_value to
indicate its unsigned. And additionally save bound from umax value in
check_arg_funcs which is the same as smax due to as noted above tnumx_cont
and negative check but reads better. By my analysis nothing functionally
changes in v2 but it does get easier to read so that is win.

Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/verifier.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Comments

Daniel Borkmann Jan. 29, 2020, 4:25 p.m. UTC | #1
On 1/27/20 8:29 PM, John Fastabend wrote:
> do_refine_retval_range() is called to refine return values from specified
> helpers, probe_read_str and get_stack at the moment, the reasoning is
> because both have a max value as part of their input arguments and
> because the helper ensure the return value will not be larger than this
> we can set smax values of the return register, r0.
> 
> However, the return value is a signed integer so setting umax is incorrect
> It leads to further confusion when the do_refine_retval_range() then calls,
> __reg_deduce_bounds() which will see a umax value as meaning the value is
> unsigned and then assuming it is unsigned set the smin = umin which in this
> case results in 'smin = 0' and an 'smax = X' where X is the input argument
> from the helper call.
> 
> Here are the comments from _reg_deduce_bounds() on why this would be safe
> to do.
> 
>   /* Learn sign from unsigned bounds.  Signed bounds cross the sign
>    * boundary, so we must be careful.
>    */
>   if ((s64)reg->umax_value >= 0) {
> 	/* Positive.  We can't learn anything from the smin, but smax
> 	 * is positive, hence safe.
> 	 */
> 	reg->smin_value = reg->umin_value;
> 	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
> 						  reg->umax_value);
> 
> But now we incorrectly have a return value with type int with the
> signed bounds (0,X). Suppose the return value is negative, which is
> possible the we have the verifier and reality out of sync. Among other
> things this may result in any error handling code being falsely detected
> as dead-code and removed. For instance the example below shows using
> bpf_probe_read_str() causes the error path to be identified as dead
> code and removed.
> 
>>From the 'llvm-object -S' dump,
> 
>   r2 = 100
>   call 45
>   if r0 s< 0 goto +4
>   r4 = *(u32 *)(r7 + 0)
> 
> But from dump xlate
> 
>    (b7) r2 = 100
>    (85) call bpf_probe_read_compat_str#-96768
>    (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto
> 
> Due to verifier state after call being
> 
>   R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))
> 
> To fix omit setting the umax value because its not safe. The only
> actual bounds we know is the smax. This results in the correct bounds
> (SMIN, X) where X is the max length from the helper. After this the
> new verifier state looks like the following after call 45.
> 
> R0=inv(id=0,smax_value=100)
> 
> Then xlated version no longer removed dead code giving the expected
> result,
> 
>    (b7) r2 = 100
>    (85) call bpf_probe_read_compat_str#-96768
>    (c5) if r0 s< 0x0 goto pc+4
>    (61) r4 = *(u32 *)(r7 +0)
> 
> Note, bpf_probe_read_* calls are root only so we wont hit this case
> with non-root bpf users.
> 
> v3: comment had some documentation about meta set to null case which
> is not relevant here and confusing to include in the comment.
> 
> v2 note: In original version we set msize_smax_value from check_func_arg()
> and propagated this into smax of retval. The logic was smax is the bound
> on the retval we set and because the type in the helper is ARG_CONST_SIZE
> we know that the reg is a positive tnum_const() so umax=smax. Alexei
> pointed out though this is a bit odd to read because the register in
> check_func_arg() has a C type of u32 and the umax bound would be the
> normally relavent bound here. Pulling in extra knowledge about future
> checks makes reading the code a bit tricky. Further having a signed
> meta data that can only ever be positive is also a bit odd. So dropped
> the msize_smax_value metadata and made it a u64 msize_max_value to
> indicate its unsigned. And additionally save bound from umax value in
> check_arg_funcs which is the same as smax due to as noted above tnumx_cont
> and negative check but reads better. By my analysis nothing functionally
> changes in v2 but it does get easier to read so that is win.
> 
> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Applied, thanks!
Alexei Starovoitov Jan. 29, 2020, 7:28 p.m. UTC | #2
On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>
> Applied, thanks!

Daniel,
did you run the selftests before applying?
This patch breaks two.
We have to find a different fix.

./test_progs -t get_stack
68: (85) call bpf_get_stack#67
 R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
R2 unbounded memory access, make sure to bounds check any array access
into a map

John,
did you run the selftests?
Daniel Borkmann Jan. 29, 2020, 10:20 p.m. UTC | #3
On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>
>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>
>> Applied, thanks!
> 
> Daniel,
> did you run the selftests before applying?
> This patch breaks two.
> We have to find a different fix.
> 
> ./test_progs -t get_stack
> 68: (85) call bpf_get_stack#67
>   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> R2 unbounded memory access, make sure to bounds check any array access
> into a map

Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
tree since this needs to be addressed. Sorry for the trouble.

Thanks,
Daniel
John Fastabend Jan. 29, 2020, 10:52 p.m. UTC | #4
Daniel Borkmann wrote:
> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>
> >>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> >>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>
> >> Applied, thanks!
> > 
> > Daniel,
> > did you run the selftests before applying?
> > This patch breaks two.
> > We have to find a different fix.
> > 
> > ./test_progs -t get_stack
> > 68: (85) call bpf_get_stack#67
> >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> > R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> > 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> > 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> > R2 unbounded memory access, make sure to bounds check any array access
> > into a map
> 
> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> tree since this needs to be addressed. Sorry for the trouble.

Thanks I'm looking into it now. Not sure how I missed it on
selftests either older branch or I missed the test somehow. I've
updated toolchain and kernel now so shouldn't happen again.

Thanks,
John
Alexei Starovoitov Jan. 30, 2020, 12:04 a.m. UTC | #5
On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> Daniel Borkmann wrote:
> > On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> > > On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >>>
> > >>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> > >>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > >>
> > >> Applied, thanks!
> > > 
> > > Daniel,
> > > did you run the selftests before applying?
> > > This patch breaks two.
> > > We have to find a different fix.
> > > 
> > > ./test_progs -t get_stack
> > > 68: (85) call bpf_get_stack#67
> > >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> > > R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> > > 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> > > 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> > > R2 unbounded memory access, make sure to bounds check any array access
> > > into a map
> > 
> > Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> > tree since this needs to be addressed. Sorry for the trouble.
> 
> Thanks I'm looking into it now. Not sure how I missed it on
> selftests either older branch or I missed the test somehow. I've
> updated toolchain and kernel now so shouldn't happen again.

Looks like smax_value was nuked by <<32 >>32 shifts.
53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
55: (c7) r8 s>>= 32
; if (usize < 0)
56: (c5) if r8 s< 0x0 goto pc+28
// and here "less than zero check" doesn't help anymore.

Not sure how to fix it yet, but the code pattern used in
progs/test_get_stack_rawtp.c
is real. Plenty of bpf progs rely on this.
John Fastabend Jan. 30, 2020, 5:38 p.m. UTC | #6
Alexei Starovoitov wrote:
> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> > Daniel Borkmann wrote:
> > > On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> > > > On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >>>
> > > >>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> > > >>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > >>
> > > >> Applied, thanks!
> > > > 
> > > > Daniel,
> > > > did you run the selftests before applying?
> > > > This patch breaks two.
> > > > We have to find a different fix.
> > > > 
> > > > ./test_progs -t get_stack
> > > > 68: (85) call bpf_get_stack#67
> > > >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> > > > R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> > > > 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> > > > 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> > > > R2 unbounded memory access, make sure to bounds check any array access
> > > > into a map
> > > 
> > > Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> > > tree since this needs to be addressed. Sorry for the trouble.
> > 
> > Thanks I'm looking into it now. Not sure how I missed it on
> > selftests either older branch or I missed the test somehow. I've
> > updated toolchain and kernel now so shouldn't happen again.
> 
> Looks like smax_value was nuked by <<32 >>32 shifts.
> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
> 55: (c7) r8 s>>= 32
> ; if (usize < 0)
> 56: (c5) if r8 s< 0x0 goto pc+28
> // and here "less than zero check" doesn't help anymore.
> 
> Not sure how to fix it yet, but the code pattern used in
> progs/test_get_stack_rawtp.c
> is real. Plenty of bpf progs rely on this.

OK I see what happened I have some patches on my llvm tree and forgot to
pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
in a few places for us and causes verifier trouble whenever it is hit.

I think I have a fix for this in llvm, if that is OK. And we can make
the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
architecture expectation on the jit side. For example x86 jit code here,

146:   shl    $0x20,%rdi
14a:   shr    $0x20,%rdi

the shr will clear the most significant bit so we can say something about
the min sign value. I'll generate a couple patches today and send them
out to discuss. Probably easier to explain with code and examples.

Thanks,
John
Alexei Starovoitov Jan. 30, 2020, 5:59 p.m. UTC | #7
On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
> Alexei Starovoitov wrote:
> > On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> > > Daniel Borkmann wrote:
> > > > On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> > > > > On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > >>>
> > > > >>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> > > > >>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > > >>
> > > > >> Applied, thanks!
> > > > > 
> > > > > Daniel,
> > > > > did you run the selftests before applying?
> > > > > This patch breaks two.
> > > > > We have to find a different fix.
> > > > > 
> > > > > ./test_progs -t get_stack
> > > > > 68: (85) call bpf_get_stack#67
> > > > >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> > > > > R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> > > > > 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> > > > > 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> > > > > R2 unbounded memory access, make sure to bounds check any array access
> > > > > into a map
> > > > 
> > > > Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> > > > tree since this needs to be addressed. Sorry for the trouble.
> > > 
> > > Thanks I'm looking into it now. Not sure how I missed it on
> > > selftests either older branch or I missed the test somehow. I've
> > > updated toolchain and kernel now so shouldn't happen again.
> > 
> > Looks like smax_value was nuked by <<32 >>32 shifts.
> > 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
> > 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
> > 55: (c7) r8 s>>= 32
> > ; if (usize < 0)
> > 56: (c5) if r8 s< 0x0 goto pc+28
> > // and here "less than zero check" doesn't help anymore.
> > 
> > Not sure how to fix it yet, but the code pattern used in
> > progs/test_get_stack_rawtp.c
> > is real. Plenty of bpf progs rely on this.
> 
> OK I see what happened I have some patches on my llvm tree and forgot to
> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
> in a few places for us and causes verifier trouble whenever it is hit.
> 
> I think I have a fix for this in llvm, if that is OK. And we can make
> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
> architecture expectation on the jit side. For example x86 jit code here,
> 
> 146:   shl    $0x20,%rdi
> 14a:   shr    $0x20,%rdi
> 
> the shr will clear the most significant bit so we can say something about
> the min sign value. I'll generate a couple patches today and send them
> out to discuss. Probably easier to explain with code and examples.

How about we detect this pattern on the verifier side and replace with
pseudo insn that will do 32-bit sign extend. Most archs have special
cpu instruction to do this much more efficiently than two shifts.
If JIT doesn't implement that pseudo yet the verifier can convert
it back to two shifts.
Then during verification it will process pseudo_sign_extend op easily.
So the idea:
1. pattern match sequence of two shifts in a pass similar to
   replace_map_fd_with_map_ptr() before main do_check()
2. pseudo_sign_extend gets process in do_check() doing the right thing
   with bpf_reg_state.
3. JIT this pseudo insn or convert back

Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
John Fastabend Jan. 30, 2020, 11:34 p.m. UTC | #8
Alexei Starovoitov wrote:
> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> > > On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> > > > Daniel Borkmann wrote:
> > > > > On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> > > > > > On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > > >>>
> > > > > >>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> > > > > >>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > > > >>
> > > > > >> Applied, thanks!
> > > > > > 
> > > > > > Daniel,
> > > > > > did you run the selftests before applying?
> > > > > > This patch breaks two.
> > > > > > We have to find a different fix.
> > > > > > 
> > > > > > ./test_progs -t get_stack
> > > > > > 68: (85) call bpf_get_stack#67
> > > > > >   R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> > > > > > R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> > > > > > 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> > > > > > 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> > > > > > R2 unbounded memory access, make sure to bounds check any array access
> > > > > > into a map
> > > > > 
> > > > > Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> > > > > tree since this needs to be addressed. Sorry for the trouble.
> > > > 
> > > > Thanks I'm looking into it now. Not sure how I missed it on
> > > > selftests either older branch or I missed the test somehow. I've
> > > > updated toolchain and kernel now so shouldn't happen again.
> > > 
> > > Looks like smax_value was nuked by <<32 >>32 shifts.
> > > 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
> > > 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
> > > 55: (c7) r8 s>>= 32
> > > ; if (usize < 0)
> > > 56: (c5) if r8 s< 0x0 goto pc+28
> > > // and here "less than zero check" doesn't help anymore.
> > > 
> > > Not sure how to fix it yet, but the code pattern used in
> > > progs/test_get_stack_rawtp.c
> > > is real. Plenty of bpf progs rely on this.
> > 
> > OK I see what happened I have some patches on my llvm tree and forgot to
> > pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
> > in a few places for us and causes verifier trouble whenever it is hit.
> > 
> > I think I have a fix for this in llvm, if that is OK. And we can make
> > the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
> > architecture expectation on the jit side. For example x86 jit code here,
> > 
> > 146:   shl    $0x20,%rdi
> > 14a:   shr    $0x20,%rdi
> > 
> > the shr will clear the most significant bit so we can say something about
> > the min sign value. I'll generate a couple patches today and send them
> > out to discuss. Probably easier to explain with code and examples.
> 
> How about we detect this pattern on the verifier side and replace with
> pseudo insn that will do 32-bit sign extend. Most archs have special
> cpu instruction to do this much more efficiently than two shifts.
> If JIT doesn't implement that pseudo yet the verifier can convert
> it back to two shifts.
> Then during verification it will process pseudo_sign_extend op easily.
> So the idea:
> 1. pattern match sequence of two shifts in a pass similar to
>    replace_map_fd_with_map_ptr() before main do_check()
> 2. pseudo_sign_extend gets process in do_check() doing the right thing
>    with bpf_reg_state.
> 3. JIT this pseudo insn or convert back
> 
> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.

I'm not sure pattern matching in the verifier is best. This paticular
case of lsh/rsh games is the result of BPF backend generating it from
a LLVM IR zext.

Here is the LLVM IR generated from test_get_stack_rawtp that produces
the zext.


 %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
  call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
  %27 = icmp slt i32 %26, 0, !dbg !167
  br i1 %27, label %41, label %28, !dbg !169

28:                                               ; preds = %25
  %29 = zext i32 %26 to i64, !dbg !170
  %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170


Clang wants to do zext because we are promoting a i32 to i64. In the
BPF backend code we pattern match this as follows,

 def : Pat<(i64 (zext GPR32:$src)),
           (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;

Which generates the object code (again from test_get_stack_rawtp),

      56:       bc 81 00 00 00 00 00 00 w1 = w8
      57:       67 01 00 00 20 00 00 00 r1 <<= 32
      58:       77 01 00 00 20 00 00 00 r1 >>= 32

Unfortunately, this is a pretty poor form of zext from the verifiers point
of view it completely nukes bounds as you observed. So how about doing
something a bit simpler from the backend side. Noting that moving 32bit
into 32bit zero-extends on x86 and we also make that assumption elsewhere
so it should be safe to implement the zext from above object dump as just
the mov

  w1 = w8

Which we can implement in the backend with this patch,

diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
index 0f39294..a187103 100644
--- a/llvm/lib/Target/BPF/BPFInstrInfo.td
+++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
@@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
           (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
 
 def : Pat<(i64 (zext GPR32:$src)),
-          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
+          (MOV_32_64 GPR32:$src)>;
 
Now the new object code is simply,

      54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
      55:       1c 89 00 00 00 00 00 00 w9 -= w8
      56:       bc 81 00 00 00 00 00 00 w1 = w8
      57:       bf 72 00 00 00 00 00 00 r2 = r7
      58:       0f 12 00 00 00 00 00 00 r2 += r1
      59:       bf 61 00 00 00 00 00 00 r1 = r6
      60:       bc 93 00 00 00 00 00 00 w3 = w9
      61:       b7 04 00 00 00 00 00 00 r4 = 0
      62:       85 00 00 00 43 00 00 00 call 67
;       if (ksize < 0)

That is the block from your originally trace. But one issue still 
remains and just the above llvm backend update doesn't fix the verifier
problem created by my patch because in the false branch after line 54
above we don't have the right bounds.

 53: (bc) w8 = w0
 ; if (usize < 0)
 54: (c6) if w8 s< 0x0 goto pc+20
  R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
 55: (1c) w9 -= w8
 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
 56: (bc) w1 = w8
 57: (bf) r2 = r7
 58: (0f) r2 += r1
  R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
 parent didn't have regs=1 stack=0 marks
 last_idx 52 first_idx 40
 regs=1 stack=0 before 52: (85) call bpf_get_stack#67
 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
 59: (bf) r1 = r6
 60: (bc) w3 = w9
 61: (b7) r4 = 0
 62: (85) call bpf_get_stack#67

At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
w8 remembering a load will zero extend there.  So we should expect
the false branch to have bounds [0, 800] exactly as we want. But,
we instead got only a umax_value? Digging deeper we are failing here,

 /* Return true if VAL is compared with a s64 sign extended from s32, and they
  * are with the same signedness.
  */
 static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
 {
         return ((s32)sval >= 0 &&
                 reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
                ((s32)sval < 0 &&
                 reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
 }

This appears to conservative. I'll need to analyze that a bit but it
should be safe to relax to catch above <0 case. After that I expect
we should be passing again.

Sorry for the long thread but those are the details. What do you think,
in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
and see what we can cook up with Daniel. It avoid pseudo instructions
and pattern matching which I think is a bit more general.

Thanks,
John
Yonghong Song Jan. 31, 2020, 12:15 a.m. UTC | #9
On 1/30/20 3:34 PM, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
>>> Alexei Starovoitov wrote:
>>>> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
>>>>> Daniel Borkmann wrote:
>>>>>> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>>>>>
>>>>>>>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
>>>>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>>>>>>
>>>>>>>> Applied, thanks!
>>>>>>>
>>>>>>> Daniel,
>>>>>>> did you run the selftests before applying?
>>>>>>> This patch breaks two.
>>>>>>> We have to find a different fix.
>>>>>>>
>>>>>>> ./test_progs -t get_stack
>>>>>>> 68: (85) call bpf_get_stack#67
>>>>>>>    R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
>>>>>>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
>>>>>>> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
>>>>>>> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
>>>>>>> R2 unbounded memory access, make sure to bounds check any array access
>>>>>>> into a map
>>>>>>
>>>>>> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
>>>>>> tree since this needs to be addressed. Sorry for the trouble.
>>>>>
>>>>> Thanks I'm looking into it now. Not sure how I missed it on
>>>>> selftests either older branch or I missed the test somehow. I've
>>>>> updated toolchain and kernel now so shouldn't happen again.
>>>>
>>>> Looks like smax_value was nuked by <<32 >>32 shifts.
>>>> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
>>>> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
>>>> 55: (c7) r8 s>>= 32
>>>> ; if (usize < 0)
>>>> 56: (c5) if r8 s< 0x0 goto pc+28
>>>> // and here "less than zero check" doesn't help anymore.
>>>>
>>>> Not sure how to fix it yet, but the code pattern used in
>>>> progs/test_get_stack_rawtp.c
>>>> is real. Plenty of bpf progs rely on this.
>>>
>>> OK I see what happened I have some patches on my llvm tree and forgot to
>>> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
>>> in a few places for us and causes verifier trouble whenever it is hit.
>>>
>>> I think I have a fix for this in llvm, if that is OK. And we can make
>>> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
>>> architecture expectation on the jit side. For example x86 jit code here,
>>>
>>> 146:   shl    $0x20,%rdi
>>> 14a:   shr    $0x20,%rdi
>>>
>>> the shr will clear the most significant bit so we can say something about
>>> the min sign value. I'll generate a couple patches today and send them
>>> out to discuss. Probably easier to explain with code and examples.
>>
>> How about we detect this pattern on the verifier side and replace with
>> pseudo insn that will do 32-bit sign extend. Most archs have special
>> cpu instruction to do this much more efficiently than two shifts.
>> If JIT doesn't implement that pseudo yet the verifier can convert
>> it back to two shifts.
>> Then during verification it will process pseudo_sign_extend op easily.
>> So the idea:
>> 1. pattern match sequence of two shifts in a pass similar to
>>     replace_map_fd_with_map_ptr() before main do_check()
>> 2. pseudo_sign_extend gets process in do_check() doing the right thing
>>     with bpf_reg_state.
>> 3. JIT this pseudo insn or convert back
>>
>> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
> 
> I'm not sure pattern matching in the verifier is best. This paticular
> case of lsh/rsh games is the result of BPF backend generating it from
> a LLVM IR zext.
> 
> Here is the LLVM IR generated from test_get_stack_rawtp that produces
> the zext.
> 
> 
>   %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
>    call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
>    %27 = icmp slt i32 %26, 0, !dbg !167
>    br i1 %27, label %41, label %28, !dbg !169
> 
> 28:                                               ; preds = %25
>    %29 = zext i32 %26 to i64, !dbg !170
>    %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170
> 
> 
> Clang wants to do zext because we are promoting a i32 to i64. In the
> BPF backend code we pattern match this as follows,
> 
>   def : Pat<(i64 (zext GPR32:$src)),
>             (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> 
> Which generates the object code (again from test_get_stack_rawtp),
> 
>        56:       bc 81 00 00 00 00 00 00 w1 = w8
>        57:       67 01 00 00 20 00 00 00 r1 <<= 32
>        58:       77 01 00 00 20 00 00 00 r1 >>= 32
> 
> Unfortunately, this is a pretty poor form of zext from the verifiers point
> of view it completely nukes bounds as you observed. So how about doing
> something a bit simpler from the backend side. Noting that moving 32bit
> into 32bit zero-extends on x86 and we also make that assumption elsewhere
> so it should be safe to implement the zext from above object dump as just
> the mov
> 
>    w1 = w8
> 
> Which we can implement in the backend with this patch,
> 
> diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> index 0f39294..a187103 100644
> --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
>             (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>   
>   def : Pat<(i64 (zext GPR32:$src)),
> -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> +          (MOV_32_64 GPR32:$src)>;

Thanks. This should work.

Also, in BPF backend, we have a phase in BPFMIPeephole.cpp to
remove such <<= and >>= in general. It may miss some cases.

But verifier seems handling <<= and >>= correctly, right?
Even we have it, the verifier should reach the same conclusion
compared to not having it, right?

>   
> Now the new object code is simply,
> 
>        54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
>        55:       1c 89 00 00 00 00 00 00 w9 -= w8
>        56:       bc 81 00 00 00 00 00 00 w1 = w8
>        57:       bf 72 00 00 00 00 00 00 r2 = r7
>        58:       0f 12 00 00 00 00 00 00 r2 += r1
>        59:       bf 61 00 00 00 00 00 00 r1 = r6
>        60:       bc 93 00 00 00 00 00 00 w3 = w9
>        61:       b7 04 00 00 00 00 00 00 r4 = 0
>        62:       85 00 00 00 43 00 00 00 call 67
> ;       if (ksize < 0)
> 
> That is the block from your originally trace. But one issue still
> remains and just the above llvm backend update doesn't fix the verifier
> problem created by my patch because in the false branch after line 54
> above we don't have the right bounds.
> 
>   53: (bc) w8 = w0
>   ; if (usize < 0)
>   54: (c6) if w8 s< 0x0 goto pc+20
>    R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   55: (1c) w9 -= w8
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   56: (bc) w1 = w8
>   57: (bf) r2 = r7
>   58: (0f) r2 += r1
>    R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
>   parent didn't have regs=1 stack=0 marks
>   last_idx 52 first_idx 40
>   regs=1 stack=0 before 52: (85) call bpf_get_stack#67
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   59: (bf) r1 = r6
>   60: (bc) w3 = w9
>   61: (b7) r4 = 0
>   62: (85) call bpf_get_stack#67
> 
> At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
> w8 remembering a load will zero extend there.  So we should expect
> the false branch to have bounds [0, 800] exactly as we want. But,
> we instead got only a umax_value? Digging deeper we are failing here,
> 
>   /* Return true if VAL is compared with a s64 sign extended from s32, and they
>    * are with the same signedness.
>    */
>   static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
>   {
>           return ((s32)sval >= 0 &&
>                   reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
>                  ((s32)sval < 0 &&
>                   reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
>   }
> 
> This appears to conservative. I'll need to analyze that a bit but it
> should be safe to relax to catch above <0 case. After that I expect
> we should be passing again.
> 
> Sorry for the long thread but those are the details. What do you think,
> in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
> and see what we can cook up with Daniel. It avoid pseudo instructions
> and pattern matching which I think is a bit more general.
> 
> Thanks,
> John
>
Yonghong Song Jan. 31, 2020, 12:28 a.m. UTC | #10
On 1/30/20 3:34 PM, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
>>> Alexei Starovoitov wrote:
>>>> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
>>>>> Daniel Borkmann wrote:
>>>>>> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>>>>>
>>>>>>>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
>>>>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>>>>>>
>>>>>>>> Applied, thanks!
>>>>>>>
>>>>>>> Daniel,
>>>>>>> did you run the selftests before applying?
>>>>>>> This patch breaks two.
>>>>>>> We have to find a different fix.
>>>>>>>
>>>>>>> ./test_progs -t get_stack
>>>>>>> 68: (85) call bpf_get_stack#67
>>>>>>>    R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
>>>>>>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
>>>>>>> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
>>>>>>> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
>>>>>>> R2 unbounded memory access, make sure to bounds check any array access
>>>>>>> into a map
>>>>>>
>>>>>> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
>>>>>> tree since this needs to be addressed. Sorry for the trouble.
>>>>>
>>>>> Thanks I'm looking into it now. Not sure how I missed it on
>>>>> selftests either older branch or I missed the test somehow. I've
>>>>> updated toolchain and kernel now so shouldn't happen again.
>>>>
>>>> Looks like smax_value was nuked by <<32 >>32 shifts.
>>>> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
>>>> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
>>>> 55: (c7) r8 s>>= 32
>>>> ; if (usize < 0)
>>>> 56: (c5) if r8 s< 0x0 goto pc+28
>>>> // and here "less than zero check" doesn't help anymore.
>>>>
>>>> Not sure how to fix it yet, but the code pattern used in
>>>> progs/test_get_stack_rawtp.c
>>>> is real. Plenty of bpf progs rely on this.
>>>
>>> OK I see what happened I have some patches on my llvm tree and forgot to
>>> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
>>> in a few places for us and causes verifier trouble whenever it is hit.
>>>
>>> I think I have a fix for this in llvm, if that is OK. And we can make
>>> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
>>> architecture expectation on the jit side. For example x86 jit code here,
>>>
>>> 146:   shl    $0x20,%rdi
>>> 14a:   shr    $0x20,%rdi
>>>
>>> the shr will clear the most significant bit so we can say something about
>>> the min sign value. I'll generate a couple patches today and send them
>>> out to discuss. Probably easier to explain with code and examples.
>>
>> How about we detect this pattern on the verifier side and replace with
>> pseudo insn that will do 32-bit sign extend. Most archs have special
>> cpu instruction to do this much more efficiently than two shifts.
>> If JIT doesn't implement that pseudo yet the verifier can convert
>> it back to two shifts.
>> Then during verification it will process pseudo_sign_extend op easily.
>> So the idea:
>> 1. pattern match sequence of two shifts in a pass similar to
>>     replace_map_fd_with_map_ptr() before main do_check()
>> 2. pseudo_sign_extend gets process in do_check() doing the right thing
>>     with bpf_reg_state.
>> 3. JIT this pseudo insn or convert back
>>
>> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
> 
> I'm not sure pattern matching in the verifier is best. This paticular
> case of lsh/rsh games is the result of BPF backend generating it from
> a LLVM IR zext.
> 
> Here is the LLVM IR generated from test_get_stack_rawtp that produces
> the zext.
> 
> 
>   %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
>    call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
>    %27 = icmp slt i32 %26, 0, !dbg !167
>    br i1 %27, label %41, label %28, !dbg !169
> 
> 28:                                               ; preds = %25
>    %29 = zext i32 %26 to i64, !dbg !170
>    %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170
> 
> 
> Clang wants to do zext because we are promoting a i32 to i64. In the
> BPF backend code we pattern match this as follows,
> 
>   def : Pat<(i64 (zext GPR32:$src)),
>             (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> 
> Which generates the object code (again from test_get_stack_rawtp),
> 
>        56:       bc 81 00 00 00 00 00 00 w1 = w8
>        57:       67 01 00 00 20 00 00 00 r1 <<= 32
>        58:       77 01 00 00 20 00 00 00 r1 >>= 32
> 
> Unfortunately, this is a pretty poor form of zext from the verifiers point
> of view it completely nukes bounds as you observed. So how about doing
> something a bit simpler from the backend side. Noting that moving 32bit
> into 32bit zero-extends on x86 and we also make that assumption elsewhere
> so it should be safe to implement the zext from above object dump as just
> the mov
> 
>    w1 = w8
> 
> Which we can implement in the backend with this patch,
> 
> diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> index 0f39294..a187103 100644
> --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
>             (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>   
>   def : Pat<(i64 (zext GPR32:$src)),
> -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> +          (MOV_32_64 GPR32:$src)>;
>   
> Now the new object code is simply,
> 
>        54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
>        55:       1c 89 00 00 00 00 00 00 w9 -= w8
>        56:       bc 81 00 00 00 00 00 00 w1 = w8
>        57:       bf 72 00 00 00 00 00 00 r2 = r7
>        58:       0f 12 00 00 00 00 00 00 r2 += r1
>        59:       bf 61 00 00 00 00 00 00 r1 = r6
>        60:       bc 93 00 00 00 00 00 00 w3 = w9
>        61:       b7 04 00 00 00 00 00 00 r4 = 0
>        62:       85 00 00 00 43 00 00 00 call 67
> ;       if (ksize < 0)
> 
> That is the block from your originally trace. But one issue still
> remains and just the above llvm backend update doesn't fix the verifier
> problem created by my patch because in the false branch after line 54
> above we don't have the right bounds.
> 
>   53: (bc) w8 = w0
>   ; if (usize < 0)
>   54: (c6) if w8 s< 0x0 goto pc+20
>    R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   55: (1c) w9 -= w8
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   56: (bc) w1 = w8
>   57: (bf) r2 = r7
>   58: (0f) r2 += r1
>    R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
>   parent didn't have regs=1 stack=0 marks
>   last_idx 52 first_idx 40
>   regs=1 stack=0 before 52: (85) call bpf_get_stack#67
>   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>   59: (bf) r1 = r6
>   60: (bc) w3 = w9
>   61: (b7) r4 = 0
>   62: (85) call bpf_get_stack#67
> 
> At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
> w8 remembering a load will zero extend there.  So we should expect
> the false branch to have bounds [0, 800] exactly as we want. But,
> we instead got only a umax_value? Digging deeper we are failing here,
> 
>   /* Return true if VAL is compared with a s64 sign extended from s32, and they
>    * are with the same signedness.
>    */
>   static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
>   {
>           return ((s32)sval >= 0 &&
>                   reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
>                  ((s32)sval < 0 &&
>                   reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
>   }
> 
> This appears to conservative. I'll need to analyze that a bit but it
> should be safe to relax to catch above <0 case. After that I expect
> we should be passing again.

Yes, this is the place I have problem as well. In my case, the top 32bit 
bit may not be 0, it will be somehow magically cleared later through
w_x = w_y style ALU operations.

> 
> Sorry for the long thread but those are the details. What do you think,
> in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
> and see what we can cook up with Daniel. It avoid pseudo instructions
> and pattern matching which I think is a bit more general.
> 
> Thanks,
> John
>
John Fastabend Jan. 31, 2020, 12:44 a.m. UTC | #11
Yonghong Song wrote:
> 
> 
> On 1/30/20 3:34 PM, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
> >>> Alexei Starovoitov wrote:
> >>>> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> >>>>> Daniel Borkmann wrote:
> >>>>>> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> >>>>>>> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>>>>>>
> >>>>>>>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> >>>>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>>>>>>>
> >>>>>>>> Applied, thanks!
> >>>>>>>
> >>>>>>> Daniel,
> >>>>>>> did you run the selftests before applying?
> >>>>>>> This patch breaks two.
> >>>>>>> We have to find a different fix.
> >>>>>>>
> >>>>>>> ./test_progs -t get_stack
> >>>>>>> 68: (85) call bpf_get_stack#67
> >>>>>>>    R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> >>>>>>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> >>>>>>> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> >>>>>>> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> >>>>>>> R2 unbounded memory access, make sure to bounds check any array access
> >>>>>>> into a map
> >>>>>>
> >>>>>> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> >>>>>> tree since this needs to be addressed. Sorry for the trouble.
> >>>>>
> >>>>> Thanks I'm looking into it now. Not sure how I missed it on
> >>>>> selftests either older branch or I missed the test somehow. I've
> >>>>> updated toolchain and kernel now so shouldn't happen again.
> >>>>
> >>>> Looks like smax_value was nuked by <<32 >>32 shifts.
> >>>> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
> >>>> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
> >>>> 55: (c7) r8 s>>= 32
> >>>> ; if (usize < 0)
> >>>> 56: (c5) if r8 s< 0x0 goto pc+28
> >>>> // and here "less than zero check" doesn't help anymore.
> >>>>
> >>>> Not sure how to fix it yet, but the code pattern used in
> >>>> progs/test_get_stack_rawtp.c
> >>>> is real. Plenty of bpf progs rely on this.
> >>>
> >>> OK I see what happened I have some patches on my llvm tree and forgot to
> >>> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
> >>> in a few places for us and causes verifier trouble whenever it is hit.
> >>>
> >>> I think I have a fix for this in llvm, if that is OK. And we can make
> >>> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
> >>> architecture expectation on the jit side. For example x86 jit code here,
> >>>
> >>> 146:   shl    $0x20,%rdi
> >>> 14a:   shr    $0x20,%rdi
> >>>
> >>> the shr will clear the most significant bit so we can say something about
> >>> the min sign value. I'll generate a couple patches today and send them
> >>> out to discuss. Probably easier to explain with code and examples.
> >>
> >> How about we detect this pattern on the verifier side and replace with
> >> pseudo insn that will do 32-bit sign extend. Most archs have special
> >> cpu instruction to do this much more efficiently than two shifts.
> >> If JIT doesn't implement that pseudo yet the verifier can convert
> >> it back to two shifts.
> >> Then during verification it will process pseudo_sign_extend op easily.
> >> So the idea:
> >> 1. pattern match sequence of two shifts in a pass similar to
> >>     replace_map_fd_with_map_ptr() before main do_check()
> >> 2. pseudo_sign_extend gets process in do_check() doing the right thing
> >>     with bpf_reg_state.
> >> 3. JIT this pseudo insn or convert back
> >>
> >> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
> > 
> > I'm not sure pattern matching in the verifier is best. This paticular
> > case of lsh/rsh games is the result of BPF backend generating it from
> > a LLVM IR zext.
> > 
> > Here is the LLVM IR generated from test_get_stack_rawtp that produces
> > the zext.
> > 
> > 
> >   %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
> >    call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
> >    %27 = icmp slt i32 %26, 0, !dbg !167
> >    br i1 %27, label %41, label %28, !dbg !169
> > 
> > 28:                                               ; preds = %25
> >    %29 = zext i32 %26 to i64, !dbg !170
> >    %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170
> > 
> > 
> > Clang wants to do zext because we are promoting a i32 to i64. In the
> > BPF backend code we pattern match this as follows,
> > 
> >   def : Pat<(i64 (zext GPR32:$src)),
> >             (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> > 
> > Which generates the object code (again from test_get_stack_rawtp),
> > 
> >        56:       bc 81 00 00 00 00 00 00 w1 = w8
> >        57:       67 01 00 00 20 00 00 00 r1 <<= 32
> >        58:       77 01 00 00 20 00 00 00 r1 >>= 32
> > 
> > Unfortunately, this is a pretty poor form of zext from the verifiers point
> > of view it completely nukes bounds as you observed. So how about doing
> > something a bit simpler from the backend side. Noting that moving 32bit
> > into 32bit zero-extends on x86 and we also make that assumption elsewhere
> > so it should be safe to implement the zext from above object dump as just
> > the mov
> > 
> >    w1 = w8
> > 
> > Which we can implement in the backend with this patch,
> > 
> > diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > index 0f39294..a187103 100644
> > --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> > +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
> >             (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> >   
> >   def : Pat<(i64 (zext GPR32:$src)),
> > -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> > +          (MOV_32_64 GPR32:$src)>;
> 
> Thanks. This should work.
> 
> Also, in BPF backend, we have a phase in BPFMIPeephole.cpp to
> remove such <<= and >>= in general. It may miss some cases.
> 
> But verifier seems handling <<= and >>= correctly, right?
> Even we have it, the verifier should reach the same conclusion
> compared to not having it, right?
> 

No, verifier marks it unknown and doesn't track it fully. We
can perhaps improve the verifier but the above is a nice
fix/improvement for the backend imo regardless.

	case BPF_LSH:
		if (umax_val >= insn_bitness) {
			/* Shifts greater than 31 or 63 are undefined.
			 * This includes shifts by a negative number.
			 */
			mark_reg_unknown(env, regs, insn->dst_reg);
			break;
		}
		/* We lose all sign bit information (except what we can pick
		 * up from var_off)
		 */

> >   
> > Now the new object code is simply,
> > 
> >        54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
> >        55:       1c 89 00 00 00 00 00 00 w9 -= w8
> >        56:       bc 81 00 00 00 00 00 00 w1 = w8
> >        57:       bf 72 00 00 00 00 00 00 r2 = r7
> >        58:       0f 12 00 00 00 00 00 00 r2 += r1
> >        59:       bf 61 00 00 00 00 00 00 r1 = r6
> >        60:       bc 93 00 00 00 00 00 00 w3 = w9
> >        61:       b7 04 00 00 00 00 00 00 r4 = 0
> >        62:       85 00 00 00 43 00 00 00 call 67
> > ;       if (ksize < 0)
> > 
> > That is the block from your originally trace. But one issue still
> > remains and just the above llvm backend update doesn't fix the verifier
> > problem created by my patch because in the false branch after line 54
> > above we don't have the right bounds.
> > 
> >   53: (bc) w8 = w0
> >   ; if (usize < 0)
> >   54: (c6) if w8 s< 0x0 goto pc+20
> >    R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   55: (1c) w9 -= w8
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   56: (bc) w1 = w8
> >   57: (bf) r2 = r7
> >   58: (0f) r2 += r1
> >    R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
> >   parent didn't have regs=1 stack=0 marks
> >   last_idx 52 first_idx 40
> >   regs=1 stack=0 before 52: (85) call bpf_get_stack#67
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   59: (bf) r1 = r6
> >   60: (bc) w3 = w9
> >   61: (b7) r4 = 0
> >   62: (85) call bpf_get_stack#67
> > 
> > At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
> > w8 remembering a load will zero extend there.  So we should expect
> > the false branch to have bounds [0, 800] exactly as we want. But,
> > we instead got only a umax_value? Digging deeper we are failing here,
> > 
> >   /* Return true if VAL is compared with a s64 sign extended from s32, and they
> >    * are with the same signedness.
> >    */
> >   static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
> >   {
> >           return ((s32)sval >= 0 &&
> >                   reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
> >                  ((s32)sval < 0 &&
> >                   reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
> >   }
> > 
> > This appears to conservative. I'll need to analyze that a bit but it
> > should be safe to relax to catch above <0 case. After that I expect
> > we should be passing again.
> > 
> > Sorry for the long thread but those are the details. What do you think,
> > in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
> > and see what we can cook up with Daniel. It avoid pseudo instructions
> > and pattern matching which I think is a bit more general.
> > 
> > Thanks,
> > John
> >
John Fastabend Jan. 31, 2020, 12:48 a.m. UTC | #12
Yonghong Song wrote:
> 
> 
> On 1/30/20 3:34 PM, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
> >>> Alexei Starovoitov wrote:
> >>>> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
> >>>>> Daniel Borkmann wrote:
> >>>>>> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
> >>>>>>> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>>>>>>
> >>>>>>>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> >>>>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>>>>>>>
> >>>>>>>> Applied, thanks!
> >>>>>>>
> >>>>>>> Daniel,
> >>>>>>> did you run the selftests before applying?
> >>>>>>> This patch breaks two.
> >>>>>>> We have to find a different fix.
> >>>>>>>
> >>>>>>> ./test_progs -t get_stack
> >>>>>>> 68: (85) call bpf_get_stack#67
> >>>>>>>    R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
> >>>>>>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
> >>>>>>> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
> >>>>>>> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
> >>>>>>> R2 unbounded memory access, make sure to bounds check any array access
> >>>>>>> into a map
> >>>>>>
> >>>>>> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
> >>>>>> tree since this needs to be addressed. Sorry for the trouble.
> >>>>>
> >>>>> Thanks I'm looking into it now. Not sure how I missed it on
> >>>>> selftests either older branch or I missed the test somehow. I've
> >>>>> updated toolchain and kernel now so shouldn't happen again.
> >>>>
> >>>> Looks like smax_value was nuked by <<32 >>32 shifts.
> >>>> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
> >>>> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
> >>>> 55: (c7) r8 s>>= 32
> >>>> ; if (usize < 0)
> >>>> 56: (c5) if r8 s< 0x0 goto pc+28
> >>>> // and here "less than zero check" doesn't help anymore.
> >>>>
> >>>> Not sure how to fix it yet, but the code pattern used in
> >>>> progs/test_get_stack_rawtp.c
> >>>> is real. Plenty of bpf progs rely on this.
> >>>
> >>> OK I see what happened I have some patches on my llvm tree and forgot to
> >>> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
> >>> in a few places for us and causes verifier trouble whenever it is hit.
> >>>
> >>> I think I have a fix for this in llvm, if that is OK. And we can make
> >>> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
> >>> architecture expectation on the jit side. For example x86 jit code here,
> >>>
> >>> 146:   shl    $0x20,%rdi
> >>> 14a:   shr    $0x20,%rdi
> >>>
> >>> the shr will clear the most significant bit so we can say something about
> >>> the min sign value. I'll generate a couple patches today and send them
> >>> out to discuss. Probably easier to explain with code and examples.
> >>
> >> How about we detect this pattern on the verifier side and replace with
> >> pseudo insn that will do 32-bit sign extend. Most archs have special
> >> cpu instruction to do this much more efficiently than two shifts.
> >> If JIT doesn't implement that pseudo yet the verifier can convert
> >> it back to two shifts.
> >> Then during verification it will process pseudo_sign_extend op easily.
> >> So the idea:
> >> 1. pattern match sequence of two shifts in a pass similar to
> >>     replace_map_fd_with_map_ptr() before main do_check()
> >> 2. pseudo_sign_extend gets process in do_check() doing the right thing
> >>     with bpf_reg_state.
> >> 3. JIT this pseudo insn or convert back
> >>
> >> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
> > 
> > I'm not sure pattern matching in the verifier is best. This paticular
> > case of lsh/rsh games is the result of BPF backend generating it from
> > a LLVM IR zext.
> > 
> > Here is the LLVM IR generated from test_get_stack_rawtp that produces
> > the zext.
> > 
> > 
> >   %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
> >    call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
> >    %27 = icmp slt i32 %26, 0, !dbg !167
> >    br i1 %27, label %41, label %28, !dbg !169
> > 
> > 28:                                               ; preds = %25
> >    %29 = zext i32 %26 to i64, !dbg !170
> >    %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170
> > 
> > 
> > Clang wants to do zext because we are promoting a i32 to i64. In the
> > BPF backend code we pattern match this as follows,
> > 
> >   def : Pat<(i64 (zext GPR32:$src)),
> >             (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> > 
> > Which generates the object code (again from test_get_stack_rawtp),
> > 
> >        56:       bc 81 00 00 00 00 00 00 w1 = w8
> >        57:       67 01 00 00 20 00 00 00 r1 <<= 32
> >        58:       77 01 00 00 20 00 00 00 r1 >>= 32
> > 
> > Unfortunately, this is a pretty poor form of zext from the verifiers point
> > of view it completely nukes bounds as you observed. So how about doing
> > something a bit simpler from the backend side. Noting that moving 32bit
> > into 32bit zero-extends on x86 and we also make that assumption elsewhere
> > so it should be safe to implement the zext from above object dump as just
> > the mov
> > 
> >    w1 = w8
> > 
> > Which we can implement in the backend with this patch,
> > 
> > diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > index 0f39294..a187103 100644
> > --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> > +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
> >             (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> >   
> >   def : Pat<(i64 (zext GPR32:$src)),
> > -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> > +          (MOV_32_64 GPR32:$src)>;
> >   
> > Now the new object code is simply,
> > 
> >        54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
> >        55:       1c 89 00 00 00 00 00 00 w9 -= w8
> >        56:       bc 81 00 00 00 00 00 00 w1 = w8
> >        57:       bf 72 00 00 00 00 00 00 r2 = r7
> >        58:       0f 12 00 00 00 00 00 00 r2 += r1
> >        59:       bf 61 00 00 00 00 00 00 r1 = r6
> >        60:       bc 93 00 00 00 00 00 00 w3 = w9
> >        61:       b7 04 00 00 00 00 00 00 r4 = 0
> >        62:       85 00 00 00 43 00 00 00 call 67
> > ;       if (ksize < 0)
> > 
> > That is the block from your originally trace. But one issue still
> > remains and just the above llvm backend update doesn't fix the verifier
> > problem created by my patch because in the false branch after line 54
> > above we don't have the right bounds.
> > 
> >   53: (bc) w8 = w0
> >   ; if (usize < 0)
> >   54: (c6) if w8 s< 0x0 goto pc+20
> >    R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   55: (1c) w9 -= w8
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   56: (bc) w1 = w8
> >   57: (bf) r2 = r7
> >   58: (0f) r2 += r1
> >    R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
> >   parent didn't have regs=1 stack=0 marks
> >   last_idx 52 first_idx 40
> >   regs=1 stack=0 before 52: (85) call bpf_get_stack#67
> >   ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> >   59: (bf) r1 = r6
> >   60: (bc) w3 = w9
> >   61: (b7) r4 = 0
> >   62: (85) call bpf_get_stack#67
> > 
> > At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
> > w8 remembering a load will zero extend there.  So we should expect
> > the false branch to have bounds [0, 800] exactly as we want. But,
> > we instead got only a umax_value? Digging deeper we are failing here,
> > 
> >   /* Return true if VAL is compared with a s64 sign extended from s32, and they
> >    * are with the same signedness.
> >    */
> >   static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
> >   {
> >           return ((s32)sval >= 0 &&
> >                   reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
> >                  ((s32)sval < 0 &&
> >                   reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
> >   }
> > 
> > This appears to conservative. I'll need to analyze that a bit but it
> > should be safe to relax to catch above <0 case. After that I expect
> > we should be passing again.
> 
> Yes, this is the place I have problem as well. In my case, the top 32bit 
> bit may not be 0, it will be somehow magically cleared later through
> w_x = w_y style ALU operations.

The w_x = w_y MOV can be improved as well a bit and this helps my case.
Needs some more thought but roughly coerce_reg_to_size called from
MOV can do better with max bound. At the moment it loses all knowledge
of previous bounds checks on the move becuase of the smax_value=umax_value
below,

static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
{
	u64 mask;

	/* clear high bits in bit representation */
	reg->var_off = tnum_cast(reg->var_off, size);

	/* fix arithmetic bounds */
	mask = ((u64)1 << (size * 8)) - 1;
	if ((reg->umin_value & ~mask) == (reg->umax_value & ~mask)) {
		reg->umin_value &= mask;
		reg->umax_value &= mask;
	} else {
		reg->umin_value = 0;
		reg->umax_value = mask;
	}
	reg->smin_value = reg->umin_value;
	reg->smax_value = reg->umax_value;
}

I believe smax_value can likely be max(reg->smax_value, reg->umax_value)
but need to check. (assuming smax_value < UMAX_32) I need to drop off for
tonight but will dig into it tomorrow.

> 
> > 
> > Sorry for the long thread but those are the details. What do you think,
> > in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
> > and see what we can cook up with Daniel. It avoid pseudo instructions
> > and pattern matching which I think is a bit more general.
> > 
> > Thanks,
> > John
> >
Yonghong Song Jan. 31, 2020, 12:52 a.m. UTC | #13
On 1/30/20 4:44 PM, John Fastabend wrote:
> Yonghong Song wrote:
>>
>>
>> On 1/30/20 3:34 PM, John Fastabend wrote:
>>> Alexei Starovoitov wrote:
>>>> On Thu, Jan 30, 2020 at 09:38:07AM -0800, John Fastabend wrote:
>>>>> Alexei Starovoitov wrote:
>>>>>> On Wed, Jan 29, 2020 at 02:52:10PM -0800, John Fastabend wrote:
>>>>>>> Daniel Borkmann wrote:
>>>>>>>> On 1/29/20 8:28 PM, Alexei Starovoitov wrote:
>>>>>>>>> On Wed, Jan 29, 2020 at 8:25 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
>>>>>>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>>>>>>>>
>>>>>>>>>> Applied, thanks!
>>>>>>>>>
>>>>>>>>> Daniel,
>>>>>>>>> did you run the selftests before applying?
>>>>>>>>> This patch breaks two.
>>>>>>>>> We have to find a different fix.
>>>>>>>>>
>>>>>>>>> ./test_progs -t get_stack
>>>>>>>>> 68: (85) call bpf_get_stack#67
>>>>>>>>>     R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0)
>>>>>>>>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0;
>>>>>>>>> 0xffffffff)) R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0;
>>>>>>>>> 0xffffffff)) R4_w=inv0 R6=ctx(id=0,off=0,im?
>>>>>>>>> R2 unbounded memory access, make sure to bounds check any array access
>>>>>>>>> into a map
>>>>>>>>
>>>>>>>> Sigh, had it in my wip pre-rebase tree when running tests. I've revert it from the
>>>>>>>> tree since this needs to be addressed. Sorry for the trouble.
>>>>>>>
>>>>>>> Thanks I'm looking into it now. Not sure how I missed it on
>>>>>>> selftests either older branch or I missed the test somehow. I've
>>>>>>> updated toolchain and kernel now so shouldn't happen again.
>>>>>>
>>>>>> Looks like smax_value was nuked by <<32 >>32 shifts.
>>>>>> 53: (bf) r8 = r0   // R0=inv(id=0,smax_value=800)
>>>>>> 54: (67) r8 <<= 32  // R8->smax_value = S64_MAX; in adjust_scalar_min_max_vals()
>>>>>> 55: (c7) r8 s>>= 32
>>>>>> ; if (usize < 0)
>>>>>> 56: (c5) if r8 s< 0x0 goto pc+28
>>>>>> // and here "less than zero check" doesn't help anymore.
>>>>>>
>>>>>> Not sure how to fix it yet, but the code pattern used in
>>>>>> progs/test_get_stack_rawtp.c
>>>>>> is real. Plenty of bpf progs rely on this.
>>>>>
>>>>> OK I see what happened I have some patches on my llvm tree and forgot to
>>>>> pop them off before running selftests :/ These <<=32 s>>=32 pattern pops up
>>>>> in a few places for us and causes verifier trouble whenever it is hit.
>>>>>
>>>>> I think I have a fix for this in llvm, if that is OK. And we can make
>>>>> the BPF_RSH and BPF_LSH verifier bounds tighter if we also define the
>>>>> architecture expectation on the jit side. For example x86 jit code here,
>>>>>
>>>>> 146:   shl    $0x20,%rdi
>>>>> 14a:   shr    $0x20,%rdi
>>>>>
>>>>> the shr will clear the most significant bit so we can say something about
>>>>> the min sign value. I'll generate a couple patches today and send them
>>>>> out to discuss. Probably easier to explain with code and examples.
>>>>
>>>> How about we detect this pattern on the verifier side and replace with
>>>> pseudo insn that will do 32-bit sign extend. Most archs have special
>>>> cpu instruction to do this much more efficiently than two shifts.
>>>> If JIT doesn't implement that pseudo yet the verifier can convert
>>>> it back to two shifts.
>>>> Then during verification it will process pseudo_sign_extend op easily.
>>>> So the idea:
>>>> 1. pattern match sequence of two shifts in a pass similar to
>>>>      replace_map_fd_with_map_ptr() before main do_check()
>>>> 2. pseudo_sign_extend gets process in do_check() doing the right thing
>>>>      with bpf_reg_state.
>>>> 3. JIT this pseudo insn or convert back
>>>>
>>>> Long term we can upgrade this pseudo insn into uapi and let llvm emit it.
>>>
>>> I'm not sure pattern matching in the verifier is best. This paticular
>>> case of lsh/rsh games is the result of BPF backend generating it from
>>> a LLVM IR zext.
>>>
>>> Here is the LLVM IR generated from test_get_stack_rawtp that produces
>>> the zext.
>>>
>>>
>>>    %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
>>>     call void @llvm.dbg.value(metadata i32 %26, metadata !124, metadata !DIExpression()), !dbg !130
>>>     %27 = icmp slt i32 %26, 0, !dbg !167
>>>     br i1 %27, label %41, label %28, !dbg !169
>>>
>>> 28:                                               ; preds = %25
>>>     %29 = zext i32 %26 to i64, !dbg !170
>>>     %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170
>>>
>>>
>>> Clang wants to do zext because we are promoting a i32 to i64. In the
>>> BPF backend code we pattern match this as follows,
>>>
>>>    def : Pat<(i64 (zext GPR32:$src)),
>>>              (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>>>
>>> Which generates the object code (again from test_get_stack_rawtp),
>>>
>>>         56:       bc 81 00 00 00 00 00 00 w1 = w8
>>>         57:       67 01 00 00 20 00 00 00 r1 <<= 32
>>>         58:       77 01 00 00 20 00 00 00 r1 >>= 32
>>>
>>> Unfortunately, this is a pretty poor form of zext from the verifiers point
>>> of view it completely nukes bounds as you observed. So how about doing
>>> something a bit simpler from the backend side. Noting that moving 32bit
>>> into 32bit zero-extends on x86 and we also make that assumption elsewhere
>>> so it should be safe to implement the zext from above object dump as just
>>> the mov
>>>
>>>     w1 = w8
>>>
>>> Which we can implement in the backend with this patch,
>>>
>>> diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
>>> index 0f39294..a187103 100644
>>> --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
>>> +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
>>> @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
>>>              (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>>>    
>>>    def : Pat<(i64 (zext GPR32:$src)),
>>> -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>>> +          (MOV_32_64 GPR32:$src)>;
>>
>> Thanks. This should work.
>>
>> Also, in BPF backend, we have a phase in BPFMIPeephole.cpp to
>> remove such <<= and >>= in general. It may miss some cases.
>>
>> But verifier seems handling <<= and >>= correctly, right?
>> Even we have it, the verifier should reach the same conclusion
>> compared to not having it, right?
>>
> 
> No, verifier marks it unknown and doesn't track it fully. We
> can perhaps improve the verifier but the above is a nice
> fix/improvement for the backend imo regardless.
> 
> 	case BPF_LSH:
> 		if (umax_val >= insn_bitness) {
> 			/* Shifts greater than 31 or 63 are undefined.
> 			 * This includes shifts by a negative number.
> 			 */
> 			mark_reg_unknown(env, regs, insn->dst_reg);
> 			break;
> 		}
> 		/* We lose all sign bit information (except what we can pick
> 		 * up from var_off)
> 		 */

But in this case, insn_bitness is 64 and supposed umax_val = umin_val 
should be 32 since the original instruction is r1 <<= 32.

Probably somewhere needs improvement.

Do agree that LLVM change is good and please submit a patch for it.
Thanks!

> 
>>>    
>>> Now the new object code is simply,
>>>
>>>         54:       c6 08 14 00 00 00 00 00 if w8 s< 0 goto +20 <LBB0_6>
>>>         55:       1c 89 00 00 00 00 00 00 w9 -= w8
>>>         56:       bc 81 00 00 00 00 00 00 w1 = w8
>>>         57:       bf 72 00 00 00 00 00 00 r2 = r7
>>>         58:       0f 12 00 00 00 00 00 00 r2 += r1
>>>         59:       bf 61 00 00 00 00 00 00 r1 = r6
>>>         60:       bc 93 00 00 00 00 00 00 w3 = w9
>>>         61:       b7 04 00 00 00 00 00 00 r4 = 0
>>>         62:       85 00 00 00 43 00 00 00 call 67
>>> ;       if (ksize < 0)
>>>
>>> That is the block from your originally trace. But one issue still
>>> remains and just the above llvm backend update doesn't fix the verifier
>>> problem created by my patch because in the false branch after line 54
>>> above we don't have the right bounds.
>>>
>>>    53: (bc) w8 = w0
>>>    ; if (usize < 0)
>>>    54: (c6) if w8 s< 0x0 goto pc+20
>>>     R0=inv(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv800 R10=fp0 fp-8=mmmm????
>>>    ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>>>    55: (1c) w9 -= w8
>>>    ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>>>    56: (bc) w1 = w8
>>>    57: (bf) r2 = r7
>>>    58: (0f) r2 += r1
>>>     R0_rw=invP(id=0,smax_value=800) R6=ctx(id=0,off=0,imm=0) R7_rw=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_rw=inv800 R10=fp0 fp-8=mmmm????
>>>    parent didn't have regs=1 stack=0 marks
>>>    last_idx 52 first_idx 40
>>>    regs=1 stack=0 before 52: (85) call bpf_get_stack#67
>>>    ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>>>    59: (bf) r1 = r6
>>>    60: (bc) w3 = w9
>>>    61: (b7) r4 = 0
>>>    62: (85) call bpf_get_stack#67
>>>
>>> At line :54 R0 has bounds [SMIN, 800] which by 53: are the bounds in
>>> w8 remembering a load will zero extend there.  So we should expect
>>> the false branch to have bounds [0, 800] exactly as we want. But,
>>> we instead got only a umax_value? Digging deeper we are failing here,
>>>
>>>    /* Return true if VAL is compared with a s64 sign extended from s32, and they
>>>     * are with the same signedness.
>>>     */
>>>    static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg)
>>>    {
>>>            return ((s32)sval >= 0 &&
>>>                    reg->smin_value >= 0 && reg->smax_value <= S32_MAX) ||
>>>                   ((s32)sval < 0 &&
>>>                    reg->smax_value <= 0 && reg->smin_value >= S32_MIN);
>>>    }
>>>
>>> This appears to conservative. I'll need to analyze that a bit but it
>>> should be safe to relax to catch above <0 case. After that I expect
>>> we should be passing again.
>>>
>>> Sorry for the long thread but those are the details. What do you think,
>>> in the meantime I'll generate the relaxed bounds on cmp_val_with_extended
>>> and see what we can cook up with Daniel. It avoid pseudo instructions
>>> and pattern matching which I think is a bit more general.
>>>
>>> Thanks,
>>> John
>>>
> 
>
Alexei Starovoitov Jan. 31, 2020, 2:46 a.m. UTC | #14
On Thu, Jan 30, 2020 at 03:34:27PM -0800, John Fastabend wrote:
> 
> diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> index 0f39294..a187103 100644
> --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
>            (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
>  
>  def : Pat<(i64 (zext GPR32:$src)),
> -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> +          (MOV_32_64 GPR32:$src)>;

That's good optimization and 32-bit zero-extend after mov32 is not x86
specific. It's mandatory on all archs.

But I won't solve this problem. There are both signed and unsigned extensions
in that program. The one that breaks is _singed_ one and it cannot be optimized
into any other instruction by llvm.
Hence the proposal to do pseudo insn for it and upgrade to uapi later.

llvm-objdump -S test_get_stack_rawtp.o
; usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
      52:	85 00 00 00 43 00 00 00	call 67 
      53:	b7 02 00 00 00 00 00 00	r2 = 0
      54:	bc 08 00 00 00 00 00 00	w8 = w0
; 	if (usize < 0)
      55:	bc 81 00 00 00 00 00 00	w1 = w8
      56:	67 01 00 00 20 00 00 00	r1 <<= 32
      57:	c7 01 00 00 20 00 00 00	r1 s>>= 32
      58:	6d 12 1a 00 00 00 00 00	if r2 s> r1 goto +26 <LBB0_6>
56 and 57 are the shifts.
Alexei Starovoitov Jan. 31, 2020, 2:50 a.m. UTC | #15
On Thu, Jan 30, 2020 at 04:44:09PM -0800, John Fastabend wrote:
> > 
> > But verifier seems handling <<= and >>= correctly, right?
> > Even we have it, the verifier should reach the same conclusion
> > compared to not having it, right?
> > 
> 
> No, verifier marks it unknown and doesn't track it fully. We
> can perhaps improve the verifier but the above is a nice
> fix/improvement for the backend imo regardless.

I don't see how the verifier can be taught to carry smax information after <<32
shift. The verifier has to do dst_reg->smax_value = S64_MAX.
The only way to carry smax is to recognize a sequence of <<32 s>>32.
While at it we can recognize both <<32 >>32 and <<32 s>32 as pseudo insns.
John Fastabend Jan. 31, 2020, 5:48 a.m. UTC | #16
Alexei Starovoitov wrote:
> On Thu, Jan 30, 2020 at 03:34:27PM -0800, John Fastabend wrote:
> > 
> > diff --git a/llvm/lib/Target/BPF/BPFInstrInfo.td b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > index 0f39294..a187103 100644
> > --- a/llvm/lib/Target/BPF/BPFInstrInfo.td
> > +++ b/llvm/lib/Target/BPF/BPFInstrInfo.td
> > @@ -733,7 +733,7 @@ def : Pat<(i64 (sext GPR32:$src)),
> >            (SRA_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> >  
> >  def : Pat<(i64 (zext GPR32:$src)),
> > -          (SRL_ri (SLL_ri (MOV_32_64 GPR32:$src), 32), 32)>;
> > +          (MOV_32_64 GPR32:$src)>;
> 
> That's good optimization and 32-bit zero-extend after mov32 is not x86
> specific. It's mandatory on all archs.
> 
> But I won't solve this problem. There are both signed and unsigned extensions
> in that program. The one that breaks is _singed_ one and it cannot be optimized
> into any other instruction by llvm.
> Hence the proposal to do pseudo insn for it and upgrade to uapi later.

Those are both coming from the llvm ir zext call with the above patch
there 56 and 57 are ommitted so there are no shifts. I'll check again
just to be sure and put the details in a patch for the backend.

> 
> llvm-objdump -S test_get_stack_rawtp.o
> ; usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>       52:	85 00 00 00 43 00 00 00	call 67 
>       53:	b7 02 00 00 00 00 00 00	r2 = 0
>       54:	bc 08 00 00 00 00 00 00	w8 = w0
> ; 	if (usize < 0)
>       55:	bc 81 00 00 00 00 00 00	w1 = w8
>       56:	67 01 00 00 20 00 00 00	r1 <<= 32
>       57:	c7 01 00 00 20 00 00 00	r1 s>>= 32
>       58:	6d 12 1a 00 00 00 00 00	if r2 s> r1 goto +26 <LBB0_6>
> 56 and 57 are the shifts.

Agree it doesn't make much sense that the r1 s>>= 32 is signed to me
at the moment. I'll take a look in the morning. That fragment 55,56,
57 are coming from a zext in llvm though.

FWIW once the shifts are removed the next issue is coerce loses info
on the smax that it needs. Something like this is needed so that if
we have a tight smax_value we don't lose it to the mask.

@@ -2805,9 +2804,32 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
                reg->umax_value = mask;
        }
        reg->smin_value = reg->umin_value;
-       reg->smax_value = reg->umax_value;
+       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
+               reg->smax_value = reg->umax_value;
+}
+

I'll write up the details in a patch once we iron out the LLVM zext
IR signed shift.


.John
Alexei Starovoitov Jan. 31, 2020, 6:18 a.m. UTC | #17
On Thu, Jan 30, 2020 at 9:48 PM John Fastabend <john.fastabend@gmail.com> wrote:
> at the moment. I'll take a look in the morning. That fragment 55,56,
> 57 are coming from a zext in llvm though.

I don't think so. Here is how IR looks after all optimizations
and right before instruction selection:
  %call12 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32,
i64)*)(i8* %ctx, i8* nonnull %call8, i32 800, i64 256) #2
  %cmp = icmp slt i32 %call12, 0
  br i1 %cmp, label %cleanup, label %if.end15

if.end15:                                         ; preds = %if.end11
  %idx.ext70 = zext i32 %call12 to i64
  %add.ptr = getelementptr i8, i8* %call8, i64 %idx.ext70
  %sub = sub nsw i32 800, %call12
  %call16 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32,
i64)*)(i8* %ctx, i8* %add.ptr, i32 %sub, i64 0) #2
  %cmp17 = icmp slt i32 %call16, 0
  br i1 %cmp17, label %cleanup, label %if.end20

and corresponding C code:
        usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
        if (usize < 0)
                return 0;

        ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
        if (ksize < 0)

%idx.ext70 = zext i32 %call12 to i64
that you see is a part of 'raw_data + usize' math.
The result of first bpf_get_stack() is directly passed into
"icmp slt i32 %call12, 0"
and during instruction selection the backend does
sign extension with <<32 s>>32.

I agree that peephole zext->mov32_64 is correct and a nice optimization,
but I still don't see how it helps this case.
John Fastabend Jan. 31, 2020, 5:16 p.m. UTC | #18
Alexei Starovoitov wrote:
> On Thu, Jan 30, 2020 at 9:48 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > at the moment. I'll take a look in the morning. That fragment 55,56,
> > 57 are coming from a zext in llvm though.
> 

OK I see the disconnect now. I don't get the same instructions at all. I only
have the <<32 >> 32, no signed shift s>>32.

> I don't think so. Here is how IR looks after all optimizations
> and right before instruction selection:

Same llvm ir though

>   %call12 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32,
> i64)*)(i8* %ctx, i8* nonnull %call8, i32 800, i64 256) #2
>   %cmp = icmp slt i32 %call12, 0
>   br i1 %cmp, label %cleanup, label %if.end15
> 
> if.end15:                                         ; preds = %if.end11
>   %idx.ext70 = zext i32 %call12 to i64
>   %add.ptr = getelementptr i8, i8* %call8, i64 %idx.ext70
>   %sub = sub nsw i32 800, %call12
>   %call16 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32,
> i64)*)(i8* %ctx, i8* %add.ptr, i32 %sub, i64 0) #2
>   %cmp17 = icmp slt i32 %call16, 0
>   br i1 %cmp17, label %cleanup, label %if.end20
> 

 %26 = call i32 inttoptr (i64 67 to i32 (i8*, i8*, i32, i64)*)(i8* %0, i8* nonnull %23, i32 800, i64 256) #3, !dbg !166
  %27 = icmp slt i32 %26, 0, !dbg !167
  br i1 %27, label %41, label %28, !dbg !169

28:                                               ; preds = %25
  %29 = zext i32 %26 to i64, !dbg !170
  %30 = getelementptr i8, i8* %23, i64 %29, !dbg !170


> and corresponding C code:
>         usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>         if (usize < 0)
>                 return 0;
> 
>         ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>         if (ksize < 0)

same as well. But my object code only has this (unpatched llvm)

      56:       bc 81 00 00 00 00 00 00 w1 = w8
      57:       67 01 00 00 20 00 00 00 r1 <<= 32
      58:       77 01 00 00 20 00 00 00 r1 >>= 32

> 
> %idx.ext70 = zext i32 %call12 to i64
> that you see is a part of 'raw_data + usize' math.
> The result of first bpf_get_stack() is directly passed into
> "icmp slt i32 %call12, 0"
> and during instruction selection the backend does
> sign extension with <<32 s>>32.

Assuming latest bpf tree and llvm master branch? 

> 
> I agree that peephole zext->mov32_64 is correct and a nice optimization,
> but I still don't see how it helps this case.

Also don't mind to build pseudo instruction here for signed extension
but its not clear to me why we are getting different instruction
selections? Its not clear to me why sext is being chosen in your case?

.John
Alexei Starovoitov Jan. 31, 2020, 9:36 p.m. UTC | #19
On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Also don't mind to build pseudo instruction here for signed extension
> but its not clear to me why we are getting different instruction
> selections? Its not clear to me why sext is being chosen in your case?

Sign extension has to be there if jmp64 is used.
So the difference is due to -mcpu=v2 vs -mcpu=v3
v2 does alu32, but not jmp32
v3 does both.
By default selftests are using -mcpu=probe which
detects v2/v3 depending on running kernel.

llc -mattr=dwarfris -march=bpf -mcpu=v3  -mattr=+alu32
;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
      48:       bf 61 00 00 00 00 00 00 r1 = r6
      49:       bf 72 00 00 00 00 00 00 r2 = r7
      50:       b4 03 00 00 20 03 00 00 w3 = 800
      51:       b7 04 00 00 00 01 00 00 r4 = 256
      52:       85 00 00 00 43 00 00 00 call 67
      53:       bc 08 00 00 00 00 00 00 w8 = w0
;       if (usize < 0)
      54:       c6 08 16 00 00 00 00 00 if w8 s< 0 goto +22 <LBB0_6>
;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
      55:       1c 89 00 00 00 00 00 00 w9 -= w8
      56:       bc 81 00 00 00 00 00 00 w1 = w8
      57:       67 01 00 00 20 00 00 00 r1 <<= 32
      58:       77 01 00 00 20 00 00 00 r1 >>= 32
      59:       bf 72 00 00 00 00 00 00 r2 = r7
      60:       0f 12 00 00 00 00 00 00 r2 += r1
      61:       bf 61 00 00 00 00 00 00 r1 = r6
      62:       bc 93 00 00 00 00 00 00 w3 = w9
      63:       b7 04 00 00 00 00 00 00 r4 = 0
      64:       85 00 00 00 43 00 00 00 call 67
;       if (ksize < 0)
      65:       c6 00 0b 00 00 00 00 00 if w0 s< 0 goto +11 <LBB0_6>

llc -mattr=dwarfris -march=bpf -mcpu=v2  -mattr=+alu32
;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
      48:       bf 61 00 00 00 00 00 00 r1 = r6
      49:       bf 72 00 00 00 00 00 00 r2 = r7
      50:       b4 03 00 00 20 03 00 00 w3 = 800
      51:       b7 04 00 00 00 01 00 00 r4 = 256
      52:       85 00 00 00 43 00 00 00 call 67
      53:       bc 08 00 00 00 00 00 00 w8 = w0
;       if (usize < 0)
      54:       bc 81 00 00 00 00 00 00 w1 = w8
      55:       67 01 00 00 20 00 00 00 r1 <<= 32
      56:       c7 01 00 00 20 00 00 00 r1 s>>= 32
      57:       c5 01 19 00 00 00 00 00 if r1 s< 0 goto +25 <LBB0_6>
;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
      58:       1c 89 00 00 00 00 00 00 w9 -= w8
      59:       bc 81 00 00 00 00 00 00 w1 = w8
      60:       67 01 00 00 20 00 00 00 r1 <<= 32
      61:       77 01 00 00 20 00 00 00 r1 >>= 32
      62:       bf 72 00 00 00 00 00 00 r2 = r7
      63:       0f 12 00 00 00 00 00 00 r2 += r1
      64:       bf 61 00 00 00 00 00 00 r1 = r6
      65:       bc 93 00 00 00 00 00 00 w3 = w9
      66:       b7 04 00 00 00 00 00 00 r4 = 0
      67:       85 00 00 00 43 00 00 00 call 67
;       if (ksize < 0)
      68:       bc 01 00 00 00 00 00 00 w1 = w0
      69:       67 01 00 00 20 00 00 00 r1 <<= 32
      70:       c7 01 00 00 20 00 00 00 r1 s>>= 32
      71:       c5 01 0b 00 00 00 00 00 if r1 s< 0 goto +11 <LBB0_6>

zext is there both cases and it will be optimized with your llvm patch.
So please send it. Don't delay :)
John Fastabend Feb. 4, 2020, 7:55 p.m. UTC | #20
Alexei Starovoitov wrote:
> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Also don't mind to build pseudo instruction here for signed extension
> > but its not clear to me why we are getting different instruction
> > selections? Its not clear to me why sext is being chosen in your case?
> 
> Sign extension has to be there if jmp64 is used.
> So the difference is due to -mcpu=v2 vs -mcpu=v3
> v2 does alu32, but not jmp32
> v3 does both.
> By default selftests are using -mcpu=probe which
> detects v2/v3 depending on running kernel.
> 
> llc -mattr=dwarfris -march=bpf -mcpu=v3  -mattr=+alu32
> ;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>       48:       bf 61 00 00 00 00 00 00 r1 = r6
>       49:       bf 72 00 00 00 00 00 00 r2 = r7
>       50:       b4 03 00 00 20 03 00 00 w3 = 800
>       51:       b7 04 00 00 00 01 00 00 r4 = 256
>       52:       85 00 00 00 43 00 00 00 call 67
>       53:       bc 08 00 00 00 00 00 00 w8 = w0
> ;       if (usize < 0)
>       54:       c6 08 16 00 00 00 00 00 if w8 s< 0 goto +22 <LBB0_6>
> ;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>       55:       1c 89 00 00 00 00 00 00 w9 -= w8
>       56:       bc 81 00 00 00 00 00 00 w1 = w8
>       57:       67 01 00 00 20 00 00 00 r1 <<= 32
>       58:       77 01 00 00 20 00 00 00 r1 >>= 32
>       59:       bf 72 00 00 00 00 00 00 r2 = r7
>       60:       0f 12 00 00 00 00 00 00 r2 += r1
>       61:       bf 61 00 00 00 00 00 00 r1 = r6
>       62:       bc 93 00 00 00 00 00 00 w3 = w9
>       63:       b7 04 00 00 00 00 00 00 r4 = 0
>       64:       85 00 00 00 43 00 00 00 call 67
> ;       if (ksize < 0)
>       65:       c6 00 0b 00 00 00 00 00 if w0 s< 0 goto +11 <LBB0_6>
> 
> llc -mattr=dwarfris -march=bpf -mcpu=v2  -mattr=+alu32
> ;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>       48:       bf 61 00 00 00 00 00 00 r1 = r6
>       49:       bf 72 00 00 00 00 00 00 r2 = r7
>       50:       b4 03 00 00 20 03 00 00 w3 = 800
>       51:       b7 04 00 00 00 01 00 00 r4 = 256
>       52:       85 00 00 00 43 00 00 00 call 67
>       53:       bc 08 00 00 00 00 00 00 w8 = w0
> ;       if (usize < 0)
>       54:       bc 81 00 00 00 00 00 00 w1 = w8
>       55:       67 01 00 00 20 00 00 00 r1 <<= 32
>       56:       c7 01 00 00 20 00 00 00 r1 s>>= 32
>       57:       c5 01 19 00 00 00 00 00 if r1 s< 0 goto +25 <LBB0_6>
> ;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>       58:       1c 89 00 00 00 00 00 00 w9 -= w8
>       59:       bc 81 00 00 00 00 00 00 w1 = w8
>       60:       67 01 00 00 20 00 00 00 r1 <<= 32
>       61:       77 01 00 00 20 00 00 00 r1 >>= 32
>       62:       bf 72 00 00 00 00 00 00 r2 = r7
>       63:       0f 12 00 00 00 00 00 00 r2 += r1
>       64:       bf 61 00 00 00 00 00 00 r1 = r6
>       65:       bc 93 00 00 00 00 00 00 w3 = w9
>       66:       b7 04 00 00 00 00 00 00 r4 = 0
>       67:       85 00 00 00 43 00 00 00 call 67
> ;       if (ksize < 0)
>       68:       bc 01 00 00 00 00 00 00 w1 = w0
>       69:       67 01 00 00 20 00 00 00 r1 <<= 32
>       70:       c7 01 00 00 20 00 00 00 r1 s>>= 32
>       71:       c5 01 0b 00 00 00 00 00 if r1 s< 0 goto +11 <LBB0_6>
> 
> zext is there both cases and it will be optimized with your llvm patch.
> So please send it. Don't delay :)

LLVM patch here, https://reviews.llvm.org/D73985

With updated LLVM I can pass selftests with above fix and additional patch
below to get tighter bounds on 32bit registers. So going forward I think
we need to review and assuming it looks good commit above llvm patch and
then go forward with this series.

---

bpf: coerce reg use tighter max bound if possible
    
When we do a coerce_reg_to_size we lose possibly valid upper bounds in
the case where, (a) smax is non-negative and (b) smax is less than max
value in new reg size. If both (a) and (b) are satisfied we can keep
the smax bound. (a) is required to ensure we do not remove upper sign
bit. And (b) is required to ensure previously set bits are contained
inside the new reg bits.
    
Signed-off-by: John Fastabend <john.fastabend@gmail.com>

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1cc945d..e5349d6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2805,7 +2805,8 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
 		reg->umax_value = mask;
 	}
 	reg->smin_value = reg->umin_value;
-	reg->smax_value = reg->umax_value;
+	if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
+		reg->smax_value = reg->umax_value;
 }
 
 static bool bpf_map_is_rdonly(const struct bpf_map *map)
Yonghong Song Feb. 5, 2020, 1:21 a.m. UTC | #21
On 2/4/20 11:55 AM, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
>>>
>>> Also don't mind to build pseudo instruction here for signed extension
>>> but its not clear to me why we are getting different instruction
>>> selections? Its not clear to me why sext is being chosen in your case?
>>
>> Sign extension has to be there if jmp64 is used.
>> So the difference is due to -mcpu=v2 vs -mcpu=v3
>> v2 does alu32, but not jmp32
>> v3 does both.
>> By default selftests are using -mcpu=probe which
>> detects v2/v3 depending on running kernel.
>>
>> llc -mattr=dwarfris -march=bpf -mcpu=v3  -mattr=+alu32
>> ;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>>        48:       bf 61 00 00 00 00 00 00 r1 = r6
>>        49:       bf 72 00 00 00 00 00 00 r2 = r7
>>        50:       b4 03 00 00 20 03 00 00 w3 = 800
>>        51:       b7 04 00 00 00 01 00 00 r4 = 256
>>        52:       85 00 00 00 43 00 00 00 call 67
>>        53:       bc 08 00 00 00 00 00 00 w8 = w0
>> ;       if (usize < 0)
>>        54:       c6 08 16 00 00 00 00 00 if w8 s< 0 goto +22 <LBB0_6>
>> ;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>>        55:       1c 89 00 00 00 00 00 00 w9 -= w8
>>        56:       bc 81 00 00 00 00 00 00 w1 = w8
>>        57:       67 01 00 00 20 00 00 00 r1 <<= 32
>>        58:       77 01 00 00 20 00 00 00 r1 >>= 32
>>        59:       bf 72 00 00 00 00 00 00 r2 = r7
>>        60:       0f 12 00 00 00 00 00 00 r2 += r1
>>        61:       bf 61 00 00 00 00 00 00 r1 = r6
>>        62:       bc 93 00 00 00 00 00 00 w3 = w9
>>        63:       b7 04 00 00 00 00 00 00 r4 = 0
>>        64:       85 00 00 00 43 00 00 00 call 67
>> ;       if (ksize < 0)
>>        65:       c6 00 0b 00 00 00 00 00 if w0 s< 0 goto +11 <LBB0_6>
>>
>> llc -mattr=dwarfris -march=bpf -mcpu=v2  -mattr=+alu32
>> ;       usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>>        48:       bf 61 00 00 00 00 00 00 r1 = r6
>>        49:       bf 72 00 00 00 00 00 00 r2 = r7
>>        50:       b4 03 00 00 20 03 00 00 w3 = 800
>>        51:       b7 04 00 00 00 01 00 00 r4 = 256
>>        52:       85 00 00 00 43 00 00 00 call 67
>>        53:       bc 08 00 00 00 00 00 00 w8 = w0
>> ;       if (usize < 0)
>>        54:       bc 81 00 00 00 00 00 00 w1 = w8
>>        55:       67 01 00 00 20 00 00 00 r1 <<= 32
>>        56:       c7 01 00 00 20 00 00 00 r1 s>>= 32
>>        57:       c5 01 19 00 00 00 00 00 if r1 s< 0 goto +25 <LBB0_6>
>> ;       ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>>        58:       1c 89 00 00 00 00 00 00 w9 -= w8
>>        59:       bc 81 00 00 00 00 00 00 w1 = w8
>>        60:       67 01 00 00 20 00 00 00 r1 <<= 32
>>        61:       77 01 00 00 20 00 00 00 r1 >>= 32
>>        62:       bf 72 00 00 00 00 00 00 r2 = r7
>>        63:       0f 12 00 00 00 00 00 00 r2 += r1
>>        64:       bf 61 00 00 00 00 00 00 r1 = r6
>>        65:       bc 93 00 00 00 00 00 00 w3 = w9
>>        66:       b7 04 00 00 00 00 00 00 r4 = 0
>>        67:       85 00 00 00 43 00 00 00 call 67
>> ;       if (ksize < 0)
>>        68:       bc 01 00 00 00 00 00 00 w1 = w0
>>        69:       67 01 00 00 20 00 00 00 r1 <<= 32
>>        70:       c7 01 00 00 20 00 00 00 r1 s>>= 32
>>        71:       c5 01 0b 00 00 00 00 00 if r1 s< 0 goto +11 <LBB0_6>
>>
>> zext is there both cases and it will be optimized with your llvm patch.
>> So please send it. Don't delay :)
> 
> LLVM patch here, https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D73985&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=VnK0SKxGnw_yzWjaO-cZFrmlZB9p86L4me-mWE_vDto&s=jwDJuAEdJ23HVcvIILvkfxvTNSe_cgHQFn_MpXssfXc&e=
> 
> With updated LLVM I can pass selftests with above fix and additional patch
> below to get tighter bounds on 32bit registers. So going forward I think
> we need to review and assuming it looks good commit above llvm patch and
> then go forward with this series.

Thanks. The llvm patch looks sane, but after applying the patch, I hit 
several selftest failures. For example, strobemeta_nounroll1.o.

The following is a brief analysis of the verifier state:

184: 
R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
0xffffffff00000001))
R7=inv0

184: (bc) w7 = w0
185: 
R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
0xffffffff00000001))
R7_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x1))

185: (67) r7 <<= 32
186: 
R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
0xffffffff00000001))
R7_w=inv(id=0,umax_value=4294967296,var_off=(0x0; 0x100000000))

186: (77) r7 >>= 32
187: 
R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
0xffffffff00000001))
R7_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1))

You can see after left and right shift, we got a better R7 umax_value=1.
Without the left and right shift, eventually verifier complains.

Can we make uname_value=1 at insn 'w7 = w0'?
Currently, we cannot do this due to the logic in coerce_reg_to_size().
It looks correct to me to ignore the top mask as we know the upper 32bit 
will be discarded.

I have implemented in my previous patch to deal with signed compares.
The following is the patch to fix this particular issue:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1cc945daa9c8..5aa2adad18c9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2788,7 +2788,8 @@ static int check_tp_buffer_access(struct 
bpf_verifier_env *env,
  /* truncate register to smaller size (in bytes)
   * must be called with size < BPF_REG_SIZE
   */
-static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
+static void coerce_reg_to_size(struct bpf_reg_state *reg, int size,
+                              bool ignore_upper_mask)
  {
         u64 mask;

@@ -2797,7 +2798,8 @@ static void coerce_reg_to_size(struct 
bpf_reg_state *reg, int size)

         /* fix arithmetic bounds */
         mask = ((u64)1 << (size * 8)) - 1;
-       if ((reg->umin_value & ~mask) == (reg->umax_value & ~mask)) {
+       if (ignore_upper_mask ||
+           (reg->umin_value & ~mask) == (reg->umax_value & ~mask)) {
                 reg->umin_value &= mask;
                 reg->umax_value &= mask;
         } else {
@@ -3066,7 +3068,7 @@ static int check_mem_access(struct 
bpf_verifier_env *env, int insn_idx, u32 regn
         if (!err && size < BPF_REG_SIZE && value_regno >= 0 && t == 
BPF_READ &&
             regs[value_regno].type == SCALAR_VALUE) {
                 /* b/h/w load zero-extends, mark upper bits as known 0 */
-               coerce_reg_to_size(&regs[value_regno], size);
+               coerce_reg_to_size(&regs[value_regno], size, false);
         }
         return err;
  }
@@ -4859,8 +4861,8 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
                  * LSB, so it isn't sufficient to only truncate the 
output to
                  * 32 bits.
                  */
-               coerce_reg_to_size(dst_reg, 4);
-               coerce_reg_to_size(&src_reg, 4);
+               coerce_reg_to_size(dst_reg, 4, false);
+               coerce_reg_to_size(&src_reg, 4, false);
         }

         smin_val = src_reg.smin_value;
@@ -5114,7 +5116,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,

         if (BPF_CLASS(insn->code) != BPF_ALU64) {
                 /* 32-bit ALU ops are (32,32)->32 */
-               coerce_reg_to_size(dst_reg, 4);
+               coerce_reg_to_size(dst_reg, 4, false);
         }


         __reg_deduce_bounds(dst_reg);
@@ -5290,7 +5292,7 @@ static int check_alu_op(struct bpf_verifier_env 
*env, struct bpf_insn *insn)
                                         mark_reg_unknown(env, regs,
                                                          insn->dst_reg);
                                 }
-                               coerce_reg_to_size(dst_reg, 4);
+                               coerce_reg_to_size(dst_reg, 4, true);
                         }
                 } else {
                         /* case: R = imm
@@ -5482,7 +5484,7 @@ static int is_branch_taken(struct bpf_reg_state 
*reg, u64 val, u8 opcode,
                  * could truncate high bits and update umin/umax 
according to
                  * information of low bits.
                  */
-               coerce_reg_to_size(reg, 4);
+               coerce_reg_to_size(reg, 4, false);
                 /* smin/smax need special handling. For example, after 
coerce,
                  * if smin_value is 0x00000000ffffffffLL, the value is 
-1 when
                  * used as operand to JMP32. It is a negative number 
from s32's
@@ -6174,8 +6176,8 @@ static int check_cond_jmp_op(struct 
bpf_verifier_env *env,

                 dst_lo = &lo_reg0;
                 src_lo = &lo_reg1;
-               coerce_reg_to_size(dst_lo, 4);
-               coerce_reg_to_size(src_lo, 4);
+               coerce_reg_to_size(dst_lo, 4, false);
+               coerce_reg_to_size(src_lo, 4, false);

                 if (dst_reg->type == SCALAR_VALUE &&
                     src_reg->type == SCALAR_VALUE) {

With the above patch, there is still one more issue in test_seg6_loop.o, 
which is related to llvm code generation, w.r.t. our strange 32bit 
packet begin and packet end.

The following patch is generated:

2: (61) r1 = *(u32 *)(r6 +76)
3: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0) 
R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; cursor = (void *)(long)skb->data;
3: (bc) w8 = w1
4: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0) 
R6_w=ctx(id=0,off=0,imm=0) 
R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
; if ((void *)ipver + sizeof(*ipver) > data_end)
4: (bf) r3 = r8

In the above r1 is packet pointer and after the assignment, it becomes a 
scalar and will lead later verification failure.

Without the patch, we generates:
1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; data_end = (void *)(long)skb->data_end;
1: (61) r1 = *(u32 *)(r6 +80)
2: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; cursor = (void *)(long)skb->data;
2: (61) r8 = *(u32 *)(r6 +76)
3: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 
R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
; if ((void *)ipver + sizeof(*ipver) > data_end)
3: (bf) r2 = r8
4: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0) 
R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
4: (07) r2 += 1
5: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=1,r=0,imm=0) 
R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0

r2 keeps as a packet pointer, so we don't have issues later.

Not sure how we could fix this in llvm as llvm does not really have idea
the above w1 in w8 = w1 is a packet pointer.

> 
> ---
> 
> bpf: coerce reg use tighter max bound if possible
>      
> When we do a coerce_reg_to_size we lose possibly valid upper bounds in
> the case where, (a) smax is non-negative and (b) smax is less than max
> value in new reg size. If both (a) and (b) are satisfied we can keep
> the smax bound. (a) is required to ensure we do not remove upper sign
> bit. And (b) is required to ensure previously set bits are contained
> inside the new reg bits.
>      
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1cc945d..e5349d6 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2805,7 +2805,8 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
>   		reg->umax_value = mask;
>   	}
>   	reg->smin_value = reg->umin_value;
> -	reg->smax_value = reg->umax_value;
> +	if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
> +		reg->smax_value = reg->umax_value;
>   }
>   
>   static bool bpf_map_is_rdonly(const struct bpf_map *map)
>
John Fastabend Feb. 5, 2020, 3:05 a.m. UTC | #22
Yonghong Song wrote:
> 
> 
> On 2/4/20 11:55 AM, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>
> >>> Also don't mind to build pseudo instruction here for signed extension
> >>> but its not clear to me why we are getting different instruction
> >>> selections? Its not clear to me why sext is being chosen in your case?

[...]

> >> zext is there both cases and it will be optimized with your llvm patch.
> >> So please send it. Don't delay :)
> > 
> > LLVM patch here, https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D73985&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=VnK0SKxGnw_yzWjaO-cZFrmlZB9p86L4me-mWE_vDto&s=jwDJuAEdJ23HVcvIILvkfxvTNSe_cgHQFn_MpXssfXc&e=
> > 
> > With updated LLVM I can pass selftests with above fix and additional patch
> > below to get tighter bounds on 32bit registers. So going forward I think
> > we need to review and assuming it looks good commit above llvm patch and
> > then go forward with this series.
> 
> Thanks. The llvm patch looks sane, but after applying the patch, I hit 
> several selftest failures. For example, strobemeta_nounroll1.o.
> 
> The following is a brief analysis of the verifier state:
> 
> 184: 
> R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
> 0xffffffff00000001))
> R7=inv0
> 
> 184: (bc) w7 = w0
> 185: 
> R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
> 0xffffffff00000001))
> R7_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x1))
> 
> 185: (67) r7 <<= 32
> 186: 
> R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
> 0xffffffff00000001))
> R7_w=inv(id=0,umax_value=4294967296,var_off=(0x0; 0x100000000))
> 
> 186: (77) r7 >>= 32
> 187: 
> R0=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 
> 0xffffffff00000001))
> R7_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1))
> 
> You can see after left and right shift, we got a better R7 umax_value=1.
> Without the left and right shift, eventually verifier complains.
> 
> Can we make uname_value=1 at insn 'w7 = w0'?
> Currently, we cannot do this due to the logic in coerce_reg_to_size().
> It looks correct to me to ignore the top mask as we know the upper 32bit 
> will be discarded.
> 
> I have implemented in my previous patch to deal with signed compares.
> The following is the patch to fix this particular issue:

[...]

> 
> With the above patch, there is still one more issue in test_seg6_loop.o, 
> which is related to llvm code generation, w.r.t. our strange 32bit 
> packet begin and packet end.
> 
> The following patch is generated:
> 
> 2: (61) r1 = *(u32 *)(r6 +76)
> 3: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0) 
> R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> ; cursor = (void *)(long)skb->data;
> 3: (bc) w8 = w1
> 4: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0) 
> R6_w=ctx(id=0,off=0,imm=0) 
> R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
> ; if ((void *)ipver + sizeof(*ipver) > data_end)
> 4: (bf) r3 = r8
> 
> In the above r1 is packet pointer and after the assignment, it becomes a 
> scalar and will lead later verification failure.
> 
> Without the patch, we generates:
> 1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> ; data_end = (void *)(long)skb->data_end;
> 1: (61) r1 = *(u32 *)(r6 +80)
> 2: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> ; cursor = (void *)(long)skb->data;
> 2: (61) r8 = *(u32 *)(r6 +76)
> 3: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 
> R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> ; if ((void *)ipver + sizeof(*ipver) > data_end)
> 3: (bf) r2 = r8
> 4: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0) 
> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> 4: (07) r2 += 1
> 5: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=1,r=0,imm=0) 
> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> 
> r2 keeps as a packet pointer, so we don't have issues later.
> 
> Not sure how we could fix this in llvm as llvm does not really have idea
> the above w1 in w8 = w1 is a packet pointer.
> 

OK thanks for analysis. I have this on my stack as well but need to
check its correct still,

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 320e2df..3072dba7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2804,8 +2804,11 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
                reg->umax_value = mask;
        }
        reg->smin_value = reg->umin_value;
-       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
+       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value) {
                reg->smax_value = reg->umax_value;
+       } else {
+               reg->umax_value = reg->smax_value;
+       }
 }

this helps but still hitting above issue with the packet pointer as
you pointed out. I'll sort out how we can fix this. Somewhat related
we have a similar issue we hit fairly consistently I've been meaning
to sort out where the cmp happens on a different register then is
used in the call, for example something like this pseudocode

   r8 = r2
   if r8 > blah goto +label
   r1 = dest_ptr
   r1 += r2
   r2 = size
   r3 = ptr
   call #some_call

and the verifier aborts because r8 was verified instead of r2. The
working plan was to walk back in the def-use chain and sort it out
but tbd.

.John
Yonghong Song Feb. 6, 2020, 1:24 a.m. UTC | #23
On 2/4/20 7:05 PM, John Fastabend wrote:
> Yonghong Song wrote:
>>
>>
>> On 2/4/20 11:55 AM, John Fastabend wrote:
>>> Alexei Starovoitov wrote:
>>>> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>
>>>>> Also don't mind to build pseudo instruction here for signed extension
>>>>> but its not clear to me why we are getting different instruction
>>>>> selections? Its not clear to me why sext is being chosen in your case?
> 
> [...]
> 
>>>> zext is there both cases and it will be optimized with your llvm patch.
>>>> So please send it. Don't delay :)
>>>
>>> LLVM patch here, https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D73985&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=VnK0SKxGnw_yzWjaO-cZFrmlZB9p86L4me-mWE_vDto&s=jwDJuAEdJ23HVcvIILvkfxvTNSe_cgHQFn_MpXssfXc&e=
>>>
>>> With updated LLVM I can pass selftests with above fix and additional patch
>>> below to get tighter bounds on 32bit registers. So going forward I think
>>> we need to review and assuming it looks good commit above llvm patch and
>>> then go forward with this series.
>>[...]
>> With the above patch, there is still one more issue in test_seg6_loop.o,
>> which is related to llvm code generation, w.r.t. our strange 32bit
>> packet begin and packet end.
>>
>> The following patch is generated:
>>
>> 2: (61) r1 = *(u32 *)(r6 +76)
>> 3: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
>> R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>> ; cursor = (void *)(long)skb->data;
>> 3: (bc) w8 = w1
>> 4: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
>> R6_w=ctx(id=0,off=0,imm=0)
>> R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
>> ; if ((void *)ipver + sizeof(*ipver) > data_end)
>> 4: (bf) r3 = r8
>>
>> In the above r1 is packet pointer and after the assignment, it becomes a
>> scalar and will lead later verification failure.
>>
>> Without the patch, we generates:
>> 1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>> ; data_end = (void *)(long)skb->data_end;
>> 1: (61) r1 = *(u32 *)(r6 +80)
>> 2: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>> ; cursor = (void *)(long)skb->data;
>> 2: (61) r8 = *(u32 *)(r6 +76)
>> 3: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
>> R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>> ; if ((void *)ipver + sizeof(*ipver) > data_end)
>> 3: (bf) r2 = r8
>> 4: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0)
>> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>> 4: (07) r2 += 1
>> 5: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=1,r=0,imm=0)
>> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>>
>> r2 keeps as a packet pointer, so we don't have issues later.
>>
>> Not sure how we could fix this in llvm as llvm does not really have idea
>> the above w1 in w8 = w1 is a packet pointer.
>>
> 
> OK thanks for analysis. I have this on my stack as well but need to
> check its correct still,
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 320e2df..3072dba7 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2804,8 +2804,11 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
>                  reg->umax_value = mask;
>          }
>          reg->smin_value = reg->umin_value;
> -       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
> +       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value) {
>                  reg->smax_value = reg->umax_value;
> +       } else {
> +               reg->umax_value = reg->smax_value;
> +       }
>   }
> 
> this helps but still hitting above issue with the packet pointer as
> you pointed out. I'll sort out how we can fix this. Somewhat related

I just fixed llvm to allow itself doing a better job for zext code gen.
https://reviews.llvm.org/D74101
This should solve the above issue.

> we have a similar issue we hit fairly consistently I've been meaning
> to sort out where the cmp happens on a different register then is
> used in the call, for example something like this pseudocode
> 
>     r8 = r2
>     if r8 > blah goto +label
>     r1 = dest_ptr
>     r1 += r2
>     r2 = size
>     r3 = ptr
>     call #some_call
> 
> and the verifier aborts because r8 was verified instead of r2. The
> working plan was to walk back in the def-use chain and sort it out
> but tbd.

I have another llvm patch (not merged yet)
   https://reviews.llvm.org/D72787
to undo some llvm optimization so we do not have the above code.
But the resulted byte code needs some kernel verifier change. The 
following is my previous attempt and you commented on.

https://lore.kernel.org/bpf/20200123191815.1364298-1-yhs@fb.com/T/#m8e3dee022801542ddf15b8e406dc05185f959b4f

I think this is better than making verifier more complex to do 
backtracking. What do you think?


> 
> .John
>     
>
John Fastabend Feb. 7, 2020, 8:47 p.m. UTC | #24
Yonghong Song wrote:
> 
> 
> On 2/4/20 7:05 PM, John Fastabend wrote:
> > Yonghong Song wrote:
> >>
> >>
> >> On 2/4/20 11:55 AM, John Fastabend wrote:
> >>> Alexei Starovoitov wrote:
> >>>> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>>>
> >>>>> Also don't mind to build pseudo instruction here for signed extension
> >>>>> but its not clear to me why we are getting different instruction
> >>>>> selections? Its not clear to me why sext is being chosen in your case?
> > 
> > [...]
> > 
> >>>> zext is there both cases and it will be optimized with your llvm patch.
> >>>> So please send it. Don't delay :)
> >>>
> >>> LLVM patch here, https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D73985&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=VnK0SKxGnw_yzWjaO-cZFrmlZB9p86L4me-mWE_vDto&s=jwDJuAEdJ23HVcvIILvkfxvTNSe_cgHQFn_MpXssfXc&e=
> >>>
> >>> With updated LLVM I can pass selftests with above fix and additional patch
> >>> below to get tighter bounds on 32bit registers. So going forward I think
> >>> we need to review and assuming it looks good commit above llvm patch and
> >>> then go forward with this series.
> >>[...]
> >> With the above patch, there is still one more issue in test_seg6_loop.o,
> >> which is related to llvm code generation, w.r.t. our strange 32bit
> >> packet begin and packet end.
> >>
> >> The following patch is generated:
> >>
> >> 2: (61) r1 = *(u32 *)(r6 +76)
> >> 3: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
> >> R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> >> ; cursor = (void *)(long)skb->data;
> >> 3: (bc) w8 = w1
> >> 4: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
> >> R6_w=ctx(id=0,off=0,imm=0)
> >> R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
> >> ; if ((void *)ipver + sizeof(*ipver) > data_end)
> >> 4: (bf) r3 = r8
> >>
> >> In the above r1 is packet pointer and after the assignment, it becomes a
> >> scalar and will lead later verification failure.
> >>
> >> Without the patch, we generates:
> >> 1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> >> ; data_end = (void *)(long)skb->data_end;
> >> 1: (61) r1 = *(u32 *)(r6 +80)
> >> 2: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
> >> ; cursor = (void *)(long)skb->data;
> >> 2: (61) r8 = *(u32 *)(r6 +76)
> >> 3: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
> >> R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> >> ; if ((void *)ipver + sizeof(*ipver) > data_end)
> >> 3: (bf) r2 = r8
> >> 4: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0)
> >> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> >> 4: (07) r2 += 1
> >> 5: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=1,r=0,imm=0)
> >> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
> >>
> >> r2 keeps as a packet pointer, so we don't have issues later.
> >>
> >> Not sure how we could fix this in llvm as llvm does not really have idea
> >> the above w1 in w8 = w1 is a packet pointer.
> >>
> > 
> > OK thanks for analysis. I have this on my stack as well but need to
> > check its correct still,
> > 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 320e2df..3072dba7 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -2804,8 +2804,11 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
> >                  reg->umax_value = mask;
> >          }
> >          reg->smin_value = reg->umin_value;
> > -       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
> > +       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value) {
> >                  reg->smax_value = reg->umax_value;
> > +       } else {
> > +               reg->umax_value = reg->smax_value;
> > +       }
> >   }
> > 
> > this helps but still hitting above issue with the packet pointer as
> > you pointed out. I'll sort out how we can fix this. Somewhat related
> 
> I just fixed llvm to allow itself doing a better job for zext code gen.
> https://reviews.llvm.org/D74101
> This should solve the above issue.

Great applied this but still have one more issue to resolve.

> 
> > we have a similar issue we hit fairly consistently I've been meaning
> > to sort out where the cmp happens on a different register then is
> > used in the call, for example something like this pseudocode
> > 
> >     r8 = r2
> >     if r8 > blah goto +label
> >     r1 = dest_ptr
> >     r1 += r2
> >     r2 = size
> >     r3 = ptr
> >     call #some_call
> > 
> > and the verifier aborts because r8 was verified instead of r2. The
> > working plan was to walk back in the def-use chain and sort it out
> > but tbd.
> 
> I have another llvm patch (not merged yet)
>    https://reviews.llvm.org/D72787
> to undo some llvm optimization so we do not have the above code.
> But the resulted byte code needs some kernel verifier change. The 
> following is my previous attempt and you commented on.
> 
> https://lore.kernel.org/bpf/20200123191815.1364298-1-yhs@fb.com/T/#m8e3dee022801542ddf15b8e406dc05185f959b4f
> 
> I think this is better than making verifier more complex to do 
> backtracking. What do you think?
> 

In general I think its nice if llvm can continue to optimize as it
wants and we can verify it. I was hoping to try this next week and
see how it falls out after getting the above resolved. If it gets
too ugly sure we can fall back to removing the optimization.
Yonghong Song Feb. 8, 2020, 6:23 a.m. UTC | #25
On 2/7/20 12:47 PM, John Fastabend wrote:
> Yonghong Song wrote:
>>
>>
>> On 2/4/20 7:05 PM, John Fastabend wrote:
>>> Yonghong Song wrote:
>>>>
>>>>
>>>> On 2/4/20 11:55 AM, John Fastabend wrote:
>>>>> Alexei Starovoitov wrote:
>>>>>> On Fri, Jan 31, 2020 at 9:16 AM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>>>
>>>>>>> Also don't mind to build pseudo instruction here for signed extension
>>>>>>> but its not clear to me why we are getting different instruction
>>>>>>> selections? Its not clear to me why sext is being chosen in your case?
>>>
>>> [...]
>>>
>>>>>> zext is there both cases and it will be optimized with your llvm patch.
>>>>>> So please send it. Don't delay :)
>>>>>
>>>>> LLVM patch here, https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D73985&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=VnK0SKxGnw_yzWjaO-cZFrmlZB9p86L4me-mWE_vDto&s=jwDJuAEdJ23HVcvIILvkfxvTNSe_cgHQFn_MpXssfXc&e=
>>>>>
>>>>> With updated LLVM I can pass selftests with above fix and additional patch
>>>>> below to get tighter bounds on 32bit registers. So going forward I think
>>>>> we need to review and assuming it looks good commit above llvm patch and
>>>>> then go forward with this series.
>>>> [...]
>>>> With the above patch, there is still one more issue in test_seg6_loop.o,
>>>> which is related to llvm code generation, w.r.t. our strange 32bit
>>>> packet begin and packet end.
>>>>
>>>> The following patch is generated:
>>>>
>>>> 2: (61) r1 = *(u32 *)(r6 +76)
>>>> 3: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
>>>> R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>>>> ; cursor = (void *)(long)skb->data;
>>>> 3: (bc) w8 = w1
>>>> 4: R1_w=pkt(id=0,off=0,r=0,imm=0) R2_w=pkt_end(id=0,off=0,imm=0)
>>>> R6_w=ctx(id=0,off=0,imm=0)
>>>> R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
>>>> ; if ((void *)ipver + sizeof(*ipver) > data_end)
>>>> 4: (bf) r3 = r8
>>>>
>>>> In the above r1 is packet pointer and after the assignment, it becomes a
>>>> scalar and will lead later verification failure.
>>>>
>>>> Without the patch, we generates:
>>>> 1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>>>> ; data_end = (void *)(long)skb->data_end;
>>>> 1: (61) r1 = *(u32 *)(r6 +80)
>>>> 2: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
>>>> ; cursor = (void *)(long)skb->data;
>>>> 2: (61) r8 = *(u32 *)(r6 +76)
>>>> 3: R1_w=pkt_end(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
>>>> R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>>>> ; if ((void *)ipver + sizeof(*ipver) > data_end)
>>>> 3: (bf) r2 = r8
>>>> 4: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0)
>>>> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>>>> 4: (07) r2 += 1
>>>> 5: R1_w=pkt_end(id=0,off=0,imm=0) R2_w=pkt(id=0,off=1,r=0,imm=0)
>>>> R6_w=ctx(id=0,off=0,imm=0) R8_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
>>>>
>>>> r2 keeps as a packet pointer, so we don't have issues later.
>>>>
>>>> Not sure how we could fix this in llvm as llvm does not really have idea
>>>> the above w1 in w8 = w1 is a packet pointer.
>>>>
>>>
>>> OK thanks for analysis. I have this on my stack as well but need to
>>> check its correct still,
>>>
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index 320e2df..3072dba7 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -2804,8 +2804,11 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
>>>                   reg->umax_value = mask;
>>>           }
>>>           reg->smin_value = reg->umin_value;
>>> -       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value)
>>> +       if (reg->smax_value < 0 || reg->smax_value > reg->umax_value) {
>>>                   reg->smax_value = reg->umax_value;
>>> +       } else {
>>> +               reg->umax_value = reg->smax_value;
>>> +       }
>>>    }
>>>
>>> this helps but still hitting above issue with the packet pointer as
>>> you pointed out. I'll sort out how we can fix this. Somewhat related
>>
>> I just fixed llvm to allow itself doing a better job for zext code gen.
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D74101&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=H-nAm78-zRumhyRoNeSzy3IaT2e1IQSfD7rq-DYkDUM&s=GwE8T2RDQGwRxrmCTHhipk47qj3aJFyICQrm-cBmS9w&e=
>> This should solve the above issue.
> 
> Great applied this but still have one more issue to resolve.

So this one more issue is related to coerce_reg_to_size(), right?

> 
>>
>>> we have a similar issue we hit fairly consistently I've been meaning
>>> to sort out where the cmp happens on a different register then is
>>> used in the call, for example something like this pseudocode
>>>
>>>      r8 = r2
>>>      if r8 > blah goto +label
>>>      r1 = dest_ptr
>>>      r1 += r2
>>>      r2 = size
>>>      r3 = ptr
>>>      call #some_call
>>>
>>> and the verifier aborts because r8 was verified instead of r2. The
>>> working plan was to walk back in the def-use chain and sort it out
>>> but tbd.
>>
>> I have another llvm patch (not merged yet)
>>     https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D72787&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=H-nAm78-zRumhyRoNeSzy3IaT2e1IQSfD7rq-DYkDUM&s=yvw4ipXO3Eln_HHZvXPBZS9n-0w2ek5BjKbtG_Q2f4E&e=
>> to undo some llvm optimization so we do not have the above code.
>> But the resulted byte code needs some kernel verifier change. The
>> following is my previous attempt and you commented on.
>>
>> https://lore.kernel.org/bpf/20200123191815.1364298-1-yhs@fb.com/T/#m8e3dee022801542ddf15b8e406dc05185f959b4f
>>
>> I think this is better than making verifier more complex to do
>> backtracking. What do you think?
>>
> 
> In general I think its nice if llvm can continue to optimize as it
> wants and we can verify it. I was hoping to try this next week and
> see how it falls out after getting the above resolved. If it gets
> too ugly sure we can fall back to removing the optimization.

Sounds good. Looking forward to your patch. Thanks!
Lorenzo Fontana April 9, 2020, 3:03 p.m. UTC | #26
On 1/27/20 8:29 PM, John Fastabend wrote:
> do_refine_retval_range() is called to refine return values from specified
> helpers, probe_read_str and get_stack at the moment, the reasoning is
> because both have a max value as part of their input arguments and
> because the helper ensure the return value will not be larger than this
> we can set smax values of the return register, r0.
> 
> However, the return value is a signed integer so setting umax is incorrect
> It leads to further confusion when the do_refine_retval_range() then calls,
> __reg_deduce_bounds() which will see a umax value as meaning the value is
> unsigned and then assuming it is unsigned set the smin = umin which in this
> case results in 'smin = 0' and an 'smax = X' where X is the input argument
> from the helper call.
> 
> Here are the comments from _reg_deduce_bounds() on why this would be safe
> to do.
> 
>  /* Learn sign from unsigned bounds.  Signed bounds cross the sign
>   * boundary, so we must be careful.
>   */
>  if ((s64)reg->umax_value >= 0) {
> 	/* Positive.  We can't learn anything from the smin, but smax
> 	 * is positive, hence safe.
> 	 */
> 	reg->smin_value = reg->umin_value;
> 	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
> 						  reg->umax_value);
> 
> But now we incorrectly have a return value with type int with the
> signed bounds (0,X). Suppose the return value is negative, which is
> possible the we have the verifier and reality out of sync. Among other
> things this may result in any error handling code being falsely detected
> as dead-code and removed. For instance the example below shows using
> bpf_probe_read_str() causes the error path to be identified as dead
> code and removed.
> 
>>From the 'llvm-object -S' dump,
> 
>  r2 = 100
>  call 45
>  if r0 s< 0 goto +4
>  r4 = *(u32 *)(r7 + 0)
> 
> But from dump xlate
> 
>   (b7) r2 = 100
>   (85) call bpf_probe_read_compat_str#-96768
>   (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto
> 
> Due to verifier state after call being
> 
>  R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))
> 
> To fix omit setting the umax value because its not safe. The only
> actual bounds we know is the smax. This results in the correct bounds
> (SMIN, X) where X is the max length from the helper. After this the
> new verifier state looks like the following after call 45.
> 
> R0=inv(id=0,smax_value=100)
> 
> Then xlated version no longer removed dead code giving the expected
> result,
> 
>   (b7) r2 = 100
>   (85) call bpf_probe_read_compat_str#-96768
>   (c5) if r0 s< 0x0 goto pc+4
>   (61) r4 = *(u32 *)(r7 +0)
> 
> Note, bpf_probe_read_* calls are root only so we wont hit this case
> with non-root bpf users.
> 
> v3: comment had some documentation about meta set to null case which
> is not relevant here and confusing to include in the comment.
> 
> v2 note: In original version we set msize_smax_value from check_func_arg()
> and propagated this into smax of retval. The logic was smax is the bound
> on the retval we set and because the type in the helper is ARG_CONST_SIZE
> we know that the reg is a positive tnum_const() so umax=smax. Alexei
> pointed out though this is a bit odd to read because the register in
> check_func_arg() has a C type of u32 and the umax bound would be the
> normally relavent bound here. Pulling in extra knowledge about future
> checks makes reading the code a bit tricky. Further having a signed
> meta data that can only ever be positive is also a bit odd. So dropped
> the msize_smax_value metadata and made it a u64 msize_max_value to
> indicate its unsigned. And additionally save bound from umax value in
> check_arg_funcs which is the same as smax due to as noted above tnumx_cont
> and negative check but reads better. By my analysis nothing functionally
> changes in v2 but it does get easier to read so that is win.
> 
> Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  kernel/bpf/verifier.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7d530ce8719d..adeee88102e5 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -227,8 +227,7 @@ struct bpf_call_arg_meta {
>  	bool pkt_access;
>  	int regno;
>  	int access_size;
> -	s64 msize_smax_value;
> -	u64 msize_umax_value;
> +	u64 msize_max_value;
>  	int ref_obj_id;
>  	int func_id;
>  	u32 btf_id;
> @@ -3569,11 +3568,15 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  	} else if (arg_type_is_mem_size(arg_type)) {
>  		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
>  
> -		/* remember the mem_size which may be used later
> -		 * to refine return values.
> +		/* This is used to refine r0 return value bounds for helpers
> +		 * that enforce this value as an upper bound on return values.
> +		 * See do_refine_retval_range() for helpers that can refine
> +		 * the return value. C type of helper is u32 so we pull register
> +		 * bound from umax_value however, if negative verifier errors
> +		 * out. Only upper bounds can be learned because retval is an
> +		 * int type and negative retvals are allowed.
>  		 */
> -		meta->msize_smax_value = reg->smax_value;
> -		meta->msize_umax_value = reg->umax_value;
> +		meta->msize_max_value = reg->umax_value;
>  
>  		/* The register is SCALAR_VALUE; the access check
>  		 * happens using its boundaries.
> @@ -4077,10 +4080,10 @@ static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
>  	     func_id != BPF_FUNC_probe_read_str))
>  		return;
>  
> -	ret_reg->smax_value = meta->msize_smax_value;
> -	ret_reg->umax_value = meta->msize_umax_value;
> +	ret_reg->smax_value = meta->msize_max_value;
>  	__reg_deduce_bounds(ret_reg);
>  	__reg_bound_offset(ret_reg);
> +	__update_reg_bounds(ret_reg);
>  }
>  
>  static int
> 



I've been working on this problem too. Based on what we did to fix it in our BPF program by changing the return value conditionals [0], the reasoning behind this patch looks good to me. I also tried to apply this patch series and I don't see the loop happening in the xlated code.

Thanks for working on this John.


[0] https://patch-diff.githubusercontent.com/raw/draios/sysdig/pull/1612.patch
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7d530ce8719d..adeee88102e5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -227,8 +227,7 @@  struct bpf_call_arg_meta {
 	bool pkt_access;
 	int regno;
 	int access_size;
-	s64 msize_smax_value;
-	u64 msize_umax_value;
+	u64 msize_max_value;
 	int ref_obj_id;
 	int func_id;
 	u32 btf_id;
@@ -3569,11 +3568,15 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 	} else if (arg_type_is_mem_size(arg_type)) {
 		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
 
-		/* remember the mem_size which may be used later
-		 * to refine return values.
+		/* This is used to refine r0 return value bounds for helpers
+		 * that enforce this value as an upper bound on return values.
+		 * See do_refine_retval_range() for helpers that can refine
+		 * the return value. C type of helper is u32 so we pull register
+		 * bound from umax_value however, if negative verifier errors
+		 * out. Only upper bounds can be learned because retval is an
+		 * int type and negative retvals are allowed.
 		 */
-		meta->msize_smax_value = reg->smax_value;
-		meta->msize_umax_value = reg->umax_value;
+		meta->msize_max_value = reg->umax_value;
 
 		/* The register is SCALAR_VALUE; the access check
 		 * happens using its boundaries.
@@ -4077,10 +4080,10 @@  static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
 	     func_id != BPF_FUNC_probe_read_str))
 		return;
 
-	ret_reg->smax_value = meta->msize_smax_value;
-	ret_reg->umax_value = meta->msize_umax_value;
+	ret_reg->smax_value = meta->msize_max_value;
 	__reg_deduce_bounds(ret_reg);
 	__reg_bound_offset(ret_reg);
+	__update_reg_bounds(ret_reg);
 }
 
 static int