diff mbox series

[v3,bpf-next,08/19] bpf: insert explicit zero extension insn when hardware doesn't do it implicitly

Message ID 1555106392-20117-9-git-send-email-jiong.wang@netronome.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: eliminate zero extensions for sub-register writes | expand

Commit Message

Jiong Wang April 12, 2019, 9:59 p.m. UTC
After previous patches, verifier has marked those instructions that really
need zero extension on dst_reg.

It is then for all back-ends to decide how to use such information to
eliminate unnecessary zero extension code-gen during JIT compilation.

One approach is:
  1. Verifier insert explicit zero extension for those instructions that
     need zero extension.
  2. All JIT back-ends do NOT generate zero extension for sub-register
     write any more.

The good thing for this approach is no major change on JIT back-end
interface, all back-ends could get this optimization.

However, only those back-ends that do not have hardware zero extension
want this optimization. For back-ends like x86_64 and AArch64, there is
hardware support, so zext insertion should be disabled.

This patch introduces new target hook "bpf_jit_hardware_zext" which is
default true, meaning the underlying hardware will do zero extension
implicitly, therefore zext insertion by verifier will be disabled. Once a
back-end overrides this hook to false, then verifier will insert zext
sequence to clear high 32-bit of definitions when necessary.

Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
---
 include/linux/bpf.h    |  1 +
 include/linux/filter.h |  1 +
 kernel/bpf/core.c      |  8 +++++
 kernel/bpf/verifier.c  | 87 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 96 insertions(+), 1 deletion(-)

Comments

Naveen N. Rao April 15, 2019, 9:59 a.m. UTC | #1
Hi Jiong,

Jiong Wang wrote:
> After previous patches, verifier has marked those instructions that really
> need zero extension on dst_reg.

Thanks for implementing this -- this is very helpful on architectures 
without sub-register instructions, especially in comparison with legacy 
BPF, since the move to eBPF resulted in lot more instructions being 
generated.

I have a small nit below on the overall approach...

> 
> It is then for all back-ends to decide how to use such information to
> eliminate unnecessary zero extension code-gen during JIT compilation.
> 
> One approach is:
>   1. Verifier insert explicit zero extension for those instructions that
>      need zero extension.
>   2. All JIT back-ends do NOT generate zero extension for sub-register
>      write any more.

Is it possible to instead give a hint to the JIT back-ends on the 
instructions needing zero-extension? That would help in case of 
architectures that have single/more-optimal instruction for zero 
extension, compared to having to emit 2 instructions with the current 
approach.

- Naveen
Naveen N. Rao April 15, 2019, 10:11 a.m. UTC | #2
Naveen N. Rao wrote:
>> It is then for all back-ends to decide how to use such information to
>> eliminate unnecessary zero extension code-gen during JIT compilation.
>> 
>> One approach is:
>>   1. Verifier insert explicit zero extension for those instructions that
>>      need zero extension.
>>   2. All JIT back-ends do NOT generate zero extension for sub-register
>>      write any more.
> 
> Is it possible to instead give a hint to the JIT back-ends on the 
> instructions needing zero-extension? That would help in case of 
> architectures that have single/more-optimal instruction for zero 
> extension, compared to having to emit 2 instructions with the current 
> approach.

I just noticed your discussion with Alexei on RFC v1 after posting this.  
I agree that this can be looked into subsequently -- either a new 
instruction, or detecting this during JIT.

- Naveen
Jiong Wang April 15, 2019, 11:24 a.m. UTC | #3
Naveen N. Rao writes:

> Naveen N. Rao wrote:
>>> It is then for all back-ends to decide how to use such information to
>>> eliminate unnecessary zero extension code-gen during JIT compilation.
>>> 
>>> One approach is:
>>>   1. Verifier insert explicit zero extension for those instructions that
>>>      need zero extension.
>>>   2. All JIT back-ends do NOT generate zero extension for sub-register
>>>      write any more.
>> 
>> Is it possible to instead give a hint to the JIT back-ends on the 
>> instructions needing zero-extension? That would help in case of 
>> architectures that have single/more-optimal instruction for zero 
>> extension, compared to having to emit 2 instructions with the current 
>> approach.
>
> I just noticed your discussion with Alexei on RFC v1 after posting this.  
> I agree that this can be looked into subsequently -- either a new 
> instruction, or detecting this during JIT.

Thanks Naveen.

It will be great if you could test the latest set on PowerPC to see if
there is any regression for example for those under test_progs and
test_verifier.

And it will be even greater if you also use latest llvm snapshot for the
testing, which then will enable test_progs_32 etc.

Thanks.

Regards,
Jiong
Naveen N. Rao April 15, 2019, 6:21 p.m. UTC | #4
Jiong Wang wrote:
> 
> It will be great if you could test the latest set on PowerPC to see if
> there is any regression for example for those under test_progs and
> test_verifier.

With test_bpf, I am seeing a few failures with this patchset.

> 
> And it will be even greater if you also use latest llvm snapshot for the
> testing, which then will enable test_progs_32 etc.

Is a newer llvm a dependency? Or, is this also expected to work with 
older llvm levels?

The set of tests that are failing are listed further below. I looked 
into MUL_X2 and it looks like zero extension for the two initial ALU32 
loads (-1) are being removed, resulting in the failure.

I didn't get to look into this in detail -- am I missing something?


- Naveen


---
$ cat ~/jit_fail.out | grep -v "JIT code" | grep -B4 FAIL
test_bpf: #38 INT: MUL_X2 Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=9 proglen=64 pass=3 image=d000000006bfca9c from=insmod pid=8923
jited:1 ret -1 != 1 FAIL (1 times)
test_bpf: #39 INT: MUL32_X 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=9 proglen=64 pass=3 image=d000000006c335fc from=insmod pid=8923
jited:1 ret -1 != 1 FAIL (1 times)

test_bpf: #49 INT: shifts by register 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=30 proglen=192 pass=3 image=d000000006eb80e4 from=insmod pid=8923
jited:1 ret -1234 != -1 FAIL (1 times)

test_bpf: #68 ALU_MOV_K: 0x0000ffffffff0000 = 0x00000000ffffffff 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=76 pass=3 image=d000000007290e48 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)

test_bpf: #75 ALU_ADD_X: 2 + 4294967294 = 0 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=64 pass=3 image=d0000000074537b0 from=insmod pid=8923
jited:1 ret 0 != 1 FAIL (1 times)

test_bpf: #82 ALU_ADD_K: 4294967294 + 2 = 0 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=8 proglen=60 pass=3 image=d00000000761af8c from=insmod pid=8923
jited:1 ret 0 != 1 FAIL (1 times)
test_bpf: #83 ALU_ADD_K: 0 + (-1) = 0x00000000ffffffff 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=64 pass=3 image=d0000000076579dc from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)

test_bpf: #86 ALU_ADD_K: 0 + 0x80000000 = 0x80000000 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=64 pass=3 image=d000000007719958 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)
test_bpf: #87 ALU_ADD_K: 0 + 0x80008000 = 0x80008000 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=72 pass=3 image=d000000007752510 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)

test_bpf: #118 ALU_MUL_K: 1 * (-1) = 0x00000000ffffffff 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=10 proglen=64 pass=3 image=d000000007f184f8 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)

test_bpf: #371 JNE signed compare, test 1 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=8 proglen=60 pass=3 image=d000000002394ab8 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)
test_bpf: #372 JNE signed compare, test 2 
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=8 proglen=60 pass=3 image=d0000000023d98b4 from=insmod pid=8923
jited:1 ret 2 != 1 FAIL (1 times)

Pass 1: shrink = 0, seen = 0x18
Pass 2: shrink = 0, seen = 0x18
flen=13 proglen=92 pass=3 image=d0000000025105f8 from=insmod pid=8923
jited:1 12 PASS
test_bpf: Summary: 366 PASSED, 12 FAILED, [366/366 JIT'ed]
Jiong Wang April 15, 2019, 7:28 p.m. UTC | #5
> On 15 Apr 2019, at 19:21, Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> wrote:
> 
> Jiong Wang wrote:
>> It will be great if you could test the latest set on PowerPC to see if
>> there is any regression for example for those under test_progs and
>> test_verifier.
> 
> With test_bpf, I am seeing a few failures with this patchset.
> 
>> And it will be even greater if you also use latest llvm snapshot for the
>> testing, which then will enable test_progs_32 etc.
> 
> Is a newer llvm a dependency? Or, is this also expected to work with older llvm levels?

There is no newer LLVM dependency. This set should work with older llvm.        
                                                                                
It is just newer LLVM has better sub-register code-gen support that could       
the generate bpf program contains more elimination opportunities for verifier.

> 
> The set of tests that are failing are listed further below. I looked into MUL_X2 and it looks like zero extension for the two initial ALU32 loads (-1) are being removed, resulting in the failure.
> 
> I didn't get to look into this in detail -- am I missing something?

Hmm, I guess the issue is:
                                                                                
  1. test_bpf.c is a testsuite running inside kernel space, it is calling some
     kernel eBPF jit interface directly without calling verifier first, so this
     set actually hasn’t been triggered.

  2. However, the elimination information at the moment is passed from verifier
     to JIT backend through

       fp->aux->no_verifier_zext

     “no_verifier_zext” is initially false, and once verifier inserted zero
     extension, it will be set to true.

     Now, for test_bpf, because it doesn’t go through verifier at all, so
     “no_verifier_zext” is left at default value which is false, meaning
     verifier has inserted zero-extension, so PPC backend then thinks it is
     safe to eliminate zero-extension by himself.

     Perhaps should change “no_verifier_zext” to “verifier_zext”, then default
     is false and will only be true when verifier really has inserted zext.
      
     Was thinking, this will cause JIT backend writing the check like
        if (no_verifier_zext)
          insert_zext_by_JIT
     
     is better than:
        
        if (!verifier_zext)
          insert_zext_by_JIT

BTW, does test_progs and test_verifier has a full pass on PowerPC?
On arch without hardware zext like PowerPC, verifier will insert zext and test
mode will still randomisation high 32-bit for those sub-registers not zext,
this is very stressful test.

Regards,
Jiong

> 
> 
> - Naveen
> 
> 
> ---
> $ cat ~/jit_fail.out | grep -v "JIT code" | grep -B4 FAIL
> test_bpf: #38 INT: MUL_X2 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=9 proglen=64 pass=3 image=d000000006bfca9c from=insmod pid=8923
> jited:1 ret -1 != 1 FAIL (1 times)
> test_bpf: #39 INT: MUL32_X Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=9 proglen=64 pass=3 image=d000000006c335fc from=insmod pid=8923
> jited:1 ret -1 != 1 FAIL (1 times)
> 
> test_bpf: #49 INT: shifts by register Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=30 proglen=192 pass=3 image=d000000006eb80e4 from=insmod pid=8923
> jited:1 ret -1234 != -1 FAIL (1 times)
> 
> test_bpf: #68 ALU_MOV_K: 0x0000ffffffff0000 = 0x00000000ffffffff Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=76 pass=3 image=d000000007290e48 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> 
> test_bpf: #75 ALU_ADD_X: 2 + 4294967294 = 0 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=64 pass=3 image=d0000000074537b0 from=insmod pid=8923
> jited:1 ret 0 != 1 FAIL (1 times)
> 
> test_bpf: #82 ALU_ADD_K: 4294967294 + 2 = 0 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=8 proglen=60 pass=3 image=d00000000761af8c from=insmod pid=8923
> jited:1 ret 0 != 1 FAIL (1 times)
> test_bpf: #83 ALU_ADD_K: 0 + (-1) = 0x00000000ffffffff Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=64 pass=3 image=d0000000076579dc from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> 
> test_bpf: #86 ALU_ADD_K: 0 + 0x80000000 = 0x80000000 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=64 pass=3 image=d000000007719958 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> test_bpf: #87 ALU_ADD_K: 0 + 0x80008000 = 0x80008000 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=72 pass=3 image=d000000007752510 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> 
> test_bpf: #118 ALU_MUL_K: 1 * (-1) = 0x00000000ffffffff Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=10 proglen=64 pass=3 image=d000000007f184f8 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> 
> test_bpf: #371 JNE signed compare, test 1 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=8 proglen=60 pass=3 image=d000000002394ab8 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> test_bpf: #372 JNE signed compare, test 2 Pass 1: shrink = 0, seen = 0x0
> Pass 2: shrink = 0, seen = 0x0
> flen=8 proglen=60 pass=3 image=d0000000023d98b4 from=insmod pid=8923
> jited:1 ret 2 != 1 FAIL (1 times)
> 
> Pass 1: shrink = 0, seen = 0x18
> Pass 2: shrink = 0, seen = 0x18
> flen=13 proglen=92 pass=3 image=d0000000025105f8 from=insmod pid=8923
> jited:1 12 PASS
> test_bpf: Summary: 366 PASSED, 12 FAILED, [366/366 JIT'ed]
> 
> 
>
Naveen N. Rao April 16, 2019, 6:41 a.m. UTC | #6
Jiong Wang wrote:
> 
>> On 15 Apr 2019, at 19:21, Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> wrote:
>> 
>> Jiong Wang wrote:
>>> It will be great if you could test the latest set on PowerPC to see if
>>> there is any regression for example for those under test_progs and
>>> test_verifier.
>> 
>> With test_bpf, I am seeing a few failures with this patchset.
>> 
>>> And it will be even greater if you also use latest llvm snapshot for the
>>> testing, which then will enable test_progs_32 etc.
>> 
>> Is a newer llvm a dependency? Or, is this also expected to work with older llvm levels?
> 
> There is no newer LLVM dependency. This set should work with older llvm.        
>                                                                                 
> It is just newer LLVM has better sub-register code-gen support that could       
> the generate bpf program contains more elimination opportunities for verifier.

Ok, I will try and get to that by next week (busy with other things 
right now).

> 
>> 
>> The set of tests that are failing are listed further below. I looked into MUL_X2 and it looks like zero extension for the two initial ALU32 loads (-1) are being removed, resulting in the failure.
>> 
>> I didn't get to look into this in detail -- am I missing something?
> 
> Hmm, I guess the issue is:
>                                                                                 
>   1. test_bpf.c is a testsuite running inside kernel space, it is calling some
>      kernel eBPF jit interface directly without calling verifier first, so this
>      set actually hasn’t been triggered.

Ah, indeed.

> 
>   2. However, the elimination information at the moment is passed from verifier
>      to JIT backend through
> 
>        fp->aux->no_verifier_zext
> 
>      “no_verifier_zext” is initially false, and once verifier inserted zero
>      extension, it will be set to true.
> 
>      Now, for test_bpf, because it doesn’t go through verifier at all, so
>      “no_verifier_zext” is left at default value which is false, meaning
>      verifier has inserted zero-extension, so PPC backend then thinks it is
>      safe to eliminate zero-extension by himself.
> 
>      Perhaps should change “no_verifier_zext” to “verifier_zext”, then default
>      is false and will only be true when verifier really has inserted zext.

Yes, that's probably better.

>       
>      Was thinking, this will cause JIT backend writing the check like
>         if (no_verifier_zext)
>           insert_zext_by_JIT
>      
>      is better than:
>         
>         if (!verifier_zext)
>           insert_zext_by_JIT
> 
> BTW, does test_progs and test_verifier has a full pass on PowerPC?
> On arch without hardware zext like PowerPC, verifier will insert zext and test
> mode will still randomisation high 32-bit for those sub-registers not zext,
> this is very stressful test.

test_verfier is throwing up one failure with this patchset:
#569/p ld_abs: vlan + abs, test 1 FAIL
Failed to load prog 'Success'!
insn 2463 cannot be patched due to 16-bit range
verification time 172602 usec
stack depth 0
processed 30728 insns (limit 1000000) max_states_per_insn 1 total_states 1022 peak_states 1022 mark_read 1

This test passes with bpf-next/master. Btw, I tried with your v4 patches 
though I am replying here...

test_progs has no regression, but has 15 failures even without these 
patches that I need to look into.


- Naveen
Jiong Wang April 16, 2019, 7:47 a.m. UTC | #7
Naveen N. Rao writes:

> Jiong Wang wrote:
>>
>>> On 15 Apr 2019, at 19:21, Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> wrote:
>>>
>>> Jiong Wang wrote:
>>>> It will be great if you could test the latest set on PowerPC to see if
>>>> there is any regression for example for those under test_progs and
>>>> test_verifier.
>>>
>>> With test_bpf, I am seeing a few failures with this patchset.
>>>
>>>> And it will be even greater if you also use latest llvm snapshot for the
>>>> testing, which then will enable test_progs_32 etc.
>>>
>>> Is a newer llvm a dependency? Or, is this also expected to work with older llvm levels?
>>
>> There is no newer LLVM dependency. This set should work with older
>> llvm.
>> It is just newer LLVM has better sub-register code-gen support that
>> could       the generate bpf program contains more elimination
>> opportunities for verifier.
>
> Ok, I will try and get to that by next week (busy with other things
> right now).

Great, thanks!

>>>
>>> The set of tests that are failing are listed further below. I looked into MUL_X2 and it looks like zero extension for the two initial ALU32 loads (-1) are being removed, resulting in the failure.
>>>
>>> I didn't get to look into this in detail -- am I missing something?
>>
>> Hmm, I guess the issue is:
>>                                                                                   1. test_bpf.c
>> is a testsuite running inside kernel space, it is calling some
>>      kernel eBPF jit interface directly without calling verifier first, so this
>>      set actually hasn’t been triggered.
>
> Ah, indeed.
>
>>
>>   2. However, the elimination information at the moment is passed from verifier
>>      to JIT backend through
>>
>>        fp->aux->no_verifier_zext
>>
>>      “no_verifier_zext” is initially false, and once verifier inserted zero
>>      extension, it will be set to true.
>>
>>      Now, for test_bpf, because it doesn’t go through verifier at all, so
>>      “no_verifier_zext” is left at default value which is false, meaning
>>      verifier has inserted zero-extension, so PPC backend then thinks it is
>>      safe to eliminate zero-extension by himself.
>>
>>      Perhaps should change “no_verifier_zext” to “verifier_zext”, then default
>>      is false and will only be true when verifier really has inserted zext.
>
> Yes, that's probably better.
>
>>            Was thinking, this will cause JIT backend writing the
>> check like
>>         if (no_verifier_zext)
>>           insert_zext_by_JIT
>>           is better than:
>>                 if (!verifier_zext)
>>           insert_zext_by_JIT
>>
>> BTW, does test_progs and test_verifier has a full pass on PowerPC?
>> On arch without hardware zext like PowerPC, verifier will insert zext and test
>> mode will still randomisation high 32-bit for those sub-registers not zext,
>> this is very stressful test.
>
> test_verfier is throwing up one failure with this patchset:
> #569/p ld_abs: vlan + abs, test 1 FAIL
> Failed to load prog 'Success'!
> insn 2463 cannot be patched due to 16-bit range
> verification time 172602 usec
> stack depth 0
> processed 30728 insns (limit 1000000) max_states_per_insn 1 total_states 1022 peak_states 1022 mark_read 1
>
> This test passes with bpf-next/master. Btw, I tried with your v4
> patches though I am replying here...

ld_abs: vlan + abs is a special test which calls a helper
"bpf_fill_ld_abs_vlan_push_pop" to fill (1 << 15) insns which it the jump
distance maximum. Extra code insertion may overflow some jump inside the
test. The selftest patch in this set changed the one place to ALU64 to
avoid high 32-bit randomization sequence insertion. Now for PowerPC,
zero-extension for low 32-bit could be inserted, so this testcase needs
further adjustment.

I will try to emulate and fix this issue on my x86_64 env.

> test_progs has no regression, but has 15 failures even without these
> patches that I need to look into.

That's a good news to hear no regression on test_progs.

Thanks.

Regards,
Jiong
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 884b8e1..bdab6e7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -368,6 +368,7 @@  struct bpf_prog_aux {
 	u32 id;
 	u32 func_cnt; /* used by non-func prog as the number of func progs */
 	u32 func_idx; /* 0 for non-func prog, the index in func array for func prog */
+	bool no_verifier_zext; /* No zero extension insertion by verifier. */
 	bool offload_requested;
 	struct bpf_prog **func;
 	void *jit_data; /* JIT specific data. arch dependent */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fb0edad..8750657 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -821,6 +821,7 @@  u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
+bool bpf_jit_hardware_zext(void);
 bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2792eda..1c54274 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2091,6 +2091,14 @@  bool __weak bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
+/* Return TRUE is the target hardware of JIT will do zero extension to high bits
+ * when writing to low 32-bit of one register. Otherwise, return FALSE.
+ */
+bool __weak bpf_jit_hardware_zext(void)
+{
+	return true;
+}
+
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 83b3f83..016f81d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7551,6 +7551,80 @@  static int opt_remove_nops(struct bpf_verifier_env *env)
 	return 0;
 }
 
+static int opt_subreg_zext_lo32(struct bpf_verifier_env *env)
+{
+	struct bpf_insn_aux_data orig_aux, *aux = env->insn_aux_data;
+	struct bpf_insn *insns = env->prog->insnsi;
+	int i, delta = 0, len = env->prog->len;
+	struct bpf_insn zext_patch[3];
+	struct bpf_prog *new_prog;
+
+	zext_patch[1] = BPF_ALU64_IMM(BPF_LSH, 0, 32);
+	zext_patch[2] = BPF_ALU64_IMM(BPF_RSH, 0, 32);
+	for (i = 0; i < len; i++) {
+		int adj_idx = i + delta;
+		struct bpf_insn insn;
+
+		if (!aux[adj_idx].zext_dst)
+			continue;
+
+		insn = insns[adj_idx];
+		/* "adjust_insn_aux_data" only retains the original insn aux
+		 * data if insn at patched offset is at the end of the patch
+		 * buffer. That is to say, given the following insn sequence:
+		 *
+		 *   insn 1
+		 *   insn 2
+		 *   insn 3
+		 *
+		 * if the patch offset is at insn 2, then the patch buffer must
+		 * be the following that original insn aux data can be retained.
+		 *
+		 *   {lshift, rshift, insn2}
+		 *
+		 * However, zero extension needs to be inserted after insn2, so
+		 * insn patch buffer needs to be the following:
+		 *
+		 *   {insn2, lshift, rshift}
+		 *
+		 * which would cause insn aux data of insn2 lost and that data
+		 * is critical for ctx field load instruction transformed
+		 * correctly later inside "convert_ctx_accesses".
+		 *
+		 * The simplest way to fix this to build the following patch
+		 * buffer:
+		 *
+		 *   {lshift, rshift, insn-next-to-insn2}
+		 *
+		 * Given insn2 defines a value, it can't be a JMP, hence there
+		 * must be a next insn for it otherwise CFG check should have
+		 * rejected this program. However, insn-next-to-insn2 could
+		 * be a JMP and verifier insn patch infrastructure doesn't
+		 * support adjust offset for JMP inside patch buffer. We would
+		 * end up with a few insn check and offset adj code outside of
+		 * the generic insn patch helpers if we go with this approach.
+		 *
+		 * Therefore, we still use {insn2, lshift, rshift} as the patch
+		 * buffer, we copy and restore insn aux data for insn2
+		 * explicitly. The change looks simpler and smaller.
+		 */
+		zext_patch[0] = insns[adj_idx];
+		zext_patch[1].dst_reg = insn.dst_reg;
+		zext_patch[2].dst_reg = insn.dst_reg;
+		memcpy(&orig_aux, &aux[adj_idx], sizeof(orig_aux));
+		new_prog = bpf_patch_insn_data(env, adj_idx, zext_patch, 3);
+		if (!new_prog)
+			return -ENOMEM;
+		env->prog = new_prog;
+		insns = new_prog->insnsi;
+		aux = env->insn_aux_data;
+		memcpy(&aux[adj_idx], &orig_aux, sizeof(orig_aux));
+		delta += 2;
+	}
+
+	return 0;
+}
+
 /* convert load instructions that access fields of a context type into a
  * sequence of instructions that access fields of the underlying structure:
  *     struct __sk_buff    -> struct sk_buff
@@ -8382,7 +8456,18 @@  int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 	if (ret == 0)
 		ret = check_max_stack_depth(env);
 
-	/* instruction rewrites happen after this point */
+	/* Instruction rewrites happen after this point.
+	 * For offload target, finalize hook has all aux insn info, do any
+	 * customized work there.
+	 */
+	if (ret == 0 && !bpf_jit_hardware_zext() &&
+	    !bpf_prog_is_dev_bound(env->prog->aux)) {
+		ret = opt_subreg_zext_lo32(env);
+		env->prog->aux->no_verifier_zext = !!ret;
+	} else {
+		env->prog->aux->no_verifier_zext = true;
+	}
+
 	if (is_priv) {
 		if (ret == 0)
 			opt_hard_wire_dead_code_branches(env);