bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

Message ID 20181206170646.3736-1-mdroth@linux.vnet.ibm.com
State New
Headers show
Series
  • bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch fail total: 1 errors, 1 warnings, 2 checks, 9 lines checked
snowpatch_ozlabs/build-pmac32 success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64le success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied

Commit Message

Michael Roth Dec. 6, 2018, 5:06 p.m.
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.

For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:

  root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
  -1673527296

and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:

  setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)

and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
would time out on systemd/netplan configuring a virtio-net NIC with no
noticeable errors in the logs.

Fix this by limiting the compile-time default for bpf_jit_limit to
INT_MAX.

Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Cc: linuxppc-dev@ozlabs.org
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 kernel/bpf/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Dec. 7, 2018, 12:31 p.m. | #1
Michael Roth <mdroth@linux.vnet.ibm.com> writes:

> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
> and is adjusted again at init time if MODULES_VADDR is defined.
>
> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with

But maybe it should be, I don't know why we don't define it.

> the compile-time default at boot-time, which is 0x9c400000 when
> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> value:
>
>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>   -1673527296
>
> and can cause various unexpected failures throughout the network
> stack. In one case `strace dhclient eth0` reported:
>
>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
>
> and similar failures can be seen with tools like tcpdump. This doesn't
> always reproduce however, and I'm not sure why. The more consistent
> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> would time out on systemd/netplan configuring a virtio-net NIC with no
> noticeable errors in the logs.
>
> Fix this by limiting the compile-time default for bpf_jit_limit to
> INT_MAX.

INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
whether we should be using PAGE_SIZE here at all. I guess each BPF
program uses at least one page is the thinking?

Thanks for tracking this down. For some reason none of my ~10 test boxes
have hit this, perhaps I don't have new enough userspace?

You don't mention why you needed to add BPF_MIN(), I assume because the
kernel version of min() has gotten too complicated to work here?

Daniel I assume you'll merge this via your tree?

cheers

> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
> Cc: linuxppc-dev@ozlabs.org
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Sandipan Das <sandipan@linux.ibm.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>  kernel/bpf/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index b1a3545d0ec8..55de4746cdfd 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>  }
>  
>  #ifdef CONFIG_BPF_JIT
> -# define BPF_JIT_LIMIT_DEFAULT	(PAGE_SIZE * 40000)
> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> +# define BPF_JIT_LIMIT_DEFAULT	BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
>  
>  /* All BPF JIT sysctl knobs here. */
>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> -- 
> 2.17.1
Michael Roth Dec. 7, 2018, 3:36 p.m. | #2
Quoting Michael Ellerman (2018-12-07 06:31:13)
> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> 
> > Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> > JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
> > and is adjusted again at init time if MODULES_VADDR is defined.
> >
> > For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
> 
> But maybe it should be, I don't know why we don't define it.
> 
> > the compile-time default at boot-time, which is 0x9c400000 when
> > using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> > value:
> >
> >   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
> >   -1673527296
> >
> > and can cause various unexpected failures throughout the network
> > stack. In one case `strace dhclient eth0` reported:
> >
> >   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
> >
> > and similar failures can be seen with tools like tcpdump. This doesn't
> > always reproduce however, and I'm not sure why. The more consistent
> > failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> > would time out on systemd/netplan configuring a virtio-net NIC with no
> > noticeable errors in the logs.
> >
> > Fix this by limiting the compile-time default for bpf_jit_limit to
> > INT_MAX.
> 
> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
> whether we should be using PAGE_SIZE here at all. I guess each BPF
> program uses at least one page is the thinking?

That seems to be the case, at least, the max number of minimum-sized
allocations would be less on ppc64 since the allocations are always at
least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
so it seemed consistent to do that here too.

> 
> Thanks for tracking this down. For some reason none of my ~10 test boxes
> have hit this, perhaps I don't have new enough userspace?

I'm not too sure, I would've thought things like the dhclient case in
the commit log would fail every time, but sometimes I need to reboot the
guest before I start seeing the behavior. Maybe there's something special
about when JIT allocations are actually done that can affect
reproducibility?

In my case at least the virtio-net networking timeout was consistent
enough for a bisect, but maybe it depends on the specific network
configuration (single NIC, basic DHCP through netplan/systemd in my case).

> 
> You don't mention why you needed to add BPF_MIN(), I assume because the
> kernel version of min() has gotten too complicated to work here?

I wasn't sure if it was safe here or not, so I tried looking at other
users and came across:

mm/vmalloc.c:777:#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */

I'm not sure what the reasoning was (or whether it still applies), but I
figured it was safer to do the same here. Maybe Nick still recalls?

> 
> Daniel I assume you'll merge this via your tree?
> 
> cheers
> 
> > Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
> > Cc: linuxppc-dev@ozlabs.org
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Sandipan Das <sandipan@linux.ibm.com>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> > ---
> >  kernel/bpf/core.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index b1a3545d0ec8..55de4746cdfd 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >  }
> >  
> >  #ifdef CONFIG_BPF_JIT
> > -# define BPF_JIT_LIMIT_DEFAULT       (PAGE_SIZE * 40000)
> > +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> > +# define BPF_JIT_LIMIT_DEFAULT       BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
> >  
> >  /* All BPF JIT sysctl knobs here. */
> >  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> > -- 
> > 2.17.1
>
Daniel Borkmann Dec. 10, 2018, 2:26 p.m. | #3
On 12/07/2018 04:36 PM, Michael Roth wrote:
> Quoting Michael Ellerman (2018-12-07 06:31:13)
>> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>>
>>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>>> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
>>> and is adjusted again at init time if MODULES_VADDR is defined.
>>>
>>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>>
>> But maybe it should be, I don't know why we don't define it.
>>
>>> the compile-time default at boot-time, which is 0x9c400000 when
>>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>>> value:
>>>
>>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>>>   -1673527296
>>>
>>> and can cause various unexpected failures throughout the network
>>> stack. In one case `strace dhclient eth0` reported:
>>>
>>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
>>>
>>> and similar failures can be seen with tools like tcpdump. This doesn't
>>> always reproduce however, and I'm not sure why. The more consistent
>>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
>>> would time out on systemd/netplan configuring a virtio-net NIC with no
>>> noticeable errors in the logs.
>>>
>>> Fix this by limiting the compile-time default for bpf_jit_limit to
>>> INT_MAX.
>>
>> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
>> whether we should be using PAGE_SIZE here at all. I guess each BPF
>> program uses at least one page is the thinking?
> 
> That seems to be the case, at least, the max number of minimum-sized
> allocations would be less on ppc64 since the allocations are always at
> least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
> so it seemed consistent to do that here too.
> 
>>
>> Thanks for tracking this down. For some reason none of my ~10 test boxes
>> have hit this, perhaps I don't have new enough userspace?
> 
> I'm not too sure, I would've thought things like the dhclient case in
> the commit log would fail every time, but sometimes I need to reboot the
> guest before I start seeing the behavior. Maybe there's something special
> about when JIT allocations are actually done that can affect
> reproducibility?
> 
> In my case at least the virtio-net networking timeout was consistent
> enough for a bisect, but maybe it depends on the specific network
> configuration (single NIC, basic DHCP through netplan/systemd in my case).
> 
>>
>> You don't mention why you needed to add BPF_MIN(), I assume because the
>> kernel version of min() has gotten too complicated to work here?
> 
> I wasn't sure if it was safe here or not, so I tried looking at other
> users and came across:
> 
> mm/vmalloc.c:777:#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
> 
> I'm not sure what the reasoning was (or whether it still applies), but I
> figured it was safer to do the same here. Maybe Nick still recalls?
> 
>>
>> Daniel I assume you'll merge this via your tree?
>>
>> cheers
>>
>>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
>>> Cc: linuxppc-dev@ozlabs.org
>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>> Cc: Sandipan Das <sandipan@linux.ibm.com>
>>> Cc: Alexei Starovoitov <ast@kernel.org>
>>> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>

Thanks for the reports / fixes and sorry for my late reply (bit too
swamped last week), some more thoughts below.

>>>  kernel/bpf/core.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index b1a3545d0ec8..55de4746cdfd 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>>>  }
>>>  
>>>  #ifdef CONFIG_BPF_JIT
>>> -# define BPF_JIT_LIMIT_DEFAULT       (PAGE_SIZE * 40000)
>>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
>>> +# define BPF_JIT_LIMIT_DEFAULT       BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
>>>  
>>>  /* All BPF JIT sysctl knobs here. */
>>>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);

I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
define also given for 4.21 arm64 will have its own dedicated area for
JIT allocations where neither the above limit nor the MODULES_END/
MODULES_VADDR one would fit and I don't want to make this even more
ugly with adding further cases into the core. Would the below variant
work for you?

Thanks,
Daniel

From da9daf462d41ce5506c6b6318a9fa3d6d8a64f6c Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 10 Dec 2018 14:30:27 +0100
Subject: [PATCH bpf] bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K

Michael and Sandipan report:

  Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
  JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
  and is adjusted again at init time if MODULES_VADDR is defined.

  For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
  the compile-time default at boot-time, which is 0x9c400000 when
  using 64K page size. This overflows the signed 32-bit bpf_jit_limit
  value:

  root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
  -1673527296

  and can cause various unexpected failures throughout the network
  stack. In one case `strace dhclient eth0` reported:

  setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
             16) = -1 ENOTSUPP (Unknown error 524)

  and similar failures can be seen with tools like tcpdump. This doesn't
  always reproduce however, and I'm not sure why. The more consistent
  failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
  host would time out on systemd/netplan configuring a virtio-net NIC
  with no noticeable errors in the logs.

Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().

Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.

Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/core.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545..6c2332e 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -365,13 +365,11 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 }

 #ifdef CONFIG_BPF_JIT
-# define BPF_JIT_LIMIT_DEFAULT	(PAGE_SIZE * 40000)
-
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
-int bpf_jit_limit    __read_mostly = BPF_JIT_LIMIT_DEFAULT;
+int bpf_jit_limit    __read_mostly;

 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -580,16 +578,27 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,

 static atomic_long_t bpf_jit_current;

+/* Can be overridden by an arch's JIT compiler if it has a custom,
+ * dedicated BPF backend memory area, or if neither of the two
+ * below apply.
+ */
+u64 __weak bpf_jit_alloc_exec_limit(void)
+{
 #if defined(MODULES_VADDR)
+	return MODULES_END - MODULES_VADDR;
+#else
+	return VMALLOC_END - VMALLOC_START;
+#endif
+}
+
 static int __init bpf_jit_charge_init(void)
 {
 	/* Only used as heuristic here to derive limit. */
-	bpf_jit_limit = min_t(u64, round_up((MODULES_END - MODULES_VADDR) >> 2,
+	bpf_jit_limit = min_t(u64, round_up(bpf_jit_alloc_exec_limit() >> 2,
 					    PAGE_SIZE), INT_MAX);
 	return 0;
 }
 pure_initcall(bpf_jit_charge_init);
-#endif

 static int bpf_jit_charge_modmem(u32 pages)
 {
Michael Roth Dec. 10, 2018, 5:27 p.m. | #4
Quoting Daniel Borkmann (2018-12-10 08:26:31)
> On 12/07/2018 04:36 PM, Michael Roth wrote:
> > Quoting Michael Ellerman (2018-12-07 06:31:13)
> >> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> >>
> >>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> >>> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
> >>> and is adjusted again at init time if MODULES_VADDR is defined.
> >>>
> >>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
> >>
> >> But maybe it should be, I don't know why we don't define it.
> >>
> >>> the compile-time default at boot-time, which is 0x9c400000 when
> >>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> >>> value:
> >>>
> >>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
> >>>   -1673527296
> >>>
> >>> and can cause various unexpected failures throughout the network
> >>> stack. In one case `strace dhclient eth0` reported:
> >>>
> >>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
> >>>
> >>> and similar failures can be seen with tools like tcpdump. This doesn't
> >>> always reproduce however, and I'm not sure why. The more consistent
> >>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> >>> would time out on systemd/netplan configuring a virtio-net NIC with no
> >>> noticeable errors in the logs.
> >>>
> >>> Fix this by limiting the compile-time default for bpf_jit_limit to
> >>> INT_MAX.
> >>
> >> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
> >> whether we should be using PAGE_SIZE here at all. I guess each BPF
> >> program uses at least one page is the thinking?
> > 
> > That seems to be the case, at least, the max number of minimum-sized
> > allocations would be less on ppc64 since the allocations are always at
> > least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
> > so it seemed consistent to do that here too.
> > 
> >>
> >> Thanks for tracking this down. For some reason none of my ~10 test boxes
> >> have hit this, perhaps I don't have new enough userspace?
> > 
> > I'm not too sure, I would've thought things like the dhclient case in
> > the commit log would fail every time, but sometimes I need to reboot the
> > guest before I start seeing the behavior. Maybe there's something special
> > about when JIT allocations are actually done that can affect
> > reproducibility?
> > 
> > In my case at least the virtio-net networking timeout was consistent
> > enough for a bisect, but maybe it depends on the specific network
> > configuration (single NIC, basic DHCP through netplan/systemd in my case).
> > 
> >>
> >> You don't mention why you needed to add BPF_MIN(), I assume because the
> >> kernel version of min() has gotten too complicated to work here?
> > 
> > I wasn't sure if it was safe here or not, so I tried looking at other
> > users and came across:
> > 
> > mm/vmalloc.c:777:#define VMAP_MIN(x, y)               ((x) < (y) ? (x) : (y)) /* can't use min() */
> > 
> > I'm not sure what the reasoning was (or whether it still applies), but I
> > figured it was safer to do the same here. Maybe Nick still recalls?
> > 
> >>
> >> Daniel I assume you'll merge this via your tree?
> >>
> >> cheers
> >>
> >>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
> >>> Cc: linuxppc-dev@ozlabs.org
> >>> Cc: Daniel Borkmann <daniel@iogearbox.net>
> >>> Cc: Sandipan Das <sandipan@linux.ibm.com>
> >>> Cc: Alexei Starovoitov <ast@kernel.org>
> >>> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> 
> Thanks for the reports / fixes and sorry for my late reply (bit too
> swamped last week), some more thoughts below.
> 
> >>>  kernel/bpf/core.c | 3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >>> index b1a3545d0ec8..55de4746cdfd 100644
> >>> --- a/kernel/bpf/core.c
> >>> +++ b/kernel/bpf/core.c
> >>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >>>  }
> >>>  
> >>>  #ifdef CONFIG_BPF_JIT
> >>> -# define BPF_JIT_LIMIT_DEFAULT       (PAGE_SIZE * 40000)
> >>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> >>> +# define BPF_JIT_LIMIT_DEFAULT       BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
> >>>  
> >>>  /* All BPF JIT sysctl knobs here. */
> >>>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> 
> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
> define also given for 4.21 arm64 will have its own dedicated area for
> JIT allocations where neither the above limit nor the MODULES_END/
> MODULES_VADDR one would fit and I don't want to make this even more
> ugly with adding further cases into the core. Would the below variant
> work for you?

Looks good to me. My one concern (which is probably a separate
issue) is that the INT_MAX limit is a bit more punishing for larger
page sizes since the minimum allocations seem to be 1 page. Are there
reasonable workloads that could actually push this (INT_MAX >>
64K_PAGE_SHIFT) limit, or is that pretty generous in practice?

> 
> Thanks,
> Daniel
> 
> From da9daf462d41ce5506c6b6318a9fa3d6d8a64f6c Mon Sep 17 00:00:00 2001
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Mon, 10 Dec 2018 14:30:27 +0100
> Subject: [PATCH bpf] bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
> 
> Michael and Sandipan report:
> 
>   Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>   JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
>   and is adjusted again at init time if MODULES_VADDR is defined.
> 
>   For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>   the compile-time default at boot-time, which is 0x9c400000 when
>   using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>   value:
> 
>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>   -1673527296
> 
>   and can cause various unexpected failures throughout the network
>   stack. In one case `strace dhclient eth0` reported:
> 
>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
>              16) = -1 ENOTSUPP (Unknown error 524)
> 
>   and similar failures can be seen with tools like tcpdump. This doesn't
>   always reproduce however, and I'm not sure why. The more consistent
>   failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
>   host would time out on systemd/netplan configuring a virtio-net NIC
>   with no noticeable errors in the logs.
> 
> Given this and also given that in near future some architectures like
> arm64 will have a custom area for BPF JIT image allocations we should
> get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
> 4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
> so therefore add another overridable bpf_jit_alloc_exec_limit() helper
> function which returns the possible size of the memory area for deriving
> the default heuristic in bpf_jit_charge_init().
> 
> Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
> bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
> JIT memory provider, and therefore in case archs implement their custom
> module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
> vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
> 
> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
> Reported-by: Sandipan Das <sandipan@linux.ibm.com>
> Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  kernel/bpf/core.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index b1a3545..6c2332e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,13 +365,11 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>  }
> 
>  #ifdef CONFIG_BPF_JIT
> -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 40000)
> -
>  /* All BPF JIT sysctl knobs here. */
>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
>  int bpf_jit_harden   __read_mostly;
>  int bpf_jit_kallsyms __read_mostly;
> -int bpf_jit_limit    __read_mostly = BPF_JIT_LIMIT_DEFAULT;
> +int bpf_jit_limit    __read_mostly;
> 
>  static __always_inline void
>  bpf_get_prog_addr_region(const struct bpf_prog *prog,
> @@ -580,16 +578,27 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
> 
>  static atomic_long_t bpf_jit_current;
> 
> +/* Can be overridden by an arch's JIT compiler if it has a custom,
> + * dedicated BPF backend memory area, or if neither of the two
> + * below apply.
> + */
> +u64 __weak bpf_jit_alloc_exec_limit(void)
> +{
>  #if defined(MODULES_VADDR)
> +       return MODULES_END - MODULES_VADDR;
> +#else
> +       return VMALLOC_END - VMALLOC_START;
> +#endif
> +}
> +
>  static int __init bpf_jit_charge_init(void)
>  {
>         /* Only used as heuristic here to derive limit. */
> -       bpf_jit_limit = min_t(u64, round_up((MODULES_END - MODULES_VADDR) >> 2,
> +       bpf_jit_limit = min_t(u64, round_up(bpf_jit_alloc_exec_limit() >> 2,
>                                             PAGE_SIZE), INT_MAX);
>         return 0;
>  }
>  pure_initcall(bpf_jit_charge_init);
> -#endif
> 
>  static int bpf_jit_charge_modmem(u32 pages)
>  {
> -- 
> 2.9.5
>
Daniel Borkmann Dec. 10, 2018, 11:02 p.m. | #5
On 12/10/2018 06:27 PM, Michael Roth wrote:
> Quoting Daniel Borkmann (2018-12-10 08:26:31)
>> On 12/07/2018 04:36 PM, Michael Roth wrote:
>>> Quoting Michael Ellerman (2018-12-07 06:31:13)
>>>> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>>>>
>>>>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>>>>> JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
>>>>> and is adjusted again at init time if MODULES_VADDR is defined.
>>>>>
>>>>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>>>>
>>>> But maybe it should be, I don't know why we don't define it.
>>>>
>>>>> the compile-time default at boot-time, which is 0x9c400000 when
>>>>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>>>>> value:
>>>>>
>>>>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>>>>>   -1673527296
>>>>>
>>>>> and can cause various unexpected failures throughout the network
>>>>> stack. In one case `strace dhclient eth0` reported:
>>>>>
>>>>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
>>>>>
>>>>> and similar failures can be seen with tools like tcpdump. This doesn't
>>>>> always reproduce however, and I'm not sure why. The more consistent
>>>>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
>>>>> would time out on systemd/netplan configuring a virtio-net NIC with no
>>>>> noticeable errors in the logs.
>>>>>
>>>>> Fix this by limiting the compile-time default for bpf_jit_limit to
>>>>> INT_MAX.
>>>>
>>>> INT_MAX is a lot more than (4k * 40000), so I guess I'm not clear on
>>>> whether we should be using PAGE_SIZE here at all. I guess each BPF
>>>> program uses at least one page is the thinking?
>>>
>>> That seems to be the case, at least, the max number of minimum-sized
>>> allocations would be less on ppc64 since the allocations are always at
>>> least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
>>> so it seemed consistent to do that here too.
>>>
>>>>
>>>> Thanks for tracking this down. For some reason none of my ~10 test boxes
>>>> have hit this, perhaps I don't have new enough userspace?
>>>
>>> I'm not too sure, I would've thought things like the dhclient case in
>>> the commit log would fail every time, but sometimes I need to reboot the
>>> guest before I start seeing the behavior. Maybe there's something special
>>> about when JIT allocations are actually done that can affect
>>> reproducibility?
>>>
>>> In my case at least the virtio-net networking timeout was consistent
>>> enough for a bisect, but maybe it depends on the specific network
>>> configuration (single NIC, basic DHCP through netplan/systemd in my case).
>>>
>>>>
>>>> You don't mention why you needed to add BPF_MIN(), I assume because the
>>>> kernel version of min() has gotten too complicated to work here?
>>>
>>> I wasn't sure if it was safe here or not, so I tried looking at other
>>> users and came across:
>>>
>>> mm/vmalloc.c:777:#define VMAP_MIN(x, y)               ((x) < (y) ? (x) : (y)) /* can't use min() */
>>>
>>> I'm not sure what the reasoning was (or whether it still applies), but I
>>> figured it was safer to do the same here. Maybe Nick still recalls?
>>>
>>>>
>>>> Daniel I assume you'll merge this via your tree?
>>>>
>>>> cheers
>>>>
>>>>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
>>>>> Cc: linuxppc-dev@ozlabs.org
>>>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>>>> Cc: Sandipan Das <sandipan@linux.ibm.com>
>>>>> Cc: Alexei Starovoitov <ast@kernel.org>
>>>>> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
>>
>> Thanks for the reports / fixes and sorry for my late reply (bit too
>> swamped last week), some more thoughts below.
>>
>>>>>  kernel/bpf/core.c | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>>> index b1a3545d0ec8..55de4746cdfd 100644
>>>>> --- a/kernel/bpf/core.c
>>>>> +++ b/kernel/bpf/core.c
>>>>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>>>>>  }
>>>>>  
>>>>>  #ifdef CONFIG_BPF_JIT
>>>>> -# define BPF_JIT_LIMIT_DEFAULT       (PAGE_SIZE * 40000)
>>>>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
>>>>> +# define BPF_JIT_LIMIT_DEFAULT       BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
>>>>>  
>>>>>  /* All BPF JIT sysctl knobs here. */
>>>>>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
>>
>> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
>> define also given for 4.21 arm64 will have its own dedicated area for
>> JIT allocations where neither the above limit nor the MODULES_END/
>> MODULES_VADDR one would fit and I don't want to make this even more
>> ugly with adding further cases into the core. Would the below variant
>> work for you?
> 
> Looks good to me. My one concern (which is probably a separate
> issue) is that the INT_MAX limit is a bit more punishing for larger
> page sizes since the minimum allocations seem to be 1 page. Are there
> reasonable workloads that could actually push this (INT_MAX >>
> 64K_PAGE_SHIFT) limit, or is that pretty generous in practice?

I don't think there are today, but it definitely doesn't hurt to change
the interface to be on safe side which I just did for the sent out one.

Thanks,
Daniel
Michael Ellerman Dec. 11, 2018, 1:10 p.m. | #6
Daniel Borkmann <daniel@iogearbox.net> writes:
< snip >
>
> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
> define also given for 4.21 arm64 will have its own dedicated area for
> JIT allocations where neither the above limit nor the MODULES_END/
> MODULES_VADDR one would fit and I don't want to make this even more
> ugly with adding further cases into the core. Would the below variant
> work for you?

I haven't actually hit the bug so I won't ack/tested-by this, but it
looks fine to me.

cheers

> From da9daf462d41ce5506c6b6318a9fa3d6d8a64f6c Mon Sep 17 00:00:00 2001
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Mon, 10 Dec 2018 14:30:27 +0100
> Subject: [PATCH bpf] bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K
>
> Michael and Sandipan report:
>
>   Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>   JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
>   and is adjusted again at init time if MODULES_VADDR is defined.
>
>   For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>   the compile-time default at boot-time, which is 0x9c400000 when
>   using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>   value:
>
>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>   -1673527296
>
>   and can cause various unexpected failures throughout the network
>   stack. In one case `strace dhclient eth0` reported:
>
>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
>              16) = -1 ENOTSUPP (Unknown error 524)
>
>   and similar failures can be seen with tools like tcpdump. This doesn't
>   always reproduce however, and I'm not sure why. The more consistent
>   failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
>   host would time out on systemd/netplan configuring a virtio-net NIC
>   with no noticeable errors in the logs.
>
> Given this and also given that in near future some architectures like
> arm64 will have a custom area for BPF JIT image allocations we should
> get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
> 4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
> so therefore add another overridable bpf_jit_alloc_exec_limit() helper
> function which returns the possible size of the memory area for deriving
> the default heuristic in bpf_jit_charge_init().
>
> Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
> bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
> JIT memory provider, and therefore in case archs implement their custom
> module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
> vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
>
> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
> Reported-by: Sandipan Das <sandipan@linux.ibm.com>
> Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  kernel/bpf/core.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index b1a3545..6c2332e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,13 +365,11 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>  }
>
>  #ifdef CONFIG_BPF_JIT
> -# define BPF_JIT_LIMIT_DEFAULT	(PAGE_SIZE * 40000)
> -
>  /* All BPF JIT sysctl knobs here. */
>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
>  int bpf_jit_harden   __read_mostly;
>  int bpf_jit_kallsyms __read_mostly;
> -int bpf_jit_limit    __read_mostly = BPF_JIT_LIMIT_DEFAULT;
> +int bpf_jit_limit    __read_mostly;
>
>  static __always_inline void
>  bpf_get_prog_addr_region(const struct bpf_prog *prog,
> @@ -580,16 +578,27 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
>
>  static atomic_long_t bpf_jit_current;
>
> +/* Can be overridden by an arch's JIT compiler if it has a custom,
> + * dedicated BPF backend memory area, or if neither of the two
> + * below apply.
> + */
> +u64 __weak bpf_jit_alloc_exec_limit(void)
> +{
>  #if defined(MODULES_VADDR)
> +	return MODULES_END - MODULES_VADDR;
> +#else
> +	return VMALLOC_END - VMALLOC_START;
> +#endif
> +}
> +
>  static int __init bpf_jit_charge_init(void)
>  {
>  	/* Only used as heuristic here to derive limit. */
> -	bpf_jit_limit = min_t(u64, round_up((MODULES_END - MODULES_VADDR) >> 2,
> +	bpf_jit_limit = min_t(u64, round_up(bpf_jit_alloc_exec_limit() >> 2,
>  					    PAGE_SIZE), INT_MAX);
>  	return 0;
>  }
>  pure_initcall(bpf_jit_charge_init);
> -#endif
>
>  static int bpf_jit_charge_modmem(u32 pages)
>  {
> -- 
> 2.9.5

Patch

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545d0ec8..55de4746cdfd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -365,7 +365,8 @@  void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 }
 
 #ifdef CONFIG_BPF_JIT
-# define BPF_JIT_LIMIT_DEFAULT	(PAGE_SIZE * 40000)
+# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
+# define BPF_JIT_LIMIT_DEFAULT	BPF_MIN((PAGE_SIZE * 40000), INT_MAX)
 
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);