diff mbox series

[bpf,v3,1/2] bpf: don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE

Message ID 20200608182748.6998-1-sdf@google.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series [bpf,v3,1/2] bpf: don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE | expand

Commit Message

Stanislav Fomichev June 8, 2020, 6:27 p.m. UTC
Attaching to these hooks can break iptables because its optval is
usually quite big, or at least bigger than the current PAGE_SIZE limit.
David also mentioned some SCTP options can be big (around 256k).

There are two possible ways to fix it:
1. Increase the limit to match iptables max optval. There is, however,
   no clear upper limit. Technically, iptables can accept up to
   512M of data (not sure how practical it is though).

2. Bypass the value (don't expose to BPF) if it's too big and trigger
   BPF only with level/optname so BPF can still decide whether
   to allow/deny big sockopts.

The initial attempt was implemented using strategy #1. Due to
listed shortcomings, let's switch to strategy #2. When there is
legitimate a real use-case for iptables/SCTP, we can consider increasing
the PAGE_SIZE limit.

v3:
* don't increase the limit, bypass the argument

v2:
* proper comments formatting (Jakub Kicinski)

Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
Cc: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/cgroup.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Alexei Starovoitov June 13, 2020, 12:33 a.m. UTC | #1
On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote:
> Attaching to these hooks can break iptables because its optval is
> usually quite big, or at least bigger than the current PAGE_SIZE limit.
> David also mentioned some SCTP options can be big (around 256k).
> 
> There are two possible ways to fix it:
> 1. Increase the limit to match iptables max optval. There is, however,
>    no clear upper limit. Technically, iptables can accept up to
>    512M of data (not sure how practical it is though).
> 
> 2. Bypass the value (don't expose to BPF) if it's too big and trigger
>    BPF only with level/optname so BPF can still decide whether
>    to allow/deny big sockopts.
> 
> The initial attempt was implemented using strategy #1. Due to
> listed shortcomings, let's switch to strategy #2. When there is
> legitimate a real use-case for iptables/SCTP, we can consider increasing
> the PAGE_SIZE limit.
> 
> v3:
> * don't increase the limit, bypass the argument
> 
> v2:
> * proper comments formatting (Jakub Kicinski)
> 
> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
> Cc: David Laight <David.Laight@ACULAB.COM>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  kernel/bpf/cgroup.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index fdf7836750a3..758082853086 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
>  
>  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
>  {
> -	if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0)
> +	if (unlikely(max_optlen < 0))
>  		return -EINVAL;
>  
> +	if (unlikely(max_optlen > PAGE_SIZE)) {
> +		/* We don't expose optvals that are greater than PAGE_SIZE
> +		 * to the BPF program.
> +		 */
> +		ctx->optval = NULL;
> +		ctx->optval_end = NULL;
> +		return 0;
> +	}

It's probably ok, but makes me uneasy about verifier consequences.
ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
I don't think we have such tests. I guess bpf prog won't be able to read
anything and nothing will crash, but having PTR_TO_PACKET that is
actually NULL would be an odd special case to keep in mind for everyone
who will work on the verifier from now on.

Also consider bpf prog that simply reads something small like 4 bytes.
IP_FREEBIND sockopt (like your selftest in the patch 2) will have
those 4 bytes, so it's natural for the prog to assume that it can read it.
It will have
p = ctx->optval;
if (p + 4 > ctx->optval_end)
 /* goto out and don't bother logging, since that never happens */
*(u32*)p;

but 'clever' user space would pass long optlen and prog suddenly
'not seeing' the sockopt. It didn't crash, but debugging would be
surprising.

I feel it's better to copy the first 4k and let the program see it.
Stanislav Fomichev June 13, 2020, 1:03 a.m. UTC | #2
On Fri, Jun 12, 2020 at 5:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote:
> > Attaching to these hooks can break iptables because its optval is
> > usually quite big, or at least bigger than the current PAGE_SIZE limit.
> > David also mentioned some SCTP options can be big (around 256k).
> >
> > There are two possible ways to fix it:
> > 1. Increase the limit to match iptables max optval. There is, however,
> >    no clear upper limit. Technically, iptables can accept up to
> >    512M of data (not sure how practical it is though).
> >
> > 2. Bypass the value (don't expose to BPF) if it's too big and trigger
> >    BPF only with level/optname so BPF can still decide whether
> >    to allow/deny big sockopts.
> >
> > The initial attempt was implemented using strategy #1. Due to
> > listed shortcomings, let's switch to strategy #2. When there is
> > legitimate a real use-case for iptables/SCTP, we can consider increasing
> > the PAGE_SIZE limit.
> >
> > v3:
> > * don't increase the limit, bypass the argument
> >
> > v2:
> > * proper comments formatting (Jakub Kicinski)
> >
> > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
> > Cc: David Laight <David.Laight@ACULAB.COM>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  kernel/bpf/cgroup.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index fdf7836750a3..758082853086 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
> >
> >  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> >  {
> > -     if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0)
> > +     if (unlikely(max_optlen < 0))
> >               return -EINVAL;
> >
> > +     if (unlikely(max_optlen > PAGE_SIZE)) {
> > +             /* We don't expose optvals that are greater than PAGE_SIZE
> > +              * to the BPF program.
> > +              */
> > +             ctx->optval = NULL;
> > +             ctx->optval_end = NULL;
> > +             return 0;
> > +     }
>
> It's probably ok, but makes me uneasy about verifier consequences.
> ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
> Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
> I don't think we have such tests. I guess bpf prog won't be able to read
> anything and nothing will crash, but having PTR_TO_PACKET that is
> actually NULL would be an odd special case to keep in mind for everyone
> who will work on the verifier from now on.
>
> Also consider bpf prog that simply reads something small like 4 bytes.
> IP_FREEBIND sockopt (like your selftest in the patch 2) will have
> those 4 bytes, so it's natural for the prog to assume that it can read it.
> It will have
> p = ctx->optval;
> if (p + 4 > ctx->optval_end)
>  /* goto out and don't bother logging, since that never happens */
> *(u32*)p;
>
> but 'clever' user space would pass long optlen and prog suddenly
> 'not seeing' the sockopt. It didn't crash, but debugging would be
> surprising.
>
> I feel it's better to copy the first 4k and let the program see it.
Agreed with the IP_FREEBIND example wrt observability, however it's
not clear what to do with the cropped buffer if the bpf program
modifies it.

Consider that huge iptables setsockopts where the usespace passes
PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it.
Now, if the bpf program modifies the buffer (say, flips some byte), we
are back to square one. We either have to silently discard that buffer
or reallocate/merge. My reasoning with data == NULL, is that at least
there is a clear signal that the program can't access the data (and
can look at optlen to see if the original buffer is indeed non-zero
and maybe deny such requests?).
At this point I'm really starting to think that maybe we should just
vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an
upper bound :-/
I'll try to think about this a bit more over the weekend.
Alexei Starovoitov June 13, 2020, 3:50 a.m. UTC | #3
On Fri, Jun 12, 2020 at 06:03:03PM -0700, Stanislav Fomichev wrote:
> On Fri, Jun 12, 2020 at 5:34 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote:
> > > Attaching to these hooks can break iptables because its optval is
> > > usually quite big, or at least bigger than the current PAGE_SIZE limit.
> > > David also mentioned some SCTP options can be big (around 256k).
> > >
> > > There are two possible ways to fix it:
> > > 1. Increase the limit to match iptables max optval. There is, however,
> > >    no clear upper limit. Technically, iptables can accept up to
> > >    512M of data (not sure how practical it is though).
> > >
> > > 2. Bypass the value (don't expose to BPF) if it's too big and trigger
> > >    BPF only with level/optname so BPF can still decide whether
> > >    to allow/deny big sockopts.
> > >
> > > The initial attempt was implemented using strategy #1. Due to
> > > listed shortcomings, let's switch to strategy #2. When there is
> > > legitimate a real use-case for iptables/SCTP, we can consider increasing
> > > the PAGE_SIZE limit.
> > >
> > > v3:
> > > * don't increase the limit, bypass the argument
> > >
> > > v2:
> > > * proper comments formatting (Jakub Kicinski)
> > >
> > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
> > > Cc: David Laight <David.Laight@ACULAB.COM>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  kernel/bpf/cgroup.c | 18 ++++++++++++++----
> > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index fdf7836750a3..758082853086 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
> > >
> > >  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > >  {
> > > -     if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0)
> > > +     if (unlikely(max_optlen < 0))
> > >               return -EINVAL;
> > >
> > > +     if (unlikely(max_optlen > PAGE_SIZE)) {
> > > +             /* We don't expose optvals that are greater than PAGE_SIZE
> > > +              * to the BPF program.
> > > +              */
> > > +             ctx->optval = NULL;
> > > +             ctx->optval_end = NULL;
> > > +             return 0;
> > > +     }
> >
> > It's probably ok, but makes me uneasy about verifier consequences.
> > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
> > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
> > I don't think we have such tests. I guess bpf prog won't be able to read
> > anything and nothing will crash, but having PTR_TO_PACKET that is
> > actually NULL would be an odd special case to keep in mind for everyone
> > who will work on the verifier from now on.
> >
> > Also consider bpf prog that simply reads something small like 4 bytes.
> > IP_FREEBIND sockopt (like your selftest in the patch 2) will have
> > those 4 bytes, so it's natural for the prog to assume that it can read it.
> > It will have
> > p = ctx->optval;
> > if (p + 4 > ctx->optval_end)
> >  /* goto out and don't bother logging, since that never happens */
> > *(u32*)p;
> >
> > but 'clever' user space would pass long optlen and prog suddenly
> > 'not seeing' the sockopt. It didn't crash, but debugging would be
> > surprising.
> >
> > I feel it's better to copy the first 4k and let the program see it.
> Agreed with the IP_FREEBIND example wrt observability, however it's
> not clear what to do with the cropped buffer if the bpf program
> modifies it.
> 
> Consider that huge iptables setsockopts where the usespace passes
> PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it.
> Now, if the bpf program modifies the buffer (say, flips some byte), we
> are back to square one. We either have to silently discard that buffer
> or reallocate/merge. My reasoning with data == NULL, is that at least
> there is a clear signal that the program can't access the data (and
> can look at optlen to see if the original buffer is indeed non-zero
> and maybe deny such requests?).
> At this point I'm really starting to think that maybe we should just
> vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an
> upper bound :-/
> I'll try to think about this a bit more over the weekend.

Yeah. Tough choices.
We can also detect in the verifier whether program accessed ctx->optval
and skip alloc/copy if program didn't touch it, but I suspect in most
case the program would want to read it.
I think vmallocing what optlen said is DoS-able. It's better to
stick with single page.
Let's keep brainstorming.
Stanislav Fomichev June 15, 2020, 4:41 p.m. UTC | #4
On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
[ .. ]
> > > It's probably ok, but makes me uneasy about verifier consequences.
> > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
> > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
> > > I don't think we have such tests. I guess bpf prog won't be able to read
> > > anything and nothing will crash, but having PTR_TO_PACKET that is
> > > actually NULL would be an odd special case to keep in mind for everyone
> > > who will work on the verifier from now on.
> > >
> > > Also consider bpf prog that simply reads something small like 4 bytes.
> > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have
> > > those 4 bytes, so it's natural for the prog to assume that it can read it.
> > > It will have
> > > p = ctx->optval;
> > > if (p + 4 > ctx->optval_end)
> > >  /* goto out and don't bother logging, since that never happens */
> > > *(u32*)p;
> > >
> > > but 'clever' user space would pass long optlen and prog suddenly
> > > 'not seeing' the sockopt. It didn't crash, but debugging would be
> > > surprising.
> > >
> > > I feel it's better to copy the first 4k and let the program see it.
> > Agreed with the IP_FREEBIND example wrt observability, however it's
> > not clear what to do with the cropped buffer if the bpf program
> > modifies it.
> >
> > Consider that huge iptables setsockopts where the usespace passes
> > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it.
> > Now, if the bpf program modifies the buffer (say, flips some byte), we
> > are back to square one. We either have to silently discard that buffer
> > or reallocate/merge. My reasoning with data == NULL, is that at least
> > there is a clear signal that the program can't access the data (and
> > can look at optlen to see if the original buffer is indeed non-zero
> > and maybe deny such requests?).
> > At this point I'm really starting to think that maybe we should just
> > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an
> > upper bound :-/
> > I'll try to think about this a bit more over the weekend.
>
> Yeah. Tough choices.
> We can also detect in the verifier whether program accessed ctx->optval
> and skip alloc/copy if program didn't touch it, but I suspect in most
> case the program would want to read it.
> I think vmallocing what optlen said is DoS-able. It's better to
> stick with single page.
> Let's keep brainstorming.
Btw, can we use sleepable bpf for that? As in, do whatever I suggested
in these patches (don't expose optval>PAGE_SIZE via context), but add
a new helper where you can say 'copy x bytes from y offset of the
original optval' (the helper will do sleepable copy_form_user).
That way we have a clean signal to the BPF that the value is too big
(optval==optval_end==NULL) and the user can fallback to the helper to
inspect the value. We can also provide another helper to export new
value for this case.
WDYT?
Alexei Starovoitov June 16, 2020, 11:03 p.m. UTC | #5
On Mon, Jun 15, 2020 at 09:41:38AM -0700, Stanislav Fomichev wrote:
> On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> [ .. ]
> > > > It's probably ok, but makes me uneasy about verifier consequences.
> > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
> > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
> > > > I don't think we have such tests. I guess bpf prog won't be able to read
> > > > anything and nothing will crash, but having PTR_TO_PACKET that is
> > > > actually NULL would be an odd special case to keep in mind for everyone
> > > > who will work on the verifier from now on.
> > > >
> > > > Also consider bpf prog that simply reads something small like 4 bytes.
> > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have
> > > > those 4 bytes, so it's natural for the prog to assume that it can read it.
> > > > It will have
> > > > p = ctx->optval;
> > > > if (p + 4 > ctx->optval_end)
> > > >  /* goto out and don't bother logging, since that never happens */
> > > > *(u32*)p;
> > > >
> > > > but 'clever' user space would pass long optlen and prog suddenly
> > > > 'not seeing' the sockopt. It didn't crash, but debugging would be
> > > > surprising.
> > > >
> > > > I feel it's better to copy the first 4k and let the program see it.
> > > Agreed with the IP_FREEBIND example wrt observability, however it's
> > > not clear what to do with the cropped buffer if the bpf program
> > > modifies it.
> > >
> > > Consider that huge iptables setsockopts where the usespace passes
> > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it.
> > > Now, if the bpf program modifies the buffer (say, flips some byte), we
> > > are back to square one. We either have to silently discard that buffer
> > > or reallocate/merge. My reasoning with data == NULL, is that at least
> > > there is a clear signal that the program can't access the data (and
> > > can look at optlen to see if the original buffer is indeed non-zero
> > > and maybe deny such requests?).
> > > At this point I'm really starting to think that maybe we should just
> > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an
> > > upper bound :-/
> > > I'll try to think about this a bit more over the weekend.
> >
> > Yeah. Tough choices.
> > We can also detect in the verifier whether program accessed ctx->optval
> > and skip alloc/copy if program didn't touch it, but I suspect in most
> > case the program would want to read it.
> > I think vmallocing what optlen said is DoS-able. It's better to
> > stick with single page.
> > Let's keep brainstorming.
> Btw, can we use sleepable bpf for that? As in, do whatever I suggested
> in these patches (don't expose optval>PAGE_SIZE via context), but add
> a new helper where you can say 'copy x bytes from y offset of the
> original optval' (the helper will do sleepable copy_form_user).
> That way we have a clean signal to the BPF that the value is too big
> (optval==optval_end==NULL) and the user can fallback to the helper to
> inspect the value. We can also provide another helper to export new
> value for this case.

sleepable will be read-only and with toctou.
I guess this patch is the least evil then ?
But I'm confused with the test in patch 2.
Why does it do 'if (optval > optval_end)' ?
How is that possible when patch 1 makes them equal.
may be another idea:
allocate 4k, copy first 4k into it, but keep ctx.optlen as original.
if bpf prog reads optval and finds it ok in setsockopt,
it can set ctx.optlen = 0
which would mean run the rest of setsockopt handling with original
'__user *optval' and ignore the buffer that was passed to bpf prog.
In case of ctx.optlen < 4k the behavior won't change from bpf prog
and from kernel pov.
When ctx.optlen > 4k and prog didn't adjust it
__cgroup_bpf_run_filter_setsockopt will do ret = -EFAULT;
and reject sockopt.
So bpf prog would be force to do something with large optvals.
yet it will be able to examine first 4k bytes of it.
It's a bit of change in behavior, but I don't think bpf prog is
doing ctx.optlen = 0, since that's more or less the same as
doing ctx.optlen = -2.
or may be use some other indicator ?
And do something similar with getsockopt.
Stanislav Fomichev June 16, 2020, 11:20 p.m. UTC | #6
On Tue, Jun 16, 2020 at 4:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 09:41:38AM -0700, Stanislav Fomichev wrote:
> > On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > [ .. ]
> > > > > It's probably ok, but makes me uneasy about verifier consequences.
> > > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov.
> > > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ?
> > > > > I don't think we have such tests. I guess bpf prog won't be able to read
> > > > > anything and nothing will crash, but having PTR_TO_PACKET that is
> > > > > actually NULL would be an odd special case to keep in mind for everyone
> > > > > who will work on the verifier from now on.
> > > > >
> > > > > Also consider bpf prog that simply reads something small like 4 bytes.
> > > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have
> > > > > those 4 bytes, so it's natural for the prog to assume that it can read it.
> > > > > It will have
> > > > > p = ctx->optval;
> > > > > if (p + 4 > ctx->optval_end)
> > > > >  /* goto out and don't bother logging, since that never happens */
> > > > > *(u32*)p;
> > > > >
> > > > > but 'clever' user space would pass long optlen and prog suddenly
> > > > > 'not seeing' the sockopt. It didn't crash, but debugging would be
> > > > > surprising.
> > > > >
> > > > > I feel it's better to copy the first 4k and let the program see it.
> > > > Agreed with the IP_FREEBIND example wrt observability, however it's
> > > > not clear what to do with the cropped buffer if the bpf program
> > > > modifies it.
> > > >
> > > > Consider that huge iptables setsockopts where the usespace passes
> > > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it.
> > > > Now, if the bpf program modifies the buffer (say, flips some byte), we
> > > > are back to square one. We either have to silently discard that buffer
> > > > or reallocate/merge. My reasoning with data == NULL, is that at least
> > > > there is a clear signal that the program can't access the data (and
> > > > can look at optlen to see if the original buffer is indeed non-zero
> > > > and maybe deny such requests?).
> > > > At this point I'm really starting to think that maybe we should just
> > > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an
> > > > upper bound :-/
> > > > I'll try to think about this a bit more over the weekend.
> > >
> > > Yeah. Tough choices.
> > > We can also detect in the verifier whether program accessed ctx->optval
> > > and skip alloc/copy if program didn't touch it, but I suspect in most
> > > case the program would want to read it.
> > > I think vmallocing what optlen said is DoS-able. It's better to
> > > stick with single page.
> > > Let's keep brainstorming.
> > Btw, can we use sleepable bpf for that? As in, do whatever I suggested
> > in these patches (don't expose optval>PAGE_SIZE via context), but add
> > a new helper where you can say 'copy x bytes from y offset of the
> > original optval' (the helper will do sleepable copy_form_user).
> > That way we have a clean signal to the BPF that the value is too big
> > (optval==optval_end==NULL) and the user can fallback to the helper to
> > inspect the value. We can also provide another helper to export new
> > value for this case.
>
> sleepable will be read-only and with toctou.
> I guess this patch is the least evil then ?
> But I'm confused with the test in patch 2.
> Why does it do 'if (optval > optval_end)' ?
> How is that possible when patch 1 makes them equal.
Right, it should really be 'optval < optval_end', good point!
I want to make sure in the BPF program that we can't access the buffer
(essentially return EPERM to the userpace so the test fails).
Will fix in a follow up.

> may be another idea:
> allocate 4k, copy first 4k into it, but keep ctx.optlen as original.
> if bpf prog reads optval and finds it ok in setsockopt,
> it can set ctx.optlen = 0
> which would mean run the rest of setsockopt handling with original
> '__user *optval' and ignore the buffer that was passed to bpf prog.
> In case of ctx.optlen < 4k the behavior won't change from bpf prog
> and from kernel pov.
> When ctx.optlen > 4k and prog didn't adjust it
> __cgroup_bpf_run_filter_setsockopt will do ret = -EFAULT;
> and reject sockopt.
> So bpf prog would be force to do something with large optvals.
> yet it will be able to examine first 4k bytes of it.
> It's a bit of change in behavior, but I don't think bpf prog is
> doing ctx.optlen = 0, since that's more or less the same as
> doing ctx.optlen = -2.
> or may be use some other indicator ?
> And do something similar with getsockopt.
Good suggestion! That sounds doable in theory, let me try and see if I
hit something unexpected.
I suppose trimming provided optval to zero (optlen=0) is not something
that sensible programs would do?

In this case please ignore the v4 I posted recently!
diff mbox series

Patch

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index fdf7836750a3..758082853086 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1276,9 +1276,18 @@  static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
 
 static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
 {
-	if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0)
+	if (unlikely(max_optlen < 0))
 		return -EINVAL;
 
+	if (unlikely(max_optlen > PAGE_SIZE)) {
+		/* We don't expose optvals that are greater than PAGE_SIZE
+		 * to the BPF program.
+		 */
+		ctx->optval = NULL;
+		ctx->optval_end = NULL;
+		return 0;
+	}
+
 	ctx->optval = kzalloc(max_optlen, GFP_USER);
 	if (!ctx->optval)
 		return -ENOMEM;
@@ -1325,7 +1334,7 @@  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 
 	ctx.optlen = *optlen;
 
-	if (copy_from_user(ctx.optval, optval, *optlen) != 0) {
+	if (ctx.optval && copy_from_user(ctx.optval, optval, *optlen) != 0) {
 		ret = -EFAULT;
 		goto out;
 	}
@@ -1407,7 +1416,8 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 		if (ctx.optlen > max_optlen)
 			ctx.optlen = max_optlen;
 
-		if (copy_from_user(ctx.optval, optval, ctx.optlen) != 0) {
+		if (ctx.optval &&
+		    copy_from_user(ctx.optval, optval, ctx.optlen) != 0) {
 			ret = -EFAULT;
 			goto out;
 		}
@@ -1436,7 +1446,7 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 		goto out;
 	}
 
-	if (copy_to_user(optval, ctx.optval, ctx.optlen) ||
+	if ((ctx.optval && copy_to_user(optval, ctx.optval, ctx.optlen)) ||
 	    put_user(ctx.optlen, optlen)) {
 		ret = -EFAULT;
 		goto out;