Message ID | 20200608182748.6998-1-sdf@google.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf,v3,1/2] bpf: don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE | expand |
On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote: > Attaching to these hooks can break iptables because its optval is > usually quite big, or at least bigger than the current PAGE_SIZE limit. > David also mentioned some SCTP options can be big (around 256k). > > There are two possible ways to fix it: > 1. Increase the limit to match iptables max optval. There is, however, > no clear upper limit. Technically, iptables can accept up to > 512M of data (not sure how practical it is though). > > 2. Bypass the value (don't expose to BPF) if it's too big and trigger > BPF only with level/optname so BPF can still decide whether > to allow/deny big sockopts. > > The initial attempt was implemented using strategy #1. Due to > listed shortcomings, let's switch to strategy #2. When there is > legitimate a real use-case for iptables/SCTP, we can consider increasing > the PAGE_SIZE limit. > > v3: > * don't increase the limit, bypass the argument > > v2: > * proper comments formatting (Jakub Kicinski) > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > Cc: David Laight <David.Laight@ACULAB.COM> > Signed-off-by: Stanislav Fomichev <sdf@google.com> > --- > kernel/bpf/cgroup.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index fdf7836750a3..758082853086 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > { > - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) > + if (unlikely(max_optlen < 0)) > return -EINVAL; > > + if (unlikely(max_optlen > PAGE_SIZE)) { > + /* We don't expose optvals that are greater than PAGE_SIZE > + * to the BPF program. > + */ > + ctx->optval = NULL; > + ctx->optval_end = NULL; > + return 0; > + } It's probably ok, but makes me uneasy about verifier consequences. ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? I don't think we have such tests. I guess bpf prog won't be able to read anything and nothing will crash, but having PTR_TO_PACKET that is actually NULL would be an odd special case to keep in mind for everyone who will work on the verifier from now on. Also consider bpf prog that simply reads something small like 4 bytes. IP_FREEBIND sockopt (like your selftest in the patch 2) will have those 4 bytes, so it's natural for the prog to assume that it can read it. It will have p = ctx->optval; if (p + 4 > ctx->optval_end) /* goto out and don't bother logging, since that never happens */ *(u32*)p; but 'clever' user space would pass long optlen and prog suddenly 'not seeing' the sockopt. It didn't crash, but debugging would be surprising. I feel it's better to copy the first 4k and let the program see it.
On Fri, Jun 12, 2020 at 5:34 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote: > > Attaching to these hooks can break iptables because its optval is > > usually quite big, or at least bigger than the current PAGE_SIZE limit. > > David also mentioned some SCTP options can be big (around 256k). > > > > There are two possible ways to fix it: > > 1. Increase the limit to match iptables max optval. There is, however, > > no clear upper limit. Technically, iptables can accept up to > > 512M of data (not sure how practical it is though). > > > > 2. Bypass the value (don't expose to BPF) if it's too big and trigger > > BPF only with level/optname so BPF can still decide whether > > to allow/deny big sockopts. > > > > The initial attempt was implemented using strategy #1. Due to > > listed shortcomings, let's switch to strategy #2. When there is > > legitimate a real use-case for iptables/SCTP, we can consider increasing > > the PAGE_SIZE limit. > > > > v3: > > * don't increase the limit, bypass the argument > > > > v2: > > * proper comments formatting (Jakub Kicinski) > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > Cc: David Laight <David.Laight@ACULAB.COM> > > Signed-off-by: Stanislav Fomichev <sdf@google.com> > > --- > > kernel/bpf/cgroup.c | 18 ++++++++++++++---- > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > index fdf7836750a3..758082853086 100644 > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > { > > - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) > > + if (unlikely(max_optlen < 0)) > > return -EINVAL; > > > > + if (unlikely(max_optlen > PAGE_SIZE)) { > > + /* We don't expose optvals that are greater than PAGE_SIZE > > + * to the BPF program. > > + */ > > + ctx->optval = NULL; > > + ctx->optval_end = NULL; > > + return 0; > > + } > > It's probably ok, but makes me uneasy about verifier consequences. > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > I don't think we have such tests. I guess bpf prog won't be able to read > anything and nothing will crash, but having PTR_TO_PACKET that is > actually NULL would be an odd special case to keep in mind for everyone > who will work on the verifier from now on. > > Also consider bpf prog that simply reads something small like 4 bytes. > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > those 4 bytes, so it's natural for the prog to assume that it can read it. > It will have > p = ctx->optval; > if (p + 4 > ctx->optval_end) > /* goto out and don't bother logging, since that never happens */ > *(u32*)p; > > but 'clever' user space would pass long optlen and prog suddenly > 'not seeing' the sockopt. It didn't crash, but debugging would be > surprising. > > I feel it's better to copy the first 4k and let the program see it. Agreed with the IP_FREEBIND example wrt observability, however it's not clear what to do with the cropped buffer if the bpf program modifies it. Consider that huge iptables setsockopts where the usespace passes PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. Now, if the bpf program modifies the buffer (say, flips some byte), we are back to square one. We either have to silently discard that buffer or reallocate/merge. My reasoning with data == NULL, is that at least there is a clear signal that the program can't access the data (and can look at optlen to see if the original buffer is indeed non-zero and maybe deny such requests?). At this point I'm really starting to think that maybe we should just vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an upper bound :-/ I'll try to think about this a bit more over the weekend.
On Fri, Jun 12, 2020 at 06:03:03PM -0700, Stanislav Fomichev wrote: > On Fri, Jun 12, 2020 at 5:34 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote: > > > Attaching to these hooks can break iptables because its optval is > > > usually quite big, or at least bigger than the current PAGE_SIZE limit. > > > David also mentioned some SCTP options can be big (around 256k). > > > > > > There are two possible ways to fix it: > > > 1. Increase the limit to match iptables max optval. There is, however, > > > no clear upper limit. Technically, iptables can accept up to > > > 512M of data (not sure how practical it is though). > > > > > > 2. Bypass the value (don't expose to BPF) if it's too big and trigger > > > BPF only with level/optname so BPF can still decide whether > > > to allow/deny big sockopts. > > > > > > The initial attempt was implemented using strategy #1. Due to > > > listed shortcomings, let's switch to strategy #2. When there is > > > legitimate a real use-case for iptables/SCTP, we can consider increasing > > > the PAGE_SIZE limit. > > > > > > v3: > > > * don't increase the limit, bypass the argument > > > > > > v2: > > > * proper comments formatting (Jakub Kicinski) > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > Cc: David Laight <David.Laight@ACULAB.COM> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com> > > > --- > > > kernel/bpf/cgroup.c | 18 ++++++++++++++---- > > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > > index fdf7836750a3..758082853086 100644 > > > --- a/kernel/bpf/cgroup.c > > > +++ b/kernel/bpf/cgroup.c > > > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > > > > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > > { > > > - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) > > > + if (unlikely(max_optlen < 0)) > > > return -EINVAL; > > > > > > + if (unlikely(max_optlen > PAGE_SIZE)) { > > > + /* We don't expose optvals that are greater than PAGE_SIZE > > > + * to the BPF program. > > > + */ > > > + ctx->optval = NULL; > > > + ctx->optval_end = NULL; > > > + return 0; > > > + } > > > > It's probably ok, but makes me uneasy about verifier consequences. > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > I don't think we have such tests. I guess bpf prog won't be able to read > > anything and nothing will crash, but having PTR_TO_PACKET that is > > actually NULL would be an odd special case to keep in mind for everyone > > who will work on the verifier from now on. > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > It will have > > p = ctx->optval; > > if (p + 4 > ctx->optval_end) > > /* goto out and don't bother logging, since that never happens */ > > *(u32*)p; > > > > but 'clever' user space would pass long optlen and prog suddenly > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > surprising. > > > > I feel it's better to copy the first 4k and let the program see it. > Agreed with the IP_FREEBIND example wrt observability, however it's > not clear what to do with the cropped buffer if the bpf program > modifies it. > > Consider that huge iptables setsockopts where the usespace passes > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > Now, if the bpf program modifies the buffer (say, flips some byte), we > are back to square one. We either have to silently discard that buffer > or reallocate/merge. My reasoning with data == NULL, is that at least > there is a clear signal that the program can't access the data (and > can look at optlen to see if the original buffer is indeed non-zero > and maybe deny such requests?). > At this point I'm really starting to think that maybe we should just > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > upper bound :-/ > I'll try to think about this a bit more over the weekend. Yeah. Tough choices. We can also detect in the verifier whether program accessed ctx->optval and skip alloc/copy if program didn't touch it, but I suspect in most case the program would want to read it. I think vmallocing what optlen said is DoS-able. It's better to stick with single page. Let's keep brainstorming.
On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: [ .. ] > > > It's probably ok, but makes me uneasy about verifier consequences. > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > > I don't think we have such tests. I guess bpf prog won't be able to read > > > anything and nothing will crash, but having PTR_TO_PACKET that is > > > actually NULL would be an odd special case to keep in mind for everyone > > > who will work on the verifier from now on. > > > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > > It will have > > > p = ctx->optval; > > > if (p + 4 > ctx->optval_end) > > > /* goto out and don't bother logging, since that never happens */ > > > *(u32*)p; > > > > > > but 'clever' user space would pass long optlen and prog suddenly > > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > > surprising. > > > > > > I feel it's better to copy the first 4k and let the program see it. > > Agreed with the IP_FREEBIND example wrt observability, however it's > > not clear what to do with the cropped buffer if the bpf program > > modifies it. > > > > Consider that huge iptables setsockopts where the usespace passes > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > > Now, if the bpf program modifies the buffer (say, flips some byte), we > > are back to square one. We either have to silently discard that buffer > > or reallocate/merge. My reasoning with data == NULL, is that at least > > there is a clear signal that the program can't access the data (and > > can look at optlen to see if the original buffer is indeed non-zero > > and maybe deny such requests?). > > At this point I'm really starting to think that maybe we should just > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > > upper bound :-/ > > I'll try to think about this a bit more over the weekend. > > Yeah. Tough choices. > We can also detect in the verifier whether program accessed ctx->optval > and skip alloc/copy if program didn't touch it, but I suspect in most > case the program would want to read it. > I think vmallocing what optlen said is DoS-able. It's better to > stick with single page. > Let's keep brainstorming. Btw, can we use sleepable bpf for that? As in, do whatever I suggested in these patches (don't expose optval>PAGE_SIZE via context), but add a new helper where you can say 'copy x bytes from y offset of the original optval' (the helper will do sleepable copy_form_user). That way we have a clean signal to the BPF that the value is too big (optval==optval_end==NULL) and the user can fallback to the helper to inspect the value. We can also provide another helper to export new value for this case. WDYT?
On Mon, Jun 15, 2020 at 09:41:38AM -0700, Stanislav Fomichev wrote: > On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > [ .. ] > > > > It's probably ok, but makes me uneasy about verifier consequences. > > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > > > I don't think we have such tests. I guess bpf prog won't be able to read > > > > anything and nothing will crash, but having PTR_TO_PACKET that is > > > > actually NULL would be an odd special case to keep in mind for everyone > > > > who will work on the verifier from now on. > > > > > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > > > It will have > > > > p = ctx->optval; > > > > if (p + 4 > ctx->optval_end) > > > > /* goto out and don't bother logging, since that never happens */ > > > > *(u32*)p; > > > > > > > > but 'clever' user space would pass long optlen and prog suddenly > > > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > > > surprising. > > > > > > > > I feel it's better to copy the first 4k and let the program see it. > > > Agreed with the IP_FREEBIND example wrt observability, however it's > > > not clear what to do with the cropped buffer if the bpf program > > > modifies it. > > > > > > Consider that huge iptables setsockopts where the usespace passes > > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > > > Now, if the bpf program modifies the buffer (say, flips some byte), we > > > are back to square one. We either have to silently discard that buffer > > > or reallocate/merge. My reasoning with data == NULL, is that at least > > > there is a clear signal that the program can't access the data (and > > > can look at optlen to see if the original buffer is indeed non-zero > > > and maybe deny such requests?). > > > At this point I'm really starting to think that maybe we should just > > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > > > upper bound :-/ > > > I'll try to think about this a bit more over the weekend. > > > > Yeah. Tough choices. > > We can also detect in the verifier whether program accessed ctx->optval > > and skip alloc/copy if program didn't touch it, but I suspect in most > > case the program would want to read it. > > I think vmallocing what optlen said is DoS-able. It's better to > > stick with single page. > > Let's keep brainstorming. > Btw, can we use sleepable bpf for that? As in, do whatever I suggested > in these patches (don't expose optval>PAGE_SIZE via context), but add > a new helper where you can say 'copy x bytes from y offset of the > original optval' (the helper will do sleepable copy_form_user). > That way we have a clean signal to the BPF that the value is too big > (optval==optval_end==NULL) and the user can fallback to the helper to > inspect the value. We can also provide another helper to export new > value for this case. sleepable will be read-only and with toctou. I guess this patch is the least evil then ? But I'm confused with the test in patch 2. Why does it do 'if (optval > optval_end)' ? How is that possible when patch 1 makes them equal. may be another idea: allocate 4k, copy first 4k into it, but keep ctx.optlen as original. if bpf prog reads optval and finds it ok in setsockopt, it can set ctx.optlen = 0 which would mean run the rest of setsockopt handling with original '__user *optval' and ignore the buffer that was passed to bpf prog. In case of ctx.optlen < 4k the behavior won't change from bpf prog and from kernel pov. When ctx.optlen > 4k and prog didn't adjust it __cgroup_bpf_run_filter_setsockopt will do ret = -EFAULT; and reject sockopt. So bpf prog would be force to do something with large optvals. yet it will be able to examine first 4k bytes of it. It's a bit of change in behavior, but I don't think bpf prog is doing ctx.optlen = 0, since that's more or less the same as doing ctx.optlen = -2. or may be use some other indicator ? And do something similar with getsockopt.
On Tue, Jun 16, 2020 at 4:04 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 09:41:38AM -0700, Stanislav Fomichev wrote: > > On Fri, Jun 12, 2020 at 8:50 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > [ .. ] > > > > > It's probably ok, but makes me uneasy about verifier consequences. > > > > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > > > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > > > > I don't think we have such tests. I guess bpf prog won't be able to read > > > > > anything and nothing will crash, but having PTR_TO_PACKET that is > > > > > actually NULL would be an odd special case to keep in mind for everyone > > > > > who will work on the verifier from now on. > > > > > > > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > > > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > > > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > > > > It will have > > > > > p = ctx->optval; > > > > > if (p + 4 > ctx->optval_end) > > > > > /* goto out and don't bother logging, since that never happens */ > > > > > *(u32*)p; > > > > > > > > > > but 'clever' user space would pass long optlen and prog suddenly > > > > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > > > > surprising. > > > > > > > > > > I feel it's better to copy the first 4k and let the program see it. > > > > Agreed with the IP_FREEBIND example wrt observability, however it's > > > > not clear what to do with the cropped buffer if the bpf program > > > > modifies it. > > > > > > > > Consider that huge iptables setsockopts where the usespace passes > > > > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > > > > Now, if the bpf program modifies the buffer (say, flips some byte), we > > > > are back to square one. We either have to silently discard that buffer > > > > or reallocate/merge. My reasoning with data == NULL, is that at least > > > > there is a clear signal that the program can't access the data (and > > > > can look at optlen to see if the original buffer is indeed non-zero > > > > and maybe deny such requests?). > > > > At this point I'm really starting to think that maybe we should just > > > > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > > > > upper bound :-/ > > > > I'll try to think about this a bit more over the weekend. > > > > > > Yeah. Tough choices. > > > We can also detect in the verifier whether program accessed ctx->optval > > > and skip alloc/copy if program didn't touch it, but I suspect in most > > > case the program would want to read it. > > > I think vmallocing what optlen said is DoS-able. It's better to > > > stick with single page. > > > Let's keep brainstorming. > > Btw, can we use sleepable bpf for that? As in, do whatever I suggested > > in these patches (don't expose optval>PAGE_SIZE via context), but add > > a new helper where you can say 'copy x bytes from y offset of the > > original optval' (the helper will do sleepable copy_form_user). > > That way we have a clean signal to the BPF that the value is too big > > (optval==optval_end==NULL) and the user can fallback to the helper to > > inspect the value. We can also provide another helper to export new > > value for this case. > > sleepable will be read-only and with toctou. > I guess this patch is the least evil then ? > But I'm confused with the test in patch 2. > Why does it do 'if (optval > optval_end)' ? > How is that possible when patch 1 makes them equal. Right, it should really be 'optval < optval_end', good point! I want to make sure in the BPF program that we can't access the buffer (essentially return EPERM to the userpace so the test fails). Will fix in a follow up. > may be another idea: > allocate 4k, copy first 4k into it, but keep ctx.optlen as original. > if bpf prog reads optval and finds it ok in setsockopt, > it can set ctx.optlen = 0 > which would mean run the rest of setsockopt handling with original > '__user *optval' and ignore the buffer that was passed to bpf prog. > In case of ctx.optlen < 4k the behavior won't change from bpf prog > and from kernel pov. > When ctx.optlen > 4k and prog didn't adjust it > __cgroup_bpf_run_filter_setsockopt will do ret = -EFAULT; > and reject sockopt. > So bpf prog would be force to do something with large optvals. > yet it will be able to examine first 4k bytes of it. > It's a bit of change in behavior, but I don't think bpf prog is > doing ctx.optlen = 0, since that's more or less the same as > doing ctx.optlen = -2. > or may be use some other indicator ? > And do something similar with getsockopt. Good suggestion! That sounds doable in theory, let me try and see if I hit something unexpected. I suppose trimming provided optval to zero (optlen=0) is not something that sensible programs would do? In this case please ignore the v4 I posted recently!
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index fdf7836750a3..758082853086 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) { - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) + if (unlikely(max_optlen < 0)) return -EINVAL; + if (unlikely(max_optlen > PAGE_SIZE)) { + /* We don't expose optvals that are greater than PAGE_SIZE + * to the BPF program. + */ + ctx->optval = NULL; + ctx->optval_end = NULL; + return 0; + } + ctx->optval = kzalloc(max_optlen, GFP_USER); if (!ctx->optval) return -ENOMEM; @@ -1325,7 +1334,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, ctx.optlen = *optlen; - if (copy_from_user(ctx.optval, optval, *optlen) != 0) { + if (ctx.optval && copy_from_user(ctx.optval, optval, *optlen) != 0) { ret = -EFAULT; goto out; } @@ -1407,7 +1416,8 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, if (ctx.optlen > max_optlen) ctx.optlen = max_optlen; - if (copy_from_user(ctx.optval, optval, ctx.optlen) != 0) { + if (ctx.optval && + copy_from_user(ctx.optval, optval, ctx.optlen) != 0) { ret = -EFAULT; goto out; } @@ -1436,7 +1446,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, goto out; } - if (copy_to_user(optval, ctx.optval, ctx.optlen) || + if ((ctx.optval && copy_to_user(optval, ctx.optval, ctx.optlen)) || put_user(ctx.optlen, optlen)) { ret = -EFAULT; goto out;
Attaching to these hooks can break iptables because its optval is usually quite big, or at least bigger than the current PAGE_SIZE limit. David also mentioned some SCTP options can be big (around 256k). There are two possible ways to fix it: 1. Increase the limit to match iptables max optval. There is, however, no clear upper limit. Technically, iptables can accept up to 512M of data (not sure how practical it is though). 2. Bypass the value (don't expose to BPF) if it's too big and trigger BPF only with level/optname so BPF can still decide whether to allow/deny big sockopts. The initial attempt was implemented using strategy #1. Due to listed shortcomings, let's switch to strategy #2. When there is legitimate a real use-case for iptables/SCTP, we can consider increasing the PAGE_SIZE limit. v3: * don't increase the limit, bypass the argument v2: * proper comments formatting (Jakub Kicinski) Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Cc: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Stanislav Fomichev <sdf@google.com> --- kernel/bpf/cgroup.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)