diff mbox series

[RFC] net: bpf: make __bpf_skb_max_len(skb) an skb-independent constant

Message ID 20200420231427.63894-1-zenczykowski@gmail.com
State RFC
Delegated to: BPF Maintainers
Headers show
Series [RFC] net: bpf: make __bpf_skb_max_len(skb) an skb-independent constant | expand

Commit Message

Maciej Żenczykowski April 20, 2020, 11:14 p.m. UTC
From: Maciej Żenczykowski <maze@google.com>

This function is used from:
  bpf_skb_adjust_room
  __bpf_skb_change_tail
  __bpf_skb_change_head

but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device.

This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device.  The mtu check prevents
us from adding the ethernet header prior to forwarding the packet.
After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Maciej Żenczykowski April 20, 2020, 11:26 p.m. UTC | #1
This is only a semi serious patch.

But, I've spent a long time trying to come up with a solution that works,
and everything seems broken.

I'm hoping someone else has some ideas.

As is, forwarding doesn't work.

Here's an example scenario:

cell0 - 1500 l3 mtu, raw_ip, 0 l2 header
wlan0 - 1500 l3 mtu, ethernet, 14 l2 header

cell0 -> wlan0 forwarding

tc ingress hook on cell0:
  map lookups, other stuff, eventually
  skb_modifications to add ethernet header (via skb_change_head or
bpf_skb_adjust_room)
  bpf_redirect(wlan0, egress)

This fails because adding ethernet header goes above the cell0 ->
mtu+header_len,
even though it would be fine if we tested against wlan0 -> mtu+header_len

Indeed the only solution that would perhaps work is to have 2 bpf programs

tc ingress hook on cell0: redirect to wlan0
tc egress hook on wlan0: actually add the header

but this requires doing the lookups twice - first to determine if
should redirect and where,
and then to actually add the header.  additionally the packet we get
on wlan0 might
not have come from the redirect... and that's hard to detect...

so you actually need to do:

tc ingress hook on cell0: redirect to dummy0, which has larger mtu
tc ingress hook on dummy0: add header, redirect to wlan0

this still requires a double set of bpf programs and lookups...
it's ugly.

Calling bpf_redirect() prior to skb_change_head() isn't enough, since it checks
skb->dev not tgt_index.  Although I guess we could save the redirect device's
mtu in the redirect struct and test against that in preference to
testing against skb->dev...
but that's really a pointless test, because you can call bpf_redirect
multiple times
changing the device, ie...

bpf_redirect(dummy with large mtu)
skb_change_head()
bpf_redirect(wlan0)

so basically this would make the test worthless...

I considered simply removing the mtu check from these skb modifying functions...
it's not like it even does the right thing:
(a) device mtu is only an upper limit - we should really be testing
against path mtu
      and that's probably only something the bpf code knows
(b) it ignores mtu entirely for gso packets: but gso max seg size
should be tested instead...

Or maybe add a bpf uapi visible flag to ignore the mtu check...

Or maybe simply pass in 16-bits of mtu via the currently unused flags field...

... etc ...

- Maciej
Jakub Kicinski April 21, 2020, 5:27 p.m. UTC | #2
On Mon, 20 Apr 2020 16:14:27 -0700 Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> This function is used from:
>   bpf_skb_adjust_room
>   __bpf_skb_change_tail
>   __bpf_skb_change_head
> 
> but in the case of forwarding we're likely calling these functions
> during receive processing on ingress and bpf_redirect()'ing at
> a later point in time to egress on another interface, thus these
> mtu checks are for the wrong device.

Interesting. Without redirecting there should also be no reason
to do this check at ingress, right? So at ingress it's either 
incorrect or unnecessary?
Maciej Żenczykowski April 21, 2020, 8:36 p.m. UTC | #3
> > This function is used from:
> >   bpf_skb_adjust_room
> >   __bpf_skb_change_tail
> >   __bpf_skb_change_head
> >
> > but in the case of forwarding we're likely calling these functions
> > during receive processing on ingress and bpf_redirect()'ing at
> > a later point in time to egress on another interface, thus these
> > mtu checks are for the wrong device.
>
> Interesting. Without redirecting there should also be no reason
> to do this check at ingress, right? So at ingress it's either
> incorrect or unnecessary?

Well, I guess there's technically a chance that you'd want to mutate
the packet somehow during ingress pre-receive processing (without
redirecting)...
But yeah, I can't really think of a case where that would be
increasing the size of the packet.

Usually you'd be decapsulating at ingress and encapsulating at egress,
or doing ingress rewrite & redirect to egress...

(Also, note that relying on a sequence where at ingress you first call
bpf_redirect(ifindex, EGRESS); then change the packet size, and then
return TC_ACT_REDIRECT; thus being able to use the redirect ifindex
for mtu checks in the packet mutation functions is potentially buggy,
since there's no guarantee you won't call bpf_redirect again to change
the ifinidex, or even return from the bpf program without returning
TC_ACT_REDIRECT --- so while that could be *more* correct, it would
still have holes...)
Alexei Starovoitov April 28, 2020, 5:53 p.m. UTC | #4
On Tue, Apr 21, 2020 at 01:36:08PM -0700, Maciej Żenczykowski wrote:
> > > This function is used from:
> > >   bpf_skb_adjust_room
> > >   __bpf_skb_change_tail
> > >   __bpf_skb_change_head
> > >
> > > but in the case of forwarding we're likely calling these functions
> > > during receive processing on ingress and bpf_redirect()'ing at
> > > a later point in time to egress on another interface, thus these
> > > mtu checks are for the wrong device.
> >
> > Interesting. Without redirecting there should also be no reason
> > to do this check at ingress, right? So at ingress it's either
> > incorrect or unnecessary?
> 
> Well, I guess there's technically a chance that you'd want to mutate
> the packet somehow during ingress pre-receive processing (without
> redirecting)...
> But yeah, I can't really think of a case where that would be
> increasing the size of the packet.
> 
> Usually you'd be decapsulating at ingress and encapsulating at egress,
> or doing ingress rewrite & redirect to egress...
> 
> (Also, note that relying on a sequence where at ingress you first call
> bpf_redirect(ifindex, EGRESS); then change the packet size, and then
> return TC_ACT_REDIRECT; thus being able to use the redirect ifindex
> for mtu checks in the packet mutation functions is potentially buggy,
> since there's no guarantee you won't call bpf_redirect again to change
> the ifinidex, or even return from the bpf program without returning
> TC_ACT_REDIRECT --- so while that could be *more* correct, it would
> still have holes...)

yeah. there is no good fix here, since target netdev is not known,
but dropping the check also doesn't seem right.
How about:
 if (skb->dev) {
    u32 header_len = skb->dev->hard_header_len;

    if (!header_len)
       header_len = ETH_HLEN;
    return skb->dev->mtu + header_len;
  } else {
    return SKB_MAX_ALLOC;
  }

the idea that l3 devices won't have l2 and here we will assume
that l2 can be added sooner or later.
It's not pretty either, but it will solve your wifi->eth use case?
While keeping basic sanity for other cases.
diff mbox series

Patch

diff --git a/net/core/filter.c b/net/core/filter.c
index ec567d1e6fb9..1e119a47f9fe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,7 @@  static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 
 static u32 __bpf_skb_max_len(const struct sk_buff *skb)
 {
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
+	return SKB_MAX_ALLOC;
 }
 
 BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,