Message ID | CAOftzPisP-3jN8drC6RXcTigXJjdwEnvTRvTHR-Kv4LKn4rhQQ@mail.gmail.com |
---|---|
State | RFC |
Delegated to: | David Miller |
Headers | show |
Series | Removing skb_orphan() from ip_rcv_core() | expand |
Joe Stringer <joe@wand.net.nz> wrote: > As discussed during LSFMM, I've been looking at adding something like > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can > be implemented with integration into other BPF logic, however > currently any attempts to do so are blocked by the skb_orphan() call > in ip_rcv_core() (which will effectively ignore any socket assign > decision made by the TC BPF program). > > Recently I was attempting to remove the skb_orphan() call, and I've > been trying different things but there seems to be some context I'm > missing. Here's the core of the patch: > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > index ed97724c5e33..16aea980318a 100644 > --- a/net/ipv4/ip_input.c > +++ b/net/ipv4/ip_input.c > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff > *skb, struct net *net) > memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); > IPCB(skb)->iif = skb->skb_iif; > > - /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > > return skb; > > The statement that the socket must be dropped because of tproxy > doesn't make sense to me, because the PRE_ROUTING hook is hit after > this, which will call into the tproxy logic and eventually > nf_tproxy_assign_sock() which already does the skb_orphan() itself. in comment: s/tproxy/skb_steal_sock/ at least thats what I concluded a few years ago when I looked into the skb_oprhan() need. IIRC some device drivers use skb->sk for backpressure, so without this non-tcp socket would be stolen by skb_steal_sock. We also recently removed skb orphan when crossing netns: commit 9c4c325252c54b34d53b3d0ffd535182b744e03d Author: Flavio Leitner <fbl@redhat.com> skbuff: preserve sock reference when scrubbing the skb. So thats another case where this orphan is needed. What could be done is adding some way to delay/defer the orphaning further, but we would need at the very least some annotation for skb_steal_sock to know when the skb->sk is really from TPROXY or if it has to orphan. Same for the safety check in the forwarding path. Netfilter modules need o be audited as well, they might make assumptions wrt. skb->sk being inet sockets (set by local stack or early demux). > However, if I drop these lines then I end up causing sockets to > release references too many times. Seems like if we don't orphan the > skb here, then later logic assumes that we have one more reference > than we actually have, and decrements the count when it shouldn't > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems > to assume we always have a reference to the socket?) We might be calling the wrong destructor (i.e., the one set by tcp receive instead of the one set at tx time)?
On 6/21/19 10:58 AM, Joe Stringer wrote: > Hi folks, picking this up again.. > > As discussed during LSFMM, I've been looking at adding something like > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can > be implemented with integration into other BPF logic, however > currently any attempts to do so are blocked by the skb_orphan() call > in ip_rcv_core() (which will effectively ignore any socket assign > decision made by the TC BPF program). > > Recently I was attempting to remove the skb_orphan() call, and I've > been trying different things but there seems to be some context I'm > missing. Here's the core of the patch: > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > index ed97724c5e33..16aea980318a 100644 > --- a/net/ipv4/ip_input.c > +++ b/net/ipv4/ip_input.c > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff > *skb, struct net *net) > memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); > IPCB(skb)->iif = skb->skb_iif; > > - /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > > return skb; > > The statement that the socket must be dropped because of tproxy > doesn't make sense to me, because the PRE_ROUTING hook is hit after > this, which will call into the tproxy logic and eventually > nf_tproxy_assign_sock() which already does the skb_orphan() itself. > > However, if I drop these lines then I end up causing sockets to > release references too many times. Seems like if we don't orphan the > skb here, then later logic assumes that we have one more reference > than we actually have, and decrements the count when it shouldn't > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems > to assume we always have a reference to the socket?) > > Splat: > > refcount_t hit zero at sk_stop_timer+0x2c/0x30 in cilium-agent[16359], > uid/euid: 0/0 > WARNING: CPU: 0 PID: 16359 at kernel/panic.c:686 refcount_error_report+0x9c/0xa1 > ... > ? inet_put_port+0xa6/0xd0 > inet_csk_clear_xmit_timers+0x2e/0x50 > tcp_done+0x8b/0xf0 > tcp_reset+0x49/0xc0 > tcp_validate_incoming+0x2f7/0x410 > tcp_rcv_state_process+0x250/0xdb6 > ? tcp_v4_connect+0x46f/0x4e0 > tcp_v4_do_rcv+0xbd/0x1f0 > __release_sock+0x84/0xd0 > release_sock+0x30/0xa0 > inet_stream_connect+0x47/0x60 > > (Full version: https://gist.github.com/joestringer/d5313e4bf4231e2c46405bd7a3053936 > ) > > This seems potentially related to some of the socket referencing > discussion in the peer thread "[RFC bpf-next 0/7] Programming socket > lookup with BPF". > > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is > necessary in that path in the current version of the code, and that we > may be able to remove it. Florian, I know you weren't in the room for > that discussion, so raising it again now with a stack trace, Do you > have some sense what's going on here and whether there's a path > towards removing it from this path or allowing the skb->sk to be > retained during ip_rcv() in some conditions? > I guess you might missed part of the LSFMM discussion : We need to make sure skb_orphan() is called from any virtual driver ndo_start_xmit() that re-injects packet back to the stack. loopback driver has it, but probably some driver lacks it.
On 2019-06-21 1:58 p.m., Joe Stringer wrote: > Hi folks, picking this up again.. [..] > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is > necessary in that path in the current version of the code, and that we > may be able to remove it. Florian, I know you weren't in the room for > that discussion, so raising it again now with a stack trace, Do you > have some sense what's going on here and whether there's a path > towards removing it from this path or allowing the skb->sk to be > retained during ip_rcv() in some conditions? Sorry - I havent followed the discussion but saw your email over the weekend and wanted to be at work to refresh my memory on some code. For maybe 2-3 years we have deployed the tproxy equivalent as a tc action on ingress (with no netfilter dependency). And, of course, we had to work around that specific code you are referring to - we didnt remove it. The tc action code increments the sk refcount and sets the tc index. The net core doesnt orphan the skb if a speacial tc index value is set (see attached patch) I never bothered up streaming the patch because the hack is a bit embarrassing (but worked ;->); and never posted the action code either because i thought this was just us that had this requirement. I am glad other people see the need for this feature. Is there effort to make this _not_ depend on iptables/netfilter? I am guessing if you want to do this from ebpf (tc or xdp) that is a requirement. Our need was with tcp at the time; so left udp dependency on netfilter alone. cheers, jamal
On 6/24/19 7:47 AM, Jamal Hadi Salim wrote: > On 2019-06-21 1:58 p.m., Joe Stringer wrote: >> Hi folks, picking this up again.. > [..] >> During LSFMM, it seemed like no-one knew quite why the skb_orphan() is >> necessary in that path in the current version of the code, and that we >> may be able to remove it. Florian, I know you weren't in the room for >> that discussion, so raising it again now with a stack trace, Do you >> have some sense what's going on here and whether there's a path >> towards removing it from this path or allowing the skb->sk to be >> retained during ip_rcv() in some conditions? > > > Sorry - I havent followed the discussion but saw your email over > the weekend and wanted to be at work to refresh my memory on some > code. For maybe 2-3 years we have deployed the tproxy > equivalent as a tc action on ingress (with no netfilter dependency). > > And, of course, we had to work around that specific code you are > referring to - we didnt remove it. The tc action code increments > the sk refcount and sets the tc index. The net core doesnt orphan > the skb if a speacial tc index value is set (see attached patch) > > I never bothered up streaming the patch because the hack is a bit embarrassing (but worked ;->); and never posted the action code > either because i thought this was just us that had this requirement. > I am glad other people see the need for this feature. Is there effort > to make this _not_ depend on iptables/netfilter? I am guessing if you > want to do this from ebpf (tc or xdp) that is a requirement. > Our need was with tcp at the time; so left udp dependency on netfilter > alone. > Well, I would simply remove the skb_orphan() call completely.
On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote: > > Joe Stringer <joe@wand.net.nz> wrote: > > As discussed during LSFMM, I've been looking at adding something like > > an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can > > be implemented with integration into other BPF logic, however > > currently any attempts to do so are blocked by the skb_orphan() call > > in ip_rcv_core() (which will effectively ignore any socket assign > > decision made by the TC BPF program). > > > > Recently I was attempting to remove the skb_orphan() call, and I've > > been trying different things but there seems to be some context I'm > > missing. Here's the core of the patch: > > > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > > index ed97724c5e33..16aea980318a 100644 > > --- a/net/ipv4/ip_input.c > > +++ b/net/ipv4/ip_input.c > > @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff > > *skb, struct net *net) > > memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); > > IPCB(skb)->iif = skb->skb_iif; > > > > - /* Must drop socket now because of tproxy. */ > > - skb_orphan(skb); > > > > return skb; > > > > The statement that the socket must be dropped because of tproxy > > doesn't make sense to me, because the PRE_ROUTING hook is hit after > > this, which will call into the tproxy logic and eventually > > nf_tproxy_assign_sock() which already does the skb_orphan() itself. > > in comment: s/tproxy/skb_steal_sock/ For reference, I was following the path like this: ip_rcv() ( -> ip_rcv_core() for skb_orphan) -> NF_INET_PRE_ROUTING hook (... invoke iptables hooks) -> iptable_mangle_hook() -> ipt_do_table() ... -> tproxy_tg4() ... -> nf_tproxy_assign_sock() -> skb_orphan() (... finish iptables processing) ( -> ip_rcv_finish()) ( ... -> ip_rcv_finish_core() for early demux / route lookup ) (... -> dst_input()) (... -> tcp_v4_rcv()) ( -> __inet_lookup_skb()) ( -> skb_steal_sock() ) > at least thats what I concluded a few years ago when I looked into > the skb_oprhan() need. > > IIRC some device drivers use skb->sk for backpressure, so without this > non-tcp socket would be stolen by skb_steal_sock. Do you happen to recall which device drivers? Or have some idea of a list I could try to go through? Are you referring to virtual drivers like veth or something else? > We also recently removed skb orphan when crossing netns: > > commit 9c4c325252c54b34d53b3d0ffd535182b744e03d > Author: Flavio Leitner <fbl@redhat.com> > skbuff: preserve sock reference when scrubbing the skb. > > So thats another case where this orphan is needed. Presumably the orphan is only needed in this case if the packet crosses a namespace and then is subsequently passed back into the stack? > What could be done is adding some way to delay/defer the orphaning > further, but we would need at the very least some annotation for > skb_steal_sock to know when the skb->sk is really from TPROXY or > if it has to orphan. Eric mentions in another response to this thread that skb_orphan() should be called from any ndo_start_xmit() which sends traffic back into the stack. With that, presumably we would be pushing the orphaning earlier such that the only way that the skb->sk ref can be non-NULL around this point in receive would be because it was specifically set by some kind of tproxy logic? > Same for the safety check in the forwarding path. > Netfilter modules need o be audited as well, they might make assumptions > wrt. skb->sk being inet sockets (set by local stack or early demux). > > > However, if I drop these lines then I end up causing sockets to > > release references too many times. Seems like if we don't orphan the > > skb here, then later logic assumes that we have one more reference > > than we actually have, and decrements the count when it shouldn't > > (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems > > to assume we always have a reference to the socket?) > > We might be calling the wrong destructor (i.e., the one set by tcp > receive instead of the one set at tx time)? Hmm, interesting thought. Sure enough, with a bit of bpftrace debugging we find it's tcp_wfree(): $ cat ip_rcv.bt #include <linux/skbuff.h> kprobe:ip_rcv { $sk = ((struct sk_buff *)arg0)->sk; $des = ((struct sk_buff *)arg0)->destructor; if ($sk) { if ($des) { printf("received %s on %s with sk destructor %s set\n", str(arg0), str(arg1), ksym($des)); @ip4_stacks[kstack] = count(); } } } $ sudo bpftrace ip_rcv.bt Attaching 1 probe... received on eth0 with sk destructor tcp_wfree set ^C @ip4_stacks[ ip_rcv+1 __netif_receive_skb+24 process_backlog+179 net_rx_action+304 __do_softirq+220 do_softirq_own_stack+42 do_softirq.part.17+70 __local_bh_enable_ip+101 ip_finish_output2+421 __ip_finish_output+187 ip_finish_output+44 ip_output+109 ip_local_out+59 __ip_queue_xmit+368 ip_queue_xmit+16 __tcp_transmit_skb+1303 tcp_connect+2758 tcp_v4_connect+1135 __inet_stream_connect+214 inet_stream_connect+59 __sys_connect+237 __x64_sys_connect+26 do_syscall_64+90 entry_SYSCALL_64_after_hwframe+68 ]: 1 Is there a solution here where we call the destructor if it's not sock_efree()? When the socket is later stolen, it will only return the reference via a call to sock_put(), so presumably at that point in the stack we already assume that the skb->destructor is not one of these other destructors (otherwise we wouldn't release the resources correctly).
On Mon, Jun 24, 2019 at 7:47 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On 2019-06-21 1:58 p.m., Joe Stringer wrote: > > Hi folks, picking this up again.. > [..] > > During LSFMM, it seemed like no-one knew quite why the skb_orphan() is > > necessary in that path in the current version of the code, and that we > > may be able to remove it. Florian, I know you weren't in the room for > > that discussion, so raising it again now with a stack trace, Do you > > have some sense what's going on here and whether there's a path > > towards removing it from this path or allowing the skb->sk to be > > retained during ip_rcv() in some conditions? > > > Sorry - I havent followed the discussion but saw your email over > the weekend and wanted to be at work to refresh my memory on some > code. For maybe 2-3 years we have deployed the tproxy > equivalent as a tc action on ingress (with no netfilter dependency). > > And, of course, we had to work around that specific code you are > referring to - we didnt remove it. The tc action code increments > the sk refcount and sets the tc index. The net core doesnt orphan > the skb if a speacial tc index value is set (see attached patch) > > I never bothered up streaming the patch because the hack is a bit > embarrassing (but worked ;->); and never posted the action code > either because i thought this was just us that had this requirement. > I am glad other people see the need for this feature. Is there effort > to make this _not_ depend on iptables/netfilter? I am guessing if you > want to do this from ebpf (tc or xdp) that is a requirement. > Our need was with tcp at the time; so left udp dependency on netfilter > alone. I haven't got as far as UDP yet, but I didn't see any need for a dependency on netfilter.
On 6/24/19 8:17 PM, Joe Stringer wrote: > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote: >> >> Joe Stringer <joe@wand.net.nz> wrote: >>> As discussed during LSFMM, I've been looking at adding something like >>> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can >>> be implemented with integration into other BPF logic, however >>> currently any attempts to do so are blocked by the skb_orphan() call >>> in ip_rcv_core() (which will effectively ignore any socket assign >>> decision made by the TC BPF program). >>> >>> Recently I was attempting to remove the skb_orphan() call, and I've >>> been trying different things but there seems to be some context I'm >>> missing. Here's the core of the patch: >>> >>> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c >>> index ed97724c5e33..16aea980318a 100644 >>> --- a/net/ipv4/ip_input.c >>> +++ b/net/ipv4/ip_input.c >>> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff >>> *skb, struct net *net) >>> memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); >>> IPCB(skb)->iif = skb->skb_iif; >>> >>> - /* Must drop socket now because of tproxy. */ >>> - skb_orphan(skb); >>> >>> return skb; >>> >>> The statement that the socket must be dropped because of tproxy >>> doesn't make sense to me, because the PRE_ROUTING hook is hit after >>> this, which will call into the tproxy logic and eventually >>> nf_tproxy_assign_sock() which already does the skb_orphan() itself. >> >> in comment: s/tproxy/skb_steal_sock/ > > For reference, I was following the path like this: > > ip_rcv() > ( -> ip_rcv_core() for skb_orphan) > -> NF_INET_PRE_ROUTING hook > (... invoke iptables hooks) > -> iptable_mangle_hook() > -> ipt_do_table() > ... -> tproxy_tg4() > ... -> nf_tproxy_assign_sock() > -> skb_orphan() > (... finish iptables processing) > ( -> ip_rcv_finish()) > ( ... -> ip_rcv_finish_core() for early demux / route lookup ) > (... -> dst_input()) > (... -> tcp_v4_rcv()) > ( -> __inet_lookup_skb()) > ( -> skb_steal_sock() ) > >> at least thats what I concluded a few years ago when I looked into >> the skb_oprhan() need. >> >> IIRC some device drivers use skb->sk for backpressure, so without this >> non-tcp socket would be stolen by skb_steal_sock. > > Do you happen to recall which device drivers? Or have some idea of a > list I could try to go through? Are you referring to virtual drivers > like veth or something else? > >> We also recently removed skb orphan when crossing netns: >> >> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d >> Author: Flavio Leitner <fbl@redhat.com> >> skbuff: preserve sock reference when scrubbing the skb. >> >> So thats another case where this orphan is needed. > > Presumably the orphan is only needed in this case if the packet > crosses a namespace and then is subsequently passed back into the > stack? Yes, I understand we do not want the skb_orphan() when 'srubing' the skb. But we want the skb_orphan() right before the packet is reinjected in ingress path. > >> What could be done is adding some way to delay/defer the orphaning >> further, but we would need at the very least some annotation for >> skb_steal_sock to know when the skb->sk is really from TPROXY or >> if it has to orphan. > > Eric mentions in another response to this thread that skb_orphan() > should be called from any ndo_start_xmit() which sends traffic back > into the stack. With that, presumably we would be pushing the > orphaning earlier such that the only way that the skb->sk ref can be > non-NULL around this point in receive would be because it was > specifically set by some kind of tproxy logic? > >> Same for the safety check in the forwarding path. >> Netfilter modules need o be audited as well, they might make assumptions >> wrt. skb->sk being inet sockets (set by local stack or early demux). >> >>> However, if I drop these lines then I end up causing sockets to >>> release references too many times. Seems like if we don't orphan the >>> skb here, then later logic assumes that we have one more reference >>> than we actually have, and decrements the count when it shouldn't >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems >>> to assume we always have a reference to the socket?) >> >> We might be calling the wrong destructor (i.e., the one set by tcp >> receive instead of the one set at tx time)? > > Hmm, interesting thought. Sure enough, with a bit of bpftrace > debugging we find it's tcp_wfree(): > > $ cat ip_rcv.bt > #include <linux/skbuff.h> > > kprobe:ip_rcv { > $sk = ((struct sk_buff *)arg0)->sk; > $des = ((struct sk_buff *)arg0)->destructor; > if ($sk) { > if ($des) { > printf("received %s on %s with sk destructor %s > set\n", str(arg0), str(arg1), ksym($des)); > @ip4_stacks[kstack] = count(); > } > } > } > $ sudo bpftrace ip_rcv.bt > Attaching 1 probe... > received on eth0 with sk destructor tcp_wfree set > ^C > > @ip4_stacks[ > ip_rcv+1 > __netif_receive_skb+24 > process_backlog+179 > net_rx_action+304 > __do_softirq+220 > do_softirq_own_stack+42 > do_softirq.part.17+70 > __local_bh_enable_ip+101 > ip_finish_output2+421 > __ip_finish_output+187 > ip_finish_output+44 > ip_output+109 > ip_local_out+59 > __ip_queue_xmit+368 > ip_queue_xmit+16 > __tcp_transmit_skb+1303 > tcp_connect+2758 > tcp_v4_connect+1135 > __inet_stream_connect+214 > inet_stream_connect+59 > __sys_connect+237 > __x64_sys_connect+26 > do_syscall_64+90 > entry_SYSCALL_64_after_hwframe+68 > ]: 1 > > Is there a solution here where we call the destructor if it's not > sock_efree()? When the socket is later stolen, it will only return the > reference via a call to sock_put(), so presumably at that point in the > stack we already assume that the skb->destructor is not one of these > other destructors (otherwise we wouldn't release the resources > correctly). > What was the driver here ? In any case, the following patch should help. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, return NET_RX_DROP; } + skb_orphan(skb); skb_scrub_packet(skb, true); skb->priority = 0; return 0;
On 06/25/2019 08:37 AM, Eric Dumazet wrote: > On 6/24/19 8:17 PM, Joe Stringer wrote: >> On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote: >>> >>> Joe Stringer <joe@wand.net.nz> wrote: >>>> As discussed during LSFMM, I've been looking at adding something like >>>> an `skb_sk_assign()` helper to BPF so that logic similar to TPROXY can >>>> be implemented with integration into other BPF logic, however >>>> currently any attempts to do so are blocked by the skb_orphan() call >>>> in ip_rcv_core() (which will effectively ignore any socket assign >>>> decision made by the TC BPF program). >>>> >>>> Recently I was attempting to remove the skb_orphan() call, and I've >>>> been trying different things but there seems to be some context I'm >>>> missing. Here's the core of the patch: >>>> >>>> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c >>>> index ed97724c5e33..16aea980318a 100644 >>>> --- a/net/ipv4/ip_input.c >>>> +++ b/net/ipv4/ip_input.c >>>> @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff >>>> *skb, struct net *net) >>>> memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); >>>> IPCB(skb)->iif = skb->skb_iif; >>>> >>>> - /* Must drop socket now because of tproxy. */ >>>> - skb_orphan(skb); >>>> >>>> return skb; >>>> >>>> The statement that the socket must be dropped because of tproxy >>>> doesn't make sense to me, because the PRE_ROUTING hook is hit after >>>> this, which will call into the tproxy logic and eventually >>>> nf_tproxy_assign_sock() which already does the skb_orphan() itself. >>> >>> in comment: s/tproxy/skb_steal_sock/ >> >> For reference, I was following the path like this: >> >> ip_rcv() >> ( -> ip_rcv_core() for skb_orphan) >> -> NF_INET_PRE_ROUTING hook >> (... invoke iptables hooks) >> -> iptable_mangle_hook() >> -> ipt_do_table() >> ... -> tproxy_tg4() >> ... -> nf_tproxy_assign_sock() >> -> skb_orphan() >> (... finish iptables processing) >> ( -> ip_rcv_finish()) >> ( ... -> ip_rcv_finish_core() for early demux / route lookup ) >> (... -> dst_input()) >> (... -> tcp_v4_rcv()) >> ( -> __inet_lookup_skb()) >> ( -> skb_steal_sock() ) >> >>> at least thats what I concluded a few years ago when I looked into >>> the skb_oprhan() need. >>> >>> IIRC some device drivers use skb->sk for backpressure, so without this >>> non-tcp socket would be stolen by skb_steal_sock. >> >> Do you happen to recall which device drivers? Or have some idea of a >> list I could try to go through? Are you referring to virtual drivers >> like veth or something else? >> >>> We also recently removed skb orphan when crossing netns: >>> >>> commit 9c4c325252c54b34d53b3d0ffd535182b744e03d >>> Author: Flavio Leitner <fbl@redhat.com> >>> skbuff: preserve sock reference when scrubbing the skb. >>> >>> So thats another case where this orphan is needed. >> >> Presumably the orphan is only needed in this case if the packet >> crosses a namespace and then is subsequently passed back into the >> stack? > > Yes, I understand we do not want the skb_orphan() when 'srubing' the skb. > > But we want the skb_orphan() right before the packet is reinjected in ingress path. > >>> What could be done is adding some way to delay/defer the orphaning >>> further, but we would need at the very least some annotation for >>> skb_steal_sock to know when the skb->sk is really from TPROXY or >>> if it has to orphan. >> >> Eric mentions in another response to this thread that skb_orphan() >> should be called from any ndo_start_xmit() which sends traffic back >> into the stack. With that, presumably we would be pushing the >> orphaning earlier such that the only way that the skb->sk ref can be >> non-NULL around this point in receive would be because it was >> specifically set by some kind of tproxy logic? >> >>> Same for the safety check in the forwarding path. >>> Netfilter modules need o be audited as well, they might make assumptions >>> wrt. skb->sk being inet sockets (set by local stack or early demux). >>> >>>> However, if I drop these lines then I end up causing sockets to >>>> release references too many times. Seems like if we don't orphan the >>>> skb here, then later logic assumes that we have one more reference >>>> than we actually have, and decrements the count when it shouldn't >>>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems >>>> to assume we always have a reference to the socket?) >>> >>> We might be calling the wrong destructor (i.e., the one set by tcp >>> receive instead of the one set at tx time)? >> >> Hmm, interesting thought. Sure enough, with a bit of bpftrace >> debugging we find it's tcp_wfree(): >> >> $ cat ip_rcv.bt >> #include <linux/skbuff.h> >> >> kprobe:ip_rcv { >> $sk = ((struct sk_buff *)arg0)->sk; >> $des = ((struct sk_buff *)arg0)->destructor; >> if ($sk) { >> if ($des) { >> printf("received %s on %s with sk destructor %s >> set\n", str(arg0), str(arg1), ksym($des)); >> @ip4_stacks[kstack] = count(); >> } >> } >> } >> $ sudo bpftrace ip_rcv.bt >> Attaching 1 probe... >> received on eth0 with sk destructor tcp_wfree set >> ^C >> >> @ip4_stacks[ >> ip_rcv+1 >> __netif_receive_skb+24 >> process_backlog+179 >> net_rx_action+304 >> __do_softirq+220 >> do_softirq_own_stack+42 >> do_softirq.part.17+70 >> __local_bh_enable_ip+101 >> ip_finish_output2+421 >> __ip_finish_output+187 >> ip_finish_output+44 >> ip_output+109 >> ip_local_out+59 >> __ip_queue_xmit+368 >> ip_queue_xmit+16 >> __tcp_transmit_skb+1303 >> tcp_connect+2758 >> tcp_v4_connect+1135 >> __inet_stream_connect+214 >> inet_stream_connect+59 >> __sys_connect+237 >> __x64_sys_connect+26 >> do_syscall_64+90 >> entry_SYSCALL_64_after_hwframe+68 >> ]: 1 >> >> Is there a solution here where we call the destructor if it's not >> sock_efree()? When the socket is later stolen, it will only return the >> reference via a call to sock_put(), so presumably at that point in the >> stack we already assume that the skb->destructor is not one of these >> other destructors (otherwise we wouldn't release the resources >> correctly). > > What was the driver here ? In any case, the following patch should help. > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, > return NET_RX_DROP; > } > > + skb_orphan(skb); > skb_scrub_packet(skb, true); > skb->priority = 0; > return 0; > But wasn't the whole point of 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb.") to defer orphaning to as late as possible? If I'm not missing anything, then above would reintroduce the issues that 9c4c325252c5 was trying to solve wrt TSQ/XPS/etc when skb was sent via veth based data path to cross netns and then forwarded to phys dev for transmission; meaning, skb->sk is lost at the point of dev_queue_xmit() for the latter. A side-effect this would also have is that this changes behavior again for tc egress programs sitting on phys dev (e.g. querying sock cookie or other related features). Thanks, Daniel
On 2019-06-24 12:49 p.m., Eric Dumazet wrote: > > > Well, I would simply remove the skb_orphan() call completely. My experience: You still need to call skb_orphan() from the lower level (and set the skb destructor etc). cheers, jamal
On 2019-06-24 11:26 p.m., Joe Stringer wrote: [..] > > I haven't got as far as UDP yet, but I didn't see any need for a > dependency on netfilter. I'd be curious to see what you did. My experience, even for TCP is the socket(transparent/tproxy) lookup code (to set skb->sk either listening or established) is entangled in CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of there (in the tproxy tc action into that code). Only then can you compile out netfilter. I didnt bother to rip out code for udp case. i.e if you needed udp to work with the tc action, youd have to turn on NF. But that was because we had no need for udp transparent proxying. IOW: There is really no reason, afaik, for tproxy code to only be accessed if netfilter is compiled in. Not sure i made sense. cheers, jamal
On 6/25/19 2:35 AM, Daniel Borkmann wrote: > > But wasn't the whole point of 9c4c325252c5 ("skbuff: preserve sock reference when > scrubbing the skb.") to defer orphaning to as late as possible? If I'm not missing > anything, then above would reintroduce the issues that 9c4c325252c5 was trying to > solve wrt TSQ/XPS/etc when skb was sent via veth based data path to cross netns and > then forwarded to phys dev for transmission; meaning, skb->sk is lost at the point > of dev_queue_xmit() for the latter. A side-effect this would also have is that this > changes behavior again for tc egress programs sitting on phys dev (e.g. querying > sock cookie or other related features). Unless we can detect/decide that a packet going through veth pair is going to be locally consumed, or forwarded to a physical device (another ndo_start_xmit()), we need to skb_orphan() the packet, exactly the same way than loopback ndo_start_xmit() (We could have setups where these packets going through lo interface could be forwarded to a NIC...) Backpressure is a best effort, we should not make it an absolute requirement and prevent doing early demux as early as possible in RX path.
On Mon, Jun 24, 2019 at 11:37 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 6/24/19 8:17 PM, Joe Stringer wrote: > > On Fri, Jun 21, 2019 at 1:59 PM Florian Westphal <fw@strlen.de> wrote: > >> Joe Stringer <joe@wand.net.nz> wrote: > >>> However, if I drop these lines then I end up causing sockets to > >>> release references too many times. Seems like if we don't orphan the > >>> skb here, then later logic assumes that we have one more reference > >>> than we actually have, and decrements the count when it shouldn't > >>> (perhaps the skb_steal_sock() call in __inet_lookup_skb() which seems > >>> to assume we always have a reference to the socket?) > >> > >> We might be calling the wrong destructor (i.e., the one set by tcp > >> receive instead of the one set at tx time)? > > > > Hmm, interesting thought. Sure enough, with a bit of bpftrace > > debugging we find it's tcp_wfree(): > > > > $ cat ip_rcv.bt > > #include <linux/skbuff.h> > > > > kprobe:ip_rcv { > > $sk = ((struct sk_buff *)arg0)->sk; > > $des = ((struct sk_buff *)arg0)->destructor; > > if ($sk) { > > if ($des) { > > printf("received %s on %s with sk destructor %s > > set\n", str(arg0), str(arg1), ksym($des)); > > @ip4_stacks[kstack] = count(); > > } > > } > > } > > $ sudo bpftrace ip_rcv.bt > > Attaching 1 probe... > > received on eth0 with sk destructor tcp_wfree set > > ^C > > > > @ip4_stacks[ > > ip_rcv+1 > > __netif_receive_skb+24 > > process_backlog+179 > > net_rx_action+304 > > __do_softirq+220 > > do_softirq_own_stack+42 > > do_softirq.part.17+70 > > __local_bh_enable_ip+101 > > ip_finish_output2+421 > > __ip_finish_output+187 > > ip_finish_output+44 > > ip_output+109 > > ip_local_out+59 > > __ip_queue_xmit+368 > > ip_queue_xmit+16 > > __tcp_transmit_skb+1303 > > tcp_connect+2758 > > tcp_v4_connect+1135 > > __inet_stream_connect+214 > > inet_stream_connect+59 > > __sys_connect+237 > > __x64_sys_connect+26 > > do_syscall_64+90 > > entry_SYSCALL_64_after_hwframe+68 > > ]: 1 > > > > Is there a solution here where we call the destructor if it's not > > sock_efree()? When the socket is later stolen, it will only return the > > reference via a call to sock_put(), so presumably at that point in the > > stack we already assume that the skb->destructor is not one of these > > other destructors (otherwise we wouldn't release the resources > > correctly). > > > > What was the driver here ? In any case, the following patch should help. > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index eeacebd7debbe6a55daedb92f00afd48051ebaf8..5075b4b267af7057f69fcb935226fce097a920e2 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3699,6 +3699,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev, > return NET_RX_DROP; > } > > + skb_orphan(skb); > skb_scrub_packet(skb, true); > skb->priority = 0; > return 0; Looks like it was bridge in the end, found by attaching a similar bpftrace program to __dev_forward_sk(). Interestingly enough, the device attached to the skb reported its name as "eth0" despite not having such a named link or named bridge that I could find anywhere via "ip link" / "brctl show".. __dev_forward_skb+1 dev_hard_start_xmit+151 __dev_queue_xmit+1928 dev_queue_xmit+16 br_dev_queue_push_xmit+123 br_forward_finish+69 __br_forward+327 br_forward+204 br_dev_xmit+598 dev_hard_start_xmit+151 __dev_queue_xmit+1928 dev_queue_xmit+16 neigh_resolve_output+339 ip_finish_output2+402 __ip_finish_output+187 ip_finish_output+44 ip_output+109 ip_local_out+59 __ip_queue_xmit+368 ip_queue_xmit+16 __tcp_transmit_skb+1303 tcp_connect+2758 tcp_v4_connect+1135 __inet_stream_connect+214 inet_stream_connect+59 __sys_connect+237 __x64_sys_connect+26 do_syscall_64+90 entry_SYSCALL_64_after_hwframe+68 So I guess something like this could be another alternative: diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index 82225b8b54f5..c2de2bb35080 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -65,6 +65,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit); int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { + skb_orphan(skb); skb->tstamp = 0; return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING, net, sk, skb, NULL, skb->dev,
On Tue, Jun 25, 2019 at 4:07 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On 2019-06-24 11:26 p.m., Joe Stringer wrote: > [..] > > > > I haven't got as far as UDP yet, but I didn't see any need for a > > dependency on netfilter. > > I'd be curious to see what you did. My experience, even for TCP is > the socket(transparent/tproxy) lookup code (to set skb->sk either > listening or established) is entangled in > CONFIG_NETFILTER_SOMETHING_OR_OTHER. You have to rip it out of > there (in the tproxy tc action into that code). Only then can you > compile out netfilter. > I didnt bother to rip out code for udp case. > i.e if you needed udp to work with the tc action, > youd have to turn on NF. But that was because we had > no need for udp transparent proxying. > IOW: > There is really no reason, afaik, for tproxy code to only be > accessed if netfilter is compiled in. Not sure i made sense. Oh, I see. Between the existing bpf_skc_lookup_tcp() and bpf_sk_lookup_tcp() helpers in BPF, plus a new bpf_sk_assign() helper and a little bit of lookup code using the appropriate tproxy ports etc. from the BPF side, I was able to get it working. One could imagine perhaps wrapping all this logic up in a higher level "bpf_sk_lookup_tproxy()" helper call or similar, but I didn't go that direction given that the BPF socket primitives seemed to provide the necessary functionality in a more generic manner.
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index ed97724c5e33..16aea980318a 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -500,8 +500,6 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net) memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); IPCB(skb)->iif = skb->skb_iif; - /* Must drop socket now because of tproxy. */ - skb_orphan(skb); return skb;