Message ID | 332514c4-8391-4eae-ad63-fa02c19cecc3@k2.cloud |
---|---|
State | Not Applicable |
Delegated to: | Ilya Maximets |
Headers | show |
Series | [ovs-dev] STT kmod kernel panic investigated. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | fail | apply and check: fail |
On 11/7/24 09:26, Aleksandr Smirnov (K2 Cloud) wrote: > Hello, > > We got kernel crash in STT module in our could environment. I have made > root cause analysis and I happy to provide RCA report here: Thanks for the detailed analysis! This may be very useful for people still using STT for whatever reason. However, at this point in time, I don't think it makes a lot of sense to apply any changes to the kernel module distributed within branch-2.17. It contains way more serious issues, fixes for which were not backported form the upstream kernel. And it is also deprecated for a long time. The 2.17 branch itself will reach end of life in February. I'm also planing on full deprecation of STT and LISP tunnels support in OVS, since there are no supported datapath implementations for these tunnels and STT was also rejected upstream. There is an STT implementation for the windows datapath, but it should probably be removed, as we're not really testing it and other datapaths will not get support for STT. Also, the protocol itself is not really friendly for the wild internet as it may break middleboxes expecting proper TCP (one of the reasons it was rejected). I'll send deprecation patches in coming weeks for review, so we can mark them in OVS 3.5 and fully remove support in OVS 3.6. Best regards, Ilya Maximets. > > Version reported: OVS v2.17.6 > > Kernel crash: > > [ 120.574175] ------------[ cut here ]------------ > [ 120.574178] kernel BUG at net/core/skbuff.c:122! > [ 120.574510] invalid opcode: 0000 [#1] SMP NOPTI > [ 120.574840] CPU: 4 PID: 40 Comm: ksoftirqd/4 Kdump: loaded Tainted: > G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 > [ 120.575538] Hardware name: Dell Inc. PowerEdge R720/0X6FFV, BIOS > 2.5.2 01/28/2015 > [ 120.575916] RIP: 0010:skb_panic+0x4b/0x4d > [ 120.576288] Code: 00 00 50 8b 87 dc 00 00 00 50 8b 87 d8 00 00 00 50 > ff b7 e8 00 00 00 4c 8b 8f e0 00 00 00 48 c7 c7 f8 0a 1a 9d e8 58 10 95 > ff <0f> 0 > b 48 8b 14 24 48 c7 c1 c0 ea ee 9c e8 a3 ff ff ff 48 c7 c6 00 > [ 120.577132] RSP: 0018:ffffba2581b7f678 EFLAGS: 00010246 > [ 120.577553] RAX: 000000000000008b RBX: ffff9087064a4b00 RCX: > 0000000000000000 > [ 120.577989] RDX: 0000000000000000 RSI: ffff9086f789e698 RDI: > ffff9086f789e698 > [ 120.578428] RBP: 0000000000009ec2 R08: 0000000000000000 R09: > c0000000ffff7fff > [ 120.578877] R10: 0000000000000001 R11: ffffba2581b7f498 R12: > 0000000000002f1d > [ 120.579576] R13: 0000000000000c20 R14: 0000000000000008 R15: > ffff9085c999a400 > [ 120.580041] FS: 0000000000000000(0000) GS:ffff9086f7880000(0000) > knlGS:0000000000000000 > [ 120.580512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 120.580971] CR2: 00007f0eb711e340 CR3: 00000003b4410006 CR4: > 00000000000606e0 > [ 120.581445] Call Trace: > [ 120.581906] skb_push.cold.98+0x10/0x10 > [ 120.582374] ovs_stt_xmit.cold.48+0x166/0x480 [openvswitch] > [ 120.582870] do_execute_actions+0xc51/0xc90 [openvswitch] > [ 120.583361] ? flow_lookup.isra.12+0x40/0xb0 [openvswitch] > > > Open vSwitch background to reproduce: > > 1. OVS action field in flow must have two 'output' commands executed on > the same packet and routed to different ports. This double output > results in input socket buffer will be cloned and used in subsequent > xmit calls. Here I provide part of code that will be executed in > do_execute_actions() > > case OVS_ACTION_ATTR_OUTPUT: { > int port = nla_get_u32(a); > struct sk_buff *clone; > > /* Every output action needs a separate clone > * of 'skb', In case the output action is the > * last action, cloning can be avoided. > */ > if (nla_is_last(a, rem)) { > do_output(dp, skb, port, key); > /* 'skb' has been used for output. > */ > return 0; > } > > clone = skb_clone(skb, GFP_ATOMIC); > if (clone) > do_output(dp, clone, port, key); > > 2. Input packet must be split into at least two fragments. > > 3. STT tunneling must be in use. > > > Open vSwitch kernel module background to reproduce. > > > 1. We must have input packet consisting two fragments: > Head part A (struct of sk_buff) and 2-nd part F (struct of sk_buff) > A->frag_list == F > > 2. We call B = skb_clone(A) so, we create clone of input sk_buff. Note, > that after clone second part of packet is not copied, thus A->flag_list > == B->frag_list > > 3. Call send action on both buffers, in fact, ovs_stt_xmit(A), > ovs_stt_xmit(B) > > 4. We get kernel crash. > > > Why it happens: > > STT code wants to prepend STT header in face of incoming packet. > For input fragmented packet it wants to add STT header to every > fragment. Thus, every fragment supplied with STT header cannot be > enitire network packet anymore but rather set of individual packets. > Therefore STT rebuilds input packet detaching fragments and turn them > into individual packets. Technically say, it moves A->frag_list pointer > to A->next pointer. Then STT header is prepended to every element in skb > chain, (STT+)A, (STT+)A->next -- which is F. > > For now a situation is not very cute but somehow correct. > > Then we invoke same steps for B. B->frag_list still have pointer to > F which already has STT header prepended. > > Running same way we fall into situation where B, B->next (==F) come > to the code that adds STT headers. As result we could potentially have > (STT+)B, (STT+STT+)B->next - which is F, so, second part of input packet > will be corrupted, but, hopefully (?), a second attempt to save STT > header alarmed memory availability guard code. > > What is proposed to do: > > Note that Geneve code does not process fragmented-case > individually. Logically say, we always have to prepend exactly ONE > tunnel header in front of entire packet, it does not matter how many > fragments it had on input. When such packet comes to output it shall be > reassembled and possibly fragmented again by ip code. Since I find no > reasons (except maybe historical reasons) to convert fragmented packet > to several individual STT packets I have made an attempt to just remove > STT packet defragmentation call, so, such like Geneve, STT adds only one > header in front of entire packet. After such change kernel crash gone > and all packets were transmitted correctly in out testing environment. > I append this mail with possible patch just in case you want to have a fix. > > Special attention required. > > STT code has installed NF hook that really need to reassemble > fragments to whole packet. I think a same or similar problem could > happen here under same conditions. I did not make efforts to investigate > this case. An only question I have is Do we really need to cooperate > with NF here? > > =============== PATCH Here ================== > > From d8543538dd08bc53c729c26d119d430dd16379f6 Mon Sep 17 00:00:00 2001 > From: Aleksandr Smirnov <alekssmirnov@k2.cloud> > Date: Fri, 1 Nov 2024 14:00:36 +0300 > Subject: [PATCH] stt: fix kernel panic skb_push > > --- > datapath/linux/compat/stt.c | 16 ---------------- > 1 file changed, 16 deletions(-) > > diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c > index 39a294764..7b1c7f3e7 100644 > --- a/datapath/linux/compat/stt.c > +++ b/datapath/linux/compat/stt.c > @@ -663,22 +663,6 @@ static struct sk_buff *push_stt_header(struct > sk_buff *head, __be64 tun_id, > { > struct sk_buff *skb; > > - if (skb_shinfo(head)->frag_list) { > - bool ipv4 = (l3_proto == htons(ETH_P_IP)); > - bool tcp = (l4_proto == IPPROTO_TCP); > - bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL); > - int l4_offset = skb_transport_offset(head); > - > - /* Need to call skb_orphan() to report currect true-size. > - * calling skb_orphan() in this layer is odd but SKB with > - * frag-list should not be associated with any socket, so > - * skb-orphan should be no-op. */ > - skb_orphan(head); > - if (unlikely(segment_skb(&head, csum_partial, > - ipv4, tcp, l4_offset))) > - goto error; > - } > - > for (skb = head; skb; skb = skb->next) { > if (__push_stt_header(skb, tun_id, s_port, d_port, saddr, dst, > l3_proto, l4_proto, dst_mtu))
diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c index 39a294764..7b1c7f3e7 100644 --- a/datapath/linux/compat/stt.c +++ b/datapath/linux/compat/stt.c @@ -663,22 +663,6 @@ static struct sk_buff *push_stt_header(struct sk_buff *head, __be64 tun_id, { struct sk_buff *skb; - if (skb_shinfo(head)->frag_list) { - bool ipv4 = (l3_proto == htons(ETH_P_IP)); - bool tcp = (l4_proto == IPPROTO_TCP); - bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL); - int l4_offset = skb_transport_offset(head); - - /* Need to call skb_orphan() to report currect true-size. - * calling skb_orphan() in this layer is odd but SKB with - * frag-list should not be associated with any socket, so - * skb-orphan should be no-op. */ - skb_orphan(head); - if (unlikely(segment_skb(&head, csum_partial, - ipv4, tcp, l4_offset))) - goto error; - } - for (skb = head; skb; skb = skb->next) { if (__push_stt_header(skb, tun_id, s_port, d_port, saddr, dst,
Hello, We got kernel crash in STT module in our could environment. I have made root cause analysis and I happy to provide RCA report here: Version reported: OVS v2.17.6 Kernel crash: [ 120.574175] ------------[ cut here ]------------ [ 120.574178] kernel BUG at net/core/skbuff.c:122! [ 120.574510] invalid opcode: 0000 [#1] SMP NOPTI [ 120.574840] CPU: 4 PID: 40 Comm: ksoftirqd/4 Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1 [ 120.575538] Hardware name: Dell Inc. PowerEdge R720/0X6FFV, BIOS 2.5.2 01/28/2015 [ 120.575916] RIP: 0010:skb_panic+0x4b/0x4d [ 120.576288] Code: 00 00 50 8b 87 dc 00 00 00 50 8b 87 d8 00 00 00 50 ff b7 e8 00 00 00 4c 8b 8f e0 00 00 00 48 c7 c7 f8 0a 1a 9d e8 58 10 95 ff <0f> 0 b 48 8b 14 24 48 c7 c1 c0 ea ee 9c e8 a3 ff ff ff 48 c7 c6 00 [ 120.577132] RSP: 0018:ffffba2581b7f678 EFLAGS: 00010246 [ 120.577553] RAX: 000000000000008b RBX: ffff9087064a4b00 RCX: 0000000000000000 [ 120.577989] RDX: 0000000000000000 RSI: ffff9086f789e698 RDI: ffff9086f789e698 [ 120.578428] RBP: 0000000000009ec2 R08: 0000000000000000 R09: c0000000ffff7fff [ 120.578877] R10: 0000000000000001 R11: ffffba2581b7f498 R12: 0000000000002f1d [ 120.579576] R13: 0000000000000c20 R14: 0000000000000008 R15: ffff9085c999a400 [ 120.580041] FS: 0000000000000000(0000) GS:ffff9086f7880000(0000) knlGS:0000000000000000 [ 120.580512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 120.580971] CR2: 00007f0eb711e340 CR3: 00000003b4410006 CR4: 00000000000606e0 [ 120.581445] Call Trace: [ 120.581906] skb_push.cold.98+0x10/0x10 [ 120.582374] ovs_stt_xmit.cold.48+0x166/0x480 [openvswitch] [ 120.582870] do_execute_actions+0xc51/0xc90 [openvswitch] [ 120.583361] ? flow_lookup.isra.12+0x40/0xb0 [openvswitch] Open vSwitch background to reproduce: 1. OVS action field in flow must have two 'output' commands executed on the same packet and routed to different ports. This double output results in input socket buffer will be cloned and used in subsequent xmit calls. Here I provide part of code that will be executed in do_execute_actions() case OVS_ACTION_ATTR_OUTPUT: { int port = nla_get_u32(a); struct sk_buff *clone; /* Every output action needs a separate clone * of 'skb', In case the output action is the * last action, cloning can be avoided. */ if (nla_is_last(a, rem)) { do_output(dp, skb, port, key); /* 'skb' has been used for output. */ return 0; } clone = skb_clone(skb, GFP_ATOMIC); if (clone) do_output(dp, clone, port, key); 2. Input packet must be split into at least two fragments. 3. STT tunneling must be in use. Open vSwitch kernel module background to reproduce. 1. We must have input packet consisting two fragments: Head part A (struct of sk_buff) and 2-nd part F (struct of sk_buff) A->frag_list == F 2. We call B = skb_clone(A) so, we create clone of input sk_buff. Note, that after clone second part of packet is not copied, thus A->flag_list == B->frag_list 3. Call send action on both buffers, in fact, ovs_stt_xmit(A), ovs_stt_xmit(B) 4. We get kernel crash. Why it happens: STT code wants to prepend STT header in face of incoming packet. For input fragmented packet it wants to add STT header to every fragment. Thus, every fragment supplied with STT header cannot be enitire network packet anymore but rather set of individual packets. Therefore STT rebuilds input packet detaching fragments and turn them into individual packets. Technically say, it moves A->frag_list pointer to A->next pointer. Then STT header is prepended to every element in skb chain, (STT+)A, (STT+)A->next -- which is F. For now a situation is not very cute but somehow correct. Then we invoke same steps for B. B->frag_list still have pointer to F which already has STT header prepended. Running same way we fall into situation where B, B->next (==F) come to the code that adds STT headers. As result we could potentially have (STT+)B, (STT+STT+)B->next - which is F, so, second part of input packet will be corrupted, but, hopefully (?), a second attempt to save STT header alarmed memory availability guard code. What is proposed to do: Note that Geneve code does not process fragmented-case individually. Logically say, we always have to prepend exactly ONE tunnel header in front of entire packet, it does not matter how many fragments it had on input. When such packet comes to output it shall be reassembled and possibly fragmented again by ip code. Since I find no reasons (except maybe historical reasons) to convert fragmented packet to several individual STT packets I have made an attempt to just remove STT packet defragmentation call, so, such like Geneve, STT adds only one header in front of entire packet. After such change kernel crash gone and all packets were transmitted correctly in out testing environment. I append this mail with possible patch just in case you want to have a fix. Special attention required. STT code has installed NF hook that really need to reassemble fragments to whole packet. I think a same or similar problem could happen here under same conditions. I did not make efforts to investigate this case. An only question I have is Do we really need to cooperate with NF here? =============== PATCH Here ================== From d8543538dd08bc53c729c26d119d430dd16379f6 Mon Sep 17 00:00:00 2001 From: Aleksandr Smirnov <alekssmirnov@k2.cloud> Date: Fri, 1 Nov 2024 14:00:36 +0300 Subject: [PATCH] stt: fix kernel panic skb_push --- datapath/linux/compat/stt.c | 16 ---------------- 1 file changed, 16 deletions(-) l3_proto, l4_proto, dst_mtu))