diff mbox series

aquantia: Reserve space when allocating an SKB

Message ID CY4PR1001MB23118EE23F7F5196817B8B2EE8E10@CY4PR1001MB2311.namprd10.prod.outlook.com
State Superseded
Headers show
Series aquantia: Reserve space when allocating an SKB | expand

Commit Message

Ramsay, Lincoln Nov. 18, 2020, 1:52 a.m. UTC
When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

It was observed that napi_alloc_skb and other ethernet drivers
reserve (NET_SKB_PAD + NET_IP_ALIGN) bytes in new SKBs. Do this
when calling build_skb as well.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---

We have an Aquantia 10G ethernet interface in one of our devices. While testing a new feature, we discovered a problem with it. The problem only shows up in a very specific situation however.

We are using firewalld as a frontend to nftables.
It sets up port forwarding (eg. incoming port 5022 -> other_machine:22).
We also use masquerading on the outgoing packet, although I'm not sure this is relevant to the issue.
IPv4 works fine, IPv6 is a problem.
The bug is triggered by trying to hit this forwarded port (ssh -p 5022 addr). It is 100% reproducible.

The problem is that we get a kernel warning. It is triggered by this line in neighbour.h:
    if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) {

It seems that skb_headroom is only 14, when it is expected to be >= 16.

2020-10-19 21:24:24 DEBUG   [console] ------------[ cut here ]------------
2020-10-19 21:24:24 DEBUG   [console] WARNING: CPU: 3 PID: 0 at include/net/neighbour.h:493 ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Modules linked in: xt_addrtype xt_MASQUERADE iptable_filter iptable_nat ip6table_raw ip6_tables xt_CT xt_tcpudp iptable_raw ip_tables nf_nat_tftp nft_nat nft_masq nft_objref nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_chain_nat nf_nat xfrm_user nf_conntrack_tftp nf_tables_set x_tables nft_ct nf_tables nfnetlink amd_spirom_nor(O) spi_nor(O) mtd(O) atlantic nct5104_wdt(O) gpio_amd(O) nct7491(O) sch_fq_codel tun qmi_wwan usbnet mii qcserial usb_wwan qcaux nsh nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 i2c_dev cdc_wdm br_netfilter bridge stp llc [last unloaded: nft_reject]
2020-10-19 21:24:24 DEBUG   [console] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O      5.4.65-og #1
2020-10-19 21:24:24 DEBUG   [console] RIP: 0010:ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Code: 87 e9 fc ff ff 44 89 fa 48 89 74 24 20 48 29 d7 e8 2d 4f 0c 00 48 8b 74 24 20 e9 cf fc ff ff 41 bf 10 00 00 00 e9 c4 fc ff ff <0f> 0b 4c 89 ef 41 bc 01 00 00 00 e8 d8 89 f0 ff e9 ee fc ff ff e8
2020-10-19 21:24:24 DEBUG   [console] RSP: 0018:ffffac2040114ab0 EFLAGS: 00010212
2020-10-19 21:24:24 DEBUG   [console] RAX: ffff9c041a0bf00e RBX: 000000000000000e RCX: ffff9c041a0bf00e
2020-10-19 21:24:24 DEBUG   [console] RDX: 000000000000000e RSI: ffff9c03ddf606c8 RDI: 0000000000000000
2020-10-19 21:24:24 DEBUG   [console] RBP: ffffac2040114b38 R08: 00000000f2000000 R09: 0000000002ec5955
2020-10-19 21:24:24 DEBUG   [console] R10: ffff9c041e57a440 R11: 000000000000000a R12: ffff9c03ddf60600
2020-10-19 21:24:24 DEBUG   [console] R13: ffff9c03dcf24800 R14: 0000000000000000 R15: 0000000000000010
2020-10-19 21:24:24 DEBUG   [console] FS:  0000000000000000(0000) GS:ffff9c0426b80000(0000) knlGS:0000000000000000
2020-10-19 21:24:24 DEBUG   [console] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-19 21:24:24 DEBUG   [console] CR2: 0000000000a0b4d8 CR3: 0000000222054000 CR4: 00000000000406e0
2020-10-19 21:24:24 DEBUG   [console] Call Trace:
2020-10-19 21:24:24 DEBUG   [console]  <IRQ>
2020-10-19 21:24:24 DEBUG   [console]  ? ipv6_confirm+0x85/0xf0 [nf_conntrack]
2020-10-19 21:24:24 DEBUG   [console]  ip6_output+0x67/0x130
2020-10-19 21:24:24 DEBUG   [console]  ? __ip6_finish_output+0x110/0x110
2020-10-19 21:24:24 DEBUG   [console]  ip6_forward+0x582/0x920
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_frag_init+0x40/0x40
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv_finish+0x33/0x50
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv+0x212/0x240
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_rcv_finish_core.isra.0+0xc0/0xc0
2020-10-19 21:24:24 DEBUG   [console]  ipv6_list_rcv+0x116/0x140
2020-10-19 21:24:24 DEBUG   [console]  __netif_receive_skb_list_core+0x1b1/0x260
2020-10-19 21:24:24 DEBUG   [console]  netif_receive_skb_list_internal+0x1ba/0x2d0
2020-10-19 21:24:24 DEBUG   [console]  ? napi_gro_receive+0x50/0x90
2020-10-19 21:24:24 DEBUG   [console]  gro_normal_list.part.0+0x14/0x30
2020-10-19 21:24:24 DEBUG   [console]  napi_complete_done+0x81/0x100
2020-10-19 21:24:24 DEBUG   [console]  aq_vec_poll+0x166/0x190 [atlantic]
2020-10-19 21:24:24 DEBUG   [console]  net_rx_action+0x12b/0x2f0
2020-10-19 21:24:24 DEBUG   [console]  __do_softirq+0xd1/0x213
2020-10-19 21:24:24 DEBUG   [console]  irq_exit+0xc8/0xd0
2020-10-19 21:24:24 DEBUG   [console]  do_IRQ+0x48/0xd0
2020-10-19 21:24:24 DEBUG   [console]  common_interrupt+0xf/0xf
2020-10-19 21:24:24 DEBUG   [console]  </IRQ>
2020-10-19 21:24:24 DEBUG   [console] ---[ end trace c1cba758301d342f ]---

After much hunting and debugging, I think I have figured out the issue here.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
    skb_put(skb, buff->len);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these 2 code paths. When napi_alloc_skb creates an SKB, there is a certain amount of headroom reserved. The same pattern appears to be used in all of the other ethernet drivers I have looked at. However, this is not done in the build_skb codepath.

I believe that this is the ultimate cause of the warning we are seeing.

I have created a patch to create some headroom in the SKB. The logic is inspired by the igb driver. This was originally developed against Linux 5.4, then migrated to Linux 5.8. It has been tested on our product against both versions. The patch below was migrated to Linux master (some context changed, but otherwise it applied cleanly).

--
2.17.1

Comments

Igor Russkikh Nov. 18, 2020, 2:02 p.m. UTC | #1
Hi Ramsay,


> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> It was observed that napi_alloc_skb and other ethernet drivers
> reserve (NET_SKB_PAD + NET_IP_ALIGN) bytes in new SKBs. Do this
> when calling build_skb as well.

Thanks for the analysis, but I think the solution you propose is invalid.


> After much hunting and debugging, I think I have figured out the issue
> here.
> 
> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
>     skb_put(skb, buff->len);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these 2 code
> paths. When napi_alloc_skb creates an SKB, there is a certain amount of
> headroom reserved. The same pattern appears to be used in all of the other
> ethernet drivers I have looked at. However, this is not done in the
> build_skb codepath.

...

> -	rxpage->pg_off = 0;
> +	rxpage->pg_off = AQ_SKB_PAD;
> 
>  	return 0;
> 
> @@ -67,8 +69,8 @@ static int aq_get_rxpages(struct aq_ring_s *self, struct
> aq_ring_buff_s *rxbuf,
>  		/* One means ring is the only user and can reuse */
>  		if (page_ref_count(rxbuf->rxdata.page) > 1) {
>  			/* Try reuse buffer */
> -			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX;
> -			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX <=
> +			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX +
> AQ_SKB_PAD;
> +			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX +
> AQ_SKB_PAD <=
>  				(PAGE_SIZE << order)) {


Here I understand your intention. You are trying to "offset" the placement of
the packet data, and the restore it back when construction SKB.

The problem however is that hardware is being programmed with fixed descriptor
size for placement. And its equal to AQ_CFG_RX_FRAME_MAX (2K by default).

This means, HW will do writes of up to 2K packet data into a single
descriptor, and then (if not enough), will go for next descriptor data.

With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
to. Ultimately, HW will do a memory corruption of next page.

The limitation here is we can't tell HW on granularity less than 1K.

I think the only acceptable solution here would be removing that optimized
path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
it under some configuration condition (which is also not good).

So far I can't imagine any other good solution.

HW supports also a header split - this could be used to follow the optimized
path, but thats not an easy thing to implement.

Regards,
  Igor
Ramsay, Lincoln Nov. 19, 2020, 12:14 a.m. UTC | #2
Hi Igor,

> Here I understand your intention. You are trying to "offset" the placement of
> the packet data, and the restore it back when construction SKB.

Originally, I just added the skb_reserve call, but that broke everything. When I looked at what the igb driver was doing, this approach seemed reasonable but I wasn't sure it'd work.

> The problem however is that hardware is being programmed with fixed descriptor
> size for placement. And its equal to AQ_CFG_RX_FRAME_MAX (2K by default).
> 
> This means, HW will do writes of up to 2K packet data into a single
> descriptor, and then (if not enough), will go for next descriptor data.
> 
> With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
> size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
> to. Ultimately, HW will do a memory corruption of next page.

Yeah... this is the kind of thing I was worried about. It seemed to me that the SKB was being built around a hardware buffer rather than around heap-allocated memory. I just hoped that the rx_off value would somehow make it work.

The code in aq_get_rxpages seems to suggest that multiple frames can fit in a rxpage, so maybe the logic there prevents overwriting? (at the expense of not fitting as many frames into the page before it has to get a new one?)

I didn't notice any issues when I was testing, but apart from port forwarding ssh (which is tiny) and some copying of files on (probably not even close to saturating the link) there's not a huge network load placed on the device. I guess it's entirely possible that an overwrite problem would only show up under heavy load? (ie. more, and larger amounts of data in flight through the kernel at once)

> I think the only acceptable solution here would be removing that optimized
> path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
> it under some configuration condition (which is also not good).

I'll attempt to confirm that this works too, at least for our tests :)

Lincoln
Ramsay, Lincoln Nov. 19, 2020, 5:19 a.m. UTC | #3
Hi Igor,

> > With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
> > size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
> > to. Ultimately, HW will do a memory corruption of next page.
> 
> The code in aq_get_rxpages seems to suggest that multiple frames can fit in a rxpage, so
> maybe the logic there prevents overwriting? (at the expense of not fitting as many
> frames into the page before it has to get a new one?)

I am not terribly experienced with such low-level code, but it looks to me looks like a page is allocated (4k) and then DMA mapped to the device. Frames are 2k, so only 2 can fit into a single mapped page. If the mapping was done with an offset of AQ_SKB_PAD, that'd leave space for the SKB headroom but it would mean only a single frame could fit into that mapped page. Since this is the "I only have 1 fragment less than 2k" code path, maybe that's ok? I'm not sure if the hardware side can know that it's only allowed to write 1 frame into the buffer...

I noticed on my device that aq_ring_rx_clean always hits the "fast" codepath. I guess that just means I am not pushing it hard enough?

> > I think the only acceptable solution here would be removing that optimized
> > path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
> > it under some configuration condition (which is also not good).
> 
> I'll attempt to confirm that this works too, at least for our tests :)

FWIW: This does work. I notice that this has to copy data into an allocated skb rather than building the skb around the data. I guess avoiding that copy is why the fast path existed in the first place?

Would you like me to post this patch (removing the fast path)?

Lincoln
diff mbox series

Patch

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..57150e3d3257 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -16,6 +16,8 @@ 
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>

+#define AQ_SKB_PAD	(NET_SKB_PAD + NET_IP_ALIGN)
+
 static inline void aq_free_rxpage(struct aq_rxpage *rxpage, struct device *dev)
 {
 	unsigned int len = PAGE_SIZE << rxpage->order;
@@ -47,7 +49,7 @@  static int aq_get_rxpage(struct aq_rxpage *rxpage, unsigned int order,
 	rxpage->page = page;
 	rxpage->daddr = daddr;
 	rxpage->order = order;
-	rxpage->pg_off = 0;
+	rxpage->pg_off = AQ_SKB_PAD;

 	return 0;

@@ -67,8 +69,8 @@  static int aq_get_rxpages(struct aq_ring_s *self, struct aq_ring_buff_s *rxbuf,
 		/* One means ring is the only user and can reuse */
 		if (page_ref_count(rxbuf->rxdata.page) > 1) {
 			/* Try reuse buffer */
-			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX;
-			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX <=
+			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD;
+			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD <=
 				(PAGE_SIZE << order)) {
 				u64_stats_update_begin(&self->stats.rx.syncp);
 				self->stats.rx.pg_flips++;
@@ -84,7 +86,7 @@  static int aq_get_rxpages(struct aq_ring_s *self, struct aq_ring_buff_s *rxbuf,
 				u64_stats_update_end(&self->stats.rx.syncp);
 			}
 		} else {
-			rxbuf->rxdata.pg_off = 0;
+			rxbuf->rxdata.pg_off = AQ_SKB_PAD;
 			u64_stats_update_begin(&self->stats.rx.syncp);
 			self->stats.rx.pg_reuses++;
 			u64_stats_update_end(&self->stats.rx.syncp);
@@ -416,8 +418,8 @@  int aq_ring_rx_clean(struct aq_ring_s *self,
 		/* for single fragment packets use build_skb() */
 		if (buff->is_eop &&
 		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
-					AQ_CFG_RX_FRAME_MAX);
+			skb = build_skb(aq_buf_vaddr(&buff->rxdata) - AQ_SKB_PAD,
+					AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD);
 			if (unlikely(!skb)) {
 				u64_stats_update_begin(&self->stats.rx.syncp);
 				self->stats.rx.skb_alloc_fails++;
@@ -425,6 +427,7 @@  int aq_ring_rx_clean(struct aq_ring_s *self,
 				err = -ENOMEM;
 				goto err_exit;
 			}
+			skb_reserve(skb, AQ_SKB_PAD);
 			if (is_ptp_ring)
 				buff->len -=
 					aq_ptp_extract_ts(self->aq_nic, skb,