bnx2x: Prevent load reordering in tx completion processing
diff mbox series

Message ID 1563226910-21660-1-git-send-email-brking@linux.vnet.ibm.com
State Accepted
Delegated to: David Miller
Headers show
Series
  • bnx2x: Prevent load reordering in tx completion processing
Related show

Commit Message

Brian King July 15, 2019, 9:41 p.m. UTC
This patch fixes an issue seen on Power systems with bnx2x which results
in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

David Miller July 17, 2019, 7 p.m. UTC | #1
From: Brian King <brking@linux.vnet.ibm.com>
Date: Mon, 15 Jul 2019 16:41:50 -0500

> This patch fixes an issue seen on Power systems with bnx2x which results
> in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
> pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
> load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

Marvell folks, please review.
Manish Chopra July 18, 2019, 10:12 a.m. UTC | #2
> -----Original Message-----
> From: Brian King <brking@linux.vnet.ibm.com>
> Sent: Tuesday, July 16, 2019 3:12 AM
> To: GR-everest-linux-l2 <GR-everest-linux-l2@marvell.com>
> Cc: Sudarsana Reddy Kalluru <skalluru@marvell.com>; Ariel Elior
> <aelior@marvell.com>; netdev@vger.kernel.org; Brian King
> <brking@linux.vnet.ibm.com>
> Subject: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion
> processing
> 
> External Email
> 
> ----------------------------------------------------------------------
> This patch fixes an issue seen on Power systems with bnx2x which results in
> the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb pointer
> getting loaded in bnx2x_free_tx_pkt prior to the hw_cons load in
> bnx2x_tx_int. Adding a read memory barrier resolves the issue.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 656ed80..e2be5a6 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct
> bnx2x_fp_txdata *txdata)
>  	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
>  	sw_cons = txdata->tx_pkt_cons;
> 
> +	/* Ensure subsequent loads occur after hw_cons */
> +	smp_rmb();
> +
>  	while (sw_cons != hw_cons) {
>  		u16 pkt_cons;
> 
> --
> 1.8.3.1

Could you please explain a bit in detail what could have caused skb to NULL exactly ?
Curious that if skb would have been NULL for some reason it did not cause NULL pointer dereference in bnx2x_free_tx_pkt() on below call -

prefetch(&skb->end);

Which is prior to the said WARN_ON(!skb) in bnx2x_free_tx_pkt().
Brian King July 18, 2019, 6:26 p.m. UTC | #3
On 7/18/19 5:12 AM, Manish Chopra wrote:
>> -----Original Message-----
>> From: Brian King <brking@linux.vnet.ibm.com>
>> Sent: Tuesday, July 16, 2019 3:12 AM
>> To: GR-everest-linux-l2 <GR-everest-linux-l2@marvell.com>
>> Cc: Sudarsana Reddy Kalluru <skalluru@marvell.com>; Ariel Elior
>> <aelior@marvell.com>; netdev@vger.kernel.org; Brian King
>> <brking@linux.vnet.ibm.com>
>> Subject: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion
>> processing
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> This patch fixes an issue seen on Power systems with bnx2x which results in
>> the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb pointer
>> getting loaded in bnx2x_free_tx_pkt prior to the hw_cons load in
>> bnx2x_tx_int. Adding a read memory barrier resolves the issue.
>>
>> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
>> ---
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> index 656ed80..e2be5a6 100644
>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> @@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct
>> bnx2x_fp_txdata *txdata)
>>  	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
>>  	sw_cons = txdata->tx_pkt_cons;
>>
>> +	/* Ensure subsequent loads occur after hw_cons */
>> +	smp_rmb();
>> +
>>  	while (sw_cons != hw_cons) {
>>  		u16 pkt_cons;
>>
>> --
>> 1.8.3.1
> 
> Could you please explain a bit in detail what could have caused skb to NULL exactly ?
> Curious that if skb would have been NULL for some reason it did not cause NULL pointer dereference in bnx2x_free_tx_pkt() on below call -
> 
> prefetch(&skb->end);
> 
> Which is prior to the said WARN_ON(!skb) in bnx2x_free_tx_pkt().

Right. In this case, that would end up passing an invalid address to prefetch. On a
Power processor, that turns into a dcbt instruction (data cache block touch), which
is a hint to the process that the subsequent code may access that data cache block.
Passing an invalid address to dcbt causes no harm. I just built a userspace program
to validate that doing something very similar to what is happening here, and the dcbt
executed with no errors, no segfault to the userspace process.

This is the scenario I think is occurring. 

CPU[0]
bnx2x_start_xmit
[1] tx_buf->skb = skb; /* store skb pointer */
[2] ...
[3] wmb(); 
[4] DOORBELL_RELAXED

CPU[1]
bnx2x_tx_int
[5]  hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
[6]  sw_cons = txdata->tx_pkt_cons;
[7]  while (sw_cons != hw_cons) {
[8]  pkt_cons = TX_BD(sw_cons);
[9]  bnx2x_free_tx_pkt
[10]  tx_buf = &txdata->tx_buf_ring[pkt_cons];
[11]  skb = tx_buf->skb;
[12]  prefetch(&skb->end);
[13]  ...
[14]  WARN_ON(!skb);


On CPU0 we are in the process of sending a TX buffer to the adapter. We have a wmb at [3]
to ensure that all stores to cacheable storage are coherent with respect to the
non-cacheable store at [4] to tell the adapter about the new TX buffer.

On CPU1, we have a potential race condition if the processor is aggressively reordering
loads. If we find ourselves in bnx2x_tx_int, while still in bn2x_start_xmit on CPU0,
its possible the processor could begin speculatively executing well into [11]. Since there
is no read barrier of any sort to tell the processor that the load of the skb pointer
cannot happen until the load of hw_cons has occurred, we could end up speculatively
loading the skb pointer before the write in [1] has completed with respect to CPU1. 

Adding the smp_rmb between 6 and 7 ensures that the load at 5 occurs prior to the
load at 11.

This was reproducing consistently on a 16 socket Power 9 system built of four, four socket
nodes, with 10 bnx2x adapters and 26TB of memory. The NUMA effects can get to be exaggerated
a bit on these systems. With this patch, the system runs without issues. We've now been able to
run the workload multiple times with no issues.

Thanks,

Brian
David Miller July 18, 2019, 11:30 p.m. UTC | #4
From: Brian King <brking@linux.vnet.ibm.com>
Date: Mon, 15 Jul 2019 16:41:50 -0500

> This patch fixes an issue seen on Power systems with bnx2x which results
> in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
> pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
> load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

I agree with Brian's explanation of how the reordering can happen, and why
the crash doesn't show up in the prefetch() call.

I'll give the Marvell folks one more day to give a proper ACK.

Thanks.
David Miller July 21, 2019, 7:29 p.m. UTC | #5
From: Brian King <brking@linux.vnet.ibm.com>
Date: Mon, 15 Jul 2019 16:41:50 -0500

> This patch fixes an issue seen on Power systems with bnx2x which results
> in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
> pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
> load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

Applied and queued up for -stable.

Patch
diff mbox series

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 656ed80..e2be5a6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -285,6 +285,9 @@  int bnx2x_tx_int(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata)
 	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
 	sw_cons = txdata->tx_pkt_cons;
 
+	/* Ensure subsequent loads occur after hw_cons */
+	smp_rmb();
+
 	while (sw_cons != hw_cons) {
 		u16 pkt_cons;