Message ID | 20180723142524.24224-2-suresh.reddy@broadcom.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Series | be2net: patch-set | expand |
From: Suresh Reddy <suresh.reddy@broadcom.com> Date: Mon, 23 Jul 2018 10:25:23 -0400 > Driver dumps tx_queue, tx_compl, pending SKBs information in tx_timeout. > This debug data used to idenfiy the cause of the time out. > > Also reset Lancer chip in tx_timeout. > > Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> The purpose of the tx timeout NDO operation is to do whatever is necessary to handle the TX timeout. Outputting debugging information is useful, but is secondary. I see that you do reset the Lancer, but that is far from what really needs to happen here. When you get a TX timeout, the hardware is not processing TX ring entries, nor signalling completion any longer. Therefore the only way to get things going again is to reset all of the TX side data structure and logic. This means shutting down the TX engine, freeing up all of the TX SKBs in the ring, resetting the TX ring software state, and then finally reprogramming the head/tail pointer registers and re-enabling TX DMA processing.
On Mon, Jul 23, 2018 at 11:53 PM, David Miller <davem@davemloft.net> wrote: > From: Suresh Reddy <suresh.reddy@broadcom.com> > Date: Mon, 23 Jul 2018 10:25:23 -0400 > >> Driver dumps tx_queue, tx_compl, pending SKBs information in tx_timeout. >> This debug data used to idenfiy the cause of the time out. >> >> Also reset Lancer chip in tx_timeout. >> >> Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> > > The purpose of the tx timeout NDO operation is to do whatever is > necessary to handle the TX timeout. > > Outputting debugging information is useful, but is secondary. > > I see that you do reset the Lancer, but that is far from what really > needs to happen here. > > When you get a TX timeout, the hardware is not processing TX ring > entries, nor signalling completion any longer. > > Therefore the only way to get things going again is to reset all of > the TX side data structure and logic. This means shutting down the TX > engine, freeing up all of the TX SKBs in the ring, resetting the TX > ring software state, and then finally reprogramming the head/tail > pointer registers and re-enabling TX DMA processing. The current patch does recover from a TX-timeout by resetting the chip itself that *includes* resetting the TX block. Freeing up TX SKBs and rings and re-creating/registering them is also being done. Now, this recovery is not supported in BE3 and Skyhawk NICs but is supported in Lancer NICs. That's why this patch performs recovery in the case of Lancer NICs but only gathers diagnostic information in the case of BE3 and Skyhawk NICs. Regards, Suresh.
On Wed, Jul 25, 2018 at 6:14 PM, Suresh Kumar Reddy Reddygari <suresh.reddy@broadcom.com> wrote: > On Mon, Jul 23, 2018 at 11:53 PM, David Miller <davem@davemloft.net> wrote: >> From: Suresh Reddy <suresh.reddy@broadcom.com> >> Date: Mon, 23 Jul 2018 10:25:23 -0400 >> >>> Driver dumps tx_queue, tx_compl, pending SKBs information in tx_timeout. >>> This debug data used to idenfiy the cause of the time out. >>> >>> Also reset Lancer chip in tx_timeout. >>> ... >> >> The purpose of the tx timeout NDO operation is to do whatever is >> necessary to handle the TX timeout. >> >> Outputting debugging information is useful, but is secondary. >> >> I see that you do reset the Lancer, but that is far from what really >> needs to happen here. >> >> When you get a TX timeout, the hardware is not processing TX ring >> entries, nor signalling completion any longer. >> >> Therefore the only way to get things going again is to reset all of >> the TX side data structure and logic. This means shutting down the TX >> engine, freeing up all of the TX SKBs in the ring, resetting the TX >> ring software state, and then finally reprogramming the head/tail >> pointer registers and re-enabling TX DMA processing. Hi David, I am clarifying again about the patch (Lancer reset) as I didnt see a reply from you after my clarification. > +static void be_tx_timeout(struct net_device *netdev) > +{ ... > + > + if (lancer_chip(adapter)) { > + dev_info(dev, "Initiating reset due to tx timeout\n"); > + dev_info(dev, "Resetting adapter\n"); > + status = lancer_physdev_ctrl(adapter, > + PHYSDEV_CONTROL_FW_RESET_MASK); This patch does recover from a TX-timeout by resetting the chip itself that *includes* resetting the TX block. Freeing up TX SKBs and rings and re-creating/registering them is also being done. Driver and Firmware does the following in chip reset path. 1. When driver sets the PHYSDEV_RESET_MASK bit, Lancer firmware goes into an error state and indicates this back to the driver via a bit in a doorbell register. 2. Driver detects this and calls be_err_recover(). be_close() and be_clear() are called in this error recovery path. a) In be_close(), we cleanup all pending TX queue entries and SKBs. b) In be_clear(), we destroy TX, RX and all other queues. 3. be_resume() is called after be_cleanup() in error recover path. In this routine, we create TX, RX queues and other resources which were destroyed in cleanup path. Now, this recovery is supported only on Lancer NICs (and not in BE3 and Skyhawk NICs). That's why this patch only gathers diagnostic information in the case of BE3 and Skyhawk NICs, but on Lancer NICs does the extra step of resetting and recovering. Please let me know if this makes sense. I send a v2 patch with this explanation in the commit message. Regards, Suresh.
From: Suresh Kumar Reddy Reddygari <suresh.reddy@broadcom.com> Date: Mon, 30 Jul 2018 16:23:12 +0530 > I am clarifying again about the patch (Lancer reset) as I didnt see a reply > from you after my clarification. Ok, please resubmit this patch. Thank you.
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index 05e4c0b..580cdec 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -1412,6 +1412,83 @@ static netdev_tx_t be_xmit(struct sk_buff *skb, struct net_device *netdev) return NETDEV_TX_OK; } +static void be_tx_timeout(struct net_device *netdev) +{ + struct be_adapter *adapter = netdev_priv(netdev); + struct device *dev = &adapter->pdev->dev; + struct be_tx_obj *txo; + struct sk_buff *skb; + struct tcphdr *tcphdr; + struct udphdr *udphdr; + u32 *entry; + int status; + int i, j; + + for_all_tx_queues(adapter, txo, i) { + dev_info(dev, "TXQ Dump: %d H: %d T: %d used: %d, qid: 0x%x\n", + i, txo->q.head, txo->q.tail, + atomic_read(&txo->q.used), txo->q.id); + + entry = txo->q.dma_mem.va; + for (j = 0; j < TX_Q_LEN * 4; j += 4) { + if (entry[j] != 0 || entry[j + 1] != 0 || + entry[j + 2] != 0 || entry[j + 3] != 0) { + dev_info(dev, "Entry %d 0x%x 0x%x 0x%x 0x%x\n", + j, entry[j], entry[j + 1], + entry[j + 2], entry[j + 3]); + } + } + + entry = txo->cq.dma_mem.va; + dev_info(dev, "TXCQ Dump: %d H: %d T: %d used: %d\n", + i, txo->cq.head, txo->cq.tail, + atomic_read(&txo->cq.used)); + for (j = 0; j < TX_CQ_LEN * 4; j += 4) { + if (entry[j] != 0 || entry[j + 1] != 0 || + entry[j + 2] != 0 || entry[j + 3] != 0) { + dev_info(dev, "Entry %d 0x%x 0x%x 0x%x 0x%x\n", + j, entry[j], entry[j + 1], + entry[j + 2], entry[j + 3]); + } + } + + for (j = 0; j < TX_Q_LEN; j++) { + if (txo->sent_skb_list[j]) { + skb = txo->sent_skb_list[j]; + if (ip_hdr(skb)->protocol == IPPROTO_TCP) { + tcphdr = tcp_hdr(skb); + dev_info(dev, "TCP source port %d\n", + ntohs(tcphdr->source)); + dev_info(dev, "TCP dest port %d\n", + ntohs(tcphdr->dest)); + dev_info(dev, "TCP seqence num %d\n", + ntohs(tcphdr->seq)); + dev_info(dev, "TCP ack_seq %d\n", + ntohs(tcphdr->ack_seq)); + } else if (ip_hdr(skb)->protocol == + IPPROTO_UDP) { + udphdr = udp_hdr(skb); + dev_info(dev, "UDP source port %d\n", + ntohs(udphdr->source)); + dev_info(dev, "UDP dest port %d\n", + ntohs(udphdr->dest)); + } + dev_info(dev, "skb[%d] %p len %d proto 0x%x\n", + j, skb, skb->len, skb->protocol); + } + } + } + + if (lancer_chip(adapter)) { + dev_info(dev, "Initiating reset due to tx timeout\n"); + dev_info(dev, "Resetting adapter\n"); + status = lancer_physdev_ctrl(adapter, + PHYSDEV_CONTROL_FW_RESET_MASK); + if (status) + dev_err(dev, "Reset failed .. Reboot server\n"); + } +} + static inline bool be_in_all_promisc(struct be_adapter *adapter) { return (adapter->if_flags & BE_IF_FLAGS_ALL_PROMISCUOUS) == @@ -3274,7 +3351,7 @@ void be_detect_error(struct be_adapter *adapter) /* Do not log error messages if its a FW reset */ if (sliport_err1 == SLIPORT_ERROR_FW_RESET1 && sliport_err2 == SLIPORT_ERROR_FW_RESET2) { - dev_info(dev, "Firmware update in progress\n"); + dev_info(dev, "Reset is in progress\n"); } else { dev_err(dev, "Error detected in the card\n"); dev_err(dev, "ERR: sliport status 0x%x\n", @@ -5218,6 +5295,7 @@ static const struct net_device_ops be_netdev_ops = { .ndo_get_vf_config = be_get_vf_config, .ndo_set_vf_link_state = be_set_vf_link_state, .ndo_set_vf_spoofchk = be_set_vf_spoofchk, + .ndo_tx_timeout = be_tx_timeout, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = be_netpoll, #endif
Driver dumps tx_queue, tx_compl, pending SKBs information in tx_timeout. This debug data used to idenfiy the cause of the time out. Also reset Lancer chip in tx_timeout. Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> --- drivers/net/ethernet/emulex/benet/be_main.c | 80 ++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-)