Patchwork 2.6.29-rc3: tg3 dead after resume

login
register
mail settings
Submitter Matt Carlson
Date Jan. 30, 2009, 6:40 p.m.
Message ID <20090130184030.GA14933@xw6200.broadcom.net>
Download mbox | patch
Permalink /patch/21248/
State RFC
Delegated to: David Miller
Headers show

Comments

Matt Carlson - Jan. 30, 2009, 6:40 p.m.
On Thu, Jan 29, 2009 at 02:35:44PM -0800, Parag Warudkar wrote:
> 
> 
> On Thu, 29 Jan 2009, Matt Carlson wrote:
> 
> > FWIW, I can suspend and resume using the latest linux-2.6 kernel
> > on a machine with a similar chip here.  The problem doesn't seem to
> > affect all Broadcom devices.
> 
> It is failing for me on HP xw6600 workstation, if that helps in any way.
> 
> Parag

O.K.  Let's test some more assumptions.  Can you apply the following
patch and observe the system logs when the device is first loaded and
again after resume.  The patch looks at the pci command register to
verify that memory space IO is indeed enabled.  (It should be.)  This is
all that should be needed for MMIO to work.

If the PCI_COMMAND message doesn't match, then it means that the
PCI_COMMAND register isn't getting restored for some reason.  If they do
match, then something else in the system is not getting restored
correctly.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Parag Warudkar - Jan. 30, 2009, 10:50 p.m.
On Fri, 30 Jan 2009, Matt Carlson wrote:

> 
> If the PCI_COMMAND message doesn't match, then it means that the
> PCI_COMMAND register isn't getting restored for some reason.  If they do
> On Thu, Jan 29, 2009 at 02:35:44PM -0800, Parag Warudkar wrote:
> > 
> > 
> > On Thu, 29 Jan 2009, Matt Carlson wrote:
> > 
> > > FWIW, I can suspend and resume using the latest linux-2.6 kernel
> > > on a machine with a similar chip here.  The problem doesn't seem to
> > > affect all Broadcom devices.
> > 
> > It is failing for me on HP xw6600 workstation, if that helps in any way.
> > 
> > Parag
> 
> O.K.  Let's test some more assumptions.  Can you apply the following
> patch and observe the system logs when the device is first loaded and
> again after resume.  The patch looks at the pci command register to
> verify that memory space IO is indeed enabled.  (It should be.)  This is
> all that should be needed for MMIO to work.
> On Thu, Jan 29, 2009 at 02:35:44PM -0800, Parag Warudkar wrote:
> > 
> > 
> > On Thu, 29 Jan 2009, Matt Carlson wrote:

> O.K.  Let's test some more assumptions.  Can you apply the following
> patch and observe the system logs when the device is first loaded and
> again after resume.  The patch looks at the pci command register to
> verify that memory space IO is indeed enabled.  (It should be.)  This is
> all that should be needed for MMIO to work.
> match, then something else in the system is not getting restored
> correctly.
> 
Here is the output after applying the patch (fresh boot btw) -

[   29.698877] eth0: PCI_COMMAND reg = 0x406 (bit 1 is on)
[   29.698880] eth0: Reg value at offset 0x0 is 0x167b14e4
[   29.758169] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   31.295087] tg3: eth0: Link is up at 100 Mbps, full duplex.
[   31.295090] tg3: eth0: Flow control is off for TX and off for RX.
[   31.297574] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   41.872007] eth0: no IPv6 routers present
^^^ Pre-Suspend

[  245.924484] eth0: PCI_COMMAND reg = 0x406 (bit 1 is on)
[  245.924487] eth0: Reg value at offset 0x0 is 0xffffffff
[  247.317971] tg3: eth0: No firmware running.
[  258.710634] ADDRCONF(NETDEV_UP): eth0: link is not ready
^^^ Post-Suspend

So it looks like the memory space IO is enabled before and after suspend.
The device/vendor id goes 0xffffffff after resume - just like before.
Does that one matter? (Firmware may be looking at it?) 

> 
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index 8b3f846..67bb29f 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -7225,8 +7225,17 @@ static int tg3_reset_hw(struct tg3 *tp, int reset_phy)
>   */
>  static int tg3_init_hw(struct tg3 *tp, int reset_phy)
>  {
> +	u16 cmd;
> +
>  	tg3_switch_clocks(tp);
>  
> +	pci_read_config_word(tp->pdev, PCI_COMMAND, &cmd);
> +
> +	printk(KERN_NOTICE "%s: PCI_COMMAND reg = 0x%x (bit 1 is %s)\n",
> +	       tp->dev->name, cmd, (cmd & PCI_COMMAND_MEMORY) ? "on" : "off");
> +	printk(KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
> +	       tp->dev->name, tr32(0x0));
> +
>  	tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
>  
>  	return tg3_reset_hw(tp, reset_phy);
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds - Jan. 30, 2009, 11:06 p.m.
On Fri, 30 Jan 2009, Parag Warudkar wrote:
> 
> [  245.924484] eth0: PCI_COMMAND reg = 0x406 (bit 1 is on)
> [  245.924487] eth0: Reg value at offset 0x0 is 0xffffffff
> [  247.317971] tg3: eth0: No firmware running.
> [  258.710634] ADDRCONF(NETDEV_UP): eth0: link is not ready
> ^^^ Post-Suspend
> 
> So it looks like the memory space IO is enabled before and after suspend.
> The device/vendor id goes 0xffffffff after resume - just like before.
> Does that one matter? (Firmware may be looking at it?) 

One thing strikes me - are there any bridges between the host (CPU) and 
that tg3 device?

Because we obviously have two people who say that their tg3 suspend/resume 
works fine, so the tg3 driver is obviously not _totally_ broken. So I'm 
wondering if there is something funny in between the CPU and the tg3, like 
a hotplug bridge that needs magic to wake up properly.

Because clearly the PCI config space addresses are working fine, but the 
thing is, while PCI config space accesses are routed by the device number 
(and the bridges notion of secondary bridging), the PCI memory space 
routing is based on address. So a PCI bridge can easily get one right (in 
fact, it's really hard to get config space accesses wrong without the 
bridges being _totally_ screwed up), while not routing the other at all.

So just do that "lspci -vvxxx" for the whole box, before and after, and 
send us the "before" and the "diff -u before after" thing, and maybe that 
shows something interesting. Because some bridge chip being confused would 
also explain why a total re-init of the whole tg3 chip by a driver unload 
and reload doesn't seem to help.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds - Jan. 30, 2009, 11:33 p.m.
On Fri, 30 Jan 2009, Linus Torvalds wrote:
> 
> So just do that "lspci -vvxxx" for the whole box, before and after, and 
> send us the "before" and the "diff -u before after" thing, and maybe that 
> shows something interesting. Because some bridge chip being confused would 
> also explain why a total re-init of the whole tg3 chip by a driver unload 
> and reload doesn't seem to help.

It might also be instructive to see the same thing for a working kernel. I 
assume plain 2.6.28 works for you?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 8b3f846..67bb29f 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -7225,8 +7225,17 @@  static int tg3_reset_hw(struct tg3 *tp, int reset_phy)
  */
 static int tg3_init_hw(struct tg3 *tp, int reset_phy)
 {
+	u16 cmd;
+
 	tg3_switch_clocks(tp);
 
+	pci_read_config_word(tp->pdev, PCI_COMMAND, &cmd);
+
+	printk(KERN_NOTICE "%s: PCI_COMMAND reg = 0x%x (bit 1 is %s)\n",
+	       tp->dev->name, cmd, (cmd & PCI_COMMAND_MEMORY) ? "on" : "off");
+	printk(KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
+	       tp->dev->name, tr32(0x0));
+
 	tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
 
 	return tg3_reset_hw(tp, reset_phy);