diff mbox

[bisected] tg3 broken in 3.18.0?

Message ID 1418930889.3433.8.camel@prashant
State Not Applicable
Headers show

Commit Message

Prashant Sreedharan Dec. 18, 2014, 7:28 p.m. UTC
On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
> > On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
> >> It's a
> >> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
> >> Gigabit Ethernet PCI Express
> >> over here
> >>
> >> I put a WARN_ON(1) after those printks, and this is what I got:
> >>
> >> [    1.550640] pci 0000:02:00.0: 1st 1 1
> >> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
> >> [    1.550645] ------------[ cut here ]------------
> >> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
> >> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
> >> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
> >> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
> >> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
> >> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
> >> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
> >> [    1.550669] Call Trace:
> >> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
> >> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
> >> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
> >> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
> >> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
> >> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
> >> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
> >> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
> >
> > So does it work if you use a non-zero crs_timeout?  The driver has
> > called tg3_halt() which may affect configuration read responses.  I need
> > to check with the hardware team to see if the 5722 will return CRS in
> > this scenario.
> 
> Any updates from the hardware team?
> 
> This is a pretty serious regression, but as far as I can tell, it is
> not a PCI bug.  The device should respond to a config read of vendor
> ID.  If the driver does something that make the read return CRS
> status, I think the driver is responsible for doing whatever delay or
> other fixup is required.
> 
> I'm inclined to reassign this bug to the tg3 driver unless you think
> the PCI core is doing something wrong here.
> 
> Bjorn

We were not able to reproduce this issue, could you please check what is
the value of reg 0x70, before the pci_device_is_present call is made ?
if bit 15 is set config access will be retried.


 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Marcelo Ricardo Leitner Dec. 18, 2014, 8:09 p.m. UTC | #1
On 18-12-2014 17:28, Prashant Sreedharan wrote:
> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
>>> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>>>> It's a
>>>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>>>> Gigabit Ethernet PCI Express
>>>> over here
>>>>
>>>> I put a WARN_ON(1) after those printks, and this is what I got:
>>>>
>>>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>>>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>>>> [    1.550645] ------------[ cut here ]------------
>>>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>>>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>>>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>>>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>>>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>>>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>>>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>>>> [    1.550669] Call Trace:
>>>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>>>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>>>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>>>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>>>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>>>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>>>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>>>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>>>
>>> So does it work if you use a non-zero crs_timeout?  The driver has
>>> called tg3_halt() which may affect configuration read responses.  I need
>>> to check with the hardware team to see if the 5722 will return CRS in
>>> this scenario.
>>
>> Any updates from the hardware team?
>>
>> This is a pretty serious regression, but as far as I can tell, it is
>> not a PCI bug.  The device should respond to a config read of vendor
>> ID.  If the driver does something that make the read return CRS
>> status, I think the driver is responsible for doing whatever delay or
>> other fixup is required.
>>
>> I'm inclined to reassign this bug to the tg3 driver unless you think
>> the PCI core is doing something wrong here.
>>
>> Bjorn
> 
> We were not able to reproduce this issue, could you please check what is
> the value of reg 0x70, before the pci_device_is_present call is made ?
> if bit 15 is set config access will be retried.
> 
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>          void (*write_op)(struct tg3 *, u32, u32);
>          int i, err;
>   
> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>          if (!pci_device_is_present(tp->pdev))
>                  return -ENODEV;
>   

With that PCI patch applied and my debugs, without the timeout hack (so crs_timeout=0):

[    1.545554] config state: 12b2
[    1.548636] pci 0000:02:00.0: 1st 1 1
[    1.548637] pci 0000:02:00.0: crs_timeout: 0
[    1.548783] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
[    1.548785] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.548786] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.548787] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    1.554389] tg3 0000:02:00.0 p1p1: renamed from eth0
...

That's the only time your printk got printed.

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nils Holland Dec. 18, 2014, 8:26 p.m. UTC | #2
On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> > 
> > Any updates from the hardware team?
> > 
> > This is a pretty serious regression, but as far as I can tell, it is
> > not a PCI bug.  The device should respond to a config read of vendor
> > ID.  If the driver does something that make the read return CRS
> > status, I think the driver is responsible for doing whatever delay or
> > other fixup is required.
> > 
> > I'm inclined to reassign this bug to the tg3 driver unless you think
> > the PCI core is doing something wrong here.
> > 
> > Bjorn
> 
> We were not able to reproduce this issue, could you please check what is
> the value of reg 0x70, before the pci_device_is_present call is made ?
> if bit 15 is set config access will be retried.
> 
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>         void (*write_op)(struct tg3 *, u32, u32);
>         int i, err;
>  
> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>         if (!pci_device_is_present(tp->pdev))
>                 return -ENODEV;

No problem, I gave this a try and here is what I get:

[    2.185190] libphy: tg3 mdio bus: probed
[    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
[    2.244993] config state: 1292
[    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
        (PCI Express) MAC address 00:19:99:ce:13:a6
[    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
        (mii_bus:phy_addr=200:01)
[    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
        MIirq[0] ASF[0] TSOcap[1]
[    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[   12.204692] tg3 0000:02:00.0
        enp2s0: No firmware running
[   12.206653] config state: 1292
[   12.208655] config state: 1292

That's all of the three times the new debugging line gets hit when I
boot my system using the supplied diagnostic patch.

Hope that helps - of course, I'd gladly test any further
(diagnostic) patches if required! Also, if I can provide any
additional information that might be of value, just ask:-)

Greetings,
Nils
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Ricardo Leitner Dec. 18, 2014, 8:33 p.m. UTC | #3
On 18-12-2014 18:09, Marcelo Ricardo Leitner wrote:
> On 18-12-2014 17:28, Prashant Sreedharan wrote:
>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
>>>> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>>>>> It's a
>>>>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>>>>> Gigabit Ethernet PCI Express
>>>>> over here
>>>>>
>>>>> I put a WARN_ON(1) after those printks, and this is what I got:
>>>>>
>>>>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>>>>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>>>>> [    1.550645] ------------[ cut here ]------------
>>>>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>>>>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>>>>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>>>>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>>>>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>>>>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>>>>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>>>>> [    1.550669] Call Trace:
>>>>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>>>>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>>>>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>>>>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>>>>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>>>>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>>>>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>>>>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>>>>
>>>> So does it work if you use a non-zero crs_timeout?  The driver has
>>>> called tg3_halt() which may affect configuration read responses.  I need
>>>> to check with the hardware team to see if the 5722 will return CRS in
>>>> this scenario.
>>>
>>> Any updates from the hardware team?
>>>
>>> This is a pretty serious regression, but as far as I can tell, it is
>>> not a PCI bug.  The device should respond to a config read of vendor
>>> ID.  If the driver does something that make the read return CRS
>>> status, I think the driver is responsible for doing whatever delay or
>>> other fixup is required.
>>>
>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>> the PCI core is doing something wrong here.
>>>
>>> Bjorn
>>
>> We were not able to reproduce this issue, could you please check what is
>> the value of reg 0x70, before the pci_device_is_present call is made ?
>> if bit 15 is set config access will be retried.
>>
>> --- a/drivers/net/ethernet/broadcom/tg3.c
>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>           void (*write_op)(struct tg3 *, u32, u32);
>>           int i, err;
>>
>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>           if (!pci_device_is_present(tp->pdev))
>>                   return -ENODEV;
>>
>
> With that PCI patch applied and my debugs, without the timeout hack (so crs_timeout=0):
>
> [    1.545554] config state: 12b2
> [    1.548636] pci 0000:02:00.0: 1st 1 1
> [    1.548637] pci 0000:02:00.0: crs_timeout: 0
> [    1.548783] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
> [    1.548785] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
> [    1.548786] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    1.548787] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> [    1.554389] tg3 0000:02:00.0 p1p1: renamed from eth0
> ...
>
> That's the only time your printk got printed.

My bad, I forgot I had configured the system to not bring that iface up 
anymore.. when doing so, just like Nils had too:

[ 1743.678714] tg3 0000:02:00.0: irq 32 for MSI/MSI-X
[ 1745.554039] tg3 0000:02:00.0 p1p1: No firmware running
[ 1745.554724] config state: 12b2
[ 1745.557822] pci 0000:02:00.0: 1st 1 1
[ 1745.557827] pci 0000:02:00.0: crs_timeout: 0
[ 1745.559383] config state: 12b2
[ 1745.562470] pci 0000:02:00.0: 1st 1 1
[ 1745.562471] pci 0000:02:00.0: crs_timeout: 0

   Marcelo

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prashant Sreedharan Dec. 19, 2014, 2:10 a.m. UTC | #4
On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
> > On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> > > 
> > > Any updates from the hardware team?
> > > 
> > > This is a pretty serious regression, but as far as I can tell, it is
> > > not a PCI bug.  The device should respond to a config read of vendor
> > > ID.  If the driver does something that make the read return CRS
> > > status, I think the driver is responsible for doing whatever delay or
> > > other fixup is required.
> > > 
> > > I'm inclined to reassign this bug to the tg3 driver unless you think
> > > the PCI core is doing something wrong here.
> > > 
> > > Bjorn
> > 
> > We were not able to reproduce this issue, could you please check what is
> > the value of reg 0x70, before the pci_device_is_present call is made ?
> > if bit 15 is set config access will be retried.
> > 
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
> >         void (*write_op)(struct tg3 *, u32, u32);
> >         int i, err;
> >  
> > +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
> >         if (!pci_device_is_present(tp->pdev))
> >                 return -ENODEV;
> 
> No problem, I gave this a try and here is what I get:
> 
> [    2.185190] libphy: tg3 mdio bus: probed
> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
> [    2.244993] config state: 1292
> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>         (PCI Express) MAC address 00:19:99:ce:13:a6
> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>         (mii_bus:phy_addr=200:01)
> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>         MIirq[0] ASF[0] TSOcap[1]
> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [   12.204692] tg3 0000:02:00.0
>         enp2s0: No firmware running
> [   12.206653] config state: 1292
> [   12.208655] config state: 1292
> 
> That's all of the three times the new debugging line gets hit when I
> boot my system using the supplied diagnostic patch.
> 
> Hope that helps - of course, I'd gladly test any further
> (diagnostic) patches if required! Also, if I can provide any
> additional information that might be of value, just ask:-)
> 
Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
indicates the chip is not setting the config retry bit. We were hoping
this bit is causing the config access to return CRS but looks like it is
not. 

Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
driver now we are able to reproduce the problem on 5722 in house. We are
working with the HW team to narrow this down.

Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
the problem.
> Greetings,
> Nils
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 19, 2014, 5:09 p.m. UTC | #5
On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
<prashant@broadcom.com> wrote:
> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>> > On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>> > >
>> > > Any updates from the hardware team?
>> > >
>> > > This is a pretty serious regression, but as far as I can tell, it is
>> > > not a PCI bug.  The device should respond to a config read of vendor
>> > > ID.  If the driver does something that make the read return CRS
>> > > status, I think the driver is responsible for doing whatever delay or
>> > > other fixup is required.
>> > >
>> > > I'm inclined to reassign this bug to the tg3 driver unless you think
>> > > the PCI core is doing something wrong here.
>> > >
>> > > Bjorn
>> >
>> > We were not able to reproduce this issue, could you please check what is
>> > the value of reg 0x70, before the pci_device_is_present call is made ?
>> > if bit 15 is set config access will be retried.
>> >
>> > --- a/drivers/net/ethernet/broadcom/tg3.c
>> > +++ b/drivers/net/ethernet/broadcom/tg3.c
>> > @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>> >         void (*write_op)(struct tg3 *, u32, u32);
>> >         int i, err;
>> >
>> > +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>> >         if (!pci_device_is_present(tp->pdev))
>> >                 return -ENODEV;
>>
>> No problem, I gave this a try and here is what I get:
>>
>> [    2.185190] libphy: tg3 mdio bus: probed
>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>> [    2.244993] config state: 1292
>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>>         (PCI Express) MAC address 00:19:99:ce:13:a6
>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>>         (mii_bus:phy_addr=200:01)
>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>         MIirq[0] ASF[0] TSOcap[1]
>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
>> [...]
>> [   12.204692] tg3 0000:02:00.0
>>         enp2s0: No firmware running
>> [   12.206653] config state: 1292
>> [   12.208655] config state: 1292
>>
>> That's all of the three times the new debugging line gets hit when I
>> boot my system using the supplied diagnostic patch.
>>
>> Hope that helps - of course, I'd gladly test any further
>> (diagnostic) patches if required! Also, if I can provide any
>> additional information that might be of value, just ask:-)
>>
> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
> indicates the chip is not setting the config retry bit. We were hoping
> this bit is causing the config access to return CRS but looks like it is
> not.
>
> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
> driver now we are able to reproduce the problem on 5722 in house. We are
> working with the HW team to narrow this down.
>
> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
> the problem.

The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
works with any unique *prefix* of that.  The current convention is to
use the first 12 characters (I have "[core] abbrev = 12" in my
.git/config).  Unfortunately, suffixes don't work at all.

Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this path:

  pci_device_is_present
    pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
      pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)

and for some reason the chip returns 0x00010001 for that 32-bit read.
Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
This is false, so pci_bus_read_dev_vendor_id() returns true, which
means pci_device_is_present() is also true.

After 89665a6a7140, we compare only the low 16 bits with ((*l &
0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
returns false, and pci_device_is_present() is false.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Ricardo Leitner Dec. 19, 2014, 5:16 p.m. UTC | #6
On 19-12-2014 15:09, Bjorn Helgaas wrote:
> On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
> <prashant@broadcom.com> wrote:
>> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>>>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>>>>
>>>>> Any updates from the hardware team?
>>>>>
>>>>> This is a pretty serious regression, but as far as I can tell, it is
>>>>> not a PCI bug.  The device should respond to a config read of vendor
>>>>> ID.  If the driver does something that make the read return CRS
>>>>> status, I think the driver is responsible for doing whatever delay or
>>>>> other fixup is required.
>>>>>
>>>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>>>> the PCI core is doing something wrong here.
>>>>>
>>>>> Bjorn
>>>>
>>>> We were not able to reproduce this issue, could you please check what is
>>>> the value of reg 0x70, before the pci_device_is_present call is made ?
>>>> if bit 15 is set config access will be retried.
>>>>
>>>> --- a/drivers/net/ethernet/broadcom/tg3.c
>>>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>>>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>>>          void (*write_op)(struct tg3 *, u32, u32);
>>>>          int i, err;
>>>>
>>>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>>>          if (!pci_device_is_present(tp->pdev))
>>>>                  return -ENODEV;
>>>
>>> No problem, I gave this a try and here is what I get:
>>>
>>> [    2.185190] libphy: tg3 mdio bus: probed
>>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>>> [    2.244993] config state: 1292
>>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>>>          (PCI Express) MAC address 00:19:99:ce:13:a6
>>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>>>          (mii_bus:phy_addr=200:01)
>>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>>          MIirq[0] ASF[0] TSOcap[1]
>>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
>>> [...]
>>> [   12.204692] tg3 0000:02:00.0
>>>          enp2s0: No firmware running
>>> [   12.206653] config state: 1292
>>> [   12.208655] config state: 1292
>>>
>>> That's all of the three times the new debugging line gets hit when I
>>> boot my system using the supplied diagnostic patch.
>>>
>>> Hope that helps - of course, I'd gladly test any further
>>> (diagnostic) patches if required! Also, if I can provide any
>>> additional information that might be of value, just ask:-)
>>>
>> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
>> indicates the chip is not setting the config retry bit. We were hoping
>> this bit is causing the config access to return CRS but looks like it is
>> not.
>>
>> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
>> driver now we are able to reproduce the problem on 5722 in house. We are
>> working with the HW team to narrow this down.
>>
>> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
>> the problem.
>
> The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
> works with any unique *prefix* of that.  The current convention is to
> use the first 12 characters (I have "[core] abbrev = 12" in my
> .git/config).  Unfortunately, suffixes don't work at all.
>
> Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this path:
>
>    pci_device_is_present
>      pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
>        pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)
>
> and for some reason the chip returns 0x00010001 for that 32-bit read.

Actually it returns just 0x00000001, but yeah, that's my understanding too.

   Marcelo

> Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
> This is false, so pci_bus_read_dev_vendor_id() returns true, which
> means pci_device_is_present() is also true.
>
> After 89665a6a7140, we compare only the low 16 bits with ((*l &
> 0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
> returns false, and pci_device_is_present() is false.
>
> Bjorn
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rajat Jain Dec. 19, 2014, 6:24 p.m. UTC | #7
One of the reasons to replace the condition (*l == 0xffff0001) with
(*l & 0xffff) == 0x0001) was that some devices apparently returned
0001 for device id to indicate CRS, but returned actual vendor id in
the vendor ID field (hence the need to ignore vendor field).

Are we saying that the tg3 device returns 0x0001 for the device ID and
yet expects it to be treated as a good value (not CRS)?

Thanks,

Rajat

On Fri, Dec 19, 2014 at 9:16 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On 19-12-2014 15:09, Bjorn Helgaas wrote:
>>
>> On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
>> <prashant@broadcom.com> wrote:
>>>
>>> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>>>>
>>>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>>>>>
>>>>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>>>>>
>>>>>>
>>>>>> Any updates from the hardware team?
>>>>>>
>>>>>> This is a pretty serious regression, but as far as I can tell, it is
>>>>>> not a PCI bug.  The device should respond to a config read of vendor
>>>>>> ID.  If the driver does something that make the read return CRS
>>>>>> status, I think the driver is responsible for doing whatever delay or
>>>>>> other fixup is required.
>>>>>>
>>>>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>>>>> the PCI core is doing something wrong here.
>>>>>>
>>>>>> Bjorn
>>>>>
>>>>>
>>>>> We were not able to reproduce this issue, could you please check what
>>>>> is
>>>>> the value of reg 0x70, before the pci_device_is_present call is made ?
>>>>> if bit 15 is set config access will be retried.
>>>>>
>>>>> --- a/drivers/net/ethernet/broadcom/tg3.c
>>>>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>>>>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>>>>          void (*write_op)(struct tg3 *, u32, u32);
>>>>>          int i, err;
>>>>>
>>>>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>>>>          if (!pci_device_is_present(tp->pdev))
>>>>>                  return -ENODEV;
>>>>
>>>>
>>>> No problem, I gave this a try and here is what I get:
>>>>
>>>> [    2.185190] libphy: tg3 mdio bus: probed
>>>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>>>> [    2.244993] config state: 1292
>>>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev
>>>> 57780001]
>>>>          (PCI Express) MAC address 00:19:99:ce:13:a6
>>>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom
>>>> BCM57780]
>>>>          (mii_bus:phy_addr=200:01)
>>>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>>>          MIirq[0] ASF[0] TSOcap[1]
>>>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000]
>>>> dma_mask[64-bit]
>>>> [...]
>>>> [   12.204692] tg3 0000:02:00.0
>>>>          enp2s0: No firmware running
>>>> [   12.206653] config state: 1292
>>>> [   12.208655] config state: 1292
>>>>
>>>> That's all of the three times the new debugging line gets hit when I
>>>> boot my system using the supplied diagnostic patch.
>>>>
>>>> Hope that helps - of course, I'd gladly test any further
>>>> (diagnostic) patches if required! Also, if I can provide any
>>>> additional information that might be of value, just ask:-)
>>>>
>>> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
>>> indicates the chip is not setting the config retry bit. We were hoping
>>> this bit is causing the config access to return CRS but looks like it is
>>> not.
>>>
>>> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
>>> driver now we are able to reproduce the problem on 5722 in house. We are
>>> working with the HW team to narrow this down.
>>>
>>> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
>>> the problem.
>>
>>
>> The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
>> works with any unique *prefix* of that.  The current convention is to
>> use the first 12 characters (I have "[core] abbrev = 12" in my
>> .git/config).  Unfortunately, suffixes don't work at all.
>>
>> Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this
>> path:
>>
>>    pci_device_is_present
>>      pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
>>        pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)
>>
>> and for some reason the chip returns 0x00010001 for that 32-bit read.
>
>
> Actually it returns just 0x00000001, but yeah, that's my understanding too.
>
>   Marcelo
>
>
>> Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
>> This is false, so pci_bus_read_dev_vendor_id() returns true, which
>> means pci_device_is_present() is also true.
>>
>> After 89665a6a7140, we compare only the low 16 bits with ((*l &
>> 0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
>> returns false, and pci_device_is_present() is false.
>>
>> Bjorn
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prashant Sreedharan Dec. 19, 2014, 6:53 p.m. UTC | #8
On Fri, 2014-12-19 at 10:24 -0800, Rajat Jain wrote:
> One of the reasons to replace the condition (*l == 0xffff0001) with
> (*l & 0xffff) == 0x0001) was that some devices apparently returned
> 0001 for device id to indicate CRS, but returned actual vendor id in
> the vendor ID field (hence the need to ignore vendor field).
> 
> Are we saying that the tg3 device returns 0x0001 for the device ID and
> yet expects it to be treated as a good value (not CRS)?
> 
No it should not be treated as a good value, this commit has
surfaced/exposed a problem in 5722 chipset which was previously masked. 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rajat Jain Dec. 19, 2014, 7:37 p.m. UTC | #9
On Fri, Dec 19, 2014 at 10:53 AM, Prashant Sreedharan
<prashant@broadcom.com> wrote:
> On Fri, 2014-12-19 at 10:24 -0800, Rajat Jain wrote:
>> One of the reasons to replace the condition (*l == 0xffff0001) with
>> (*l & 0xffff) == 0x0001) was that some devices apparently returned
>> 0001 for device id to indicate CRS, but returned actual vendor id in
>> the vendor ID field (hence the need to ignore vendor field).
>>
>> Are we saying that the tg3 device returns 0x0001 for the device ID and
>> yet expects it to be treated as a good value (not CRS)?
>>
> No it should not be treated as a good value, this commit has
> surfaced/exposed a problem in 5722 chipset which was previously masked.
>

Got it. Thanks.

I assume you mean its a HW issue, and the workaround if required shall
be taken care in tg3 driver.

Thanks,

Rajat

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -9025,6 +9025,7 @@  static int tg3_chip_reset(struct tg3 *tp)
        void (*write_op)(struct tg3 *, u32, u32);
        int i, err;
 
+       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
        if (!pci_device_is_present(tp->pdev))
                return -ENODEV;