pci: fix incorrect value returned from pcie_get_speed_cap

Message ID alpine.LRH.2.02.1810301232310.23106@file01.intranet.prod.int.rdu2.redhat.com
State Accepted
Delegated to: Bjorn Helgaas
Headers show
Series
  • pci: fix incorrect value returned from pcie_get_speed_cap
Related show

Commit Message

Mikulas Patocka Oct. 30, 2018, 4:36 p.m.
The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
the register and compare it against them.

This patch fixes errors "amdgpu: [powerplay] failed to send message 261
ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
the slot is being incorrectly reported as PCIe-v3 capable.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
Cc: stable@vger.kernel.org	# v4.17+

---
 drivers/pci/pci.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Alex Deucher Oct. 30, 2018, 6 p.m. | #1
On Tue, Oct 30, 2018 at 12:36 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> the register and compare it against them.
>
> This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> the slot is being incorrectly reported as PCIe-v3 capable.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> Cc: stable@vger.kernel.org      # v4.17+

Acked-by: Alex Deucher <alexander.deucher@amd.com>

>
> ---
>  drivers/pci/pci.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> Index: linux-4.19/drivers/pci/pci.c
> ===================================================================
> --- linux-4.19.orig/drivers/pci/pci.c   2018-10-30 16:58:58.000000000 +0100
> +++ linux-4.19/drivers/pci/pci.c        2018-10-30 16:58:58.000000000 +0100
> @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
>
>         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
>         if (lnkcap) {
> -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
>                         return PCIE_SPEED_16_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
>                         return PCIE_SPEED_8_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
>                         return PCIE_SPEED_5_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
>                         return PCIE_SPEED_2_5GT;
>         }
>
Alex Deucher Nov. 14, 2018, 3:44 p.m. | #2
On Tue, Oct 30, 2018 at 2:00 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, Oct 30, 2018 at 12:36 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > the register and compare it against them.
> >
> > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > the slot is being incorrectly reported as PCIe-v3 capable.
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > Cc: stable@vger.kernel.org      # v4.17+
>
> Acked-by: Alex Deucher <alexander.deucher@amd.com>


Has this patch been picked up yet?  It fixes a ton of bugs.  Please apply!

Alex

>
> >
> > ---
> >  drivers/pci/pci.c |    8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > Index: linux-4.19/drivers/pci/pci.c
> > ===================================================================
> > --- linux-4.19.orig/drivers/pci/pci.c   2018-10-30 16:58:58.000000000 +0100
> > +++ linux-4.19/drivers/pci/pci.c        2018-10-30 16:58:58.000000000 +0100
> > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> >
> >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> >         if (lnkcap) {
> > -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> >                         return PCIE_SPEED_16_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> >                         return PCIE_SPEED_8_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> >                         return PCIE_SPEED_5_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> >                         return PCIE_SPEED_2_5GT;
> >         }
> >
Alex Deucher Nov. 19, 2018, 7:14 p.m. | #3
On Wed, Nov 14, 2018 at 10:44 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, Oct 30, 2018 at 2:00 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, Oct 30, 2018 at 12:36 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > the register and compare it against them.
> > >
> > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > the slot is being incorrectly reported as PCIe-v3 capable.
> > >
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > > Cc: stable@vger.kernel.org      # v4.17+
> >
> > Acked-by: Alex Deucher <alexander.deucher@amd.com>
>
>
> Has this patch been picked up yet?  It fixes a ton of bugs.  Please apply!

Any word on this?  Fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=108704
https://bugs.freedesktop.org/show_bug.cgi?id=108778
and several others.

Alex

>
> Alex
>
> >
> > >
> > > ---
> > >  drivers/pci/pci.c |    8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > Index: linux-4.19/drivers/pci/pci.c
> > > ===================================================================
> > > --- linux-4.19.orig/drivers/pci/pci.c   2018-10-30 16:58:58.000000000 +0100
> > > +++ linux-4.19/drivers/pci/pci.c        2018-10-30 16:58:58.000000000 +0100
> > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > >
> > >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > >         if (lnkcap) {
> > > -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > >                         return PCIE_SPEED_16_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> > >                         return PCIE_SPEED_8_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > >                         return PCIE_SPEED_5_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > >                         return PCIE_SPEED_2_5GT;
> > >         }
> > >
Bjorn Helgaas Nov. 20, 2018, 12:47 a.m. | #4
On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> the register and compare it against them.
> 
> This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> the slot is being incorrectly reported as PCIe-v3 capable.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> Cc: stable@vger.kernel.org	# v4.17+
> 
> ---
>  drivers/pci/pci.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: linux-4.19/drivers/pci/pci.c
> ===================================================================
> --- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> +++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
>  
>  	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
>  	if (lnkcap) {
> -		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> +		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
>  			return PCIE_SPEED_16_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
>  			return PCIE_SPEED_8_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
>  			return PCIE_SPEED_5_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
>  			return PCIE_SPEED_2_5GT;
>  	}
>  

I'd like to apply this as below, where I removed the 8_0GB and 16_0GB
cases as recommended by the spec.  I can't test it myself, and the
bugzillas don't contain enough information for me to confirm that the
patch below is enough (the "lspci -vv" output of the root port and GPU
is what I would need).

I'm confused about the fact that 6cf57be0f78e appeared in v4.17, but
v4.18 works fine according to both bugzillas.

I also don't have a good feel for whether this is urgent enough to be
a v4.20 fix or whether it can wait for v4.21.  Evidence either way
would help.

We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
(hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
qla24xx_pci_info_str(), and maybe a couple other places.

Bjorn


commit 871f73abf4b8e6aee8a206775f944ede7c7d7250
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Tue Oct 30 12:36:08 2018 -0400

    PCI: Fix incorrect value returned from pcie_get_speed_cap()
    
    The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks.  We must mask
    the register and compare it against them.
    
    This patch fixes errors "amdgpu: [powerplay] failed to send message 261 ret
    is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because the
    slot is being incorrectly reported as PCIe-v3 capable.
    
    Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
    Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704
    Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
    PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org      # v4.17+

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d068f11d08a7..8563d1d9b102 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5556,9 +5556,13 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
 	u32 lnkcap2, lnkcap;
 
 	/*
-	 * PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
-	 * Speeds Vector in Link Capabilities 2 when supported, falling
-	 * back to Max Link Speed in Link Capabilities otherwise.
+	 * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
+	 * implementation note there recommends using the Supported Link
+	 * Speeds Vector in Link Capabilities 2 when supported.
+	 *
+	 * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
+	 * should use the Supported Link Speeds field in Link Capabilities,
+	 * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
 	 */
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
 	if (lnkcap2) { /* PCIe r3.0-compliant */
@@ -5575,13 +5579,9 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
 
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
 	if (lnkcap) {
-		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
-			return PCIE_SPEED_16_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
-			return PCIE_SPEED_8_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
+		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
 			return PCIE_SPEED_5_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
+		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
 			return PCIE_SPEED_2_5GT;
 	}
Alex Deucher Nov. 20, 2018, 2:17 p.m. | #5
On Mon, Nov 19, 2018 at 7:47 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > the register and compare it against them.
> >
> > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > the slot is being incorrectly reported as PCIe-v3 capable.
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > Cc: stable@vger.kernel.org    # v4.17+
> >
> > ---
> >  drivers/pci/pci.c |    8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > Index: linux-4.19/drivers/pci/pci.c
> > ===================================================================
> > --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.000000000 +0100
> > +++ linux-4.19/drivers/pci/pci.c      2018-10-30 16:58:58.000000000 +0100
> > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> >
> >       pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> >       if (lnkcap) {
> > -             if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > +             if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> >                       return PCIE_SPEED_16_0GT;
> > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> >                       return PCIE_SPEED_8_0GT;
> > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> >                       return PCIE_SPEED_5_0GT;
> > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> >                       return PCIE_SPEED_2_5GT;
> >       }
> >
>
> I'd like to apply this as below, where I removed the 8_0GB and 16_0GB
> cases as recommended by the spec.  I can't test it myself, and the
> bugzillas don't contain enough information for me to confirm that the
> patch below is enough (the "lspci -vv" output of the root port and GPU
> is what I would need).
>
> I'm confused about the fact that 6cf57be0f78e appeared in v4.17, but
> v4.18 works fine according to both bugzillas.

This issue affects AMD GPUs because we switched from using an open
coded check for pcie link speeds in the driver to using the common
pcie variants in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.19&id=5d9a6330403271fbb1244f14380a7cc44662796f

The patch below would regress performance, at least on AMD GPUs, since
we'd end up reporting a max speed of gen 2 (5 GT/s) which would cause
the driver to limit the speed to gen2 even if gen3 or 4 are available.

>
> I also don't have a good feel for whether this is urgent enough to be
> a v4.20 fix or whether it can wait for v4.21.  Evidence either way
> would help.

I'd like it to land for 4.19 and 4.20 at least.  Alternatively, we
could revert all of the drm patches to and bring back all the open
coded implementations, but it's a fairly large number of patches to
revert.

Alex

>
> We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> qla24xx_pci_info_str(), and maybe a couple other places.
>
> Bjorn
>
>
> commit 871f73abf4b8e6aee8a206775f944ede7c7d7250
> Author: Mikulas Patocka <mpatocka@redhat.com>
> Date:   Tue Oct 30 12:36:08 2018 -0400
>
>     PCI: Fix incorrect value returned from pcie_get_speed_cap()
>
>     The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks.  We must mask
>     the register and compare it against them.
>
>     This patch fixes errors "amdgpu: [powerplay] failed to send message 261 ret
>     is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because the
>     slot is being incorrectly reported as PCIe-v3 capable.
>
>     Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
>     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704
>     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
>     PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2]
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Acked-by: Alex Deucher <alexander.deucher@amd.com>
>     Cc: stable@vger.kernel.org      # v4.17+
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d068f11d08a7..8563d1d9b102 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5556,9 +5556,13 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
>         u32 lnkcap2, lnkcap;
>
>         /*
> -        * PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
> -        * Speeds Vector in Link Capabilities 2 when supported, falling
> -        * back to Max Link Speed in Link Capabilities otherwise.
> +        * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
> +        * implementation note there recommends using the Supported Link
> +        * Speeds Vector in Link Capabilities 2 when supported.
> +        *
> +        * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
> +        * should use the Supported Link Speeds field in Link Capabilities,
> +        * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
>          */
>         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
>         if (lnkcap2) { /* PCIe r3.0-compliant */
> @@ -5575,13 +5579,9 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
>
>         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
>         if (lnkcap) {
> -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> -                       return PCIE_SPEED_16_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> -                       return PCIE_SPEED_8_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
>                         return PCIE_SPEED_5_0GT;
> -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
>                         return PCIE_SPEED_2_5GT;
>         }
>
Bjorn Helgaas Nov. 20, 2018, 3:13 p.m. | #6
On Tue, Nov 20, 2018 at 09:17:52AM -0500, Alex Deucher wrote:
> On Mon, Nov 19, 2018 at 7:47 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > the register and compare it against them.
> > >
> > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > the slot is being incorrectly reported as PCIe-v3 capable.
> > >
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > > Cc: stable@vger.kernel.org    # v4.17+
> > >
> > > ---
> > >  drivers/pci/pci.c |    8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > Index: linux-4.19/drivers/pci/pci.c
> > > ===================================================================
> > > --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.000000000 +0100
> > > +++ linux-4.19/drivers/pci/pci.c      2018-10-30 16:58:58.000000000 +0100
> > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > >
> > >       pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > >       if (lnkcap) {
> > > -             if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > +             if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > >                       return PCIE_SPEED_16_0GT;
> > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> > >                       return PCIE_SPEED_8_0GT;
> > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > >                       return PCIE_SPEED_5_0GT;
> > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > >                       return PCIE_SPEED_2_5GT;
> > >       }
> > >
> >
> > I'd like to apply this as below, where I removed the 8_0GB and 16_0GB
> > cases as recommended by the spec.  I can't test it myself, and the
> > bugzillas don't contain enough information for me to confirm that the
> > patch below is enough (the "lspci -vv" output of the root port and GPU
> > is what I would need).
> >
> > I'm confused about the fact that 6cf57be0f78e appeared in v4.17, but
> > v4.18 works fine according to both bugzillas.
> 
> This issue affects AMD GPUs because we switched from using an open
> coded check for pcie link speeds in the driver to using the common
> pcie variants in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.19&id=5d9a6330403271fbb1244f14380a7cc44662796f

OK, thanks.  I added that SHA1 and a note to explain the connection.

> The patch below would regress performance, at least on AMD GPUs, since
> we'd end up reporting a max speed of gen 2 (5 GT/s) which would cause
> the driver to limit the speed to gen2 even if gen3 or 4 are available.

I guess this means these parts are broken with respect to the spec,
since they support gen3 speeds but don't implement LnkCap2?

Gen2, i.e., PCIe r2.1, only defined 2.5 GT/s and 5 GT/s.  Devices
capable of the higher speeds added by PCIe r3.0 are supposed to
implement LnkCap2, but if we're even getting to this code, it means
LnkCap2 was zero.

If we confirm that this is a device defect, the question is what the
best way to work around it is.  Probably the original patch is easier
than some sort of quirk, but we'd need to expand the comment a little
bit to explain why we're not following the spec recommendation.

It looks like lspci probably also needs some updating here -- it
currently doesn't do anything at all with LnkCap2.

> > I also don't have a good feel for whether this is urgent enough to be
> > a v4.20 fix or whether it can wait for v4.21.  Evidence either way
> > would help.
> 
> I'd like it to land for 4.19 and 4.20 at least.  Alternatively, we
> could revert all of the drm patches to and bring back all the open
> coded implementations, but it's a fairly large number of patches to
> revert.

OK, sounds like it makes sense to do this for v4.20 and backport it at
least to v4.19 stable.  I do want to get the places below fixed also.
They may not be as urgent, but we might as well try and make
everything consistent while we're looking at it.

> > We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> > qla24xx_pci_info_str(), and maybe a couple other places.
> >
> > Bjorn
> >
> >
> > commit 871f73abf4b8e6aee8a206775f944ede7c7d7250
> > Author: Mikulas Patocka <mpatocka@redhat.com>
> > Date:   Tue Oct 30 12:36:08 2018 -0400
> >
> >     PCI: Fix incorrect value returned from pcie_get_speed_cap()
> >
> >     The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks.  We must mask
> >     the register and compare it against them.
> >
> >     This patch fixes errors "amdgpu: [powerplay] failed to send message 261 ret
> >     is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because the
> >     slot is being incorrectly reported as PCIe-v3 capable.
> >
> >     Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> >     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704
> >     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778
> >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >     [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
> >     PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2]
> >     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> >     Acked-by: Alex Deucher <alexander.deucher@amd.com>
> >     Cc: stable@vger.kernel.org      # v4.17+
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index d068f11d08a7..8563d1d9b102 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5556,9 +5556,13 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
> >         u32 lnkcap2, lnkcap;
> >
> >         /*
> > -        * PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
> > -        * Speeds Vector in Link Capabilities 2 when supported, falling
> > -        * back to Max Link Speed in Link Capabilities otherwise.
> > +        * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
> > +        * implementation note there recommends using the Supported Link
> > +        * Speeds Vector in Link Capabilities 2 when supported.
> > +        *
> > +        * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
> > +        * should use the Supported Link Speeds field in Link Capabilities,
> > +        * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
> >          */
> >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
> >         if (lnkcap2) { /* PCIe r3.0-compliant */
> > @@ -5575,13 +5579,9 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
> >
> >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> >         if (lnkcap) {
> > -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > -                       return PCIE_SPEED_16_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > -                       return PCIE_SPEED_8_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
> >                         return PCIE_SPEED_5_0GT;
> > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> >                         return PCIE_SPEED_2_5GT;
> >         }
> >
Alex Deucher Nov. 20, 2018, 4:29 p.m. | #7
On Tue, Nov 20, 2018 at 10:13 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Nov 20, 2018 at 09:17:52AM -0500, Alex Deucher wrote:
> > On Mon, Nov 19, 2018 at 7:47 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > > the register and compare it against them.
> > > >
> > > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > > the slot is being incorrectly reported as PCIe-v3 capable.
> > > >
> > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > > > Cc: stable@vger.kernel.org    # v4.17+
> > > >
> > > > ---
> > > >  drivers/pci/pci.c |    8 ++++----
> > > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > >
> > > > Index: linux-4.19/drivers/pci/pci.c
> > > > ===================================================================
> > > > --- linux-4.19.orig/drivers/pci/pci.c 2018-10-30 16:58:58.000000000 +0100
> > > > +++ linux-4.19/drivers/pci/pci.c      2018-10-30 16:58:58.000000000 +0100
> > > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > > >
> > > >       pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > > >       if (lnkcap) {
> > > > -             if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > > +             if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > > >                       return PCIE_SPEED_16_0GT;
> > > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> > > >                       return PCIE_SPEED_8_0GT;
> > > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > > >                       return PCIE_SPEED_5_0GT;
> > > > -             else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > > +             else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > > >                       return PCIE_SPEED_2_5GT;
> > > >       }
> > > >
> > >
> > > I'd like to apply this as below, where I removed the 8_0GB and 16_0GB
> > > cases as recommended by the spec.  I can't test it myself, and the
> > > bugzillas don't contain enough information for me to confirm that the
> > > patch below is enough (the "lspci -vv" output of the root port and GPU
> > > is what I would need).
> > >
> > > I'm confused about the fact that 6cf57be0f78e appeared in v4.17, but
> > > v4.18 works fine according to both bugzillas.
> >
> > This issue affects AMD GPUs because we switched from using an open
> > coded check for pcie link speeds in the driver to using the common
> > pcie variants in
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.19&id=5d9a6330403271fbb1244f14380a7cc44662796f
>
> OK, thanks.  I added that SHA1 and a note to explain the connection.
>
> > The patch below would regress performance, at least on AMD GPUs, since
> > we'd end up reporting a max speed of gen 2 (5 GT/s) which would cause
> > the driver to limit the speed to gen2 even if gen3 or 4 are available.
>
> I guess this means these parts are broken with respect to the spec,
> since they support gen3 speeds but don't implement LnkCap2?

Sorry, I mis-read the patch.  The patch is correct and our parts are
compliant.  We are only hitting this issue on non-gen3 platforms which
would end up in the second case.  Sorry for the confusion.  You new
patch is:
Acked-by: Alex Deucher <alexander.deucher@amd.com>

Alex

>
> Gen2, i.e., PCIe r2.1, only defined 2.5 GT/s and 5 GT/s.  Devices
> capable of the higher speeds added by PCIe r3.0 are supposed to
> implement LnkCap2, but if we're even getting to this code, it means
> LnkCap2 was zero.
>
> If we confirm that this is a device defect, the question is what the
> best way to work around it is.  Probably the original patch is easier
> than some sort of quirk, but we'd need to expand the comment a little
> bit to explain why we're not following the spec recommendation.
>
> It looks like lspci probably also needs some updating here -- it
> currently doesn't do anything at all with LnkCap2.
>
> > > I also don't have a good feel for whether this is urgent enough to be
> > > a v4.20 fix or whether it can wait for v4.21.  Evidence either way
> > > would help.
> >
> > I'd like it to land for 4.19 and 4.20 at least.  Alternatively, we
> > could revert all of the drm patches to and bring back all the open
> > coded implementations, but it's a fairly large number of patches to
> > revert.
>
> OK, sounds like it makes sense to do this for v4.20 and backport it at
> least to v4.19 stable.  I do want to get the places below fixed also.
> They may not be as urgent, but we might as well try and make
> everything consistent while we're looking at it.
>
> > > We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> > > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> > > qla24xx_pci_info_str(), and maybe a couple other places.
> > >
> > > Bjorn
> > >
> > >
> > > commit 871f73abf4b8e6aee8a206775f944ede7c7d7250
> > > Author: Mikulas Patocka <mpatocka@redhat.com>
> > > Date:   Tue Oct 30 12:36:08 2018 -0400
> > >
> > >     PCI: Fix incorrect value returned from pcie_get_speed_cap()
> > >
> > >     The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks.  We must mask
> > >     the register and compare it against them.
> > >
> > >     This patch fixes errors "amdgpu: [powerplay] failed to send message 261 ret
> > >     is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because the
> > >     slot is being incorrectly reported as PCIe-v3 capable.
> > >
> > >     Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > >     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704
> > >     Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778
> > >     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > >     [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
> > >     PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2]
> > >     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > >     Acked-by: Alex Deucher <alexander.deucher@amd.com>
> > >     Cc: stable@vger.kernel.org      # v4.17+
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index d068f11d08a7..8563d1d9b102 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5556,9 +5556,13 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
> > >         u32 lnkcap2, lnkcap;
> > >
> > >         /*
> > > -        * PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
> > > -        * Speeds Vector in Link Capabilities 2 when supported, falling
> > > -        * back to Max Link Speed in Link Capabilities otherwise.
> > > +        * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
> > > +        * implementation note there recommends using the Supported Link
> > > +        * Speeds Vector in Link Capabilities 2 when supported.
> > > +        *
> > > +        * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
> > > +        * should use the Supported Link Speeds field in Link Capabilities,
> > > +        * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
> > >          */
> > >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
> > >         if (lnkcap2) { /* PCIe r3.0-compliant */
> > > @@ -5575,13 +5579,9 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
> > >
> > >         pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > >         if (lnkcap) {
> > > -               if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > -                       return PCIE_SPEED_16_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > -                       return PCIE_SPEED_8_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > +               if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
> > >                         return PCIE_SPEED_5_0GT;
> > > -               else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > +               else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > >                         return PCIE_SPEED_2_5GT;
> > >         }
> > >
Bjorn Helgaas Nov. 26, 2018, 10:33 p.m. | #8
On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> the register and compare it against them.
> 
> This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> the slot is being incorrectly reported as PCIe-v3 capable.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

Applied the patch below to for-linus for v4.20 and backport to v4.17+
stable kernels, thanks!

I removed the "if (lnkcap)" test because the LNKCAP case is not
parallel to the LNKCAP2 case.  LNKCAP2 is optional so we have to test
for lnkcap2 being non-zero.  But LNKCAP is required for all devices.
If the config read of it fails, we should get either 0 or ~0, and
neither one will satisfy the 5.0 or 2.5 checks.

> Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> Cc: stable@vger.kernel.org	# v4.17+
> 
> ---
>  drivers/pci/pci.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: linux-4.19/drivers/pci/pci.c
> ===================================================================
> --- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> +++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
>  
>  	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
>  	if (lnkcap) {
> -		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> +		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
>  			return PCIE_SPEED_16_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
>  			return PCIE_SPEED_8_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
>  			return PCIE_SPEED_5_0GT;
> -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
>  			return PCIE_SPEED_2_5GT;
>  	}
>  

commit 94ea01a6d9a6
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Mon Nov 26 10:37:13 2018 -0600

    PCI: Fix incorrect value returned from pcie_get_speed_cap()
    
    The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks.  We must mask
    the register and compare it against them.
    
    This fixes errors like this:
    
      amdgpu: [powerplay] failed to send message 261 ret is 0
    
    when PCIe-v3 card is plugged into a PCIe-v1 slot, because the slot is being
    incorrectly reported as PCIe-v3 capable.
    
    6cf57be0f78e, which appeared in v4.17, added pcie_get_speed_cap() with the
    incorrect test of PCI_EXP_LNKCAP_SLS as a bitmask.  5d9a63304032, which
    appeared in v4.19, changed amdgpu to use pcie_get_speed_cap(), so the
    amdgpu bug reports below are regressions in v4.19.
    
    Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
    Fixes: 5d9a63304032 ("drm/amdgpu: use pcie functions for link width and speed")
    Link: https://bugs.freedesktop.org/show_bug.cgi?id=108704
    Link: https://bugs.freedesktop.org/show_bug.cgi?id=108778
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    [bhelgaas: update comment, remove use of PCI_EXP_LNKCAP_SLS_8_0GB and
    PCI_EXP_LNKCAP_SLS_16_0GB since those should be covered by PCI_EXP_LNKCAP2,
    remove test of PCI_EXP_LNKCAP for zero, since that register is required]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org      # v4.17+

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d068f11d08a7..c9d8e3c837de 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5556,9 +5556,13 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
 	u32 lnkcap2, lnkcap;
 
 	/*
-	 * PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
-	 * Speeds Vector in Link Capabilities 2 when supported, falling
-	 * back to Max Link Speed in Link Capabilities otherwise.
+	 * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
+	 * implementation note there recommends using the Supported Link
+	 * Speeds Vector in Link Capabilities 2 when supported.
+	 *
+	 * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
+	 * should use the Supported Link Speeds field in Link Capabilities,
+	 * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
 	 */
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
 	if (lnkcap2) { /* PCIe r3.0-compliant */
@@ -5574,16 +5578,10 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
 	}
 
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
-	if (lnkcap) {
-		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
-			return PCIE_SPEED_16_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
-			return PCIE_SPEED_8_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
-			return PCIE_SPEED_5_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
-			return PCIE_SPEED_2_5GT;
-	}
+	if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
+		return PCIE_SPEED_5_0GT;
+	else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
+		return PCIE_SPEED_2_5GT;
 
 	return PCI_SPEED_UNKNOWN;
 }
Bjorn Helgaas Nov. 26, 2018, 10:40 p.m. | #9
On Mon, Nov 19, 2018 at 06:47:04PM -0600, Bjorn Helgaas wrote:
> On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > the register and compare it against them.
> > 
> > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > the slot is being incorrectly reported as PCIe-v3 capable.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > Cc: stable@vger.kernel.org	# v4.17+
> > 
> > ---
> >  drivers/pci/pci.c |    8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > Index: linux-4.19/drivers/pci/pci.c
> > ===================================================================
> > --- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > +++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> >  
> >  	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> >  	if (lnkcap) {
> > -		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > +		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> >  			return PCIE_SPEED_16_0GT;
> > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> >  			return PCIE_SPEED_8_0GT;
> > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> >  			return PCIE_SPEED_5_0GT;
> > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> >  			return PCIE_SPEED_2_5GT;
> >  	}

> We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> qla24xx_pci_info_str(), and maybe a couple other places.

Does anybody want to volunteer to fix the places above as well?  I
found them by grepping for PCI_EXP_LNKCAP, and they're all broken in
ways similar to pcie_get_speed_cap().  Possibly some of these places
could use pcie_get_speed_cap() directly.

Bjorn
Mikulas Patocka Nov. 27, 2018, 4:49 p.m. | #10
On Mon, 26 Nov 2018, Bjorn Helgaas wrote:

> On Mon, Nov 19, 2018 at 06:47:04PM -0600, Bjorn Helgaas wrote:
> > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > the register and compare it against them.
> > > 
> > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > the slot is being incorrectly reported as PCIe-v3 capable.
> > > 
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > > Cc: stable@vger.kernel.org	# v4.17+
> > > 
> > > ---
> > >  drivers/pci/pci.c |    8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > Index: linux-4.19/drivers/pci/pci.c
> > > ===================================================================
> > > --- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > > +++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > >  
> > >  	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > >  	if (lnkcap) {
> > > -		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > +		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > >  			return PCIE_SPEED_16_0GT;
> > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> > >  			return PCIE_SPEED_8_0GT;
> > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > >  			return PCIE_SPEED_5_0GT;
> > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > >  			return PCIE_SPEED_2_5GT;
> > >  	}
> 
> > We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> > qla24xx_pci_info_str(), and maybe a couple other places.
> 
> Does anybody want to volunteer to fix the places above as well?  I
> found them by grepping for PCI_EXP_LNKCAP, and they're all broken in
> ways similar to pcie_get_speed_cap().  Possibly some of these places
> could use pcie_get_speed_cap() directly.
> 
> Bjorn
> 

They are not broken, they are masking the value with PCI_EXP_LNKCAP_SLS - 
that is correct.

pci_set_bus_speed:
                pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
                bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];

pcie_speeds:
	if ((linkcap & PCI_EXP_LNKCAP_SLS) != PCI_EXP_LNKCAP_SLS_8_0GB)

cobalt_pcie_status_show:
	just prints the values without doing anything with them

hba_ioctl_callback:
	gai->pci.link_speed_max = (u8)(caps & PCI_EXP_LNKCAP_SLS);

	gai->pci.link_width_max = (u8)((caps & PCI_EXP_LNKCAP_MLW) >> 4);

qla24xx_pci_info_str:
	lspeed = lstat & PCI_EXP_LNKCAP_SLS;
	lwidth = (lstat & PCI_EXP_LNKCAP_MLW) >> 4;

Mikulas
Bjorn Helgaas Nov. 27, 2018, 8:40 p.m. | #11
On Tue, Nov 27, 2018 at 11:49:43AM -0500, Mikulas Patocka wrote:
> On Mon, 26 Nov 2018, Bjorn Helgaas wrote:
> > On Mon, Nov 19, 2018 at 06:47:04PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Oct 30, 2018 at 12:36:08PM -0400, Mikulas Patocka wrote:
> > > > The macros PCI_EXP_LNKCAP_SLS_*GB are values, not bit masks. We must mask
> > > > the register and compare it against them.
> > > > 
> > > > This patch fixes errors "amdgpu: [powerplay] failed to send message 261
> > > > ret is 0" errors when PCIe-v3 card is plugged into PCIe-v1 slot, because
> > > > the slot is being incorrectly reported as PCIe-v3 capable.
> > > > 
> > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > > Fixes: 6cf57be0f78e ("PCI: Add pcie_get_speed_cap() to find max supported link speed")
> > > > Cc: stable@vger.kernel.org	# v4.17+
> > > > 
> > > > ---
> > > >  drivers/pci/pci.c |    8 ++++----
> > > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > > 
> > > > Index: linux-4.19/drivers/pci/pci.c
> > > > ===================================================================
> > > > --- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > > > +++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
> > > > @@ -5492,13 +5492,13 @@ enum pci_bus_speed pcie_get_speed_cap(st
> > > >  
> > > >  	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > > >  	if (lnkcap) {
> > > > -		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
> > > > +		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
> > > >  			return PCIE_SPEED_16_0GT;
> > > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
> > > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
> > > >  			return PCIE_SPEED_8_0GT;
> > > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
> > > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
> > > >  			return PCIE_SPEED_5_0GT;
> > > > -		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
> > > > +		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
> > > >  			return PCIE_SPEED_2_5GT;
> > > >  	}
> > 
> > > We also need similar fixes in pci_set_bus_speed(), pcie_speeds()
> > > (hfi1), cobalt_pcie_status_show(), hba_ioctl_callback(),
> > > qla24xx_pci_info_str(), and maybe a couple other places.
> > 
> > Does anybody want to volunteer to fix the places above as well?  I
> > found them by grepping for PCI_EXP_LNKCAP, and they're all broken in
> > ways similar to pcie_get_speed_cap().  Possibly some of these places
> > could use pcie_get_speed_cap() directly.
> 
> They are not broken, they are masking the value with PCI_EXP_LNKCAP_SLS - 
> that is correct.

You're right, "broken" is an overstatement except in terms of
maintainance: they do happen to work correctly, but there's no reason
to reimplement the same functionality several places in slightly
different ways.

pcie_get_speed_cap() looks at LNKCAP2, then LNKCAP.  The other places
look only at LNKCAP.  Since these are all finding only the max link
speed (which can be determined using only LKNCAP), not the set of
supported speeds (which requires both LNKCAP2 and LNKCAP), they should
all do it the same way.

> pci_set_bus_speed:
>                 pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
>                 bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
> 
> pcie_speeds:
> 	if ((linkcap & PCI_EXP_LNKCAP_SLS) != PCI_EXP_LNKCAP_SLS_8_0GB)
> 
> cobalt_pcie_status_show:
> 	just prints the values without doing anything with them
> 
> hba_ioctl_callback:
> 	gai->pci.link_speed_max = (u8)(caps & PCI_EXP_LNKCAP_SLS);
> 
> 	gai->pci.link_width_max = (u8)((caps & PCI_EXP_LNKCAP_MLW) >> 4);
> 
> qla24xx_pci_info_str:
> 	lspeed = lstat & PCI_EXP_LNKCAP_SLS;
> 	lwidth = (lstat & PCI_EXP_LNKCAP_MLW) >> 4;
> 
> Mikulas

Patch

Index: linux-4.19/drivers/pci/pci.c
===================================================================
--- linux-4.19.orig/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
+++ linux-4.19/drivers/pci/pci.c	2018-10-30 16:58:58.000000000 +0100
@@ -5492,13 +5492,13 @@  enum pci_bus_speed pcie_get_speed_cap(st
 
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
 	if (lnkcap) {
-		if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
+		if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_16_0GB)
 			return PCIE_SPEED_16_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
+		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_8_0GB)
 			return PCIE_SPEED_8_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
+		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) ==PCI_EXP_LNKCAP_SLS_5_0GB)
 			return PCIE_SPEED_5_0GT;
-		else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
+		else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
 			return PCIE_SPEED_2_5GT;
 	}