[v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
diff mbox series

Message ID 20191016144449.24646-1-kherbst@redhat.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series
  • [v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
Related show

Commit Message

Karol Herbst Oct. 16, 2019, 2:44 p.m. UTC
Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
states.

v2: convert to pci_dev quirk
    put a proper technical explanation of the issue as a in-code comment
v3: disable it only for certain combinations of intel and nvidia hardware

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Mika Westerberg <mika.westerberg@intel.com>
Cc: linux-pci@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
---
 drivers/pci/pci.c    | 11 ++++++++++
 drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |  1 +
 3 files changed, 64 insertions(+)

Comments

Bjorn Helgaas Oct. 16, 2019, 7:14 p.m. UTC | #1
On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> states.
> 
> v2: convert to pci_dev quirk
>     put a proper technical explanation of the issue as a in-code comment
> v3: disable it only for certain combinations of intel and nvidia hardware
> 
> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Mika Westerberg <mika.westerberg@intel.com>
> Cc: linux-pci@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c    | 11 ++++++++++
>  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h  |  1 +
>  3 files changed, 64 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b97d9e10c9cc..8e056eb7e6ff 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
>  	return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
>  }
>  
> +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> +{
> +	if (!dev->bus || !dev->bus->self)
> +		return false;
> +	return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> +}
> +
>  /**
>   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
>   *			     given PCI device
> @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
>  	   || (state == PCI_D2 && !dev->d2_support))
>  		return -EIO;
>  
> +	/* check if the bus controller causes issues */
> +	if (state != PCI_D0 && parent_broken_child_pm(dev))
> +		return 0;
> +
>  	pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>  
>  	/*
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44c4ae1abd00..c2f20b745dd4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5268,3 +5268,55 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
>  			      PCI_CLASS_DISPLAY_VGA, 8,
>  			      quirk_reset_lenovo_thinkpad_p50_nvgpu);
> +
> +/*
> + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> + * those were put into D3cold state if they were put into a non D0 PCI PM
> + * device state before doing so.
> + *
> + * This leads to various issue different issues which all manifest differently,
> + * but have the same root cause:
> + *  - AIML code execution hits an infinite loop (as the coe waits on device
> + *    memory to change).
> + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> + *    to handle well enough.
> + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> + *    userspace tries to access the GPU.
> + *
> + * In all cases dmesg will contain at least one line like this:
> + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
> + * followed by a lot of nouveau timeouts.
> + *
> + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> + * PCIe bridge controller in order to power down the GPU.
> + * Nonetheless, there are other code paths inside the ACPI firmware which use
> + * other registers, which seem to work fine:
> + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> + *  - 0xb0 bit 0x10 (link disable)
> + * Changing the conditions inside the firmware by poking into the relevant
> + * addresses does resolve the issue, but it seemed to be ACPI private memory
> + * and not any device accessible memory at all, so there is no portable way of
> + * changing the conditions.
> + *
> + * The only systems where this behavior can be seen are hybrid graphics laptops
> + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
> + * only occurs in combination with listed Intel PCIe bridge controllers and
> + * the mentioned GPUs or if it's only a hw bug in the bridge controller.
> + *
> + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
> + * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug
> + * in the bridge controller rather than in the GPU.
> + *
> + * This issue was not able to be reproduced on non laptop systems.
> + */
> +
> +static void quirk_broken_nv_runpm(struct pci_dev *dev)
> +{
> +	dev->broken_nv_runpm = 1;

Can you use the existing PCI_DEV_FLAGS_NO_D3 flag for this instead of
adding a new flag?

I would put the parent_broken_child_pm() logic here, if possible,
e.g., something like:

  struct pci_dev *bridge = pci_upstream_bridge(dev);

  if (bridge &&
      bridge->vendor == PCI_VENDOR_ID_INTEL && bridge->device == 0x1901)
	dev->dev_flags |= PCI_DEV_FLAGS_NO_D3;

> +}
> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> +			      PCI_BASE_CLASS_DISPLAY, 16,
> +			      quirk_broken_nv_runpm);
> +/* kaby lake */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901,
> +			quirk_broken_nv_runpm);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index ac8a6c4e1792..903a0b3a39ec 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -416,6 +416,7 @@ struct pci_dev {
>  	unsigned int	__aer_firmware_first_valid:1;
>  	unsigned int	__aer_firmware_first:1;
>  	unsigned int	broken_intx_masking:1;	/* INTx masking can't be used */
> +	unsigned int	broken_nv_runpm:1;	/* some combinations of intel bridge controller and nvidia GPUs break rtd3 */
>  	unsigned int	io_window_1k:1;		/* Intel bridge 1K I/O windows */
>  	unsigned int	irq_managed:1;
>  	unsigned int	has_secondary_link:1;
> -- 
> 2.21.0
>
Karol Herbst Oct. 16, 2019, 7:18 p.m. UTC | #2
but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
platform means of putting the device into D3cold, right? That's
actually what should still happen, just the D3hot step should be
skipped.

On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > states.
> >
> > v2: convert to pci_dev quirk
> >     put a proper technical explanation of the issue as a in-code comment
> > v3: disable it only for certain combinations of intel and nvidia hardware
> >
> > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: Lyude Paul <lyude@redhat.com>
> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > Cc: linux-pci@vger.kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: nouveau@lists.freedesktop.org
> > ---
> >  drivers/pci/pci.c    | 11 ++++++++++
> >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pci.h  |  1 +
> >  3 files changed, 64 insertions(+)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index b97d9e10c9cc..8e056eb7e6ff 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
> >       return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
> >  }
> >
> > +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> > +{
> > +     if (!dev->bus || !dev->bus->self)
> > +             return false;
> > +     return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> > +}
> > +
> >  /**
> >   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
> >   *                        given PCI device
> > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> >          || (state == PCI_D2 && !dev->d2_support))
> >               return -EIO;
> >
> > +     /* check if the bus controller causes issues */
> > +     if (state != PCI_D0 && parent_broken_child_pm(dev))
> > +             return 0;
> > +
> >       pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> >
> >       /*
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 44c4ae1abd00..c2f20b745dd4 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -5268,3 +5268,55 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
> >  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> >                             PCI_CLASS_DISPLAY_VGA, 8,
> >                             quirk_reset_lenovo_thinkpad_p50_nvgpu);
> > +
> > +/*
> > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> > + * those were put into D3cold state if they were put into a non D0 PCI PM
> > + * device state before doing so.
> > + *
> > + * This leads to various issue different issues which all manifest differently,
> > + * but have the same root cause:
> > + *  - AIML code execution hits an infinite loop (as the coe waits on device
> > + *    memory to change).
> > + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> > + *    to handle well enough.
> > + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> > + *    userspace tries to access the GPU.
> > + *
> > + * In all cases dmesg will contain at least one line like this:
> > + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
> > + * followed by a lot of nouveau timeouts.
> > + *
> > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> > + * PCIe bridge controller in order to power down the GPU.
> > + * Nonetheless, there are other code paths inside the ACPI firmware which use
> > + * other registers, which seem to work fine:
> > + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> > + *  - 0xb0 bit 0x10 (link disable)
> > + * Changing the conditions inside the firmware by poking into the relevant
> > + * addresses does resolve the issue, but it seemed to be ACPI private memory
> > + * and not any device accessible memory at all, so there is no portable way of
> > + * changing the conditions.
> > + *
> > + * The only systems where this behavior can be seen are hybrid graphics laptops
> > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
> > + * only occurs in combination with listed Intel PCIe bridge controllers and
> > + * the mentioned GPUs or if it's only a hw bug in the bridge controller.
> > + *
> > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
> > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug
> > + * in the bridge controller rather than in the GPU.
> > + *
> > + * This issue was not able to be reproduced on non laptop systems.
> > + */
> > +
> > +static void quirk_broken_nv_runpm(struct pci_dev *dev)
> > +{
> > +     dev->broken_nv_runpm = 1;
>
> Can you use the existing PCI_DEV_FLAGS_NO_D3 flag for this instead of
> adding a new flag?
>
> I would put the parent_broken_child_pm() logic here, if possible,
> e.g., something like:
>
>   struct pci_dev *bridge = pci_upstream_bridge(dev);
>
>   if (bridge &&
>       bridge->vendor == PCI_VENDOR_ID_INTEL && bridge->device == 0x1901)
>         dev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
>
> > +}
> > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > +                           PCI_BASE_CLASS_DISPLAY, 16,
> > +                           quirk_broken_nv_runpm);
> > +/* kaby lake */
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901,
> > +                     quirk_broken_nv_runpm);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index ac8a6c4e1792..903a0b3a39ec 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -416,6 +416,7 @@ struct pci_dev {
> >       unsigned int    __aer_firmware_first_valid:1;
> >       unsigned int    __aer_firmware_first:1;
> >       unsigned int    broken_intx_masking:1;  /* INTx masking can't be used */
> > +     unsigned int    broken_nv_runpm:1;      /* some combinations of intel bridge controller and nvidia GPUs break rtd3 */
> >       unsigned int    io_window_1k:1;         /* Intel bridge 1K I/O windows */
> >       unsigned int    irq_managed:1;
> >       unsigned int    has_secondary_link:1;
> > --
> > 2.21.0
> >
Bjorn Helgaas Oct. 16, 2019, 9:37 p.m. UTC | #3
[+cc linux-acpi]

On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> platform means of putting the device into D3cold, right? That's
> actually what should still happen, just the D3hot step should be
> skipped.

If I understand correctly, when we put a device in D3cold on an ACPI
system, we do something like this:

  pci_set_power_state(D3cold)
    if (PCI_DEV_FLAGS_NO_D3)
      return 0                                   <-- nothing at all if quirked
    pci_raw_set_power_state
      pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
    __pci_complete_power_transition(D3cold)
      pci_platform_power_transition(D3cold)
        platform_pci_set_power_state(D3cold)
          acpi_pci_set_power_state(D3cold)
            acpi_device_set_power(ACPI_STATE_D3_COLD)
              ...
                acpi_evaluate_object("_OFF")     <-- set to D3cold

I did not understand the connection with platform (ACPI) power
management from your patch.  It sounds like you want this entire path
except that you want to skip the PCI_PM_CTRL write?

That seems like something Rafael should weigh in on.  I don't know
why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
methods, and I don't know what the effect of skipping that is.  It
seems a little messy to slice out this tiny piece from the middle, but
maybe it makes sense.

> On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > > states.
> > >
> > > v2: convert to pci_dev quirk
> > >     put a proper technical explanation of the issue as a in-code comment
> > > v3: disable it only for certain combinations of intel and nvidia hardware
> > >
> > > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > > Cc: Lyude Paul <lyude@redhat.com>
> > > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > > Cc: linux-pci@vger.kernel.org
> > > Cc: linux-pm@vger.kernel.org
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: nouveau@lists.freedesktop.org
> > > ---
> > >  drivers/pci/pci.c    | 11 ++++++++++
> > >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/pci.h  |  1 +
> > >  3 files changed, 64 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index b97d9e10c9cc..8e056eb7e6ff 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
> > >       return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
> > >  }
> > >
> > > +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> > > +{
> > > +     if (!dev->bus || !dev->bus->self)
> > > +             return false;
> > > +     return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> > > +}
> > > +
> > >  /**
> > >   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
> > >   *                        given PCI device
> > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> > >          || (state == PCI_D2 && !dev->d2_support))
> > >               return -EIO;
> > >
> > > +     /* check if the bus controller causes issues */
> > > +     if (state != PCI_D0 && parent_broken_child_pm(dev))
> > > +             return 0;
> > > +
> > >       pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > >
> > >       /*
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 44c4ae1abd00..c2f20b745dd4 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -5268,3 +5268,55 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
> > >  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> > >                             PCI_CLASS_DISPLAY_VGA, 8,
> > >                             quirk_reset_lenovo_thinkpad_p50_nvgpu);
> > > +
> > > +/*
> > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> > > + * those were put into D3cold state if they were put into a non D0 PCI PM
> > > + * device state before doing so.
> > > + *
> > > + * This leads to various issue different issues which all manifest differently,
> > > + * but have the same root cause:
> > > + *  - AIML code execution hits an infinite loop (as the coe waits on device
> > > + *    memory to change).
> > > + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> > > + *    to handle well enough.
> > > + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> > > + *    userspace tries to access the GPU.
> > > + *
> > > + * In all cases dmesg will contain at least one line like this:
> > > + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
> > > + * followed by a lot of nouveau timeouts.
> > > + *
> > > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> > > + * PCIe bridge controller in order to power down the GPU.
> > > + * Nonetheless, there are other code paths inside the ACPI firmware which use
> > > + * other registers, which seem to work fine:
> > > + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> > > + *  - 0xb0 bit 0x10 (link disable)
> > > + * Changing the conditions inside the firmware by poking into the relevant
> > > + * addresses does resolve the issue, but it seemed to be ACPI private memory
> > > + * and not any device accessible memory at all, so there is no portable way of
> > > + * changing the conditions.
> > > + *
> > > + * The only systems where this behavior can be seen are hybrid graphics laptops
> > > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
> > > + * only occurs in combination with listed Intel PCIe bridge controllers and
> > > + * the mentioned GPUs or if it's only a hw bug in the bridge controller.
> > > + *
> > > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
> > > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug
> > > + * in the bridge controller rather than in the GPU.
> > > + *
> > > + * This issue was not able to be reproduced on non laptop systems.
> > > + */
> > > +
> > > +static void quirk_broken_nv_runpm(struct pci_dev *dev)
> > > +{
> > > +     dev->broken_nv_runpm = 1;
> >
> > Can you use the existing PCI_DEV_FLAGS_NO_D3 flag for this instead of
> > adding a new flag?
> >
> > I would put the parent_broken_child_pm() logic here, if possible,
> > e.g., something like:
> >
> >   struct pci_dev *bridge = pci_upstream_bridge(dev);
> >
> >   if (bridge &&
> >       bridge->vendor == PCI_VENDOR_ID_INTEL && bridge->device == 0x1901)
> >         dev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> >
> > > +}
> > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > > +                           PCI_BASE_CLASS_DISPLAY, 16,
> > > +                           quirk_broken_nv_runpm);
> > > +/* kaby lake */
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901,
> > > +                     quirk_broken_nv_runpm);
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index ac8a6c4e1792..903a0b3a39ec 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -416,6 +416,7 @@ struct pci_dev {
> > >       unsigned int    __aer_firmware_first_valid:1;
> > >       unsigned int    __aer_firmware_first:1;
> > >       unsigned int    broken_intx_masking:1;  /* INTx masking can't be used */
> > > +     unsigned int    broken_nv_runpm:1;      /* some combinations of intel bridge controller and nvidia GPUs break rtd3 */
> > >       unsigned int    io_window_1k:1;         /* Intel bridge 1K I/O windows */
> > >       unsigned int    irq_managed:1;
> > >       unsigned int    has_secondary_link:1;
> > > --
> > > 2.21.0
> > >
Karol Herbst Oct. 16, 2019, 9:48 p.m. UTC | #4
On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc linux-acpi]
>
> On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> > platform means of putting the device into D3cold, right? That's
> > actually what should still happen, just the D3hot step should be
> > skipped.
>
> If I understand correctly, when we put a device in D3cold on an ACPI
> system, we do something like this:
>
>   pci_set_power_state(D3cold)
>     if (PCI_DEV_FLAGS_NO_D3)
>       return 0                                   <-- nothing at all if quirked
>     pci_raw_set_power_state
>       pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
>     __pci_complete_power_transition(D3cold)
>       pci_platform_power_transition(D3cold)
>         platform_pci_set_power_state(D3cold)
>           acpi_pci_set_power_state(D3cold)
>             acpi_device_set_power(ACPI_STATE_D3_COLD)
>               ...
>                 acpi_evaluate_object("_OFF")     <-- set to D3cold
>
> I did not understand the connection with platform (ACPI) power
> management from your patch.  It sounds like you want this entire path
> except that you want to skip the PCI_PM_CTRL write?
>

exactly. I am running with this workaround for a while now and never
had any fails with it anymore. The GPU gets turned off correctly and I
see the same power savings, just that the GPU can be powered on again.

> That seems like something Rafael should weigh in on.  I don't know
> why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
> methods, and I don't know what the effect of skipping that is.  It
> seems a little messy to slice out this tiny piece from the middle, but
> maybe it makes sense.
>

afaik when I was talking with others in the past about it, Windows is
doing that before using ACPI calls, but maybe they have some similar
workarounds for certain intel bridges as well? I am sure it affects
more than the one I am blacklisting here, but I rather want to check
each device before blacklisting all kabylake and sky lake bridges (as
those are the ones were this issue can be observed).

Sadly we had no luck getting any information about such workaround out
of Nvidia or Intel.

> > On Wed, Oct 16, 2019 at 9:14 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > > > states.
> > > >
> > > > v2: convert to pci_dev quirk
> > > >     put a proper technical explanation of the issue as a in-code comment
> > > > v3: disable it only for certain combinations of intel and nvidia hardware
> > > >
> > > > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > > > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > > > Cc: Lyude Paul <lyude@redhat.com>
> > > > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > > > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > > > Cc: linux-pci@vger.kernel.org
> > > > Cc: linux-pm@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: nouveau@lists.freedesktop.org
> > > > ---
> > > >  drivers/pci/pci.c    | 11 ++++++++++
> > > >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  include/linux/pci.h  |  1 +
> > > >  3 files changed, 64 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > index b97d9e10c9cc..8e056eb7e6ff 100644
> > > > --- a/drivers/pci/pci.c
> > > > +++ b/drivers/pci/pci.c
> > > > @@ -805,6 +805,13 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
> > > >       return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
> > > >  }
> > > >
> > > > +static inline bool parent_broken_child_pm(struct pci_dev *dev)
> > > > +{
> > > > +     if (!dev->bus || !dev->bus->self)
> > > > +             return false;
> > > > +     return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
> > > > +}
> > > > +
> > > >  /**
> > > >   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
> > > >   *                        given PCI device
> > > > @@ -850,6 +857,10 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> > > >          || (state == PCI_D2 && !dev->d2_support))
> > > >               return -EIO;
> > > >
> > > > +     /* check if the bus controller causes issues */
> > > > +     if (state != PCI_D0 && parent_broken_child_pm(dev))
> > > > +             return 0;
> > > > +
> > > >       pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > > >
> > > >       /*
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 44c4ae1abd00..c2f20b745dd4 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -5268,3 +5268,55 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
> > > >  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> > > >                             PCI_CLASS_DISPLAY_VGA, 8,
> > > >                             quirk_reset_lenovo_thinkpad_p50_nvgpu);
> > > > +
> > > > +/*
> > > > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
> > > > + * those were put into D3cold state if they were put into a non D0 PCI PM
> > > > + * device state before doing so.
> > > > + *
> > > > + * This leads to various issue different issues which all manifest differently,
> > > > + * but have the same root cause:
> > > > + *  - AIML code execution hits an infinite loop (as the coe waits on device
> > > > + *    memory to change).
> > > > + *  - kernel crashes, as all pci reads return -1, which most code isn't able
> > > > + *    to handle well enough.
> > > > + *  - sudden shutdowns, as the kernel identified an unrecoverable error after
> > > > + *    userspace tries to access the GPU.
> > > > + *
> > > > + * In all cases dmesg will contain at least one line like this:
> > > > + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
> > > > + * followed by a lot of nouveau timeouts.
> > > > + *
> > > > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
> > > > + * PCIe bridge controller in order to power down the GPU.
> > > > + * Nonetheless, there are other code paths inside the ACPI firmware which use
> > > > + * other registers, which seem to work fine:
> > > > + *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
> > > > + *  - 0xb0 bit 0x10 (link disable)
> > > > + * Changing the conditions inside the firmware by poking into the relevant
> > > > + * addresses does resolve the issue, but it seemed to be ACPI private memory
> > > > + * and not any device accessible memory at all, so there is no portable way of
> > > > + * changing the conditions.
> > > > + *
> > > > + * The only systems where this behavior can be seen are hybrid graphics laptops
> > > > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
> > > > + * only occurs in combination with listed Intel PCIe bridge controllers and
> > > > + * the mentioned GPUs or if it's only a hw bug in the bridge controller.
> > > > + *
> > > > + * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
> > > > + * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug
> > > > + * in the bridge controller rather than in the GPU.
> > > > + *
> > > > + * This issue was not able to be reproduced on non laptop systems.
> > > > + */
> > > > +
> > > > +static void quirk_broken_nv_runpm(struct pci_dev *dev)
> > > > +{
> > > > +     dev->broken_nv_runpm = 1;
> > >
> > > Can you use the existing PCI_DEV_FLAGS_NO_D3 flag for this instead of
> > > adding a new flag?
> > >
> > > I would put the parent_broken_child_pm() logic here, if possible,
> > > e.g., something like:
> > >
> > >   struct pci_dev *bridge = pci_upstream_bridge(dev);
> > >
> > >   if (bridge &&
> > >       bridge->vendor == PCI_VENDOR_ID_INTEL && bridge->device == 0x1901)
> > >         dev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> > >
> > > > +}
> > > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > > > +                           PCI_BASE_CLASS_DISPLAY, 16,
> > > > +                           quirk_broken_nv_runpm);
> > > > +/* kaby lake */
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901,
> > > > +                     quirk_broken_nv_runpm);
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > index ac8a6c4e1792..903a0b3a39ec 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -416,6 +416,7 @@ struct pci_dev {
> > > >       unsigned int    __aer_firmware_first_valid:1;
> > > >       unsigned int    __aer_firmware_first:1;
> > > >       unsigned int    broken_intx_masking:1;  /* INTx masking can't be used */
> > > > +     unsigned int    broken_nv_runpm:1;      /* some combinations of intel bridge controller and nvidia GPUs break rtd3 */
> > > >       unsigned int    io_window_1k:1;         /* Intel bridge 1K I/O windows */
> > > >       unsigned int    irq_managed:1;
> > > >       unsigned int    has_secondary_link:1;
> > > > --
> > > > 2.21.0
> > > >
Bjorn Helgaas Oct. 16, 2019, 10:03 p.m. UTC | #5
On Wed, Oct 16, 2019 at 11:48:22PM +0200, Karol Herbst wrote:
> On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> > > platform means of putting the device into D3cold, right? That's
> > > actually what should still happen, just the D3hot step should be
> > > skipped.
> >
> > If I understand correctly, when we put a device in D3cold on an ACPI
> > system, we do something like this:
> >
> >   pci_set_power_state(D3cold)
> >     if (PCI_DEV_FLAGS_NO_D3)
> >       return 0                                   <-- nothing at all if quirked
> >     pci_raw_set_power_state
> >       pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
> >     __pci_complete_power_transition(D3cold)
> >       pci_platform_power_transition(D3cold)
> >         platform_pci_set_power_state(D3cold)
> >           acpi_pci_set_power_state(D3cold)
> >             acpi_device_set_power(ACPI_STATE_D3_COLD)
> >               ...
> >                 acpi_evaluate_object("_OFF")     <-- set to D3cold
> >
> > I did not understand the connection with platform (ACPI) power
> > management from your patch.  It sounds like you want this entire path
> > except that you want to skip the PCI_PM_CTRL write?
> >
> 
> exactly. I am running with this workaround for a while now and never
> had any fails with it anymore. The GPU gets turned off correctly and I
> see the same power savings, just that the GPU can be powered on again.
> 
> > That seems like something Rafael should weigh in on.  I don't know
> > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
> > methods, and I don't know what the effect of skipping that is.  It
> > seems a little messy to slice out this tiny piece from the middle, but
> > maybe it makes sense.
> >
> 
> afaik when I was talking with others in the past about it, Windows is
> doing that before using ACPI calls, but maybe they have some similar
> workarounds for certain intel bridges as well? I am sure it affects
> more than the one I am blacklisting here, but I rather want to check
> each device before blacklisting all kabylake and sky lake bridges (as
> those are the ones were this issue can be observed).

From a quick look at the ACPI spec, I didn't see conditions like "OSPM
must put PCI devices in D3hot before executing _OFF".  But obviously
there's *some* reason and I probably just missed it.

> Sadly we had no luck getting any information about such workaround out
> of Nvidia or Intel.

I'm not surprised; it doesn't seem like we really have the details
needed to get to a root cause yet.  I think what we really need is a
PCIe analyzer trace to see what happens when the device "falls off the
bus".

Bjorn
Mika Westerberg Oct. 21, 2019, 11:40 a.m. UTC | #6
Hi Karol,

Sorry for commenting late, I just came back from vacation.

On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> states.
> 
> v2: convert to pci_dev quirk
>     put a proper technical explanation of the issue as a in-code comment
> v3: disable it only for certain combinations of intel and nvidia hardware
> 
> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Mika Westerberg <mika.westerberg@intel.com>
> Cc: linux-pci@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c    | 11 ++++++++++
>  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++

I may be missing something but why you can't do this in the nouveau
driver itself?
Karol Herbst Oct. 21, 2019, noon UTC | #7
On Mon, Oct 21, 2019 at 1:40 PM Mika Westerberg
<mika.westerberg@intel.com> wrote:
>
> Hi Karol,
>
> Sorry for commenting late, I just came back from vacation.
>
> On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > states.
> >
> > v2: convert to pci_dev quirk
> >     put a proper technical explanation of the issue as a in-code comment
> > v3: disable it only for certain combinations of intel and nvidia hardware
> >
> > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: Lyude Paul <lyude@redhat.com>
> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > Cc: linux-pci@vger.kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: nouveau@lists.freedesktop.org
> > ---
> >  drivers/pci/pci.c    | 11 ++++++++++
> >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
>
> I may be missing something but why you can't do this in the nouveau
> driver itself?

What do you mean precisely? Move the quirk into nouveau, but keep the
changes to pci core?
Mika Westerberg Oct. 21, 2019, 12:06 p.m. UTC | #8
On Mon, Oct 21, 2019 at 02:00:46PM +0200, Karol Herbst wrote:
> On Mon, Oct 21, 2019 at 1:40 PM Mika Westerberg
> <mika.westerberg@intel.com> wrote:
> >
> > Hi Karol,
> >
> > Sorry for commenting late, I just came back from vacation.
> >
> > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > > states.
> > >
> > > v2: convert to pci_dev quirk
> > >     put a proper technical explanation of the issue as a in-code comment
> > > v3: disable it only for certain combinations of intel and nvidia hardware
> > >
> > > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > > Cc: Lyude Paul <lyude@redhat.com>
> > > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > > Cc: linux-pci@vger.kernel.org
> > > Cc: linux-pm@vger.kernel.org
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: nouveau@lists.freedesktop.org
> > > ---
> > >  drivers/pci/pci.c    | 11 ++++++++++
> > >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> >
> > I may be missing something but why you can't do this in the nouveau
> > driver itself?
> 
> What do you mean precisely? Move the quirk into nouveau, but keep the
> changes to pci core?

No, just block runtime PM from the device in nouveau driver.
Karol Herbst Oct. 21, 2019, 1:02 p.m. UTC | #9
On Mon, Oct 21, 2019 at 2:06 PM Mika Westerberg
<mika.westerberg@intel.com> wrote:
>
> On Mon, Oct 21, 2019 at 02:00:46PM +0200, Karol Herbst wrote:
> > On Mon, Oct 21, 2019 at 1:40 PM Mika Westerberg
> > <mika.westerberg@intel.com> wrote:
> > >
> > > Hi Karol,
> > >
> > > Sorry for commenting late, I just came back from vacation.
> > >
> > > On Wed, Oct 16, 2019 at 04:44:49PM +0200, Karol Herbst wrote:
> > > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher device
> > > > states.
> > > >
> > > > v2: convert to pci_dev quirk
> > > >     put a proper technical explanation of the issue as a in-code comment
> > > > v3: disable it only for certain combinations of intel and nvidia hardware
> > > >
> > > > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > > > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > > > Cc: Lyude Paul <lyude@redhat.com>
> > > > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > > > Cc: Mika Westerberg <mika.westerberg@intel.com>
> > > > Cc: linux-pci@vger.kernel.org
> > > > Cc: linux-pm@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: nouveau@lists.freedesktop.org
> > > > ---
> > > >  drivers/pci/pci.c    | 11 ++++++++++
> > > >  drivers/pci/quirks.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> > >
> > > I may be missing something but why you can't do this in the nouveau
> > > driver itself?
> >
> > What do you mean precisely? Move the quirk into nouveau, but keep the
> > changes to pci core?
>
> No, just block runtime PM from the device in nouveau driver.

but that's not what the patch does. It only skips the PCI PM reg
write, but still let the ACPI method be invoked to put the device into
D3cold
Mika Westerberg Oct. 21, 2019, 1:21 p.m. UTC | #10
On Mon, Oct 21, 2019 at 03:02:23PM +0200, Karol Herbst wrote:
> > No, just block runtime PM from the device in nouveau driver.
> 
> but that's not what the patch does. It only skips the PCI PM reg
> write, but still let the ACPI method be invoked to put the device into
> D3cold

Oh, indeed it does. I did not realize that.

Which makes me wonder whether ACPI _OFF() expects the device to be in D0
for some reason.
Mika Westerberg Oct. 21, 2019, 1:33 p.m. UTC | #11
On Wed, Oct 16, 2019 at 11:48:22PM +0200, Karol Herbst wrote:
> On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc linux-acpi]
> >
> > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> > > platform means of putting the device into D3cold, right? That's
> > > actually what should still happen, just the D3hot step should be
> > > skipped.
> >
> > If I understand correctly, when we put a device in D3cold on an ACPI
> > system, we do something like this:
> >
> >   pci_set_power_state(D3cold)
> >     if (PCI_DEV_FLAGS_NO_D3)
> >       return 0                                   <-- nothing at all if quirked
> >     pci_raw_set_power_state
> >       pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
> >     __pci_complete_power_transition(D3cold)
> >       pci_platform_power_transition(D3cold)
> >         platform_pci_set_power_state(D3cold)
> >           acpi_pci_set_power_state(D3cold)
> >             acpi_device_set_power(ACPI_STATE_D3_COLD)
> >               ...
> >                 acpi_evaluate_object("_OFF")     <-- set to D3cold
> >
> > I did not understand the connection with platform (ACPI) power
> > management from your patch.  It sounds like you want this entire path
> > except that you want to skip the PCI_PM_CTRL write?
> >
> 
> exactly. I am running with this workaround for a while now and never
> had any fails with it anymore. The GPU gets turned off correctly and I
> see the same power savings, just that the GPU can be powered on again.
> 
> > That seems like something Rafael should weigh in on.  I don't know
> > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
> > methods, and I don't know what the effect of skipping that is.  It
> > seems a little messy to slice out this tiny piece from the middle, but
> > maybe it makes sense.
> >
> 
> afaik when I was talking with others in the past about it, Windows is
> doing that before using ACPI calls, but maybe they have some similar
> workarounds for certain intel bridges as well? I am sure it affects
> more than the one I am blacklisting here, but I rather want to check
> each device before blacklisting all kabylake and sky lake bridges (as
> those are the ones were this issue can be observed).
> 
> Sadly we had no luck getting any information about such workaround out
> of Nvidia or Intel.

I really would like to provide you more information about such
workaround but I'm not aware of any ;-) I have not seen any issues like
this when D3cold is properly implemented in the platform.  That's why
I'm bit skeptical that this has anything to do with specific Intel PCIe
ports. More likely it is some power sequence in the _ON/_OFF() methods
that is run differently on Windows.
Karol Herbst Oct. 21, 2019, 1:54 p.m. UTC | #12
On Mon, Oct 21, 2019 at 3:33 PM Mika Westerberg
<mika.westerberg@intel.com> wrote:
>
> On Wed, Oct 16, 2019 at 11:48:22PM +0200, Karol Herbst wrote:
> > On Wed, Oct 16, 2019 at 11:37 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc linux-acpi]
> > >
> > > On Wed, Oct 16, 2019 at 09:18:32PM +0200, Karol Herbst wrote:
> > > > but setting the PCI_DEV_FLAGS_NO_D3 flag does prevent using the
> > > > platform means of putting the device into D3cold, right? That's
> > > > actually what should still happen, just the D3hot step should be
> > > > skipped.
> > >
> > > If I understand correctly, when we put a device in D3cold on an ACPI
> > > system, we do something like this:
> > >
> > >   pci_set_power_state(D3cold)
> > >     if (PCI_DEV_FLAGS_NO_D3)
> > >       return 0                                   <-- nothing at all if quirked
> > >     pci_raw_set_power_state
> > >       pci_write_config_word(PCI_PM_CTRL, D3hot)  <-- set to D3hot
> > >     __pci_complete_power_transition(D3cold)
> > >       pci_platform_power_transition(D3cold)
> > >         platform_pci_set_power_state(D3cold)
> > >           acpi_pci_set_power_state(D3cold)
> > >             acpi_device_set_power(ACPI_STATE_D3_COLD)
> > >               ...
> > >                 acpi_evaluate_object("_OFF")     <-- set to D3cold
> > >
> > > I did not understand the connection with platform (ACPI) power
> > > management from your patch.  It sounds like you want this entire path
> > > except that you want to skip the PCI_PM_CTRL write?
> > >
> >
> > exactly. I am running with this workaround for a while now and never
> > had any fails with it anymore. The GPU gets turned off correctly and I
> > see the same power savings, just that the GPU can be powered on again.
> >
> > > That seems like something Rafael should weigh in on.  I don't know
> > > why we set the device to D3hot with PCI_PM_CTRL before using the ACPI
> > > methods, and I don't know what the effect of skipping that is.  It
> > > seems a little messy to slice out this tiny piece from the middle, but
> > > maybe it makes sense.
> > >
> >
> > afaik when I was talking with others in the past about it, Windows is
> > doing that before using ACPI calls, but maybe they have some similar
> > workarounds for certain intel bridges as well? I am sure it affects
> > more than the one I am blacklisting here, but I rather want to check
> > each device before blacklisting all kabylake and sky lake bridges (as
> > those are the ones were this issue can be observed).
> >
> > Sadly we had no luck getting any information about such workaround out
> > of Nvidia or Intel.
>
> I really would like to provide you more information about such
> workaround but I'm not aware of any ;-) I have not seen any issues like
> this when D3cold is properly implemented in the platform.  That's why
> I'm bit skeptical that this has anything to do with specific Intel PCIe
> ports. More likely it is some power sequence in the _ON/_OFF() methods
> that is run differently on Windows.

yeah.. maybe. I really don't know what's the actual root cause. I just
know that with this workaround it works perfectly fine on my and some
other systems it was tested on. Do you know who would be best to
approach to get proper documentation about those methods and what are
the actual prerequisites of those methods?

We kind of tried with Nvidia, but maybe having a more specific
question would help here... I will try to bring that issue up the next
time with them.
Mika Westerberg Oct. 21, 2019, 2:08 p.m. UTC | #13
On Mon, Oct 21, 2019 at 03:54:09PM +0200, Karol Herbst wrote:
> > I really would like to provide you more information about such
> > workaround but I'm not aware of any ;-) I have not seen any issues like
> > this when D3cold is properly implemented in the platform.  That's why
> > I'm bit skeptical that this has anything to do with specific Intel PCIe
> > ports. More likely it is some power sequence in the _ON/_OFF() methods
> > that is run differently on Windows.
> 
> yeah.. maybe. I really don't know what's the actual root cause. I just
> know that with this workaround it works perfectly fine on my and some
> other systems it was tested on. Do you know who would be best to
> approach to get proper documentation about those methods and what are
> the actual prerequisites of those methods?

Those should be documented in the ACPI spec. Chapter 7 should explain
power resources and the device power methods in detail.
Karol Herbst Oct. 21, 2019, 2:49 p.m. UTC | #14
On Mon, Oct 21, 2019 at 4:09 PM Mika Westerberg
<mika.westerberg@intel.com> wrote:
>
> On Mon, Oct 21, 2019 at 03:54:09PM +0200, Karol Herbst wrote:
> > > I really would like to provide you more information about such
> > > workaround but I'm not aware of any ;-) I have not seen any issues like
> > > this when D3cold is properly implemented in the platform.  That's why
> > > I'm bit skeptical that this has anything to do with specific Intel PCIe
> > > ports. More likely it is some power sequence in the _ON/_OFF() methods
> > > that is run differently on Windows.
> >
> > yeah.. maybe. I really don't know what's the actual root cause. I just
> > know that with this workaround it works perfectly fine on my and some
> > other systems it was tested on. Do you know who would be best to
> > approach to get proper documentation about those methods and what are
> > the actual prerequisites of those methods?
>
> Those should be documented in the ACPI spec. Chapter 7 should explain
> power resources and the device power methods in detail.

either I looked up the wrong spec or the documentation isn't really
saying much there.
Mika Westerberg Oct. 21, 2019, 3:46 p.m. UTC | #15
On Mon, Oct 21, 2019 at 04:49:09PM +0200, Karol Herbst wrote:
> On Mon, Oct 21, 2019 at 4:09 PM Mika Westerberg
> <mika.westerberg@intel.com> wrote:
> >
> > On Mon, Oct 21, 2019 at 03:54:09PM +0200, Karol Herbst wrote:
> > > > I really would like to provide you more information about such
> > > > workaround but I'm not aware of any ;-) I have not seen any issues like
> > > > this when D3cold is properly implemented in the platform.  That's why
> > > > I'm bit skeptical that this has anything to do with specific Intel PCIe
> > > > ports. More likely it is some power sequence in the _ON/_OFF() methods
> > > > that is run differently on Windows.
> > >
> > > yeah.. maybe. I really don't know what's the actual root cause. I just
> > > know that with this workaround it works perfectly fine on my and some
> > > other systems it was tested on. Do you know who would be best to
> > > approach to get proper documentation about those methods and what are
> > > the actual prerequisites of those methods?
> >
> > Those should be documented in the ACPI spec. Chapter 7 should explain
> > power resources and the device power methods in detail.
> 
> either I looked up the wrong spec or the documentation isn't really
> saying much there.

Well it explains those methods, _PSx, _PRx and _ON()/_OFF(). In case of
PCIe device you also want to check PCIe spec. PCIe 5.0 section 5.8 "PCI
Function Power State Transitions" has a picture about the supported
power state transitions and there we can find that function must be in
D3hot before it can be transitioned into D3cold so if the _OFF() for
example blindly assumes that the device is in D0 when it is called, it
is a bug in the BIOS.

BTW, where can I find acpidump of such system?
Karol Herbst Oct. 22, 2019, 9:16 a.m. UTC | #16
I think there is something I totally forgot about:

When there was never a driver bound to the GPU, and if runtime power
management gets enabled on that device, runtime suspend/resume works
as expected (I am not 100% sure on if that always works, but I will
recheck that).
In the past I know that at some point I "bisected" the nouveau driver
to figure out what actually breaks it and found out that some script
executed with the help of an on-chip engine (signed script, signed
firmware, both vbios provided) makes it break. Debugging the script
lead me to the PCIe link speed changes done inside the script breaking
it.

But as "reverting" the speed change didn't make it work reliably
again, I think I need to get back on that and check if it's something
else. I will try to convert the script into C or python code to make
it more accessible to debug and hopefully I'll find something I
overlooked the last time.

On Mon, Oct 21, 2019 at 6:40 PM Karol Herbst <kherbst@redhat.com> wrote:
>
> On Mon, Oct 21, 2019 at 5:46 PM Mika Westerberg
> <mika.westerberg@intel.com> wrote:
> >
> > On Mon, Oct 21, 2019 at 04:49:09PM +0200, Karol Herbst wrote:
> > > On Mon, Oct 21, 2019 at 4:09 PM Mika Westerberg
> > > <mika.westerberg@intel.com> wrote:
> > > >
> > > > On Mon, Oct 21, 2019 at 03:54:09PM +0200, Karol Herbst wrote:
> > > > > > I really would like to provide you more information about such
> > > > > > workaround but I'm not aware of any ;-) I have not seen any issues like
> > > > > > this when D3cold is properly implemented in the platform.  That's why
> > > > > > I'm bit skeptical that this has anything to do with specific Intel PCIe
> > > > > > ports. More likely it is some power sequence in the _ON/_OFF() methods
> > > > > > that is run differently on Windows.
> > > > >
> > > > > yeah.. maybe. I really don't know what's the actual root cause. I just
> > > > > know that with this workaround it works perfectly fine on my and some
> > > > > other systems it was tested on. Do you know who would be best to
> > > > > approach to get proper documentation about those methods and what are
> > > > > the actual prerequisites of those methods?
> > > >
> > > > Those should be documented in the ACPI spec. Chapter 7 should explain
> > > > power resources and the device power methods in detail.
> > >
> > > either I looked up the wrong spec or the documentation isn't really
> > > saying much there.
> >
> > Well it explains those methods, _PSx, _PRx and _ON()/_OFF(). In case of
> > PCIe device you also want to check PCIe spec. PCIe 5.0 section 5.8 "PCI
> > Function Power State Transitions" has a picture about the supported
> > power state transitions and there we can find that function must be in
> > D3hot before it can be transitioned into D3cold so if the _OFF() for
> > example blindly assumes that the device is in D0 when it is called, it
> > is a bug in the BIOS.
> >
> > BTW, where can I find acpidump of such system?
>
> I am sure it's uploaded somewhere already. But it's not an issue of
> just one system. It's essentially hitting every single laptop with a
> skylake or kaby lake CPU + Nvidia GPU. I didn't see any system where
> it's actually working right now (and we are pestering nvidia about
> this issue for over a year already with no solution)
>
> I've attached an acpidump from my system.
Mika Westerberg Oct. 22, 2019, 12:44 p.m. UTC | #17
On Tue, Oct 22, 2019 at 11:16:14AM +0200, Karol Herbst wrote:
> I think there is something I totally forgot about:
> 
> When there was never a driver bound to the GPU, and if runtime power
> management gets enabled on that device, runtime suspend/resume works
> as expected (I am not 100% sure on if that always works, but I will
> recheck that).

AFAIK, if there is no driver bound to the PCI device it is left to D0
regardless of the runtime PM state which could explain why it works in
that case (it is never put into D3hot).

I looked at the acpidump you sent and there is one thing that may
explain the differences between Windows and Linux. Not sure if you were
aware of this already, though. The power resource PGOF() method has
this:

   If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))) {
      ...
   }  

If I read it right, the later condition tries to detect Linux which
fails nowadays but if you have acpi_rev_override in the command line (or
the machine is listed in acpi_rev_dmi_table) this check passes and does
some magic which is not clear to me. There is similar in PGON() side
which is used to turn the device back on.

You can check what actually happens when _ON()/_OFF() is called by
passing something like below to the kernel command line:

  acpi.trace_debug_layer=0x80 acpi.trace_debug_level=0x10 acpi.trace_method_name=\_SB.PCI0.PEG0.PG00._ON acpi.trace_state=method

(See also Documentation/firmware-guide/acpi/method-tracing.rst).

Trace goes to system dmesg.
Karol Herbst Oct. 22, 2019, 12:51 p.m. UTC | #18
On Tue, Oct 22, 2019 at 2:45 PM Mika Westerberg
<mika.westerberg@intel.com> wrote:
>
> On Tue, Oct 22, 2019 at 11:16:14AM +0200, Karol Herbst wrote:
> > I think there is something I totally forgot about:
> >
> > When there was never a driver bound to the GPU, and if runtime power
> > management gets enabled on that device, runtime suspend/resume works
> > as expected (I am not 100% sure on if that always works, but I will
> > recheck that).
>
> AFAIK, if there is no driver bound to the PCI device it is left to D0
> regardless of the runtime PM state which could explain why it works in
> that case (it is never put into D3hot).
>
> I looked at the acpidump you sent and there is one thing that may
> explain the differences between Windows and Linux. Not sure if you were
> aware of this already, though. The power resource PGOF() method has
> this:
>
>    If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))) {
>       ...
>    }
>

I think this is the fallback to some older method of runtime
suspending the device, and I think it will end up touching different
registers on the bridge controller which do not show the broken
behaviour.

You'll find references to following variables which all cause a link
to be powered down: Q0L2 (newest), P0L2, P0LD (oldest, I think).

Maybe I remember incorrectly and have to read the code again... okay,
the fallback path uses P0LD indeed. That's actually the only register
of those being documented by Intel afaik.

> If I read it right, the later condition tries to detect Linux which
> fails nowadays but if you have acpi_rev_override in the command line (or
> the machine is listed in acpi_rev_dmi_table) this check passes and does
> some magic which is not clear to me. There is similar in PGON() side
> which is used to turn the device back on.
>
> You can check what actually happens when _ON()/_OFF() is called by
> passing something like below to the kernel command line:
>
>   acpi.trace_debug_layer=0x80 acpi.trace_debug_level=0x10 acpi.trace_method_name=\_SB.PCI0.PEG0.PG00._ON acpi.trace_state=method
>
> (See also Documentation/firmware-guide/acpi/method-tracing.rst).
>
> Trace goes to system dmesg.

This sounds to be very helpful, I'll give it a try.
Mika Westerberg Oct. 23, 2019, 9 a.m. UTC | #19
On Tue, Oct 22, 2019 at 02:51:53PM +0200, Karol Herbst wrote:
> On Tue, Oct 22, 2019 at 2:45 PM Mika Westerberg
> <mika.westerberg@intel.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 11:16:14AM +0200, Karol Herbst wrote:
> > > I think there is something I totally forgot about:
> > >
> > > When there was never a driver bound to the GPU, and if runtime power
> > > management gets enabled on that device, runtime suspend/resume works
> > > as expected (I am not 100% sure on if that always works, but I will
> > > recheck that).
> >
> > AFAIK, if there is no driver bound to the PCI device it is left to D0
> > regardless of the runtime PM state which could explain why it works in
> > that case (it is never put into D3hot).
> >
> > I looked at the acpidump you sent and there is one thing that may
> > explain the differences between Windows and Linux. Not sure if you were
> > aware of this already, though. The power resource PGOF() method has
> > this:
> >
> >    If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05)))) {
> >       ...
> >    }
> >
> 
> I think this is the fallback to some older method of runtime
> suspending the device, and I think it will end up touching different
> registers on the bridge controller which do not show the broken
> behaviour.

I think it actually tries to identify older Windows and then Linux (the
_REV == 0x05 check comes from that). So at least some point Dell people
have experiment this on Linux.

Patch
diff mbox series

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b97d9e10c9cc..8e056eb7e6ff 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -805,6 +805,13 @@  static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
 	return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
 }
 
+static inline bool parent_broken_child_pm(struct pci_dev *dev)
+{
+	if (!dev->bus || !dev->bus->self)
+		return false;
+	return dev->bus->self->broken_nv_runpm && dev->broken_nv_runpm;
+}
+
 /**
  * pci_raw_set_power_state - Use PCI PM registers to set the power state of
  *			     given PCI device
@@ -850,6 +857,10 @@  static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
 	   || (state == PCI_D2 && !dev->d2_support))
 		return -EIO;
 
+	/* check if the bus controller causes issues */
+	if (state != PCI_D0 && parent_broken_child_pm(dev))
+		return 0;
+
 	pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
 
 	/*
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44c4ae1abd00..c2f20b745dd4 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5268,3 +5268,55 @@  static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
 			      PCI_CLASS_DISPLAY_VGA, 8,
 			      quirk_reset_lenovo_thinkpad_p50_nvgpu);
+
+/*
+ * Some Intel PCIe bridges cause devices to disappear from the PCIe bus after
+ * those were put into D3cold state if they were put into a non D0 PCI PM
+ * device state before doing so.
+ *
+ * This leads to various issue different issues which all manifest differently,
+ * but have the same root cause:
+ *  - AIML code execution hits an infinite loop (as the coe waits on device
+ *    memory to change).
+ *  - kernel crashes, as all pci reads return -1, which most code isn't able
+ *    to handle well enough.
+ *  - sudden shutdowns, as the kernel identified an unrecoverable error after
+ *    userspace tries to access the GPU.
+ *
+ * In all cases dmesg will contain at least one line like this:
+ * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3'
+ * followed by a lot of nouveau timeouts.
+ *
+ * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of the
+ * PCIe bridge controller in order to power down the GPU.
+ * Nonetheless, there are other code paths inside the ACPI firmware which use
+ * other registers, which seem to work fine:
+ *  - 0xbc bit 0x20 (publicly available documentation claims 'reserved')
+ *  - 0xb0 bit 0x10 (link disable)
+ * Changing the conditions inside the firmware by poking into the relevant
+ * addresses does resolve the issue, but it seemed to be ACPI private memory
+ * and not any device accessible memory at all, so there is no portable way of
+ * changing the conditions.
+ *
+ * The only systems where this behavior can be seen are hybrid graphics laptops
+ * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this issue
+ * only occurs in combination with listed Intel PCIe bridge controllers and
+ * the mentioned GPUs or if it's only a hw bug in the bridge controller.
+ *
+ * But because this issue was NOT seen on laptops with an Nvidia Pascal GPU
+ * and an Intel Coffee Lake SoC, there is a higher chance of there being a bug
+ * in the bridge controller rather than in the GPU.
+ *
+ * This issue was not able to be reproduced on non laptop systems.
+ */
+
+static void quirk_broken_nv_runpm(struct pci_dev *dev)
+{
+	dev->broken_nv_runpm = 1;
+}
+DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+			      PCI_BASE_CLASS_DISPLAY, 16,
+			      quirk_broken_nv_runpm);
+/* kaby lake */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1901,
+			quirk_broken_nv_runpm);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ac8a6c4e1792..903a0b3a39ec 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -416,6 +416,7 @@  struct pci_dev {
 	unsigned int	__aer_firmware_first_valid:1;
 	unsigned int	__aer_firmware_first:1;
 	unsigned int	broken_intx_masking:1;	/* INTx masking can't be used */
+	unsigned int	broken_nv_runpm:1;	/* some combinations of intel bridge controller and nvidia GPUs break rtd3 */
 	unsigned int	io_window_1k:1;		/* Intel bridge 1K I/O windows */
 	unsigned int	irq_managed:1;
 	unsigned int	has_secondary_link:1;