diff mbox

PCI: Revert "PCI: Add runtime PM support for PCIe ports"

Message ID 16155135.jUckSz7QPL@aspire.rjw.lan
State Not Applicable
Headers show

Commit Message

Rafael J. Wysocki Jan. 3, 2017, 11:13 p.m. UTC
On Tuesday, January 03, 2017 04:25:09 PM Bjorn Helgaas wrote:
> On Tue, Jan 03, 2017 at 10:38:24PM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 03, 2017 12:12:21 PM Bjorn Helgaas wrote:
> > > On Tue, Jan 03, 2017 at 05:31:30PM +0100, Peter Wu wrote:
> > > > On Tue, Jan 03, 2017 at 05:11:23PM +0100, Lukas Wunner wrote:
> > > > > [cc += Dave Airlie:
> > > > > 
> > > > > Dave, we're about to lose support for newer Optimus laptops which use
> > > > > _PR3 to cut power to the discrete GPU because Bjorn Helgaas has queued
> > > > > a commit on his for-linus branch to remove runtime PM for PCIe ports.
> > > > > This fixes a regression on Kilian Singer's laptop on which locking the
> > > > > screen breaks USB and PS/2 input devices:  Mouse movements are still
> > > > > visible, but button or key presses no longer have any effect.  The GPU
> > > > > is powered down upon locking the screen and the current theory is that
> > > > > this causes the issues.]
> > > > 
> > > > (+cc Alex: this might affect amdgpu/radeon too.]
> > > > 
> > > > Bjorn, please reconsider the rpm patch. Reverting support would
> > > > introduce other regressions (see issues below) and make future
> > > > Thunderbolt work harder (according to Lukas). If Kilian's laptop has
> > > > issues, what about a "temporary" quirk?
> > > 
> > > As I mentioned at the beginning, the outcome I'm hoping for is a patch
> > > that fixes Kilian's laptop while preserving the runtime PM support.
> > > 
> > > As I also mentioned at the beginning, preserving the runtime PM
> > > support at the expense of breaking Kilian's laptop is not one of the
> > > options.
> > 
> > But the revert doesn't really help.
> > 
> > It doesn't fix system suspend/resume on that laptop, which also breaks when
> > PCIe ports PM is enabled on it.
> > 
> > If you really want to use a sledgehammer approach here (which I don't recommend,
> > but that's your call), you can change the initial value of pci_bridge_d3_disable to
> > "true" (and update the pcie_ports_pm= command line to take "on" in case
> > someone wants to enable the feature).  That at least will take care of the
> > regression entirely and not just partly.
> 
> What the heck is the problem here?  I'm not trying to be difficult,
> but I didn't write this code and I'm not really interested in figuring
> out how to fix it, so my only real option is to solicit fixes and, if
> none appear, revert changes that break things.
> 
> As I've said more than once, I hope and expect that there is a better
> solution than reverting the patch.  But *I* am not going to write it.
> As soon as somebody proposes a better patch, I'll use it instead of
> the revert.
> 
> If you want to fix the regression by changing the
> pci_bridge_d3_disable value, all you have to do is post a patch doing
> that.

OK, please find appended.

> I really don't understand why people are so wrapped around the axle
> about this.  This is just the way Linux works -- we try really hard
> not to cause regressions on platforms that used to work.

I haven't seen anyone in this thread questioning that.

IMO the point people are trying to make is that reverting stuff may not really be
the way to go.

> I *SAID* in the very first posting of the revert that I assume Mika will have a
> better solution soon.

In which case I wouldn't have queued up a revert had I been you.

> When a better patch appears, I'll take that and drop the revert.
> What's the problem with that?

There are people for whom the commit in question fixed serious issues and the
revert would just take that away from them without any option to make their
systems work.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] PCI / PM: Disable power management of PCIe ports by default

Due to regressions introduced by enabling power management of
PCIe ports by default, disable it for the time being, but still
allow it to be enabled via a kernel command line option.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=190861
Tentatively-signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This particular patch hasn't been tested, but the result of it should be the
same as passing pcie_port_pm=off in the kernel command line, which has
been tested in the BZ entry above.

---
 Documentation/admin-guide/kernel-parameters.txt |    3 --
 drivers/pci/pci.c                               |   26 +++++-------------------
 2 files changed, 7 insertions(+), 22 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas Jan. 4, 2017, 12:05 a.m. UTC | #1
On Wed, Jan 04, 2017 at 12:13:18AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, January 03, 2017 04:25:09 PM Bjorn Helgaas wrote:
> > On Tue, Jan 03, 2017 at 10:38:24PM +0100, Rafael J. Wysocki wrote:
> > > On Tuesday, January 03, 2017 12:12:21 PM Bjorn Helgaas wrote:
> > > > On Tue, Jan 03, 2017 at 05:31:30PM +0100, Peter Wu wrote:
> > > > > On Tue, Jan 03, 2017 at 05:11:23PM +0100, Lukas Wunner wrote:
> > > > > > [cc += Dave Airlie:
> > > > > > 
> > > > > > Dave, we're about to lose support for newer Optimus laptops which use
> > > > > > _PR3 to cut power to the discrete GPU because Bjorn Helgaas has queued
> > > > > > a commit on his for-linus branch to remove runtime PM for PCIe ports.
> > > > > > This fixes a regression on Kilian Singer's laptop on which locking the
> > > > > > screen breaks USB and PS/2 input devices:  Mouse movements are still
> > > > > > visible, but button or key presses no longer have any effect.  The GPU
> > > > > > is powered down upon locking the screen and the current theory is that
> > > > > > this causes the issues.]
> > > > > 
> > > > > (+cc Alex: this might affect amdgpu/radeon too.]
> > > > > 
> > > > > Bjorn, please reconsider the rpm patch. Reverting support would
> > > > > introduce other regressions (see issues below) and make future
> > > > > Thunderbolt work harder (according to Lukas). If Kilian's laptop has
> > > > > issues, what about a "temporary" quirk?
> > > > 
> > > > As I mentioned at the beginning, the outcome I'm hoping for is a patch
> > > > that fixes Kilian's laptop while preserving the runtime PM support.
> > > > 
> > > > As I also mentioned at the beginning, preserving the runtime PM
> > > > support at the expense of breaking Kilian's laptop is not one of the
> > > > options.
> > > 
> > > But the revert doesn't really help.
> > > 
> > > It doesn't fix system suspend/resume on that laptop, which also breaks when
> > > PCIe ports PM is enabled on it.
> > > 
> > > If you really want to use a sledgehammer approach here (which I don't recommend,
> > > but that's your call), you can change the initial value of pci_bridge_d3_disable to
> > > "true" (and update the pcie_ports_pm= command line to take "on" in case
> > > someone wants to enable the feature).  That at least will take care of the
> > > regression entirely and not just partly.
> > 
> > What the heck is the problem here?  I'm not trying to be difficult,
> > but I didn't write this code and I'm not really interested in figuring
> > out how to fix it, so my only real option is to solicit fixes and, if
> > none appear, revert changes that break things.
> > 
> > As I've said more than once, I hope and expect that there is a better
> > solution than reverting the patch.  But *I* am not going to write it.
> > As soon as somebody proposes a better patch, I'll use it instead of
> > the revert.
> > 
> > If you want to fix the regression by changing the
> > pci_bridge_d3_disable value, all you have to do is post a patch doing
> > that.
> 
> OK, please find appended.
> 
> > I really don't understand why people are so wrapped around the axle
> > about this.  This is just the way Linux works -- we try really hard
> > not to cause regressions on platforms that used to work.
> 
> I haven't seen anyone in this thread questioning that.
> 
> IMO the point people are trying to make is that reverting stuff may not really be
> the way to go.
> 
> > I *SAID* in the very first posting of the revert that I assume Mika will have a
> > better solution soon.
> 
> In which case I wouldn't have queued up a revert had I been you.
> 
> > When a better patch appears, I'll take that and drop the revert.
> > What's the problem with that?
> 
> There are people for whom the commit in question fixed serious issues and the
> revert would just take that away from them without any option to make their
> systems work.

I don't *want* to apply the revert.  It's on my for-linus branch as a
worst-case scenario change if we can't figure out a better fix.

The patch below is preferable, but I'd rather not take even it,
because it takes away functionality and forces people to use a boot
parameter to restore it.  I expect that somebody will figure out how
to fix the regression Kilian found and also keep the new functionality
(without requiring boot parameters) before v4.10.

Of course, if a better fix is far off and the patch below is much
better in the interim (avoids memory corruption, fixes problems for
more people, etc.), I will replace the revert with it.  I just
haven't seen the argument for doing that.

My main point is that Kilian found a pretty serious regression and
spent a lot of time bisecting it and testing things, and we need to
address it in some way before v4.10.

> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] PCI / PM: Disable power management of PCIe ports by default
> 
> Due to regressions introduced by enabling power management of
> PCIe ports by default, disable it for the time being, but still
> allow it to be enabled via a kernel command line option.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=190861
> Tentatively-signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> This particular patch hasn't been tested, but the result of it should be the
> same as passing pcie_port_pm=off in the kernel command line, which has
> been tested in the BZ entry above.
> 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |    3 --
>  drivers/pci/pci.c                               |   26 +++++-------------------
>  2 files changed, 7 insertions(+), 22 deletions(-)
> 
> Index: linux-pm/drivers/pci/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci.c
> +++ linux-pm/drivers/pci/pci.c
> @@ -108,17 +108,14 @@ unsigned int pcibios_max_latency = 255;
>  /* If set, the PCIe ARI capability will not be used. */
>  static bool pcie_ari_disabled;
>  
> -/* Disable bridge_d3 for all PCIe ports */
> -static bool pci_bridge_d3_disable;
> -/* Force bridge_d3 for all PCIe ports */
> -static bool pci_bridge_d3_force;
> +/* Enable bridge_d3 for all PCIe ports */
> +static bool pci_bridge_d3_enable;
>  
>  static int __init pcie_port_pm_setup(char *str)
>  {
> -	if (!strcmp(str, "off"))
> -		pci_bridge_d3_disable = true;
> -	else if (!strcmp(str, "force"))
> -		pci_bridge_d3_force = true;
> +	if (!strcmp(str, "on"))
> +		pci_bridge_d3_enable = true;
> +
>  	return 1;
>  }
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
> @@ -2237,7 +2234,7 @@ bool pci_bridge_d3_possible(struct pci_d
>  	case PCI_EXP_TYPE_ROOT_PORT:
>  	case PCI_EXP_TYPE_UPSTREAM:
>  	case PCI_EXP_TYPE_DOWNSTREAM:
> -		if (pci_bridge_d3_disable)
> +		if (!pci_bridge_d3_enable)
>  			return false;
>  
>  		/*
> @@ -2247,17 +2244,6 @@ bool pci_bridge_d3_possible(struct pci_d
>  		if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
>  			return false;
>  
> -		if (pci_bridge_d3_force)
> -			return true;
> -
> -		/*
> -		 * It should be safe to put PCIe ports from 2015 or newer
> -		 * to D3.
> -		 */
> -		if (dmi_get_date(DMI_BIOS_DATE, &year, NULL, NULL) &&
> -		    year >= 2015) {
> -			return true;
> -		}
>  		break;
>  	}
>  
> Index: linux-pm/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ linux-pm/Documentation/admin-guide/kernel-parameters.txt
> @@ -2984,8 +2984,7 @@
>  			ports driver.
>  
>  	pcie_port_pm=	[PCIE] PCIe port power management handling:
> -		off	Disable power management of all PCIe ports
> -		force	Forcibly enable power management of all PCIe ports
> +		on	Enable power management of PCIe ports
>  
>  	pcie_pme=	[PCIE,PM] Native PCIe PME signaling options:
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 4, 2017, 1:09 a.m. UTC | #2
On Tuesday, January 03, 2017 06:05:57 PM Bjorn Helgaas wrote:
> On Wed, Jan 04, 2017 at 12:13:18AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 03, 2017 04:25:09 PM Bjorn Helgaas wrote:
> > > On Tue, Jan 03, 2017 at 10:38:24PM +0100, Rafael J. Wysocki wrote:
> > > > On Tuesday, January 03, 2017 12:12:21 PM Bjorn Helgaas wrote:
> > > > > On Tue, Jan 03, 2017 at 05:31:30PM +0100, Peter Wu wrote:
> > > > > > On Tue, Jan 03, 2017 at 05:11:23PM +0100, Lukas Wunner wrote:
> > > > > > > [cc += Dave Airlie:
> > > > > > > 
> > > > > > > Dave, we're about to lose support for newer Optimus laptops which use
> > > > > > > _PR3 to cut power to the discrete GPU because Bjorn Helgaas has queued
> > > > > > > a commit on his for-linus branch to remove runtime PM for PCIe ports.
> > > > > > > This fixes a regression on Kilian Singer's laptop on which locking the
> > > > > > > screen breaks USB and PS/2 input devices:  Mouse movements are still
> > > > > > > visible, but button or key presses no longer have any effect.  The GPU
> > > > > > > is powered down upon locking the screen and the current theory is that
> > > > > > > this causes the issues.]
> > > > > > 
> > > > > > (+cc Alex: this might affect amdgpu/radeon too.]
> > > > > > 
> > > > > > Bjorn, please reconsider the rpm patch. Reverting support would
> > > > > > introduce other regressions (see issues below) and make future
> > > > > > Thunderbolt work harder (according to Lukas). If Kilian's laptop has
> > > > > > issues, what about a "temporary" quirk?
> > > > > 
> > > > > As I mentioned at the beginning, the outcome I'm hoping for is a patch
> > > > > that fixes Kilian's laptop while preserving the runtime PM support.
> > > > > 
> > > > > As I also mentioned at the beginning, preserving the runtime PM
> > > > > support at the expense of breaking Kilian's laptop is not one of the
> > > > > options.
> > > > 
> > > > But the revert doesn't really help.
> > > > 
> > > > It doesn't fix system suspend/resume on that laptop, which also breaks when
> > > > PCIe ports PM is enabled on it.
> > > > 
> > > > If you really want to use a sledgehammer approach here (which I don't recommend,
> > > > but that's your call), you can change the initial value of pci_bridge_d3_disable to
> > > > "true" (and update the pcie_ports_pm= command line to take "on" in case
> > > > someone wants to enable the feature).  That at least will take care of the
> > > > regression entirely and not just partly.
> > > 
> > > What the heck is the problem here?  I'm not trying to be difficult,
> > > but I didn't write this code and I'm not really interested in figuring
> > > out how to fix it, so my only real option is to solicit fixes and, if
> > > none appear, revert changes that break things.
> > > 
> > > As I've said more than once, I hope and expect that there is a better
> > > solution than reverting the patch.  But *I* am not going to write it.
> > > As soon as somebody proposes a better patch, I'll use it instead of
> > > the revert.
> > > 
> > > If you want to fix the regression by changing the
> > > pci_bridge_d3_disable value, all you have to do is post a patch doing
> > > that.
> > 
> > OK, please find appended.
> > 
> > > I really don't understand why people are so wrapped around the axle
> > > about this.  This is just the way Linux works -- we try really hard
> > > not to cause regressions on platforms that used to work.
> > 
> > I haven't seen anyone in this thread questioning that.
> > 
> > IMO the point people are trying to make is that reverting stuff may not really be
> > the way to go.
> > 
> > > I *SAID* in the very first posting of the revert that I assume Mika will have a
> > > better solution soon.
> > 
> > In which case I wouldn't have queued up a revert had I been you.
> > 
> > > When a better patch appears, I'll take that and drop the revert.
> > > What's the problem with that?
> > 
> > There are people for whom the commit in question fixed serious issues and the
> > revert would just take that away from them without any option to make their
> > systems work.
> 
> I don't *want* to apply the revert.  It's on my for-linus branch as a
> worst-case scenario change if we can't figure out a better fix.
> 
> The patch below is preferable, but I'd rather not take even it,
> because it takes away functionality and forces people to use a boot
> parameter to restore it.  I expect that somebody will figure out how
> to fix the regression Kilian found and also keep the new functionality
> (without requiring boot parameters) before v4.10.
> 
> Of course, if a better fix is far off and the patch below is much
> better in the interim (avoids memory corruption, fixes problems for
> more people, etc.), I will replace the revert with it.  I just
> haven't seen the argument for doing that.

That is very simple.

If you revert, runtime PM will not work for any PCIe ports no matter what and
there is no way to enable it whatever.  Therefore, if there's anyone who
depends on it whatever the reason, they have no way to enable it other than
patching the kernel and rebuilding it.  There are users who may not be able
to do that.

With this patch, in turn, they at least have a kernel command line option to
enable the feature if they need it.  To me, this would be a good enough reason
to apply this patch instead of the revert.

> My main point is that Kilian found a pretty serious regression and
> spent a lot of time bisecting it and testing things, and we need to
> address it in some way before v4.10.

Let me repeat then that nobody here is questioning the need to address the
issue.

That said to me, reverting would be almost as bad as leaving it unfixed.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 4, 2017, 8:16 a.m. UTC | #3
On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> I don't *want* to apply the revert.  It's on my for-linus branch as a
> worst-case scenario change if we can't figure out a better fix.
> 
> The patch below is preferable, but I'd rather not take even it,
> because it takes away functionality and forces people to use a boot
> parameter to restore it.  I expect that somebody will figure out how
> to fix the regression Kilian found and also keep the new functionality
> (without requiring boot parameters) before v4.10.

The issue is constrained to hybrid graphics laptops with Nvidia discrete
GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
PCI core.

(AFAIUI, laptops with AMD discrete GPU are not affected as it is known
when and how to call an ACPI method versus using PR3.)

(Neither are laptops using the Nvidia proprietary driver as it doesn't
runtime suspend the card.  But battery life will be terrible then.)

We're at rc2 so the time frame for coming up with a fix is probably
4 weeks.  Peter and others have tried for months to reverse-engineer
how to handle runtime PM on newer Nvidia cards.  It seems likely that
we'll not find the ultimate solution to the problem within 4 weeks.

The way it is now, i.e. defaulting to PR3 when available, regresses
certain laptops such as Kilian's.  If on the other hand we default to
DSM when available, we'll regress certain other laptops, as Peter has
pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
approach either, ideally we'd want to use PR3 as Windows does.

As said, the only short-term solution I see is to add an "optimus"
module_param to nouveau to allow users to select which method to use.
So in Kilian's case an additional command line parameter would be
necessary to fix the issue.

Does anyone see a better solution or can we agree on this one?  If so
I can come up with a patch.  This could go in via Dave Airlie's tree.

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 4, 2017, 10:33 a.m. UTC | #4
Dear all,

the weird thing is also that when calling:

echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
or
echo on > /sys/bus/pci/devices/0000:01:00.0/power/control

on a command line then the command line crashes. The system stays responsive but still
becomes unresponsive when I lock the screen.

Maybe a similar thing are happening in the kernel. Maybe one could use the above fact
and a timeout to test for the problem and then take measures to deactivate the pm.

I hope this gives you some more clues.

Best regards
Kilian

PS: let me know which patches I should test. I am currently terribly busy with work. So if you would select the most urgent patches I would be delighted.



On 04-Jan-17 09:16, Lukas Wunner wrote:
> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
>> I don't *want* to apply the revert.  It's on my for-linus branch as a
>> worst-case scenario change if we can't figure out a better fix.
>>
>> The patch below is preferable, but I'd rather not take even it,
>> because it takes away functionality and forces people to use a boot
>> parameter to restore it.  I expect that somebody will figure out how
>> to fix the regression Kilian found and also keep the new functionality
>> (without requiring boot parameters) before v4.10.
> The issue is constrained to hybrid graphics laptops with Nvidia discrete
> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> PCI core.
>
> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> when and how to call an ACPI method versus using PR3.)
>
> (Neither are laptops using the Nvidia proprietary driver as it doesn't
> runtime suspend the card.  But battery life will be terrible then.)
>
> We're at rc2 so the time frame for coming up with a fix is probably
> 4 weeks.  Peter and others have tried for months to reverse-engineer
> how to handle runtime PM on newer Nvidia cards.  It seems likely that
> we'll not find the ultimate solution to the problem within 4 weeks.
>
> The way it is now, i.e. defaulting to PR3 when available, regresses
> certain laptops such as Kilian's.  If on the other hand we default to
> DSM when available, we'll regress certain other laptops, as Peter has
> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> approach either, ideally we'd want to use PR3 as Windows does.
>
> As said, the only short-term solution I see is to add an "optimus"
> module_param to nouveau to allow users to select which method to use.
> So in Kilian's case an additional command line parameter would be
> necessary to fix the issue.
>
> Does anyone see a better solution or can we agree on this one?  If so
> I can come up with a patch.  This could go in via Dave Airlie's tree.
>
> Thanks,
>
> Lukas

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 4, 2017, 12:29 p.m. UTC | #5
On Wed, Jan 04, 2017 at 11:33:16AM +0100, Kilian Singer wrote:
> Dear all,
> 
> the weird thing is also that when calling:
> 
> echo 0 > /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
> or
> echo on > /sys/bus/pci/devices/0000:01:00.0/power/control
> 
> on a command line then the command line crashes. The system stays responsive but still
> becomes unresponsive when I lock the screen.

Most probably because the device has already been runtime suspended.
Setting them from command line may be too late.

> Maybe a similar thing are happening in the kernel. Maybe one could use the above fact
> and a timeout to test for the problem and then take measures to deactivate the pm.
> 
> I hope this gives you some more clues.

You could try to run 'lspci -vv' once the machine is unresponsive (if
you can do anything anymore). That should show whether the device and
the root port are still in D3.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Deucher, Alexander Jan. 4, 2017, 3:50 p.m. UTC | #6
> -----Original Message-----
> From: Lukas Wunner [mailto:lukas@wunner.de]
> Sent: Wednesday, January 04, 2017 3:17 AM
> To: Bjorn Helgaas
> Cc: Rafael J. Wysocki; Peter Wu; Mika Westerberg; Kilian Singer; linux-pci;
> Deucher, Alexander; Dave Airlie
> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"
> 
> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > worst-case scenario change if we can't figure out a better fix.
> >
> > The patch below is preferable, but I'd rather not take even it,
> > because it takes away functionality and forces people to use a boot
> > parameter to restore it.  I expect that somebody will figure out how
> > to fix the regression Kilian found and also keep the new functionality
> > (without requiring boot parameters) before v4.10.
> 
> The issue is constrained to hybrid graphics laptops with Nvidia discrete
> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> PCI core.
> 
> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> when and how to call an ACPI method versus using PR3.)
> 
> (Neither are laptops using the Nvidia proprietary driver as it doesn't
> runtime suspend the card.  But battery life will be terrible then.)
> 
> We're at rc2 so the time frame for coming up with a fix is probably
> 4 weeks.  Peter and others have tried for months to reverse-engineer
> how to handle runtime PM on newer Nvidia cards.  It seems likely that
> we'll not find the ultimate solution to the problem within 4 weeks.
> 
> The way it is now, i.e. defaulting to PR3 when available, regresses
> certain laptops such as Kilian's.  If on the other hand we default to
> DSM when available, we'll regress certain other laptops, as Peter has
> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> approach either, ideally we'd want to use PR3 as Windows does.
> 
> As said, the only short-term solution I see is to add an "optimus"
> module_param to nouveau to allow users to select which method to use.
> So in Kilian's case an additional command line parameter would be
> necessary to fix the issue.
> 
> Does anyone see a better solution or can we agree on this one?  If so
> I can come up with a patch.  This could go in via Dave Airlie's tree.

I think an option may be useful for testing, but I think the best solution is probably a quirk for Kilian's system unless there are a lot of users having similar problems to Killian.  PR3 standardizes dGPU power control so things should get better across the board.

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Wu Jan. 4, 2017, 9:09 p.m. UTC | #7
On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > worst-case scenario change if we can't figure out a better fix.
> > 
> > The patch below is preferable, but I'd rather not take even it,
> > because it takes away functionality and forces people to use a boot
> > parameter to restore it.  I expect that somebody will figure out how
> > to fix the regression Kilian found and also keep the new functionality
> > (without requiring boot parameters) before v4.10.
> 
> The issue is constrained to hybrid graphics laptops with Nvidia discrete
> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> PCI core.

The problem is not necessarily in the nouveau driver, the same problem
occurs when you enable RPM without loading nouveau. The issue is limited
though to some newer hybrid graphics laptops with Nvidia GPUs. While a
quirk can be added to nouveau, I think that a (temporary) quirk in core
would also be reasonable (since it also occurs without nouveau).

> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> when and how to call an ACPI method versus using PR3.)
> 
> (Neither are laptops using the Nvidia proprietary driver as it doesn't
> runtime suspend the card.  But battery life will be terrible then.)
> 
> We're at rc2 so the time frame for coming up with a fix is probably
> 4 weeks.  Peter and others have tried for months to reverse-engineer
> how to handle runtime PM on newer Nvidia cards.  It seems likely that
> we'll not find the ultimate solution to the problem within 4 weeks.

Yep, a quick proper fix seems unlikely.
[ Help/ideas are welcome, I suspect that these failures to restore power
on laptops designed for Win8+ all have the same cause, related to some
unknown interaction between ACPI and PCI. Some links:
https://bugzilla.kernel.org/show_bug.cgi?id=190861
https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]

> The way it is now, i.e. defaulting to PR3 when available, regresses
> certain laptops such as Kilian's.  If on the other hand we default to
> DSM when available, we'll regress certain other laptops, as Peter has
> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> approach either, ideally we'd want to use PR3 as Windows does.
> 
> As said, the only short-term solution I see is to add an "optimus"
> module_param to nouveau to allow users to select which method to use.
> So in Kilian's case an additional command line parameter would be
> necessary to fix the issue.
> 
> Does anyone see a better solution or can we agree on this one?  If so
> I can come up with a patch.  This could go in via Dave Airlie's tree.

As pcie_port_pm=off already reverts to DSM, I do not think that an
additional (temporary) nouveau module parameter is going to help. I
instead propose a (hopefully temporary) quirk in pci core that disables
D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
to test possible solutions in the future.
Rafael J. Wysocki Jan. 4, 2017, 9:55 p.m. UTC | #8
On Wednesday, January 04, 2017 09:16:39 AM Lukas Wunner wrote:
> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > worst-case scenario change if we can't figure out a better fix.
> > 
> > The patch below is preferable, but I'd rather not take even it,
> > because it takes away functionality and forces people to use a boot
> > parameter to restore it.  I expect that somebody will figure out how
> > to fix the regression Kilian found and also keep the new functionality
> > (without requiring boot parameters) before v4.10.
> 
> The issue is constrained to hybrid graphics laptops with Nvidia discrete
> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> PCI core.

But it may boil down to the fact that on some systems some ACPI power
resources are not usable to us.  They shouldn't be used for power management
at all then and I'm not sure whether or not addressing that in nouveau alone is
entirely viable.

> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> when and how to call an ACPI method versus using PR3.)
> 
> (Neither are laptops using the Nvidia proprietary driver as it doesn't
> runtime suspend the card.  But battery life will be terrible then.)
> 
> We're at rc2 so the time frame for coming up with a fix is probably
> 4 weeks.  Peter and others have tried for months to reverse-engineer
> how to handle runtime PM on newer Nvidia cards.  It seems likely that
> we'll not find the ultimate solution to the problem within 4 weeks.
> 
> The way it is now, i.e. defaulting to PR3 when available, regresses
> certain laptops such as Kilian's.  If on the other hand we default to
> DSM when available, we'll regress certain other laptops, as Peter has
> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> approach either, ideally we'd want to use PR3 as Windows does.
> 
> As said, the only short-term solution I see is to add an "optimus"
> module_param to nouveau to allow users to select which method to use.
> So in Kilian's case an additional command line parameter would be
> necessary to fix the issue.

There is a command line arg he can use for that already, so adding just
another one for the same purpose doesn't look like a great improvement to me.

> Does anyone see a better solution or can we agree on this one?  If so
> I can come up with a patch.  This could go in via Dave Airlie's tree.

We basically need to quirk ACPI power resources on those systems somehow.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 4, 2017, 9:58 p.m. UTC | #9
On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > > worst-case scenario change if we can't figure out a better fix.
> > > 
> > > The patch below is preferable, but I'd rather not take even it,
> > > because it takes away functionality and forces people to use a boot
> > > parameter to restore it.  I expect that somebody will figure out how
> > > to fix the regression Kilian found and also keep the new functionality
> > > (without requiring boot parameters) before v4.10.
> > 
> > The issue is constrained to hybrid graphics laptops with Nvidia discrete
> > GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> > PCI core.
> 
> The problem is not necessarily in the nouveau driver, the same problem
> occurs when you enable RPM without loading nouveau. The issue is limited
> though to some newer hybrid graphics laptops with Nvidia GPUs. While a
> quirk can be added to nouveau, I think that a (temporary) quirk in core
> would also be reasonable (since it also occurs without nouveau).
> 
> > (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> > when and how to call an ACPI method versus using PR3.)
> > 
> > (Neither are laptops using the Nvidia proprietary driver as it doesn't
> > runtime suspend the card.  But battery life will be terrible then.)
> > 
> > We're at rc2 so the time frame for coming up with a fix is probably
> > 4 weeks.  Peter and others have tried for months to reverse-engineer
> > how to handle runtime PM on newer Nvidia cards.  It seems likely that
> > we'll not find the ultimate solution to the problem within 4 weeks.
> 
> Yep, a quick proper fix seems unlikely.
> [ Help/ideas are welcome, I suspect that these failures to restore power
> on laptops designed for Win8+ all have the same cause, related to some
> unknown interaction between ACPI and PCI. Some links:
> https://bugzilla.kernel.org/show_bug.cgi?id=190861
> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> 
> > The way it is now, i.e. defaulting to PR3 when available, regresses
> > certain laptops such as Kilian's.  If on the other hand we default to
> > DSM when available, we'll regress certain other laptops, as Peter has
> > pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> > approach either, ideally we'd want to use PR3 as Windows does.
> > 
> > As said, the only short-term solution I see is to add an "optimus"
> > module_param to nouveau to allow users to select which method to use.
> > So in Kilian's case an additional command line parameter would be
> > necessary to fix the issue.
> > 
> > Does anyone see a better solution or can we agree on this one?  If so
> > I can come up with a patch.  This could go in via Dave Airlie's tree.
> 
> As pcie_port_pm=off already reverts to DSM, I do not think that an
> additional (temporary) nouveau module parameter is going to help. I
> instead propose a (hopefully temporary) quirk in pci core that disables
> D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
> pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
> to test possible solutions in the future.

I would rather add a quirk to the ACPI core to prevent the power resources in
question from being enumerated.  Or even to prevent ACPI PM from being
used for the port in question.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Airlie Jan. 4, 2017, 11:21 p.m. UTC | #10
> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > > > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > > > worst-case scenario change if we can't figure out a better fix.
> > > > 
> > > > The patch below is preferable, but I'd rather not take even it,
> > > > because it takes away functionality and forces people to use a boot
> > > > parameter to restore it.  I expect that somebody will figure out how
> > > > to fix the regression Kilian found and also keep the new functionality
> > > > (without requiring boot parameters) before v4.10.
> > > 
> > > The issue is constrained to hybrid graphics laptops with Nvidia discrete
> > > GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> > > PCI core.
> > 
> > The problem is not necessarily in the nouveau driver, the same problem
> > occurs when you enable RPM without loading nouveau. The issue is limited
> > though to some newer hybrid graphics laptops with Nvidia GPUs. While a
> > quirk can be added to nouveau, I think that a (temporary) quirk in core
> > would also be reasonable (since it also occurs without nouveau).
> > 
> > > (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> > > when and how to call an ACPI method versus using PR3.)
> > > 
> > > (Neither are laptops using the Nvidia proprietary driver as it doesn't
> > > runtime suspend the card.  But battery life will be terrible then.)
> > > 
> > > We're at rc2 so the time frame for coming up with a fix is probably
> > > 4 weeks.  Peter and others have tried for months to reverse-engineer
> > > how to handle runtime PM on newer Nvidia cards.  It seems likely that
> > > we'll not find the ultimate solution to the problem within 4 weeks.
> > 
> > Yep, a quick proper fix seems unlikely.
> > [ Help/ideas are welcome, I suspect that these failures to restore power
> > on laptops designed for Win8+ all have the same cause, related to some
> > unknown interaction between ACPI and PCI. Some links:
> > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > 
> > > The way it is now, i.e. defaulting to PR3 when available, regresses
> > > certain laptops such as Kilian's.  If on the other hand we default to
> > > DSM when available, we'll regress certain other laptops, as Peter has
> > > pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> > > approach either, ideally we'd want to use PR3 as Windows does.
> > > 
> > > As said, the only short-term solution I see is to add an "optimus"
> > > module_param to nouveau to allow users to select which method to use.
> > > So in Kilian's case an additional command line parameter would be
> > > necessary to fix the issue.
> > > 
> > > Does anyone see a better solution or can we agree on this one?  If so
> > > I can come up with a patch.  This could go in via Dave Airlie's tree.
> > 
> > As pcie_port_pm=off already reverts to DSM, I do not think that an
> > additional (temporary) nouveau module parameter is going to help. I
> > instead propose a (hopefully temporary) quirk in pci core that disables
> > D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
> > pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
> > to test possible solutions in the future.
> 
> I would rather add a quirk to the ACPI core to prevent the power resources in
> question from being enumerated.  Or even to prevent ACPI PM from being
> used for the port in question.

I do have a W541 in a cupboard in the office somewhere, but I won't be close to
it for a couple of weeks. The W541 was the first place I tested the pm patches
so I'm kinda wondering whether it's all W541's or just some specific model/bios
combo.

However I'm pretty much unavailable to do anything much until late Jan on this.

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 5, 2017, 10:49 a.m. UTC | #11
On Wed, Jan 04, 2017 at 10:58:10PM +0100, Rafael J. Wysocki wrote:
> I would rather add a quirk to the ACPI core to prevent the power resources in
> question from being enumerated.  Or even to prevent ACPI PM from being
> used for the port in question.

If we are going to add a quirk, I agree that it should be put to the
ACPI core.

However, Windows seems to be able to use _PR3 just fine. So there is
something that we are missing or do not implement properly which causes
all the troubles. IMHO we should try to find out what that difference is
and fix that if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 5, 2017, 2:19 p.m. UTC | #12
On Thursday, January 05, 2017 12:49:40 PM Mika Westerberg wrote:
> On Wed, Jan 04, 2017 at 10:58:10PM +0100, Rafael J. Wysocki wrote:
> > I would rather add a quirk to the ACPI core to prevent the power resources in
> > question from being enumerated.  Or even to prevent ACPI PM from being
> > used for the port in question.
> 
> If we are going to add a quirk, I agree that it should be put to the
> ACPI core.
> 
> However, Windows seems to be able to use _PR3 just fine. So there is
> something that we are missing or do not implement properly which causes
> all the troubles. IMHO we should try to find out what that difference is
> and fix that if possible.

But we are time-constrained and that may take forever.

In the particular case of the Kilian's system, the power resource used by
_PR0 and _PR3 for the port actually operates the same hardware registers
as _PS0 and _PS3 for the VID device under the port (if I remember the name
correctly), so there is something in Windows switching between the two
depending on something.

I gess that depends on the version of Windows, but that's pure speculation.

We don't know what that is and a have little hope to learn about that, so let's
just say that this power resource is fishy and don't use it until we find out
(which frankly may or may not happen).

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 5, 2017, 2:42 p.m. UTC | #13
On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> [ Help/ideas are welcome, I suspect that these failures to restore power
> on laptops designed for Win8+ all have the same cause, related to some
> unknown interaction between ACPI and PCI. Some links:
> https://bugzilla.kernel.org/show_bug.cgi?id=190861
> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]

Looking at Kilian's acpidump again I notice that the methods to power
the GPU on or off (GPON / GPOF) are called from two places:

- From the _PS0 and _PS3 methods of the GPU and
- from the _PR3 power resource of the root port above the GPU.

In the former case they're called for pre Windows 2013 or if VDAD is true.
In the latter case they're called unconditionally but GPOF becomes a no-op
in the pre Windows 2013 case.

This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
is true.  I could imagine this to cause issues.

VDAD is at 0x7CE7D018 + 0xEE2 + 6. It's not set in the DSDT.

@Kilian, what do you get if you execute this as root:

dd iflag=skip_bytes,count_bytes skip=$((0x7CE7D018 + 0xEE2 + 6)) count=1 \
  if=/dev/mem 2>/dev/null | hexdump


Another oddity I've noticed is that when calling the Optimus DSM with
the capabilities function number (0x1A, NOUVEAU_DSM_OPTIMUS_CAPS) and
a special argument, it's possible to influence the behaviour of GPOF
(the method to power the GPU off):  GPOF is a no-op unless it's running
on Windows 2013+ or OMPR has value 0x03.  Initially OMPR has value 0x02,
but by setting bits 18 and 19 in the argument given to the capabilities
function, it can be set to 0x3.  After GPOF has finished, OMPR reverts
back to 0x02.  This means that pre Windows 2013, GPOF only has any effect
if the DSM capabilities function is called with an appropriate argument.
The same functionality can be seen in the Clevo P651RA ssdt3/7.  What
confuses me is that the bits are at position 18 and 19, but in
nouveau_switcheroo_optimus_dsm() we're setting bits 0 and 1 as well as
bits 24 and 25?  This may be a dumb question, I'm not familiar with
Optimus, only Macs.


Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 5, 2017, 3:06 p.m. UTC | #14
On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > > > > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > > > > worst-case scenario change if we can't figure out a better fix.
> > > > > 
> > > > > The patch below is preferable, but I'd rather not take even it,
> > > > > because it takes away functionality and forces people to use a boot
> > > > > parameter to restore it.  I expect that somebody will figure out how
> > > > > to fix the regression Kilian found and also keep the new functionality
> > > > > (without requiring boot parameters) before v4.10.
> > > > 
> > > > The issue is constrained to hybrid graphics laptops with Nvidia discrete
> > > > GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> > > > PCI core.
> > > 
> > > The problem is not necessarily in the nouveau driver, the same problem
> > > occurs when you enable RPM without loading nouveau. The issue is limited
> > > though to some newer hybrid graphics laptops with Nvidia GPUs. While a
> > > quirk can be added to nouveau, I think that a (temporary) quirk in core
> > > would also be reasonable (since it also occurs without nouveau).
> > > 
> > > > (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> > > > when and how to call an ACPI method versus using PR3.)
> > > > 
> > > > (Neither are laptops using the Nvidia proprietary driver as it doesn't
> > > > runtime suspend the card.  But battery life will be terrible then.)
> > > > 
> > > > We're at rc2 so the time frame for coming up with a fix is probably
> > > > 4 weeks.  Peter and others have tried for months to reverse-engineer
> > > > how to handle runtime PM on newer Nvidia cards.  It seems likely that
> > > > we'll not find the ultimate solution to the problem within 4 weeks.
> > > 
> > > Yep, a quick proper fix seems unlikely.
> > > [ Help/ideas are welcome, I suspect that these failures to restore power
> > > on laptops designed for Win8+ all have the same cause, related to some
> > > unknown interaction between ACPI and PCI. Some links:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > 
> > > > The way it is now, i.e. defaulting to PR3 when available, regresses
> > > > certain laptops such as Kilian's.  If on the other hand we default to
> > > > DSM when available, we'll regress certain other laptops, as Peter has
> > > > pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> > > > approach either, ideally we'd want to use PR3 as Windows does.
> > > > 
> > > > As said, the only short-term solution I see is to add an "optimus"
> > > > module_param to nouveau to allow users to select which method to use.
> > > > So in Kilian's case an additional command line parameter would be
> > > > necessary to fix the issue.
> > > > 
> > > > Does anyone see a better solution or can we agree on this one?  If so
> > > > I can come up with a patch.  This could go in via Dave Airlie's tree.
> > > 
> > > As pcie_port_pm=off already reverts to DSM, I do not think that an
> > > additional (temporary) nouveau module parameter is going to help. I
> > > instead propose a (hopefully temporary) quirk in pci core that disables
> > > D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
> > > pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
> > > to test possible solutions in the future.
> > 
> > I would rather add a quirk to the ACPI core to prevent the power resources in
> > question from being enumerated.  Or even to prevent ACPI PM from being
> > used for the port in question.
> 
> I do have a W541 in a cupboard in the office somewhere, but I won't be close to
> it for a couple of weeks. The W541 was the first place I tested the pm patches
> so I'm kinda wondering whether it's all W541's or just some specific model/bios
> combo.
> 
> However I'm pretty much unavailable to do anything much until late Jan on this.

Is there anyone else at Red Hat who might be able to look into this?

ISTR that Hans de Goede is working on improving laptop support in Fedora,
and Peter Jones recently got a patch merged for the W541 with the exact
same firmware Kilian is using to work around a botched EFI memory map.
Adding them to cc: in the hope that they may be able to help.

@Peter, have you noticed issues with the discrete Nvidia GPU on your W541
related to runtime suspend and system sleep?

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Jones Jan. 5, 2017, 6:13 p.m. UTC | #15
On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > > > > > I don't *want* to apply the revert.  It's on my for-linus branch as a
> > > > > > worst-case scenario change if we can't figure out a better fix.
> > > > > > 
> > > > > > The patch below is preferable, but I'd rather not take even it,
> > > > > > because it takes away functionality and forces people to use a boot
> > > > > > parameter to restore it.  I expect that somebody will figure out how
> > > > > > to fix the regression Kilian found and also keep the new functionality
> > > > > > (without requiring boot parameters) before v4.10.
> > > > > 
> > > > > The issue is constrained to hybrid graphics laptops with Nvidia discrete
> > > > > GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
> > > > > PCI core.
> > > > 
> > > > The problem is not necessarily in the nouveau driver, the same problem
> > > > occurs when you enable RPM without loading nouveau. The issue is limited
> > > > though to some newer hybrid graphics laptops with Nvidia GPUs. While a
> > > > quirk can be added to nouveau, I think that a (temporary) quirk in core
> > > > would also be reasonable (since it also occurs without nouveau).
> > > > 
> > > > > (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
> > > > > when and how to call an ACPI method versus using PR3.)
> > > > > 
> > > > > (Neither are laptops using the Nvidia proprietary driver as it doesn't
> > > > > runtime suspend the card.  But battery life will be terrible then.)
> > > > > 
> > > > > We're at rc2 so the time frame for coming up with a fix is probably
> > > > > 4 weeks.  Peter and others have tried for months to reverse-engineer
> > > > > how to handle runtime PM on newer Nvidia cards.  It seems likely that
> > > > > we'll not find the ultimate solution to the problem within 4 weeks.
> > > > 
> > > > Yep, a quick proper fix seems unlikely.
> > > > [ Help/ideas are welcome, I suspect that these failures to restore power
> > > > on laptops designed for Win8+ all have the same cause, related to some
> > > > unknown interaction between ACPI and PCI. Some links:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > 
> > > > > The way it is now, i.e. defaulting to PR3 when available, regresses
> > > > > certain laptops such as Kilian's.  If on the other hand we default to
> > > > > DSM when available, we'll regress certain other laptops, as Peter has
> > > > > pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
> > > > > approach either, ideally we'd want to use PR3 as Windows does.
> > > > > 
> > > > > As said, the only short-term solution I see is to add an "optimus"
> > > > > module_param to nouveau to allow users to select which method to use.
> > > > > So in Kilian's case an additional command line parameter would be
> > > > > necessary to fix the issue.
> > > > > 
> > > > > Does anyone see a better solution or can we agree on this one?  If so
> > > > > I can come up with a patch.  This could go in via Dave Airlie's tree.
> > > > 
> > > > As pcie_port_pm=off already reverts to DSM, I do not think that an
> > > > additional (temporary) nouveau module parameter is going to help. I
> > > > instead propose a (hopefully temporary) quirk in pci core that disables
> > > > D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
> > > > pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
> > > > to test possible solutions in the future.
> > > 
> > > I would rather add a quirk to the ACPI core to prevent the power resources in
> > > question from being enumerated.  Or even to prevent ACPI PM from being
> > > used for the port in question.
> > 
> > I do have a W541 in a cupboard in the office somewhere, but I won't be close to
> > it for a couple of weeks. The W541 was the first place I tested the pm patches
> > so I'm kinda wondering whether it's all W541's or just some specific model/bios
> > combo.

They seem to all ship with the 1.10 firmware, and 2.80 is current (there
are a bunch of intermediate 2.xx versions).  Somewhere along the line
they introduced some bugs in the UEFI stuff, so it wouldn't be
surprising if there's bugs introduced elsewhere as well.

> > However I'm pretty much unavailable to do anything much until late Jan on this.
> 
> Is there anyone else at Red Hat who might be able to look into this?
> 
> ISTR that Hans de Goede is working on improving laptop support in Fedora,
> and Peter Jones recently got a patch merged for the W541 with the exact
> same firmware Kilian is using to work around a botched EFI memory map.
> Adding them to cc: in the hope that they may be able to help.
> 
> @Peter, have you noticed issues with the discrete Nvidia GPU on your W541
> related to runtime suspend and system sleep?

I was using a borrowed one (I can certainly find it again, but I'm not
working on graphics/pm really), but yeah - shutdown and lspci both broke
sometime after pci_pm_runtime_resume().  Here's the traceback from
SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67

Dave, if you know who in Westford should have a look at this, I can see
about getting them hardware.  I am more or less surrounded by that team.
David Airlie Jan. 5, 2017, 7:36 p.m. UTC | #16
(cc'ing Lyude, who has the hw also I think).

----- Original Message -----
> From: "Peter Jones" <pjones@redhat.com>
> To: "Lukas Wunner" <lukas@wunner.de>
> Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.westerberg@linux.intel.com>, "Kilian Singer"
> <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger.kernel.org>, "Alex Deucher"
> <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
> Sent: Friday, 6 January, 2017 4:13:23 AM
> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"
> 
> On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
> > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
> > > > > > > I don't *want* to apply the revert.  It's on my for-linus branch
> > > > > > > as a
> > > > > > > worst-case scenario change if we can't figure out a better fix.
> > > > > > > 
> > > > > > > The patch below is preferable, but I'd rather not take even it,
> > > > > > > because it takes away functionality and forces people to use a
> > > > > > > boot
> > > > > > > parameter to restore it.  I expect that somebody will figure out
> > > > > > > how
> > > > > > > to fix the regression Kilian found and also keep the new
> > > > > > > functionality
> > > > > > > (without requiring boot parameters) before v4.10.
> > > > > > 
> > > > > > The issue is constrained to hybrid graphics laptops with Nvidia
> > > > > > discrete
> > > > > > GPU using nouveau.  Hence it needs to be fixed in nouveau, not in
> > > > > > the
> > > > > > PCI core.
> > > > > 
> > > > > The problem is not necessarily in the nouveau driver, the same
> > > > > problem
> > > > > occurs when you enable RPM without loading nouveau. The issue is
> > > > > limited
> > > > > though to some newer hybrid graphics laptops with Nvidia GPUs. While
> > > > > a
> > > > > quirk can be added to nouveau, I think that a (temporary) quirk in
> > > > > core
> > > > > would also be reasonable (since it also occurs without nouveau).
> > > > > 
> > > > > > (AFAIUI, laptops with AMD discrete GPU are not affected as it is
> > > > > > known
> > > > > > when and how to call an ACPI method versus using PR3.)
> > > > > > 
> > > > > > (Neither are laptops using the Nvidia proprietary driver as it
> > > > > > doesn't
> > > > > > runtime suspend the card.  But battery life will be terrible then.)
> > > > > > 
> > > > > > We're at rc2 so the time frame for coming up with a fix is probably
> > > > > > 4 weeks.  Peter and others have tried for months to
> > > > > > reverse-engineer
> > > > > > how to handle runtime PM on newer Nvidia cards.  It seems likely
> > > > > > that
> > > > > > we'll not find the ultimate solution to the problem within 4 weeks.
> > > > > 
> > > > > Yep, a quick proper fix seems unlikely.
> > > > > [ Help/ideas are welcome, I suspect that these failures to restore
> > > > > power
> > > > > on laptops designed for Win8+ all have the same cause, related to
> > > > > some
> > > > > unknown interaction between ACPI and PCI. Some links:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > > 
> > > > > > The way it is now, i.e. defaulting to PR3 when available, regresses
> > > > > > certain laptops such as Kilian's.  If on the other hand we default
> > > > > > to
> > > > > > DSM when available, we'll regress certain other laptops, as Peter
> > > > > > has
> > > > > > pointed out.  Whitelisting or blacklisting laptops doesn't seem a
> > > > > > good
> > > > > > approach either, ideally we'd want to use PR3 as Windows does.
> > > > > > 
> > > > > > As said, the only short-term solution I see is to add an "optimus"
> > > > > > module_param to nouveau to allow users to select which method to
> > > > > > use.
> > > > > > So in Kilian's case an additional command line parameter would be
> > > > > > necessary to fix the issue.
> > > > > > 
> > > > > > Does anyone see a better solution or can we agree on this one?  If
> > > > > > so
> > > > > > I can come up with a patch.  This could go in via Dave Airlie's
> > > > > > tree.
> > > > > 
> > > > > As pcie_port_pm=off already reverts to DSM, I do not think that an
> > > > > additional (temporary) nouveau module parameter is going to help. I
> > > > > instead propose a (hopefully temporary) quirk in pci core that
> > > > > disables
> > > > > D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
> > > > > pcie_port_pm=off). Then the option pcie_port_pm=force can still be
> > > > > used
> > > > > to test possible solutions in the future.
> > > > 
> > > > I would rather add a quirk to the ACPI core to prevent the power
> > > > resources in
> > > > question from being enumerated.  Or even to prevent ACPI PM from being
> > > > used for the port in question.
> > > 
> > > I do have a W541 in a cupboard in the office somewhere, but I won't be
> > > close to
> > > it for a couple of weeks. The W541 was the first place I tested the pm
> > > patches
> > > so I'm kinda wondering whether it's all W541's or just some specific
> > > model/bios
> > > combo.
> 
> They seem to all ship with the 1.10 firmware, and 2.80 is current (there
> are a bunch of intermediate 2.xx versions).  Somewhere along the line
> they introduced some bugs in the UEFI stuff, so it wouldn't be
> surprising if there's bugs introduced elsewhere as well.
> 
> > > However I'm pretty much unavailable to do anything much until late Jan on
> > > this.
> > 
> > Is there anyone else at Red Hat who might be able to look into this?
> > 
> > ISTR that Hans de Goede is working on improving laptop support in Fedora,
> > and Peter Jones recently got a patch merged for the W541 with the exact
> > same firmware Kilian is using to work around a botched EFI memory map.
> > Adding them to cc: in the hope that they may be able to help.
> > 
> > @Peter, have you noticed issues with the discrete Nvidia GPU on your W541
> > related to runtime suspend and system sleep?
> 
> I was using a borrowed one (I can certainly find it again, but I'm not
> working on graphics/pm really), but yeah - shutdown and lspci both broke
> sometime after pci_pm_runtime_resume().  Here's the traceback from
> SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> 
> Dave, if you know who in Westford should have a look at this, I can see
> about getting them hardware.  I am more or less surrounded by that team.
> 
> --
>         Peter
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 6, 2017, 1:21 a.m. UTC | #17
On Thursday, January 05, 2017 03:42:20 PM Lukas Wunner wrote:
> On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > [ Help/ideas are welcome, I suspect that these failures to restore power
> > on laptops designed for Win8+ all have the same cause, related to some
> > unknown interaction between ACPI and PCI. Some links:
> > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> 
> Looking at Kilian's acpidump again I notice that the methods to power
> the GPU on or off (GPON / GPOF) are called from two places:
> 
> - From the _PS0 and _PS3 methods of the GPU and
> - from the _PR3 power resource of the root port above the GPU.
> 
> In the former case they're called for pre Windows 2013 or if VDAD is true.
> In the latter case they're called unconditionally but GPOF becomes a no-op
> in the pre Windows 2013 case.
> 
> This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
> is true.  I could imagine this to cause issues.

Right.  Exactly my observation (http://marc.info/?l=linux-pci&m=148362622326066&w=2).

So on (newer) Windows something is done in order to make it work in addition to
the _PR3 _OSC handshake.

So, I'd like to try to follow the Mika's suggestion to use the response we get
from the _OSC handshake for \_SB and if that says "no _PR3", ignore power
resources for PCIe ports at least.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 7, 2017, 6:50 a.m. UTC | #18
On Fri, Jan 06, 2017 at 02:21:11AM +0100, Rafael J. Wysocki wrote:
> On Thursday, January 05, 2017 03:42:20 PM Lukas Wunner wrote:
> > On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > > [ Help/ideas are welcome, I suspect that these failures to restore power
> > > on laptops designed for Win8+ all have the same cause, related to some
> > > unknown interaction between ACPI and PCI. Some links:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > 
> > Looking at Kilian's acpidump again I notice that the methods to power
> > the GPU on or off (GPON / GPOF) are called from two places:
> > 
> > - From the _PS0 and _PS3 methods of the GPU and
> > - from the _PR3 power resource of the root port above the GPU.
> > 
> > In the former case they're called for pre Windows 2013 or if VDAD is true.
> > In the latter case they're called unconditionally but GPOF becomes a no-op
> > in the pre Windows 2013 case.
> > 
> > This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
> > is true.  I could imagine this to cause issues.
> 
> Right.  Exactly my observation (http://marc.info/?l=linux-pci&m=148362622326066&w=2).
> 
> So on (newer) Windows something is done in order to make it work in addition to
> the _PR3 _OSC handshake.
> 
> So, I'd like to try to follow the Mika's suggestion to use the response we get
> from the _OSC handshake for \_SB and if that says "no _PR3", ignore power
> resources for PCIe ports at least.

Kilian send the dmesg back to me and unfortunately the BIOS on that
machine acks use of _PR3:

[    0.699776] ACPI: Added _OSI(Module Device)
[    0.699777] ACPI: Added _OSI(Processor Device)
[    0.699777] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.699778] ACPI: Added _OSI(Processor Aggregator Device)
[    0.699783] ACPI : EC: EC started
[    0.700466] ACPI: \: Used as first EC
[    0.700467] ACPI: \: GPE=0x11, EC_CMD/EC_SC=0x66, EC_DATA=0x62
[    0.700468] ACPI: \: Used as boot ECDT EC to handle transactions
[    0.703905] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    0.880246] ACPI: \_SB_: Supported caps: 0x0000009f
[    0.880278] ACPI: \_SB_: Acked caps: 0x0000009f (_PR3: on)

So ignoring _PR3 based on that will not solve the issue :-(
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Wu Jan. 7, 2017, 11:35 a.m. UTC | #19
On Thu, Jan 05, 2017 at 03:42:20PM +0100, Lukas Wunner wrote:
> On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > [ Help/ideas are welcome, I suspect that these failures to restore power
> > on laptops designed for Win8+ all have the same cause, related to some
> > unknown interaction between ACPI and PCI. Some links:
> > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> 
> Looking at Kilian's acpidump again I notice that the methods to power
> the GPU on or off (GPON / GPOF) are called from two places:
> 
> - From the _PS0 and _PS3 methods of the GPU and
> - from the _PR3 power resource of the root port above the GPU.
> 
> In the former case they're called for pre Windows 2013 or if VDAD is true.
> In the latter case they're called unconditionally but GPOF becomes a no-op
> in the pre Windows 2013 case.
> 
> This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
> is true.  I could imagine this to cause issues.

There is a flag "DGOS" which is set when PGON/PGOF are called, so
multiple invocations should not matter for the powerdown/up sequence.
There are some SMI calls though that might have side-effects though.

> VDAD is at 0x7CE7D018 + 0xEE2 + 6. It's not set in the DSDT.
> 
> @Kilian, what do you get if you execute this as root:
> 
> dd iflag=skip_bytes,count_bytes skip=$((0x7CE7D018 + 0xEE2 + 6)) count=1 \
>   if=/dev/mem 2>/dev/null | hexdump
> 
> Another oddity I've noticed is that when calling the Optimus DSM with
> the capabilities function number (0x1A, NOUVEAU_DSM_OPTIMUS_CAPS) and
> a special argument, it's possible to influence the behaviour of GPOF
> (the method to power the GPU off):  GPOF is a no-op unless it's running
> on Windows 2013+ or OMPR has value 0x03.  Initially OMPR has value 0x02,
> but by setting bits 18 and 19 in the argument given to the capabilities
> function, it can be set to 0x3.  After GPOF has finished, OMPR reverts
> back to 0x02.  This means that pre Windows 2013, GPOF only has any effect
> if the DSM capabilities function is called with an appropriate argument.

Pre-Windows 2013 (Win8), the DSM method was used to regulate power.
Value 3 means that _PS3 should power down the dGPU.  Value 2 means that
the platform should not do that.

Starting from Win8, PR3 is supported so this is used instead of DSM.

> The same functionality can be seen in the Clevo P651RA ssdt3/7.  What
> confuses me is that the bits are at position 18 and 19, but in
> nouveau_switcheroo_optimus_dsm() we're setting bits 0 and 1 as well as
> bits 24 and 25?  This may be a dumb question, I'm not familiar with
> Optimus, only Macs.

nouveau_switcheroo_optimus_dsm calls two different functions:

 - NOUVEAU_DSM_OPTIMUS_CAPS (0x1A) with bits 25:24 set (value 3 << 24).
   This enables powering down in _PS3.
 - NOUVEAU_DSM_OPTIMUS_FLAGS (0x1B) with bits 1:0 set (value 3).
   This enables the "dGPU audio codec flag" via SMI.

When the old DSM method is in use, these functions are always invoked
before _PS3.
Hans de Goede Jan. 7, 2017, 11:45 a.m. UTC | #20
Hi,

On 05-01-17 16:06, Lukas Wunner wrote:
> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
>>>>>> I don't *want* to apply the revert.  It's on my for-linus branch as a
>>>>>> worst-case scenario change if we can't figure out a better fix.
>>>>>>
>>>>>> The patch below is preferable, but I'd rather not take even it,
>>>>>> because it takes away functionality and forces people to use a boot
>>>>>> parameter to restore it.  I expect that somebody will figure out how
>>>>>> to fix the regression Kilian found and also keep the new functionality
>>>>>> (without requiring boot parameters) before v4.10.
>>>>>
>>>>> The issue is constrained to hybrid graphics laptops with Nvidia discrete
>>>>> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
>>>>> PCI core.
>>>>
>>>> The problem is not necessarily in the nouveau driver, the same problem
>>>> occurs when you enable RPM without loading nouveau. The issue is limited
>>>> though to some newer hybrid graphics laptops with Nvidia GPUs. While a
>>>> quirk can be added to nouveau, I think that a (temporary) quirk in core
>>>> would also be reasonable (since it also occurs without nouveau).
>>>>
>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>
>>>>> (Neither are laptops using the Nvidia proprietary driver as it doesn't
>>>>> runtime suspend the card.  But battery life will be terrible then.)
>>>>>
>>>>> We're at rc2 so the time frame for coming up with a fix is probably
>>>>> 4 weeks.  Peter and others have tried for months to reverse-engineer
>>>>> how to handle runtime PM on newer Nvidia cards.  It seems likely that
>>>>> we'll not find the ultimate solution to the problem within 4 weeks.
>>>>
>>>> Yep, a quick proper fix seems unlikely.
>>>> [ Help/ideas are welcome, I suspect that these failures to restore power
>>>> on laptops designed for Win8+ all have the same cause, related to some
>>>> unknown interaction between ACPI and PCI. Some links:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>
>>>>> The way it is now, i.e. defaulting to PR3 when available, regresses
>>>>> certain laptops such as Kilian's.  If on the other hand we default to
>>>>> DSM when available, we'll regress certain other laptops, as Peter has
>>>>> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
>>>>> approach either, ideally we'd want to use PR3 as Windows does.
>>>>>
>>>>> As said, the only short-term solution I see is to add an "optimus"
>>>>> module_param to nouveau to allow users to select which method to use.
>>>>> So in Kilian's case an additional command line parameter would be
>>>>> necessary to fix the issue.
>>>>>
>>>>> Does anyone see a better solution or can we agree on this one?  If so
>>>>> I can come up with a patch.  This could go in via Dave Airlie's tree.
>>>>
>>>> As pcie_port_pm=off already reverts to DSM, I do not think that an
>>>> additional (temporary) nouveau module parameter is going to help. I
>>>> instead propose a (hopefully temporary) quirk in pci core that disables
>>>> D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
>>>> to test possible solutions in the future.
>>>
>>> I would rather add a quirk to the ACPI core to prevent the power resources in
>>> question from being enumerated.  Or even to prevent ACPI PM from being
>>> used for the port in question.
>>
>> I do have a W541 in a cupboard in the office somewhere, but I won't be close to
>> it for a couple of weeks. The W541 was the first place I tested the pm patches
>> so I'm kinda wondering whether it's all W541's or just some specific model/bios
>> combo.
>>
>> However I'm pretty much unavailable to do anything much until late Jan on this.
>
> Is there anyone else at Red Hat who might be able to look into this?
>
> ISTR that Hans de Goede is working on improving laptop support in Fedora,
> and Peter Jones recently got a patch merged for the W541 with the exact
> same firmware Kilian is using to work around a botched EFI memory map.
> Adding them to cc: in the hope that they may be able to help.
>
> @Peter, have you noticed issues with the discrete Nvidia GPU on your W541
> related to runtime suspend and system sleep?

I've a W541 sitting in my home office at well. I will take it through
some gpu runtime suspend/resume testing. Which kernel introduces the
problem I'm looking for ?

I believe mine has the old BIOS / EFI which is less troublesome so I
will first see if I can reproduce the problem with that and then upgrade
to see if that introduces the problem.

Peter IIRC you said that after upgrading the firmware I need a new enough
kernel to be able to even boot, from which kernel onwards will the machine
boot with the new firmware ?

Also is it possible to downgrade the EFI again ? ...

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 7, 2017, 12:16 p.m. UTC | #21
On Sat, Jan 07, 2017 at 12:45:35PM +0100, Hans de Goede wrote:
> I've a W541 sitting in my home office at well. I will take it through
> some gpu runtime suspend/resume testing. Which kernel introduces the
> problem I'm looking for ?

v4.8, it adds runtime PM for PCIe ports.  Or anything newer.


> I believe mine has the old BIOS / EFI which is less troublesome so I
> will first see if I can reproduce the problem with that and then upgrade
> to see if that introduces the problem.

Please be sure to make an acpidump before upgrading the BIOS
so that we can compare what they've changed.

Thanks!

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 7, 2017, 12:19 p.m. UTC | #22
On Sat, Jan 07, 2017 at 12:35:10PM +0100, Peter Wu wrote:
> On Thu, Jan 05, 2017 at 03:42:20PM +0100, Lukas Wunner wrote:
> > On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > > [ Help/ideas are welcome, I suspect that these failures to restore power
> > > on laptops designed for Win8+ all have the same cause, related to some
> > > unknown interaction between ACPI and PCI. Some links:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > 
> > Looking at Kilian's acpidump again I notice that the methods to power
> > the GPU on or off (GPON / GPOF) are called from two places:
> > 
> > - From the _PS0 and _PS3 methods of the GPU and
> > - from the _PR3 power resource of the root port above the GPU.
> > 
> > In the former case they're called for pre Windows 2013 or if VDAD is true.
> > In the latter case they're called unconditionally but GPOF becomes a no-op
> > in the pre Windows 2013 case.
> > 
> > This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
> > is true.  I could imagine this to cause issues.
> 
> There is a flag "DGOS" which is set when PGON/PGOF are called, so
> multiple invocations should not matter for the powerdown/up sequence.
> There are some SMI calls though that might have side-effects though.

The PGON method becomes a no-op if DGOS is true.  But the PGOF method
doesn't check DGOS.

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Wu Jan. 7, 2017, 12:36 p.m. UTC | #23
On Sat, Jan 07, 2017 at 01:19:59PM +0100, Lukas Wunner wrote:
> On Sat, Jan 07, 2017 at 12:35:10PM +0100, Peter Wu wrote:
> > On Thu, Jan 05, 2017 at 03:42:20PM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > > > [ Help/ideas are welcome, I suspect that these failures to restore power
> > > > on laptops designed for Win8+ all have the same cause, related to some
> > > > unknown interaction between ACPI and PCI. Some links:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > 
> > > Looking at Kilian's acpidump again I notice that the methods to power
> > > the GPU on or off (GPON / GPOF) are called from two places:
> > > 
> > > - From the _PS0 and _PS3 methods of the GPU and
> > > - from the _PR3 power resource of the root port above the GPU.
> > > 
> > > In the former case they're called for pre Windows 2013 or if VDAD is true.
> > > In the latter case they're called unconditionally but GPOF becomes a no-op
> > > in the pre Windows 2013 case.
> > > 
> > > This means that GPOF would be executed *twice* on Windows 2013+ if VDAD
> > > is true.  I could imagine this to cause issues.
> > 
> > There is a flag "DGOS" which is set when PGON/PGOF are called, so
> > multiple invocations should not matter for the powerdown/up sequence.
> > There are some SMI calls though that might have side-effects though.
> 
> The PGON method becomes a no-op if DGOS is true.  But the PGOF method
> doesn't check DGOS.

You are right, GPON does not check that. Hopefully "VDAD" is 0 then when
_PS3 is called, otherwise it might invoke PGOF multiple times (though
NVP3._OFF and through _PS3).
Lukas Wunner Jan. 8, 2017, 2:05 p.m. UTC | #24
On Sat, Jan 07, 2017 at 01:36:35PM +0100, Peter Wu wrote:
> On Sat, Jan 07, 2017 at 01:19:59PM +0100, Lukas Wunner wrote:
> > On Sat, Jan 07, 2017 at 12:35:10PM +0100, Peter Wu wrote:
> > > On Thu, Jan 05, 2017 at 03:42:20PM +0100, Lukas Wunner wrote:
> > > > On Wed, Jan 04, 2017 at 10:09:54PM +0100, Peter Wu wrote:
> > > > > [ Help/ideas are welcome, I suspect that these failures to restore
> > > > > power on laptops designed for Win8+ all have the same cause,
> > > > > related to some unknown interaction between ACPI and PCI.
> > > > > Some links:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > 
> > > > Looking at Kilian's acpidump again I notice that the methods to power
> > > > the GPU on or off (GPON / GPOF) are called from two places:
> > > > 
> > > > - From the _PS0 and _PS3 methods of the GPU and
> > > > - from the _PR3 power resource of the root port above the GPU.
> > > > 
> > > > In the former case they're called for pre Windows 2013 or if VDAD is
> > > > true.  In the latter case they're called unconditionally but GPOF
> > > > becomes a no-op in the pre Windows 2013 case.
> > > > 
> > > > This means that GPOF would be executed *twice* on Windows 2013+ if
> > > > VDAD is true.  I could imagine this to cause issues.
> > > 
> > > There is a flag "DGOS" which is set when PGON/PGOF are called, so
> > > multiple invocations should not matter for the powerdown/up sequence.
> > > There are some SMI calls though that might have side-effects though.
> > 
> > The PGON method becomes a no-op if DGOS is true.  But the PGOF method
> > doesn't check DGOS.
> 
> You are right, GPON does not check that. Hopefully "VDAD" is 0 then when
> _PS3 is called, otherwise it might invoke PGOF multiple times (though
> NVP3._OFF and through _PS3).

Kilian responded off-list that the byte containing the VDAD bit has
value 0x01.  So one of the 8 bits is set but I'm not sure if that's
the VDAD bit.  In the DSDT the bit is defined thus:

    OperationRegion (MNVS, SystemMemory, 0x7CE7D018, 0x1000)
    Field (MNVS, DWordAcc, NoLock, Preserve)
    {
        Offset (0xEE2),
        TPPP,   8,
        TPPC,   8,
        WKRS,   8,
        FNWK,   8,
        USBC,   8,
        ODDF,   8,
        VDAD,   1,                    <-------
        Offset (0xEE9),

Does this mean that VDAD is the most significant or least significant
bit?  The value read on Kilian's laptop has the least significant bit
set.

In any case, he cleared that bit but locking the screen still breaks
keyboard/mouse input.

I'm running out of ideas, the only viable solution I can see is to
add a model-specific quirk which causes nouveau to fall back to DSM. :-(

Best regards,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 9, 2017, 3:11 p.m. UTC | #25
fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64 running on
here and so far it seems to suspend/resume just fine using firmware
version 2.19

On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> (cc'ing Lyude, who has the hw also I think).
> 
> ----- Original Message -----
> > From: "Peter Jones" <pjones@redhat.com>
> > To: "Lukas Wunner" <lukas@wunner.de>
> > Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@r
> > jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.weste
> > rberg@linux.intel.com>, "Kilian Singer"
> > <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger
> > .kernel.org>, "Alex Deucher"
> > <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
> > Sent: Friday, 6 January, 2017 4:13:23 AM
> > Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
> > ports"
> > 
> > On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> > > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> > > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
> > > > > > wrote:
> > > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas
> > > > > > > wrote:
> > > > > > > > I don't *want* to apply the revert.  It's on my for-
> > > > > > > > linus branch
> > > > > > > > as a
> > > > > > > > worst-case scenario change if we can't figure out a
> > > > > > > > better fix.
> > > > > > > > 
> > > > > > > > The patch below is preferable, but I'd rather not take
> > > > > > > > even it,
> > > > > > > > because it takes away functionality and forces people
> > > > > > > > to use a
> > > > > > > > boot
> > > > > > > > parameter to restore it.  I expect that somebody will
> > > > > > > > figure out
> > > > > > > > how
> > > > > > > > to fix the regression Kilian found and also keep the
> > > > > > > > new
> > > > > > > > functionality
> > > > > > > > (without requiring boot parameters) before v4.10.
> > > > > > > 
> > > > > > > The issue is constrained to hybrid graphics laptops with
> > > > > > > Nvidia
> > > > > > > discrete
> > > > > > > GPU using nouveau.  Hence it needs to be fixed in
> > > > > > > nouveau, not in
> > > > > > > the
> > > > > > > PCI core.
> > > > > > 
> > > > > > The problem is not necessarily in the nouveau driver, the
> > > > > > same
> > > > > > problem
> > > > > > occurs when you enable RPM without loading nouveau. The
> > > > > > issue is
> > > > > > limited
> > > > > > though to some newer hybrid graphics laptops with Nvidia
> > > > > > GPUs. While
> > > > > > a
> > > > > > quirk can be added to nouveau, I think that a (temporary)
> > > > > > quirk in
> > > > > > core
> > > > > > would also be reasonable (since it also occurs without
> > > > > > nouveau).
> > > > > > 
> > > > > > > (AFAIUI, laptops with AMD discrete GPU are not affected
> > > > > > > as it is
> > > > > > > known
> > > > > > > when and how to call an ACPI method versus using PR3.)
> > > > > > > 
> > > > > > > (Neither are laptops using the Nvidia proprietary driver
> > > > > > > as it
> > > > > > > doesn't
> > > > > > > runtime suspend the card.  But battery life will be
> > > > > > > terrible then.)
> > > > > > > 
> > > > > > > We're at rc2 so the time frame for coming up with a fix
> > > > > > > is probably
> > > > > > > 4 weeks.  Peter and others have tried for months to
> > > > > > > reverse-engineer
> > > > > > > how to handle runtime PM on newer Nvidia cards.  It seems
> > > > > > > likely
> > > > > > > that
> > > > > > > we'll not find the ultimate solution to the problem
> > > > > > > within 4 weeks.
> > > > > > 
> > > > > > Yep, a quick proper fix seems unlikely.
> > > > > > [ Help/ideas are welcome, I suspect that these failures to
> > > > > > restore
> > > > > > power
> > > > > > on laptops designed for Win8+ all have the same cause,
> > > > > > related to
> > > > > > some
> > > > > > unknown interaction between ACPI and PCI. Some links:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > > > 
> > > > > > > The way it is now, i.e. defaulting to PR3 when available,
> > > > > > > regresses
> > > > > > > certain laptops such as Kilian's.  If on the other hand
> > > > > > > we default
> > > > > > > to
> > > > > > > DSM when available, we'll regress certain other laptops,
> > > > > > > as Peter
> > > > > > > has
> > > > > > > pointed out.  Whitelisting or blacklisting laptops
> > > > > > > doesn't seem a
> > > > > > > good
> > > > > > > approach either, ideally we'd want to use PR3 as Windows
> > > > > > > does.
> > > > > > > 
> > > > > > > As said, the only short-term solution I see is to add an
> > > > > > > "optimus"
> > > > > > > module_param to nouveau to allow users to select which
> > > > > > > method to
> > > > > > > use.
> > > > > > > So in Kilian's case an additional command line parameter
> > > > > > > would be
> > > > > > > necessary to fix the issue.
> > > > > > > 
> > > > > > > Does anyone see a better solution or can we agree on this
> > > > > > > one?  If
> > > > > > > so
> > > > > > > I can come up with a patch.  This could go in via Dave
> > > > > > > Airlie's
> > > > > > > tree.
> > > > > > 
> > > > > > As pcie_port_pm=off already reverts to DSM, I do not think
> > > > > > that an
> > > > > > additional (temporary) nouveau module parameter is going to
> > > > > > help. I
> > > > > > instead propose a (hopefully temporary) quirk in pci core
> > > > > > that
> > > > > > disables
> > > > > > D3cold RPM for just Kilians Lenovo laptop (basically
> > > > > > defaulting to
> > > > > > pcie_port_pm=off). Then the option pcie_port_pm=force can
> > > > > > still be
> > > > > > used
> > > > > > to test possible solutions in the future.
> > > > > 
> > > > > I would rather add a quirk to the ACPI core to prevent the
> > > > > power
> > > > > resources in
> > > > > question from being enumerated.  Or even to prevent ACPI PM
> > > > > from being
> > > > > used for the port in question.
> > > > 
> > > > I do have a W541 in a cupboard in the office somewhere, but I
> > > > won't be
> > > > close to
> > > > it for a couple of weeks. The W541 was the first place I tested
> > > > the pm
> > > > patches
> > > > so I'm kinda wondering whether it's all W541's or just some
> > > > specific
> > > > model/bios
> > > > combo.
> > 
> > They seem to all ship with the 1.10 firmware, and 2.80 is current
> > (there
> > are a bunch of intermediate 2.xx versions).  Somewhere along the
> > line
> > they introduced some bugs in the UEFI stuff, so it wouldn't be
> > surprising if there's bugs introduced elsewhere as well.
> > 
> > > > However I'm pretty much unavailable to do anything much until
> > > > late Jan on
> > > > this.
> > > 
> > > Is there anyone else at Red Hat who might be able to look into
> > > this?
> > > 
> > > ISTR that Hans de Goede is working on improving laptop support in
> > > Fedora,
> > > and Peter Jones recently got a patch merged for the W541 with the
> > > exact
> > > same firmware Kilian is using to work around a botched EFI memory
> > > map.
> > > Adding them to cc: in the hope that they may be able to help.
> > > 
> > > @Peter, have you noticed issues with the discrete Nvidia GPU on
> > > your W541
> > > related to runtime suspend and system sleep?
> > 
> > I was using a borrowed one (I can certainly find it again, but I'm
> > not
> > working on graphics/pm really), but yeah - shutdown and lspci both
> > broke
> > sometime after pci_pm_runtime_resume().  Here's the traceback from
> > SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> > 
> > Dave, if you know who in Westford should have a look at this, I can
> > see
> > about getting them hardware.  I am more or less surrounded by that
> > team.
> > 
> > --
> >         Peter
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans de Goede Jan. 9, 2017, 3:21 p.m. UTC | #26
Hi Lyude,

On 09-01-17 16:11, Lyude Paul wrote:
> fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64 running on
> here and so far it seems to suspend/resume just fine using firmware
> version 2.19

Note this is not about normal suspend resume, but runtime
suspend/resume of the nvidia discrete GPU...

Try running glxgears like this:

DRI_PRIME=1 glxgears -info | grep REND

(the grep is to check you're really running on the nvidia GPU).

Then you should see msgs in dmesg about nouveau resuming the gpu,
then kill glxgears and wait for 5 seconds, now the nouveau drv
should say the gpu is suspending, etc.

If it never runtime suspends, then make sure you are not using
any external screens, only the built-in laptop screen.

Regards,

Hans


>
> On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
>> (cc'ing Lyude, who has the hw also I think).
>>
>> ----- Original Message -----
>>> From: "Peter Jones" <pjones@redhat.com>
>>> To: "Lukas Wunner" <lukas@wunner.de>
>>> Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@r
>>> jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
>>> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.weste
>>> rberg@linux.intel.com>, "Kilian Singer"
>>> <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger
>>> .kernel.org>, "Alex Deucher"
>>> <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
>>> Sent: Friday, 6 January, 2017 4:13:23 AM
>>> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
>>> ports"
>>>
>>> On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
>>>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
>>>>>>> wrote:
>>>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas
>>>>>>>> wrote:
>>>>>>>>> I don't *want* to apply the revert.  It's on my for-
>>>>>>>>> linus branch
>>>>>>>>> as a
>>>>>>>>> worst-case scenario change if we can't figure out a
>>>>>>>>> better fix.
>>>>>>>>>
>>>>>>>>> The patch below is preferable, but I'd rather not take
>>>>>>>>> even it,
>>>>>>>>> because it takes away functionality and forces people
>>>>>>>>> to use a
>>>>>>>>> boot
>>>>>>>>> parameter to restore it.  I expect that somebody will
>>>>>>>>> figure out
>>>>>>>>> how
>>>>>>>>> to fix the regression Kilian found and also keep the
>>>>>>>>> new
>>>>>>>>> functionality
>>>>>>>>> (without requiring boot parameters) before v4.10.
>>>>>>>>
>>>>>>>> The issue is constrained to hybrid graphics laptops with
>>>>>>>> Nvidia
>>>>>>>> discrete
>>>>>>>> GPU using nouveau.  Hence it needs to be fixed in
>>>>>>>> nouveau, not in
>>>>>>>> the
>>>>>>>> PCI core.
>>>>>>>
>>>>>>> The problem is not necessarily in the nouveau driver, the
>>>>>>> same
>>>>>>> problem
>>>>>>> occurs when you enable RPM without loading nouveau. The
>>>>>>> issue is
>>>>>>> limited
>>>>>>> though to some newer hybrid graphics laptops with Nvidia
>>>>>>> GPUs. While
>>>>>>> a
>>>>>>> quirk can be added to nouveau, I think that a (temporary)
>>>>>>> quirk in
>>>>>>> core
>>>>>>> would also be reasonable (since it also occurs without
>>>>>>> nouveau).
>>>>>>>
>>>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected
>>>>>>>> as it is
>>>>>>>> known
>>>>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>>>>
>>>>>>>> (Neither are laptops using the Nvidia proprietary driver
>>>>>>>> as it
>>>>>>>> doesn't
>>>>>>>> runtime suspend the card.  But battery life will be
>>>>>>>> terrible then.)
>>>>>>>>
>>>>>>>> We're at rc2 so the time frame for coming up with a fix
>>>>>>>> is probably
>>>>>>>> 4 weeks.  Peter and others have tried for months to
>>>>>>>> reverse-engineer
>>>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems
>>>>>>>> likely
>>>>>>>> that
>>>>>>>> we'll not find the ultimate solution to the problem
>>>>>>>> within 4 weeks.
>>>>>>>
>>>>>>> Yep, a quick proper fix seems unlikely.
>>>>>>> [ Help/ideas are welcome, I suspect that these failures to
>>>>>>> restore
>>>>>>> power
>>>>>>> on laptops designed for Win8+ all have the same cause,
>>>>>>> related to
>>>>>>> some
>>>>>>> unknown interaction between ACPI and PCI. Some links:
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>>>>
>>>>>>>> The way it is now, i.e. defaulting to PR3 when available,
>>>>>>>> regresses
>>>>>>>> certain laptops such as Kilian's.  If on the other hand
>>>>>>>> we default
>>>>>>>> to
>>>>>>>> DSM when available, we'll regress certain other laptops,
>>>>>>>> as Peter
>>>>>>>> has
>>>>>>>> pointed out.  Whitelisting or blacklisting laptops
>>>>>>>> doesn't seem a
>>>>>>>> good
>>>>>>>> approach either, ideally we'd want to use PR3 as Windows
>>>>>>>> does.
>>>>>>>>
>>>>>>>> As said, the only short-term solution I see is to add an
>>>>>>>> "optimus"
>>>>>>>> module_param to nouveau to allow users to select which
>>>>>>>> method to
>>>>>>>> use.
>>>>>>>> So in Kilian's case an additional command line parameter
>>>>>>>> would be
>>>>>>>> necessary to fix the issue.
>>>>>>>>
>>>>>>>> Does anyone see a better solution or can we agree on this
>>>>>>>> one?  If
>>>>>>>> so
>>>>>>>> I can come up with a patch.  This could go in via Dave
>>>>>>>> Airlie's
>>>>>>>> tree.
>>>>>>>
>>>>>>> As pcie_port_pm=off already reverts to DSM, I do not think
>>>>>>> that an
>>>>>>> additional (temporary) nouveau module parameter is going to
>>>>>>> help. I
>>>>>>> instead propose a (hopefully temporary) quirk in pci core
>>>>>>> that
>>>>>>> disables
>>>>>>> D3cold RPM for just Kilians Lenovo laptop (basically
>>>>>>> defaulting to
>>>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can
>>>>>>> still be
>>>>>>> used
>>>>>>> to test possible solutions in the future.
>>>>>>
>>>>>> I would rather add a quirk to the ACPI core to prevent the
>>>>>> power
>>>>>> resources in
>>>>>> question from being enumerated.  Or even to prevent ACPI PM
>>>>>> from being
>>>>>> used for the port in question.
>>>>>
>>>>> I do have a W541 in a cupboard in the office somewhere, but I
>>>>> won't be
>>>>> close to
>>>>> it for a couple of weeks. The W541 was the first place I tested
>>>>> the pm
>>>>> patches
>>>>> so I'm kinda wondering whether it's all W541's or just some
>>>>> specific
>>>>> model/bios
>>>>> combo.
>>>
>>> They seem to all ship with the 1.10 firmware, and 2.80 is current
>>> (there
>>> are a bunch of intermediate 2.xx versions).  Somewhere along the
>>> line
>>> they introduced some bugs in the UEFI stuff, so it wouldn't be
>>> surprising if there's bugs introduced elsewhere as well.
>>>
>>>>> However I'm pretty much unavailable to do anything much until
>>>>> late Jan on
>>>>> this.
>>>>
>>>> Is there anyone else at Red Hat who might be able to look into
>>>> this?
>>>>
>>>> ISTR that Hans de Goede is working on improving laptop support in
>>>> Fedora,
>>>> and Peter Jones recently got a patch merged for the W541 with the
>>>> exact
>>>> same firmware Kilian is using to work around a botched EFI memory
>>>> map.
>>>> Adding them to cc: in the hope that they may be able to help.
>>>>
>>>> @Peter, have you noticed issues with the discrete Nvidia GPU on
>>>> your W541
>>>> related to runtime suspend and system sleep?
>>>
>>> I was using a borrowed one (I can certainly find it again, but I'm
>>> not
>>> working on graphics/pm really), but yeah - shutdown and lspci both
>>> broke
>>> sometime after pci_pm_runtime_resume().  Here's the traceback from
>>> SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
>>>
>>> Dave, if you know who in Westford should have a look at this, I can
>>> see
>>> about getting them hardware.  I am more or less surrounded by that
>>> team.
>>>
>>> --
>>>         Peter
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 9, 2017, 6:48 p.m. UTC | #27
Hi Lyude Paul,

normal supend resume does not work neither on my machine.

Best regards

Kilian


On 09-Jan-17 16:21, Hans de Goede wrote:
> Hi Lyude,
>
> On 09-01-17 16:11, Lyude Paul wrote:
>> fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64 running on
>> here and so far it seems to suspend/resume just fine using firmware
>> version 2.19
>
> Note this is not about normal suspend resume, but runtime
> suspend/resume of the nvidia discrete GPU...
>
> Try running glxgears like this:
>
> DRI_PRIME=1 glxgears -info | grep REND
>
> (the grep is to check you're really running on the nvidia GPU).
>
> Then you should see msgs in dmesg about nouveau resuming the gpu,
> then kill glxgears and wait for 5 seconds, now the nouveau drv
> should say the gpu is suspending, etc.
>
> If it never runtime suspends, then make sure you are not using
> any external screens, only the built-in laptop screen.
>
> Regards,
>
> Hans
>
>
>>
>> On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
>>> (cc'ing Lyude, who has the hw also I think).
>>>
>>> ----- Original Message -----
>>>> From: "Peter Jones" <pjones@redhat.com>
>>>> To: "Lukas Wunner" <lukas@wunner.de>
>>>> Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@r
>>>> jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
>>>> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.weste
>>>> rberg@linux.intel.com>, "Kilian Singer"
>>>> <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger
>>>> .kernel.org>, "Alex Deucher"
>>>> <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
>>>> Sent: Friday, 6 January, 2017 4:13:23 AM
>>>> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
>>>> ports"
>>>>
>>>> On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
>>>>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>>>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
>>>>>>>> wrote:
>>>>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas
>>>>>>>>> wrote:
>>>>>>>>>> I don't *want* to apply the revert.  It's on my for-
>>>>>>>>>> linus branch
>>>>>>>>>> as a
>>>>>>>>>> worst-case scenario change if we can't figure out a
>>>>>>>>>> better fix.
>>>>>>>>>>
>>>>>>>>>> The patch below is preferable, but I'd rather not take
>>>>>>>>>> even it,
>>>>>>>>>> because it takes away functionality and forces people
>>>>>>>>>> to use a
>>>>>>>>>> boot
>>>>>>>>>> parameter to restore it.  I expect that somebody will
>>>>>>>>>> figure out
>>>>>>>>>> how
>>>>>>>>>> to fix the regression Kilian found and also keep the
>>>>>>>>>> new
>>>>>>>>>> functionality
>>>>>>>>>> (without requiring boot parameters) before v4.10.
>>>>>>>>>
>>>>>>>>> The issue is constrained to hybrid graphics laptops with
>>>>>>>>> Nvidia
>>>>>>>>> discrete
>>>>>>>>> GPU using nouveau.  Hence it needs to be fixed in
>>>>>>>>> nouveau, not in
>>>>>>>>> the
>>>>>>>>> PCI core.
>>>>>>>>
>>>>>>>> The problem is not necessarily in the nouveau driver, the
>>>>>>>> same
>>>>>>>> problem
>>>>>>>> occurs when you enable RPM without loading nouveau. The
>>>>>>>> issue is
>>>>>>>> limited
>>>>>>>> though to some newer hybrid graphics laptops with Nvidia
>>>>>>>> GPUs. While
>>>>>>>> a
>>>>>>>> quirk can be added to nouveau, I think that a (temporary)
>>>>>>>> quirk in
>>>>>>>> core
>>>>>>>> would also be reasonable (since it also occurs without
>>>>>>>> nouveau).
>>>>>>>>
>>>>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected
>>>>>>>>> as it is
>>>>>>>>> known
>>>>>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>>>>>
>>>>>>>>> (Neither are laptops using the Nvidia proprietary driver
>>>>>>>>> as it
>>>>>>>>> doesn't
>>>>>>>>> runtime suspend the card.  But battery life will be
>>>>>>>>> terrible then.)
>>>>>>>>>
>>>>>>>>> We're at rc2 so the time frame for coming up with a fix
>>>>>>>>> is probably
>>>>>>>>> 4 weeks.  Peter and others have tried for months to
>>>>>>>>> reverse-engineer
>>>>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems
>>>>>>>>> likely
>>>>>>>>> that
>>>>>>>>> we'll not find the ultimate solution to the problem
>>>>>>>>> within 4 weeks.
>>>>>>>>
>>>>>>>> Yep, a quick proper fix seems unlikely.
>>>>>>>> [ Help/ideas are welcome, I suspect that these failures to
>>>>>>>> restore
>>>>>>>> power
>>>>>>>> on laptops designed for Win8+ all have the same cause,
>>>>>>>> related to
>>>>>>>> some
>>>>>>>> unknown interaction between ACPI and PCI. Some links:
>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>>>>>
>>>>>>>>> The way it is now, i.e. defaulting to PR3 when available,
>>>>>>>>> regresses
>>>>>>>>> certain laptops such as Kilian's.  If on the other hand
>>>>>>>>> we default
>>>>>>>>> to
>>>>>>>>> DSM when available, we'll regress certain other laptops,
>>>>>>>>> as Peter
>>>>>>>>> has
>>>>>>>>> pointed out.  Whitelisting or blacklisting laptops
>>>>>>>>> doesn't seem a
>>>>>>>>> good
>>>>>>>>> approach either, ideally we'd want to use PR3 as Windows
>>>>>>>>> does.
>>>>>>>>>
>>>>>>>>> As said, the only short-term solution I see is to add an
>>>>>>>>> "optimus"
>>>>>>>>> module_param to nouveau to allow users to select which
>>>>>>>>> method to
>>>>>>>>> use.
>>>>>>>>> So in Kilian's case an additional command line parameter
>>>>>>>>> would be
>>>>>>>>> necessary to fix the issue.
>>>>>>>>>
>>>>>>>>> Does anyone see a better solution or can we agree on this
>>>>>>>>> one?  If
>>>>>>>>> so
>>>>>>>>> I can come up with a patch.  This could go in via Dave
>>>>>>>>> Airlie's
>>>>>>>>> tree.
>>>>>>>>
>>>>>>>> As pcie_port_pm=off already reverts to DSM, I do not think
>>>>>>>> that an
>>>>>>>> additional (temporary) nouveau module parameter is going to
>>>>>>>> help. I
>>>>>>>> instead propose a (hopefully temporary) quirk in pci core
>>>>>>>> that
>>>>>>>> disables
>>>>>>>> D3cold RPM for just Kilians Lenovo laptop (basically
>>>>>>>> defaulting to
>>>>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can
>>>>>>>> still be
>>>>>>>> used
>>>>>>>> to test possible solutions in the future.
>>>>>>>
>>>>>>> I would rather add a quirk to the ACPI core to prevent the
>>>>>>> power
>>>>>>> resources in
>>>>>>> question from being enumerated.  Or even to prevent ACPI PM
>>>>>>> from being
>>>>>>> used for the port in question.
>>>>>>
>>>>>> I do have a W541 in a cupboard in the office somewhere, but I
>>>>>> won't be
>>>>>> close to
>>>>>> it for a couple of weeks. The W541 was the first place I tested
>>>>>> the pm
>>>>>> patches
>>>>>> so I'm kinda wondering whether it's all W541's or just some
>>>>>> specific
>>>>>> model/bios
>>>>>> combo.
>>>>
>>>> They seem to all ship with the 1.10 firmware, and 2.80 is current
>>>> (there
>>>> are a bunch of intermediate 2.xx versions).  Somewhere along the
>>>> line
>>>> they introduced some bugs in the UEFI stuff, so it wouldn't be
>>>> surprising if there's bugs introduced elsewhere as well.
>>>>
>>>>>> However I'm pretty much unavailable to do anything much until
>>>>>> late Jan on
>>>>>> this.
>>>>>
>>>>> Is there anyone else at Red Hat who might be able to look into
>>>>> this?
>>>>>
>>>>> ISTR that Hans de Goede is working on improving laptop support in
>>>>> Fedora,
>>>>> and Peter Jones recently got a patch merged for the W541 with the
>>>>> exact
>>>>> same firmware Kilian is using to work around a botched EFI memory
>>>>> map.
>>>>> Adding them to cc: in the hope that they may be able to help.
>>>>>
>>>>> @Peter, have you noticed issues with the discrete Nvidia GPU on
>>>>> your W541
>>>>> related to runtime suspend and system sleep?
>>>>
>>>> I was using a borrowed one (I can certainly find it again, but I'm
>>>> not
>>>> working on graphics/pm really), but yeah - shutdown and lspci both
>>>> broke
>>>> sometime after pci_pm_runtime_resume().  Here's the traceback from
>>>> SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
>>>>
>>>> Dave, if you know who in Westford should have a look at this, I can
>>>> see
>>>> about getting them hardware.  I am more or less surrounded by that
>>>> team.
>>>>
>>>> -- 
>>>>         Peter
>>>>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Jones Jan. 9, 2017, 11 p.m. UTC | #28
On Sat, Jan 07, 2017 at 12:45:35PM +0100, Hans de Goede wrote:

> I've a W541 sitting in my home office at well. I will take it through
> some gpu runtime suspend/resume testing. Which kernel introduces the
> problem I'm looking for ?
> 
> I believe mine has the old BIOS / EFI which is less troublesome so I
> will first see if I can reproduce the problem with that and then upgrade
> to see if that introduces the problem.
> 
> Peter IIRC you said that after upgrading the firmware I need a new enough
> kernel to be able to even boot, from which kernel onwards will the machine
> boot with the new firmware ?

That fix is currently commit b2a91a35 in the "next" branch of the repo
at git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi .  I'm not sure
what the timeframe for it landing in linus' tree is, but it doesn't look
like it has yet.

> Also is it possible to downgrade the EFI again ? ...

IIRC that model has a switch in the firmware to enable downgrading.  I
have not tried it.  Also, there's some chance the firmware you're
starting from isn't available.
David Airlie Jan. 10, 2017, 12:17 a.m. UTC | #29
> 
> On Sat, Jan 07, 2017 at 12:45:35PM +0100, Hans de Goede wrote:
> 
> > I've a W541 sitting in my home office at well. I will take it through
> > some gpu runtime suspend/resume testing. Which kernel introduces the
> > problem I'm looking for ?
> > 
> > I believe mine has the old BIOS / EFI which is less troublesome so I
> > will first see if I can reproduce the problem with that and then upgrade
> > to see if that introduces the problem.
> > 
> > Peter IIRC you said that after upgrading the firmware I need a new enough
> > kernel to be able to even boot, from which kernel onwards will the machine
> > boot with the new firmware ?
> 
> That fix is currently commit b2a91a35 in the "next" branch of the repo
> at git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi .  I'm not sure
> what the timeframe for it landing in linus' tree is, but it doesn't look
> like it has yet.
> 
> > Also is it possible to downgrade the EFI again ? ...
> 
> IIRC that model has a switch in the firmware to enable downgrading.  I
> have not tried it.  Also, there's some chance the firmware you're
> starting from isn't available.
> 
> --
>         Peter
> 

just FYI, but W541 with Fedora 25 and Linux 4.10-rc3 + drm-next and the efi
fix (you might want to motivate that fix a bit harder), seems to be working
well.

I can suspend/resume, and the nvidia seems to go off.

[  411.799035] nouveau 0000:01:00.0: DRM: suspending console...
[  411.799059] nouveau 0000:01:00.0: DRM: suspending display...
[  411.799119] nouveau 0000:01:00.0: DRM: evicting buffers...
[  411.799125] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
[  411.799176] nouveau 0000:01:00.0: DRM: suspending client object trees...
[  411.805616] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
[  413.217090] device_pm-0235 device_set_power      : Device [VID1] transitioned to D3hot
[  413.217099] device_pm-0124 device_get_power      : Device [VID1] power state is (unknown)
[  413.230201] thinkpad_acpi: EC reports that Thermal Table has changed
[  413.351497]     power-0275 __acpi_power_off      : Power resource [NVP3] turned off
[  413.351507] device_pm-0235 device_set_power      : Device [PEG] transitioned to D3hot
[  413.351526]     power-0189 power_get_state       : Resource [NVP3] is off
[  413.351530]     power-0219 power_get_list_state  : Resource list is off
[  413.351542]     power-0189 power_get_state       : Resource [NVP2] is on
[  413.351545]     power-0219 power_get_list_state  : Resource list is on
[  413.351548] device_pm-0124 device_get_power      : Device [PEG] power state is D2

That is with some acpi debugging enabled (though I'm not quite sure why D2 is where it ends up).

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Airlie Jan. 10, 2017, 12:33 a.m. UTC | #30
Hi Killian,

do you use powertop or have you ever used it, I'm guessing some port is getting into suspend on your machine that isn't on ours due to differeing userspace or powertop settings.

Dave.

----- Original Message -----
> From: "Kilian Singer" <kilian.singer@quantumtechnology.info>
> To: "Hans de Goede" <hdegoede@redhat.com>, "Lyude Paul" <lyude@redhat.com>, "David Airlie" <airlied@redhat.com>,
> "Peter Jones" <pjones@redhat.com>
> Cc: "Lukas Wunner" <lukas@wunner.de>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.westerberg@linux.intel.com>, "linux-pci"
> <linux-pci@vger.kernel.org>, "Alex Deucher" <alexander.deucher@amd.com>, "Lyude" <cpaul@redhat.com>
> Sent: Tuesday, 10 January, 2017 4:48:22 AM
> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"
> 
> Hi Lyude Paul,
> 
> normal supend resume does not work neither on my machine.
> 
> Best regards
> 
> Kilian
> 
> 
> On 09-Jan-17 16:21, Hans de Goede wrote:
> > Hi Lyude,
> >
> > On 09-01-17 16:11, Lyude Paul wrote:
> >> fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64 running on
> >> here and so far it seems to suspend/resume just fine using firmware
> >> version 2.19
> >
> > Note this is not about normal suspend resume, but runtime
> > suspend/resume of the nvidia discrete GPU...
> >
> > Try running glxgears like this:
> >
> > DRI_PRIME=1 glxgears -info | grep REND
> >
> > (the grep is to check you're really running on the nvidia GPU).
> >
> > Then you should see msgs in dmesg about nouveau resuming the gpu,
> > then kill glxgears and wait for 5 seconds, now the nouveau drv
> > should say the gpu is suspending, etc.
> >
> > If it never runtime suspends, then make sure you are not using
> > any external screens, only the built-in laptop screen.
> >
> > Regards,
> >
> > Hans
> >
> >
> >>
> >> On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> >>> (cc'ing Lyude, who has the hw also I think).
> >>>
> >>> ----- Original Message -----
> >>>> From: "Peter Jones" <pjones@redhat.com>
> >>>> To: "Lukas Wunner" <lukas@wunner.de>
> >>>> Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@r
> >>>> jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> >>>> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.weste
> >>>> rberg@linux.intel.com>, "Kilian Singer"
> >>>> <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger
> >>>> .kernel.org>, "Alex Deucher"
> >>>> <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
> >>>> Sent: Friday, 6 January, 2017 4:13:23 AM
> >>>> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
> >>>> ports"
> >>>>
> >>>> On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> >>>>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> >>>>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
> >>>>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
> >>>>>>>> wrote:
> >>>>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas
> >>>>>>>>> wrote:
> >>>>>>>>>> I don't *want* to apply the revert.  It's on my for-
> >>>>>>>>>> linus branch
> >>>>>>>>>> as a
> >>>>>>>>>> worst-case scenario change if we can't figure out a
> >>>>>>>>>> better fix.
> >>>>>>>>>>
> >>>>>>>>>> The patch below is preferable, but I'd rather not take
> >>>>>>>>>> even it,
> >>>>>>>>>> because it takes away functionality and forces people
> >>>>>>>>>> to use a
> >>>>>>>>>> boot
> >>>>>>>>>> parameter to restore it.  I expect that somebody will
> >>>>>>>>>> figure out
> >>>>>>>>>> how
> >>>>>>>>>> to fix the regression Kilian found and also keep the
> >>>>>>>>>> new
> >>>>>>>>>> functionality
> >>>>>>>>>> (without requiring boot parameters) before v4.10.
> >>>>>>>>>
> >>>>>>>>> The issue is constrained to hybrid graphics laptops with
> >>>>>>>>> Nvidia
> >>>>>>>>> discrete
> >>>>>>>>> GPU using nouveau.  Hence it needs to be fixed in
> >>>>>>>>> nouveau, not in
> >>>>>>>>> the
> >>>>>>>>> PCI core.
> >>>>>>>>
> >>>>>>>> The problem is not necessarily in the nouveau driver, the
> >>>>>>>> same
> >>>>>>>> problem
> >>>>>>>> occurs when you enable RPM without loading nouveau. The
> >>>>>>>> issue is
> >>>>>>>> limited
> >>>>>>>> though to some newer hybrid graphics laptops with Nvidia
> >>>>>>>> GPUs. While
> >>>>>>>> a
> >>>>>>>> quirk can be added to nouveau, I think that a (temporary)
> >>>>>>>> quirk in
> >>>>>>>> core
> >>>>>>>> would also be reasonable (since it also occurs without
> >>>>>>>> nouveau).
> >>>>>>>>
> >>>>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected
> >>>>>>>>> as it is
> >>>>>>>>> known
> >>>>>>>>> when and how to call an ACPI method versus using PR3.)
> >>>>>>>>>
> >>>>>>>>> (Neither are laptops using the Nvidia proprietary driver
> >>>>>>>>> as it
> >>>>>>>>> doesn't
> >>>>>>>>> runtime suspend the card.  But battery life will be
> >>>>>>>>> terrible then.)
> >>>>>>>>>
> >>>>>>>>> We're at rc2 so the time frame for coming up with a fix
> >>>>>>>>> is probably
> >>>>>>>>> 4 weeks.  Peter and others have tried for months to
> >>>>>>>>> reverse-engineer
> >>>>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems
> >>>>>>>>> likely
> >>>>>>>>> that
> >>>>>>>>> we'll not find the ultimate solution to the problem
> >>>>>>>>> within 4 weeks.
> >>>>>>>>
> >>>>>>>> Yep, a quick proper fix seems unlikely.
> >>>>>>>> [ Help/ideas are welcome, I suspect that these failures to
> >>>>>>>> restore
> >>>>>>>> power
> >>>>>>>> on laptops designed for Win8+ all have the same cause,
> >>>>>>>> related to
> >>>>>>>> some
> >>>>>>>> unknown interaction between ACPI and PCI. Some links:
> >>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
> >>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> >>>>>>>>
> >>>>>>>>> The way it is now, i.e. defaulting to PR3 when available,
> >>>>>>>>> regresses
> >>>>>>>>> certain laptops such as Kilian's.  If on the other hand
> >>>>>>>>> we default
> >>>>>>>>> to
> >>>>>>>>> DSM when available, we'll regress certain other laptops,
> >>>>>>>>> as Peter
> >>>>>>>>> has
> >>>>>>>>> pointed out.  Whitelisting or blacklisting laptops
> >>>>>>>>> doesn't seem a
> >>>>>>>>> good
> >>>>>>>>> approach either, ideally we'd want to use PR3 as Windows
> >>>>>>>>> does.
> >>>>>>>>>
> >>>>>>>>> As said, the only short-term solution I see is to add an
> >>>>>>>>> "optimus"
> >>>>>>>>> module_param to nouveau to allow users to select which
> >>>>>>>>> method to
> >>>>>>>>> use.
> >>>>>>>>> So in Kilian's case an additional command line parameter
> >>>>>>>>> would be
> >>>>>>>>> necessary to fix the issue.
> >>>>>>>>>
> >>>>>>>>> Does anyone see a better solution or can we agree on this
> >>>>>>>>> one?  If
> >>>>>>>>> so
> >>>>>>>>> I can come up with a patch.  This could go in via Dave
> >>>>>>>>> Airlie's
> >>>>>>>>> tree.
> >>>>>>>>
> >>>>>>>> As pcie_port_pm=off already reverts to DSM, I do not think
> >>>>>>>> that an
> >>>>>>>> additional (temporary) nouveau module parameter is going to
> >>>>>>>> help. I
> >>>>>>>> instead propose a (hopefully temporary) quirk in pci core
> >>>>>>>> that
> >>>>>>>> disables
> >>>>>>>> D3cold RPM for just Kilians Lenovo laptop (basically
> >>>>>>>> defaulting to
> >>>>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can
> >>>>>>>> still be
> >>>>>>>> used
> >>>>>>>> to test possible solutions in the future.
> >>>>>>>
> >>>>>>> I would rather add a quirk to the ACPI core to prevent the
> >>>>>>> power
> >>>>>>> resources in
> >>>>>>> question from being enumerated.  Or even to prevent ACPI PM
> >>>>>>> from being
> >>>>>>> used for the port in question.
> >>>>>>
> >>>>>> I do have a W541 in a cupboard in the office somewhere, but I
> >>>>>> won't be
> >>>>>> close to
> >>>>>> it for a couple of weeks. The W541 was the first place I tested
> >>>>>> the pm
> >>>>>> patches
> >>>>>> so I'm kinda wondering whether it's all W541's or just some
> >>>>>> specific
> >>>>>> model/bios
> >>>>>> combo.
> >>>>
> >>>> They seem to all ship with the 1.10 firmware, and 2.80 is current
> >>>> (there
> >>>> are a bunch of intermediate 2.xx versions).  Somewhere along the
> >>>> line
> >>>> they introduced some bugs in the UEFI stuff, so it wouldn't be
> >>>> surprising if there's bugs introduced elsewhere as well.
> >>>>
> >>>>>> However I'm pretty much unavailable to do anything much until
> >>>>>> late Jan on
> >>>>>> this.
> >>>>>
> >>>>> Is there anyone else at Red Hat who might be able to look into
> >>>>> this?
> >>>>>
> >>>>> ISTR that Hans de Goede is working on improving laptop support in
> >>>>> Fedora,
> >>>>> and Peter Jones recently got a patch merged for the W541 with the
> >>>>> exact
> >>>>> same firmware Kilian is using to work around a botched EFI memory
> >>>>> map.
> >>>>> Adding them to cc: in the hope that they may be able to help.
> >>>>>
> >>>>> @Peter, have you noticed issues with the discrete Nvidia GPU on
> >>>>> your W541
> >>>>> related to runtime suspend and system sleep?
> >>>>
> >>>> I was using a borrowed one (I can certainly find it again, but I'm
> >>>> not
> >>>> working on graphics/pm really), but yeah - shutdown and lspci both
> >>>> broke
> >>>> sometime after pci_pm_runtime_resume().  Here's the traceback from
> >>>> SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> >>>>
> >>>> Dave, if you know who in Westford should have a look at this, I can
> >>>> see
> >>>> about getting them hardware.  I am more or less surrounded by that
> >>>> team.
> >>>>
> >>>> --
> >>>>         Peter
> >>>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Jan. 10, 2017, 1:24 a.m. UTC | #31
On Mon, Jan 09, 2017 at 07:17:17PM -0500, David Airlie wrote:
> just FYI, but W541 with Fedora 25 and Linux 4.10-rc3 + drm-next and the efi
> fix (you might want to motivate that fix a bit harder), seems to be working
> well.

The efi fix is on the efi.git next branch without stable designation,
i.e. slated for 4.11 not 4.10, and Matt Fleming usually sends his pull
to Ingo between rc4 and rc5.


> I can suspend/resume, and the nvidia seems to go off.
> 
> [  411.799035] nouveau 0000:01:00.0: DRM: suspending console...
> [  411.799059] nouveau 0000:01:00.0: DRM: suspending display...
> [  411.799119] nouveau 0000:01:00.0: DRM: evicting buffers...
> [  411.799125] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
> [  411.799176] nouveau 0000:01:00.0: DRM: suspending client object trees...
> [  411.805616] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> [  413.217090] device_pm-0235 device_set_power      : Device [VID1] transitioned to D3hot
> [  413.217099] device_pm-0124 device_get_power      : Device [VID1] power state is (unknown)
> [  413.230201] thinkpad_acpi: EC reports that Thermal Table has changed
> [  413.351497]     power-0275 __acpi_power_off      : Power resource [NVP3] turned off
> [  413.351507] device_pm-0235 device_set_power      : Device [PEG] transitioned to D3hot
> [  413.351526]     power-0189 power_get_state       : Resource [NVP3] is off
> [  413.351530]     power-0219 power_get_list_state  : Resource list is off
> [  413.351542]     power-0189 power_get_state       : Resource [NVP2] is on
> [  413.351545]     power-0219 power_get_list_state  : Resource list is on
> [  413.351548] device_pm-0124 device_get_power      : Device [PEG] power state is D2
> 
> That is with some acpi debugging enabled (though I'm not quite sure why D2 is where it ends up).

The ACPI D2 state seems fishy indeed, could you post the log for a
runtime resume as well?  It would be interesting to see how it gets
out of this incorrect power state.  Also, what is the firmware version
that you're using?  If it's not GNET80WW (2.28), could you attach an
acpidump to the bugzilla entry?

https://bugzilla.kernel.org/show_bug.cgi?id=190861

Thanks,

Lukas

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Airlie Jan. 10, 2017, 2:15 a.m. UTC | #32
> On Mon, Jan 09, 2017 at 07:17:17PM -0500, David Airlie wrote:
> > just FYI, but W541 with Fedora 25 and Linux 4.10-rc3 + drm-next and the efi
> > fix (you might want to motivate that fix a bit harder), seems to be working
> > well.
> 
> The efi fix is on the efi.git next branch without stable designation,
> i.e. slated for 4.11 not 4.10, and Matt Fleming usually sends his pull
> to Ingo between rc4 and rc5.
> 
> 
> > I can suspend/resume, and the nvidia seems to go off.
> > 
> > [  411.799035] nouveau 0000:01:00.0: DRM: suspending console...
> > [  411.799059] nouveau 0000:01:00.0: DRM: suspending display...
> > [  411.799119] nouveau 0000:01:00.0: DRM: evicting buffers...
> > [  411.799125] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go
> > idle...
> > [  411.799176] nouveau 0000:01:00.0: DRM: suspending client object trees...
> > [  411.805616] nouveau 0000:01:00.0: DRM: suspending kernel object tree...
> > [  413.217090] device_pm-0235 device_set_power      : Device [VID1]
> > transitioned to D3hot
> > [  413.217099] device_pm-0124 device_get_power      : Device [VID1] power
> > state is (unknown)
> > [  413.230201] thinkpad_acpi: EC reports that Thermal Table has changed
> > [  413.351497]     power-0275 __acpi_power_off      : Power resource [NVP3]
> > turned off
> > [  413.351507] device_pm-0235 device_set_power      : Device [PEG]
> > transitioned to D3hot
> > [  413.351526]     power-0189 power_get_state       : Resource [NVP3] is
> > off
> > [  413.351530]     power-0219 power_get_list_state  : Resource list is off
> > [  413.351542]     power-0189 power_get_state       : Resource [NVP2] is on
> > [  413.351545]     power-0219 power_get_list_state  : Resource list is on
> > [  413.351548] device_pm-0124 device_get_power      : Device [PEG] power
> > state is D2
> > 
> > That is with some acpi debugging enabled (though I'm not quite sure why D2
> > is where it ends up).
> 
> The ACPI D2 state seems fishy indeed, could you post the log for a
> runtime resume as well?  It would be interesting to see how it gets
> out of this incorrect power state.  Also, what is the firmware version
> that you're using?  If it's not GNET80WW (2.28), could you attach an
> acpidump to the bugzilla entry?
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=190861
> 

Okay I found a possible race and sent patches to fix it dri-devel

https://patchwork.freedesktop.org/series/17731/

also https://cgit.freedesktop.org/~airlied/linux/log/?h=drm-next-wip-fix-runtime-race
here with the efi patch also.

It might just be that enabling runtime PM makes things actually suspend/resume and
we can hit this, or else I've just found something else in the area.

Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 10, 2017, 9:17 a.m. UTC | #33
It is a standart debian installation.

I have not installed powertop.


On 10-Jan-17 01:33, David Airlie wrote:
> Hi Killian,
>
> do you use powertop or have you ever used it, I'm guessing some port is getting into suspend on your machine that isn't on ours due to differeing userspace or powertop settings.
>
> Dave.
>
> ----- Original Message -----
>> From: "Kilian Singer" <kilian.singer@quantumtechnology.info>
>> To: "Hans de Goede" <hdegoede@redhat.com>, "Lyude Paul" <lyude@redhat.com>, "David Airlie" <airlied@redhat.com>,
>> "Peter Jones" <pjones@redhat.com>
>> Cc: "Lukas Wunner" <lukas@wunner.de>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
>> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.westerberg@linux.intel.com>, "linux-pci"
>> <linux-pci@vger.kernel.org>, "Alex Deucher" <alexander.deucher@amd.com>, "Lyude" <cpaul@redhat.com>
>> Sent: Tuesday, 10 January, 2017 4:48:22 AM
>> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe ports"
>>
>> Hi Lyude Paul,
>>
>> normal supend resume does not work neither on my machine.
>>
>> Best regards
>>
>> Kilian
>>
>>
>> On 09-Jan-17 16:21, Hans de Goede wrote:
>>> Hi Lyude,
>>>
>>> On 09-01-17 16:11, Lyude Paul wrote:
>>>> fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64 running on
>>>> here and so far it seems to suspend/resume just fine using firmware
>>>> version 2.19
>>> Note this is not about normal suspend resume, but runtime
>>> suspend/resume of the nvidia discrete GPU...
>>>
>>> Try running glxgears like this:
>>>
>>> DRI_PRIME=1 glxgears -info | grep REND
>>>
>>> (the grep is to check you're really running on the nvidia GPU).
>>>
>>> Then you should see msgs in dmesg about nouveau resuming the gpu,
>>> then kill glxgears and wait for 5 seconds, now the nouveau drv
>>> should say the gpu is suspending, etc.
>>>
>>> If it never runtime suspends, then make sure you are not using
>>> any external screens, only the built-in laptop screen.
>>>
>>> Regards,
>>>
>>> Hans
>>>
>>>
>>>> On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
>>>>> (cc'ing Lyude, who has the hw also I think).
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Peter Jones" <pjones@redhat.com>
>>>>>> To: "Lukas Wunner" <lukas@wunner.de>
>>>>>> Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <rjw@r
>>>>>> jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
>>>>>> "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.weste
>>>>>> rberg@linux.intel.com>, "Kilian Singer"
>>>>>> <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@vger
>>>>>> .kernel.org>, "Alex Deucher"
>>>>>> <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.com>
>>>>>> Sent: Friday, 6 January, 2017 4:13:23 AM
>>>>>> Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
>>>>>> ports"
>>>>>>
>>>>>> On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
>>>>>>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>>>>>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>>>>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
>>>>>>>>>> wrote:
>>>>>>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas
>>>>>>>>>>> wrote:
>>>>>>>>>>>> I don't *want* to apply the revert.  It's on my for-
>>>>>>>>>>>> linus branch
>>>>>>>>>>>> as a
>>>>>>>>>>>> worst-case scenario change if we can't figure out a
>>>>>>>>>>>> better fix.
>>>>>>>>>>>>
>>>>>>>>>>>> The patch below is preferable, but I'd rather not take
>>>>>>>>>>>> even it,
>>>>>>>>>>>> because it takes away functionality and forces people
>>>>>>>>>>>> to use a
>>>>>>>>>>>> boot
>>>>>>>>>>>> parameter to restore it.  I expect that somebody will
>>>>>>>>>>>> figure out
>>>>>>>>>>>> how
>>>>>>>>>>>> to fix the regression Kilian found and also keep the
>>>>>>>>>>>> new
>>>>>>>>>>>> functionality
>>>>>>>>>>>> (without requiring boot parameters) before v4.10.
>>>>>>>>>>> The issue is constrained to hybrid graphics laptops with
>>>>>>>>>>> Nvidia
>>>>>>>>>>> discrete
>>>>>>>>>>> GPU using nouveau.  Hence it needs to be fixed in
>>>>>>>>>>> nouveau, not in
>>>>>>>>>>> the
>>>>>>>>>>> PCI core.
>>>>>>>>>> The problem is not necessarily in the nouveau driver, the
>>>>>>>>>> same
>>>>>>>>>> problem
>>>>>>>>>> occurs when you enable RPM without loading nouveau. The
>>>>>>>>>> issue is
>>>>>>>>>> limited
>>>>>>>>>> though to some newer hybrid graphics laptops with Nvidia
>>>>>>>>>> GPUs. While
>>>>>>>>>> a
>>>>>>>>>> quirk can be added to nouveau, I think that a (temporary)
>>>>>>>>>> quirk in
>>>>>>>>>> core
>>>>>>>>>> would also be reasonable (since it also occurs without
>>>>>>>>>> nouveau).
>>>>>>>>>>
>>>>>>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected
>>>>>>>>>>> as it is
>>>>>>>>>>> known
>>>>>>>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>>>>>>>
>>>>>>>>>>> (Neither are laptops using the Nvidia proprietary driver
>>>>>>>>>>> as it
>>>>>>>>>>> doesn't
>>>>>>>>>>> runtime suspend the card.  But battery life will be
>>>>>>>>>>> terrible then.)
>>>>>>>>>>>
>>>>>>>>>>> We're at rc2 so the time frame for coming up with a fix
>>>>>>>>>>> is probably
>>>>>>>>>>> 4 weeks.  Peter and others have tried for months to
>>>>>>>>>>> reverse-engineer
>>>>>>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems
>>>>>>>>>>> likely
>>>>>>>>>>> that
>>>>>>>>>>> we'll not find the ultimate solution to the problem
>>>>>>>>>>> within 4 weeks.
>>>>>>>>>> Yep, a quick proper fix seems unlikely.
>>>>>>>>>> [ Help/ideas are welcome, I suspect that these failures to
>>>>>>>>>> restore
>>>>>>>>>> power
>>>>>>>>>> on laptops designed for Win8+ all have the same cause,
>>>>>>>>>> related to
>>>>>>>>>> some
>>>>>>>>>> unknown interaction between ACPI and PCI. Some links:
>>>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>>>>>>>
>>>>>>>>>>> The way it is now, i.e. defaulting to PR3 when available,
>>>>>>>>>>> regresses
>>>>>>>>>>> certain laptops such as Kilian's.  If on the other hand
>>>>>>>>>>> we default
>>>>>>>>>>> to
>>>>>>>>>>> DSM when available, we'll regress certain other laptops,
>>>>>>>>>>> as Peter
>>>>>>>>>>> has
>>>>>>>>>>> pointed out.  Whitelisting or blacklisting laptops
>>>>>>>>>>> doesn't seem a
>>>>>>>>>>> good
>>>>>>>>>>> approach either, ideally we'd want to use PR3 as Windows
>>>>>>>>>>> does.
>>>>>>>>>>>
>>>>>>>>>>> As said, the only short-term solution I see is to add an
>>>>>>>>>>> "optimus"
>>>>>>>>>>> module_param to nouveau to allow users to select which
>>>>>>>>>>> method to
>>>>>>>>>>> use.
>>>>>>>>>>> So in Kilian's case an additional command line parameter
>>>>>>>>>>> would be
>>>>>>>>>>> necessary to fix the issue.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone see a better solution or can we agree on this
>>>>>>>>>>> one?  If
>>>>>>>>>>> so
>>>>>>>>>>> I can come up with a patch.  This could go in via Dave
>>>>>>>>>>> Airlie's
>>>>>>>>>>> tree.
>>>>>>>>>> As pcie_port_pm=off already reverts to DSM, I do not think
>>>>>>>>>> that an
>>>>>>>>>> additional (temporary) nouveau module parameter is going to
>>>>>>>>>> help. I
>>>>>>>>>> instead propose a (hopefully temporary) quirk in pci core
>>>>>>>>>> that
>>>>>>>>>> disables
>>>>>>>>>> D3cold RPM for just Kilians Lenovo laptop (basically
>>>>>>>>>> defaulting to
>>>>>>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can
>>>>>>>>>> still be
>>>>>>>>>> used
>>>>>>>>>> to test possible solutions in the future.
>>>>>>>>> I would rather add a quirk to the ACPI core to prevent the
>>>>>>>>> power
>>>>>>>>> resources in
>>>>>>>>> question from being enumerated.  Or even to prevent ACPI PM
>>>>>>>>> from being
>>>>>>>>> used for the port in question.
>>>>>>>> I do have a W541 in a cupboard in the office somewhere, but I
>>>>>>>> won't be
>>>>>>>> close to
>>>>>>>> it for a couple of weeks. The W541 was the first place I tested
>>>>>>>> the pm
>>>>>>>> patches
>>>>>>>> so I'm kinda wondering whether it's all W541's or just some
>>>>>>>> specific
>>>>>>>> model/bios
>>>>>>>> combo.
>>>>>> They seem to all ship with the 1.10 firmware, and 2.80 is current
>>>>>> (there
>>>>>> are a bunch of intermediate 2.xx versions).  Somewhere along the
>>>>>> line
>>>>>> they introduced some bugs in the UEFI stuff, so it wouldn't be
>>>>>> surprising if there's bugs introduced elsewhere as well.
>>>>>>
>>>>>>>> However I'm pretty much unavailable to do anything much until
>>>>>>>> late Jan on
>>>>>>>> this.
>>>>>>> Is there anyone else at Red Hat who might be able to look into
>>>>>>> this?
>>>>>>>
>>>>>>> ISTR that Hans de Goede is working on improving laptop support in
>>>>>>> Fedora,
>>>>>>> and Peter Jones recently got a patch merged for the W541 with the
>>>>>>> exact
>>>>>>> same firmware Kilian is using to work around a botched EFI memory
>>>>>>> map.
>>>>>>> Adding them to cc: in the hope that they may be able to help.
>>>>>>>
>>>>>>> @Peter, have you noticed issues with the discrete Nvidia GPU on
>>>>>>> your W541
>>>>>>> related to runtime suspend and system sleep?
>>>>>> I was using a borrowed one (I can certainly find it again, but I'm
>>>>>> not
>>>>>> working on graphics/pm really), but yeah - shutdown and lspci both
>>>>>> broke
>>>>>> sometime after pci_pm_runtime_resume().  Here's the traceback from
>>>>>> SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
>>>>>>
>>>>>> Dave, if you know who in Westford should have a look at this, I can
>>>>>> see
>>>>>> about getting them hardware.  I am more or less surrounded by that
>>>>>> team.
>>>>>>
>>>>>> --
>>>>>>         Peter
>>>>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans de Goede Jan. 11, 2017, 11:04 a.m. UTC | #34
HI,

On 05-01-17 16:06, Lukas Wunner wrote:
> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
>>>>>> I don't *want* to apply the revert.  It's on my for-linus branch as a
>>>>>> worst-case scenario change if we can't figure out a better fix.
>>>>>>
>>>>>> The patch below is preferable, but I'd rather not take even it,
>>>>>> because it takes away functionality and forces people to use a boot
>>>>>> parameter to restore it.  I expect that somebody will figure out how
>>>>>> to fix the regression Kilian found and also keep the new functionality
>>>>>> (without requiring boot parameters) before v4.10.
>>>>>
>>>>> The issue is constrained to hybrid graphics laptops with Nvidia discrete
>>>>> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in the
>>>>> PCI core.
>>>>
>>>> The problem is not necessarily in the nouveau driver, the same problem
>>>> occurs when you enable RPM without loading nouveau. The issue is limited
>>>> though to some newer hybrid graphics laptops with Nvidia GPUs. While a
>>>> quirk can be added to nouveau, I think that a (temporary) quirk in core
>>>> would also be reasonable (since it also occurs without nouveau).
>>>>
>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected as it is known
>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>
>>>>> (Neither are laptops using the Nvidia proprietary driver as it doesn't
>>>>> runtime suspend the card.  But battery life will be terrible then.)
>>>>>
>>>>> We're at rc2 so the time frame for coming up with a fix is probably
>>>>> 4 weeks.  Peter and others have tried for months to reverse-engineer
>>>>> how to handle runtime PM on newer Nvidia cards.  It seems likely that
>>>>> we'll not find the ultimate solution to the problem within 4 weeks.
>>>>
>>>> Yep, a quick proper fix seems unlikely.
>>>> [ Help/ideas are welcome, I suspect that these failures to restore power
>>>> on laptops designed for Win8+ all have the same cause, related to some
>>>> unknown interaction between ACPI and PCI. Some links:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>
>>>>> The way it is now, i.e. defaulting to PR3 when available, regresses
>>>>> certain laptops such as Kilian's.  If on the other hand we default to
>>>>> DSM when available, we'll regress certain other laptops, as Peter has
>>>>> pointed out.  Whitelisting or blacklisting laptops doesn't seem a good
>>>>> approach either, ideally we'd want to use PR3 as Windows does.
>>>>>
>>>>> As said, the only short-term solution I see is to add an "optimus"
>>>>> module_param to nouveau to allow users to select which method to use.
>>>>> So in Kilian's case an additional command line parameter would be
>>>>> necessary to fix the issue.
>>>>>
>>>>> Does anyone see a better solution or can we agree on this one?  If so
>>>>> I can come up with a patch.  This could go in via Dave Airlie's tree.
>>>>
>>>> As pcie_port_pm=off already reverts to DSM, I do not think that an
>>>> additional (temporary) nouveau module parameter is going to help. I
>>>> instead propose a (hopefully temporary) quirk in pci core that disables
>>>> D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can still be used
>>>> to test possible solutions in the future.
>>>
>>> I would rather add a quirk to the ACPI core to prevent the power resources in
>>> question from being enumerated.  Or even to prevent ACPI PM from being
>>> used for the port in question.
>>
>> I do have a W541 in a cupboard in the office somewhere, but I won't be close to
>> it for a couple of weeks. The W541 was the first place I tested the pm patches
>> so I'm kinda wondering whether it's all W541's or just some specific model/bios
>> combo.
>>
>> However I'm pretty much unavailable to do anything much until late Jan on this.
>
> Is there anyone else at Red Hat who might be able to look into this?
>
> ISTR that Hans de Goede is working on improving laptop support in Fedora,
> and Peter Jones recently got a patch merged for the W541 with the exact
> same firmware Kilian is using to work around a botched EFI memory map.
> Adding them to cc: in the hope that they may be able to help.
>
> @Peter, have you noticed issues with the discrete Nvidia GPU on your W541
> related to runtime suspend and system sleep?

I've tried to reproduce this problem on my W541, which has the exact
same CPU + GPU combo as the reporter of:

https://bugzilla.kernel.org/show_bug.cgi?id=190861

But no luck, I started out with BIOS-2.27 and when I could not reproduce
I updated to 2.29 (should have tried 2.28 which is what the reporter
has first in retrospect) and still no luck in reproducing this.

I'll attach acpidumps of the 2 Bios versions I've tried to the bug.

Regards,

Hans

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kilian Singer Jan. 11, 2017, 1:24 p.m. UTC | #35
Dear all,

sounds interesting I could try to update to 2.29.

Shall I do so?

Best regards

Kilian



On 11-Jan-17 12:04, Hans de Goede wrote:
> HI,
>
> On 05-01-17 16:06, Lukas Wunner wrote:
>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
>>>>>>> I don't *want* to apply the revert.  It's on my for-linus branch
>>>>>>> as a
>>>>>>> worst-case scenario change if we can't figure out a better fix.
>>>>>>>
>>>>>>> The patch below is preferable, but I'd rather not take even it,
>>>>>>> because it takes away functionality and forces people to use a boot
>>>>>>> parameter to restore it.  I expect that somebody will figure out
>>>>>>> how
>>>>>>> to fix the regression Kilian found and also keep the new
>>>>>>> functionality
>>>>>>> (without requiring boot parameters) before v4.10.
>>>>>>
>>>>>> The issue is constrained to hybrid graphics laptops with Nvidia
>>>>>> discrete
>>>>>> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in
>>>>>> the
>>>>>> PCI core.
>>>>>
>>>>> The problem is not necessarily in the nouveau driver, the same
>>>>> problem
>>>>> occurs when you enable RPM without loading nouveau. The issue is
>>>>> limited
>>>>> though to some newer hybrid graphics laptops with Nvidia GPUs.
>>>>> While a
>>>>> quirk can be added to nouveau, I think that a (temporary) quirk in
>>>>> core
>>>>> would also be reasonable (since it also occurs without nouveau).
>>>>>
>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected as it is
>>>>>> known
>>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>>
>>>>>> (Neither are laptops using the Nvidia proprietary driver as it
>>>>>> doesn't
>>>>>> runtime suspend the card.  But battery life will be terrible then.)
>>>>>>
>>>>>> We're at rc2 so the time frame for coming up with a fix is probably
>>>>>> 4 weeks.  Peter and others have tried for months to reverse-engineer
>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems likely
>>>>>> that
>>>>>> we'll not find the ultimate solution to the problem within 4 weeks.
>>>>>
>>>>> Yep, a quick proper fix seems unlikely.
>>>>> [ Help/ideas are welcome, I suspect that these failures to restore
>>>>> power
>>>>> on laptops designed for Win8+ all have the same cause, related to
>>>>> some
>>>>> unknown interaction between ACPI and PCI. Some links:
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>>
>>>>>> The way it is now, i.e. defaulting to PR3 when available, regresses
>>>>>> certain laptops such as Kilian's.  If on the other hand we
>>>>>> default to
>>>>>> DSM when available, we'll regress certain other laptops, as Peter
>>>>>> has
>>>>>> pointed out.  Whitelisting or blacklisting laptops doesn't seem a
>>>>>> good
>>>>>> approach either, ideally we'd want to use PR3 as Windows does.
>>>>>>
>>>>>> As said, the only short-term solution I see is to add an "optimus"
>>>>>> module_param to nouveau to allow users to select which method to
>>>>>> use.
>>>>>> So in Kilian's case an additional command line parameter would be
>>>>>> necessary to fix the issue.
>>>>>>
>>>>>> Does anyone see a better solution or can we agree on this one? 
>>>>>> If so
>>>>>> I can come up with a patch.  This could go in via Dave Airlie's
>>>>>> tree.
>>>>>
>>>>> As pcie_port_pm=off already reverts to DSM, I do not think that an
>>>>> additional (temporary) nouveau module parameter is going to help. I
>>>>> instead propose a (hopefully temporary) quirk in pci core that
>>>>> disables
>>>>> D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can still be
>>>>> used
>>>>> to test possible solutions in the future.
>>>>
>>>> I would rather add a quirk to the ACPI core to prevent the power
>>>> resources in
>>>> question from being enumerated.  Or even to prevent ACPI PM from being
>>>> used for the port in question.
>>>
>>> I do have a W541 in a cupboard in the office somewhere, but I won't
>>> be close to
>>> it for a couple of weeks. The W541 was the first place I tested the
>>> pm patches
>>> so I'm kinda wondering whether it's all W541's or just some specific
>>> model/bios
>>> combo.
>>>
>>> However I'm pretty much unavailable to do anything much until late
>>> Jan on this.
>>
>> Is there anyone else at Red Hat who might be able to look into this?
>>
>> ISTR that Hans de Goede is working on improving laptop support in
>> Fedora,
>> and Peter Jones recently got a patch merged for the W541 with the exact
>> same firmware Kilian is using to work around a botched EFI memory map.
>> Adding them to cc: in the hope that they may be able to help.
>>
>> @Peter, have you noticed issues with the discrete Nvidia GPU on your
>> W541
>> related to runtime suspend and system sleep?
>
> I've tried to reproduce this problem on my W541, which has the exact
> same CPU + GPU combo as the reporter of:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>
> But no luck, I started out with BIOS-2.27 and when I could not reproduce
> I updated to 2.29 (should have tried 2.28 which is what the reporter
> has first in retrospect) and still no luck in reproducing this.
>
> I'll attach acpidumps of the 2 Bios versions I've tried to the bug.
>
> Regards,
>
> Hans
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans de Goede Jan. 11, 2017, 1:26 p.m. UTC | #36
Hi,

On 11-01-17 14:24, Kilian Singer wrote:
> Dear all,
>
> sounds interesting I could try to update to 2.29.
>
> Shall I do so?

According to the BIOS changelog 2.29 has some
fixes for bugs introduces in 2.28, so trying
2.29 is probably a good idea.

Regards,

Hans



>
> Best regards
>
> Kilian
>
>
>
> On 11-Jan-17 12:04, Hans de Goede wrote:
>> HI,
>>
>> On 05-01-17 16:06, Lukas Wunner wrote:
>>> On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
>>>>> On Wednesday, January 04, 2017 10:09:54 PM Peter Wu wrote:
>>>>>> On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner wrote:
>>>>>>> On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn Helgaas wrote:
>>>>>>>> I don't *want* to apply the revert.  It's on my for-linus branch
>>>>>>>> as a
>>>>>>>> worst-case scenario change if we can't figure out a better fix.
>>>>>>>>
>>>>>>>> The patch below is preferable, but I'd rather not take even it,
>>>>>>>> because it takes away functionality and forces people to use a boot
>>>>>>>> parameter to restore it.  I expect that somebody will figure out
>>>>>>>> how
>>>>>>>> to fix the regression Kilian found and also keep the new
>>>>>>>> functionality
>>>>>>>> (without requiring boot parameters) before v4.10.
>>>>>>>
>>>>>>> The issue is constrained to hybrid graphics laptops with Nvidia
>>>>>>> discrete
>>>>>>> GPU using nouveau.  Hence it needs to be fixed in nouveau, not in
>>>>>>> the
>>>>>>> PCI core.
>>>>>>
>>>>>> The problem is not necessarily in the nouveau driver, the same
>>>>>> problem
>>>>>> occurs when you enable RPM without loading nouveau. The issue is
>>>>>> limited
>>>>>> though to some newer hybrid graphics laptops with Nvidia GPUs.
>>>>>> While a
>>>>>> quirk can be added to nouveau, I think that a (temporary) quirk in
>>>>>> core
>>>>>> would also be reasonable (since it also occurs without nouveau).
>>>>>>
>>>>>>> (AFAIUI, laptops with AMD discrete GPU are not affected as it is
>>>>>>> known
>>>>>>> when and how to call an ACPI method versus using PR3.)
>>>>>>>
>>>>>>> (Neither are laptops using the Nvidia proprietary driver as it
>>>>>>> doesn't
>>>>>>> runtime suspend the card.  But battery life will be terrible then.)
>>>>>>>
>>>>>>> We're at rc2 so the time frame for coming up with a fix is probably
>>>>>>> 4 weeks.  Peter and others have tried for months to reverse-engineer
>>>>>>> how to handle runtime PM on newer Nvidia cards.  It seems likely
>>>>>>> that
>>>>>>> we'll not find the ultimate solution to the problem within 4 weeks.
>>>>>>
>>>>>> Yep, a quick proper fix seems unlikely.
>>>>>> [ Help/ideas are welcome, I suspect that these failures to restore
>>>>>> power
>>>>>> on laptops designed for Win8+ all have the same cause, related to
>>>>>> some
>>>>>> unknown interaction between ACPI and PCI. Some links:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
>>>>>>
>>>>>>> The way it is now, i.e. defaulting to PR3 when available, regresses
>>>>>>> certain laptops such as Kilian's.  If on the other hand we
>>>>>>> default to
>>>>>>> DSM when available, we'll regress certain other laptops, as Peter
>>>>>>> has
>>>>>>> pointed out.  Whitelisting or blacklisting laptops doesn't seem a
>>>>>>> good
>>>>>>> approach either, ideally we'd want to use PR3 as Windows does.
>>>>>>>
>>>>>>> As said, the only short-term solution I see is to add an "optimus"
>>>>>>> module_param to nouveau to allow users to select which method to
>>>>>>> use.
>>>>>>> So in Kilian's case an additional command line parameter would be
>>>>>>> necessary to fix the issue.
>>>>>>>
>>>>>>> Does anyone see a better solution or can we agree on this one?
>>>>>>> If so
>>>>>>> I can come up with a patch.  This could go in via Dave Airlie's
>>>>>>> tree.
>>>>>>
>>>>>> As pcie_port_pm=off already reverts to DSM, I do not think that an
>>>>>> additional (temporary) nouveau module parameter is going to help. I
>>>>>> instead propose a (hopefully temporary) quirk in pci core that
>>>>>> disables
>>>>>> D3cold RPM for just Kilians Lenovo laptop (basically defaulting to
>>>>>> pcie_port_pm=off). Then the option pcie_port_pm=force can still be
>>>>>> used
>>>>>> to test possible solutions in the future.
>>>>>
>>>>> I would rather add a quirk to the ACPI core to prevent the power
>>>>> resources in
>>>>> question from being enumerated.  Or even to prevent ACPI PM from being
>>>>> used for the port in question.
>>>>
>>>> I do have a W541 in a cupboard in the office somewhere, but I won't
>>>> be close to
>>>> it for a couple of weeks. The W541 was the first place I tested the
>>>> pm patches
>>>> so I'm kinda wondering whether it's all W541's or just some specific
>>>> model/bios
>>>> combo.
>>>>
>>>> However I'm pretty much unavailable to do anything much until late
>>>> Jan on this.
>>>
>>> Is there anyone else at Red Hat who might be able to look into this?
>>>
>>> ISTR that Hans de Goede is working on improving laptop support in
>>> Fedora,
>>> and Peter Jones recently got a patch merged for the W541 with the exact
>>> same firmware Kilian is using to work around a botched EFI memory map.
>>> Adding them to cc: in the hope that they may be able to help.
>>>
>>> @Peter, have you noticed issues with the discrete Nvidia GPU on your
>>> W541
>>> related to runtime suspend and system sleep?
>>
>> I've tried to reproduce this problem on my W541, which has the exact
>> same CPU + GPU combo as the reporter of:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=190861
>>
>> But no luck, I started out with BIOS-2.27 and when I could not reproduce
>> I updated to 2.29 (should have tried 2.28 which is what the reporter
>> has first in retrospect) and still no luck in reproducing this.
>>
>> I'll attach acpidumps of the 2 Bios versions I've tried to the bug.
>>
>> Regards,
>>
>> Hans
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Jones Jan. 11, 2017, 4:24 p.m. UTC | #37
On Wed, Jan 11, 2017 at 02:26:09PM +0100, Hans de Goede wrote:
> Hi,
> 
> On 11-01-17 14:24, Kilian Singer wrote:
> > Dear all,
> > 
> > sounds interesting I could try to update to 2.29.
> > 
> > Shall I do so?
> 
> According to the BIOS changelog 2.29 has some
> fixes for bugs introduces in 2.28, so trying
> 2.29 is probably a good idea.

I'd also be interested in seeing dmesg from that with efi=debug on the
kernel command line, if you don't mind.
Kilian Singer Jan. 11, 2017, 7:20 p.m. UTC | #38
I updated ti 2.29 but issue is not resolved.


On 11-Jan-17 17:24, Peter Jones wrote:
> On Wed, Jan 11, 2017 at 02:26:09PM +0100, Hans de Goede wrote:
>> Hi,
>>
>> On 11-01-17 14:24, Kilian Singer wrote:
>>> Dear all,
>>>
>>> sounds interesting I could try to update to 2.29.
>>>
>>> Shall I do so?
>> According to the BIOS changelog 2.29 has some
>> fixes for bugs introduces in 2.28, so trying
>> 2.29 is probably a good idea.
> I'd also be interested in seeing dmesg from that with efi=debug on the
> kernel command line, if you don't mind.
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 11, 2017, 8:40 p.m. UTC | #39
Alright yeah, runtime suspend definitely doesn't seem to work on this
one either. I thought I had sent a patch for this a while back, but
trying my patch now it doesn't seem to fix the issue…

On Mon, 2017-01-09 at 16:21 +0100, Hans de Goede wrote:
> Hi Lyude,
> 
> On 09-01-17 16:11, Lyude Paul wrote:
> > fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64
> > running on
> > here and so far it seems to suspend/resume just fine using firmware
> > version 2.19
> 
> Note this is not about normal suspend resume, but runtime
> suspend/resume of the nvidia discrete GPU...
> 
> Try running glxgears like this:
> 
> DRI_PRIME=1 glxgears -info | grep REND
> 
> (the grep is to check you're really running on the nvidia GPU).
> 
> Then you should see msgs in dmesg about nouveau resuming the gpu,
> then kill glxgears and wait for 5 seconds, now the nouveau drv
> should say the gpu is suspending, etc.
> 
> If it never runtime suspends, then make sure you are not using
> any external screens, only the built-in laptop screen.
> 
> Regards,
> 
> Hans
> 
> 
> > 
> > On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> > > (cc'ing Lyude, who has the hw also I think).
> > > 
> > > ----- Original Message -----
> > > > From: "Peter Jones" <pjones@redhat.com>
> > > > To: "Lukas Wunner" <lukas@wunner.de>
> > > > Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki" <r
> > > > jw@r
> > > > jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > > > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg"
> > > > <mika.weste
> > > > rberg@linux.intel.com>, "Kilian Singer"
> > > > <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-pci@
> > > > vger
> > > > .kernel.org>, "Alex Deucher"
> > > > <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat.c
> > > > om>
> > > > Sent: Friday, 6 January, 2017 4:13:23 AM
> > > > Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
> > > > ports"
> > > > 
> > > > On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> > > > > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie wrote:
> > > > > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu
> > > > > > > wrote:
> > > > > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas Wunner
> > > > > > > > wrote:
> > > > > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn
> > > > > > > > > Helgaas
> > > > > > > > > wrote:
> > > > > > > > > > I don't *want* to apply the revert.  It's on my
> > > > > > > > > > for-
> > > > > > > > > > linus branch
> > > > > > > > > > as a
> > > > > > > > > > worst-case scenario change if we can't figure out a
> > > > > > > > > > better fix.
> > > > > > > > > > 
> > > > > > > > > > The patch below is preferable, but I'd rather not
> > > > > > > > > > take
> > > > > > > > > > even it,
> > > > > > > > > > because it takes away functionality and forces
> > > > > > > > > > people
> > > > > > > > > > to use a
> > > > > > > > > > boot
> > > > > > > > > > parameter to restore it.  I expect that somebody
> > > > > > > > > > will
> > > > > > > > > > figure out
> > > > > > > > > > how
> > > > > > > > > > to fix the regression Kilian found and also keep
> > > > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > functionality
> > > > > > > > > > (without requiring boot parameters) before v4.10.
> > > > > > > > > 
> > > > > > > > > The issue is constrained to hybrid graphics laptops
> > > > > > > > > with
> > > > > > > > > Nvidia
> > > > > > > > > discrete
> > > > > > > > > GPU using nouveau.  Hence it needs to be fixed in
> > > > > > > > > nouveau, not in
> > > > > > > > > the
> > > > > > > > > PCI core.
> > > > > > > > 
> > > > > > > > The problem is not necessarily in the nouveau driver,
> > > > > > > > the
> > > > > > > > same
> > > > > > > > problem
> > > > > > > > occurs when you enable RPM without loading nouveau. The
> > > > > > > > issue is
> > > > > > > > limited
> > > > > > > > though to some newer hybrid graphics laptops with
> > > > > > > > Nvidia
> > > > > > > > GPUs. While
> > > > > > > > a
> > > > > > > > quirk can be added to nouveau, I think that a
> > > > > > > > (temporary)
> > > > > > > > quirk in
> > > > > > > > core
> > > > > > > > would also be reasonable (since it also occurs without
> > > > > > > > nouveau).
> > > > > > > > 
> > > > > > > > > (AFAIUI, laptops with AMD discrete GPU are not
> > > > > > > > > affected
> > > > > > > > > as it is
> > > > > > > > > known
> > > > > > > > > when and how to call an ACPI method versus using
> > > > > > > > > PR3.)
> > > > > > > > > 
> > > > > > > > > (Neither are laptops using the Nvidia proprietary
> > > > > > > > > driver
> > > > > > > > > as it
> > > > > > > > > doesn't
> > > > > > > > > runtime suspend the card.  But battery life will be
> > > > > > > > > terrible then.)
> > > > > > > > > 
> > > > > > > > > We're at rc2 so the time frame for coming up with a
> > > > > > > > > fix
> > > > > > > > > is probably
> > > > > > > > > 4 weeks.  Peter and others have tried for months to
> > > > > > > > > reverse-engineer
> > > > > > > > > how to handle runtime PM on newer Nvidia cards.  It
> > > > > > > > > seems
> > > > > > > > > likely
> > > > > > > > > that
> > > > > > > > > we'll not find the ultimate solution to the problem
> > > > > > > > > within 4 weeks.
> > > > > > > > 
> > > > > > > > Yep, a quick proper fix seems unlikely.
> > > > > > > > [ Help/ideas are welcome, I suspect that these failures
> > > > > > > > to
> > > > > > > > restore
> > > > > > > > power
> > > > > > > > on laptops designed for Win8+ all have the same cause,
> > > > > > > > related to
> > > > > > > > some
> > > > > > > > unknown interaction between ACPI and PCI. Some links:
> > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > > > > > 
> > > > > > > > > The way it is now, i.e. defaulting to PR3 when
> > > > > > > > > available,
> > > > > > > > > regresses
> > > > > > > > > certain laptops such as Kilian's.  If on the other
> > > > > > > > > hand
> > > > > > > > > we default
> > > > > > > > > to
> > > > > > > > > DSM when available, we'll regress certain other
> > > > > > > > > laptops,
> > > > > > > > > as Peter
> > > > > > > > > has
> > > > > > > > > pointed out.  Whitelisting or blacklisting laptops
> > > > > > > > > doesn't seem a
> > > > > > > > > good
> > > > > > > > > approach either, ideally we'd want to use PR3 as
> > > > > > > > > Windows
> > > > > > > > > does.
> > > > > > > > > 
> > > > > > > > > As said, the only short-term solution I see is to add
> > > > > > > > > an
> > > > > > > > > "optimus"
> > > > > > > > > module_param to nouveau to allow users to select
> > > > > > > > > which
> > > > > > > > > method to
> > > > > > > > > use.
> > > > > > > > > So in Kilian's case an additional command line
> > > > > > > > > parameter
> > > > > > > > > would be
> > > > > > > > > necessary to fix the issue.
> > > > > > > > > 
> > > > > > > > > Does anyone see a better solution or can we agree on
> > > > > > > > > this
> > > > > > > > > one?  If
> > > > > > > > > so
> > > > > > > > > I can come up with a patch.  This could go in via
> > > > > > > > > Dave
> > > > > > > > > Airlie's
> > > > > > > > > tree.
> > > > > > > > 
> > > > > > > > As pcie_port_pm=off already reverts to DSM, I do not
> > > > > > > > think
> > > > > > > > that an
> > > > > > > > additional (temporary) nouveau module parameter is
> > > > > > > > going to
> > > > > > > > help. I
> > > > > > > > instead propose a (hopefully temporary) quirk in pci
> > > > > > > > core
> > > > > > > > that
> > > > > > > > disables
> > > > > > > > D3cold RPM for just Kilians Lenovo laptop (basically
> > > > > > > > defaulting to
> > > > > > > > pcie_port_pm=off). Then the option pcie_port_pm=force
> > > > > > > > can
> > > > > > > > still be
> > > > > > > > used
> > > > > > > > to test possible solutions in the future.
> > > > > > > 
> > > > > > > I would rather add a quirk to the ACPI core to prevent
> > > > > > > the
> > > > > > > power
> > > > > > > resources in
> > > > > > > question from being enumerated.  Or even to prevent ACPI
> > > > > > > PM
> > > > > > > from being
> > > > > > > used for the port in question.
> > > > > > 
> > > > > > I do have a W541 in a cupboard in the office somewhere, but
> > > > > > I
> > > > > > won't be
> > > > > > close to
> > > > > > it for a couple of weeks. The W541 was the first place I
> > > > > > tested
> > > > > > the pm
> > > > > > patches
> > > > > > so I'm kinda wondering whether it's all W541's or just some
> > > > > > specific
> > > > > > model/bios
> > > > > > combo.
> > > > 
> > > > They seem to all ship with the 1.10 firmware, and 2.80 is
> > > > current
> > > > (there
> > > > are a bunch of intermediate 2.xx versions).  Somewhere along
> > > > the
> > > > line
> > > > they introduced some bugs in the UEFI stuff, so it wouldn't be
> > > > surprising if there's bugs introduced elsewhere as well.
> > > > 
> > > > > > However I'm pretty much unavailable to do anything much
> > > > > > until
> > > > > > late Jan on
> > > > > > this.
> > > > > 
> > > > > Is there anyone else at Red Hat who might be able to look
> > > > > into
> > > > > this?
> > > > > 
> > > > > ISTR that Hans de Goede is working on improving laptop
> > > > > support in
> > > > > Fedora,
> > > > > and Peter Jones recently got a patch merged for the W541 with
> > > > > the
> > > > > exact
> > > > > same firmware Kilian is using to work around a botched EFI
> > > > > memory
> > > > > map.
> > > > > Adding them to cc: in the hope that they may be able to help.
> > > > > 
> > > > > @Peter, have you noticed issues with the discrete Nvidia GPU
> > > > > on
> > > > > your W541
> > > > > related to runtime suspend and system sleep?
> > > > 
> > > > I was using a borrowed one (I can certainly find it again, but
> > > > I'm
> > > > not
> > > > working on graphics/pm really), but yeah - shutdown and lspci
> > > > both
> > > > broke
> > > > sometime after pci_pm_runtime_resume().  Here's the traceback
> > > > from
> > > > SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> > > > 
> > > > Dave, if you know who in Westford should have a look at this, I
> > > > can
> > > > see
> > > > about getting them hardware.  I am more or less surrounded by
> > > > that
> > > > team.
> > > > 
> > > > --
> > > >         Peter
> > > >
Lyude Paul Jan. 12, 2017, 1:13 a.m. UTC | #40
Neat, went through one of my old kernel git repos and indeed I did
write a patch for fixing this exact issue. I must have forgotten to
follow up with it after sending it out to the mailing list.

Anyway, the original patch I had doesn't fix the problem entirely on
new kernels so I added some more deadlock fixes into it, now RPM seems
to work fine. I will respond with a koji build of the F25 kernel with
the patches applied once I finish wrestling with getting kerberos to
work with FAS…

On Wed, 2017-01-11 at 15:40 -0500, Lyude Paul wrote:
> Alright yeah, runtime suspend definitely doesn't seem to work on this
> one either. I thought I had sent a patch for this a while back, but
> trying my patch now it doesn't seem to fix the issue…
> 
> On Mon, 2017-01-09 at 16:21 +0100, Hans de Goede wrote:
> > Hi Lyude,
> > 
> > On 09-01-17 16:11, Lyude Paul wrote:
> > > fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64
> > > running on
> > > here and so far it seems to suspend/resume just fine using
> > > firmware
> > > version 2.19
> > 
> > Note this is not about normal suspend resume, but runtime
> > suspend/resume of the nvidia discrete GPU...
> > 
> > Try running glxgears like this:
> > 
> > DRI_PRIME=1 glxgears -info | grep REND
> > 
> > (the grep is to check you're really running on the nvidia GPU).
> > 
> > Then you should see msgs in dmesg about nouveau resuming the gpu,
> > then kill glxgears and wait for 5 seconds, now the nouveau drv
> > should say the gpu is suspending, etc.
> > 
> > If it never runtime suspends, then make sure you are not using
> > any external screens, only the built-in laptop screen.
> > 
> > Regards,
> > 
> > Hans
> > 
> > 
> > > 
> > > On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> > > > (cc'ing Lyude, who has the hw also I think).
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Peter Jones" <pjones@redhat.com>
> > > > > To: "Lukas Wunner" <lukas@wunner.de>
> > > > > Cc: "David Airlie" <airlied@redhat.com>, "Rafael J. Wysocki"
> > > > > <r
> > > > > jw@r
> > > > > jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > > > > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg"
> > > > > <mika.weste
> > > > > rberg@linux.intel.com>, "Kilian Singer"
> > > > > <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-
> > > > > pci@
> > > > > vger
> > > > > .kernel.org>, "Alex Deucher"
> > > > > <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redhat
> > > > > .c
> > > > > om>
> > > > > Sent: Friday, 6 January, 2017 4:13:23 AM
> > > > > Subject: Re: PCI: Revert "PCI: Add runtime PM support for
> > > > > PCIe
> > > > > ports"
> > > > > 
> > > > > On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner wrote:
> > > > > > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie
> > > > > > wrote:
> > > > > > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu
> > > > > > > > wrote:
> > > > > > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas
> > > > > > > > > Wunner
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn
> > > > > > > > > > Helgaas
> > > > > > > > > > wrote:
> > > > > > > > > > > I don't *want* to apply the revert.  It's on my
> > > > > > > > > > > for-
> > > > > > > > > > > linus branch
> > > > > > > > > > > as a
> > > > > > > > > > > worst-case scenario change if we can't figure out
> > > > > > > > > > > a
> > > > > > > > > > > better fix.
> > > > > > > > > > > 
> > > > > > > > > > > The patch below is preferable, but I'd rather not
> > > > > > > > > > > take
> > > > > > > > > > > even it,
> > > > > > > > > > > because it takes away functionality and forces
> > > > > > > > > > > people
> > > > > > > > > > > to use a
> > > > > > > > > > > boot
> > > > > > > > > > > parameter to restore it.  I expect that somebody
> > > > > > > > > > > will
> > > > > > > > > > > figure out
> > > > > > > > > > > how
> > > > > > > > > > > to fix the regression Kilian found and also keep
> > > > > > > > > > > the
> > > > > > > > > > > new
> > > > > > > > > > > functionality
> > > > > > > > > > > (without requiring boot parameters) before v4.10.
> > > > > > > > > > 
> > > > > > > > > > The issue is constrained to hybrid graphics laptops
> > > > > > > > > > with
> > > > > > > > > > Nvidia
> > > > > > > > > > discrete
> > > > > > > > > > GPU using nouveau.  Hence it needs to be fixed in
> > > > > > > > > > nouveau, not in
> > > > > > > > > > the
> > > > > > > > > > PCI core.
> > > > > > > > > 
> > > > > > > > > The problem is not necessarily in the nouveau driver,
> > > > > > > > > the
> > > > > > > > > same
> > > > > > > > > problem
> > > > > > > > > occurs when you enable RPM without loading nouveau.
> > > > > > > > > The
> > > > > > > > > issue is
> > > > > > > > > limited
> > > > > > > > > though to some newer hybrid graphics laptops with
> > > > > > > > > Nvidia
> > > > > > > > > GPUs. While
> > > > > > > > > a
> > > > > > > > > quirk can be added to nouveau, I think that a
> > > > > > > > > (temporary)
> > > > > > > > > quirk in
> > > > > > > > > core
> > > > > > > > > would also be reasonable (since it also occurs
> > > > > > > > > without
> > > > > > > > > nouveau).
> > > > > > > > > 
> > > > > > > > > > (AFAIUI, laptops with AMD discrete GPU are not
> > > > > > > > > > affected
> > > > > > > > > > as it is
> > > > > > > > > > known
> > > > > > > > > > when and how to call an ACPI method versus using
> > > > > > > > > > PR3.)
> > > > > > > > > > 
> > > > > > > > > > (Neither are laptops using the Nvidia proprietary
> > > > > > > > > > driver
> > > > > > > > > > as it
> > > > > > > > > > doesn't
> > > > > > > > > > runtime suspend the card.  But battery life will be
> > > > > > > > > > terrible then.)
> > > > > > > > > > 
> > > > > > > > > > We're at rc2 so the time frame for coming up with a
> > > > > > > > > > fix
> > > > > > > > > > is probably
> > > > > > > > > > 4 weeks.  Peter and others have tried for months to
> > > > > > > > > > reverse-engineer
> > > > > > > > > > how to handle runtime PM on newer Nvidia cards.  It
> > > > > > > > > > seems
> > > > > > > > > > likely
> > > > > > > > > > that
> > > > > > > > > > we'll not find the ultimate solution to the problem
> > > > > > > > > > within 4 weeks.
> > > > > > > > > 
> > > > > > > > > Yep, a quick proper fix seems unlikely.
> > > > > > > > > [ Help/ideas are welcome, I suspect that these
> > > > > > > > > failures
> > > > > > > > > to
> > > > > > > > > restore
> > > > > > > > > power
> > > > > > > > > on laptops designed for Win8+ all have the same
> > > > > > > > > cause,
> > > > > > > > > related to
> > > > > > > > > some
> > > > > > > > > unknown interaction between ACPI and PCI. Some links:
> > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341 ]
> > > > > > > > > 
> > > > > > > > > > The way it is now, i.e. defaulting to PR3 when
> > > > > > > > > > available,
> > > > > > > > > > regresses
> > > > > > > > > > certain laptops such as Kilian's.  If on the other
> > > > > > > > > > hand
> > > > > > > > > > we default
> > > > > > > > > > to
> > > > > > > > > > DSM when available, we'll regress certain other
> > > > > > > > > > laptops,
> > > > > > > > > > as Peter
> > > > > > > > > > has
> > > > > > > > > > pointed out.  Whitelisting or blacklisting laptops
> > > > > > > > > > doesn't seem a
> > > > > > > > > > good
> > > > > > > > > > approach either, ideally we'd want to use PR3 as
> > > > > > > > > > Windows
> > > > > > > > > > does.
> > > > > > > > > > 
> > > > > > > > > > As said, the only short-term solution I see is to
> > > > > > > > > > add
> > > > > > > > > > an
> > > > > > > > > > "optimus"
> > > > > > > > > > module_param to nouveau to allow users to select
> > > > > > > > > > which
> > > > > > > > > > method to
> > > > > > > > > > use.
> > > > > > > > > > So in Kilian's case an additional command line
> > > > > > > > > > parameter
> > > > > > > > > > would be
> > > > > > > > > > necessary to fix the issue.
> > > > > > > > > > 
> > > > > > > > > > Does anyone see a better solution or can we agree
> > > > > > > > > > on
> > > > > > > > > > this
> > > > > > > > > > one?  If
> > > > > > > > > > so
> > > > > > > > > > I can come up with a patch.  This could go in via
> > > > > > > > > > Dave
> > > > > > > > > > Airlie's
> > > > > > > > > > tree.
> > > > > > > > > 
> > > > > > > > > As pcie_port_pm=off already reverts to DSM, I do not
> > > > > > > > > think
> > > > > > > > > that an
> > > > > > > > > additional (temporary) nouveau module parameter is
> > > > > > > > > going to
> > > > > > > > > help. I
> > > > > > > > > instead propose a (hopefully temporary) quirk in pci
> > > > > > > > > core
> > > > > > > > > that
> > > > > > > > > disables
> > > > > > > > > D3cold RPM for just Kilians Lenovo laptop (basically
> > > > > > > > > defaulting to
> > > > > > > > > pcie_port_pm=off). Then the option pcie_port_pm=force
> > > > > > > > > can
> > > > > > > > > still be
> > > > > > > > > used
> > > > > > > > > to test possible solutions in the future.
> > > > > > > > 
> > > > > > > > I would rather add a quirk to the ACPI core to prevent
> > > > > > > > the
> > > > > > > > power
> > > > > > > > resources in
> > > > > > > > question from being enumerated.  Or even to prevent
> > > > > > > > ACPI
> > > > > > > > PM
> > > > > > > > from being
> > > > > > > > used for the port in question.
> > > > > > > 
> > > > > > > I do have a W541 in a cupboard in the office somewhere,
> > > > > > > but
> > > > > > > I
> > > > > > > won't be
> > > > > > > close to
> > > > > > > it for a couple of weeks. The W541 was the first place I
> > > > > > > tested
> > > > > > > the pm
> > > > > > > patches
> > > > > > > so I'm kinda wondering whether it's all W541's or just
> > > > > > > some
> > > > > > > specific
> > > > > > > model/bios
> > > > > > > combo.
> > > > > 
> > > > > They seem to all ship with the 1.10 firmware, and 2.80 is
> > > > > current
> > > > > (there
> > > > > are a bunch of intermediate 2.xx versions).  Somewhere along
> > > > > the
> > > > > line
> > > > > they introduced some bugs in the UEFI stuff, so it wouldn't
> > > > > be
> > > > > surprising if there's bugs introduced elsewhere as well.
> > > > > 
> > > > > > > However I'm pretty much unavailable to do anything much
> > > > > > > until
> > > > > > > late Jan on
> > > > > > > this.
> > > > > > 
> > > > > > Is there anyone else at Red Hat who might be able to look
> > > > > > into
> > > > > > this?
> > > > > > 
> > > > > > ISTR that Hans de Goede is working on improving laptop
> > > > > > support in
> > > > > > Fedora,
> > > > > > and Peter Jones recently got a patch merged for the W541
> > > > > > with
> > > > > > the
> > > > > > exact
> > > > > > same firmware Kilian is using to work around a botched EFI
> > > > > > memory
> > > > > > map.
> > > > > > Adding them to cc: in the hope that they may be able to
> > > > > > help.
> > > > > > 
> > > > > > @Peter, have you noticed issues with the discrete Nvidia
> > > > > > GPU
> > > > > > on
> > > > > > your W541
> > > > > > related to runtime suspend and system sleep?
> > > > > 
> > > > > I was using a borrowed one (I can certainly find it again,
> > > > > but
> > > > > I'm
> > > > > not
> > > > > working on graphics/pm really), but yeah - shutdown and lspci
> > > > > both
> > > > > broke
> > > > > sometime after pci_pm_runtime_resume().  Here's the traceback
> > > > > from
> > > > > SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> > > > > 
> > > > > Dave, if you know who in Westford should have a look at this,
> > > > > I
> > > > > can
> > > > > see
> > > > > about getting them hardware.  I am more or less surrounded by
> > > > > that
> > > > > team.
> > > > > 
> > > > > --
> > > > >         Peter
> > > > >
Lyude Paul Jan. 12, 2017, 2:04 a.m. UTC | #41
Finally got koji to work :). People having the runtime resume problem,
can you give this kernel RPM a try and tell me if the issue still
persists?

https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338

On Wed, 2017-01-11 at 20:13 -0500, Lyude Paul wrote:
> Neat, went through one of my old kernel git repos and indeed I did
> write a patch for fixing this exact issue. I must have forgotten to
> follow up with it after sending it out to the mailing list.
> 
> Anyway, the original patch I had doesn't fix the problem entirely on
> new kernels so I added some more deadlock fixes into it, now RPM
> seems
> to work fine. I will respond with a koji build of the F25 kernel with
> the patches applied once I finish wrestling with getting kerberos to
> work with FAS…
> 
> On Wed, 2017-01-11 at 15:40 -0500, Lyude Paul wrote:
> > Alright yeah, runtime suspend definitely doesn't seem to work on
> > this
> > one either. I thought I had sent a patch for this a while back, but
> > trying my patch now it doesn't seem to fix the issue…
> > 
> > On Mon, 2017-01-09 at 16:21 +0100, Hans de Goede wrote:
> > > Hi Lyude,
> > > 
> > > On 09-01-17 16:11, Lyude Paul wrote:
> > > > fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64
> > > > running on
> > > > here and so far it seems to suspend/resume just fine using
> > > > firmware
> > > > version 2.19
> > > 
> > > Note this is not about normal suspend resume, but runtime
> > > suspend/resume of the nvidia discrete GPU...
> > > 
> > > Try running glxgears like this:
> > > 
> > > DRI_PRIME=1 glxgears -info | grep REND
> > > 
> > > (the grep is to check you're really running on the nvidia GPU).
> > > 
> > > Then you should see msgs in dmesg about nouveau resuming the gpu,
> > > then kill glxgears and wait for 5 seconds, now the nouveau drv
> > > should say the gpu is suspending, etc.
> > > 
> > > If it never runtime suspends, then make sure you are not using
> > > any external screens, only the built-in laptop screen.
> > > 
> > > Regards,
> > > 
> > > Hans
> > > 
> > > 
> > > > 
> > > > On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> > > > > (cc'ing Lyude, who has the hw also I think).
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Peter Jones" <pjones@redhat.com>
> > > > > > To: "Lukas Wunner" <lukas@wunner.de>
> > > > > > Cc: "David Airlie" <airlied@redhat.com>, "Rafael J.
> > > > > > Wysocki"
> > > > > > <r
> > > > > > jw@r
> > > > > > jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > > > > > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg"
> > > > > > <mika.weste
> > > > > > rberg@linux.intel.com>, "Kilian Singer"
> > > > > > <kilian.singer@quantumtechnology.info>, "linux-pci" <linux-
> > > > > > pci@
> > > > > > vger
> > > > > > .kernel.org>, "Alex Deucher"
> > > > > > <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@redh
> > > > > > at
> > > > > > .c
> > > > > > om>
> > > > > > Sent: Friday, 6 January, 2017 4:13:23 AM
> > > > > > Subject: Re: PCI: Revert "PCI: Add runtime PM support for
> > > > > > PCIe
> > > > > > ports"
> > > > > > 
> > > > > > On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner
> > > > > > wrote:
> > > > > > > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie
> > > > > > > wrote:
> > > > > > > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas
> > > > > > > > > > Wunner
> > > > > > > > > > wrote:
> > > > > > > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn
> > > > > > > > > > > Helgaas
> > > > > > > > > > > wrote:
> > > > > > > > > > > > I don't *want* to apply the revert.  It's on my
> > > > > > > > > > > > for-
> > > > > > > > > > > > linus branch
> > > > > > > > > > > > as a
> > > > > > > > > > > > worst-case scenario change if we can't figure
> > > > > > > > > > > > out
> > > > > > > > > > > > a
> > > > > > > > > > > > better fix.
> > > > > > > > > > > > 
> > > > > > > > > > > > The patch below is preferable, but I'd rather
> > > > > > > > > > > > not
> > > > > > > > > > > > take
> > > > > > > > > > > > even it,
> > > > > > > > > > > > because it takes away functionality and forces
> > > > > > > > > > > > people
> > > > > > > > > > > > to use a
> > > > > > > > > > > > boot
> > > > > > > > > > > > parameter to restore it.  I expect that
> > > > > > > > > > > > somebody
> > > > > > > > > > > > will
> > > > > > > > > > > > figure out
> > > > > > > > > > > > how
> > > > > > > > > > > > to fix the regression Kilian found and also
> > > > > > > > > > > > keep
> > > > > > > > > > > > the
> > > > > > > > > > > > new
> > > > > > > > > > > > functionality
> > > > > > > > > > > > (without requiring boot parameters) before
> > > > > > > > > > > > v4.10.
> > > > > > > > > > > 
> > > > > > > > > > > The issue is constrained to hybrid graphics
> > > > > > > > > > > laptops
> > > > > > > > > > > with
> > > > > > > > > > > Nvidia
> > > > > > > > > > > discrete
> > > > > > > > > > > GPU using nouveau.  Hence it needs to be fixed in
> > > > > > > > > > > nouveau, not in
> > > > > > > > > > > the
> > > > > > > > > > > PCI core.
> > > > > > > > > > 
> > > > > > > > > > The problem is not necessarily in the nouveau
> > > > > > > > > > driver,
> > > > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > problem
> > > > > > > > > > occurs when you enable RPM without loading nouveau.
> > > > > > > > > > The
> > > > > > > > > > issue is
> > > > > > > > > > limited
> > > > > > > > > > though to some newer hybrid graphics laptops with
> > > > > > > > > > Nvidia
> > > > > > > > > > GPUs. While
> > > > > > > > > > a
> > > > > > > > > > quirk can be added to nouveau, I think that a
> > > > > > > > > > (temporary)
> > > > > > > > > > quirk in
> > > > > > > > > > core
> > > > > > > > > > would also be reasonable (since it also occurs
> > > > > > > > > > without
> > > > > > > > > > nouveau).
> > > > > > > > > > 
> > > > > > > > > > > (AFAIUI, laptops with AMD discrete GPU are not
> > > > > > > > > > > affected
> > > > > > > > > > > as it is
> > > > > > > > > > > known
> > > > > > > > > > > when and how to call an ACPI method versus using
> > > > > > > > > > > PR3.)
> > > > > > > > > > > 
> > > > > > > > > > > (Neither are laptops using the Nvidia proprietary
> > > > > > > > > > > driver
> > > > > > > > > > > as it
> > > > > > > > > > > doesn't
> > > > > > > > > > > runtime suspend the card.  But battery life will
> > > > > > > > > > > be
> > > > > > > > > > > terrible then.)
> > > > > > > > > > > 
> > > > > > > > > > > We're at rc2 so the time frame for coming up with
> > > > > > > > > > > a
> > > > > > > > > > > fix
> > > > > > > > > > > is probably
> > > > > > > > > > > 4 weeks.  Peter and others have tried for months
> > > > > > > > > > > to
> > > > > > > > > > > reverse-engineer
> > > > > > > > > > > how to handle runtime PM on newer Nvidia
> > > > > > > > > > > cards.  It
> > > > > > > > > > > seems
> > > > > > > > > > > likely
> > > > > > > > > > > that
> > > > > > > > > > > we'll not find the ultimate solution to the
> > > > > > > > > > > problem
> > > > > > > > > > > within 4 weeks.
> > > > > > > > > > 
> > > > > > > > > > Yep, a quick proper fix seems unlikely.
> > > > > > > > > > [ Help/ideas are welcome, I suspect that these
> > > > > > > > > > failures
> > > > > > > > > > to
> > > > > > > > > > restore
> > > > > > > > > > power
> > > > > > > > > > on laptops designed for Win8+ all have the same
> > > > > > > > > > cause,
> > > > > > > > > > related to
> > > > > > > > > > some
> > > > > > > > > > unknown interaction between ACPI and PCI. Some
> > > > > > > > > > links:
> > > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=190861
> > > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341
> > > > > > > > > > ]
> > > > > > > > > > 
> > > > > > > > > > > The way it is now, i.e. defaulting to PR3 when
> > > > > > > > > > > available,
> > > > > > > > > > > regresses
> > > > > > > > > > > certain laptops such as Kilian's.  If on the
> > > > > > > > > > > other
> > > > > > > > > > > hand
> > > > > > > > > > > we default
> > > > > > > > > > > to
> > > > > > > > > > > DSM when available, we'll regress certain other
> > > > > > > > > > > laptops,
> > > > > > > > > > > as Peter
> > > > > > > > > > > has
> > > > > > > > > > > pointed out.  Whitelisting or blacklisting
> > > > > > > > > > > laptops
> > > > > > > > > > > doesn't seem a
> > > > > > > > > > > good
> > > > > > > > > > > approach either, ideally we'd want to use PR3 as
> > > > > > > > > > > Windows
> > > > > > > > > > > does.
> > > > > > > > > > > 
> > > > > > > > > > > As said, the only short-term solution I see is to
> > > > > > > > > > > add
> > > > > > > > > > > an
> > > > > > > > > > > "optimus"
> > > > > > > > > > > module_param to nouveau to allow users to select
> > > > > > > > > > > which
> > > > > > > > > > > method to
> > > > > > > > > > > use.
> > > > > > > > > > > So in Kilian's case an additional command line
> > > > > > > > > > > parameter
> > > > > > > > > > > would be
> > > > > > > > > > > necessary to fix the issue.
> > > > > > > > > > > 
> > > > > > > > > > > Does anyone see a better solution or can we agree
> > > > > > > > > > > on
> > > > > > > > > > > this
> > > > > > > > > > > one?  If
> > > > > > > > > > > so
> > > > > > > > > > > I can come up with a patch.  This could go in via
> > > > > > > > > > > Dave
> > > > > > > > > > > Airlie's
> > > > > > > > > > > tree.
> > > > > > > > > > 
> > > > > > > > > > As pcie_port_pm=off already reverts to DSM, I do
> > > > > > > > > > not
> > > > > > > > > > think
> > > > > > > > > > that an
> > > > > > > > > > additional (temporary) nouveau module parameter is
> > > > > > > > > > going to
> > > > > > > > > > help. I
> > > > > > > > > > instead propose a (hopefully temporary) quirk in
> > > > > > > > > > pci
> > > > > > > > > > core
> > > > > > > > > > that
> > > > > > > > > > disables
> > > > > > > > > > D3cold RPM for just Kilians Lenovo laptop
> > > > > > > > > > (basically
> > > > > > > > > > defaulting to
> > > > > > > > > > pcie_port_pm=off). Then the option
> > > > > > > > > > pcie_port_pm=force
> > > > > > > > > > can
> > > > > > > > > > still be
> > > > > > > > > > used
> > > > > > > > > > to test possible solutions in the future.
> > > > > > > > > 
> > > > > > > > > I would rather add a quirk to the ACPI core to
> > > > > > > > > prevent
> > > > > > > > > the
> > > > > > > > > power
> > > > > > > > > resources in
> > > > > > > > > question from being enumerated.  Or even to prevent
> > > > > > > > > ACPI
> > > > > > > > > PM
> > > > > > > > > from being
> > > > > > > > > used for the port in question.
> > > > > > > > 
> > > > > > > > I do have a W541 in a cupboard in the office somewhere,
> > > > > > > > but
> > > > > > > > I
> > > > > > > > won't be
> > > > > > > > close to
> > > > > > > > it for a couple of weeks. The W541 was the first place
> > > > > > > > I
> > > > > > > > tested
> > > > > > > > the pm
> > > > > > > > patches
> > > > > > > > so I'm kinda wondering whether it's all W541's or just
> > > > > > > > some
> > > > > > > > specific
> > > > > > > > model/bios
> > > > > > > > combo.
> > > > > > 
> > > > > > They seem to all ship with the 1.10 firmware, and 2.80 is
> > > > > > current
> > > > > > (there
> > > > > > are a bunch of intermediate 2.xx versions).  Somewhere
> > > > > > along
> > > > > > the
> > > > > > line
> > > > > > they introduced some bugs in the UEFI stuff, so it wouldn't
> > > > > > be
> > > > > > surprising if there's bugs introduced elsewhere as well.
> > > > > > 
> > > > > > > > However I'm pretty much unavailable to do anything much
> > > > > > > > until
> > > > > > > > late Jan on
> > > > > > > > this.
> > > > > > > 
> > > > > > > Is there anyone else at Red Hat who might be able to look
> > > > > > > into
> > > > > > > this?
> > > > > > > 
> > > > > > > ISTR that Hans de Goede is working on improving laptop
> > > > > > > support in
> > > > > > > Fedora,
> > > > > > > and Peter Jones recently got a patch merged for the W541
> > > > > > > with
> > > > > > > the
> > > > > > > exact
> > > > > > > same firmware Kilian is using to work around a botched
> > > > > > > EFI
> > > > > > > memory
> > > > > > > map.
> > > > > > > Adding them to cc: in the hope that they may be able to
> > > > > > > help.
> > > > > > > 
> > > > > > > @Peter, have you noticed issues with the discrete Nvidia
> > > > > > > GPU
> > > > > > > on
> > > > > > > your W541
> > > > > > > related to runtime suspend and system sleep?
> > > > > > 
> > > > > > I was using a borrowed one (I can certainly find it again,
> > > > > > but
> > > > > > I'm
> > > > > > not
> > > > > > working on graphics/pm really), but yeah - shutdown and
> > > > > > lspci
> > > > > > both
> > > > > > broke
> > > > > > sometime after pci_pm_runtime_resume().  Here's the
> > > > > > traceback
> > > > > > from
> > > > > > SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> > > > > > 
> > > > > > Dave, if you know who in Westford should have a look at
> > > > > > this,
> > > > > > I
> > > > > > can
> > > > > > see
> > > > > > about getting them hardware.  I am more or less surrounded
> > > > > > by
> > > > > > that
> > > > > > team.
> > > > > > 
> > > > > > --
> > > > > >         Peter
> > > > > >
Lukas Wunner Jan. 12, 2017, 2:12 a.m. UTC | #42
On Wed, Jan 11, 2017 at 09:04:48PM -0500, Lyude Paul wrote:
> Finally got koji to work :). People having the runtime resume problem,
> can you give this kernel RPM a try and tell me if the issue still
> persists?
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338

Kilian uses Debian but has a git repo with Linus' tree.  Could you post
the relevant patch(es) based on either the tip of Linus' master branch
or one of his tags (such as v4.9, v4.10-rc3)?  This would also allow us
to review/comment on the patches.

Thanks!

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 12, 2017, 6:10 p.m. UTC | #43
Fwiw, danvet showed me a patch he had already submitted that actually
fixes this issue as well:

https://patchwork.freedesktop.org/patch/132477/

So we're going to go with that. This doesn't fix the race conditions
I've noticed in fbcon(), but danvet suggested that some of the code for
that in nouveau should be cleaned up anyway.

On Tue, 2017-01-10 at 10:17 +0100, Kilian Singer wrote:
> It is a standart debian installation.
> 
> I have not installed powertop.
> 
> 
> On 10-Jan-17 01:33, David Airlie wrote:
> > Hi Killian,
> > 
> > do you use powertop or have you ever used it, I'm guessing some
> > port is getting into suspend on your machine that isn't on ours due
> > to differeing userspace or powertop settings.
> > 
> > Dave.
> > 
> > ----- Original Message -----
> > > From: "Kilian Singer" <kilian.singer@quantumtechnology.info>
> > > To: "Hans de Goede" <hdegoede@redhat.com>, "Lyude Paul" <lyude@re
> > > dhat.com>, "David Airlie" <airlied@redhat.com>,
> > > "Peter Jones" <pjones@redhat.com>
> > > Cc: "Lukas Wunner" <lukas@wunner.de>, "Rafael J. Wysocki" <rjw@rj
> > > wysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg" <mika.wes
> > > terberg@linux.intel.com>, "linux-pci"
> > > <linux-pci@vger.kernel.org>, "Alex Deucher" <alexander.deucher@am
> > > d.com>, "Lyude" <cpaul@redhat.com>
> > > Sent: Tuesday, 10 January, 2017 4:48:22 AM
> > > Subject: Re: PCI: Revert "PCI: Add runtime PM support for PCIe
> > > ports"
> > > 
> > > Hi Lyude Paul,
> > > 
> > > normal supend resume does not work neither on my machine.
> > > 
> > > Best regards
> > > 
> > > Kilian
> > > 
> > > 
> > > On 09-Jan-17 16:21, Hans de Goede wrote:
> > > > Hi Lyude,
> > > > 
> > > > On 09-01-17 16:11, Lyude Paul wrote:
> > > > > fwiw, I just tried on the W541 I have 4.8.15-300.fc25.x86_64
> > > > > running on
> > > > > here and so far it seems to suspend/resume just fine using
> > > > > firmware
> > > > > version 2.19
> > > > 
> > > > Note this is not about normal suspend resume, but runtime
> > > > suspend/resume of the nvidia discrete GPU...
> > > > 
> > > > Try running glxgears like this:
> > > > 
> > > > DRI_PRIME=1 glxgears -info | grep REND
> > > > 
> > > > (the grep is to check you're really running on the nvidia GPU).
> > > > 
> > > > Then you should see msgs in dmesg about nouveau resuming the
> > > > gpu,
> > > > then kill glxgears and wait for 5 seconds, now the nouveau drv
> > > > should say the gpu is suspending, etc.
> > > > 
> > > > If it never runtime suspends, then make sure you are not using
> > > > any external screens, only the built-in laptop screen.
> > > > 
> > > > Regards,
> > > > 
> > > > Hans
> > > > 
> > > > 
> > > > > On Thu, 2017-01-05 at 14:36 -0500, David Airlie wrote:
> > > > > > (cc'ing Lyude, who has the hw also I think).
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > From: "Peter Jones" <pjones@redhat.com>
> > > > > > > To: "Lukas Wunner" <lukas@wunner.de>
> > > > > > > Cc: "David Airlie" <airlied@redhat.com>, "Rafael J.
> > > > > > > Wysocki" <rjw@r
> > > > > > > jwysocki.net>, "Peter Wu" <peter@lekensteyn.nl>,
> > > > > > > "Bjorn Helgaas" <helgaas@kernel.org>, "Mika Westerberg"
> > > > > > > <mika.weste
> > > > > > > rberg@linux.intel.com>, "Kilian Singer"
> > > > > > > <kilian.singer@quantumtechnology.info>, "linux-pci" <linu
> > > > > > > x-pci@vger
> > > > > > > .kernel.org>, "Alex Deucher"
> > > > > > > <alexander.deucher@amd.com>, "Hans de Goede" <hdegoede@re
> > > > > > > dhat.com>
> > > > > > > Sent: Friday, 6 January, 2017 4:13:23 AM
> > > > > > > Subject: Re: PCI: Revert "PCI: Add runtime PM support for
> > > > > > > PCIe
> > > > > > > ports"
> > > > > > > 
> > > > > > > On Thu, Jan 05, 2017 at 04:06:46PM +0100, Lukas Wunner
> > > > > > > wrote:
> > > > > > > > On Wed, Jan 04, 2017 at 06:21:14PM -0500, David Airlie
> > > > > > > > wrote:
> > > > > > > > > > On Wednesday, January 04, 2017 10:09:54 PM Peter Wu
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, Jan 04, 2017 at 09:16:39AM +0100, Lukas
> > > > > > > > > > > Wunner
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Tue, Jan 03, 2017 at 06:05:57PM -0600, Bjorn
> > > > > > > > > > > > Helgaas
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > I don't *want* to apply the revert.  It's on
> > > > > > > > > > > > > my for-
> > > > > > > > > > > > > linus branch
> > > > > > > > > > > > > as a
> > > > > > > > > > > > > worst-case scenario change if we can't figure
> > > > > > > > > > > > > out a
> > > > > > > > > > > > > better fix.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The patch below is preferable, but I'd rather
> > > > > > > > > > > > > not take
> > > > > > > > > > > > > even it,
> > > > > > > > > > > > > because it takes away functionality and
> > > > > > > > > > > > > forces people
> > > > > > > > > > > > > to use a
> > > > > > > > > > > > > boot
> > > > > > > > > > > > > parameter to restore it.  I expect that
> > > > > > > > > > > > > somebody will
> > > > > > > > > > > > > figure out
> > > > > > > > > > > > > how
> > > > > > > > > > > > > to fix the regression Kilian found and also
> > > > > > > > > > > > > keep the
> > > > > > > > > > > > > new
> > > > > > > > > > > > > functionality
> > > > > > > > > > > > > (without requiring boot parameters) before
> > > > > > > > > > > > > v4.10.
> > > > > > > > > > > > 
> > > > > > > > > > > > The issue is constrained to hybrid graphics
> > > > > > > > > > > > laptops with
> > > > > > > > > > > > Nvidia
> > > > > > > > > > > > discrete
> > > > > > > > > > > > GPU using nouveau.  Hence it needs to be fixed
> > > > > > > > > > > > in
> > > > > > > > > > > > nouveau, not in
> > > > > > > > > > > > the
> > > > > > > > > > > > PCI core.
> > > > > > > > > > > 
> > > > > > > > > > > The problem is not necessarily in the nouveau
> > > > > > > > > > > driver, the
> > > > > > > > > > > same
> > > > > > > > > > > problem
> > > > > > > > > > > occurs when you enable RPM without loading
> > > > > > > > > > > nouveau. The
> > > > > > > > > > > issue is
> > > > > > > > > > > limited
> > > > > > > > > > > though to some newer hybrid graphics laptops with
> > > > > > > > > > > Nvidia
> > > > > > > > > > > GPUs. While
> > > > > > > > > > > a
> > > > > > > > > > > quirk can be added to nouveau, I think that a
> > > > > > > > > > > (temporary)
> > > > > > > > > > > quirk in
> > > > > > > > > > > core
> > > > > > > > > > > would also be reasonable (since it also occurs
> > > > > > > > > > > without
> > > > > > > > > > > nouveau).
> > > > > > > > > > > 
> > > > > > > > > > > > (AFAIUI, laptops with AMD discrete GPU are not
> > > > > > > > > > > > affected
> > > > > > > > > > > > as it is
> > > > > > > > > > > > known
> > > > > > > > > > > > when and how to call an ACPI method versus
> > > > > > > > > > > > using PR3.)
> > > > > > > > > > > > 
> > > > > > > > > > > > (Neither are laptops using the Nvidia
> > > > > > > > > > > > proprietary driver
> > > > > > > > > > > > as it
> > > > > > > > > > > > doesn't
> > > > > > > > > > > > runtime suspend the card.  But battery life
> > > > > > > > > > > > will be
> > > > > > > > > > > > terrible then.)
> > > > > > > > > > > > 
> > > > > > > > > > > > We're at rc2 so the time frame for coming up
> > > > > > > > > > > > with a fix
> > > > > > > > > > > > is probably
> > > > > > > > > > > > 4 weeks.  Peter and others have tried for
> > > > > > > > > > > > months to
> > > > > > > > > > > > reverse-engineer
> > > > > > > > > > > > how to handle runtime PM on newer Nvidia
> > > > > > > > > > > > cards.  It seems
> > > > > > > > > > > > likely
> > > > > > > > > > > > that
> > > > > > > > > > > > we'll not find the ultimate solution to the
> > > > > > > > > > > > problem
> > > > > > > > > > > > within 4 weeks.
> > > > > > > > > > > 
> > > > > > > > > > > Yep, a quick proper fix seems unlikely.
> > > > > > > > > > > [ Help/ideas are welcome, I suspect that these
> > > > > > > > > > > failures to
> > > > > > > > > > > restore
> > > > > > > > > > > power
> > > > > > > > > > > on laptops designed for Win8+ all have the same
> > > > > > > > > > > cause,
> > > > > > > > > > > related to
> > > > > > > > > > > some
> > > > > > > > > > > unknown interaction between ACPI and PCI. Some
> > > > > > > > > > > links:
> > > > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=19086
> > > > > > > > > > > 1
> > > > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=15634
> > > > > > > > > > > 1 ]
> > > > > > > > > > > 
> > > > > > > > > > > > The way it is now, i.e. defaulting to PR3 when
> > > > > > > > > > > > available,
> > > > > > > > > > > > regresses
> > > > > > > > > > > > certain laptops such as Kilian's.  If on the
> > > > > > > > > > > > other hand
> > > > > > > > > > > > we default
> > > > > > > > > > > > to
> > > > > > > > > > > > DSM when available, we'll regress certain other
> > > > > > > > > > > > laptops,
> > > > > > > > > > > > as Peter
> > > > > > > > > > > > has
> > > > > > > > > > > > pointed out.  Whitelisting or blacklisting
> > > > > > > > > > > > laptops
> > > > > > > > > > > > doesn't seem a
> > > > > > > > > > > > good
> > > > > > > > > > > > approach either, ideally we'd want to use PR3
> > > > > > > > > > > > as Windows
> > > > > > > > > > > > does.
> > > > > > > > > > > > 
> > > > > > > > > > > > As said, the only short-term solution I see is
> > > > > > > > > > > > to add an
> > > > > > > > > > > > "optimus"
> > > > > > > > > > > > module_param to nouveau to allow users to
> > > > > > > > > > > > select which
> > > > > > > > > > > > method to
> > > > > > > > > > > > use.
> > > > > > > > > > > > So in Kilian's case an additional command line
> > > > > > > > > > > > parameter
> > > > > > > > > > > > would be
> > > > > > > > > > > > necessary to fix the issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > Does anyone see a better solution or can we
> > > > > > > > > > > > agree on this
> > > > > > > > > > > > one?  If
> > > > > > > > > > > > so
> > > > > > > > > > > > I can come up with a patch.  This could go in
> > > > > > > > > > > > via Dave
> > > > > > > > > > > > Airlie's
> > > > > > > > > > > > tree.
> > > > > > > > > > > 
> > > > > > > > > > > As pcie_port_pm=off already reverts to DSM, I do
> > > > > > > > > > > not think
> > > > > > > > > > > that an
> > > > > > > > > > > additional (temporary) nouveau module parameter
> > > > > > > > > > > is going to
> > > > > > > > > > > help. I
> > > > > > > > > > > instead propose a (hopefully temporary) quirk in
> > > > > > > > > > > pci core
> > > > > > > > > > > that
> > > > > > > > > > > disables
> > > > > > > > > > > D3cold RPM for just Kilians Lenovo laptop
> > > > > > > > > > > (basically
> > > > > > > > > > > defaulting to
> > > > > > > > > > > pcie_port_pm=off). Then the option
> > > > > > > > > > > pcie_port_pm=force can
> > > > > > > > > > > still be
> > > > > > > > > > > used
> > > > > > > > > > > to test possible solutions in the future.
> > > > > > > > > > 
> > > > > > > > > > I would rather add a quirk to the ACPI core to
> > > > > > > > > > prevent the
> > > > > > > > > > power
> > > > > > > > > > resources in
> > > > > > > > > > question from being enumerated.  Or even to prevent
> > > > > > > > > > ACPI PM
> > > > > > > > > > from being
> > > > > > > > > > used for the port in question.
> > > > > > > > > 
> > > > > > > > > I do have a W541 in a cupboard in the office
> > > > > > > > > somewhere, but I
> > > > > > > > > won't be
> > > > > > > > > close to
> > > > > > > > > it for a couple of weeks. The W541 was the first
> > > > > > > > > place I tested
> > > > > > > > > the pm
> > > > > > > > > patches
> > > > > > > > > so I'm kinda wondering whether it's all W541's or
> > > > > > > > > just some
> > > > > > > > > specific
> > > > > > > > > model/bios
> > > > > > > > > combo.
> > > > > > > 
> > > > > > > They seem to all ship with the 1.10 firmware, and 2.80 is
> > > > > > > current
> > > > > > > (there
> > > > > > > are a bunch of intermediate 2.xx versions).  Somewhere
> > > > > > > along the
> > > > > > > line
> > > > > > > they introduced some bugs in the UEFI stuff, so it
> > > > > > > wouldn't be
> > > > > > > surprising if there's bugs introduced elsewhere as well.
> > > > > > > 
> > > > > > > > > However I'm pretty much unavailable to do anything
> > > > > > > > > much until
> > > > > > > > > late Jan on
> > > > > > > > > this.
> > > > > > > > 
> > > > > > > > Is there anyone else at Red Hat who might be able to
> > > > > > > > look into
> > > > > > > > this?
> > > > > > > > 
> > > > > > > > ISTR that Hans de Goede is working on improving laptop
> > > > > > > > support in
> > > > > > > > Fedora,
> > > > > > > > and Peter Jones recently got a patch merged for the
> > > > > > > > W541 with the
> > > > > > > > exact
> > > > > > > > same firmware Kilian is using to work around a botched
> > > > > > > > EFI memory
> > > > > > > > map.
> > > > > > > > Adding them to cc: in the hope that they may be able to
> > > > > > > > help.
> > > > > > > > 
> > > > > > > > @Peter, have you noticed issues with the discrete
> > > > > > > > Nvidia GPU on
> > > > > > > > your W541
> > > > > > > > related to runtime suspend and system sleep?
> > > > > > > 
> > > > > > > I was using a borrowed one (I can certainly find it
> > > > > > > again, but I'm
> > > > > > > not
> > > > > > > working on graphics/pm really), but yeah - shutdown and
> > > > > > > lspci both
> > > > > > > broke
> > > > > > > sometime after pci_pm_runtime_resume().  Here's the
> > > > > > > traceback from
> > > > > > > SYS_reboot(): https://goo.gl/photos/T1fr1bksHQb9RSU67
> > > > > > > 
> > > > > > > Dave, if you know who in Westford should have a look at
> > > > > > > this, I can
> > > > > > > see
> > > > > > > about getting them hardware.  I am more or less
> > > > > > > surrounded by that
> > > > > > > team.
> > > > > > > 
> > > > > > > --
> > > > > > >         Peter
> > > > > > > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg Jan. 17, 2017, 3:55 p.m. UTC | #44
On Thu, Jan 12, 2017 at 03:12:35AM +0100, Lukas Wunner wrote:
> On Wed, Jan 11, 2017 at 09:04:48PM -0500, Lyude Paul wrote:
> > Finally got koji to work :). People having the runtime resume problem,
> > can you give this kernel RPM a try and tell me if the issue still
> > persists?
> > 
> > https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338
> 
> Kilian uses Debian but has a git repo with Linus' tree.  Could you post
> the relevant patch(es) based on either the tip of Linus' master branch
> or one of his tags (such as v4.9, v4.10-rc3)?  This would also allow us
> to review/comment on the patches.

Hi Lyude,

Just a reminder. Can you post the patch(es) here so that people can try
them out and comment?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 17, 2017, 6:06 p.m. UTC | #45
Did you not get the e-mails I CC'd you in? I did originally send you
guys the patches, but I ended up finding out that Daniel Vetter had
already submitted a fix for this:

https://patchwork.freedesktop.org/patch/132478/

On Tue, 2017-01-17 at 17:55 +0200, Mika Westerberg wrote:
> On Thu, Jan 12, 2017 at 03:12:35AM +0100, Lukas Wunner wrote:
> > On Wed, Jan 11, 2017 at 09:04:48PM -0500, Lyude Paul wrote:
> > > Finally got koji to work :). People having the runtime resume
> > > problem,
> > > can you give this kernel RPM a try and tell me if the issue still
> > > persists?
> > > 
> > > https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338
> > 
> > Kilian uses Debian but has a git repo with Linus' tree.  Could you
> > post
> > the relevant patch(es) based on either the tip of Linus' master
> > branch
> > or one of his tags (such as v4.9, v4.10-rc3)?  This would also
> > allow us
> > to review/comment on the patches.
> 
> Hi Lyude,
> 
> Just a reminder. Can you post the patch(es) here so that people can
> try
> them out and comment?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 17, 2017, 7:10 p.m. UTC | #46
[+cc Daniel]

On Tue, Jan 17, 2017 at 01:06:06PM -0500, Lyude Paul wrote:
> Did you not get the e-mails I CC'd you in? I did originally send you
> guys the patches, but I ended up finding out that Daniel Vetter had
> already submitted a fix for this:
> 
> https://patchwork.freedesktop.org/patch/132478/

Is this related to https://bugzilla.kernel.org/show_bug.cgi?id=190861 ?

If so, we need to connect the dots a little bit by mentioning it in the
changelog, CC'ing this thread on linux-pci, and figuring out how to connect
the fix with the regression, i.e., stable backports, etc.

I haven't followed the entire thread, so I apologize if the patch you
mention is unrelated to this bugzilla.

> On Tue, 2017-01-17 at 17:55 +0200, Mika Westerberg wrote:
> > On Thu, Jan 12, 2017 at 03:12:35AM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 11, 2017 at 09:04:48PM -0500, Lyude Paul wrote:
> > > > Finally got koji to work :). People having the runtime resume
> > > > problem,
> > > > can you give this kernel RPM a try and tell me if the issue still
> > > > persists?
> > > > 
> > > > https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338
> > > 
> > > Kilian uses Debian but has a git repo with Linus' tree.  Could you
> > > post
> > > the relevant patch(es) based on either the tip of Linus' master
> > > branch
> > > or one of his tags (such as v4.9, v4.10-rc3)?  This would also
> > > allow us
> > > to review/comment on the patches.
> > 
> > Hi Lyude,
> > 
> > Just a reminder. Can you post the patch(es) here so that people can
> > try
> > them out and comment?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 17, 2017, 7:49 p.m. UTC | #47
On Tue, 2017-01-17 at 13:10 -0600, Bjorn Helgaas wrote:
> [+cc Daniel]
> 
> On Tue, Jan 17, 2017 at 01:06:06PM -0500, Lyude Paul wrote:
> > Did you not get the e-mails I CC'd you in? I did originally send
> > you
> > guys the patches, but I ended up finding out that Daniel Vetter had
> > already submitted a fix for this:
> > 
> > https://patchwork.freedesktop.org/patch/132478/
> 
> Is this related to https://bugzilla.kernel.org/show_bug.cgi?id=190861
>  ?
> 
> If so, we need to connect the dots a little bit by mentioning it in
> the
> changelog, CC'ing this thread on linux-pci, and figuring out how to
> connect
> the fix with the regression, i.e., stable backports, etc.
> 
> I haven't followed the entire thread, so I apologize if the patch you
> mention is unrelated to this bugzilla.

Nice catch. I'm not entirely sure if it's related either but judging
from the bz comments it's certainly plausible. I'll link to the patch
there and see what happens.
> 
> > On Tue, 2017-01-17 at 17:55 +0200, Mika Westerberg wrote:
> > > On Thu, Jan 12, 2017 at 03:12:35AM +0100, Lukas Wunner wrote:
> > > > On Wed, Jan 11, 2017 at 09:04:48PM -0500, Lyude Paul wrote:
> > > > > Finally got koji to work :). People having the runtime resume
> > > > > problem,
> > > > > can you give this kernel RPM a try and tell me if the issue
> > > > > still
> > > > > persists?
> > > > > 
> > > > > https://koji.fedoraproject.org/koji/taskinfo?taskID=17250338
> > > > 
> > > > Kilian uses Debian but has a git repo with Linus' tree.  Could
> > > > you
> > > > post
> > > > the relevant patch(es) based on either the tip of Linus' master
> > > > branch
> > > > or one of his tags (such as v4.9, v4.10-rc3)?  This would also
> > > > allow us
> > > > to review/comment on the patches.
> > > 
> > > Hi Lyude,
> > > 
> > > Just a reminder. Can you post the patch(es) here so that people
> > > can
> > > try
> > > them out and comment?
Lukas Wunner Jan. 24, 2017, 4:59 a.m. UTC | #48
On Thu, Jan 12, 2017 at 01:10:35PM -0500, Lyude Paul wrote:
> Fwiw, danvet showed me a patch he had already submitted that actually
> fixes this issue as well:
> 
> https://patchwork.freedesktop.org/patch/132477/
> 
> So we're going to go with that. This doesn't fix the race conditions
> I've noticed in fbcon(), but danvet suggested that some of the code for
> that in nouveau should be cleaned up anyway.

@Lyude:  Since Daniel's patch landed in Linus' tree tonight, what else
is needed to fix the race conditions you mention above that might be at
the root of Kilian's as well as Peter's issues with nouveau?

See:
https://bugzilla.kernel.org/show_bug.cgi?id=190861
https://bugzilla.kernel.org/show_bug.cgi?id=156341

Could you propose a patch on top of Daniel's fix?

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lyude Paul Jan. 24, 2017, 7:09 p.m. UTC | #49
On Tue, 2017-01-24 at 05:59 +0100, Lukas Wunner wrote:
> On Thu, Jan 12, 2017 at 01:10:35PM -0500, Lyude Paul wrote:
> > Fwiw, danvet showed me a patch he had already submitted that
> > actually
> > fixes this issue as well:
> > 
> > https://patchwork.freedesktop.org/patch/132477/
> > 
> > So we're going to go with that. This doesn't fix the race
> > conditions
> > I've noticed in fbcon(), but danvet suggested that some of the code
> > for
> > that in nouveau should be cleaned up anyway.
fwiw, this should only ever come up if you are manually trying to turn
the GPU on or off using the debugfs interface with vga-switcheroo
> 
> @Lyude:  Since Daniel's patch landed in Linus' tree tonight, what
> else
> is needed to fix the race conditions you mention above that might be
> at
> the root of Kilian's as well as Peter's issues with nouveau?

I'm not sure what you mean here, are you still seeing these issues?
Daniel's patch should be all that's required for fixing this

> 
> See:
> https://bugzilla.kernel.org/show_bug.cgi?id=190861
> https://bugzilla.kernel.org/show_bug.cgi?id=156341
> 
> Could you propose a patch on top of Daniel's fix?
> 
> Thanks,
> 
> Lukas
diff mbox

Patch

Index: linux-pm/drivers/pci/pci.c
===================================================================
--- linux-pm.orig/drivers/pci/pci.c
+++ linux-pm/drivers/pci/pci.c
@@ -108,17 +108,14 @@  unsigned int pcibios_max_latency = 255;
 /* If set, the PCIe ARI capability will not be used. */
 static bool pcie_ari_disabled;
 
-/* Disable bridge_d3 for all PCIe ports */
-static bool pci_bridge_d3_disable;
-/* Force bridge_d3 for all PCIe ports */
-static bool pci_bridge_d3_force;
+/* Enable bridge_d3 for all PCIe ports */
+static bool pci_bridge_d3_enable;
 
 static int __init pcie_port_pm_setup(char *str)
 {
-	if (!strcmp(str, "off"))
-		pci_bridge_d3_disable = true;
-	else if (!strcmp(str, "force"))
-		pci_bridge_d3_force = true;
+	if (!strcmp(str, "on"))
+		pci_bridge_d3_enable = true;
+
 	return 1;
 }
 __setup("pcie_port_pm=", pcie_port_pm_setup);
@@ -2237,7 +2234,7 @@  bool pci_bridge_d3_possible(struct pci_d
 	case PCI_EXP_TYPE_ROOT_PORT:
 	case PCI_EXP_TYPE_UPSTREAM:
 	case PCI_EXP_TYPE_DOWNSTREAM:
-		if (pci_bridge_d3_disable)
+		if (!pci_bridge_d3_enable)
 			return false;
 
 		/*
@@ -2247,17 +2244,6 @@  bool pci_bridge_d3_possible(struct pci_d
 		if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
 			return false;
 
-		if (pci_bridge_d3_force)
-			return true;
-
-		/*
-		 * It should be safe to put PCIe ports from 2015 or newer
-		 * to D3.
-		 */
-		if (dmi_get_date(DMI_BIOS_DATE, &year, NULL, NULL) &&
-		    year >= 2015) {
-			return true;
-		}
 		break;
 	}
 
Index: linux-pm/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- linux-pm.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux-pm/Documentation/admin-guide/kernel-parameters.txt
@@ -2984,8 +2984,7 @@ 
 			ports driver.
 
 	pcie_port_pm=	[PCIE] PCIe port power management handling:
-		off	Disable power management of all PCIe ports
-		force	Forcibly enable power management of all PCIe ports
+		on	Enable power management of PCIe ports
 
 	pcie_pme=	[PCIE,PM] Native PCIe PME signaling options:
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes