diff mbox

2.6.29-rc3: tg3 dead after resume

Message ID alpine.LFD.2.00.0901301714080.3150@localhost.localdomain
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Linus Torvalds Jan. 31, 2009, 1:41 a.m. UTC
On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> 
> Can you test the patch below, please?

Rafael, you're making this _way_ too difficult.

Don't make it use the new PM infrastructure, because that one is certainly 
broken: pci_pm_default_suspend_generic() is total crap.

It's saving the disabled state. No WAY is that correct.

That "pci_disable_enabled_device()" should be removed, but even then 
that's wrong, because if the driver suspend disabled it, you're now 
(again) saving the disabled state.

But all of that is only called if you use the new PM infrastructure. So 
the thing is, when you're trying to move the PCI-E drive to the new pm 
infrastructure, you're making things _worse_.

The legacy PM infrastructure at least does the whole

	pci_dev->state_saved = false;
	i = drv->suspend(pci_dev, state);
	..
	if (pci_dev->state_saved)
		goto Fixup;

thing, which will avoid overwriting the state if it was already saved.

HOWEVER. The problem here (I think) is that PCI-E actually does the state 
save late, so it won't ever see the "state_saved" in the early ->suspend. 
I think a patch like the one below at least simplifies this all, and lets 
the PCI layer itself do all the core restore stuff.

The new PM infrastructure gets this totally wrong, and
 (a) disables the device before saving state
and
 (b) overwrites the (now corrupted) saved state that the driver perhaps 
     already saved, after the driver may even have put it to sleep.

So let's not use the new PM infrastructure - I don't think it's ready yet.

Let's start simplifying first. Start off by getting rid of the 
suspend_early/resume_late, since the PCI layer now does it for us.

I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
sequence shouldn't disable them, afaik.

		Linus

---
 drivers/pci/pcie/portdrv_pci.c |   14 --------------
 1 files changed, 0 insertions(+), 14 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rafael J. Wysocki Jan. 31, 2009, 9:08 p.m. UTC | #1
On Saturday 31 January 2009, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > 
> > Can you test the patch below, please?
> 
> Rafael, you're making this _way_ too difficult.
> 
> Don't make it use the new PM infrastructure, because that one is certainly 
> broken: pci_pm_default_suspend_generic() is total crap.

Oh my.  I beg to differ.

> It's saving the disabled state. No WAY is that correct.

So why does it work, actually?

> That "pci_disable_enabled_device()" should be removed, but even then 
> that's wrong, because if the driver suspend disabled it, you're now 
> (again) saving the disabled state.
> 
> But all of that is only called if you use the new PM infrastructure. So 
> the thing is, when you're trying to move the PCI-E drive to the new pm 
> infrastructure, you're making things _worse_.

I don't really think so (see below).

> The legacy PM infrastructure at least does the whole
> 
> 	pci_dev->state_saved = false;
> 	i = drv->suspend(pci_dev, state);
> 	..
> 	if (pci_dev->state_saved)
> 		goto Fixup;
> 
> thing, which will avoid overwriting the state if it was already saved.
> 
> HOWEVER. The problem here (I think) is that PCI-E actually does the state 
> save late, so it won't ever see the "state_saved" in the early ->suspend. 
> I think a patch like the one below at least simplifies this all, and lets 
> the PCI layer itself do all the core restore stuff.
> 
> The new PM infrastructure gets this totally wrong, and
>  (a) disables the device before saving state

pci_disable_device() does really only one thing: it clears the bus master bit.
Yes, it also calls pcibios_disable_device(), but on x86 this is a NOP.

I don't think it is SO bad, is it?

> and
>  (b) overwrites the (now corrupted) saved state that the driver perhaps 
>      already saved, after the driver may even have put it to sleep.

The driver using the new model is not supposed to save the state and power
off the device.  Still, it's probably a good idea not to trust the drivers. :-)

> So let's not use the new PM infrastructure - I don't think it's ready yet.
>
> Let's start simplifying first. Start off by getting rid of the 
> suspend_early/resume_late, since the PCI layer now does it for us.
> 
> I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
> sequence shouldn't disable them, afaik.

No, it shouldn't.

However, the patch below actually may help and it really is not too different
from my "new infrastructure" patch.  It leaves the pcie_portdrv_restore_config()
in the PCIe port driver's ->resume(), but that shouldn't change things,
pci_enable_device() in there shouldn't do anything and the bus master bit
should already be set.

The "new infrastracture" patch makes pci_disable_enabled_device() be called in
the suspend code path, but that only disables the bus master bit, and
pci_reenable_device() be called in the resume code path, but that only sets the
bus master bit back again.  So, they are almost the same and I'd be surprised
if your patch didn't help.

I had that "new infrastracture" patch ready yesterday and I thought it might
help, so I sent it.  I was too tired to prepare a new patch and that would
probably look like the one below (I'd remove the pcie_portdrv_restore_config()
from ->resume too, but that's only a detail).

Thanks,
Rafael

> ---
>  drivers/pci/pcie/portdrv_pci.c |   14 --------------
>  1 files changed, 0 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index 99a914a..08a8e3c 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -55,16 +55,6 @@ static int pcie_portdrv_suspend(struct pci_dev *dev, pm_message_t state)
>  
>  }
>  
> -static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
> -{
> -	return pci_save_state(dev);
> -}
> -
> -static int pcie_portdrv_resume_early(struct pci_dev *dev)
> -{
> -	return pci_restore_state(dev);
> -}
> -
>  static int pcie_portdrv_resume(struct pci_dev *dev)
>  {
>  	pcie_portdrv_restore_config(dev);
> @@ -72,8 +62,6 @@ static int pcie_portdrv_resume(struct pci_dev *dev)
>  }
>  #else
>  #define pcie_portdrv_suspend NULL
> -#define pcie_portdrv_suspend_late NULL
> -#define pcie_portdrv_resume_early NULL
>  #define pcie_portdrv_resume NULL
>  #endif
>  
> @@ -292,8 +280,6 @@ static struct pci_driver pcie_portdriver = {
>  	.remove		= pcie_portdrv_remove,
>  
>  	.suspend	= pcie_portdrv_suspend,
> -	.suspend_late	= pcie_portdrv_suspend_late,
> -	.resume_early	= pcie_portdrv_resume_early,
>  	.resume		= pcie_portdrv_resume,
>  
>  	.err_handler 	= &pcie_portdrv_err_handler,
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 31, 2009, 9:42 p.m. UTC | #2
On Saturday 31 January 2009, Rafael J. Wysocki wrote:
> On Saturday 31 January 2009, Linus Torvalds wrote:
> > 
> > On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > > 
> > > Can you test the patch below, please?
> > 
> > Rafael, you're making this _way_ too difficult.
> > 
> > Don't make it use the new PM infrastructure, because that one is certainly 
> > broken: pci_pm_default_suspend_generic() is total crap.
> 
> Oh my.  I beg to differ.
> 
> > It's saving the disabled state. No WAY is that correct.
> 
> So why does it work, actually?
> 
> > That "pci_disable_enabled_device()" should be removed, but even then 
> > that's wrong, because if the driver suspend disabled it, you're now 
> > (again) saving the disabled state.
> > 
> > But all of that is only called if you use the new PM infrastructure. So 
> > the thing is, when you're trying to move the PCI-E drive to the new pm 
> > infrastructure, you're making things _worse_.
> 
> I don't really think so (see below).
> 
> > The legacy PM infrastructure at least does the whole
> > 
> > 	pci_dev->state_saved = false;
> > 	i = drv->suspend(pci_dev, state);
> > 	..
> > 	if (pci_dev->state_saved)
> > 		goto Fixup;
> > 
> > thing, which will avoid overwriting the state if it was already saved.
> > 
> > HOWEVER. The problem here (I think) is that PCI-E actually does the state 
> > save late, so it won't ever see the "state_saved" in the early ->suspend. 
> > I think a patch like the one below at least simplifies this all, and lets 
> > the PCI layer itself do all the core restore stuff.
> > 
> > The new PM infrastructure gets this totally wrong, and
> >  (a) disables the device before saving state
> 
> pci_disable_device() does really only one thing: it clears the bus master bit.
> Yes, it also calls pcibios_disable_device(), but on x86 this is a NOP.
> 
> I don't think it is SO bad, is it?
> 
> > and
> >  (b) overwrites the (now corrupted) saved state that the driver perhaps 
> >      already saved, after the driver may even have put it to sleep.
> 
> The driver using the new model is not supposed to save the state and power
> off the device.  Still, it's probably a good idea not to trust the drivers. :-)
> 
> > So let's not use the new PM infrastructure - I don't think it's ready yet.
> >
> > Let's start simplifying first. Start off by getting rid of the 
> > suspend_early/resume_late, since the PCI layer now does it for us.
> > 
> > I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
> > sequence shouldn't disable them, afaik.
> 
> No, it shouldn't.
> 
> However, the patch below actually may help and it really is not too different
> from my "new infrastructure" patch.  It leaves the pcie_portdrv_restore_config()
> in the PCIe port driver's ->resume(), but that shouldn't change things,
> pci_enable_device() in there shouldn't do anything and the bus master bit
> should already be set.
> 
> The "new infrastracture" patch makes pci_disable_enabled_device() be called in
> the suspend code path, but that only disables the bus master bit, and
> pci_reenable_device() be called in the resume code path, but that only sets the
> bus master bit back again.

I should have said "in this particular case", because it actually makes a
difference for devices using interrupt pins.  Namely, pcibios_enable_device()
additionally enables PCI resources (that may make a difference, but everything
should have been restored already) and sets up an interrupt link for the
device.  If the link has been set up already, which often is the case, it
increases a reference count and exits.

Now, this reference count is not used for anything, but it was supposed to be
used for disabling the interrupt links that are no longer needed.  This
mechanism is currently disabled, but if we enable it at one point (which may
be necessary to fix suspend-resume on some boxes), it won't work if
the "enable" calls are not balanced with "disable" ones.  IOW, for every
pcibios_enable_device() call there should be a complementary
pcibios_disable_device() call.  That's why I decided to put the
disable/enable things into the PCI core's suspend/resume code paths (also,
because pci_reenable_device() has been already there for driverless devices).
This might not be the right choice, but I don't really think it does break
things.

Anyway, from what I can tell reading your messages in this thread so far,
you seem to want the PCI core to:
(1) save the state of devices during suspend (avoid doing that if the driver has
    already saved the state),
(2) put devices into D0 during resume (early),
(3) restore the state of devices during resume (early).
Still, you don't want the core to disable devices during suspend and to enable
(or reenable) them during resume.

What about putting devices into low power states?  [Note that ACPI may be
necessary for this purpose.]

What about devices with no drivers and/or without suspend/resume support?
Do you want them to be disabled during suspend and enabled during resume by
the core?  [I guess you do, they have no interrupt handlers that may break
after all.]

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 31, 2009, 9:47 p.m. UTC | #3
On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > 
> > Don't make it use the new PM infrastructure, because that one is certainly 
> > broken: pci_pm_default_suspend_generic() is total crap.
> 
> Oh my.  I beg to differ.

Imagine that you're a bridge. Imagine what this does to any device 
_behind_ you. Any device that plans to still do something at suspend_late 
time, to be specific.

> > It's saving the disabled state. No WAY is that correct.
> 
> So why does it work, actually?

And I suspect it simply doesn't. And since almost nobody uses the new PM 
state, you just don't see it.

But as to why it fixes Parag's case - I think that's because the new PM 
resume does more than the legacy resume does, so it ends up re-enabling 
things anyway. It does it too late, but it doesn't matter in this case (no 
shared irq issues with the only device behind the pci-e bridge).

> > But all of that is only called if you use the new PM infrastructure. So 
> > the thing is, when you're trying to move the PCI-E drive to the new pm 
> > infrastructure, you're making things _worse_.
> 
> I don't really think so (see below).

See above. I think you really haven't thought the new PM code through.

> pci_disable_device() does really only one thing: it clears the bus master bit.
> Yes, it also calls pcibios_disable_device(), but on x86 this is a NOP.
> 
> I don't think it is SO bad, is it?

It's bad. It means that DMA won't work across such a bridge. Yes, it is 
probably bridge-dependent, and I know for a fact that at least some Intel 
bridges just hard-code the busmaster bit to 1 (at a minimum the host 
bridges do, I'm not sure about others), but I also know for a fact that 
some other bridges _will_ stop DMA to devices behind them if the BM bit is 
clear.

But more importantly: Why do you do it? What's the upside? I don't see it. 
There's a known downside - you're saving state that is something else than 
what the driver really expects. 

So I think clearing bus-master is a huge bug on a bridge, but I think that 
on normal devices it's just pointless.

> The driver using the new model is not supposed to save the state and power
> off the device.  Still, it's probably a good idea not to trust the drivers. :-)

How about devices that have magic power-down sequences? For example, a 
quick grep shows that USB on a PPC PMAC has a special "disable ASIC clocks 
for USB" thing after it puts the USB controller to sleep.

That was literally the _first_ driver I looked at. Admittedly because I 
knew that USB host controllers tend to be more aware of all the issues 
than most random drivers, but still...

I agree that the new model should turn off devices by default, but the 
thing is, it should also allow drivers that really know better to do magic 
things.

Of course, we can say that such devices should just continue to use the 
legacy model, but I thought that the long-term plan was to just replace 
it. And if you do, you need to allow for drivers that do special things 
due to known motherboard- or chip-specific "issues" (aka "PCI extensions" 
aka "hardware bugs").

And yes, I suspect that the magic PPC USB clock thing could maybe be 
rewritten as a system device, so there are other alternatives here, but I 
do suspect that it will be very painful if the new PM layer _forces_ a 
very specific model on drivers that they can't modify at all.

> > I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
> > sequence shouldn't disable them, afaik.
> 
> No, it shouldn't.
> 
> However, the patch below actually may help and it really is not too different
> from my "new infrastructure" patch.  It leaves the pcie_portdrv_restore_config()
> in the PCIe port driver's ->resume(), but that shouldn't change things,
> pci_enable_device() in there shouldn't do anything and the bus master bit
> should already be set.

I absolutely agree with this patch. I'd just not expect it to make a 
difference (except for the "cleanup factor"). I think it's worth applying, 
and _if_ it makes a difference for Parag it's very interesting indeed.

> The "new infrastracture" patch makes pci_disable_enabled_device() be called in
> the suspend code path, but that only disables the bus master bit, and
> pci_reenable_device() be called in the resume code path, but that only sets the
> bus master bit back again.  So, they are almost the same and I'd be surprised
> if your patch didn't help.

Hmm. We'll see. I'm a bit doubtful. But we'll see..

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 31, 2009, 9:59 p.m. UTC | #4
On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> 
> Anyway, from what I can tell reading your messages in this thread so far,
> you seem to want the PCI core to:
> (1) save the state of devices during suspend (avoid doing that if the driver has
>     already saved the state),

Yes, but I'd at least want the device drivers to have the _option_ to 
do it themselves.

I would expect that by default any normal PCI driver would just rely on 
the PCI layer doing it all for them. But I suspect we may have cases where 
the chip driver will simply want to override things, for one reason or 
another.

We do know that some devices seem to be very picky and get unhappy about 
being put to sleep (we don't put devices into D3 by default in the legacy 
PM case for a reason!), and we do know that some existing drivers do extra 
things _after_ they've put the device to D3.

So there very much are arguments for drivers wanting to do their own "save 
state and power off" if they have special needs.

(Side note: it's entirely possible that one of the reasons we don't put 
devices into D3 in the legacy code-path is purely historical: maybe not 
because the devices were unhappy, but simply because it triggered the 
whole "interrupt at an unlucky place" thing. So I'm hoping that we'll 
actually not have this as a real issue, but..)

> (2) put devices into D0 during resume (early),
> (3) restore the state of devices during resume (early).

Yes.

> Still, you don't want the core to disable devices during suspend and to enable
> (or reenable) them during resume.

At an absolute _minimum_, bridges are special. 

We already know bridges are special: we do things like

	if (!pci_is_bridge(pci_dev))
		pci_prepare_to_sleep(pci_dev);


ie we don't actually put the bridges into D3 sleep, because the devices 
behind the bridge still need to be available until at LEAST the 
"suspend_late()" stage.

But then pci_pm_default_suspend_generic() does that 
pci_disable_enabled_device() unconditionally - even for bridges. That's 
just wrong, wrong, wrong.

> What about putting devices into low power states?  [Note that ACPI may be
> necessary for this purpose.]

I do think we should do it, although I'd at least personally prefer 
delaying it to the suspend_late (noirq) phase.

Why? Think about a shared interrupt again - but now coming in at just the 
wrong time during _suspend_. The PCI layer has turned off the device. 
Oops. Lockup. The same lock-up we worked so hard to avoid during resume.

> What about devices with no drivers and/or without suspend/resume support?

Oh, suspending those early and aggressively may be the right thing. But 
again, at least bridges are special.

Some bridges have drivers (pci-e and cardbus bridges at least), others 
don't (regular pci bridges). So bridges hit both the "has drivers" and 
"don't have drivers" case, and are special in both cases.

				Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 31, 2009, 10:46 p.m. UTC | #5
On Saturday 31 January 2009, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > > 
> > > Don't make it use the new PM infrastructure, because that one is certainly 
> > > broken: pci_pm_default_suspend_generic() is total crap.
> > 
> > Oh my.  I beg to differ.
> 
> Imagine that you're a bridge. Imagine what this does to any device 
> _behind_ you. Any device that plans to still do something at suspend_late 
> time, to be specific.
> 
> > > It's saving the disabled state. No WAY is that correct.
> > 
> > So why does it work, actually?
> 
> And I suspect it simply doesn't. And since almost nobody uses the new PM 
> state, you just don't see it.

But many drivers have an analogous code sequence in their PM callbacks and
I've tested it with several drivers on my test boxes.  It's never failed for me.

> But as to why it fixes Parag's case - I think that's because the new PM 
> resume does more than the legacy resume does, so it ends up re-enabling 
> things anyway. It does it too late, but it doesn't matter in this case (no 
> shared irq issues with the only device behind the pci-e bridge).

Still, the 2.6.28 resume didn't do the "reenable device" thing and it worked.

I think in the Parag's case the problem is the "double restore".

> > > But all of that is only called if you use the new PM infrastructure. So 
> > > the thing is, when you're trying to move the PCI-E drive to the new pm 
> > > infrastructure, you're making things _worse_.
> > 
> > I don't really think so (see below).
> 
> See above. I think you really haven't thought the new PM code through.

Yes, I have, but my experience apparently doesn't match yours.

> > pci_disable_device() does really only one thing: it clears the bus master bit.
> > Yes, it also calls pcibios_disable_device(), but on x86 this is a NOP.
> > 
> > I don't think it is SO bad, is it?
> 
> It's bad. It means that DMA won't work across such a bridge. Yes, it is 
> probably bridge-dependent, and I know for a fact that at least some Intel 
> bridges just hard-code the busmaster bit to 1 (at a minimum the host 
> bridges do, I'm not sure about others), but I also know for a fact that 
> some other bridges _will_ stop DMA to devices behind them if the BM bit is 
> clear.

DMA will only not work until the ->resume sets the bus master bit, which
happes before the ->resume of any device behind the bridge runs.  There only
is a small window where something (theoretically) may go wrong and I really
don't expect any driver to start DMA from its ->resume_realy or an interrupt
handler.

> But more importantly: Why do you do it? What's the upside? I don't see it.

OK, point taken.
 
> There's a known downside - you're saving state that is something else than 
> what the driver really expects. 
> 
> So I think clearing bus-master is a huge bug on a bridge, but I think that 
> on normal devices it's just pointless.
>
> > The driver using the new model is not supposed to save the state and power
> > off the device.  Still, it's probably a good idea not to trust the drivers. :-)
> 
> How about devices that have magic power-down sequences? For example, a 
> quick grep shows that USB on a PPC PMAC has a special "disable ASIC clocks 
> for USB" thing after it puts the USB controller to sleep.

This is exceptional, from what I can tell.

> That was literally the _first_ driver I looked at. Admittedly because I 
> knew that USB host controllers tend to be more aware of all the issues 
> than most random drivers, but still...
> 
> I agree that the new model should turn off devices by default, but the 
> thing is, it should also allow drivers that really know better to do magic 
> things.
> 
> Of course, we can say that such devices should just continue to use the 
> legacy model, but I thought that the long-term plan was to just replace 
> it. And if you do, you need to allow for drivers that do special things 
> due to known motherboard- or chip-specific "issues" (aka "PCI extensions" 
> aka "hardware bugs").

We may need an "override default resume" flag for such drivers.

> And yes, I suspect that the magic PPC USB clock thing could maybe be 
> rewritten as a system device, so there are other alternatives here, but I 
> do suspect that it will be very painful if the new PM layer _forces_ a 
> very specific model on drivers that they can't modify at all.
> 
> > > I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
> > > sequence shouldn't disable them, afaik.
> > 
> > No, it shouldn't.
> > 
> > However, the patch below actually may help and it really is not too different
> > from my "new infrastructure" patch.  It leaves the pcie_portdrv_restore_config()
> > in the PCIe port driver's ->resume(), but that shouldn't change things,
> > pci_enable_device() in there shouldn't do anything and the bus master bit
> > should already be set.
> 
> I absolutely agree with this patch.

Well, it's your patch after all, isn't it? ;-)

Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 31, 2009, 11:01 p.m. UTC | #6
On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> 
> But many drivers have an analogous code sequence in their PM callbacks and
> I've tested it with several drivers on my test boxes.  It's never failed for me.

Rafael!

Read what I write. Twice.

Here it is again: "Imagine that you're a bridge."

Stop the idiocy of just ignoring what I write, and talking about something 
else.

Bridges are special.

> > But as to why it fixes Parag's case - I think that's because the new PM 
> > resume does more than the legacy resume does, so it ends up re-enabling 
> > things anyway. It does it too late, but it doesn't matter in this case (no 
> > shared irq issues with the only device behind the pci-e bridge).
> 
> Still, the 2.6.28 resume didn't do the "reenable device" thing and it worked.
> 
> I think in the Parag's case the problem is the "double restore".

It's possible. Still, it really shouldn't matter.

Or at least, it shouldn't matter, as long as what you restore is sane. 
You're just going to rewrite the same data, after all.

That's why I was trying to see if the IO/MEM bits got cleared in the save 
image for some reason.

> > See above. I think you really haven't thought the new PM code through.
> 
> Yes, I have, but my experience apparently doesn't match yours.

The problem is that you're looking at some individual devices, and saying 
"it works for them, so it must work for everybody". Add to that the fact 
that you apparently _still_ haven't figured out the difference between 
bridges and regular devices, and that most most motherboards probably keep 
the PCI bridges powered anyway, and...

And yes, I realize that this is how PM under Linux has worked for a long 
time. But it's what I think we should get away from. It's why I pushed so 
hard to get the whole interrupt handling sane and stable. 

The argument that "it works for a lot of machines" IS NOT AN ARGUMENT, 
Rafael! Stop using it as such.

We know that 2.6.28 suspend/resume works for a lot of laptops. Even 
possibly _most_ laptops. But it was still broken. We want to get _away_ 
from that.

> DMA will only not work until the ->resume sets the bus master bit, which
> happes before the ->resume of any device behind the bridge runs.

Read my emails. THIS ISN'T EVEN A RESUME-TIME PROBLEM!

The problems happen on purely the suspend path. How the f*ck do you know 
that the drivers behind the bridge don't do everything at 'suspend_late' 
time, and expect to be working up until that time?

Here's a big hint: YOU DO NOT KNOW. YOU MUST NOT TURN OFF THE BRIDGE AT 
SUSPEND TIME!

I'm getting really fed up with you here. You're not even listening. And 
you are _definitely_ not doing any "deep thinking" here.

> > How about devices that have magic power-down sequences? For example, a 
> > quick grep shows that USB on a PPC PMAC has a special "disable ASIC clocks 
> > for USB" thing after it puts the USB controller to sleep.
> 
> This is exceptional, from what I can tell.

So?

Irrelevant. We want to handle the exceptional case too. And we generally 
want to handle them _automatically, rather than by:

> We may need an "override default resume" flag for such drivers.

.. why? Wouldn't it be a hell of a lot nicer if the PCI layer just did 
things right automatically. 

Which the legacy layer already does. It sees "ok, the driver did it's own 
pci_save_state(), I'm not going to do it for it".

THAT is robust. And simple. Wouldn't you agree?

So why not do the same in the new one? Why do you want to make the new 
interfaces _inferior_ to the old ones?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 31, 2009, 11:08 p.m. UTC | #7
On Saturday 31 January 2009, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > 
> > Anyway, from what I can tell reading your messages in this thread so far,
> > you seem to want the PCI core to:
> > (1) save the state of devices during suspend (avoid doing that if the driver has
> >     already saved the state),
> 
> Yes, but I'd at least want the device drivers to have the _option_ to 
> do it themselves.

(As I wrote in the other message just sent) I wonder if it's a good idea to
introduce an "override default PCI suspend/resume" flag for this purpose.

The drivers that want to handle the PCI stuff themselves would be able to use
this flag to tell the core not to try to power manage the device etc.

Alternatively, the core may check the state_saved bit and look at current_state
to see if the device need not be put into a low power state.  Still, putting
the device into a low power state need not be desirable anyway.

> I would expect that by default any normal PCI driver would just rely on 
> the PCI layer doing it all for them. But I suspect we may have cases where 
> the chip driver will simply want to override things, for one reason or 
> another.
> 
> We do know that some devices seem to be very picky and get unhappy about 
> being put to sleep (we don't put devices into D3 by default in the legacy 
> PM case for a reason!), and we do know that some existing drivers do extra 
> things _after_ they've put the device to D3.
> 
> So there very much are arguments for drivers wanting to do their own "save 
> state and power off" if they have special needs.
> 
> (Side note: it's entirely possible that one of the reasons we don't put 
> devices into D3 in the legacy code-path is purely historical: maybe not 
> because the devices were unhappy, but simply because it triggered the 
> whole "interrupt at an unlucky place" thing. So I'm hoping that we'll 
> actually not have this as a real issue, but..)

It actually has been confirmed recently that this is a real issue.  Sadly.

> > (2) put devices into D0 during resume (early),
> > (3) restore the state of devices during resume (early).
> 
> Yes.
> 
> > Still, you don't want the core to disable devices during suspend and to enable
> > (or reenable) them during resume.
> 
> At an absolute _minimum_, bridges are special. 
>
> We already know bridges are special: we do things like
> 
> 	if (!pci_is_bridge(pci_dev))
> 		pci_prepare_to_sleep(pci_dev);
> 
> 
> ie we don't actually put the bridges into D3 sleep, because the devices 
> behind the bridge still need to be available until at LEAST the 
> "suspend_late()" stage.
> 
> But then pci_pm_default_suspend_generic() does that 
> pci_disable_enabled_device() unconditionally - even for bridges. That's 
> just wrong, wrong, wrong.

OK, I see your point.

> > What about putting devices into low power states?  [Note that ACPI may be
> > necessary for this purpose.]
> 
> I do think we should do it, although I'd at least personally prefer 
> delaying it to the suspend_late (noirq) phase.
> 
> Why? Think about a shared interrupt again - but now coming in at just the 
> wrong time during _suspend_. The PCI layer has turned off the device. 
> Oops. Lockup. The same lock-up we worked so hard to avoid during resume.

I know.  Still, all of the drivers that implement suspend-resume put the
devices into low power states in ->suspend and it's never been observed to
be a source of problems.

Unfortunately, on many systems we are supposed to use ACPI for putting PCI
devices into low power states and right now we can't do that with interrupts
off due to the limitations of our AML interpreter.

The same applies to resume, BTW, but resume is easier, because devices tend
to already be in D0 from the start.  Admittedly, though, we may have a problem
with devices that are not in D0 at that point and _require_ ACPI to power them
up (ie. don't support the native PCI PM).  Not that I know of any, but still.

ISTR that some devices will not wake up the system if they are not put into
the low power state by ACPI during suspend.

So, I wonder.  Do we need to make the AML interpreter allow us to run code
with interrupts off?

> > What about devices with no drivers and/or without suspend/resume support?
> 
> Oh, suspending those early and aggressively may be the right thing. But 
> again, at least bridges are special.
> 
> Some bridges have drivers (pci-e and cardbus bridges at least), others 
> don't (regular pci bridges). So bridges hit both the "has drivers" and 
> "don't have drivers" case, and are special in both cases.

Yes, the bridges with drivers are somewhat special.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 31, 2009, 11:27 p.m. UTC | #8
On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> 
> Alternatively, the core may check the state_saved bit and look at current_state
> to see if the device need not be put into a low power state.  Still, putting
> the device into a low power state need not be desirable anyway.

I'd be surprised if we didn't have devices that cannot act as wakeup 
devices in D3, for example. Yes, I'm sure it _should_ work, and I realize 
that the whole "platform_pci_choose_state()" is supposed to do this all 
right, but let me just take a wild stab at guessing that it's not always 
going to work.

So I would not be surprised if some devices will want to not be put in D3.

There's also the issue of debugging: we may well want to simply skip 
suspending the console device (and all bridges leading up to it). We don't 
do that right now, and we basically depend on "no_suspend_console" just 
working _despite_ that (because our legacy PCI power management never put 
things into sleep states), but that's another example of something where a 
driver might simply decide that it doesn't want the default PCI layer 
decisions: not because the device cannot do it, but because _we_ don't 
want it to do it.

> > Why? Think about a shared interrupt again - but now coming in at just the 
> > wrong time during _suspend_. The PCI layer has turned off the device. 
> > Oops. Lockup. The same lock-up we worked so hard to avoid during resume.
> 
> I know.  Still, all of the drivers that implement suspend-resume put the
> devices into low power states in ->suspend and it's never been observed to
> be a source of problems.

I do agree that problems at suspend time are probably somewhat less likely 
than at resume time. The devices are mostly in known states (rather than 
whatever random state they come up in and the BIOS early programming may 
do), and hopefully quiescent (because we told user space to freeze).

But I wouldn't bet on it in general. Think network cards on a shared 
interrupt, where the network card gets suspended _after_ the device that 
it shares interrupts with. Is the network quiescent? On a lot of desktop 
machines it probably is.

But how many people test STR while doing a "ping -f" from another machine?

It _should_ work. Do you guarantee that it does?

I think we should aim for "yes, we do guarantee that it does".

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 31, 2009, 11:39 p.m. UTC | #9
On Sat, 31 Jan 2009, Linus Torvalds wrote:
> 
> But how many people test STR while doing a "ping -f" from another machine?
> 
> It _should_ work. Do you guarantee that it does?

Btw, this really only is interesting if there's a shared interrupt. 

I'm sure that there are network drivers that will crash even on their own 
with _just_ the right timing (imagine having a delayed interrupt pending, 
then doing the "pci_set_power_state(PCI_D3hot)" thing, and then get the 
interrupt handler invoked on another CPU _just_ afterwards), but it's 
probably really hard to trigger, and a bug in that specific driver anyway.

But what's much more interesting (and not necessarily a driver bug, but a 
general PM infrastructure problem) is if we have that shared interrupt 
case, and the network driver gets lots of interrupts just as "driver X" is 
shutting down with that interrupt shared. Then, "driver X" will get 
interrupts after the PM layer has put its device to sleep, and now "driver 
X" is quite understandably confused - it didn't even do the "put to sleep" 
itself, but now its device is no longer responding.

And now it's not a really unlikely race condition any more. 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 1, 2009, 12:11 a.m. UTC | #10
On Sunday 01 February 2009, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Rafael J. Wysocki wrote:
> > 
> > But many drivers have an analogous code sequence in their PM callbacks and
> > I've tested it with several drivers on my test boxes.  It's never failed for me.
> 
> Rafael!
> 
> Read what I write. Twice.

You didn't give me a chance by removing it. ;-)

> Here it is again: "Imagine that you're a bridge."
> 
> Stop the idiocy of just ignoring what I write, and talking about something 
> else.
> 
> Bridges are special.

[Confused] Yes, they are.

And yes, I agreed that the bus master bit shouldn't be cleared for them.  Which
does not necessarily imply that clearing that bit will always lead to problems
in practice.

> > > But as to why it fixes Parag's case - I think that's because the new PM 
> > > resume does more than the legacy resume does, so it ends up re-enabling 
> > > things anyway. It does it too late, but it doesn't matter in this case (no 
> > > shared irq issues with the only device behind the pci-e bridge).
> > 
> > Still, the 2.6.28 resume didn't do the "reenable device" thing and it worked.
> > 
> > I think in the Parag's case the problem is the "double restore".
> 
> It's possible. Still, it really shouldn't matter.
> 
> Or at least, it shouldn't matter, as long as what you restore is sane. 
> You're just going to rewrite the same data, after all.

Yes, but I suspect the device is misbehaving due to some timing issues.

With devices that behave totally correctly the current code doesn't appear to
cause problems.

> That's why I was trying to see if the IO/MEM bits got cleared in the save 
> image for some reason.

Well, we never clear them.

> > > See above. I think you really haven't thought the new PM code through.
> > 
> > Yes, I have, but my experience apparently doesn't match yours.
> 
> The problem is that you're looking at some individual devices, and saying 
> "it works for them, so it must work for everybody".

Look.  I have only a limited set of data from devices that have been tested.
On the other hand I have some pieces of documentation that are sometimes not
very clear.  I try to use the data and the docs I have to figure out how things
actually work.

That's why the cases like the Parag's one are so important.  They allow me to
verify assumptions.

So no, I don't say "it must work for everybody", but I have to assume
_something_ if I don't know that _for_ _sure_.  I sometimes make wrong
assumptions, which is a consequence of the fact that I can only use a limited
set of data to verify my observations.

And please note, we still don't know for sure why the Parag's box actually
fails.  We have some theories that may or may not be correct, but that's it.

> Add to that the fact that you apparently _still_ haven't figured out the
> difference between bridges and regular devices, and that most most
> motherboards probably keep the PCI bridges powered anyway, and...

[Confused again] Why are you talking about keeping bridges powered?
 
> And yes, I realize that this is how PM under Linux has worked for a long 
> time. But it's what I think we should get away from. It's why I pushed so 
> hard to get the whole interrupt handling sane and stable. 
> 
> The argument that "it works for a lot of machines" IS NOT AN ARGUMENT, 
> Rafael! Stop using it as such.
>
> We know that 2.6.28 suspend/resume works for a lot of laptops. Even 
> possibly _most_ laptops. But it was still broken. We want to get _away_ 
> from that.

Working on many (different) systems is a good indication of what is and what
is not going to work in general.  It obviously doesn't imply correctness,
though.

> > DMA will only not work until the ->resume sets the bus master bit, which
> > happes before the ->resume of any device behind the bridge runs.
> 
> Read my emails. THIS ISN'T EVEN A RESUME-TIME PROBLEM!
> 
> The problems happen on purely the suspend path. How the f*ck do you know 
> that the drivers behind the bridge don't do everything at 'suspend_late' 
> time, and expect to be working up until that time?

DMA from suspend_late?  Come on.

> Here's a big hint: YOU DO NOT KNOW. YOU MUST NOT TURN OFF THE BRIDGE AT 
> SUSPEND TIME!
> 
> I'm getting really fed up with you here. You're not even listening. And 
> you are _definitely_ not doing any "deep thinking" here.

Why are you shouting at me?  Did you read my reply to your other message?
How many times do I have to tell you I AGREE THAT I SHOULD NOT TURN OFF THE
BUS MASTER BIT FOR BRIDGES so that you actually get it?

> > > How about devices that have magic power-down sequences? For example, a 
> > > quick grep shows that USB on a PPC PMAC has a special "disable ASIC clocks 
> > > for USB" thing after it puts the USB controller to sleep.
> > 
> > This is exceptional, from what I can tell.
> 
> So?
> 
> Irrelevant. We want to handle the exceptional case too. And we generally 
> want to handle them _automatically, rather than by:
> 
> > We may need an "override default resume" flag for such drivers.
> 
> .. why? Wouldn't it be a hell of a lot nicer if the PCI layer just did 
> things right automatically. 
> 
> Which the legacy layer already does. It sees "ok, the driver did it's own 
> pci_save_state(), I'm not going to do it for it".
> 
> THAT is robust. And simple. Wouldn't you agree?
> 
> So why not do the same in the new one? Why do you want to make the new 
> interfaces _inferior_ to the old ones?

On suspend we have two things to do:
- save the state
- put the device into a low power state

The state_saved flag can be used to see if the driver has saved the state.

What about powering off?  Are we going to assume that if the driver has saved
the state of the device, the core should not put the device into a low power
state?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 1, 2009, 12:32 a.m. UTC | #11
On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> > The problems happen on purely the suspend path. How the f*ck do you know 
> > that the drivers behind the bridge don't do everything at 'suspend_late' 
> > time, and expect to be working up until that time?
> 
> DMA from suspend_late?  Come on.

Rafael. Stop being a total idiot.

Read what I wrote.

I'm saying that the driver may not do anything at all at suspend() time, 
and leaves everything until suspend_late. Then, at suspend_late(), it 
finally really shuts down.

That's actually a very reasonable thing to do in some circumstances. It 
simplifies everything, in particular all interrupt handling, since the 
device is now fully live all the way while interrupts can happen.

For a USB host controller, for example, it really could make sense to do 
that - just leave all the core host controller stuff running, and the only 
thing the "suspend()" callback does is to send the commands to the actual 
devices, it doesn't necessarily touch the host controller itself at all.

Then, at suspend_late time, you just clear the "running" bit in the 
controller (and perhaps not even that - because you want to still push 
things out for debugging). End result: you never actually had to shut 
anything down at all, and you could (for example) still run a USB serial 
port console all the way to shutdown.

And yes, I wanted to do basically exactly that when I was debugging some 
issues a year or two ago.

See? The device and driver may be totally alive over a ->suspend() call. 
And that means that the bridge CANNOT KNOW that it's ok to shut down DMA. 
Because DMA may be the only way the device communicates (again: USB 
actually works that way).

So dammit, just admit that you were wrong, instead of continually sending 
these _idiotic_ replies. That "turn off bus mastering" was bogus and 
idiotic, and had no real cause. Ok to just admit it already?

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 1, 2009, 12:36 a.m. UTC | #12
On Sunday 01 February 2009, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Linus Torvalds wrote:
> > 
> > But how many people test STR while doing a "ping -f" from another machine?
> > 
> > It _should_ work. Do you guarantee that it does?
> 
> Btw, this really only is interesting if there's a shared interrupt. 
> 
> I'm sure that there are network drivers that will crash even on their own 
> with _just_ the right timing (imagine having a delayed interrupt pending, 
> then doing the "pci_set_power_state(PCI_D3hot)" thing, and then get the 
> interrupt handler invoked on another CPU _just_ afterwards), but it's 
> probably really hard to trigger, and a bug in that specific driver anyway.
> 
> But what's much more interesting (and not necessarily a driver bug, but a 
> general PM infrastructure problem) is if we have that shared interrupt 
> case, and the network driver gets lots of interrupts just as "driver X" is 
> shutting down with that interrupt shared. Then, "driver X" will get 
> interrupts after the PM layer has put its device to sleep, and now "driver 
> X" is quite understandably confused - it didn't even do the "put to sleep" 
> itself, but now its device is no longer responding.
> 
> And now it's not a really unlikely race condition any more. 

All this leads to the conclusion that we should put devices into low power
states with interrupts off and this seems to imply that we'll need to make the
AML interpreter allow us to run AML with interrupts off.

Still, what about the following rule:
- If the device is supposed to wake up the system, the driver should prepare it
  and put it into a low power state using the existing PCI callbacks, in
  ->suspend().  In that case, the driver is also required to save the state of
  the device before putting it into the low power state.  It is also required
  to make sure that its interrupt handler will not get confused in case of
  shared interrupts.
- If the state of the device hasn't been saved by the driver, the core is
  required to save its state (with interrupts off, I suppose?).
- If the state of the device hasn't been saved by the driver, the core will
  attempt to put the device into a low power state, using the native PCI PM and
  with interrupts off, unless PCI_DEV_FLAGS_NO_D3 is set in dev->flags.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 1, 2009, 12:41 a.m. UTC | #13
On Sunday 01 February 2009, Linus Torvalds wrote:
> 
> On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> > > The problems happen on purely the suspend path. How the f*ck do you know 
> > > that the drivers behind the bridge don't do everything at 'suspend_late' 
> > > time, and expect to be working up until that time?
> > 
> > DMA from suspend_late?  Come on.
> 
> Rafael. Stop being a total idiot.
> 
> Read what I wrote.
> 
> I'm saying that the driver may not do anything at all at suspend() time, 
> and leaves everything until suspend_late. Then, at suspend_late(), it 
> finally really shuts down.
> 
> That's actually a very reasonable thing to do in some circumstances. It 
> simplifies everything, in particular all interrupt handling, since the 
> device is now fully live all the way while interrupts can happen.
> 
> For a USB host controller, for example, it really could make sense to do 
> that - just leave all the core host controller stuff running, and the only 
> thing the "suspend()" callback does is to send the commands to the actual 
> devices, it doesn't necessarily touch the host controller itself at all.
> 
> Then, at suspend_late time, you just clear the "running" bit in the 
> controller (and perhaps not even that - because you want to still push 
> things out for debugging). End result: you never actually had to shut 
> anything down at all, and you could (for example) still run a USB serial 
> port console all the way to shutdown.
> 
> And yes, I wanted to do basically exactly that when I was debugging some 
> issues a year or two ago.
> 
> See? The device and driver may be totally alive over a ->suspend() call. 
> And that means that the bridge CANNOT KNOW that it's ok to shut down DMA. 
> Because DMA may be the only way the device communicates (again: USB 
> actually works that way).
> 
> So dammit, just admit that you were wrong,

I said I was.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 1, 2009, 12:51 a.m. UTC | #14
On Sat, 31 Jan 2009, Linus Torvalds wrote:
> 
> For a USB host controller, for example, it really could make sense to do 
> that - just leave all the core host controller stuff running, and the only 
> thing the "suspend()" callback does is to send the commands to the actual 
> devices, it doesn't necessarily touch the host controller itself at all.

Same is quite likely true of things like video graphics adapters. Again, 
for all the same reasons. Think about all those fbcon drivers. They will 
use DMA for things. And again, there are very compelling debugging reasons 
to not suspend them for real until suspend_late (if even then).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 1, 2009, 1:06 a.m. UTC | #15
On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> 
> All this leads to the conclusion that we should put devices into low power
> states with interrupts off and this seems to imply that we'll need to make the
> AML interpreter allow us to run AML with interrupts off.

How many devices actually have the _PS3 method (or whatever it is that we 
end up executing)? We might be able to simply flag it, and say "ok, if we 
have a _PS3 method, we'll have to suspend early, otherwise we can leave it 
for a late suspend".

Definitely not perfect, but perhaps a way to get the safe thing on 99% of 
all cases, and have to live with the horrid ACPI rules on some things.

I thought the _DSW thing is common for setting up wakeup, but _PSx is not. 
But I have not looked at many ACPI tables in my life. I try to active 
avoid it if I at all humanly can.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 1, 2009, 1:13 a.m. UTC | #16
On Sat, 31 Jan 2009, Linus Torvalds wrote:
> 
> I thought the _DSW thing is common for setting up wakeup, but _PSx is not. 
> But I have not looked at many ACPI tables in my life. I try to active 
> avoid it if I at all humanly can.

Doing a quick grep on my laptop seems to confirm that. No actual _PSx 
things found in any acpi tables at all that I can see.

And on a mac mini, there are _PS3 entries that _look_ like they are 
connected to the IDE controller and the realtime clock, but it looks like 
they don't happen for the PCI devices.

But again - I may have screwed that up. ACPI tables are not my favourite 
data structure.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Feb. 1, 2009, 1:20 a.m. UTC | #17
On Sat, 31 Jan 2009 17:06:47 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> > 
> > All this leads to the conclusion that we should put devices into
> > low power states with interrupts off and this seems to imply that
> > we'll need to make the AML interpreter allow us to run AML with
> > interrupts off.
> 
> How many devices actually have the _PS3 method (or whatever it is
> that we end up executing)? We might be able to simply flag it, and
> say "ok, if we have a _PS3 method, we'll have to suspend early,
> otherwise we can leave it for a late suspend".
> 

in this area there's a pet pieve of mine, or rather something that
shows up on kerneloops.org quite a bit:
There are several PCI quirks that get run with irqs off, but they are
just the "normal" boot time quirks, and when they get run there, they
can sleep. Some of them do things like call ioremap() and the like....
from memory some of the asus and via quirks come to mind...
Rafael J. Wysocki Feb. 1, 2009, 1:24 a.m. UTC | #18
On Sunday 01 February 2009, Linus Torvalds wrote:
> 
> On Sun, 1 Feb 2009, Rafael J. Wysocki wrote:
> > 
> > All this leads to the conclusion that we should put devices into low power
> > states with interrupts off and this seems to imply that we'll need to make the
> > AML interpreter allow us to run AML with interrupts off.
> 
> How many devices actually have the _PS3 method (or whatever it is that we 
> end up executing)? We might be able to simply flag it, and say "ok, if we 
> have a _PS3 method, we'll have to suspend early, otherwise we can leave it 
> for a late suspend".

That seems doable at first sight, although I think we should take D1 and D2
into account too (the ACPI rules may be that for S3, ie. suspend to RAM, given
device should be put into D2, for example).  We have a function for checking
if device is power-manageable by ACPI.

Still, in that case, should the rule be that if the device is power-manageable
by ACPI, the PCI core is supposed to put it into a low power state (using ACPI)
with interrupts on and if the device is not power-manageable by ACPI, the
PCI core is supposed to put it into a low power state using the native PM?

> Definitely not perfect, but perhaps a way to get the safe thing on 99% of 
> all cases, and have to live with the horrid ACPI rules on some things.

They are not that uncommon AFAICS.  On all of my boxes there are devices
power-manageable by ACPI.  Usually they are USB controllers and network
adapters, but sometimes it happens to sound cards too.

> I thought the _DSW thing is common for setting up wakeup, but _PSx is not. 
> But I have not looked at many ACPI tables in my life. I try to active 
> avoid it if I at all humanly can.

Usually, we need to use ACPI to set up wake-up and then use ACPI to put the
device into a low power state.  Otherwise, the wake-up may not work.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Feb. 7, 2009, 3:26 a.m. UTC | #19
> 
> Here's a big hint: YOU DO NOT KNOW. YOU MUST NOT TURN OFF THE BRIDGE AT 
> SUSPEND TIME!

A quick catch up on old emails ....

A good reason not to turn bridges off is that behind bridges, you'll
find more than your "normal" PCI devices. You'll find system devices,
things like your motherboard GPIO & clock control, i2c controllers,
service processors, your PIC, etc... 

And your BIOS will need them. Your ACPI stuff will indirectly access
these things, for example to get the machine into actual suspend state
etc...

Disabling bridges is bad :-)

If they have to be disabled for power management purposes, then it's
likely that ACPI/BIOS itself will do it at the very last minute when
going into S3, and bring them back itself on the way back.

For embedded platforms who want to override that, they can always stick
a driver there or do it manually from a sysdev or platform code.

IE. I very very much doubt windows does anything to bridges, appart
-maybe- saving/restoring the standard window settings -just-in-case-
they got wiped out and the BIOS didn't restore them.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Feb. 7, 2009, 3:27 a.m. UTC | #20
On Sat, 2009-01-31 at 16:51 -0800, Linus Torvalds wrote:
> 
> On Sat, 31 Jan 2009, Linus Torvalds wrote:
> > 
> > For a USB host controller, for example, it really could make sense to do 
> > that - just leave all the core host controller stuff running, and the only 
> > thing the "suspend()" callback does is to send the commands to the actual 
> > devices, it doesn't necessarily touch the host controller itself at all.
> 
> Same is quite likely true of things like video graphics adapters. Again, 
> for all the same reasons. Think about all those fbcon drivers. They will 
> use DMA for things. And again, there are very compelling debugging reasons 
> to not suspend them for real until suspend_late (if even then).

Actually, there are things like service processors etc... that only know
to communicate via DMA and little else via a little mailbox in some
reserved area of system memory...

Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek Feb. 7, 2009, 9:21 a.m. UTC | #21
Hi!

> What about devices with no drivers and/or without suspend/resume support?
> Do you want them to be disabled during suspend and enabled during resume by
> the core?  [I guess you do, they have no interrupt handlers that may break
> after all.]

Device with driver but with no suspend/resume support can still use
interrupts. But I don't think there's much we can do in that
case... except perhaps failing the suspend.
diff mbox

Patch

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 99a914a..08a8e3c 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -55,16 +55,6 @@  static int pcie_portdrv_suspend(struct pci_dev *dev, pm_message_t state)
 
 }
 
-static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
-{
-	return pci_save_state(dev);
-}
-
-static int pcie_portdrv_resume_early(struct pci_dev *dev)
-{
-	return pci_restore_state(dev);
-}
-
 static int pcie_portdrv_resume(struct pci_dev *dev)
 {
 	pcie_portdrv_restore_config(dev);
@@ -72,8 +62,6 @@  static int pcie_portdrv_resume(struct pci_dev *dev)
 }
 #else
 #define pcie_portdrv_suspend NULL
-#define pcie_portdrv_suspend_late NULL
-#define pcie_portdrv_resume_early NULL
 #define pcie_portdrv_resume NULL
 #endif
 
@@ -292,8 +280,6 @@  static struct pci_driver pcie_portdriver = {
 	.remove		= pcie_portdrv_remove,
 
 	.suspend	= pcie_portdrv_suspend,
-	.suspend_late	= pcie_portdrv_suspend_late,
-	.resume_early	= pcie_portdrv_resume_early,
 	.resume		= pcie_portdrv_resume,
 
 	.err_handler 	= &pcie_portdrv_err_handler,