diff mbox

[v9,07/10] block: add a new interface to block events

Message ID 1352443922-13734-8-git-send-email-aaron.lu@intel.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Aaron Lu Nov. 9, 2012, 6:51 a.m. UTC
A new interface to block disk events is added, this interface is
meant to eliminate a race between PM runtime callback and disk events
checking.

Suppose the following device tree:
device_sata_port  (the parent)
  device_ODD      (the child)

When ODD is runtime suspended, sata port will have a chance to enter
runtime suspended state. And in sata port's runtime suspend callback,
it will check if it is OK to omit the power of the ODD. And if yes, the
periodically running events checking work will be stopped, as the ODD
will be waken up by that check and cancel it can make the ODD stay in
zero power state much longer(no worry about how the ODD gets media
change event in ZPODD's case).

I used disk_block_events to do the events blocking, but there is a race
and can lead to a deadlock: when I call disk_block_events in sata port's
runtime suspend callback, the events checking work may already be running
and it will resume the ODD synchronously, and PM core will try to resume
its parent, the sata port first. The PM core makes sure that runtime
resume callback does not run concurrently with runtime suspend callback,
and if the runtime suspend callback is running, the runtime resume
callback will wait for it done. So the events checking work will wait
for sata port's runtime suspend callback, while the sata port's runtime
suspend callback is waiting for the disk events work to finish. Deadlock.

ODD_suspend                        disk_events_workfn
  ata_port_suspend                   check_events
    disk_block_events                  resume ODD
      cancel_delayed_work_sync           resume parent
      (waiting for disk_events_workfn)   (waiting for suspend callback)

So a new function disk_try_block_events is added, it will try to
cancel the delayed work if it is pending. If succeed, disk_block_events
will be called and we are done; if failed, false is returned without
doing anything. In this way, the race can be avoided.

The newly added interface and the disk_unblock_events are exported, as
sr driver will need to use them to block/unblock disk events.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 block/genhd.c         | 26 ++++++++++++++++++++++++++
 include/linux/genhd.h |  1 +
 2 files changed, 27 insertions(+)

Comments

Tejun Heo Nov. 12, 2012, 7:14 p.m. UTC | #1
On Fri, Nov 09, 2012 at 02:51:59PM +0800, Aaron Lu wrote:
> A new interface to block disk events is added, this interface is
> meant to eliminate a race between PM runtime callback and disk events
> checking.
> 
> Suppose the following device tree:
> device_sata_port  (the parent)
>   device_ODD      (the child)

Weren't you gonna do something different about this?  I mean, if sr
knows that autopm kicked in, it can simply tell the block layer that
nothing is going on.  Wouldn't that be simpler?

Thanks.
Alan Stern Nov. 12, 2012, 7:18 p.m. UTC | #2
On Mon, 12 Nov 2012, Tejun Heo wrote:

> On Fri, Nov 09, 2012 at 02:51:59PM +0800, Aaron Lu wrote:
> > A new interface to block disk events is added, this interface is
> > meant to eliminate a race between PM runtime callback and disk events
> > checking.
> > 
> > Suppose the following device tree:
> > device_sata_port  (the parent)
> >   device_ODD      (the child)
> 
> Weren't you gonna do something different about this?  I mean, if sr
> knows that autopm kicked in, it can simply tell the block layer that
> nothing is going on.  Wouldn't that be simpler?

It wouldn't work for non-ZP drives.  They do need to be polled, even 
when suspended.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 12, 2012, 7:21 p.m. UTC | #3
Hello, Alan.

On Mon, Nov 12, 2012 at 02:18:11PM -0500, Alan Stern wrote:
> > Weren't you gonna do something different about this?  I mean, if sr
> > knows that autopm kicked in, it can simply tell the block layer that
> > nothing is going on.  Wouldn't that be simpler?
> 
> It wouldn't work for non-ZP drives.  They do need to be polled, even 
> when suspended.

Hmmm... a bit confused, what would autopm do for those non-ZP devices?
Would it make any difference?

Thanks.
Alan Stern Nov. 12, 2012, 7:34 p.m. UTC | #4
On Mon, 12 Nov 2012, Tejun Heo wrote:

> Hello, Alan.
> 
> On Mon, Nov 12, 2012 at 02:18:11PM -0500, Alan Stern wrote:
> > > Weren't you gonna do something different about this?  I mean, if sr
> > > knows that autopm kicked in, it can simply tell the block layer that
> > > nothing is going on.  Wouldn't that be simpler?
> > 
> > It wouldn't work for non-ZP drives.  They do need to be polled, even 
> > when suspended.
> 
> Hmmm... a bit confused, what would autopm do for those non-ZP devices?
> Would it make any difference?

It wouldn't do anything for the device, but it would allow the device's
parent to go to low power in between polls.  That's why Aaron at one
point suggested lengthening the polling interval.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 18, 2012, 3:05 p.m. UTC | #5
Hello, Alan.

On Mon, Nov 12, 2012 at 02:34:01PM -0500, Alan Stern wrote:
> It wouldn't do anything for the device, but it would allow the device's
> parent to go to low power in between polls.  That's why Aaron at one
> point suggested lengthening the polling interval.

Ah, okay.  Entering and leaving suspend even every two seconds may
save noticeable power for some controllers.  Does that, tho, mean that
the port would get reset every two seconds?  If so (I can't see how
else it would behave), it may not really work.  I mean, you would be
spending about ten seconds reinitiaizing everything and then enter
sleep and do it all over again two seconds later, most likely wasting
power in the process.  Sounds weird to me.  What am I missing?

Thanks.
Alan Stern Nov. 18, 2012, 5:41 p.m. UTC | #6
On Sun, 18 Nov 2012, Tejun Heo wrote:

> Hello, Alan.

Hi.

> On Mon, Nov 12, 2012 at 02:34:01PM -0500, Alan Stern wrote:
> > It wouldn't do anything for the device, but it would allow the device's
> > parent to go to low power in between polls.  That's why Aaron at one
> > point suggested lengthening the polling interval.
> 
> Ah, okay.  Entering and leaving suspend even every two seconds may
> save noticeable power for some controllers.  Does that, tho, mean that
> the port would get reset every two seconds?

I don't know how suspend works for (S)ATA ports.  For USB mass storage, 
suspend/resume does not involve any resets unless something goes wrong.

>  If so (I can't see how
> else it would behave), it may not really work.  I mean, you would be
> spending about ten seconds reinitiaizing everything and then enter
> sleep and do it all over again two seconds later, most likely wasting
> power in the process.  Sounds weird to me.  What am I missing?

Does it really take 10 seconds to recover from an ATA suspend?  That 
sounds awfully long.

The two second value is merely the default interval for media polling.  
Devices with no async notification mechanism require some sort of
polling if anybody wants to know when media is inserted or removed,
and the polling can't be carried out if the device's port is suspended.

You asked about autopm.  "Autopm" is a rather vague and ugly term I
made up; it means that the kernel automatically suspends the device
after some specified idle time.  In the case of optical drives, "idle"  
currently means the device file is not open (which implies the drive
doesn't contain a mounted filesystem).  As far as I know, the kernel
has no direct way to affect the power level of an optical drive that
isn't ZP.  For these devices "suspended" is merely a logical state,
meaning that the device isn't in use and therefore its parent can be
put into a low-power mode.

Regardless, an optical drive can't remain suspended when polling
occurs.  So unless there's some other mechanism for getting the
information about media changes, suspended drives (and their parents)  
will have to be resumed periodically for polling.

Hence there's a tradeoff.  How can we use the minimum amount of energy
while still polling the drive acceptably often?  In general, the kernel
doesn't know.  That's why these things can be controlled from
userspace.  And the answer may be different for ATA drives vs.  
USB-connected drives.

Does this answer your questions?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 18, 2012, 9:56 p.m. UTC | #7
Hello, Alan.

On Sun, Nov 18, 2012 at 12:41:06PM -0500, Alan Stern wrote:
> >  If so (I can't see how
> > else it would behave), it may not really work.  I mean, you would be
> > spending about ten seconds reinitiaizing everything and then enter
> > sleep and do it all over again two seconds later, most likely wasting
> > power in the process.  Sounds weird to me.  What am I missing?
> 
> Does it really take 10 seconds to recover from an ATA suspend?  That 
> sounds awfully long.

If it's using the same ->suspend ops as the regular system suspend,
10secs isn't a crazy number.  If the controller goes offline, the PHY
would go offline too and the only way to come back from that is
performing full probing sequence.  From SATA procotol POV, it isn't
too bad.  It's just link initiazliation followed by IDENTIFY and some
config commands.

The problem is that SATA devices aren't really designed to be used
like that.  If you reset a ODD w/ a media in it, it'll probably spin
it up and try to re-establish media state during probe sequence.  It
isn't designed that way and has never been used like that.  SATA has
its own dynamic link power management (DIPM/HIPM) tho.

So, this whole autopm thing doesn't sound like a good idea to me.

> The two second value is merely the default interval for media polling.  
> Devices with no async notification mechanism require some sort of
> polling if anybody wants to know when media is inserted or removed,
> and the polling can't be carried out if the device's port is suspended.

Yeah, I know.  I wrote that thing. :)

> You asked about autopm.  "Autopm" is a rather vague and ugly term I
> made up; it means that the kernel automatically suspends the device
> after some specified idle time.  In the case of optical drives, "idle"  
> currently means the device file is not open (which implies the drive
> doesn't contain a mounted filesystem).  As far as I know, the kernel
> has no direct way to affect the power level of an optical drive that
> isn't ZP.  For these devices "suspended" is merely a logical state,
> meaning that the device isn't in use and therefore its parent can be
> put into a low-power mode.
>
> Regardless, an optical drive can't remain suspended when polling
> occurs.  So unless there's some other mechanism for getting the
> information about media changes, suspended drives (and their parents)  
> will have to be resumed periodically for polling.
> 
> Hence there's a tradeoff.  How can we use the minimum amount of energy
> while still polling the drive acceptably often?  In general, the kernel
> doesn't know.  That's why these things can be controlled from
> userspace.  And the answer may be different for ATA drives vs.  
> USB-connected drives.
> 
> Does this answer your questions?

I think the only reason autopm doesn't blow up for SATA devices is
that userland usually automatically mounts them thus effectively
disabling autopm.  I *think* the only sane thing we can do is doing
autopm iff zpodd is available && no media.

Thanks.
Tejun Heo Nov. 18, 2012, 9:58 p.m. UTC | #8
On Sun, Nov 18, 2012 at 01:56:57PM -0800, Tejun Heo wrote:
> I think the only reason autopm doesn't blow up for SATA devices is
> that userland usually automatically mounts them thus effectively
> disabling autopm.  I *think* the only sane thing we can do is doing
> autopm iff zpodd is available && no media.

That is, at least for devices which get polled.  If devices are to
leave unused for some time, suspending is fine but auto-suspending
inbetween event polling wouldn't work.

Thanks.
Alan Stern Nov. 18, 2012, 11:28 p.m. UTC | #9
On Sun, 18 Nov 2012, Tejun Heo wrote:

> > Does it really take 10 seconds to recover from an ATA suspend?  That 
> > sounds awfully long.
> 
> If it's using the same ->suspend ops as the regular system suspend,
> 10secs isn't a crazy number.  If the controller goes offline, the PHY
> would go offline too and the only way to come back from that is
> performing full probing sequence.  From SATA procotol POV, it isn't
> too bad.  It's just link initiazliation followed by IDENTIFY and some
> config commands.
> 
> The problem is that SATA devices aren't really designed to be used
> like that.  If you reset a ODD w/ a media in it, it'll probably spin
> it up and try to re-establish media state during probe sequence.  It
> isn't designed that way and has never been used like that.  SATA has
> its own dynamic link power management (DIPM/HIPM) tho.

Is it possible to use those to implement runtime suspend?  Or are they 
handled autonomously by the hardware?

> So, this whole autopm thing doesn't sound like a good idea to me.

No doubt it's better suited to some devices than to others.

> > Hence there's a tradeoff.  How can we use the minimum amount of energy
> > while still polling the drive acceptably often?  In general, the kernel
> > doesn't know.  That's why these things can be controlled from
> > userspace.  And the answer may be different for ATA drives vs.  
> > USB-connected drives.
> > 
> > Does this answer your questions?
> 
> I think the only reason autopm doesn't blow up for SATA devices is
> that userland usually automatically mounts them thus effectively
> disabling autopm.

Either that or else because it hasn't been fully implemented yet.  :-)

>  I *think* the only sane thing we can do is doing
> autopm iff zpodd is available && no media.

That may be true for SATA.  For USB optical drives, it does make sense
to power-down the host controller when the drive isn't in use.  USB
suspend/resume takes on the order of 50-100 ms or so.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 18, 2012, 11:35 p.m. UTC | #10
Hey, Alan.

On Sun, Nov 18, 2012 at 06:28:24PM -0500, Alan Stern wrote:
> > The problem is that SATA devices aren't really designed to be used
> > like that.  If you reset a ODD w/ a media in it, it'll probably spin
> > it up and try to re-establish media state during probe sequence.  It
> > isn't designed that way and has never been used like that.  SATA has
> > its own dynamic link power management (DIPM/HIPM) tho.
> 
> Is it possible to use those to implement runtime suspend?  Or are they 
> handled autonomously by the hardware?

It's controlled by /sys/class/scsi_host/hostX/link_power_management.
Once enabled, the power saving is completedly handled by hardware.  If
link stays idle longer than certain duration, the hardware initiates
low power state and leaves it when something needs to happen.

> > So, this whole autopm thing doesn't sound like a good idea to me.
> 
> No doubt it's better suited to some devices than to others.

Yeah, SATA seems to need a different approach than USB.

> > I think the only reason autopm doesn't blow up for SATA devices is
> > that userland usually automatically mounts them thus effectively
> > disabling autopm.
> 
> Either that or else because it hasn't been fully implemented yet.  :-)

Ah.. that makes sense. :)

> >  I *think* the only sane thing we can do is doing
> > autopm iff zpodd is available && no media.
> 
> That may be true for SATA.  For USB optical drives, it does make sense
> to power-down the host controller when the drive isn't in use.  USB
> suspend/resume takes on the order of 50-100 ms or so.

I see.  For SATA too, the controller / link bring up doesn't take too
long.  The crappy part is what the attached, especially ATAPI, devices
would do after such events as those events will be visible to them
basically as random link resets.

So, at least for SATA, I think what autopm can do is...

* Trigger zpodd if possible.

* Trigger suspend iff polling isn't happening on the device.

Thanks.
Alan Stern Nov. 19, 2012, 2:07 a.m. UTC | #11
On Sun, 18 Nov 2012, Tejun Heo wrote:

> Hey, Alan.

Hey...

> It's controlled by /sys/class/scsi_host/hostX/link_power_management.
> Once enabled, the power saving is completedly handled by hardware.  If
> link stays idle longer than certain duration, the hardware initiates
> low power state and leaves it when something needs to happen.
> 
> > > So, this whole autopm thing doesn't sound like a good idea to me.
> > 
> > No doubt it's better suited to some devices than to others.
> 
> Yeah, SATA seems to need a different approach than USB.

Based on your description, I agree.

> > That may be true for SATA.  For USB optical drives, it does make sense
> > to power-down the host controller when the drive isn't in use.  USB
> > suspend/resume takes on the order of 50-100 ms or so.
> 
> I see.  For SATA too, the controller / link bring up doesn't take too
> long.  The crappy part is what the attached, especially ATAPI, devices
> would do after such events as those events will be visible to them
> basically as random link resets.
> 
> So, at least for SATA, I think what autopm can do is...
> 
> * Trigger zpodd if possible.
> 
> * Trigger suspend iff polling isn't happening on the device.

That sums it up nicely.  Of course, the PM core is unaware of details
such as media polling.  What we can do is have the SATA runtime-idle
method return -EBUSY if the device isn't a ZPODD and if polling is
enabled.  Unfortunately I don't think there's any way currently to
trigger autopm when the user turns off polling.  Maybe something could
be added to the appropriate sysfs handler.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aaron Lu Nov. 19, 2012, 3:21 a.m. UTC | #12
On 11/19/2012 10:07 AM, Alan Stern wrote:
> On Sun, 18 Nov 2012, Tejun Heo wrote:
> 
>> Hey, Alan.
> 
> Hey...
> 
>> It's controlled by /sys/class/scsi_host/hostX/link_power_management.
>> Once enabled, the power saving is completedly handled by hardware.  If
>> link stays idle longer than certain duration, the hardware initiates
>> low power state and leaves it when something needs to happen.
>>
>>>> So, this whole autopm thing doesn't sound like a good idea to me.
>>>
>>> No doubt it's better suited to some devices than to others.
>>
>> Yeah, SATA seems to need a different approach than USB.
> 
> Based on your description, I agree.
> 
>>> That may be true for SATA.  For USB optical drives, it does make sense
>>> to power-down the host controller when the drive isn't in use.  USB
>>> suspend/resume takes on the order of 50-100 ms or so.
>>
>> I see.  For SATA too, the controller / link bring up doesn't take too
>> long.  The crappy part is what the attached, especially ATAPI, devices
>> would do after such events as those events will be visible to them
>> basically as random link resets.
>>
>> So, at least for SATA, I think what autopm can do is...
>>
>> * Trigger zpodd if possible.
>>
>> * Trigger suspend iff polling isn't happening on the device.
> 
> That sums it up nicely.  Of course, the PM core is unaware of details
> such as media polling.  What we can do is have the SATA runtime-idle
> method return -EBUSY if the device isn't a ZPODD and if polling is
> enabled.  Unfortunately I don't think there's any way currently to
> trigger autopm when the user turns off polling.  Maybe something could
> be added to the appropriate sysfs handler.

OK, thanks for your(both of you) suggestions.

But probably for ZPODD devices first, as it has a clear use case and
should already have productions. For async notification capable ODDs,
I'll leave it for someone else or maybe at a later time if I can get
such a device to test on.

To conclude, the ata port's runtime idle callback will return 0 when:
1 It attached a ZPODD capable ODD;
2 Or it attached a hard disk.
For all other cases, it will return -EBUSY.
Does this look correct?

Thanks,
Aaron

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Nov. 19, 2012, 2:50 p.m. UTC | #13
On Sun, Nov 18, 2012 at 09:07:22PM -0500, Alan Stern wrote:
> Hey...

Hey, again. :P

> > So, at least for SATA, I think what autopm can do is...
> > 
> > * Trigger zpodd if possible.
> > 
> > * Trigger suspend iff polling isn't happening on the device.
> 
> That sums it up nicely.  Of course, the PM core is unaware of details
> such as media polling.  What we can do is have the SATA runtime-idle
> method return -EBUSY if the device isn't a ZPODD and if polling is
> enabled.  Unfortunately I don't think there's any way currently to
> trigger autopm when the user turns off polling.  Maybe something could
> be added to the appropriate sysfs handler.

Given that genhd event polling and PM are likely to interact with each
other, I think it would be a good idea to implement a proper way for
the two to communicate.  Dunno what it should be tho.  The tricky part
would be mapping the block device to the underlying hardware one.

Thanks.
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 6cace66..8632fd3 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1469,6 +1469,31 @@  void disk_block_events(struct gendisk *disk)
 	mutex_unlock(&ev->block_mutex);
 }
 
+/*
+ * Under some circumstances, there is a race between the calling thread
+ * of disk_block_events and the events checking function. To avoid such a race,
+ * this function will check if the delayed work is pending. If not, it means
+ * the work is either not queued or is already running, false is returned.
+ * And if yes, try to cancel the delayed work. If succedded, disk_block_events
+ * will be called and there is no worry that cancel_delayed_work_sync will
+ * deadlock the events checking function. And if failed, false is returned.
+ */
+bool disk_try_block_events(struct gendisk *disk)
+{
+	struct disk_events *ev = disk->ev;
+
+	if (!ev)
+		return false;
+
+	if (cancel_delayed_work(&disk->ev->dwork)) {
+		disk_block_events(disk);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(disk_try_block_events);
+
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)
 {
 	struct disk_events *ev = disk->ev;
@@ -1512,6 +1537,7 @@  void disk_unblock_events(struct gendisk *disk)
 	if (disk->ev)
 		__disk_unblock_events(disk, false);
 }
+EXPORT_SYMBOL(disk_unblock_events);
 
 /**
  * disk_flush_events - schedule immediate event checking and flushing
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4f440b3..b67247f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -420,6 +420,7 @@  static inline int get_disk_ro(struct gendisk *disk)
 }
 
 extern void disk_block_events(struct gendisk *disk);
+extern bool disk_try_block_events(struct gendisk *disk);
 extern void disk_unblock_events(struct gendisk *disk);
 extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
 extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);