diff mbox

Re: [REGRESSION] rtc/interface.c: kills suspend-to-ram

Message ID 4F8CFC12.6050700@linaro.org
State Superseded
Headers show

Commit Message

John Stultz April 17, 2012, 5:13 a.m. UTC
On 04/16/2012 07:30 PM, Mark Lord wrote:
>
> Thanks for looking into it, John.
>
> I also spent many more hours digging away at it here today,
> and I now understand (mostly) what is happening and why.
>
> The code above introduces a new access to the RTC that never existed before.
> For the case where the Alarm has never been enabled by software,
> I believe the code above will still try to "disable" it.
> That's the new behaviour we didn't have prior to this patch.
>
> And.. on some of the systems I'm testing on, the BIOS setup has
> the RTC Alarm "enabled", which means "under BIOS control",
> as opposed to "disabled" which means "under software control".
>
> It's the "under BIOS control" systems that the above patch breaks.
>
> So I think the code may just need to be slightly more clever,
> and not disable an Alarm that was never enabled by software in the first place.

Thanks for the extra info. Although I'm still a little perplexed why 
that's causing trouble.
When "under BIOS control" is the RTC unusable by the kernel? Will any 
access cause problems? Or just the extra disable path?

On a hunch, I wonder if your tripping over the alarmtimer initialization 
issue that was recently fixed.
Have you also seen this issue w/ 3.4-rc2+ ?

I still can't trigger anything similar playing with the BIOS options for 
my system. If its not too much trouble, could you try the following two 
changes?

thanks
-john

I guess I'm curious why you're hitting the rtc_alarm_disable if you're 
not using the alarm. If you use the following diff, can you provide the 
resulting stack traces?

Comments

Mark Lord April 17, 2012, 12:51 p.m. UTC | #1
On 12-04-17 01:13 AM, John Stultz wrote:
> On 04/16/2012 07:30 PM, Mark Lord wrote:
..
>> And.. on some of the systems I'm testing on, the BIOS setup has
>> the RTC Alarm "enabled", which means "under BIOS control",
>> as opposed to "disabled" which means "under software control".
>>
>> It's the "under BIOS control" systems that the above patch breaks.
..
> Thanks for the extra info. Although I'm still a little perplexed why that's causing trouble.
> When "under BIOS control" is the RTC unusable by the kernel? Will any access cause problems? Or just
> the extra disable path?

Not totally clear yet, but I have noticed failures just reading
the RTC alarm setting when it is "under BIOS control",
so it does seem to cause issues.

I've further isolated it down to one specific revision of motherboard here.
The later rev of the same mobo doesn't exhibit the problem (I have both revs).
Interesting, that.  But if this board does it, then there probably are others
out there which could also be affected.

> On a hunch, I wonder if your tripping over the alarmtimer initialization issue that was recently fixed.
> Have you also seen this issue w/ 3.4-rc2+ ?

These boxes aren't ready for 3.4-xx yet.  :)

> I guess I'm curious why you're hitting the rtc_alarm_disable if you're not using the alarm. If you
> use the following diff, can you provide the resulting stack traces?


Yeah, excellent suggestion.  I'm very curious who calls this and when.

> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> index eb415bd..4c98ee5 100644
> --- a/drivers/rtc/interface.c
> +++ b/drivers/rtc/interface.c
> @@ -786,7 +786,8 @@ static void rtc_alarm_disable(struct rtc_device *rtc)
>      if (!rtc->ops || !rtc->ops->alarm_irq_enable)
>          return;
> 
> -    rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
> +    //rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
> +    dump_stack();
>  }
..

> Then un-comment/re-add the alarm_irq_enable() call above, and try the following, to see if the
> behavior changes? Then re-add each line one by one to see if you can isolate where things go wrong
> in the cmos code?
> 
> diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
> index 7d5f56e..c500bce 100644
> --- a/drivers/rtc/rtc-cmos.c
> +++ b/drivers/rtc/rtc-cmos.c
> @@ -318,9 +318,9 @@ static void cmos_irq_disable(struct cmos_rtc *cmos, unsigned char mask)
>      rtc_control = CMOS_READ(RTC_CONTROL);
>      rtc_control&= ~mask;
>      CMOS_WRITE(rtc_control, RTC_CONTROL);
> -    hpet_mask_rtc_irq_bit(mask);
> +    //hpet_mask_rtc_irq_bit(mask);
> 
> -    cmos_checkintr(cmos, rtc_control);
> +    //cmos_checkintr(cmos, rtc_control);
..

Ack.  Will do.  Probably not for 12-24 hours though.

Cheers
Mark Lord April 17, 2012, 8:11 p.m. UTC | #2
On 12-04-17 01:13 AM, John Stultz wrote:
..
> -    rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
> +    //rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
> +    dump_stack();
..

Okay, the call into here is coming from a "hwclock -w -u" line
in the system suspend script.

Since that command isn't touching the hardware Alarm,
then neither should the Linux kernel.  Yet it is touching it.

>      CMOS_WRITE(rtc_control, RTC_CONTROL);
> -    hpet_mask_rtc_irq_bit(mask);
> +    //hpet_mask_rtc_irq_bit(mask);
> 
> -    cmos_checkintr(cmos, rtc_control);
> +    //cmos_checkintr(cmos, rtc_control); 
...

The problem still occurs (lockup on suspend)
with both lines above commented out.

Note that it's not 100% in any case, more like 8/10,
indicating a possible strong race condition somewhere.

I think all that should be done here, is to change the kernel
to NOT enable/disable the Alarm unless told to do so by
an explicit userspace action.  Eg. writing to /sys/../wakealarm
and/or /proc/acpi/alarm.

If userspace leaves the alarm alone, then so should the kernel when possible.
That's the old behaviour before the new alarm_irq_enable() stuff.

Cheers
Mark Lord April 17, 2012, 8:12 p.m. UTC | #3
On 12-04-17 04:11 PM, Mark Lord wrote:
> On 12-04-17 01:13 AM, John Stultz wrote:
> ..
>> -    rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
>> +    //rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
>> +    dump_stack();
> ..
> 
> Okay, the call into here is coming from a "hwclock -w -u" line
> in the system suspend script.

Forgot the stack dump:

Pid: 4353, comm: hwclock Tainted: P           O 3.3.2 #5
Call Trace:
 [<ffffffff8123cfcd>] ? rtc_timer_remove+0x66/0xb2
 [<ffffffff8103d2ff>] ? should_resched+0x5/0x23
 [<ffffffff8123d21b>] ? rtc_update_irq_enable+0xd0/0x108
 [<ffffffff812dd582>] ? __mutex_lock_common.isra.5+0x3b/0x166
 [<ffffffff8123e058>] ? rtc_dev_ioctl+0x36d/0x468
 [<ffffffff8101b78a>] ? do_page_fault+0x264/0x2ce
 [<ffffffff81027650>] ? timespec_add_safe+0x33/0x63
 [<ffffffff810077a8>] ? read_tsc+0x5/0x14
 [<ffffffff810483fa>] ? timekeeping_get_ns+0xd/0x2a
 [<ffffffff810ab492>] ? do_vfs_ioctl+0x45a/0x49c
 [<ffffffff810abb8e>] ? poll_select_copy_remaining+0xdb/0xfb
 [<ffffffff810ab511>] ? sys_ioctl+0x3d/0x60
 [<ffffffff812df222>] ? system_call_fastpath+0x16/0x1b

> Since that command isn't touching the hardware Alarm,
> then neither should the Linux kernel.  Yet it is touching it.
> 
>>      CMOS_WRITE(rtc_control, RTC_CONTROL);
>> -    hpet_mask_rtc_irq_bit(mask);
>> +    //hpet_mask_rtc_irq_bit(mask);
>>
>> -    cmos_checkintr(cmos, rtc_control);
>> +    //cmos_checkintr(cmos, rtc_control); 
> ...
> 
> The problem still occurs (lockup on suspend)
> with both lines above commented out.
> 
> Note that it's not 100% in any case, more like 8/10,
> indicating a possible strong race condition somewhere.
> 
> I think all that should be done here, is to change the kernel
> to NOT enable/disable the Alarm unless told to do so by
> an explicit userspace action.  Eg. writing to /sys/../wakealarm
> and/or /proc/acpi/alarm.
> 
> If userspace leaves the alarm alone, then so should the kernel when possible.
> That's the old behaviour before the new alarm_irq_enable() stuff.
> 
> Cheers
diff mbox

Patch

diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index eb415bd..4c98ee5 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -786,7 +786,8 @@  static void rtc_alarm_disable(struct rtc_device *rtc)
  	if (!rtc->ops || !rtc->ops->alarm_irq_enable)
  		return;

-	rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
+	//rtc->ops->alarm_irq_enable(rtc->dev.parent, false);
+	dump_stack();
  }

  /**




Then un-comment/re-add the alarm_irq_enable() call above, and try the 
following, to see if the behavior changes? Then re-add each line one by 
one to see if you can isolate where things go wrong in the cmos code?

diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 7d5f56e..c500bce 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -318,9 +318,9 @@  static void cmos_irq_disable(struct cmos_rtc *cmos, unsigned char mask)
  	rtc_control = CMOS_READ(RTC_CONTROL);
  	rtc_control&= ~mask;
  	CMOS_WRITE(rtc_control, RTC_CONTROL);
-	hpet_mask_rtc_irq_bit(mask);
+	//hpet_mask_rtc_irq_bit(mask);

-	cmos_checkintr(cmos, rtc_control);
+	//cmos_checkintr(cmos, rtc_control);
  }

  static int cmos_set_alarm(struct device *dev, struct rtc_wkalrm *t)