Extreme time jitter with suspend/resume cycles

Message ID CAOdF7nvHfh_M3UQDYjbofNOm0-S_-LXHdmsMY7DeEwgOQCmfFA@mail.gmail.com
State New
Headers show
Series
  • Extreme time jitter with suspend/resume cycles
Related show

Commit Message

Gabriel Beddingfield Oct. 4, 2017, 4:11 p.m.
TL;DR: the "delta_delta" hack[1 and 2] in kernel/time/timekeeping.c
and drivers/rtc/class.c undermines the NTP system. It's not
appropriate to use if sub-second precision is available. I've attached
a patch to resolve this... please let me know the ways you hate it.
:-)

Hello Kernel Timekeeping Maintainers,

We have been developing a device that has very a very aggressive power
policy, doing suspend/resume cycles a few times a minute ("echo mem >
/sys/power/state"). In doing so, we found that the system time
experiences a lot of jitter (compared to, say, an NTP server). It was
not uncommon for us to see time corrections of 1s to 4s on a regular
basis. This didn't happen when the device stayed awake, only when it
was allowed to do suspend/resume.

We found that the problem is an interaction between the NTP code and
what I call the "delta_delta hack." (see [1] and [2]) This code
allocates a static variable in a function that contains an offset from
the system time to the persistent/rtc clock. It uses that time to
fudge the suspend timestamp so that on resume the sleep time will be
compensated. It's kind of a statistical hack that assumes things will
average out. It seems to have two main assumptions:

  1. The persistent/rtc clock has only single-second precision
  2. The system does not frequently suspend/resume.
  3. If delta_delta is less than 2 seconds, these assumptions are "true"

Because the delta_delta hack is trying to maintain an offset from the
system time to the persistent/rtc clock, any minor NTP corrections
that have occurred since the last suspend will be discarded. However,
the NTP subsystem isn't notified that this is happening -- and so it
causes some level of instability in its PLL logic.

This problem affects any device that does "frequent" suspend/resume
cycles. I.e. any battery-powered "linux" device (like Android phones).

Many ARM systems provide a "persistent clock." Most of them are backed
by a 32kHz clock that gives good precision and makes the delta_delta
hack unnecessary. However, devices that only have single-second
precision for the persistent clock and/or are forced to use the RTC
(whose API only allows for single-second precision) -- they still need
this hack.

I've attached a patch that we developed in-house. I have a feeling you
won't like it... since it pushes the responsibility on whoever
configures the kernel. It also ignores the RTC problem (which will
still affect a lot of battery-powered devices).

Please let me know what you think -- and what the right approach for
solving this would be.

Thanks,
Gabe

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?h=v4.13.4#n1717
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/drivers/rtc/class.c?h=v4.13.4#n76

Comments

Thomas Gleixner Oct. 4, 2017, 6:22 p.m. | #1
Gabriel,

On Wed, 4 Oct 2017, Gabriel Beddingfield wrote:

thanks for bringing this up. Let's see where this goes.

Disclaimer: I did not bother to look at the patch yet because:

  1) It's an attachment and I'm too lazy to open and convert it to inline
     for reply. Sending patches inline is the preferred way, but maybe I
     should be thankful that you did not, as you consider it as ugly
     yourself.

  2) I want to discuss the conceptual issues first w/o being biased by
     looking at the patch.

> We have been developing a device that has very a very aggressive power
> policy, doing suspend/resume cycles a few times a minute ("echo mem >
> /sys/power/state"). In doing so, we found that the system time
> experiences a lot of jitter (compared to, say, an NTP server). It was
> not uncommon for us to see time corrections of 1s to 4s on a regular
> basis. This didn't happen when the device stayed awake, only when it
> was allowed to do suspend/resume.
> 
> We found that the problem is an interaction between the NTP code and
> what I call the "delta_delta hack." (see [1] and [2])

What you consider a hack might other people not. In fact it is the least
resort the kernel has on a lot of systems to maintain some halfways
accurate notion of walltime across suspend/resume.

I personally consider using high frequency suspend to mem a hack as well,
because contrary to suspend to idle and proper runtime power management it
must go through a lot of work which drains power and it prevents the kernel
from using all of its knowledge of power saving / wakeup predictions etc to
avoid that.  But it might be as well the least resort on a particular
system or use case scenario to squeeze the most lifetime out of the
battery.

Calling things a hack which might have been thought out carefully does not
make people more receptive.

> This code allocates a static variable in a function that contains an
> offset from the system time to the persistent/rtc clock. It uses that
> time to fudge the suspend timestamp so that on resume the sleep time will
> be compensated. It's kind of a statistical hack that assumes things will
> average out. It seems to have two main assumptions:

> 
>   1. The persistent/rtc clock has only single-second precision
>   2. The system does not frequently suspend/resume.
>   3. If delta_delta is less than 2 seconds, these assumptions are "true"
> 
> Because the delta_delta hack is trying to maintain an offset from the
> system time to the persistent/rtc clock, any minor NTP corrections
> that have occurred since the last suspend will be discarded. However,
> the NTP subsystem isn't notified that this is happening -- and so it
> causes some level of instability in its PLL logic.

The latter might be trivial to solve by adding TK_NTP_CLEAR to the
timekeeping_update() flags if the RTC was used for injecting the sleep
time, but I leave that to John for judgement.

> This problem affects any device that does "frequent" suspend/resume
> cycles. I.e. any battery-powered "linux" device (like Android phones).

Depends as there are other options depending on your SoC/CPU.

> Many ARM systems provide a "persistent clock." Most of them are backed
> by a 32kHz clock that gives good precision and makes the delta_delta
> hack unnecessary. However, devices that only have single-second
> precision for the persistent clock and/or are forced to use the RTC
> (whose API only allows for single-second precision) -- they still need
> this hack.

I have no idea what you are trying to tell me. We know that there are
systems which have a clocksource which continues to tick across
suspend/resume.

Such clocksources can be flagged with CLOCK_SOURCE_SUSPEND_NONSTOP and the
timekeeping resume code uses them when available instead of using the
RTC. There are a few of them flagged as such in the kernel, but it might
not be the complete set of capable devices which is neither a problem of
the timekeeping core code nor of the maintainers, as we are not able to
verify every single hardware detail of a submitted driver for obvious
reasons.

It's neither a problem of the timekeeping core code to select the
appropriate clocksource for a system. That's up to the developer who
implemented the SoC/CPU support and made a decision about rating the
clocksource(s), which might end up selecting one which stops in suspend.

If your particular use case is not working with the rating decision of the
developers you can override that decision on the kernel command line or
after boot through sysfs. With some restrictions you can even switch back
and forth during runtime, though that might not be the best idea if you
want to maintain a high precision wall time clock. Again, that depends on
your use case and system.

If the clocksource on your system is missing a flag despite being nonstop
across suspend, feel free to send a patch.

> I've attached a patch that we developed in-house. I have a feeling you
> won't like it... since it pushes the responsibility on whoever
> configures the kernel. It also ignores the RTC problem (which will
> still affect a lot of battery-powered devices).

Without looking at what it does, I can tell you that making it a config
option is not going to fly. It has to be runtime discoverable as we have to
support multi platform kernels.

Though I still want to know exactly why you think that you need some extra
magic in the timekeeping core code. If your system has a clocksource which
is not stopping on suspend and it lacks the flag, then this is a one liner
patch. If there is something else, then please elaborate.

Thanks,

	tglx
Gabriel Beddingfield Oct. 4, 2017, 11:10 p.m. | #2
Hi Thomas,

Thanks for your reply!

On Wed, Oct 4, 2017 at 11:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Calling things a hack which might have been thought out carefully does not
> make people more receptive.

My bad... sorry! You're right. The code in question is better than a "hack."

>> Many ARM systems provide a "persistent clock." Most of them are backed
>> by a 32kHz clock that gives good precision and makes the delta_delta
>> hack unnecessary. However, devices that only have single-second
>> precision for the persistent clock and/or are forced to use the RTC
>> (whose API only allows for single-second precision) -- they still need
>> this hack.
>
> I have no idea what you are trying to tell me. We know that there are
> systems which have a clocksource which continues to tick across
> suspend/resume.

I'm referring to read_persistent_clock64() API... which is distinct from the
CLOCK_SOURCE_SUSPEND_NONSTOP flag.

> Such clocksources can be flagged with CLOCK_SOURCE_SUSPEND_NONSTOP and the
> timekeeping resume code uses them when available instead of using the
> RTC. There are a few of them flagged as such in the kernel, but it might
...
> It's neither a problem of the timekeeping core code to select the
> appropriate clocksource for a system. That's up to the developer who
> implemented the SoC/CPU support and made a decision about rating the
> clocksource(s), which might end up selecting one which stops in suspend.

This was our first approach. However, because of some hardware
limitations we couldn't
move the system's monotonic clock to the persistent clock. They had to
be two different
clocks. (Details below.)

> Without looking at what it does, I can tell you that making it a config
> option is not going to fly. It has to be runtime discoverable as we have to
> support multi platform kernels.

OK. Here's a couple ideas...

APPROACH ONE: Use a heuristic

If read_persistent_clock64() ever returns fractional seconds (as
apposed to whole seconds), then
permanently disable the compensation.

APPROACH TWO: Use a property

Provide another weak symbol like this:

    extern u64 persistent_clock_precision;

or..

    void get_persistent_clock_precision(struct timespec *ts);
    void get_persistent_clock_percision64(struct timespec64 *ts);

And the value returned should be, approximately, the minimum time
between clock ticks.
For clocks that only supply whole-second precision... the value should
be NSECS_PER_SEC.
For clocks that are backed by, say, a 32kHz clock... the value should be
(NSECS_PER_SEC/32768).

> Though I still want to know exactly why you think that you need some extra
> magic in the timekeeping core code. If your system has a clocksource which
> is not stopping on suspend and it lacks the flag, then this is a one liner
> patch. If there is something else, then please elaborate.

Long story short: you can't always have your low-power clock be your
monotonic/sched
clock.

The SoC we use backs the monotonic clock (sched_clock_register()) with
a counter that is
high frequency (>10 MHz) in their reference implementation. But it
does not count when the
system is in low-power mode. However, it can be configured to use a
32kHz clock that *does*
count when the system is in low-power mode. So, we started by using
this clock and setting the
CLOCK_SOURCE_SUSPEND_NONSTOP flag. It worked great... at first.

Then we found that devices would randomly experience a 36-hour time jump.
While we don't have a definitive root cause, the current theory is
that we are getting
non-atomic reads because the clock source isn't synchronized with the
the high frequency
clock (which is used for most of the digital logic on the SoC).

Therefore, we moved the monotonic/sched clock back to the high-frequency source.

Meanwhile, we can directly read the RTC clock on this system, and it
will give us 32kHz
resolution and also runs non-stop. Since reads are non-atomic, we have
to read the
registers in a loop. We used this register to implement
read_persistent_clock64().
Because we have to read the registers in a loop, it seemed unfit for use as the
monotonic/sched clock.

Thanks,
Gabe
John Stultz Oct. 5, 2017, 12:16 a.m. | #3
On Wed, Oct 4, 2017 at 9:11 AM, Gabriel Beddingfield <gabe@nestlabs.com> wrote:
> TL;DR: the "delta_delta" hack[1 and 2] in kernel/time/timekeeping.c
> and drivers/rtc/class.c undermines the NTP system. It's not
> appropriate to use if sub-second precision is available. I've attached
> a patch to resolve this... please let me know the ways you hate it.
> :-)
>
> Hello Kernel Timekeeping Maintainers,
>
> We have been developing a device that has very a very aggressive power
> policy, doing suspend/resume cycles a few times a minute ("echo mem >
> /sys/power/state"). In doing so, we found that the system time
> experiences a lot of jitter (compared to, say, an NTP server). It was
> not uncommon for us to see time corrections of 1s to 4s on a regular
> basis. This didn't happen when the device stayed awake, only when it
> was allowed to do suspend/resume.
>
> We found that the problem is an interaction between the NTP code and
> what I call the "delta_delta hack." (see [1] and [2]) This code
> allocates a static variable in a function that contains an offset from
> the system time to the persistent/rtc clock. It uses that time to
> fudge the suspend timestamp so that on resume the sleep time will be
> compensated. It's kind of a statistical hack that assumes things will
> average out. It seems to have two main assumptions:
>
>   1. The persistent/rtc clock has only single-second precision
>   2. The system does not frequently suspend/resume.
>   3. If delta_delta is less than 2 seconds, these assumptions are "true"
>
> Because the delta_delta hack is trying to maintain an offset from the
> system time to the persistent/rtc clock, any minor NTP corrections
> that have occurred since the last suspend will be discarded. However,
> the NTP subsystem isn't notified that this is happening -- and so it
> causes some level of instability in its PLL logic.

So, on resume when we call __timekeeping_inject_sleeptime(), that uses
the TK_CLEAR_NTP which clears the NTP state (sets STA_UNSYNC, etc) .
I'm not sure how else we can notify userspace.  It may be that ntpd
doesn't expect the kernel to set things as unsynced and doesn't
recover well, but the proper fix for that probably is in userspace.

>
> This problem affects any device that does "frequent" suspend/resume
> cycles. I.e. any battery-powered "linux" device (like Android phones).

Ironically, if I recall correctly, the delta_delta "hack" originally
came from Android developers who were trying to find a solution to the
~1sec per suspend  time error that we had before.


> Many ARM systems provide a "persistent clock." Most of them are backed
> by a 32kHz clock that gives good precision and makes the delta_delta
> hack unnecessary. However, devices that only have single-second
> precision for the persistent clock and/or are forced to use the RTC
> (whose API only allows for single-second precision) -- they still need
> this hack.
>
> I've attached a patch that we developed in-house. I have a feeling you
> won't like it... since it pushes the responsibility on whoever
> configures the kernel. It also ignores the RTC problem (which will
> still affect a lot of battery-powered devices).
>
> Please let me know what you think -- and what the right approach for
> solving this would be.

So I suspect the best solution for you here is: provide some
infrastructure so clocksources that set CLOCK_SOURCE_SUSPEND_NONSTOP
which are not the current clocksource can be used for suspend timing.

We should also figure out how to best handle ntpd in userspace dealing
with frequent suspend/resume cycles. This is problematic, as the
closest analogy is trying driving on the road while frequently falling
asleep.  This is not something I think ntpd handles well.  I suspect
our options are that any ntp adjustments being made might be made for
far too long (causing potentially massive over-correction) or not at
all, and not at all seems slightly better in my mind.

Miroslav may have other thoughts?

thanks
-john
John Stultz Oct. 5, 2017, 12:20 a.m. | #4
On Wed, Oct 4, 2017 at 4:10 PM, Gabriel Beddingfield <gabe@nestlabs.com> wrote:
> Hi Thomas,
>
> Thanks for your reply!
>
> On Wed, Oct 4, 2017 at 11:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Calling things a hack which might have been thought out carefully does not
>> make people more receptive.
>
> My bad... sorry! You're right. The code in question is better than a "hack."
>
>>> Many ARM systems provide a "persistent clock." Most of them are backed
>>> by a 32kHz clock that gives good precision and makes the delta_delta
>>> hack unnecessary. However, devices that only have single-second
>>> precision for the persistent clock and/or are forced to use the RTC
>>> (whose API only allows for single-second precision) -- they still need
>>> this hack.
>>
>> I have no idea what you are trying to tell me. We know that there are
>> systems which have a clocksource which continues to tick across
>> suspend/resume.
>
> I'm referring to read_persistent_clock64() API... which is distinct from the
> CLOCK_SOURCE_SUSPEND_NONSTOP flag.
>
>> Such clocksources can be flagged with CLOCK_SOURCE_SUSPEND_NONSTOP and the
>> timekeeping resume code uses them when available instead of using the
>> RTC. There are a few of them flagged as such in the kernel, but it might
> ...
>> It's neither a problem of the timekeeping core code to select the
>> appropriate clocksource for a system. That's up to the developer who
>> implemented the SoC/CPU support and made a decision about rating the
>> clocksource(s), which might end up selecting one which stops in suspend.
>
> This was our first approach. However, because of some hardware
> limitations we couldn't
> move the system's monotonic clock to the persistent clock. They had to
> be two different
> clocks. (Details below.)
>
>> Without looking at what it does, I can tell you that making it a config
>> option is not going to fly. It has to be runtime discoverable as we have to
>> support multi platform kernels.
>
> OK. Here's a couple ideas...
>
> APPROACH ONE: Use a heuristic
>
> If read_persistent_clock64() ever returns fractional seconds (as
> apposed to whole seconds), then
> permanently disable the compensation.


This seems reasonable to me as well.

>> Though I still want to know exactly why you think that you need some extra
>> magic in the timekeeping core code. If your system has a clocksource which
>> is not stopping on suspend and it lacks the flag, then this is a one liner
>> patch. If there is something else, then please elaborate.
>
> Long story short: you can't always have your low-power clock be your
> monotonic/sched
> clock.
>
> The SoC we use backs the monotonic clock (sched_clock_register()) with
> a counter that is
> high frequency (>10 MHz) in their reference implementation. But it
> does not count when the
> system is in low-power mode. However, it can be configured to use a
> 32kHz clock that *does*
> count when the system is in low-power mode. So, we started by using
> this clock and setting the
> CLOCK_SOURCE_SUSPEND_NONSTOP flag. It worked great... at first.
>
> Then we found that devices would randomly experience a 36-hour time jump.
> While we don't have a definitive root cause, the current theory is
> that we are getting
> non-atomic reads because the clock source isn't synchronized with the
> the high frequency
> clock (which is used for most of the digital logic on the SoC).
>
> Therefore, we moved the monotonic/sched clock back to the high-frequency source.
>
> Meanwhile, we can directly read the RTC clock on this system, and it
> will give us 32kHz
> resolution and also runs non-stop. Since reads are non-atomic, we have
> to read the
> registers in a loop. We used this register to implement
> read_persistent_clock64().
> Because we have to read the registers in a loop, it seemed unfit for use as the
> monotonic/sched clock.

Yea. I thought arm devices often had read_persistent_clock64() backed
by the 32k timer (which is poor for time initialization but works well
for suspend timing).

Maybe I misunderstood on the first read. Is it then that the
relatively fine-grained read_persistent_clock64() is colliding with
the delta_delta logic that assumes we get coarse 1sec resolution? In
that case the huristic above seems sane.

thanks
-john
Thomas Gleixner Oct. 5, 2017, 11:01 a.m. | #5
On Wed, 4 Oct 2017, Gabriel Beddingfield wrote:
> On Wed, Oct 4, 2017 at 11:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Long story short: you can't always have your low-power clock be your
> monotonic/sched clock.

sched_clock and the clocksource for timekeeping, which feeds monotonic are
not required to be the same thing.
 
> The SoC we use backs the monotonic clock (sched_clock_register()) with a

Again. monotonic clock != sched clock. The clocksource which feeds the
monotonic timekeeper clock is registered via clocksource_register() & al.

> counter that is high frequency (>10 MHz) in their reference
> implementation. But it does not count when the system is in low-power
> mode. However, it can be configured to use a 32kHz clock that *does*
> count when the system is in low-power mode. So, we started by using this
> clock and setting the CLOCK_SOURCE_SUSPEND_NONSTOP flag. It worked
> great... at first.
> 
> Then we found that devices would randomly experience a 36-hour time jump.
> While we don't have a definitive root cause, the current theory is that
> we are getting non-atomic reads because the clock source isn't
> synchronized with the the high frequency clock (which is used for most of
> the digital logic on the SoC).

Groan. Engineering based on theories is doomed to begin with.

Your 36 hour time jump is probably exactly 36.4089 hours as that's

     ((1 << 32) / 32768) / 3600

i.e. the 32bit rollover of the clocksource. So, if the clocksource->read()
function returns a full 64bit counter value, then it must have protection
against observing the rollover independent of the clock which feeds that
counter. Of course the frequency changes the probablity of observing it,
but still the read function must be protected against observing the
rollover unconditionally.

Which SoC/clocksource driver are you talking about?

> Therefore, we moved the monotonic/sched clock back to the high-frequency source.

Please stop confusing timekeeping clock source and sched clock. They might
be the same physical device but conceptually they are different.

> Meanwhile, we can directly read the RTC clock on this system, and it will
> give us 32kHz resolution and also runs non-stop. Since reads are
> non-atomic, we have to read the registers in a loop. We used this
> register to implement read_persistent_clock64().  Because we have to read
> the registers in a loop, it seemed unfit for use as the monotonic/sched
> clock.

I can understand that. Though, using that value for injecting accurate
sleep time should just work with the existing code no matter how long the
actual sleep time was. The timekeeping core takes the nsec part of the
timespec value retrieved via read_persistent_clock64() into account.

I still have a hard time to figure out what you are trying to achieve.

Thanks,

	tglx
Thomas Gleixner Oct. 5, 2017, 11:05 a.m. | #6
On Wed, 4 Oct 2017, John Stultz wrote:
> On Wed, Oct 4, 2017 at 9:11 AM, Gabriel Beddingfield <gabe@nestlabs.com> wrote:
> > TL;DR: the "delta_delta" hack[1 and 2] in kernel/time/timekeeping.c
> > and drivers/rtc/class.c undermines the NTP system. It's not
> > appropriate to use if sub-second precision is available. I've attached
> > a patch to resolve this... please let me know the ways you hate it.
> > :-)
> >
> > Hello Kernel Timekeeping Maintainers,
> >
> > We have been developing a device that has very a very aggressive power
> > policy, doing suspend/resume cycles a few times a minute ("echo mem >
> > /sys/power/state"). In doing so, we found that the system time
> > experiences a lot of jitter (compared to, say, an NTP server). It was
> > not uncommon for us to see time corrections of 1s to 4s on a regular
> > basis. This didn't happen when the device stayed awake, only when it
> > was allowed to do suspend/resume.
> >
> > We found that the problem is an interaction between the NTP code and
> > what I call the "delta_delta hack." (see [1] and [2]) This code
> > allocates a static variable in a function that contains an offset from
> > the system time to the persistent/rtc clock. It uses that time to
> > fudge the suspend timestamp so that on resume the sleep time will be
> > compensated. It's kind of a statistical hack that assumes things will
> > average out. It seems to have two main assumptions:
> >
> >   1. The persistent/rtc clock has only single-second precision
> >   2. The system does not frequently suspend/resume.
> >   3. If delta_delta is less than 2 seconds, these assumptions are "true"
> >
> > Because the delta_delta hack is trying to maintain an offset from the
> > system time to the persistent/rtc clock, any minor NTP corrections
> > that have occurred since the last suspend will be discarded. However,
> > the NTP subsystem isn't notified that this is happening -- and so it
> > causes some level of instability in its PLL logic.
> 
> So, on resume when we call __timekeeping_inject_sleeptime(), that uses
> the TK_CLEAR_NTP which clears the NTP state (sets STA_UNSYNC, etc) .
> I'm not sure how else we can notify userspace.  It may be that ntpd
> doesn't expect the kernel to set things as unsynced and doesn't
> recover well, but the proper fix for that probably is in userspace.

Errm. No, __timekeeping_inject_sleeptime() only updates the timekeeper.

We have two call sites:

timekeeping_resume()
{
	.....
	if (sleeptime_injected)
		__timekeeping_inject_sleeptime(tk, &ts_delta);
	...
	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
	...
}

and

timekeeping_inject_sleeptime64()
{
	__timekeeping_inject_sleeptime(tk, &delta);
	...
	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
	...
}

But Gabriel talks about the effects from injecting sleep time in
timekeeping_resume() because that's where we use
read_persistent_clock64(). And there we don't clear NTP, unless there is
some magic I'm missing completely.

Thanks,

	tglx
Miroslav Lichvar Oct. 5, 2017, 11:08 a.m. | #7
On Wed, Oct 04, 2017 at 05:16:31PM -0700, John Stultz wrote:
> On Wed, Oct 4, 2017 at 9:11 AM, Gabriel Beddingfield <gabe@nestlabs.com> wrote:
> > We found that the problem is an interaction between the NTP code and
> > what I call the "delta_delta hack." (see [1] and [2]) This code
> > allocates a static variable in a function that contains an offset from
> > the system time to the persistent/rtc clock. It uses that time to
> > fudge the suspend timestamp so that on resume the sleep time will be
> > compensated. It's kind of a statistical hack that assumes things will
> > average out. It seems to have two main assumptions:
> >
> >   1. The persistent/rtc clock has only single-second precision
> >   2. The system does not frequently suspend/resume.
> >   3. If delta_delta is less than 2 seconds, these assumptions are "true"
> >
> > Because the delta_delta hack is trying to maintain an offset from the
> > system time to the persistent/rtc clock, any minor NTP corrections
> > that have occurred since the last suspend will be discarded. However,
> > the NTP subsystem isn't notified that this is happening -- and so it
> > causes some level of instability in its PLL logic.

This is interesting. What polling interval was ntpd using? If I
understand it correctly, with a high-resolution persistent clock the
delta-delta compensation should be very small and shouldn't disrupt
ntpd. Does this instability disappear when ntpd is not controlling the
clock (i.e. "disable ntp" in ntp.conf)?

> We should also figure out how to best handle ntpd in userspace dealing
> with frequent suspend/resume cycles. This is problematic, as the
> closest analogy is trying driving on the road while frequently falling
> asleep.  This is not something I think ntpd handles well.  I suspect
> our options are that any ntp adjustments being made might be made for
> far too long (causing potentially massive over-correction) or not at
> all, and not at all seems slightly better in my mind.

Yeah, controlling the clock in such conditions will be difficult. The
kernel/ntp PLL requires periodic updates. There is some code in
ntp_update_offset() that reduces the frequency adjustment when PLL
updates are missing, but I'm not actually sure if it works correctly
with suspend.
Thomas Gleixner Oct. 5, 2017, 2:11 p.m. | #8
On Thu, 5 Oct 2017, Thomas Gleixner wrote:
> On Wed, 4 Oct 2017, John Stultz wrote:
> > So, on resume when we call __timekeeping_inject_sleeptime(), that uses
> > the TK_CLEAR_NTP which clears the NTP state (sets STA_UNSYNC, etc) .
> > I'm not sure how else we can notify userspace.  It may be that ntpd
> > doesn't expect the kernel to set things as unsynced and doesn't
> > recover well, but the proper fix for that probably is in userspace.
> 
> Errm. No, __timekeeping_inject_sleeptime() only updates the timekeeper.

That should read:

     updates the timekeeper data fields, but does not call
     timekeeping_update().

> 
> We have two call sites:
> 
> timekeeping_resume()
> {
> 	.....
> 	if (sleeptime_injected)
> 		__timekeeping_inject_sleeptime(tk, &ts_delta);
> 	...
> 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
> 	...
> }
> 
> and
> 
> timekeeping_inject_sleeptime64()
> {
> 	__timekeeping_inject_sleeptime(tk, &delta);
> 	...
> 	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
> 	...
> }
> 
> But Gabriel talks about the effects from injecting sleep time in
> timekeeping_resume() because that's where we use
> read_persistent_clock64(). And there we don't clear NTP, unless there is
> some magic I'm missing completely.
> 
> Thanks,
> 
> 	tglx
> 
> 
>
Gabriel Beddingfield Oct. 5, 2017, 4:46 p.m. | #9
Hi John,

First, my apologies for calling it a "hack." I just went back and looked at the
commit history and this is first-class stuff... and you explained it very well
(including the NTP interaction) in the commit message. I'm pretty sure I
read this before, but I reckon most of it went over my head and I garbled it.

On Wed, Oct 4, 2017 at 5:20 PM, John Stultz <john.stultz@linaro.org> wrote:
> Yea. I thought arm devices often had read_persistent_clock64() backed
> by the 32k timer (which is poor for time initialization but works well
> for suspend timing).
>
> Maybe I misunderstood on the first read. Is it then that the
> relatively fine-grained read_persistent_clock64() is colliding with
> the delta_delta logic that assumes we get coarse 1sec resolution? In
> that case the huristic above seems sane.

Yes, exactly.

-gabe
Gabriel Beddingfield Oct. 5, 2017, 4:47 p.m. | #10
Hi Thomas,

On Thu, Oct 5, 2017 at 4:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > The SoC we use backs the monotonic clock (sched_clock_register()) with a
>
> Again. monotonic clock != sched clock. The clocksource which feeds the
> monotonic timekeeper clock is registered via clocksource_register() & al.

Sorry. I was hedging since I wasn't sure *which* term to use.

> Groan. Engineering based on theories is doomed to begin with.

I knew you'd say that. :-p

> Your 36 hour time jump is probably exactly 36.4089 hours as that's
>
>      ((1 << 32) / 32768) / 3600
>
> i.e. the 32bit rollover of the clocksource. So, if the clocksource->read()
> function returns a full 64bit counter value, then it must have protection
> against observing the rollover independent of the clock which feeds that
> counter. Of course the frequency changes the probablity of observing it,
> but still the read function must be protected against observing the
> rollover unconditionally.

Right, but isn't this what clocksource->mask is supposed to do? When we change
the back-end frequency, we're still using the same front-end 32-bit register and
we don't see the same jumps.

> Which SoC/clocksource driver are you talking about?

NXP i.MX 6SoloX
drivers/clocksource/timer-imx-gpt.c
drivers/rtc/rtc-snvs.c
arch/arm/boot/dts/imx6sx.dtsi (included from imx6sx-sdb.dts)
   (node: soc/aips1/gpt)

We patched the RTC driver to call register_persistent_clock()
(arch/arm/kernel/time.c)

The upstream driver doesn't support using the 32kHz clock, but the silicone
does... so we modified the driver to accept it and set the "NONSTOP" flag.

> I can understand that. Though, using that value for injecting accurate
> sleep time should just work with the existing code no matter how long the
> actual sleep time was. The timekeeping core takes the nsec part of the
> timespec value retrieved via read_persistent_clock64() into account.
>
> I still have a hard time to figure out what you are trying to achieve.

I'm trying to disable this block:

int timekeeping_suspend(void)
{
        struct timekeeper *tk = &tk_core.timekeeper;
        unsigned long flags;
#ifdef CONFIG_PERSISTENT_CLOCK_IS_LOW_PRECISION
        struct timespec64               delta, delta_delta;
        static struct timespec64        old_delta;
#endif

        read_persistent_clock64(&timekeeping_suspend_time);

//...

#ifdef CONFIG_PERSISTENT_CLOCK_IS_LOW_PRECISION
        if (persistent_clock_exists) {
                /*
                 * To avoid drift caused by repeated suspend/resumes,
                 * which each can add ~1 second drift error,
                 * try to compensate so the difference in system time
                 * and persistent_clock time stays close to constant.
                 */
                delta = timespec64_sub(tk_xtime(tk), timekeeping_suspend_time);
                delta_delta = timespec64_sub(delta, old_delta);
                if (abs(delta_delta.tv_sec) >= 2) {
                        /*
                         * if delta_delta is too large, assume time correction
                         * has occurred and set old_delta to the current delta.
                         */
                        old_delta = delta;
                } else {
                        /* Otherwise try to adjust old_system to compensate */
                        timekeeping_suspend_time =

timespec64_add(timekeeping_suspend_time, delta_delta);
                }
        }
#endif

In the typical case, it is trying to maintain the assumption that
(ideally) 'delta'
is a constant. It tries to maintain this assumption by fiddling with
the "suspend
time" -- adding or subtracting a little to it.

However, if NTP has adjusted the system time but *not* adjusted the persistent
clock's time... then the assumption is violated, 'delta' is *not*
constant, and this
block cancels out the NTP corrections.

-gabe
Thomas Gleixner Oct. 5, 2017, 6:01 p.m. | #11
Gabriel,

On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:
> On Thu, Oct 5, 2017 at 4:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > i.e. the 32bit rollover of the clocksource. So, if the clocksource->read()
> > function returns a full 64bit counter value, then it must have protection
> > against observing the rollover independent of the clock which feeds that
> > counter. Of course the frequency changes the probablity of observing it,
> > but still the read function must be protected against observing the
> > rollover unconditionally.
> 
> Right, but isn't this what clocksource->mask is supposed to do? When we change
> the back-end frequency, we're still using the same front-end 32-bit register and
> we don't see the same jumps.

Right. That's what the mask should protect. I was assuming that this is one
of the fancy clocksources which expose two 32bit registers of a 64bit
counter and the rollover protection was missing. So that's not the
case. Good, or not so good :)

> > Which SoC/clocksource driver are you talking about?
> 
> NXP i.MX 6SoloX
> drivers/clocksource/timer-imx-gpt.c

So that clocksource driver looks correct. Do you have an idea in which
context this time jump happens? Does it happen when you exercise your high
frequency suspend/resume dance or is that happening just when you let the
machine run forever as well?

The timekeeping_resume() path definitely has an issue:

        cycle_now = tk_clock_read(&tk->tkr_mono);
        if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
                cycle_now > tk->tkr_mono.cycle_last) {

This works nice for clocksources which wont wrap across suspend/resume but
not for those which can. That cycle_now -> cycle_last check should take
cs-mask into account ...

Of course for clocksources which can wrap within realistic suspend times,
which 36 hours might be accounted for, this would need an extra sanity
check against a RTC whether wrap time has been exceeded.

I haven't thought it through whether that buggered check fully explains
what you are observing, but it's wrong nevertheless. John?

Thanks,

	tglx
Gabriel Beddingfield Oct. 5, 2017, 8:14 p.m. | #12
Hi John,

On Wed, Oct 4, 2017 at 5:16 PM, John Stultz <john.stultz@linaro.org> wrote:
>> Please let me know what you think -- and what the right approach for
>> solving this would be.
>
> So I suspect the best solution for you here is: provide some
> infrastructure so clocksources that set CLOCK_SOURCE_SUSPEND_NONSTOP
> which are not the current clocksource can be used for suspend timing.

Let me see if I understand how this might work in my situation...

1. I register a clocksource and set the NONSTOP flag.
2. Give it a "low" rating so that it's not selected as the timekeeping
clocksource.
3. Create functions clocksource_select_persistent() and
timekeeping_notify_persistent()
4. Add `struct tk_read_base tk_persistent' to `struct timekeeper'
5. Possibly add a change_persistent_clocksource() function to timekeeping.c

> We should also figure out how to best handle ntpd in userspace dealing
> with frequent suspend/resume cycles. This is problematic, as the
> closest analogy is trying driving on the road while frequently falling
> asleep.  This is not something I think ntpd handles well.  I suspect
> our options are that any ntp adjustments being made might be made for
> far too long (causing potentially massive over-correction) or not at
> all, and not at all seems slightly better in my mind.

Indeed. Because of this, we've actually disabled NTP time slewing for now. We
always "jump" the time whenever we sync with the server.

However, I set up a test with the chrony NTP daemon (and the "delta_delta"
compensation disabled). I modified chrony to do the following:

    1. hold a "wake lock" with the power management daemon whenever
doing anything (e.g. sync with time server)
    2. use an alarmtimer (timerfd backed by CLOCK_BOOTTIME_ALARM) to
back up the select(2) timeout.

In this configuration, the NTP corrections were very stable, the drift
converged, the
jitter was negligible (less than 0.010 sec), and the time synchronized
very slowly
and methodically (as it should).

In a similar test with connman's NTP implementation (which is much less
sophisticated and does not estimate drift), we got "good" results.

Thanks,
Gabe
Gabriel Beddingfield Oct. 5, 2017, 8:51 p.m. | #13
Hi Thomas,

On Thu, Oct 5, 2017 at 11:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Which SoC/clocksource driver are you talking about?
>>
>> NXP i.MX 6SoloX
>> drivers/clocksource/timer-imx-gpt.c
>
> So that clocksource driver looks correct. Do you have an idea in which
> context this time jump happens? Does it happen when you exercise your high
> frequency suspend/resume dance or is that happening just when you let the
> machine run forever as well?

We couldn't devise any reproduction steps. We observed it happening at
unexpected
times in a fleet of devices -- and we couldn't find any patterns to clue us in.

>
> The timekeeping_resume() path definitely has an issue:
>
>         cycle_now = tk_clock_read(&tk->tkr_mono);
>         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
>                 cycle_now > tk->tkr_mono.cycle_last) {
>
> This works nice for clocksources which wont wrap across suspend/resume but
> not for those which can. That cycle_now -> cycle_last check should take
> cs-mask into account ...
>
> Of course for clocksources which can wrap within realistic suspend times,
> which 36 hours might be accounted for, this would need an extra sanity
> check against a RTC whether wrap time has been exceeded.
>
> I haven't thought it through whether that buggered check fully explains
> what you are observing, but it's wrong nevertheless. John?

Nah. It looks like the consequence is that you'll either fail to
inject the sleep time
or you'll fall back to having the RTC inject the sleep time. In our
case, we never
sleep for more than a couple of minutes so the error would be seconds rather
than hours.

-gabe
Thomas Gleixner Oct. 5, 2017, 9:04 p.m. | #14
On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:

> Hi Thomas,
> 
> On Thu, Oct 5, 2017 at 11:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Which SoC/clocksource driver are you talking about?
> >>
> >> NXP i.MX 6SoloX
> >> drivers/clocksource/timer-imx-gpt.c
> >
> > So that clocksource driver looks correct. Do you have an idea in which
> > context this time jump happens? Does it happen when you exercise your high
> > frequency suspend/resume dance or is that happening just when you let the
> > machine run forever as well?
> 
> We couldn't devise any reproduction steps. We observed it happening at
> unexpected times in a fleet of devices -- and we couldn't find any
> patterns to clue us in.

Ok. Did you talk to NXP about that? Or did you try to exercise reads in a
loop to detect the wreckage and maybe a pattern in there?

> > The timekeeping_resume() path definitely has an issue:
> >
> >         cycle_now = tk_clock_read(&tk->tkr_mono);
> >         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
> >                 cycle_now > tk->tkr_mono.cycle_last) {
> >
> > This works nice for clocksources which wont wrap across suspend/resume but
> > not for those which can. That cycle_now -> cycle_last check should take
> > cs-mask into account ...
> >
> > Of course for clocksources which can wrap within realistic suspend times,
> > which 36 hours might be accounted for, this would need an extra sanity
> > check against a RTC whether wrap time has been exceeded.
> >
> > I haven't thought it through whether that buggered check fully explains
> > what you are observing, but it's wrong nevertheless. John?
> 
> Nah. It looks like the consequence is that you'll either fail to inject
> the sleep time or you'll fall back to having the RTC inject the sleep
> time. In our case, we never sleep for more than a couple of minutes so
> the error would be seconds rather than hours.

Fair enough. It's still wrong though and wants to be fixed.

Thanks,

	tglx
Gabriel Beddingfield Oct. 5, 2017, 9:12 p.m. | #15
On Thu, Oct 5, 2017 at 2:04 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > So that clocksource driver looks correct. Do you have an idea in which
>> > context this time jump happens? Does it happen when you exercise your high
>> > frequency suspend/resume dance or is that happening just when you let the
>> > machine run forever as well?
>>
>> We couldn't devise any reproduction steps. We observed it happening at
>> unexpected times in a fleet of devices -- and we couldn't find any
>> patterns to clue us in.
>
> Ok. Did you talk to NXP about that? Or did you try to exercise reads in a
> loop to detect the wreckage and maybe a pattern in there?

Yes, we talked to NXP about it. They don't have a conclusion on what happened.
While they've been very helpful... we were off the path from their reference
implementation and so it's not a high priority for them.

No, we didn't try that because we prioritized the "persistent clock" approach.
I have a little more time now and can try the loop-reading strategy.

-gabe
Thomas Gleixner Oct. 5, 2017, 9:31 p.m. | #16
Gabriel,

On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:

> Hi John,
> 
> On Wed, Oct 4, 2017 at 5:16 PM, John Stultz <john.stultz@linaro.org> wrote:
> >> Please let me know what you think -- and what the right approach for
> >> solving this would be.
> >
> > So I suspect the best solution for you here is: provide some
> > infrastructure so clocksources that set CLOCK_SOURCE_SUSPEND_NONSTOP
> > which are not the current clocksource can be used for suspend timing.
> 
> Let me see if I understand how this might work in my situation...
> 
> 1. I register a clocksource and set the NONSTOP flag.
> 2. Give it a "low" rating so that it's not selected as the timekeeping
> clocksource.
> 3. Create functions clocksource_select_persistent() and
> timekeeping_notify_persistent()
> 4. Add `struct tk_read_base tk_persistent' to `struct timekeeper'
> 5. Possibly add a change_persistent_clocksource() function to timekeeping.c

That might work, but that looks a tad too complex. Let me give it a try:

1) Create a new flag CLOCK_SOURCE_SUSPEND_BACKUP

   This makes sense because such a clocksource is likely to be something
   which you don't want ever to use for timekeeping even if its the only
   thing aside of jiffies.

2) Set this flag when registering a clocksource, which excludes it from the
   normal selection process.

3) Make the registration code select such a clocksource as the backup for
   suspend resume to brige the gap when the timekeeper clocksource does not
   have the NONSTOP flag set.

   You don't need the extra timekeeping_notify_persistent() because in that
   case the maybe current backup clocksource is definitely not in use and
   can be replaced without the stompmachine muck which we need for
   replacing the current timekeeper clocksource. The system cannot be in
   suspend at this point obviously. so all it needs is to switch a pointer.

   You neither need this extra stuff in struct timekeeper, it's big enough
   anyway. A simple static struct tk_read_base should be sufficient.

On suspend you do

   if (tk_backup->clock)
   	sync_data(timekeeper, tk_backup);

   You still want to record the RTC based time if possible, in case that
   the backup timekeeper can wrap so you have a sanity check for that
   similar to the one we need for NONSTOP clocksources. If there is no RTC
   then we need a sensible cutoff for that wraparound time which makes sure
   that we don't trip over our own feet.

On resume you check tk_backup->clock again, do the RTC sanity check, if
available and valid (either wraps above cutoff or is checked via RTC). If
that's ok, then you update the timekeeper and proceed. If not, use the
fallback or do nothing in the worst case.

Thoughts?

Thanks,

	tglx
Thomas Gleixner Oct. 5, 2017, 9:33 p.m. | #17
On Thu, 5 Oct 2017, Thomas Gleixner wrote:
> On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:
> 
> > Hi Thomas,
> > 
> > On Thu, Oct 5, 2017 at 11:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> > Which SoC/clocksource driver are you talking about?
> > >>
> > >> NXP i.MX 6SoloX
> > >> drivers/clocksource/timer-imx-gpt.c
> > >
> > > So that clocksource driver looks correct. Do you have an idea in which
> > > context this time jump happens? Does it happen when you exercise your high
> > > frequency suspend/resume dance or is that happening just when you let the
> > > machine run forever as well?
> > 
> > We couldn't devise any reproduction steps. We observed it happening at
> > unexpected times in a fleet of devices -- and we couldn't find any
> > patterns to clue us in.
> 
> Ok. Did you talk to NXP about that? Or did you try to exercise reads in a
> loop to detect the wreckage and maybe a pattern in there?

The reason I'm asking is to exclude any weird issue in the timekeeping
code, which is still a possibility, despite the fact that I went through it
with a fine comb after stumbling over that check in the resume path.

Thanks,

	tglx
Pavel Machek Oct. 15, 2017, 6:39 a.m. | #18
H!

> We have been developing a device that has very a very aggressive power
> policy, doing suspend/resume cycles a few times a minute ("echo mem >
> /sys/power/state"). In doing so, we found that the system time
> experiences a lot of jitter (compared to, say, an NTP server). It was
> not uncommon for us to see time corrections of 1s to 4s on a regular
> basis. This didn't happen when the device stayed awake, only when it
> was allowed to do suspend/resume.

Ok, so I have various machines here. It seems only about half of them has
working RTC. That are the boring ones.

And even the boring ones have pretty imprecise RTCs... For example Nokia N9.
I only power it up from time to time, I believe it drifts something like
minute per month... For normal use with SIM card, it can probably correct
from GSM network if you happen to have a cell phone signal, but...

More interesting machines... Old thinkpad is running without CMOS battery.
ARM OLPC has _three_ RTCs, but not a single working one. N900 has working
RTC but no or dead backup battery. On these, RTC driver probably knows
time is not valid, but feeds the garbage into the system time, anyway. Ouch.
Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...

Even in new end-user machines, time quality varies a lot. "First boot, please
enter time" is only accurate to seconds, if the user is careful. RTC is usually
not very accurate, either... and noone uses adjtime these days. GSM time and
ntpdate are probably accurate to miliseconds, GPS can provide time down to
picoseconds... And broken systems are so common "swclock" is available in
init system to store time in file, so it at least does not go backwards.

https (and other crypto) depends on time... so it is important to know approximate
month we are in.

Is it time we handle it better?

Could we return both time and log2(expected error) from system calls?

That way we could hide the clock in GUI if time is not available or not precise
to minutes, ignore certificate dates when time is not precise to months, and
you would not have to send me a "Pavel, are you time traveling, again?" message
next time my mailer sends email dated to 1970.

									Pavel

PS: And yes, I'm time-travaling again, this time by ~15 hours.
Alan Cox Oct. 18, 2017, 8:34 p.m. | #19
> And even the boring ones have pretty imprecise RTCs... For example Nokia N9.
> I only power it up from time to time, I believe it drifts something like
> minute per month... For normal use with SIM card, it can probably correct
> from GSM network if you happen to have a cell phone signal, but...
> 
> More interesting machines... Old thinkpad is running without CMOS battery.
> ARM OLPC has _three_ RTCs, but not a single working one. N900 has working
> RTC but no or dead backup battery. On these, RTC driver probably knows
> time is not valid, but feeds the garbage into the system time, anyway. Ouch.
> Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...

Not a new problem, RTC's used to cost lots of money 8)

Most early Unixen set the clock at boot from the superblock timestamp of
the root fs, some with RTC's also used to scream at you if the superblock
stamp was too far head of current time.That doesn't quite work with
initrd but you can do the same in userspace on Linux so you'll at least
get 'when I last booted it' and because it's always moving forward lots
of other messes don't happen.

Alan
Pavel Machek Oct. 18, 2017, 9:08 p.m. | #20
On Wed 2017-10-18 21:34:49, Alan Cox wrote:
> > And even the boring ones have pretty imprecise RTCs... For example Nokia N9.
> > I only power it up from time to time, I believe it drifts something like
> > minute per month... For normal use with SIM card, it can probably correct
> > from GSM network if you happen to have a cell phone signal, but...
> > 
> > More interesting machines... Old thinkpad is running without CMOS battery.
> > ARM OLPC has _three_ RTCs, but not a single working one. N900 has working
> > RTC but no or dead backup battery. On these, RTC driver probably knows
> > time is not valid, but feeds the garbage into the system time, anyway. Ouch.
> > Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...
> 
> Not a new problem, RTC's used to cost lots of money 8)
> 
> Most early Unixen set the clock at boot from the superblock timestamp of
> the root fs, some with RTC's also used to scream at you if the superblock
> stamp was too far head of current time.That doesn't quite work with
> initrd but you can do the same in userspace on Linux so you'll at least
> get 'when I last booted it' and because it's always moving forward lots
> of other messes don't happen.

Yeah, that's what "swclock" module of init system does. Someone solved
that one for me. But... that's not good enough. In particular, I'd
like time not to be displayed when "swclock" was used. Nokia 6230
could do it, so linux should be able to do it, too.

It seems kernel should pass accurancy info, or at least "this time is
probably off by hours at least" from gettimeofday() and similar
syscalls...


									Pavel
Alexandre Belloni Oct. 18, 2017, 9:26 p.m. | #21
On 18/10/2017 at 23:08:28 +0200, Pavel Machek wrote:
> On Wed 2017-10-18 21:34:49, Alan Cox wrote:
> > > And even the boring ones have pretty imprecise RTCs... For example Nokia N9.
> > > I only power it up from time to time, I believe it drifts something like
> > > minute per month... For normal use with SIM card, it can probably correct
> > > from GSM network if you happen to have a cell phone signal, but...
> > > 
> > > More interesting machines... Old thinkpad is running without CMOS battery.
> > > ARM OLPC has _three_ RTCs, but not a single working one. N900 has working
> > > RTC but no or dead backup battery. On these, RTC driver probably knows
> > > time is not valid, but feeds the garbage into the system time, anyway. Ouch.
> > > Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...
> > 
> > Not a new problem, RTC's used to cost lots of money 8)
> > 
> > Most early Unixen set the clock at boot from the superblock timestamp of
> > the root fs, some with RTC's also used to scream at you if the superblock
> > stamp was too far head of current time.That doesn't quite work with
> > initrd but you can do the same in userspace on Linux so you'll at least
> > get 'when I last booted it' and because it's always moving forward lots
> > of other messes don't happen.
> 
> Yeah, that's what "swclock" module of init system does. Someone solved
> that one for me. But... that's not good enough. In particular, I'd
> like time not to be displayed when "swclock" was used. Nokia 6230
> could do it, so linux should be able to do it, too.
> 
> It seems kernel should pass accurancy info, or at least "this time is
> probably off by hours at least" from gettimeofday() and similar
> syscalls...

The question being how do you expect the kernel to get that information?

Some RTCs will tell you when they lost time/time accuracy and this
should be properly reported by the driver. If not, this has to be
implemented.

For anything else, it is probably the job of userspace to try to be
clever.
Pavel Machek Oct. 18, 2017, 9:56 p.m. | #22
On Wed 2017-10-18 23:26:45, Alexandre Belloni wrote:
> On 18/10/2017 at 23:08:28 +0200, Pavel Machek wrote:
> > On Wed 2017-10-18 21:34:49, Alan Cox wrote:
> > > > And even the boring ones have pretty imprecise RTCs... For example Nokia N9.
> > > > I only power it up from time to time, I believe it drifts something like
> > > > minute per month... For normal use with SIM card, it can probably correct
> > > > from GSM network if you happen to have a cell phone signal, but...
> > > > 
> > > > More interesting machines... Old thinkpad is running without CMOS battery.
> > > > ARM OLPC has _three_ RTCs, but not a single working one. N900 has working
> > > > RTC but no or dead backup battery. On these, RTC driver probably knows
> > > > time is not valid, but feeds the garbage into the system time, anyway. Ouch.
> > > > Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...
> > > 
> > > Not a new problem, RTC's used to cost lots of money 8)
> > > 
> > > Most early Unixen set the clock at boot from the superblock timestamp of
> > > the root fs, some with RTC's also used to scream at you if the superblock
> > > stamp was too far head of current time.That doesn't quite work with
> > > initrd but you can do the same in userspace on Linux so you'll at least
> > > get 'when I last booted it' and because it's always moving forward lots
> > > of other messes don't happen.
> > 
> > Yeah, that's what "swclock" module of init system does. Someone solved
> > that one for me. But... that's not good enough. In particular, I'd
> > like time not to be displayed when "swclock" was used. Nokia 6230
> > could do it, so linux should be able to do it, too.
> > 
> > It seems kernel should pass accurancy info, or at least "this time is
> > probably off by hours at least" from gettimeofday() and similar
> > syscalls...
> 
> The question being how do you expect the kernel to get that information?

I expect userland to tell kernel. "time is now 12.34, source is ntp,
to it is precise to milliseconds". Or maybe "time is now 23.45, source
is swclock, so it may be off by months".

> Some RTCs will tell you when they lost time/time accuracy and this
> should be properly reported by the driver. If not, this has to be
> implemented.

How is it reported to the userspace?

> For anything else, it is probably the job of userspace to try to be
> clever.

Userspace would be fine with me, but as far as I can tell, there's no
good way to do it in userspace.

My proposal would be: kernel keeps accuracy for timeofday.

If RTC says time is bad, accuracy is set to ~0.

settimeofday sets accuracy to 0 (completely accurate).

new_settimeofday gets new argument, accuracy.

new_gettimofday returns accuracy, too.

Does that sound sane? I'm not sure what other interfaces need to be extended.

Best regards,
									Pavel
Alexandre Belloni Nov. 4, 2017, 3:34 p.m. | #23
On 18/10/2017 at 23:56:38 +0200, Pavel Machek wrote:
> > Some RTCs will tell you when they lost time/time accuracy and this
> > should be properly reported by the driver. If not, this has to be
> > implemented.
> 
> How is it reported to the userspace?
> 

Userspace will get -EINVAL when using the RTC_RD_TIME ioctl. The kernel
also get the same error when reading the time from the RTC.

> > For anything else, it is probably the job of userspace to try to be
> > clever.
> 
> Userspace would be fine with me, but as far as I can tell, there's no
> good way to do it in userspace.
> 
> My proposal would be: kernel keeps accuracy for timeofday.
> 
> If RTC says time is bad, accuracy is set to ~0.
> 
> settimeofday sets accuracy to 0 (completely accurate).
> 
> new_settimeofday gets new argument, accuracy.
> 
> new_gettimofday returns accuracy, too.
> 
> Does that sound sane? I'm not sure what other interfaces need to be extended.
> 
> Best regards,
> 									Pavel
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Patch

From c03ceced9a210b48f2552e7dafa9099ef2449370 Mon Sep 17 00:00:00 2001
From: "Gabriel M. Beddingfield" <beddingfield@nestlabs.com>
Date: Wed, 4 Oct 2017 08:38:57 -0700
Subject: [PATCH] time: add CONFIG_PERSISTENT_CLOCK_IS_LOW_PRECISION to disable
 suspend/resume hack

Signed-off-by: Gabriel M. Beddingfield <beddingfield@nestlabs.com>
Signed-off-by: Guy <guy@nestlabs.com>
---
 kernel/time/Kconfig       | 16 ++++++++++++++++
 kernel/time/timekeeping.c |  4 ++++
 2 files changed, 20 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index ac09bc29eb08..32d54086c96c 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -143,5 +143,21 @@  config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config PERSISTENT_CLOCK_IS_LOW_PRECISION
+	bool "The persistent clock has only single-second precision"
+	default y
+	help
+	  When enabled, then on suspend/resume the timekeeping code will
+	  try to maintain a constant offset between the system time and
+	  the persistent clock as a means of compensating for the coarse
+	  (+/- 1 second) sleep time calculation. However, this will discard
+	  any "small" NTP corrections that have happened since the last
+	  resume. However, if the system's persistent clock has better
+	  precision (e.g. because it's backed by a 32kHz clock), this is
+	  not necessary and introduces unneeded time jitter.
+
+	  If your persistent clock has only single-second precision, say Y.
+	  If your persistent clock has sub-second precision, say N.
+
 endmenu
 endif
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 2cafb49aa65e..b2c7b443ef37 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1694,8 +1694,10 @@  int timekeeping_suspend(void)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
 	unsigned long flags;
+#ifdef CONFIG_PERSISTENT_CLOCK_IS_LOW_PRECISION
 	struct timespec64		delta, delta_delta;
 	static struct timespec64	old_delta;
+#endif
 
 	read_persistent_clock64(&timekeeping_suspend_time);
 
@@ -1712,6 +1714,7 @@  int timekeeping_suspend(void)
 	timekeeping_forward_now(tk);
 	timekeeping_suspended = 1;
 
+#ifdef CONFIG_PERSISTENT_CLOCK_IS_LOW_PRECISION
 	if (persistent_clock_exists) {
 		/*
 		 * To avoid drift caused by repeated suspend/resumes,
@@ -1733,6 +1736,7 @@  int timekeeping_suspend(void)
 				timespec64_add(timekeeping_suspend_time, delta_delta);
 		}
 	}
+#endif
 
 	timekeeping_update(tk, TK_MIRROR);
 	halt_fast_timekeeper(tk);
-- 
2.14.2.920.gcf0c67979c-goog