Patchwork powerpc/timebase_read: don't return time older than cycle_last

login
register
mail settings
Submitter Scott Wood
Date June 27, 2011, 9:56 p.m.
Message ID <20110627215613.GA13676@schlenkerla.am.freescale.net>
Download mbox | patch
Permalink /patch/102285/
State Changes Requested
Headers show

Comments

Scott Wood - June 27, 2011, 9:56 p.m.
As is done in read_tsc() on x86, make sure that we don't return a timebase
value smaller than cycle_last, which can happen on SMP if the timebases are
not perfectly synchronized.  It is less expensive than total enforcement of
monotonicity, since we don't need to add another variable and update it on
each read, but it will prevent core timekeeping functions from translating
a small timebase regression into a large jump forward.

Based on commit d8bb6f4c1670c8324e4135c61ef07486f7f17379 for x86.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kernel/time.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)
Benjamin Herrenschmidt - June 28, 2011, 12:45 a.m.
On Mon, 2011-06-27 at 16:56 -0500, Scott Wood wrote:
> As is done in read_tsc() on x86, make sure that we don't return a timebase
> value smaller than cycle_last, which can happen on SMP if the timebases are
> not perfectly synchronized.  It is less expensive than total enforcement of
> monotonicity, since we don't need to add another variable and update it on
> each read, but it will prevent core timekeeping functions from translating
> a small timebase regression into a large jump forward.
> 
> Based on commit d8bb6f4c1670c8324e4135c61ef07486f7f17379 for x86.

You are applying a bandage on a wooden leg here .... userspace (vDSO)
will see the time going backward if you aren't well synchronized as
well, so you're stuffed anyways.

Cheers,
Ben.

> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>  arch/powerpc/kernel/time.c |   15 ++++++++++++++-
>  1 files changed, 14 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index f33acfd..b66ce41 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -806,9 +806,22 @@ static cycle_t rtc_read(struct clocksource *cs)
>  	return (cycle_t)get_rtc();
>  }
>  
> +/*
> + * We compare the timebase to the cycle_last value in the clocksource
> + * structure to avoid a nasty time-warp.  This can be observed in a very
> + * small window right after one CPU updated cycle_last under the xtime lock,
> + * and the other CPU reads a TSC value which is smaller than the cycle_last
> + * reference value due to a TSC which is slighty behind.  This delta is
> + * nowhere else observable, but in that case it results in a large forward
> + * time jump due to the unsigned delta calculation of the time keeping core
> + * code, which is necessary to support wrapping clocksources like pm timer.
> + */
>  static cycle_t timebase_read(struct clocksource *cs)
>  {
> -	return (cycle_t)get_tb();
> +	cycle_t ret = (cycle_t)get_tb();
> +
> +	return ret >= clocksource_timebase.cycle_last ?
> +	       ret : clocksource_timebase.cycle_last;
>  }
>  
>  void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
Scott Wood - June 28, 2011, 4:14 p.m.
On Tue, 28 Jun 2011 10:45:43 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Mon, 2011-06-27 at 16:56 -0500, Scott Wood wrote:
> > As is done in read_tsc() on x86, make sure that we don't return a timebase
> > value smaller than cycle_last, which can happen on SMP if the timebases are
> > not perfectly synchronized.  It is less expensive than total enforcement of
> > monotonicity, since we don't need to add another variable and update it on
> > each read, but it will prevent core timekeeping functions from translating
> > a small timebase regression into a large jump forward.
> > 
> > Based on commit d8bb6f4c1670c8324e4135c61ef07486f7f17379 for x86.
> 
> You are applying a bandage on a wooden leg here .... userspace (vDSO)
> will see the time going backward if you aren't well synchronized as
> well, so you're stuffed anyways.

Sure -- but we should avoid turning a slight backwards drift into a huge
positive offset in the kernel's calculations.  One way to do that is for
the generic timekeeping code to be robust against this, for all time
sources.  The other is to apply this sort of hack on time sources that are
known to possibly go backwards.  The former is the better fix IMHO, but the
latter is what was already done for TSC on x86, so I went with the less
intrusive change.

-Scott
Benjamin Herrenschmidt - June 28, 2011, 11:25 p.m.
On Tue, 2011-06-28 at 11:14 -0500, Scott Wood wrote:
> > You are applying a bandage on a wooden leg here .... userspace (vDSO)
> > will see the time going backward if you aren't well synchronized as
> > well, so you're stuffed anyways.
> 
> Sure -- but we should avoid turning a slight backwards drift into a huge
> positive offset in the kernel's calculations.  One way to do that is for
> the generic timekeeping code to be robust against this, for all time
> sources.  The other is to apply this sort of hack on time sources that are
> known to possibly go backwards.  The former is the better fix IMHO, but the
> latter is what was already done for TSC on x86, so I went with the less
> intrusive change.

Ok two things. One is first fix the comments then to stop mentioning
"TSC" :-)

Second is, I still don't think it's right. There's an expectation on
powerpc that the timebase works properly. If not, you have a userspace
visible breakage. There's no such thing as "a small drift". We assume no
difference is visible to software, period. We make hard assumptions here
and in various places actually.

So if you want to do that test, I would require that you also add a
warning, of the _rate_limited or _once, kind, indicating to the user
that something's badly wrong.

Cheers,
Ben.
Scott Wood - June 29, 2011, 12:08 a.m.
On Wed, 29 Jun 2011 09:25:08 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2011-06-28 at 11:14 -0500, Scott Wood wrote:
> > > You are applying a bandage on a wooden leg here .... userspace (vDSO)
> > > will see the time going backward if you aren't well synchronized as
> > > well, so you're stuffed anyways.
> > 
> > Sure -- but we should avoid turning a slight backwards drift into a huge
> > positive offset in the kernel's calculations.  One way to do that is for
> > the generic timekeeping code to be robust against this, for all time
> > sources.  The other is to apply this sort of hack on time sources that are
> > known to possibly go backwards.  The former is the better fix IMHO, but the
> > latter is what was already done for TSC on x86, so I went with the less
> > intrusive change.
> 
> Ok two things. One is first fix the comments then to stop mentioning
> "TSC" :-)

Doh, sorry...

> Second is, I still don't think it's right. There's an expectation on
> powerpc that the timebase works properly. If not, you have a userspace
> visible breakage.

As the changelog notes, this isn't a full enforement of monotonicity, it's
a way to avoid specific problems where the generic kernel timekeeping code
blows up if it goes backwards.  Fixing userspace reads to be fully
monotonic would be nice too, but it's a separate issue from the kernel
throwing a timer into the distant future because the timebase went
backwards one tick.

> There's no such thing as "a small drift". We assume no
> difference is visible to software, period.

On what do we base this assumption, and what does making the assumption
buy us?

Will smp-tbsync.c always converge on perfect sync (it has a limit on how
long it will try, and the only indication it failed is a pr_debug)?  Will
the timebase always increment on all cores at the same time, including on
emulated hardware?

We had a bug in U-Boot's timebase sync where the boot core would sometimes
be one tick faster than the other cores.  It's been fixed, but there are
probably people still running the old U-Boot.  It seems like the kind of
thing where defensive robustness is called for, like timing out instead of
hanging if a hardware register never flips the bit we're waiting for.

> We make hard assumptions here and in various places actually.

Are there any in the kernel that this doesn't cover?
 
> So if you want to do that test, I would require that you also add a
> warning, of the _rate_limited or _once, kind, indicating to the user
> that something's badly wrong.

OK.

-Scott
Benjamin Herrenschmidt - June 29, 2011, 1:06 a.m.
> > Ok two things. One is first fix the comments then to stop mentioning
> > "TSC" :-)
> 
> Doh, sorry...
> 
> > Second is, I still don't think it's right. There's an expectation on
> > powerpc that the timebase works properly. If not, you have a userspace
> > visible breakage.
> 
> As the changelog notes, this isn't a full enforement of monotonicity, it's
> a way to avoid specific problems where the generic kernel timekeeping code
> blows up if it goes backwards.  Fixing userspace reads to be fully
> monotonic would be nice too, but it's a separate issue from the kernel
> throwing a timer into the distant future because the timebase went
> backwards one tick.

I don't think we ever want to "fix" userspace... how would you "fix" the
vDSO gettimeofday implementation for example since the vDSO has no
storage ?

> > There's no such thing as "a small drift". We assume no
> > difference is visible to software, period.
> 
> On what do we base this assumption, and what does making the assumption
> buy us?

We base this assumption on what I believe is an architectural
requirement tho of course it's not worded very explicitely, and probably
just "derived" from the architecture statement that the timebase can
always be used as a monotonic source of time.

It has always been the assumption of Linux/ppc port that the timebase
cannot be observed going backward accross the SMP fabric.

They -MUST- be sourced from the same clock (not drift) and the initial
synchronization must be "good enough" to make it impossible to observe
it going backward.

What it does buy us is a lot of complexity avoided in the time keeping
code and the ability to have things like vDSO
gettimeofday/clock_gettime, ie, a very fast path to reliably timestamp
things (which is among others a serious benefit for networking).

> Will smp-tbsync.c always converge on perfect sync (it has a limit on how
> long it will try, and the only indication it failed is a pr_debug)?  Will
> the timebase always increment on all cores at the same time, including on
> emulated hardware?

smp-tbsync.c is and has always been a "workaround" for broken HW.
Anybody with half a clue should follow the recommendation of the
architecture (this one is actually spelled out, but as a recommendation
only) to have a TB enable pin and use it to perform a perfect sync at
boot time.

> We had a bug in U-Boot's timebase sync where the boot core would sometimes
> be one tick faster than the other cores.

It's scary to think that your cores TBs seem to be soured from different
clock sources... ie even if you fix uBoot, can you guarantee they won't
drift ? I hope so ... I would consider that an unfixable architecture
violation and I am not at this stage keen on implementing the necessary
"workarounds" in Linux (the userspace case is nasty, really nasty).

PowerPC always prided itself on having a "sane" time base mechanism
unlike x86, please don't tell me that you guys are now breaking that
assumption.
 
> It's been fixed, but there are
> probably people still running the old U-Boot.  It seems like the kind of
> thing where defensive robustness is called for, like timing out instead of
> hanging if a hardware register never flips the bit we're waiting for.

No, you'll just "hide" the problem from the kernel and horrible &
unexplainable things will happen in userspace. At the VERY LEAST you
must warn very loudly if you detect this is happening.

> > We make hard assumptions here and in various places actually.
> 
> Are there any in the kernel that this doesn't cover?

Check gtod implementation, I'm not sure whether that's enough at this
stage or not for it, and then there's the vDSO of course. Not sure
what's up with sched_clock() and whether that has similar constraints.
 
> > So if you want to do that test, I would require that you also add a
> > warning, of the _rate_limited or _once, kind, indicating to the user
> > that something's badly wrong.
> 
> OK.

Cheers,
Ben.
Scott Wood - June 29, 2011, 7:19 p.m.
On Wed, 29 Jun 2011 11:06:36 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> I don't think we ever want to "fix" userspace... how would you "fix" the
> vDSO gettimeofday implementation for example since the vDSO has no
> storage ?

Hmm... I guess you could at least make the libc interface monotonic, though
that wouldn't help if multiple processes are doing some shared memory
communication.

Ideally, anything that uses timestamps should be robust against small jumps
backwards (i.e. not blow it up into some huge jump forward or similar
breakage), because reality just isn't perfect.  Especially if you're writing
userspace code that could run on all sorts of hardware.

> We base this assumption on what I believe is an architectural
> requirement tho of course it's not worded very explicitely, and probably
> just "derived" from the architecture statement that the timebase can
> always be used as a monotonic source of time.

Is it making any statement about monotonicity across CPUs?

> smp-tbsync.c is and has always been a "workaround" for broken HW.

As is this (or for broken firmware, or broken emulation).

> Anybody with half a clue should follow the recommendation of the
> architecture (this one is actually spelled out, but as a recommendation
> only) to have a TB enable pin and use it to perform a perfect sync at
> boot time.

Where in the ISA is this?

Closest I see from a quick scan is "There must be a method for getting all
Time Bases in the system to start incrementing with values that are
identical or almost identical." (7.2 S, 9.2.1 E).  Note the "almost".

And while we do provide this beginning with the e500mc-based chips, e500v2
isn't dead yet.  Kexec is also currently breaking the boot sync, requiring
smp-tbsync -- though ideally kexec could be reworked to not need to
physically reset the core.

> > We had a bug in U-Boot's timebase sync where the boot core would sometimes
> > be one tick faster than the other cores.
> 
> It's scary to think that your cores TBs seem to be soured from different
> clock sources...

They're not.  We use the boot core's timebase for a while, then disable it,
reset it to zero, and enable all the timebases at once.  The bug was a
missing readback to ensure that it was stopped before it was reset, so
sometimes it would tick up to 1 on the boot core after reset, before being
enabled again.  This resulted in the boot core's timebase always reading
one greater than the other cores'.

It can also be an issue when running on simulated hardware.

> ie even if you fix uBoot, can you guarantee they won't
> drift ? I hope so ...

It shouldn't drift even with the old U-boot -- it's just a constant offset.

> I would consider that an unfixable architecture
> violation and I am not at this stage keen on implementing the necessary
> "workarounds" in Linux (the userspace case is nasty, really nasty).
> 
> PowerPC always prided itself on having a "sane" time base mechanism
> unlike x86, please don't tell me that you guys are now breaking that
> assumption.

If you mean "are we introducing new chips that have timebase problems", no.

I'm questioning whether the assumption was ever fully valid under all
circumstances.

> > It's been fixed, but there are
> > probably people still running the old U-Boot.  It seems like the kind of
> > thing where defensive robustness is called for, like timing out instead of
> > hanging if a hardware register never flips the bit we're waiting for.
> 
> No, you'll just "hide" the problem from the kernel and horrible &
> unexplainable things will happen in userspace. At the VERY LEAST you
> must warn very loudly if you detect this is happening.

A warning message is OK.

The current situation hides it as well, since it appears to work fine
until you hit the race, and suddenly things get stuck and it takes a lot of
digging to find out why.

> > > We make hard assumptions here and in various places actually.
> > 
> > Are there any in the kernel that this doesn't cover?
> 
> Check gtod implementation, I'm not sure whether that's enough at this
> stage or not for it, and then there's the vDSO of course.

I think the in-kernel gtod is OK after this patch (it breaks if it reads a
timestamp that is less than cycle_last).

The 32-bit vdso looks OK as is since it doesn't convert to nsec until after
adding to xtime.  Userspace will still see time go backwards a bit if the
timebase does, but it shouldn't get a wildly wrong answer.  The 64-bit vdso
uses srdi instead of sradi, and does so before adding to the upper
half of xtime, so if the race is hit the returned time will be too high by
2^32 seconds.

The problem in the in-kernel gettimeofday that causes it to be sensitive to
times less than cycles_last appears to be the same thing --
clocksource_cyc2ns() does an unsigned shift, even though it claims to be
returning a signed result.

> Not sure what's up with sched_clock() and whether that has similar constraints.

sched_clock() should be OK as long as the timestamp is always greater than
boot_tb, and hopefully any timebase skew is less than the delay from when
boot_tb is set to when secondary cores start up.

Delay loops could end early if they use unsigned comparisons, but only if
the timebase skew is greater than the time it takes to move a thread from
one cpu to another.

-Scott
Benjamin Herrenschmidt - June 30, 2011, 12:29 a.m.
On Wed, 2011-06-29 at 14:19 -0500, Scott Wood wrote:
> On Wed, 29 Jun 2011 11:06:36 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > I don't think we ever want to "fix" userspace... how would you "fix" the
> > vDSO gettimeofday implementation for example since the vDSO has no
> > storage ?
> 
> Hmm... I guess you could at least make the libc interface monotonic, though
> that wouldn't help if multiple processes are doing some shared memory
> communication.
> 
> Ideally, anything that uses timestamps should be robust against small jumps
> backwards (i.e. not blow it up into some huge jump forward or similar
> breakage), because reality just isn't perfect.  Especially if you're writing
> userspace code that could run on all sorts of hardware.

Well, powerpc hardware so far has been "perfect" in that regard and I'm
unwilling to bend on that one without throwing a lot of shame at your
face for screwing up :-)

Point is, it's an assumption made by Linux, but also a pile of userspace
code, and I wouldn't be surprised if it extends way beyond just glibc.

So even if it's not spelled out in the PPC_AS (tho it is in PAPR), we
can define it as a Linux requirement :-)
 
> > We base this assumption on what I believe is an architectural
> > requirement tho of course it's not worded very explicitely, and probably
> > just "derived" from the architecture statement that the timebase can
> > always be used as a monotonic source of time.
> 
> Is it making any statement about monotonicity across CPUs?

I think PAPR does actually, I suspect it's been mostly implied and never
written down from a more generic arch perspective.

> > smp-tbsync.c is and has always been a "workaround" for broken HW.
> 
> As is this (or for broken firmware, or broken emulation).
> 
> > Anybody with half a clue should follow the recommendation of the
> > architecture (this one is actually spelled out, but as a recommendation
> > only) to have a TB enable pin and use it to perform a perfect sync at
> > boot time.
> 
> Where in the ISA is this?

In the TB chapter of 2.06

> Closest I see from a quick scan is "There must be a method for getting all
> Time Bases in the system to start incrementing with values that are
> identical or almost identical." (7.2 S, 9.2.1 E).  Note the "almost".
> 
> And while we do provide this beginning with the e500mc-based chips, e500v2
> isn't dead yet.  Kexec is also currently breaking the boot sync, requiring
> smp-tbsync -- though ideally kexec could be reworked to not need to
> physically reset the core.

Right, fixable :-)

> > > We had a bug in U-Boot's timebase sync where the boot core would sometimes
> > > be one tick faster than the other cores.
> > 
> > It's scary to think that your cores TBs seem to be soured from different
> > clock sources...
> 
> They're not.  We use the boot core's timebase for a while, then disable it,
> reset it to zero, and enable all the timebases at once.  The bug was a
> missing readback to ensure that it was stopped before it was reset, so
> sometimes it would tick up to 1 on the boot core after reset, before being
> enabled again.  This resulted in the boot core's timebase always reading
> one greater than the other cores'.
> 
> It can also be an issue when running on simulated hardware.

Ok.

> > ie even if you fix uBoot, can you guarantee they won't
> > drift ? I hope so ...
> 
> It shouldn't drift even with the old U-boot -- it's just a constant offset.

That's easier to fix.

> > I would consider that an unfixable architecture
> > violation and I am not at this stage keen on implementing the necessary
> > "workarounds" in Linux (the userspace case is nasty, really nasty).
> > 
> > PowerPC always prided itself on having a "sane" time base mechanism
> > unlike x86, please don't tell me that you guys are now breaking that
> > assumption.
> 
> If you mean "are we introducing new chips that have timebase problems", no.
> 
> I'm questioning whether the assumption was ever fully valid under all
> circumstances.

I think it was, and regardless, the majority of code out there was
written with that assumption.

> > > It's been fixed, but there are
> > > probably people still running the old U-Boot.  It seems like the kind of
> > > thing where defensive robustness is called for, like timing out instead of
> > > hanging if a hardware register never flips the bit we're waiting for.
> > 
> > No, you'll just "hide" the problem from the kernel and horrible &
> > unexplainable things will happen in userspace. At the VERY LEAST you
> > must warn very loudly if you detect this is happening.
> 
> A warning message is OK.
> 
> The current situation hides it as well, since it appears to work fine
> until you hit the race, and suddenly things get stuck and it takes a lot of
> digging to find out why.

Yes but you can't hide it completely, so you can't "contain" the damage.

> > > > We make hard assumptions here and in various places actually.
> > > 
> > > Are there any in the kernel that this doesn't cover?
> > 
> > Check gtod implementation, I'm not sure whether that's enough at this
> > stage or not for it, and then there's the vDSO of course.
> 
> I think the in-kernel gtod is OK after this patch (it breaks if it reads a
> timestamp that is less than cycle_last).
> 
> The 32-bit vdso looks OK as is since it doesn't convert to nsec until after
> adding to xtime.  Userspace will still see time go backwards a bit if the
> timebase does, but it shouldn't get a wildly wrong answer.

But it might calculate one. There's no such thing as a "small problem"
here. If userspace sees the time go backward, even by 1ns, horrible
unpredictable things will happen.

It's either fully correct or not correct at all, and our userspace makes
the assumption that it's always fully correct.

>   The 64-bit vdso
> uses srdi instead of sradi, and does so before adding to the upper
> half of xtime, so if the race is hit the returned time will be too high by
> 2^32 seconds.
> 
> The problem in the in-kernel gettimeofday that causes it to be sensitive to
> times less than cycles_last appears to be the same thing --
> clocksource_cyc2ns() does an unsigned shift, even though it claims to be
> returning a signed result.
> 
> > Not sure what's up with sched_clock() and whether that has similar constraints.
> 
> sched_clock() should be OK as long as the timestamp is always greater than
> boot_tb, and hopefully any timebase skew is less than the delay from when
> boot_tb is set to when secondary cores start up.
> 
> Delay loops could end early if they use unsigned comparisons, but only if
> the timebase skew is greater than the time it takes to move a thread from
> one cpu to another.

Cheers,
Ben.

Patch

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f33acfd..b66ce41 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -806,9 +806,22 @@  static cycle_t rtc_read(struct clocksource *cs)
 	return (cycle_t)get_rtc();
 }
 
+/*
+ * We compare the timebase to the cycle_last value in the clocksource
+ * structure to avoid a nasty time-warp.  This can be observed in a very
+ * small window right after one CPU updated cycle_last under the xtime lock,
+ * and the other CPU reads a TSC value which is smaller than the cycle_last
+ * reference value due to a TSC which is slighty behind.  This delta is
+ * nowhere else observable, but in that case it results in a large forward
+ * time jump due to the unsigned delta calculation of the time keeping core
+ * code, which is necessary to support wrapping clocksources like pm timer.
+ */
 static cycle_t timebase_read(struct clocksource *cs)
 {
-	return (cycle_t)get_tb();
+	cycle_t ret = (cycle_t)get_tb();
+
+	return ret >= clocksource_timebase.cycle_last ?
+	       ret : clocksource_timebase.cycle_last;
 }
 
 void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,